All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC Design Doc] Add vNVDIMM support for Xen
@ 2016-02-01  5:44 Haozhong Zhang
  2016-02-01 18:25 ` Andrew Cooper
                   ` (4 more replies)
  0 siblings, 5 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-01  5:44 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	George Dunlap, Andrew Cooper, Stefano Stabellini, Ian Jackson,
	Jan Beulich, Jun Nakajima, Xiao Guangrong, Keir Fraser

Hi,

The following document describes the design of adding vNVDIMM support
for Xen. Any comments are welcome.

Thanks,
Haozhong


Content
=======
1. Background
 1.1 Access Mechanisms: Persistent Memory and Block Window
 1.2 ACPI Support
  1.2.1 NFIT
  1.2.2 _DSM and _FIT
 1.3 Namespace
 1.4 clwb/clflushopt/pcommit
2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
 2.1 NVDIMM Driver in Linux Kernel
 2.2 vNVDIMM Implementation in KVM/QEMU
3. Design of vNVDIMM in Xen
 3.1 Guest clwb/clflushopt/pcommit Enabling
 3.2 Address Mapping
  3.2.1 My Design
  3.2.2 Alternative Design
 3.3 Guest ACPI Emulation
  3.3.1 My Design
  3.3.2 Alternative Design 1: switching to QEMU
  3.3.3 Alternative Design 2: keeping in Xen
References


Non-Volatile DIMM or NVDIMM is a type of RAM device that provides
persistent storage and retains data across reboot and even power
failures. This document describes the design to support virtual NVDIMM
devices or vNVDIMM in Xen. 

The rest of this document is organized as below.
 - Section 1 briefly introduces the background knowledge of NVDIMM
   hardware, which is used by other parts of this document.

 - Section 2 briefly introduces the current/future NVDIMM/vNVDIMM
   support in Linux kernel/KVM/QEMU. They will affect the vNVDIMM
   design in Xen.

 - Section 3 proposes design details of vNVDIMM in Xen. Several
   alternatives are also listed in this section.



1. Background

1.1 Access Mechanisms: Persistent Memory and Block Window

 NVDIMM provides two access mechanisms: byte-addressable persistent
 memory (pmem) and block window (pblk). An NVDIMM can contain multiple
 ranges and each range can be accessed through either pmem or pblk
 (but not both).

 Byte-addressable persistent memory mechanism (pmem) maps NVDIMM or
 ranges of NVDIMM into the system physical address (SPA) space, so
 that software can access NVDIMM via normal memory loads and
 stores. If the virtual address is used, then MMU will translate it to
 the physical address.

 In the virtualization circumstance, we can pass through a pmem range
 or partial of it to a guest by mapping it in EPT (i.e. mapping guest
 vNVDIMM physical address to host NVDIMM physical address), so that
 guest accesses are applied directly to the host NVDIMM device without
 hypervisor's interceptions.

 Block window mechanism (pblk) provides one or multiple block windows
 (BW).  Each BW is composed of a command register, a status register
 and a 8 Kbytes aperture register. Software fills the direction of the
 transfer (read/write), the start address (LBA) and size on NVDIMM it
 is going to transfer. If nothing goes wrong, the transferred data can
 be read/write via the aperture register. The status and errors of the
 transfer can be got from the status register. Other vendor-specific
 commands and status can be implemented for BW as well. Details of the
 block window access mechanism can be found in [3].

 In the virtualization circumstance, different pblk regions on a
 single NVDIMM device may be accessed by different guests, so the
 hypervisor needs to emulate BW, which would introduce a high overhead
 for I/O intensive workload.

 Therefore, we are going to only implement pmem for vNVDIMM. The rest
 of this document will mostly concentrate on pmem.


1.2 ACPI Support

 ACPI provides two factors of support for NVDIMM. First, NVDIMM
 devices are described by firmware (BIOS/EFI) to OS via ACPI-defined
 NVDIMM Firmware Interface Table (NFIT). Second, several functions of
 NVDIMM, including operations on namespace labels, S.M.A.R.T and
 hotplug, are provided by ACPI methods (_DSM and _FIT).

1.2.1 NFIT

 NFIT is a new system description table added in ACPI v6 with
 signature "NFIT". It contains a set of structures.

 - System Physical Address Range Structure
   (SPA Range Structure)

   SPA range structure describes system physical address ranges
   occupied by NVDIMMs and types of regions.

   If Address Range Type GUID field of a SPA range structure is "Byte
   Addressable Persistent Memory (PM) Region", then the structure
   describes a NVDIMM region that is accessed via pmem. The System
   Physical Address Range Base and Length fields describe the start
   system physical address and the length that is occupied by that
   NVDIMM region.

   A SPA range structure is identified by a non-zero SPA range
   structure index.

   Note: [1] reserves E820 type 7: OSPM must comprehend this memory as
         having non-volatile attributes and handle distinct from
         conventional volatile memory (in Table 15-312 of [1]). The
         memory region supports byte-addressable non-volatility. E820
         type 12 (OEM defined) may be also used for legacy NVDIMM
         prior to ACPI v6.

   Note: Besides OS, EFI firmware may also parse NFIT for booting
         drives (Section 9.3.6.9 of [5]).

 - Memory Device to System Physical Address Range Mapping Structure
   (Range Mapping Structure)

   An NVDIMM region described by a SPA range structure can be
   interleaved across multiple NVDIMM devices. A range mapping
   structure is used to describe the single mapping on each NVDIMM
   device. It describes the size and the offset in a SPA range that an
   NVDIMM device occupies. It may refer to an Interleave Structure
   that contains details of the entire interleave set. Those
   information is used in pblk by the NVDIMM driver for address
   translation.

   The NVDIMM device described by the range mapping structure is
   identified by an unique NFIT Device Handle.

 Details of NFIT and other structures can be found in Section 5.25 in [1].

1.2.2 _DSM and _FIT

 The ACPI namespace device uses _HID of ACPI0012 to identify the root
 NVDIMM interface device. An ACPI namespace device is also present
 under the root device For each NVDIMM device. Above ACPI namespace
 devices are defined in SSDT.

 _DSM methods are present under the root device and each NVDIMM
 device. _DSM methods are used by drivers to access the label storage
 area, get health information, perform vendor-specific commands,
 etc. Details of all _DSM methods can be found in [4].

 _FIT method is under the root device and evaluated by OSPM to get
 NFIT of hotplugged NVDIMM. The hotplugged NVDIMM is indicated to OS
 using ACPI Namespace device with PNPID of PNP0C80 and the device
 object notification value is 0x80. Details of NVDIMM hotplug can be
 found in Section 9.20 of [1].


1.3 Namespace

 [2] describes a mechanism to sub-divide NVDIMMs into namespaces,
 which are logic units of storage similar to SCSI LUNs and NVM Express
 namespaces.

 The namespace information is describes by namespace labels stored in
 the persistent label storage area on each NVDIMM device. The label
 storage area is excluded from the the range mapped by the SPA range
 structure and can only be accessed via _DSM methods.

 There are two types of namespaces defined in [2]: the persistent
 memory namespace and the block namespaces. Persistent memory
 namespaces is built for only pmem NVDIMM regions, while block
 namespaces only for pblk. Only one persistent memory namespace is
 allowed for a pmem NVDIMM region.

 Besides being accessed via _DSM, namespaces are managed and
 interpreted by software. OS vendors may decide to not follow [2] and
 store other types of information in the label storage area.


1.4 clwb/clflushopt/pcommit

 Writes to NVDIMM may be cached by caches, so certain flushing
 operations should be performed to make them persistent on
 NVDIMM. clwb is used in favor of clflushopt and clflush to flush
 writes from caches to memory. Then a following pcommit makes them
 finally persistent (power failure protected) on NVDIMM.

 Details of clwb/clflushopt/pcommit can be found in Chapter 10 of [6].



2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU

2.1 NVDIMM Driver in Linux Kernel

 Linux kernel since 4.2 has added support for ACPI-defined NVDIMM
 devices.

 NVDIMM driver in Linux probes NVDIMM devices through ACPI (i.e. NFIT
 and _FIT). It is also responsible to parse the namsepace labels on
 each NVDIMM devices, recover namespace after power failure (as
 described in [2]) and handle NVDIMM hotplug. There are also some
 vendor drivers to perform vendor-specific operations on NVDIMMs
 (e.g. via _DSM).

 Compared to the ordinary ram, NVDIMM is used more like a persistent
 storage drive for its persistent aspect. For each persistent memory
 namespace, or a label-less pmem NVDIMM range, NVDIMM driver
 implements a block device interface (/dev/pmemX) for it.

 Userspace applications can mmap(2) the whole pmem into its own
 virtual address space. Linux kernel maps the system physical address
 space range occupied by pmem into the virtual address space, so that every
 normal memory loads/writes with proper flushing instructions are
 applied to the underlying pmem NVDIMM regions.

 Alternatively, a DAX file system can be made on /dev/pmemX. Files on
 that file system can be used in the same way as above. As Linux
 kernel maps the system address space range occupied by those files on
 NVDIMM to the virtual address space, reads/writes on those files are
 applied to the underlying NVDIMM regions as well.

2.2 vNVDIMM Implementation in KVM/QEMU

 An overview of vNVDIMM implementation in KVM (Linux kernel v4.2) / QEMU (commit
 70d1fb9 and patches in-review/future) is showed by the following figure.


                                       +---------------------------------+
 Guest                             GPA |                    | /dev/pmem0 |
                                       +---------------------------------+
           parse        evaluate                            ^            ^
            ACPI          _DSM                              |            |
              |            |                                |            |
 -------------|------------|--------------------------------|------------|----
              V            V                                |            |
          +-------+    +-------+                            |            |
 QEMU     | vACPI |    | v_DSM |                            |            |
          +-------+    +-------+                            |            |
                           ^                                |            |
                           | Read/Write                     |            |
                           V                                |            |
          +...+--------------------+...+-----------+        |            |
    VA    |   | Label Storage Area |   |    buf    |  KVM_SET_USER_MEMORY_REGION(buf)
          +...+--------------------+...+-----------+        |            |
                                       ^  mmap(2)  ^        |            |
 --------------------------------------|-----------|--------|------------|----
                                       |           +--------~------------+
                                       |                    |            |
 Linux/KVM                             +--------------------+            |
                                                            |            |
                                                       +....+------------+
                                                SPA    |    | /dev/pmem0 |
                                                       +....+------------+
                                                                   ^
                                                                   |
                                                            Host NVDIMM Driver
-------------------------------------------------------------------|---------
                                                                   |
 HW                                                          +------------+
                                                             |   NVDIMM   |
                                                             +------------+


 A part not put in above figure is enabling guest clwb/clflushopt/pcommit
 which exposes those instructions to guest via guest cpuid.

 Besides instruction enabling, there are two primary parts of vNVDIMM
 implementation in KVM/QEMU.

 (1) Address Mapping

  As described before, the host Linux NVDIMM driver provides a block
  device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
  region. QEMU can than mmap(2) that device into its virtual address
  space (buf). QEMU is responsible to find a proper guest physical
  address space range that is large enough to hold /dev/pmem0. Then
  QEMU passes the virtual address of mmapped buf to a KVM API
  KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
  address range of buf to the guest physical address space range where
  the virtual pmem device will be.

  In this way, all guest writes/reads on the virtual pmem device is
  applied directly to the host one.

  Besides, above implementation also allows to back a virtual pmem
  device by a mmapped regular file or a piece of ordinary ram.

 (2) Guest ACPI Emulation

  As guest system physical address and the size of the virtual pmem
  device are determined by QEMU, QEMU is responsible to emulate the
  guest NFIT and SSDT. Basically, it builds the guest NFIT and its
  sub-structures that describes the virtual NVDIMM topology, and a
  guest SSDT that defines ACPI namespace devices of virtual NVDIMM in
  guest SSDT.

  As a portion of host pmem device or a regular file/ordinary file can
  be used to back the guest pmem device, the label storage area on
  host pmem cannot always be passed through to guest. Therefore, the
  guest reads/writes on the label storage area is emulated by QEMU. As
  described before, _DSM method is utilized by OSPM to access the
  label storage area, and therefore it is emulated by QEMU. The _DSM
  buffer is registered as MMIO, and its guest physical address and
  size are described in the guest ACPI. Every command/status
  read/write from guest is trapped and emulated by QEMU.

  Guest _FIT method will be implemented similarly in the future.



3. Design of vNVDIMM in Xen

 Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
 three parts:
 (1) Guest clwb/clflushopt/pcommit enabling,
 (2) Memory mapping, and
 (3) Guest ACPI emulation.

 The rest of this section present the design of each part
 respectively. The basic design principle to reuse existing code in
 Linux NVDIMM driver and QEMU as much as possible. As recent
 discussions in the both Xen and QEMU mailing lists for the v1 patch
 series, alternative designs are also listed below.


3.1 Guest clwb/clflushopt/pcommit Enabling

 The instruction enabling is simple and we do the same work as in KVM/QEMU.
 - All three instructions are exposed to guest via guest cpuid.
 - L1 guest pcommit is never intercepted by Xen.
 - L1 hypervisor is allowed to intercept L2 guest pcommit.


3.2 Address Mapping

3.2.1 My Design

 The overview of this design is shown in the following figure.

                 Dom0                         |               DomU
                                              |
                                              |
 QEMU                                         |
     +...+--------------------+...+-----+     |
  VA |   | Label Storage Area |   | buf |     |
     +...+--------------------+...+-----+     |
                     ^            ^     ^     |
                     |            |     |     |
                     V            |     |     |
     +-------+   +-------+        mmap(2)     |
     | vACPI |   | v_DSM |        |     |     |        +----+------------+
     +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
         ^           ^     +------+     |     |        +----+------------+
 --------|-----------|-----|------------|--   |             ^            ^
         |           |     |            |     |             |            |
         |    +------+     +------------~-----~-------------+            |
         |    |            |            |     |        XEN_DOMCTL_memory_mapping
         |    |            |            +-----~--------------------------+
         |    |            |            |     |
         |    |       +----+------------+     |
 Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
         |    |       +----+------------+     |     | ACPI |   | _DSM |
         |    |                   ^           |     +------+   +------+
         |    |                   |           |         |          |
         |    |               Dom0 Driver     |   hvmloader/xl     |
 --------|----|-------------------|---------------------|----------|---------------
         |    +-------------------~---------------------~----------+
 Xen     |                        |                     |
         +------------------------~---------------------+
 ---------------------------------|------------------------------------------------
                                  +----------------+
                                                   |
                                            +-------------+
 HW                                         |    NVDIMM   |
                                            +-------------+


 This design treats host NVDIMM devices as ordinary MMIO devices:
 (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
     and drive host NVDIMM devices (implementing block device
     interface). Namespaces and file systems on host NVDIMM devices
     are handled by Dom0 Linux as well.

 (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
     virtual address space (buf).

 (3) QEMU gets the host physical address of buf, i.e. the host system
     physical address that is occupied by /dev/pmem0, and calls Xen
     hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.

 (ACPI part is described in Section 3.3 later)

 Above (1)(2) have already been done in current QEMU. Only (3) is
 needed to implement in QEMU. No change is needed in Xen for address
 mapping in this design.

 Open: It seems no system call/ioctl is provided by Linux kernel to
       get the physical address from a virtual address.
       /proc/<qemu_pid>/pagemap provides information of mapping from
       VA to PA. Is it an acceptable solution to let QEMU parse this
       file to get the physical address?

 Open: For a large pmem, mmap(2) is very possible to not map all SPA
       occupied by pmem at the beginning, i.e. QEMU may not be able to
       get all SPA of pmem from buf (in virtual address space) when
       calling XEN_DOMCTL_memory_mapping.
       Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
       entire pmem being mmaped?

3.2.2 Alternative Design

 Jan Beulich's comments [7] on my question "why must pmem resource
 management and partition be done in hypervisor":
 | Because that's where memory management belongs. And PMEM,
 | other than PBLK, is just another form of RAM.
 | ...
 | The main issue is that this would imo be a layering violation

 George Dunlap's comments [8]:
 | This is not the case for PMEM.  The whole point of PMEM (correct me if
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
 | I'm wrong) is to be used for long-term storage that survives over
 | reboot.  It matters very much that a guest be given the same PRAM
 | after the host is rebooted that it was given before.  It doesn't make
 | any sense to manage it the way Xen currently manages RAM (i.e., that
 | you request a page and get whatever Xen happens to give you).
 |
 | So if Xen is going to use PMEM, it will have to invent an entirely new
 | interface for guests, and it will have to keep track of those
 | resources across host reboots.  In other words, it will have to
 | duplicate all the work that Linux already does.  What do we gain from
 | that duplication?  Why not just leverage what's already implemented in
 | dom0?
 and [9]:
 | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
 | then you're right -- it is just another form of RAM, that should be
 | treated no differently than say, lowmem: a fungible resource that can be
 | requested by setting a flag.

 However, pmem is used more as persistent storage than fungible ram,
 and my design is for the former usage. I would like to leave the
 detection, driver and partition (either through namespace or file
 systems) of NVDIMM in Dom0 Linux kernel.

 I notice that current XEN_DOMCTL_memory_mapping does not make santiy
 check for the physical address and size passed from caller
 (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
 aware of the SPA range of pmem so that it can refuse map physical
 address in neither the normal ram nor pmem.

 Instead of duplicating the detection code (parsing NFIT and
 evaluating _FIT) in Dom0 Linux kernel, we decide to patch Dom0 Linux
 kernel to pass parameters of host pmem NVDIMM devices to Xen
 hypervisor:
 (1) Add a global
       struct rangeset pmem_rangeset
     in Xen hypervisor to record all SPA ranges of detected pmem devices.
     Each range in pmem_rangeset corresponds to a pmem device.

 (2) Add a hypercall
       XEN_SYSCTL_add_pmem_range
     (should it be a sysctl or a platform op?)
     that receives a pair of parameters (addr: starting SPA of pmem
     region, len: size of pmem region) and add a range (addr, addr +
     len - 1) in nvdimm_rangset.

 (3) Add a hypercall
       XEN_DOMCTL_pmem_mapping
     that takes the same parameters as XEN_DOMCTL_memory_mapping and
     maps a given host pmem range to guest. It checks whether the
     given host pmem range is in the pmem_rangeset before making the
     actual mapping.

 (4) Patch Linux NVDIMM driver to call XEN_SYSCTL_add_pmem_range
     whenever it detects a pmem device.

 (5) Patch QEMU to use XEN_DOMCTL_pmem_mapping for mapping host pmem
     devices.


3.3 Guest ACPI Emulation

3.3.1 My Design

 Guest ACPI emulation is composed of two parts: building guest NFIT
 and SSDT that defines ACPI namespace devices for NVDIMM, and
 emulating guest _DSM.

 (1) Building Guest ACPI Tables

  This design reuses and extends hvmloader's existing mechanism that
  loads passthrough ACPI tables from binary files to load NFIT and
  SSDT tables built by QEMU:
  1) Because the current QEMU does not building any ACPI tables when
     it runs as the Xen device model, this design needs to patch QEMU
     to build NFIT and SSDT (so far only NFIT and SSDT) in this case.

  2) QEMU copies NFIT and SSDT to the end of guest memory below
     4G. The guest address and size of those tables are written into
     xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).

  3) hvmloader is patched to probe and load device model passthrough
     ACPI tables from above xenstore keys. The detected ACPI tables
     are then appended to the end of existing guest ACPI tables just
     like what current construct_passthrough_tables() does.

  Reasons for this design are listed below:
  - NFIT and SSDT in question are quite self-contained, i.e. they do
    not refer to other ACPI tables and not conflict with existing
    guest ACPI tables in Xen. Therefore, it is safe to copy them from
    QEMU and append to existing guest ACPI tables.

  - A primary portion of current and future vNVDIMM implementation is
    about building ACPI tables. And this design also leave the
    emulation of _DSM to QEMU which needs to keep consistency with
    NFIT and SSDT itself builds. Therefore, reusing NFIT and SSDT from
    QEMU can ease the maintenance.

  - Anthony's work to pass ACPI tables from the toolstack to hvmloader
    does not move building SSDT (and NFIT) to toolstack, so this
    design can still put them in hvmloader.

 (2) Emulating Guest _DSM

  Because the same NFIT and SSDT are used, we can leave the emulation
  of guest _DSM to QEMU. Just as what it does with KVM, QEMU registers
  the _DSM buffer as MMIO region with Xen and then all guest
  evaluations of _DSM are trapped and emulated by QEMU.

3.3.2 Alternative Design 1: switching to QEMU

 Stefano Stabellini's comments [10]:
 | I don't think it is wise to have two components which both think are
 | in control of generating ACPI tables, hvmloader (soon to be the
 | toolstack with Anthony's work) and QEMU. From an architectural
 | perspective, it doesn't look robust to me.
 |
 | Could we take this opportunity to switch to QEMU generating the whole
 | set of ACPI tables?

 So an alternative design could be switching to QEMU to generate the
 whole set of guest ACPI tables. In this way, no controversy would
 happen between multiple agents QEMU and hvmloader. (is this what
 Stefano Stabellini mean by 'robust'?)

 However, looking at the code building ACPI tables in QEMU and
 hvmloader, they are quite different. As ACPI tables are important for
 OS to boot and operate device, it's critical to ensure ACPI tables
 built by QEMU would not break existing guests on Xen. Though I
 believe it could be done after a thorough investigation and
 adjustment, it may take quite a lot of work and tests and should be
 another project besides enabling vNVDIMM in Xen.

3.3.3 Alternative Design 2: keeping in Xen

 Alternative to switching to QEMU, another design would be building
 NFIT and SSDT in hvmloader or toolstack.

 The amount and parameters of sub-structures in guest NFIT vary
 according to different vNVDIMM configurations that can not be decided
 at compile-time. In contrast, current hvmloader and toolstack can
 only build static ACPI tables, i.e. their contents are decided
 statically at compile-time and independent from the guest
 configuration. In order to build guest NFIT at runtime, this design
 may take following steps:
 (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU
     options,

 (2) QEMU accepts above options, figures out the start SPA range
     address/size/NVDIMM device handles/..., and writes them in
     xenstore. No ACPI table is built by QEMU.

 (3) Either xl or hvmloader reads above parameters from xenstore and
     builds the NFIT table.

 For guest SSDT, it would take more work. The ACPI namespace devices
 are defined in SSDT by AML, so an AML builder would be needed to
 generate those definitions at runtime.

 This alternative design still needs more work than the first design.



References:
[1] ACPI Specification v6,
    http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2] NVDIMM Namespace Specification,
    http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
[3] NVDIMM Block Window Driver Writer's Guide,
    http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
[4] NVDIMM DSM Interface Example,
    http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
[5] UEFI Specification v2.6,
    http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf
[6] Intel Architecture Instruction Set Extensions Programming Reference,
    https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf
[7] http://www.gossamer-threads.com/lists/xen/devel/414945#414945
[8] http://www.gossamer-threads.com/lists/xen/devel/415658#415658
[9] http://www.gossamer-threads.com/lists/xen/devel/415681#415681
[10] http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg00271.html

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
@ 2016-02-01 18:25 ` Andrew Cooper
  2016-02-02  3:27   ` Tian, Kevin
  2016-02-02  3:44   ` Haozhong Zhang
  2016-02-02  6:33 ` Tian, Kevin
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 121+ messages in thread
From: Andrew Cooper @ 2016-02-01 18:25 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell, George Dunlap,
	Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 01/02/16 05:44, Haozhong Zhang wrote:
> Hi,
>
> The following document describes the design of adding vNVDIMM support
> for Xen. Any comments are welcome.
>
> Thanks,
> Haozhong

Thankyou for doing this.  It is a very comprehensive document, and a
fantastic example for future similar situations.


To start with however, I would like to clear up my confusion over the
the usecases of pmem vs pblk.

pblk, using indirect access, is less efficient than pmem.  NVDIMMs
themselves are slower (and presumably more expensive) than equivalent
RAM, and presumably still has a finite number of write cycles,  so I
don't buy an argument suggesting that they are a plausible replacement
for real RAM.

I presume therefore that a system would only choose to use pblk mode in
situations where the host physical address space is a limiting factor. 
Are there other situations which I have overlooked?

Secondly, I presume that pmem vs pblk will be a firmware decision and
fixed from the point of view of the Operating System?

Thanks,

~Andrew

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-01 18:25 ` Andrew Cooper
@ 2016-02-02  3:27   ` Tian, Kevin
  2016-02-02  3:44   ` Haozhong Zhang
  1 sibling, 0 replies; 121+ messages in thread
From: Tian, Kevin @ 2016-02-02  3:27 UTC (permalink / raw)
  To: Andrew Cooper, Zhang, Haozhong
  Cc: Juergen Gross, Wei Liu, Ian Campbell, George Dunlap,
	Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich,
	Nakajima, Jun, Xiao Guangrong, Keir Fraser

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Tuesday, February 02, 2016 2:25 AM
> 
> On 01/02/16 05:44, Haozhong Zhang wrote:
> > Hi,
> >
> > The following document describes the design of adding vNVDIMM support
> > for Xen. Any comments are welcome.
> >
> > Thanks,
> > Haozhong
> 
> Thankyou for doing this.  It is a very comprehensive document, and a
> fantastic example for future similar situations.

Agree. It's a very good doc to help following discussions.

> 
> 
> To start with however, I would like to clear up my confusion over the
> the usecases of pmem vs pblk.
> 
> pblk, using indirect access, is less efficient than pmem.  NVDIMMs
> themselves are slower (and presumably more expensive) than equivalent
> RAM, and presumably still has a finite number of write cycles,  so I
> don't buy an argument suggesting that they are a plausible replacement
> for real RAM.

for pblk, I think it's more meaningful to compare with today's SSD.

> 
> I presume therefore that a system would only choose to use pblk mode in
> situations where the host physical address space is a limiting factor.
> Are there other situations which I have overlooked?

I think you more mean RAM size limitation here. Today 48bit physical
address space is not limiting yet. :-)

> 
> Secondly, I presume that pmem vs pblk will be a firmware decision and
> fixed from the point of view of the Operating System?
> 

Yes from the description of ACPI manner. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-01 18:25 ` Andrew Cooper
  2016-02-02  3:27   ` Tian, Kevin
@ 2016-02-02  3:44   ` Haozhong Zhang
  2016-02-02 11:09     ` Andrew Cooper
  1 sibling, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-02  3:44 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell, George Dunlap,
	Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/01/16 18:25, Andrew Cooper wrote:
> On 01/02/16 05:44, Haozhong Zhang wrote:
> > Hi,
> >
> > The following document describes the design of adding vNVDIMM support
> > for Xen. Any comments are welcome.
> >
> > Thanks,
> > Haozhong
> 
> Thankyou for doing this.  It is a very comprehensive document, and a
> fantastic example for future similar situations.
> 
> 
> To start with however, I would like to clear up my confusion over the
> the usecases of pmem vs pblk.
> 
> pblk, using indirect access, is less efficient than pmem.  NVDIMMs
> themselves are slower (and presumably more expensive) than equivalent
> RAM, and presumably still has a finite number of write cycles,  so I
> don't buy an argument suggesting that they are a plausible replacement
> for real RAM.
> 
> I presume therefore that a system would only choose to use pblk mode in
> situations where the host physical address space is a limiting factor. 
> Are there other situations which I have overlooked?
>

Limited physical address space is one concern. Another concern is that
pblk can be used by drivers to provide better RAS, like better error
detection and power-fail write atomicity. See Section "NVDIMM Driver"
in Chapter 1 of [3] for more details.

> Secondly, I presume that pmem vs pblk will be a firmware decision and
> fixed from the point of view of the Operating System?
>

Specifications on my hands [1-4] do not mention which one is in charge
for partitioning NVDIMM into pmem and pblk. However, as NFIT uses
separated SPA range structures for pmem and pblk regions, I also
presume that firmware (BIOS/EFI, or firmware on NVDIMM devices)
determines the partition.

In addition, some NVDIMM vendors may provide specific _DSM commands to
allow software (OS/drivers) to reconfigure the pmem/pblk partition,
but those changes only take effect after reboot. If OS/drivers or
system administrators decide to do so, IMO they should make sure no
users are currently using those NVDIMMs and data on NVDIMMs is already
properly handled.

Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
  2016-02-01 18:25 ` Andrew Cooper
@ 2016-02-02  6:33 ` Tian, Kevin
  2016-02-02  7:39   ` Zhang, Haozhong
  2016-02-02 19:01   ` Konrad Rzeszutek Wilk
  2016-02-02 17:11 ` Stefano Stabellini
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 121+ messages in thread
From: Tian, Kevin @ 2016-02-02  6:33 UTC (permalink / raw)
  To: Zhang, Haozhong, xen-devel
  Cc: Juergen Gross, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, Stefano Stabellini, Ian Jackson, Jan Beulich,
	Nakajima, Jun, Xiao Guangrong, Keir Fraser

> From: Zhang, Haozhong
> Sent: Monday, February 01, 2016 1:44 PM
> 
[...]
> 
> 1.2 ACPI Support
> 
>  ACPI provides two factors of support for NVDIMM. First, NVDIMM
>  devices are described by firmware (BIOS/EFI) to OS via ACPI-defined
>  NVDIMM Firmware Interface Table (NFIT). Second, several functions of
>  NVDIMM, including operations on namespace labels, S.M.A.R.T and
>  hotplug, are provided by ACPI methods (_DSM and _FIT).
> 
> 1.2.1 NFIT
> 
>  NFIT is a new system description table added in ACPI v6 with
>  signature "NFIT". It contains a set of structures.

Can I consider only NFIT as a minimal requirement, while other stuff
(_DSM and _FIT) are optional?

> 
> 
> 2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
> 
> 2.1 NVDIMM Driver in Linux Kernel
> 
[...]
> 
>  Userspace applications can mmap(2) the whole pmem into its own
>  virtual address space. Linux kernel maps the system physical address
>  space range occupied by pmem into the virtual address space, so that every
>  normal memory loads/writes with proper flushing instructions are
>  applied to the underlying pmem NVDIMM regions.
> 
>  Alternatively, a DAX file system can be made on /dev/pmemX. Files on
>  that file system can be used in the same way as above. As Linux
>  kernel maps the system address space range occupied by those files on
>  NVDIMM to the virtual address space, reads/writes on those files are
>  applied to the underlying NVDIMM regions as well.

Does it mean only file-based interface is supported by Linux today, and 
pmem aware application cannot use normal memory allocation interface
like malloc for the purpose?

> 
> 2.2 vNVDIMM Implementation in KVM/QEMU
> 
>  (1) Address Mapping
> 
>   As described before, the host Linux NVDIMM driver provides a block
>   device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
>   region. QEMU can than mmap(2) that device into its virtual address
>   space (buf). QEMU is responsible to find a proper guest physical
>   address space range that is large enough to hold /dev/pmem0. Then
>   QEMU passes the virtual address of mmapped buf to a KVM API
>   KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
>   address range of buf to the guest physical address space range where
>   the virtual pmem device will be.
> 
>   In this way, all guest writes/reads on the virtual pmem device is
>   applied directly to the host one.
> 
>   Besides, above implementation also allows to back a virtual pmem
>   device by a mmapped regular file or a piece of ordinary ram.

What's the point of backing pmem with ordinary ram? I can buy-in
the value of file-backed option which although slower does sustain
the persistency attribute. However with ram-backed method there's
no persistency so violates guest expectation.

btw, how is persistency guaranteed in KVM/QEMU, cross guest 
power off/on? I guess since Qemu process is killed the allocated pmem
will be freed so you may switch to file-backed method to keep 
persistency (however copy would take time for large pmem trunk). Or
will you find some way to keep pmem managed separated from qemu
qemu life-cycle (then pmem is not efficiently reused)?

> 3. Design of vNVDIMM in Xen
> 
> 3.2 Address Mapping
> 
> 3.2.2 Alternative Design
> 
>  Jan Beulich's comments [7] on my question "why must pmem resource
>  management and partition be done in hypervisor":
>  | Because that's where memory management belongs. And PMEM,
>  | other than PBLK, is just another form of RAM.
>  | ...
>  | The main issue is that this would imo be a layering violation
> 
>  George Dunlap's comments [8]:
>  | This is not the case for PMEM.  The whole point of PMEM (correct me if
>    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
>  | I'm wrong) is to be used for long-term storage that survives over
>  | reboot.  It matters very much that a guest be given the same PRAM
>  | after the host is rebooted that it was given before.  It doesn't make
>  | any sense to manage it the way Xen currently manages RAM (i.e., that
>  | you request a page and get whatever Xen happens to give you).
>  |
>  | So if Xen is going to use PMEM, it will have to invent an entirely new
>  | interface for guests, and it will have to keep track of those
>  | resources across host reboots.  In other words, it will have to
>  | duplicate all the work that Linux already does.  What do we gain from
>  | that duplication?  Why not just leverage what's already implemented in
>  | dom0?
>  and [9]:
>  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
>  | then you're right -- it is just another form of RAM, that should be
>  | treated no differently than say, lowmem: a fungible resource that can be
>  | requested by setting a flag.
> 
>  However, pmem is used more as persistent storage than fungible ram,
>  and my design is for the former usage. I would like to leave the
>  detection, driver and partition (either through namespace or file
>  systems) of NVDIMM in Dom0 Linux kernel.

After reading the whole introduction I vote for this option too. One immediate
reason why a resource should be managed in Xen, is whether Xen itself also
uses it, e.g. normal RAM. In that case Xen has to control the whole resource to 
protect itself from Dom0 and other user VMs. Given a resource not used by
Xen completely, it's reasonable to leave it to Dom0 which reduces code duplication
and unnecessary maintenance burden in Xen side, as we have done for whole
PCI sub-system and other I/O peripherals. I'm not sure whether there's future
value to use pmem in Xen itself, at least for now the primary requirement is 
about exposing pmem to guest. From that angle reusing NVDIMM driver in
Dom0 looks the better choice with less enabling effort to catch up with KVM.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02  6:33 ` Tian, Kevin
@ 2016-02-02  7:39   ` Zhang, Haozhong
  2016-02-02  7:48     ` Tian, Kevin
  2016-02-02 19:01   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 121+ messages in thread
From: Zhang, Haozhong @ 2016-02-02  7:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Juergen Gross, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, Stefano Stabellini, Ian Jackson, xen-devel,
	Jan Beulich, Nakajima, Jun, Xiao Guangrong, Keir Fraser

Hi Kevin,

Thanks for your review!

On 02/02/16 14:33, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Monday, February 01, 2016 1:44 PM
> > 
> [...]
> > 
> > 1.2 ACPI Support
> > 
> >  ACPI provides two factors of support for NVDIMM. First, NVDIMM
> >  devices are described by firmware (BIOS/EFI) to OS via ACPI-defined
> >  NVDIMM Firmware Interface Table (NFIT). Second, several functions of
> >  NVDIMM, including operations on namespace labels, S.M.A.R.T and
> >  hotplug, are provided by ACPI methods (_DSM and _FIT).
> > 
> > 1.2.1 NFIT
> > 
> >  NFIT is a new system description table added in ACPI v6 with
> >  signature "NFIT". It contains a set of structures.
> 
> Can I consider only NFIT as a minimal requirement, while other stuff
> (_DSM and _FIT) are optional?
>

No. ACPI namespace devices for NVDIMM should also be present. However,
_DSM under those ACPI namespace device can be implemented to support
no functions. _FIT is optional and is used for NVDIMM hotplug.

> > 
> > 
> > 2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
> > 
> > 2.1 NVDIMM Driver in Linux Kernel
> > 
> [...]
> > 
> >  Userspace applications can mmap(2) the whole pmem into its own
> >  virtual address space. Linux kernel maps the system physical address
> >  space range occupied by pmem into the virtual address space, so that every
> >  normal memory loads/writes with proper flushing instructions are
> >  applied to the underlying pmem NVDIMM regions.
> > 
> >  Alternatively, a DAX file system can be made on /dev/pmemX. Files on
> >  that file system can be used in the same way as above. As Linux
> >  kernel maps the system address space range occupied by those files on
> >  NVDIMM to the virtual address space, reads/writes on those files are
> >  applied to the underlying NVDIMM regions as well.
> 
> Does it mean only file-based interface is supported by Linux today, and 
> pmem aware application cannot use normal memory allocation interface
> like malloc for the purpose?
>

right

> > 
> > 2.2 vNVDIMM Implementation in KVM/QEMU
> > 
> >  (1) Address Mapping
> > 
> >   As described before, the host Linux NVDIMM driver provides a block
> >   device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
> >   region. QEMU can than mmap(2) that device into its virtual address
> >   space (buf). QEMU is responsible to find a proper guest physical
> >   address space range that is large enough to hold /dev/pmem0. Then
> >   QEMU passes the virtual address of mmapped buf to a KVM API
> >   KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
> >   address range of buf to the guest physical address space range where
> >   the virtual pmem device will be.
> > 
> >   In this way, all guest writes/reads on the virtual pmem device is
> >   applied directly to the host one.
> > 
> >   Besides, above implementation also allows to back a virtual pmem
> >   device by a mmapped regular file or a piece of ordinary ram.
> 
> What's the point of backing pmem with ordinary ram? I can buy-in
> the value of file-backed option which although slower does sustain
> the persistency attribute. However with ram-backed method there's
> no persistency so violates guest expectation.
>

Well, it is not a necessity. The current vNVDIMM implementation in
QEMU uses dimm in QEMU that happens to support ram backend. A possible
usage is for debugging vNVDIMM on machines without NVDIMM.

> btw, how is persistency guaranteed in KVM/QEMU, cross guest 
> power off/on? I guess since Qemu process is killed the allocated pmem
> will be freed so you may switch to file-backed method to keep 
> persistency (however copy would take time for large pmem trunk). Or
> will you find some way to keep pmem managed separated from qemu
> qemu life-cycle (then pmem is not efficiently reused)?
>

It all depends on guests themselves. clwb/clflushopt/pcommit
instructions are exposed to guest that are used by guests to make
writes to pmem persistent.

Haozhong

> > 3. Design of vNVDIMM in Xen
> > 
> > 3.2 Address Mapping
> > 
> > 3.2.2 Alternative Design
> > 
> >  Jan Beulich's comments [7] on my question "why must pmem resource
> >  management and partition be done in hypervisor":
> >  | Because that's where memory management belongs. And PMEM,
> >  | other than PBLK, is just another form of RAM.
> >  | ...
> >  | The main issue is that this would imo be a layering violation
> > 
> >  George Dunlap's comments [8]:
> >  | This is not the case for PMEM.  The whole point of PMEM (correct me if
> >    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
> >  | I'm wrong) is to be used for long-term storage that survives over
> >  | reboot.  It matters very much that a guest be given the same PRAM
> >  | after the host is rebooted that it was given before.  It doesn't make
> >  | any sense to manage it the way Xen currently manages RAM (i.e., that
> >  | you request a page and get whatever Xen happens to give you).
> >  |
> >  | So if Xen is going to use PMEM, it will have to invent an entirely new
> >  | interface for guests, and it will have to keep track of those
> >  | resources across host reboots.  In other words, it will have to
> >  | duplicate all the work that Linux already does.  What do we gain from
> >  | that duplication?  Why not just leverage what's already implemented in
> >  | dom0?
> >  and [9]:
> >  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
> >  | then you're right -- it is just another form of RAM, that should be
> >  | treated no differently than say, lowmem: a fungible resource that can be
> >  | requested by setting a flag.
> > 
> >  However, pmem is used more as persistent storage than fungible ram,
> >  and my design is for the former usage. I would like to leave the
> >  detection, driver and partition (either through namespace or file
> >  systems) of NVDIMM in Dom0 Linux kernel.
> 
> After reading the whole introduction I vote for this option too. One immediate
> reason why a resource should be managed in Xen, is whether Xen itself also
> uses it, e.g. normal RAM. In that case Xen has to control the whole resource to 
> protect itself from Dom0 and other user VMs. Given a resource not used by
> Xen completely, it's reasonable to leave it to Dom0 which reduces code duplication
> and unnecessary maintenance burden in Xen side, as we have done for whole
> PCI sub-system and other I/O peripherals. I'm not sure whether there's future
> value to use pmem in Xen itself, at least for now the primary requirement is 
> about exposing pmem to guest. From that angle reusing NVDIMM driver in
> Dom0 looks the better choice with less enabling effort to catch up with KVM.
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02  7:39   ` Zhang, Haozhong
@ 2016-02-02  7:48     ` Tian, Kevin
  2016-02-02  7:53       ` Zhang, Haozhong
  0 siblings, 1 reply; 121+ messages in thread
From: Tian, Kevin @ 2016-02-02  7:48 UTC (permalink / raw)
  To: Zhang, Haozhong
  Cc: Juergen Gross, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, Stefano Stabellini, Ian Jackson, xen-devel,
	Jan Beulich, Nakajima, Jun, Xiao Guangrong, Keir Fraser

> From: Zhang, Haozhong
> Sent: Tuesday, February 02, 2016 3:39 PM
> 
> > btw, how is persistency guaranteed in KVM/QEMU, cross guest
> > power off/on? I guess since Qemu process is killed the allocated pmem
> > will be freed so you may switch to file-backed method to keep
> > persistency (however copy would take time for large pmem trunk). Or
> > will you find some way to keep pmem managed separated from qemu
> > qemu life-cycle (then pmem is not efficiently reused)?
> >
> 
> It all depends on guests themselves. clwb/clflushopt/pcommit
> instructions are exposed to guest that are used by guests to make
> writes to pmem persistent.
> 

I meant from guest p.o.v, a range of pmem should be persistent
cross VM power on/off, i.e. the content needs to be maintained
somewhere so guest can get it at next power on...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02  7:48     ` Tian, Kevin
@ 2016-02-02  7:53       ` Zhang, Haozhong
  2016-02-02  8:03         ` Tian, Kevin
  0 siblings, 1 reply; 121+ messages in thread
From: Zhang, Haozhong @ 2016-02-02  7:53 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Juergen Gross, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, Stefano Stabellini, Ian Jackson, xen-devel,
	Jan Beulich, Nakajima, Jun, Xiao Guangrong, Keir Fraser

On 02/02/16 15:48, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Tuesday, February 02, 2016 3:39 PM
> > 
> > > btw, how is persistency guaranteed in KVM/QEMU, cross guest
> > > power off/on? I guess since Qemu process is killed the allocated pmem
> > > will be freed so you may switch to file-backed method to keep
> > > persistency (however copy would take time for large pmem trunk). Or
> > > will you find some way to keep pmem managed separated from qemu
> > > qemu life-cycle (then pmem is not efficiently reused)?
> > >
> > 
> > It all depends on guests themselves. clwb/clflushopt/pcommit
> > instructions are exposed to guest that are used by guests to make
> > writes to pmem persistent.
> > 
> 
> I meant from guest p.o.v, a range of pmem should be persistent
> cross VM power on/off, i.e. the content needs to be maintained
> somewhere so guest can get it at next power on...
> 
> Thanks
> Kevin

It's just like what we do for guest disk: as long as we always assign
the same host pmem device or the same files on file systems on a host
pmem device to the guest, the guest can find its last data on pmem.

Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02  7:53       ` Zhang, Haozhong
@ 2016-02-02  8:03         ` Tian, Kevin
  2016-02-02  8:49           ` Zhang, Haozhong
  0 siblings, 1 reply; 121+ messages in thread
From: Tian, Kevin @ 2016-02-02  8:03 UTC (permalink / raw)
  To: Zhang, Haozhong
  Cc: Juergen Gross, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, Stefano Stabellini, Ian Jackson, xen-devel,
	Jan Beulich, Nakajima, Jun, Xiao Guangrong, Keir Fraser

> From: Zhang, Haozhong
> Sent: Tuesday, February 02, 2016 3:53 PM
> 
> On 02/02/16 15:48, Tian, Kevin wrote:
> > > From: Zhang, Haozhong
> > > Sent: Tuesday, February 02, 2016 3:39 PM
> > >
> > > > btw, how is persistency guaranteed in KVM/QEMU, cross guest
> > > > power off/on? I guess since Qemu process is killed the allocated pmem
> > > > will be freed so you may switch to file-backed method to keep
> > > > persistency (however copy would take time for large pmem trunk). Or
> > > > will you find some way to keep pmem managed separated from qemu
> > > > qemu life-cycle (then pmem is not efficiently reused)?
> > > >
> > >
> > > It all depends on guests themselves. clwb/clflushopt/pcommit
> > > instructions are exposed to guest that are used by guests to make
> > > writes to pmem persistent.
> > >
> >
> > I meant from guest p.o.v, a range of pmem should be persistent
> > cross VM power on/off, i.e. the content needs to be maintained
> > somewhere so guest can get it at next power on...
> >
> > Thanks
> > Kevin
> 
> It's just like what we do for guest disk: as long as we always assign
> the same host pmem device or the same files on file systems on a host
> pmem device to the guest, the guest can find its last data on pmem.
> 
> Haozhong

This is the detail which I'd like to learn. If it's Qemu to request 
host pmem and then free when exit, the very pmem may be 
allocated to another process later. How do you achieve the 'as
long as'?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02  8:03         ` Tian, Kevin
@ 2016-02-02  8:49           ` Zhang, Haozhong
  0 siblings, 0 replies; 121+ messages in thread
From: Zhang, Haozhong @ 2016-02-02  8:49 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Juergen Gross, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, Stefano Stabellini, Ian Jackson, xen-devel,
	Jan Beulich, Nakajima, Jun, Xiao Guangrong, Keir Fraser

On 02/02/16 16:03, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Tuesday, February 02, 2016 3:53 PM
> > 
> > On 02/02/16 15:48, Tian, Kevin wrote:
> > > > From: Zhang, Haozhong
> > > > Sent: Tuesday, February 02, 2016 3:39 PM
> > > >
> > > > > btw, how is persistency guaranteed in KVM/QEMU, cross guest
> > > > > power off/on? I guess since Qemu process is killed the allocated pmem
> > > > > will be freed so you may switch to file-backed method to keep
> > > > > persistency (however copy would take time for large pmem trunk). Or
> > > > > will you find some way to keep pmem managed separated from qemu
> > > > > qemu life-cycle (then pmem is not efficiently reused)?
> > > > >
> > > >
> > > > It all depends on guests themselves. clwb/clflushopt/pcommit
> > > > instructions are exposed to guest that are used by guests to make
> > > > writes to pmem persistent.
> > > >
> > >
> > > I meant from guest p.o.v, a range of pmem should be persistent
> > > cross VM power on/off, i.e. the content needs to be maintained
> > > somewhere so guest can get it at next power on...
> > >
> > > Thanks
> > > Kevin
> > 
> > It's just like what we do for guest disk: as long as we always assign
> > the same host pmem device or the same files on file systems on a host
> > pmem device to the guest, the guest can find its last data on pmem.
> > 
> > Haozhong
> 
> This is the detail which I'd like to learn. If it's Qemu to request 
> host pmem and then free when exit, the very pmem may be 
> allocated to another process later. How do you achieve the 'as
> long as'?
> 

QEMU receives following parameters

     -object memory-backend-file,id=mem1,share,mem-path=/dev/pmem0,size=10G \
     -device nvdimm,memdev=mem1,id=nv1
     
that configs a vNVDIMM device backed by the host pmem device
/dev/pmem0. It can also be replaced by files on the file system on
/dev/pmem0. The system address space range occupied by /dev/pmem0 or
files on /dev/pmem0 are then mapped into the guest physical address
space, and all accesses from guest are then directly applied on the
host device without any interception of QEMU.

If we always provide the same vNVDIMM parameters (specially mem-path
and size), the guest can observe the same vNVDIMM devices across
boot.

In Xen, I'll implement the similar configuration options in xl.cfg, e.g.
     nvdimms = [ '/dev/pmem0', '/dev/pmem1', '/mnt/pmem2/file_on_pmem2' ]

Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02  3:44   ` Haozhong Zhang
@ 2016-02-02 11:09     ` Andrew Cooper
  0 siblings, 0 replies; 121+ messages in thread
From: Andrew Cooper @ 2016-02-02 11:09 UTC (permalink / raw)
  To: xen-devel, Jan Beulich, Stefano Stabellini,
	Konrad Rzeszutek Wilk, George Dunlap, Ian Jackson, Ian Campbell,
	Juergen Gross, Wei Liu, Kevin Tian, Xiao Guangrong, Keir Fraser,
	Jun Nakajima

On 02/02/16 03:44, Haozhong Zhang wrote:
> On 02/01/16 18:25, Andrew Cooper wrote:
>> On 01/02/16 05:44, Haozhong Zhang wrote:
>>> Hi,
>>>
>>> The following document describes the design of adding vNVDIMM support
>>> for Xen. Any comments are welcome.
>>>
>>> Thanks,
>>> Haozhong
>> Thankyou for doing this.  It is a very comprehensive document, and a
>> fantastic example for future similar situations.
>>
>>
>> To start with however, I would like to clear up my confusion over the
>> the usecases of pmem vs pblk.
>>
>> pblk, using indirect access, is less efficient than pmem.  NVDIMMs
>> themselves are slower (and presumably more expensive) than equivalent
>> RAM, and presumably still has a finite number of write cycles,  so I
>> don't buy an argument suggesting that they are a plausible replacement
>> for real RAM.
>>
>> I presume therefore that a system would only choose to use pblk mode in
>> situations where the host physical address space is a limiting factor. 
>> Are there other situations which I have overlooked?
>>
> Limited physical address space is one concern. Another concern is that
> pblk can be used by drivers to provide better RAS, like better error
> detection and power-fail write atomicity. See Section "NVDIMM Driver"
> in Chapter 1 of [3] for more details.

Ah ok.  So even with no limiting factors to consider, it would be a
plausible design choice to use it in pblk mode.

>
>> Secondly, I presume that pmem vs pblk will be a firmware decision and
>> fixed from the point of view of the Operating System?
>>
> Specifications on my hands [1-4] do not mention which one is in charge
> for partitioning NVDIMM into pmem and pblk. However, as NFIT uses
> separated SPA range structures for pmem and pblk regions, I also
> presume that firmware (BIOS/EFI, or firmware on NVDIMM devices)
> determines the partition.
>
> In addition, some NVDIMM vendors may provide specific _DSM commands to
> allow software (OS/drivers) to reconfigure the pmem/pblk partition,
> but those changes only take effect after reboot. If OS/drivers or
> system administrators decide to do so, IMO they should make sure no
> users are currently using those NVDIMMs and data on NVDIMMs is already
> properly handled.

Ok.  Either way, it is going to be an administrator decision, and the
layout is not going to change under the feet of a running operating system.

~Andrew

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
  2016-02-01 18:25 ` Andrew Cooper
  2016-02-02  6:33 ` Tian, Kevin
@ 2016-02-02 17:11 ` Stefano Stabellini
  2016-02-03  7:00   ` Haozhong Zhang
  2016-02-02 19:15 ` Konrad Rzeszutek Wilk
  2016-02-18 17:17 ` Jan Beulich
  4 siblings, 1 reply; 121+ messages in thread
From: Stefano Stabellini @ 2016-02-02 17:11 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, Stefano Stabellini, Ian Jackson, xen-devel,
	Jan Beulich, Jun Nakajima, Xiao Guangrong, Keir Fraser

Haozhong, thanks for your work!

On Mon, 1 Feb 2016, Haozhong Zhang wrote:
> 3.2 Address Mapping
> 
> 3.2.1 My Design
> 
>  The overview of this design is shown in the following figure.
> 
>                  Dom0                         |               DomU
>                                               |
>                                               |
>  QEMU                                         |
>      +...+--------------------+...+-----+     |
>   VA |   | Label Storage Area |   | buf |     |
>      +...+--------------------+...+-----+     |
>                      ^            ^     ^     |
>                      |            |     |     |
>                      V            |     |     |
>      +-------+   +-------+        mmap(2)     |
>      | vACPI |   | v_DSM |        |     |     |        +----+------------+
>      +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
>          ^           ^     +------+     |     |        +----+------------+
>  --------|-----------|-----|------------|--   |             ^            ^
>          |           |     |            |     |             |            |
>          |    +------+     +------------~-----~-------------+            |
>          |    |            |            |     |        XEN_DOMCTL_memory_mapping
>          |    |            |            +-----~--------------------------+
>          |    |            |            |     |
>          |    |       +----+------------+     |
>  Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
>          |    |       +----+------------+     |     | ACPI |   | _DSM |
>          |    |                   ^           |     +------+   +------+
>          |    |                   |           |         |          |
>          |    |               Dom0 Driver     |   hvmloader/xl     |
>  --------|----|-------------------|---------------------|----------|---------------
>          |    +-------------------~---------------------~----------+
>  Xen     |                        |                     |
>          +------------------------~---------------------+
>  ---------------------------------|------------------------------------------------
>                                   +----------------+
>                                                    |
>                                             +-------------+
>  HW                                         |    NVDIMM   |
>                                             +-------------+
> 
> 
>  This design treats host NVDIMM devices as ordinary MMIO devices:
>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>      and drive host NVDIMM devices (implementing block device
>      interface). Namespaces and file systems on host NVDIMM devices
>      are handled by Dom0 Linux as well.
> 
>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>      virtual address space (buf).
> 
>  (3) QEMU gets the host physical address of buf, i.e. the host system
>      physical address that is occupied by /dev/pmem0, and calls Xen
>      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.

How is this going to work from a security perspective? Is it going to
require running QEMU as root in Dom0, which will prevent NVDIMM from
working by default on Xen? If so, what's the plan?



>  (ACPI part is described in Section 3.3 later)
> 
>  Above (1)(2) have already been done in current QEMU. Only (3) is
>  needed to implement in QEMU. No change is needed in Xen for address
>  mapping in this design.
> 
>  Open: It seems no system call/ioctl is provided by Linux kernel to
>        get the physical address from a virtual address.
>        /proc/<qemu_pid>/pagemap provides information of mapping from
>        VA to PA. Is it an acceptable solution to let QEMU parse this
>        file to get the physical address?

Does it work in a non-root scenario?


>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>        occupied by pmem at the beginning, i.e. QEMU may not be able to
>        get all SPA of pmem from buf (in virtual address space) when
>        calling XEN_DOMCTL_memory_mapping.
>        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>        entire pmem being mmaped?

Ditto


> 3.2.2 Alternative Design
> 
>  Jan Beulich's comments [7] on my question "why must pmem resource
>  management and partition be done in hypervisor":
>  | Because that's where memory management belongs. And PMEM,
>  | other than PBLK, is just another form of RAM.
>  | ...
>  | The main issue is that this would imo be a layering violation
> 
>  George Dunlap's comments [8]:
>  | This is not the case for PMEM.  The whole point of PMEM (correct me if
>    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
>  | I'm wrong) is to be used for long-term storage that survives over
>  | reboot.  It matters very much that a guest be given the same PRAM
>  | after the host is rebooted that it was given before.  It doesn't make
>  | any sense to manage it the way Xen currently manages RAM (i.e., that
>  | you request a page and get whatever Xen happens to give you).
>  |
>  | So if Xen is going to use PMEM, it will have to invent an entirely new
>  | interface for guests, and it will have to keep track of those
>  | resources across host reboots.  In other words, it will have to
>  | duplicate all the work that Linux already does.  What do we gain from
>  | that duplication?  Why not just leverage what's already implemented in
>  | dom0?
>  and [9]:
>  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
>  | then you're right -- it is just another form of RAM, that should be
>  | treated no differently than say, lowmem: a fungible resource that can be
>  | requested by setting a flag.
> 
>  However, pmem is used more as persistent storage than fungible ram,
>  and my design is for the former usage. I would like to leave the
>  detection, driver and partition (either through namespace or file
>  systems) of NVDIMM in Dom0 Linux kernel.
> 
>  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
>  check for the physical address and size passed from caller
>  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
>  aware of the SPA range of pmem so that it can refuse map physical
>  address in neither the normal ram nor pmem.

Indeed


>  Instead of duplicating the detection code (parsing NFIT and
>  evaluating _FIT) in Dom0 Linux kernel, we decide to patch Dom0 Linux
>  kernel to pass parameters of host pmem NVDIMM devices to Xen
>  hypervisor:
>  (1) Add a global
>        struct rangeset pmem_rangeset
>      in Xen hypervisor to record all SPA ranges of detected pmem devices.
>      Each range in pmem_rangeset corresponds to a pmem device.
> 
>  (2) Add a hypercall
>        XEN_SYSCTL_add_pmem_range
>      (should it be a sysctl or a platform op?)
>      that receives a pair of parameters (addr: starting SPA of pmem
>      region, len: size of pmem region) and add a range (addr, addr +
>      len - 1) in nvdimm_rangset.
> 
>  (3) Add a hypercall
>        XEN_DOMCTL_pmem_mapping
>      that takes the same parameters as XEN_DOMCTL_memory_mapping and
>      maps a given host pmem range to guest. It checks whether the
>      given host pmem range is in the pmem_rangeset before making the
>      actual mapping.
> 
>  (4) Patch Linux NVDIMM driver to call XEN_SYSCTL_add_pmem_range
>      whenever it detects a pmem device.
> 
>  (5) Patch QEMU to use XEN_DOMCTL_pmem_mapping for mapping host pmem
>      devices.
> 
> 
> 3.3 Guest ACPI Emulation
> 
> 3.3.1 My Design
> 
>  Guest ACPI emulation is composed of two parts: building guest NFIT
>  and SSDT that defines ACPI namespace devices for NVDIMM, and
>  emulating guest _DSM.
> 
>  (1) Building Guest ACPI Tables
> 
>   This design reuses and extends hvmloader's existing mechanism that
>   loads passthrough ACPI tables from binary files to load NFIT and
>   SSDT tables built by QEMU:
>   1) Because the current QEMU does not building any ACPI tables when
>      it runs as the Xen device model, this design needs to patch QEMU
>      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> 
>   2) QEMU copies NFIT and SSDT to the end of guest memory below
>      4G. The guest address and size of those tables are written into
>      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> 
>   3) hvmloader is patched to probe and load device model passthrough
>      ACPI tables from above xenstore keys. The detected ACPI tables
>      are then appended to the end of existing guest ACPI tables just
>      like what current construct_passthrough_tables() does.
> 
>   Reasons for this design are listed below:
>   - NFIT and SSDT in question are quite self-contained, i.e. they do
>     not refer to other ACPI tables and not conflict with existing
>     guest ACPI tables in Xen. Therefore, it is safe to copy them from
>     QEMU and append to existing guest ACPI tables.
> 
>   - A primary portion of current and future vNVDIMM implementation is
>     about building ACPI tables. And this design also leave the
>     emulation of _DSM to QEMU which needs to keep consistency with
>     NFIT and SSDT itself builds. Therefore, reusing NFIT and SSDT from
>     QEMU can ease the maintenance.
> 
>   - Anthony's work to pass ACPI tables from the toolstack to hvmloader
>     does not move building SSDT (and NFIT) to toolstack, so this
>     design can still put them in hvmloader.

If we start asking QEMU to build ACPI tables, why should we stop at NFIT
and SSDT? Once upon a time somebody made the decision that ACPI tables
on Xen should be static and included in hvmloader. That might have been
a bad decision but at least it was coherent. Loading only *some* tables
from QEMU, but not others, it feels like an incomplete design to me.

For example, QEMU is currently in charge of emulating the PCI bus, why
shouldn't it be QEMU that generates the PRT and MCFG?


>  (2) Emulating Guest _DSM
> 
>   Because the same NFIT and SSDT are used, we can leave the emulation
>   of guest _DSM to QEMU. Just as what it does with KVM, QEMU registers
>   the _DSM buffer as MMIO region with Xen and then all guest
>   evaluations of _DSM are trapped and emulated by QEMU.
> 
> 3.3.2 Alternative Design 1: switching to QEMU
> 
>  Stefano Stabellini's comments [10]:
>  | I don't think it is wise to have two components which both think are
>  | in control of generating ACPI tables, hvmloader (soon to be the
>  | toolstack with Anthony's work) and QEMU. From an architectural
>  | perspective, it doesn't look robust to me.
>  |
>  | Could we take this opportunity to switch to QEMU generating the whole
>  | set of ACPI tables?
> 
>  So an alternative design could be switching to QEMU to generate the
>  whole set of guest ACPI tables. In this way, no controversy would
>  happen between multiple agents QEMU and hvmloader. (is this what
>  Stefano Stabellini mean by 'robust'?)

Right


>  However, looking at the code building ACPI tables in QEMU and
>  hvmloader, they are quite different. As ACPI tables are important for
>  OS to boot and operate device, it's critical to ensure ACPI tables
>  built by QEMU would not break existing guests on Xen. Though I
>  believe it could be done after a thorough investigation and
>  adjustment, it may take quite a lot of work and tests and should be
>  another project besides enabling vNVDIMM in Xen.
>
> 3.3.3 Alternative Design 2: keeping in Xen
> 
>  Alternative to switching to QEMU, another design would be building
>  NFIT and SSDT in hvmloader or toolstack.
> 
>  The amount and parameters of sub-structures in guest NFIT vary
>  according to different vNVDIMM configurations that can not be decided
>  at compile-time. In contrast, current hvmloader and toolstack can
>  only build static ACPI tables, i.e. their contents are decided
>  statically at compile-time and independent from the guest
>  configuration. In order to build guest NFIT at runtime, this design
>  may take following steps:
>  (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU
>      options,
> 
>  (2) QEMU accepts above options, figures out the start SPA range
>      address/size/NVDIMM device handles/..., and writes them in
>      xenstore. No ACPI table is built by QEMU.
> 
>  (3) Either xl or hvmloader reads above parameters from xenstore and
>      builds the NFIT table.
> 
>  For guest SSDT, it would take more work. The ACPI namespace devices
>  are defined in SSDT by AML, so an AML builder would be needed to
>  generate those definitions at runtime.
> 
>  This alternative design still needs more work than the first design.

I prefer switching to QEMU building all ACPI tables for devices that it
is emulating. However this alternative is good too because it is
coherent with the current design.


> References:
> [1] ACPI Specification v6,
>     http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> [2] NVDIMM Namespace Specification,
>     http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> [3] NVDIMM Block Window Driver Writer's Guide,
>     http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> [4] NVDIMM DSM Interface Example,
>     http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> [5] UEFI Specification v2.6,
>     http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf
> [6] Intel Architecture Instruction Set Extensions Programming Reference,
>     https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf
> [7] http://www.gossamer-threads.com/lists/xen/devel/414945#414945
> [8] http://www.gossamer-threads.com/lists/xen/devel/415658#415658
> [9] http://www.gossamer-threads.com/lists/xen/devel/415681#415681
> [10] http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg00271.html
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02  6:33 ` Tian, Kevin
  2016-02-02  7:39   ` Zhang, Haozhong
@ 2016-02-02 19:01   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 121+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-02-02 19:01 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Juergen Gross, Zhang, Haozhong, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Nakajima, Jun, Xiao Guangrong,
	Keir Fraser

> > 2.2 vNVDIMM Implementation in KVM/QEMU
> > 
> >  (1) Address Mapping
> > 
> >   As described before, the host Linux NVDIMM driver provides a block
> >   device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
> >   region. QEMU can than mmap(2) that device into its virtual address
> >   space (buf). QEMU is responsible to find a proper guest physical
> >   address space range that is large enough to hold /dev/pmem0. Then
> >   QEMU passes the virtual address of mmapped buf to a KVM API
> >   KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
> >   address range of buf to the guest physical address space range where
> >   the virtual pmem device will be.
> > 
> >   In this way, all guest writes/reads on the virtual pmem device is
> >   applied directly to the host one.
> > 
> >   Besides, above implementation also allows to back a virtual pmem
> >   device by a mmapped regular file or a piece of ordinary ram.
> 
> What's the point of backing pmem with ordinary ram? I can buy-in
> the value of file-backed option which although slower does sustain
> the persistency attribute. However with ram-backed method there's
> no persistency so violates guest expectation.

Containers - like the Intel Clear Containers? You can use this work
to stitch an exploded initramfs on a tmpfs right in the guest.
And you could do that for multiple guests.

Granted this has nothing to do with pmem, but this work would allow
one to setup containers this way.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
                   ` (2 preceding siblings ...)
  2016-02-02 17:11 ` Stefano Stabellini
@ 2016-02-02 19:15 ` Konrad Rzeszutek Wilk
  2016-02-03  8:28   ` Haozhong Zhang
  2016-02-15  8:43   ` Haozhong Zhang
  2016-02-18 17:17 ` Jan Beulich
  4 siblings, 2 replies; 121+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-02-02 19:15 UTC (permalink / raw)
  To: xen-devel, Jan Beulich, Stefano Stabellini, George Dunlap,
	Andrew Cooper, Ian Jackson, Ian Campbell, Juergen Gross, Wei Liu,
	Kevin Tian, Xiao Guangrong, Keir Fraser, Jun Nakajima

> 3. Design of vNVDIMM in Xen

Thank you for this design!

> 
>  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
>  three parts:
>  (1) Guest clwb/clflushopt/pcommit enabling,
>  (2) Memory mapping, and
>  (3) Guest ACPI emulation.


.. MCE? and vMCE?

> 
>  The rest of this section present the design of each part
>  respectively. The basic design principle to reuse existing code in
>  Linux NVDIMM driver and QEMU as much as possible. As recent
>  discussions in the both Xen and QEMU mailing lists for the v1 patch
>  series, alternative designs are also listed below.
> 
> 
> 3.1 Guest clwb/clflushopt/pcommit Enabling
> 
>  The instruction enabling is simple and we do the same work as in KVM/QEMU.
>  - All three instructions are exposed to guest via guest cpuid.
>  - L1 guest pcommit is never intercepted by Xen.

I wish there was some watermarks like the PLE has.

My fear is that an unfriendly guest can issue sfence all day long
flushing out other guests MMC queue (the writes followed by pcommits).
Which means that an guest may have degraded performance as their
memory writes are being flushed out immediately as if they were
being written to UC instead of WB memory. 

In other words - the NVDIMM resource does not provide any resource
isolation. However this may not be any different than what we had
nowadays with CPU caches.


>  - L1 hypervisor is allowed to intercept L2 guest pcommit.

clwb?

> 
> 
> 3.2 Address Mapping
> 
> 3.2.1 My Design
> 
>  The overview of this design is shown in the following figure.
> 
>                  Dom0                         |               DomU
>                                               |
>                                               |
>  QEMU                                         |
>      +...+--------------------+...+-----+     |
>   VA |   | Label Storage Area |   | buf |     |
>      +...+--------------------+...+-----+     |
>                      ^            ^     ^     |
>                      |            |     |     |
>                      V            |     |     |
>      +-------+   +-------+        mmap(2)     |
>      | vACPI |   | v_DSM |        |     |     |        +----+------------+
>      +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
>          ^           ^     +------+     |     |        +----+------------+
>  --------|-----------|-----|------------|--   |             ^            ^
>          |           |     |            |     |             |            |
>          |    +------+     +------------~-----~-------------+            |
>          |    |            |            |     |        XEN_DOMCTL_memory_mapping
>          |    |            |            +-----~--------------------------+
>          |    |            |            |     |
>          |    |       +----+------------+     |
>  Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
>          |    |       +----+------------+     |     | ACPI |   | _DSM |
>          |    |                   ^           |     +------+   +------+
>          |    |                   |           |         |          |
>          |    |               Dom0 Driver     |   hvmloader/xl     |
>  --------|----|-------------------|---------------------|----------|---------------
>          |    +-------------------~---------------------~----------+
>  Xen     |                        |                     |
>          +------------------------~---------------------+
>  ---------------------------------|------------------------------------------------
>                                   +----------------+
>                                                    |
>                                             +-------------+
>  HW                                         |    NVDIMM   |
>                                             +-------------+
> 
> 
>  This design treats host NVDIMM devices as ordinary MMIO devices:

Nice.

But it also means you need Xen to 'share' the ranges of an MMIO device.

That is you may need dom0 _DSM method to access certain ranges
(the AML code may need to poke there) - and the guest may want to access
those as well.

And keep in mind that this NVDIMM management may not need to be always
in initial domain. As in you could have NVDIMM device drivers that would
carve out the ranges to guests.

>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>      and drive host NVDIMM devices (implementing block device
>      interface). Namespaces and file systems on host NVDIMM devices
>      are handled by Dom0 Linux as well.
> 
>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>      virtual address space (buf).
> 
>  (3) QEMU gets the host physical address of buf, i.e. the host system
>      physical address that is occupied by /dev/pmem0, and calls Xen
>      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> 
>  (ACPI part is described in Section 3.3 later)
> 
>  Above (1)(2) have already been done in current QEMU. Only (3) is
>  needed to implement in QEMU. No change is needed in Xen for address
>  mapping in this design.
> 
>  Open: It seems no system call/ioctl is provided by Linux kernel to
>        get the physical address from a virtual address.
>        /proc/<qemu_pid>/pagemap provides information of mapping from
>        VA to PA. Is it an acceptable solution to let QEMU parse this
>        file to get the physical address?
> 
>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>        occupied by pmem at the beginning, i.e. QEMU may not be able to
>        get all SPA of pmem from buf (in virtual address space) when
>        calling XEN_DOMCTL_memory_mapping.
>        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>        entire pmem being mmaped?
> 
> 3.2.2 Alternative Design
> 
>  Jan Beulich's comments [7] on my question "why must pmem resource
>  management and partition be done in hypervisor":
>  | Because that's where memory management belongs. And PMEM,
>  | other than PBLK, is just another form of RAM.
>  | ...
>  | The main issue is that this would imo be a layering violation
> 
>  George Dunlap's comments [8]:
>  | This is not the case for PMEM.  The whole point of PMEM (correct me if
>    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
>  | I'm wrong) is to be used for long-term storage that survives over
>  | reboot.  It matters very much that a guest be given the same PRAM
>  | after the host is rebooted that it was given before.  It doesn't make
>  | any sense to manage it the way Xen currently manages RAM (i.e., that
>  | you request a page and get whatever Xen happens to give you).
>  |
>  | So if Xen is going to use PMEM, it will have to invent an entirely new
>  | interface for guests, and it will have to keep track of those
>  | resources across host reboots.  In other words, it will have to
>  | duplicate all the work that Linux already does.  What do we gain from
>  | that duplication?  Why not just leverage what's already implemented in
>  | dom0?
>  and [9]:
>  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
>  | then you're right -- it is just another form of RAM, that should be
>  | treated no differently than say, lowmem: a fungible resource that can be
>  | requested by setting a flag.
> 
>  However, pmem is used more as persistent storage than fungible ram,
>  and my design is for the former usage. I would like to leave the
>  detection, driver and partition (either through namespace or file
>  systems) of NVDIMM in Dom0 Linux kernel.
> 
>  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
>  check for the physical address and size passed from caller
>  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
>  aware of the SPA range of pmem so that it can refuse map physical
>  address in neither the normal ram nor pmem.

/me nods.
> 
>  Instead of duplicating the detection code (parsing NFIT and
>  evaluating _FIT) in Dom0 Linux kernel, we decide to patch Dom0 Linux
>  kernel to pass parameters of host pmem NVDIMM devices to Xen
>  hypervisor:
>  (1) Add a global
>        struct rangeset pmem_rangeset
>      in Xen hypervisor to record all SPA ranges of detected pmem devices.
>      Each range in pmem_rangeset corresponds to a pmem device.
> 
>  (2) Add a hypercall
>        XEN_SYSCTL_add_pmem_range
>      (should it be a sysctl or a platform op?)
>      that receives a pair of parameters (addr: starting SPA of pmem
>      region, len: size of pmem region) and add a range (addr, addr +
>      len - 1) in nvdimm_rangset.
> 
>  (3) Add a hypercall
>        XEN_DOMCTL_pmem_mapping
>      that takes the same parameters as XEN_DOMCTL_memory_mapping and
>      maps a given host pmem range to guest. It checks whether the
>      given host pmem range is in the pmem_rangeset before making the
>      actual mapping.
> 
>  (4) Patch Linux NVDIMM driver to call XEN_SYSCTL_add_pmem_range
>      whenever it detects a pmem device.
> 
>  (5) Patch QEMU to use XEN_DOMCTL_pmem_mapping for mapping host pmem
>      devices.

That is nice - as you can instrument this on existing hardware and
create 'fake' starting SPA for real memory - which Xen may not see
due to being booted with 'mem=X'.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02 17:11 ` Stefano Stabellini
@ 2016-02-03  7:00   ` Haozhong Zhang
  2016-02-03  9:13     ` Jan Beulich
  2016-02-03 12:02     ` Stefano Stabellini
  0 siblings, 2 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-03  7:00 UTC (permalink / raw)
  To: Stefano Stabellini, Jan Beulich, Andrew Cooper, Keir Fraser
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell, George Dunlap,
	Ian Jackson, xen-devel, Jun Nakajima, Xiao Guangrong

On 02/02/16 17:11, Stefano Stabellini wrote:
> Haozhong, thanks for your work!
> 
> On Mon, 1 Feb 2016, Haozhong Zhang wrote:
> > 3.2 Address Mapping
> > 
> > 3.2.1 My Design
> > 
> >  The overview of this design is shown in the following figure.
> > 
> >                  Dom0                         |               DomU
> >                                               |
> >                                               |
> >  QEMU                                         |
> >      +...+--------------------+...+-----+     |
> >   VA |   | Label Storage Area |   | buf |     |
> >      +...+--------------------+...+-----+     |
> >                      ^            ^     ^     |
> >                      |            |     |     |
> >                      V            |     |     |
> >      +-------+   +-------+        mmap(2)     |
> >      | vACPI |   | v_DSM |        |     |     |        +----+------------+
> >      +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
> >          ^           ^     +------+     |     |        +----+------------+
> >  --------|-----------|-----|------------|--   |             ^            ^
> >          |           |     |            |     |             |            |
> >          |    +------+     +------------~-----~-------------+            |
> >          |    |            |            |     |        XEN_DOMCTL_memory_mapping
> >          |    |            |            +-----~--------------------------+
> >          |    |            |            |     |
> >          |    |       +----+------------+     |
> >  Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
> >          |    |       +----+------------+     |     | ACPI |   | _DSM |
> >          |    |                   ^           |     +------+   +------+
> >          |    |                   |           |         |          |
> >          |    |               Dom0 Driver     |   hvmloader/xl     |
> >  --------|----|-------------------|---------------------|----------|---------------
> >          |    +-------------------~---------------------~----------+
> >  Xen     |                        |                     |
> >          +------------------------~---------------------+
> >  ---------------------------------|------------------------------------------------
> >                                   +----------------+
> >                                                    |
> >                                             +-------------+
> >  HW                                         |    NVDIMM   |
> >                                             +-------------+
> > 
> > 
> >  This design treats host NVDIMM devices as ordinary MMIO devices:
> >  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
> >      and drive host NVDIMM devices (implementing block device
> >      interface). Namespaces and file systems on host NVDIMM devices
> >      are handled by Dom0 Linux as well.
> > 
> >  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
> >      virtual address space (buf).
> > 
> >  (3) QEMU gets the host physical address of buf, i.e. the host system
> >      physical address that is occupied by /dev/pmem0, and calls Xen
> >      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> 
> How is this going to work from a security perspective? Is it going to
> require running QEMU as root in Dom0, which will prevent NVDIMM from
> working by default on Xen? If so, what's the plan?
>

Oh, I forgot to address the non-root qemu issues in this design ...

The default user:group of /dev/pmem0 is root:disk, and its permission
is rw-rw----. We could lift the others permission to rw, so that
non-root QEMU can mmap /dev/pmem0. But it looks too risky.

Or, we can make a file system on /dev/pmem0, create files on it, set
the owner of those files to xen-qemuuser-domid$domid, and then pass
those files to QEMU. In this way, non-root QEMU should be able to
mmap those files.

> 
> 
> >  (ACPI part is described in Section 3.3 later)
> > 
> >  Above (1)(2) have already been done in current QEMU. Only (3) is
> >  needed to implement in QEMU. No change is needed in Xen for address
> >  mapping in this design.
> > 
> >  Open: It seems no system call/ioctl is provided by Linux kernel to
> >        get the physical address from a virtual address.
> >        /proc/<qemu_pid>/pagemap provides information of mapping from
> >        VA to PA. Is it an acceptable solution to let QEMU parse this
> >        file to get the physical address?
> 
> Does it work in a non-root scenario?
>

Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
| Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
| In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
| 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
| Reason: information about PFNs helps in exploiting Rowhammer vulnerability.

A possible alternative is to add a new hypercall similar to
XEN_DOMCTL_memory_mapping but receiving virtual address as the address
parameter and translating to machine address in the hypervisor.

> 
> >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> >        occupied by pmem at the beginning, i.e. QEMU may not be able to
> >        get all SPA of pmem from buf (in virtual address space) when
> >        calling XEN_DOMCTL_memory_mapping.
> >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> >        entire pmem being mmaped?
> 
> Ditto
>

No. If I take the above alternative for the first open, maybe the new
hypercall above can inject page faults into dom0 for the unmapped
virtual address so as to enforce dom0 Linux to create the page
mapping.

> 
> > 3.2.2 Alternative Design
> > 
> >  Jan Beulich's comments [7] on my question "why must pmem resource
> >  management and partition be done in hypervisor":
> >  | Because that's where memory management belongs. And PMEM,
> >  | other than PBLK, is just another form of RAM.
> >  | ...
> >  | The main issue is that this would imo be a layering violation
> > 
> >  George Dunlap's comments [8]:
> >  | This is not the case for PMEM.  The whole point of PMEM (correct me if
> >    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
> >  | I'm wrong) is to be used for long-term storage that survives over
> >  | reboot.  It matters very much that a guest be given the same PRAM
> >  | after the host is rebooted that it was given before.  It doesn't make
> >  | any sense to manage it the way Xen currently manages RAM (i.e., that
> >  | you request a page and get whatever Xen happens to give you).
> >  |
> >  | So if Xen is going to use PMEM, it will have to invent an entirely new
> >  | interface for guests, and it will have to keep track of those
> >  | resources across host reboots.  In other words, it will have to
> >  | duplicate all the work that Linux already does.  What do we gain from
> >  | that duplication?  Why not just leverage what's already implemented in
> >  | dom0?
> >  and [9]:
> >  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
> >  | then you're right -- it is just another form of RAM, that should be
> >  | treated no differently than say, lowmem: a fungible resource that can be
> >  | requested by setting a flag.
> > 
> >  However, pmem is used more as persistent storage than fungible ram,
> >  and my design is for the former usage. I would like to leave the
> >  detection, driver and partition (either through namespace or file
> >  systems) of NVDIMM in Dom0 Linux kernel.
> > 
> >  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
> >  check for the physical address and size passed from caller
> >  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
> >  aware of the SPA range of pmem so that it can refuse map physical
> >  address in neither the normal ram nor pmem.
> 
> Indeed
> 
> 
[...]
> > 
> > 
> > 3.3 Guest ACPI Emulation
> > 
> > 3.3.1 My Design
> > 
> >  Guest ACPI emulation is composed of two parts: building guest NFIT
> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> >  emulating guest _DSM.
> > 
> >  (1) Building Guest ACPI Tables
> > 
> >   This design reuses and extends hvmloader's existing mechanism that
> >   loads passthrough ACPI tables from binary files to load NFIT and
> >   SSDT tables built by QEMU:
> >   1) Because the current QEMU does not building any ACPI tables when
> >      it runs as the Xen device model, this design needs to patch QEMU
> >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> > 
> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
> >      4G. The guest address and size of those tables are written into
> >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> > 
> >   3) hvmloader is patched to probe and load device model passthrough
> >      ACPI tables from above xenstore keys. The detected ACPI tables
> >      are then appended to the end of existing guest ACPI tables just
> >      like what current construct_passthrough_tables() does.
> > 
> >   Reasons for this design are listed below:
> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
> >     not refer to other ACPI tables and not conflict with existing
> >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
> >     QEMU and append to existing guest ACPI tables.
> > 
> >   - A primary portion of current and future vNVDIMM implementation is
> >     about building ACPI tables. And this design also leave the
> >     emulation of _DSM to QEMU which needs to keep consistency with
> >     NFIT and SSDT itself builds. Therefore, reusing NFIT and SSDT from
> >     QEMU can ease the maintenance.
> > 
> >   - Anthony's work to pass ACPI tables from the toolstack to hvmloader
> >     does not move building SSDT (and NFIT) to toolstack, so this
> >     design can still put them in hvmloader.
> 
> If we start asking QEMU to build ACPI tables, why should we stop at NFIT
> and SSDT?

for easing my development of supporting vNVDIMM in Xen ... I mean
NFIT and SSDT are the only two tables needed for this purpose and I'm
afraid to break exiting guests if I completely switch to QEMU for
guest ACPI tables.

> Once upon a time somebody made the decision that ACPI tables
> on Xen should be static and included in hvmloader. That might have been
> a bad decision but at least it was coherent. Loading only *some* tables
> from QEMU, but not others, it feels like an incomplete design to me.
>
> For example, QEMU is currently in charge of emulating the PCI bus, why
> shouldn't it be QEMU that generates the PRT and MCFG?
>

To Keir, Jan and Andrew:

Are there anything related to ACPI that must be done (or are better to
be done) in hvmloader?

> 
> >  (2) Emulating Guest _DSM
> > 
> >   Because the same NFIT and SSDT are used, we can leave the emulation
> >   of guest _DSM to QEMU. Just as what it does with KVM, QEMU registers
> >   the _DSM buffer as MMIO region with Xen and then all guest
> >   evaluations of _DSM are trapped and emulated by QEMU.
> > 
> > 3.3.2 Alternative Design 1: switching to QEMU
> > 
> >  Stefano Stabellini's comments [10]:
> >  | I don't think it is wise to have two components which both think are
> >  | in control of generating ACPI tables, hvmloader (soon to be the
> >  | toolstack with Anthony's work) and QEMU. From an architectural
> >  | perspective, it doesn't look robust to me.
> >  |
> >  | Could we take this opportunity to switch to QEMU generating the whole
> >  | set of ACPI tables?
> > 
> >  So an alternative design could be switching to QEMU to generate the
> >  whole set of guest ACPI tables. In this way, no controversy would
> >  happen between multiple agents QEMU and hvmloader. (is this what
> >  Stefano Stabellini mean by 'robust'?)
> 
> Right
>

I understand that switching to QEMU for all guest ACPI tables would
probably ease the maintenance for guest ACPI tables. However, in
general, is it true that we currently need to fix regressions
introduced by qemu-xen whenever we upgrade it to the upstream? If yes,
is it reasonable to turn to QEMU for only NVDIMM ACPI tables and fix
regressions/controversy whenever upgraded qemu-xen changes NVDIMM ACPI
tables?

> 
> >  However, looking at the code building ACPI tables in QEMU and
> >  hvmloader, they are quite different. As ACPI tables are important for
> >  OS to boot and operate device, it's critical to ensure ACPI tables
> >  built by QEMU would not break existing guests on Xen. Though I
> >  believe it could be done after a thorough investigation and
> >  adjustment, it may take quite a lot of work and tests and should be
> >  another project besides enabling vNVDIMM in Xen.
> >
> > 3.3.3 Alternative Design 2: keeping in Xen
> > 
> >  Alternative to switching to QEMU, another design would be building
> >  NFIT and SSDT in hvmloader or toolstack.
> > 
> >  The amount and parameters of sub-structures in guest NFIT vary
> >  according to different vNVDIMM configurations that can not be decided
> >  at compile-time. In contrast, current hvmloader and toolstack can
> >  only build static ACPI tables, i.e. their contents are decided
> >  statically at compile-time and independent from the guest
> >  configuration. In order to build guest NFIT at runtime, this design
> >  may take following steps:
> >  (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU
> >      options,
> > 
> >  (2) QEMU accepts above options, figures out the start SPA range
> >      address/size/NVDIMM device handles/..., and writes them in
> >      xenstore. No ACPI table is built by QEMU.
> > 
> >  (3) Either xl or hvmloader reads above parameters from xenstore and
> >      builds the NFIT table.
> > 
> >  For guest SSDT, it would take more work. The ACPI namespace devices
> >  are defined in SSDT by AML, so an AML builder would be needed to
> >  generate those definitions at runtime.
> > 
> >  This alternative design still needs more work than the first design.
> 
> I prefer switching to QEMU building all ACPI tables for devices that it
> is emulating. However this alternative is good too because it is
> coherent with the current design.
>

I would prefer to this one if the final conclusion is that only one
agent should be allowed to build guest ACPI. As I said above, it looks
like a big change to switch to QEMU for all ACPI tables and I'm afraid
it would break some existing guests. 

Thanks,
Haozhong

> 
> > References:
> > [1] ACPI Specification v6,
> >     http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> > [2] NVDIMM Namespace Specification,
> >     http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> > [3] NVDIMM Block Window Driver Writer's Guide,
> >     http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> > [4] NVDIMM DSM Interface Example,
> >     http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> > [5] UEFI Specification v2.6,
> >     http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf
> > [6] Intel Architecture Instruction Set Extensions Programming Reference,
> >     https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf
> > [7] http://www.gossamer-threads.com/lists/xen/devel/414945#414945
> > [8] http://www.gossamer-threads.com/lists/xen/devel/415658#415658
> > [9] http://www.gossamer-threads.com/lists/xen/devel/415681#415681
> > [10] http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg00271.html
> > 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02 19:15 ` Konrad Rzeszutek Wilk
@ 2016-02-03  8:28   ` Haozhong Zhang
  2016-02-03  9:18     ` Jan Beulich
  2016-02-15  8:43   ` Haozhong Zhang
  1 sibling, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-03  8:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
> > 3. Design of vNVDIMM in Xen
> 
> Thank you for this design!
> 
> > 
> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >  three parts:
> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >  (2) Memory mapping, and
> >  (3) Guest ACPI emulation.
> 
> 
> .. MCE? and vMCE?
>

Specifications on my hand seem not mention much about MCE for NVDIMM,
but I remember that NVDIMM driver in Linux kernel does have MCE
code. I'll have a look at that code and add this part later.

> > 
> >  The rest of this section present the design of each part
> >  respectively. The basic design principle to reuse existing code in
> >  Linux NVDIMM driver and QEMU as much as possible. As recent
> >  discussions in the both Xen and QEMU mailing lists for the v1 patch
> >  series, alternative designs are also listed below.
> > 
> > 
> > 3.1 Guest clwb/clflushopt/pcommit Enabling
> > 
> >  The instruction enabling is simple and we do the same work as in KVM/QEMU.
> >  - All three instructions are exposed to guest via guest cpuid.
> >  - L1 guest pcommit is never intercepted by Xen.
> 
> I wish there was some watermarks like the PLE has.
> 
> My fear is that an unfriendly guest can issue sfence all day long
> flushing out other guests MMC queue (the writes followed by pcommits).
> Which means that an guest may have degraded performance as their
> memory writes are being flushed out immediately as if they were
> being written to UC instead of WB memory. 
>

pcommit takes no parameter and it seems hard to solve this problem
from hardware for now. And the current VMX does not provide mechanism
to limit the commit rate of pcommit like PLE for pause.

> In other words - the NVDIMM resource does not provide any resource
> isolation. However this may not be any different than what we had
> nowadays with CPU caches.
>

Does Xen have any mechanism to isolate multiple guests' operations on
CPU caches?

> 
> >  - L1 hypervisor is allowed to intercept L2 guest pcommit.
> 
> clwb?
>

VMX is not capable to intercept clwb. Any reason to intercept it?

> > 
> > 
> > 3.2 Address Mapping
> > 
> > 3.2.1 My Design
> > 
> >  The overview of this design is shown in the following figure.
> > 
> >                  Dom0                         |               DomU
> >                                               |
> >                                               |
> >  QEMU                                         |
> >      +...+--------------------+...+-----+     |
> >   VA |   | Label Storage Area |   | buf |     |
> >      +...+--------------------+...+-----+     |
> >                      ^            ^     ^     |
> >                      |            |     |     |
> >                      V            |     |     |
> >      +-------+   +-------+        mmap(2)     |
> >      | vACPI |   | v_DSM |        |     |     |        +----+------------+
> >      +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
> >          ^           ^     +------+     |     |        +----+------------+
> >  --------|-----------|-----|------------|--   |             ^            ^
> >          |           |     |            |     |             |            |
> >          |    +------+     +------------~-----~-------------+            |
> >          |    |            |            |     |        XEN_DOMCTL_memory_mapping
> >          |    |            |            +-----~--------------------------+
> >          |    |            |            |     |
> >          |    |       +----+------------+     |
> >  Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
> >          |    |       +----+------------+     |     | ACPI |   | _DSM |
> >          |    |                   ^           |     +------+   +------+
> >          |    |                   |           |         |          |
> >          |    |               Dom0 Driver     |   hvmloader/xl     |
> >  --------|----|-------------------|---------------------|----------|---------------
> >          |    +-------------------~---------------------~----------+
> >  Xen     |                        |                     |
> >          +------------------------~---------------------+
> >  ---------------------------------|------------------------------------------------
> >                                   +----------------+
> >                                                    |
> >                                             +-------------+
> >  HW                                         |    NVDIMM   |
> >                                             +-------------+
> > 
> > 
> >  This design treats host NVDIMM devices as ordinary MMIO devices:
> 
> Nice.
> 
> But it also means you need Xen to 'share' the ranges of an MMIO device.
> 
> That is you may need dom0 _DSM method to access certain ranges
> (the AML code may need to poke there) - and the guest may want to access
> those as well.
>

Currently, we are going to support _DSM that queries supported _DSM
commands and accesses vNVDIMM's label storage area. Both are emulated
by QEMU and not applied to host NVDIMM.

> And keep in mind that this NVDIMM management may not need to be always
> in initial domain.

I guess you mean it can be in a dedicated driver domain,

> As in you could have NVDIMM device drivers that would
> carve out the ranges to guests.

but I don't get what you mean here. More hints?

[...] 
> > 3.2.2 Alternative Design
> > 
> >  Jan Beulich's comments [7] on my question "why must pmem resource
> >  management and partition be done in hypervisor":
> >  | Because that's where memory management belongs. And PMEM,
> >  | other than PBLK, is just another form of RAM.
> >  | ...
> >  | The main issue is that this would imo be a layering violation
> > 
> >  George Dunlap's comments [8]:
> >  | This is not the case for PMEM.  The whole point of PMEM (correct me if
> >    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
> >  | I'm wrong) is to be used for long-term storage that survives over
> >  | reboot.  It matters very much that a guest be given the same PRAM
> >  | after the host is rebooted that it was given before.  It doesn't make
> >  | any sense to manage it the way Xen currently manages RAM (i.e., that
> >  | you request a page and get whatever Xen happens to give you).
> >  |
> >  | So if Xen is going to use PMEM, it will have to invent an entirely new
> >  | interface for guests, and it will have to keep track of those
> >  | resources across host reboots.  In other words, it will have to
> >  | duplicate all the work that Linux already does.  What do we gain from
> >  | that duplication?  Why not just leverage what's already implemented in
> >  | dom0?
> >  and [9]:
> >  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
> >  | then you're right -- it is just another form of RAM, that should be
> >  | treated no differently than say, lowmem: a fungible resource that can be
> >  | requested by setting a flag.
> > 
> >  However, pmem is used more as persistent storage than fungible ram,
> >  and my design is for the former usage. I would like to leave the
> >  detection, driver and partition (either through namespace or file
> >  systems) of NVDIMM in Dom0 Linux kernel.
> > 
> >  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
> >  check for the physical address and size passed from caller
> >  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
> >  aware of the SPA range of pmem so that it can refuse map physical
> >  address in neither the normal ram nor pmem.
> 
> /me nods.
> > 
> >  Instead of duplicating the detection code (parsing NFIT and
> >  evaluating _FIT) in Dom0 Linux kernel, we decide to patch Dom0 Linux
> >  kernel to pass parameters of host pmem NVDIMM devices to Xen
> >  hypervisor:
> >  (1) Add a global
> >        struct rangeset pmem_rangeset
> >      in Xen hypervisor to record all SPA ranges of detected pmem devices.
> >      Each range in pmem_rangeset corresponds to a pmem device.
> > 
> >  (2) Add a hypercall
> >        XEN_SYSCTL_add_pmem_range
> >      (should it be a sysctl or a platform op?)
> >      that receives a pair of parameters (addr: starting SPA of pmem
> >      region, len: size of pmem region) and add a range (addr, addr +
> >      len - 1) in nvdimm_rangset.
> > 
> >  (3) Add a hypercall
> >        XEN_DOMCTL_pmem_mapping
> >      that takes the same parameters as XEN_DOMCTL_memory_mapping and
> >      maps a given host pmem range to guest. It checks whether the
> >      given host pmem range is in the pmem_rangeset before making the
> >      actual mapping.
> > 
> >  (4) Patch Linux NVDIMM driver to call XEN_SYSCTL_add_pmem_range
> >      whenever it detects a pmem device.
> > 
> >  (5) Patch QEMU to use XEN_DOMCTL_pmem_mapping for mapping host pmem
> >      devices.
> 
> That is nice - as you can instrument this on existing hardware and
> create 'fake' starting SPA for real memory - which Xen may not see
> due to being booted with 'mem=X'.
>

'mem=X' only limits the maximum address of normal ram. NVDIMM or other
MMIO devices are limited by it as well or not?

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03  7:00   ` Haozhong Zhang
@ 2016-02-03  9:13     ` Jan Beulich
  2016-02-03 14:09       ` Andrew Cooper
  2016-02-03 12:02     ` Stefano Stabellini
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-03  9:13 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 03.02.16 at 08:00, <haozhong.zhang@intel.com> wrote:
> On 02/02/16 17:11, Stefano Stabellini wrote:
>> Once upon a time somebody made the decision that ACPI tables
>> on Xen should be static and included in hvmloader. That might have been
>> a bad decision but at least it was coherent. Loading only *some* tables
>> from QEMU, but not others, it feels like an incomplete design to me.
>>
>> For example, QEMU is currently in charge of emulating the PCI bus, why
>> shouldn't it be QEMU that generates the PRT and MCFG?
>>
> 
> To Keir, Jan and Andrew:
> 
> Are there anything related to ACPI that must be done (or are better to
> be done) in hvmloader?

Some of the static tables (FADT and HPET come to mind) likely would
better continue to live in hvmloader. MCFG (for example) coming from
qemu, otoh, would be quite natural (and would finally allow MMCFG
support for guests in the first place). I.e. ...

>> I prefer switching to QEMU building all ACPI tables for devices that it
>> is emulating. However this alternative is good too because it is
>> coherent with the current design.
> 
> I would prefer to this one if the final conclusion is that only one
> agent should be allowed to build guest ACPI. As I said above, it looks
> like a big change to switch to QEMU for all ACPI tables and I'm afraid
> it would break some existing guests. 

... I indeed think that tables should come from qemu for components
living in qemu, and from hvmloader for components coming from Xen.

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03  8:28   ` Haozhong Zhang
@ 2016-02-03  9:18     ` Jan Beulich
  2016-02-03 12:22       ` Haozhong Zhang
  2016-02-03 14:30       ` Andrew Cooper
  0 siblings, 2 replies; 121+ messages in thread
From: Jan Beulich @ 2016-02-03  9:18 UTC (permalink / raw)
  To: Haozhong Zhang, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 03.02.16 at 09:28, <haozhong.zhang@intel.com> wrote:
> On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
>> > 3.1 Guest clwb/clflushopt/pcommit Enabling
>> > 
>> >  The instruction enabling is simple and we do the same work as in KVM/QEMU.
>> >  - All three instructions are exposed to guest via guest cpuid.
>> >  - L1 guest pcommit is never intercepted by Xen.
>> 
>> I wish there was some watermarks like the PLE has.
>> 
>> My fear is that an unfriendly guest can issue sfence all day long
>> flushing out other guests MMC queue (the writes followed by pcommits).
>> Which means that an guest may have degraded performance as their
>> memory writes are being flushed out immediately as if they were
>> being written to UC instead of WB memory. 
> 
> pcommit takes no parameter and it seems hard to solve this problem
> from hardware for now. And the current VMX does not provide mechanism
> to limit the commit rate of pcommit like PLE for pause.
> 
>> In other words - the NVDIMM resource does not provide any resource
>> isolation. However this may not be any different than what we had
>> nowadays with CPU caches.
>>
> 
> Does Xen have any mechanism to isolate multiple guests' operations on
> CPU caches?

No. All it does is disallow wbinvd for guests not controlling any
actual hardware. Perhaps pcommit should at least be limited in
a similar way?

>> >  - L1 hypervisor is allowed to intercept L2 guest pcommit.
>> 
>> clwb?
> 
> VMX is not capable to intercept clwb. Any reason to intercept it?

I don't think so - otherwise normal memory writes might also need
intercepting. Bus bandwidth simply is shared (and CLWB operates
on a guest virtual address, so can only cause bus traffic for cache
lines the guest has managed to dirty).

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03  7:00   ` Haozhong Zhang
  2016-02-03  9:13     ` Jan Beulich
@ 2016-02-03 12:02     ` Stefano Stabellini
  2016-02-03 13:11       ` Haozhong Zhang
                         ` (2 more replies)
  1 sibling, 3 replies; 121+ messages in thread
From: Stefano Stabellini @ 2016-02-03 12:02 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Keir Fraser, Ian Campbell,
	George Dunlap, Andrew Cooper, Stefano Stabellini, Ian Jackson,
	Xiao Guangrong, xen-devel, Jan Beulich, Jun Nakajima, Wei Liu

On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> On 02/02/16 17:11, Stefano Stabellini wrote:
> > Haozhong, thanks for your work!
> > 
> > On Mon, 1 Feb 2016, Haozhong Zhang wrote:
> > > 3.2 Address Mapping
> > > 
> > > 3.2.1 My Design
> > > 
> > >  The overview of this design is shown in the following figure.
> > > 
> > >                  Dom0                         |               DomU
> > >                                               |
> > >                                               |
> > >  QEMU                                         |
> > >      +...+--------------------+...+-----+     |
> > >   VA |   | Label Storage Area |   | buf |     |
> > >      +...+--------------------+...+-----+     |
> > >                      ^            ^     ^     |
> > >                      |            |     |     |
> > >                      V            |     |     |
> > >      +-------+   +-------+        mmap(2)     |
> > >      | vACPI |   | v_DSM |        |     |     |        +----+------------+
> > >      +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
> > >          ^           ^     +------+     |     |        +----+------------+
> > >  --------|-----------|-----|------------|--   |             ^            ^
> > >          |           |     |            |     |             |            |
> > >          |    +------+     +------------~-----~-------------+            |
> > >          |    |            |            |     |        XEN_DOMCTL_memory_mapping
> > >          |    |            |            +-----~--------------------------+
> > >          |    |            |            |     |
> > >          |    |       +----+------------+     |
> > >  Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
> > >          |    |       +----+------------+     |     | ACPI |   | _DSM |
> > >          |    |                   ^           |     +------+   +------+
> > >          |    |                   |           |         |          |
> > >          |    |               Dom0 Driver     |   hvmloader/xl     |
> > >  --------|----|-------------------|---------------------|----------|---------------
> > >          |    +-------------------~---------------------~----------+
> > >  Xen     |                        |                     |
> > >          +------------------------~---------------------+
> > >  ---------------------------------|------------------------------------------------
> > >                                   +----------------+
> > >                                                    |
> > >                                             +-------------+
> > >  HW                                         |    NVDIMM   |
> > >                                             +-------------+
> > > 
> > > 
> > >  This design treats host NVDIMM devices as ordinary MMIO devices:
> > >  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
> > >      and drive host NVDIMM devices (implementing block device
> > >      interface). Namespaces and file systems on host NVDIMM devices
> > >      are handled by Dom0 Linux as well.
> > > 
> > >  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
> > >      virtual address space (buf).
> > > 
> > >  (3) QEMU gets the host physical address of buf, i.e. the host system
> > >      physical address that is occupied by /dev/pmem0, and calls Xen
> > >      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> > 
> > How is this going to work from a security perspective? Is it going to
> > require running QEMU as root in Dom0, which will prevent NVDIMM from
> > working by default on Xen? If so, what's the plan?
> >
> 
> Oh, I forgot to address the non-root qemu issues in this design ...
> 
> The default user:group of /dev/pmem0 is root:disk, and its permission
> is rw-rw----. We could lift the others permission to rw, so that
> non-root QEMU can mmap /dev/pmem0. But it looks too risky.

Yep, too risky.


> Or, we can make a file system on /dev/pmem0, create files on it, set
> the owner of those files to xen-qemuuser-domid$domid, and then pass
> those files to QEMU. In this way, non-root QEMU should be able to
> mmap those files.

Maybe that would work. Worth adding it to the design, I would like to
read more details on it.

Also note that QEMU initially runs as root but drops privileges to
xen-qemuuser-domid$domid before the guest is started. Initially QEMU
*could* mmap /dev/pmem0 while is still running as root, but then it
wouldn't work for any devices that need to be mmap'ed at run time
(hotplug scenario).


> > >  (ACPI part is described in Section 3.3 later)
> > > 
> > >  Above (1)(2) have already been done in current QEMU. Only (3) is
> > >  needed to implement in QEMU. No change is needed in Xen for address
> > >  mapping in this design.
> > > 
> > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > >        get the physical address from a virtual address.
> > >        /proc/<qemu_pid>/pagemap provides information of mapping from
> > >        VA to PA. Is it an acceptable solution to let QEMU parse this
> > >        file to get the physical address?
> > 
> > Does it work in a non-root scenario?
> >
> 
> Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
> | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> | Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
>
> A possible alternative is to add a new hypercall similar to
> XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> parameter and translating to machine address in the hypervisor.

That might work.


> > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > >        occupied by pmem at the beginning, i.e. QEMU may not be able to
> > >        get all SPA of pmem from buf (in virtual address space) when
> > >        calling XEN_DOMCTL_memory_mapping.
> > >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > >        entire pmem being mmaped?
> > 
> > Ditto
> >
> 
> No. If I take the above alternative for the first open, maybe the new
> hypercall above can inject page faults into dom0 for the unmapped
> virtual address so as to enforce dom0 Linux to create the page
> mapping.

Otherwise you need to use something like the mapcache in QEMU
(xen-mapcache.c), which admittedly, given its complexity, would be best
to avoid.


> > > 3.2.2 Alternative Design
> > > 
> > >  Jan Beulich's comments [7] on my question "why must pmem resource
> > >  management and partition be done in hypervisor":
> > >  | Because that's where memory management belongs. And PMEM,
> > >  | other than PBLK, is just another form of RAM.
> > >  | ...
> > >  | The main issue is that this would imo be a layering violation
> > > 
> > >  George Dunlap's comments [8]:
> > >  | This is not the case for PMEM.  The whole point of PMEM (correct me if
> > >    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
> > >  | I'm wrong) is to be used for long-term storage that survives over
> > >  | reboot.  It matters very much that a guest be given the same PRAM
> > >  | after the host is rebooted that it was given before.  It doesn't make
> > >  | any sense to manage it the way Xen currently manages RAM (i.e., that
> > >  | you request a page and get whatever Xen happens to give you).
> > >  |
> > >  | So if Xen is going to use PMEM, it will have to invent an entirely new
> > >  | interface for guests, and it will have to keep track of those
> > >  | resources across host reboots.  In other words, it will have to
> > >  | duplicate all the work that Linux already does.  What do we gain from
> > >  | that duplication?  Why not just leverage what's already implemented in
> > >  | dom0?
> > >  and [9]:
> > >  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
> > >  | then you're right -- it is just another form of RAM, that should be
> > >  | treated no differently than say, lowmem: a fungible resource that can be
> > >  | requested by setting a flag.
> > > 
> > >  However, pmem is used more as persistent storage than fungible ram,
> > >  and my design is for the former usage. I would like to leave the
> > >  detection, driver and partition (either through namespace or file
> > >  systems) of NVDIMM in Dom0 Linux kernel.
> > > 
> > >  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
> > >  check for the physical address and size passed from caller
> > >  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
> > >  aware of the SPA range of pmem so that it can refuse map physical
> > >  address in neither the normal ram nor pmem.
> > 
> > Indeed
> > 
> > 
> [...]
> > > 
> > > 
> > > 3.3 Guest ACPI Emulation
> > > 
> > > 3.3.1 My Design
> > > 
> > >  Guest ACPI emulation is composed of two parts: building guest NFIT
> > >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> > >  emulating guest _DSM.
> > > 
> > >  (1) Building Guest ACPI Tables
> > > 
> > >   This design reuses and extends hvmloader's existing mechanism that
> > >   loads passthrough ACPI tables from binary files to load NFIT and
> > >   SSDT tables built by QEMU:
> > >   1) Because the current QEMU does not building any ACPI tables when
> > >      it runs as the Xen device model, this design needs to patch QEMU
> > >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> > > 
> > >   2) QEMU copies NFIT and SSDT to the end of guest memory below
> > >      4G. The guest address and size of those tables are written into
> > >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> > > 
> > >   3) hvmloader is patched to probe and load device model passthrough
> > >      ACPI tables from above xenstore keys. The detected ACPI tables
> > >      are then appended to the end of existing guest ACPI tables just
> > >      like what current construct_passthrough_tables() does.
> > > 
> > >   Reasons for this design are listed below:
> > >   - NFIT and SSDT in question are quite self-contained, i.e. they do
> > >     not refer to other ACPI tables and not conflict with existing
> > >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
> > >     QEMU and append to existing guest ACPI tables.
> > > 
> > >   - A primary portion of current and future vNVDIMM implementation is
> > >     about building ACPI tables. And this design also leave the
> > >     emulation of _DSM to QEMU which needs to keep consistency with
> > >     NFIT and SSDT itself builds. Therefore, reusing NFIT and SSDT from
> > >     QEMU can ease the maintenance.
> > > 
> > >   - Anthony's work to pass ACPI tables from the toolstack to hvmloader
> > >     does not move building SSDT (and NFIT) to toolstack, so this
> > >     design can still put them in hvmloader.
> > 
> > If we start asking QEMU to build ACPI tables, why should we stop at NFIT
> > and SSDT?
> 
> for easing my development of supporting vNVDIMM in Xen ... I mean
> NFIT and SSDT are the only two tables needed for this purpose and I'm
> afraid to break exiting guests if I completely switch to QEMU for
> guest ACPI tables.

I realize that my words have been a bit confusing. Not /all/ ACPI
tables, just all the tables regarding devices for which QEMU is in
charge (the PCI bus and all devices behind it). Anything related to cpus
and memory (FADT, MADT, etc) would still be left to hvmloader.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03  9:18     ` Jan Beulich
@ 2016-02-03 12:22       ` Haozhong Zhang
  2016-02-03 12:38         ` Jan Beulich
  2016-02-03 14:30       ` Andrew Cooper
  1 sibling, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-03 12:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, Stefano Stabellini, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/03/16 02:18, Jan Beulich wrote:
> >>> On 03.02.16 at 09:28, <haozhong.zhang@intel.com> wrote:
> > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
> >> > 3.1 Guest clwb/clflushopt/pcommit Enabling
> >> > 
> >> >  The instruction enabling is simple and we do the same work as in KVM/QEMU.
> >> >  - All three instructions are exposed to guest via guest cpuid.
> >> >  - L1 guest pcommit is never intercepted by Xen.
> >> 
> >> I wish there was some watermarks like the PLE has.
> >> 
> >> My fear is that an unfriendly guest can issue sfence all day long
> >> flushing out other guests MMC queue (the writes followed by pcommits).
> >> Which means that an guest may have degraded performance as their
> >> memory writes are being flushed out immediately as if they were
> >> being written to UC instead of WB memory. 
> > 
> > pcommit takes no parameter and it seems hard to solve this problem
> > from hardware for now. And the current VMX does not provide mechanism
> > to limit the commit rate of pcommit like PLE for pause.
> > 
> >> In other words - the NVDIMM resource does not provide any resource
> >> isolation. However this may not be any different than what we had
> >> nowadays with CPU caches.
> >>
> > 
> > Does Xen have any mechanism to isolate multiple guests' operations on
> > CPU caches?
> 
> No. All it does is disallow wbinvd for guests not controlling any
> actual hardware. Perhaps pcommit should at least be limited in
> a similar way?
>

But pcommit is a must that makes writes be persistent on pmem. I'll
look at how guest wbinvd is limited in Xen. Any functions suggested,
vmx_wbinvd_intercept()?

Thanks,
Haozhong

> >> >  - L1 hypervisor is allowed to intercept L2 guest pcommit.
> >> 
> >> clwb?
> > 
> > VMX is not capable to intercept clwb. Any reason to intercept it?
> 
> I don't think so - otherwise normal memory writes might also need
> intercepting. Bus bandwidth simply is shared (and CLWB operates
> on a guest virtual address, so can only cause bus traffic for cache
> lines the guest has managed to dirty).
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 12:22       ` Haozhong Zhang
@ 2016-02-03 12:38         ` Jan Beulich
  2016-02-03 12:49           ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-03 12:38 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 03.02.16 at 13:22, <haozhong.zhang@intel.com> wrote:
> On 02/03/16 02:18, Jan Beulich wrote:
>> >>> On 03.02.16 at 09:28, <haozhong.zhang@intel.com> wrote:
>> > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
>> >> > 3.1 Guest clwb/clflushopt/pcommit Enabling
>> >> > 
>> >> >  The instruction enabling is simple and we do the same work as in 
> KVM/QEMU.
>> >> >  - All three instructions are exposed to guest via guest cpuid.
>> >> >  - L1 guest pcommit is never intercepted by Xen.
>> >> 
>> >> I wish there was some watermarks like the PLE has.
>> >> 
>> >> My fear is that an unfriendly guest can issue sfence all day long
>> >> flushing out other guests MMC queue (the writes followed by pcommits).
>> >> Which means that an guest may have degraded performance as their
>> >> memory writes are being flushed out immediately as if they were
>> >> being written to UC instead of WB memory. 
>> > 
>> > pcommit takes no parameter and it seems hard to solve this problem
>> > from hardware for now. And the current VMX does not provide mechanism
>> > to limit the commit rate of pcommit like PLE for pause.
>> > 
>> >> In other words - the NVDIMM resource does not provide any resource
>> >> isolation. However this may not be any different than what we had
>> >> nowadays with CPU caches.
>> >>
>> > 
>> > Does Xen have any mechanism to isolate multiple guests' operations on
>> > CPU caches?
>> 
>> No. All it does is disallow wbinvd for guests not controlling any
>> actual hardware. Perhaps pcommit should at least be limited in
>> a similar way?
>>
> 
> But pcommit is a must that makes writes be persistent on pmem. I'll
> look at how guest wbinvd is limited in Xen.

But we could intercept it on guests _not_ supposed to use it, in order
to simply drop in on the floor.

> Any functions suggested, vmx_wbinvd_intercept()?

A good example, yes.

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 12:38         ` Jan Beulich
@ 2016-02-03 12:49           ` Haozhong Zhang
  0 siblings, 0 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-03 12:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/03/16 05:38, Jan Beulich wrote:
> >>> On 03.02.16 at 13:22, <haozhong.zhang@intel.com> wrote:
> > On 02/03/16 02:18, Jan Beulich wrote:
> >> >>> On 03.02.16 at 09:28, <haozhong.zhang@intel.com> wrote:
> >> > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
> >> >> > 3.1 Guest clwb/clflushopt/pcommit Enabling
> >> >> > 
> >> >> >  The instruction enabling is simple and we do the same work as in 
> > KVM/QEMU.
> >> >> >  - All three instructions are exposed to guest via guest cpuid.
> >> >> >  - L1 guest pcommit is never intercepted by Xen.
> >> >> 
> >> >> I wish there was some watermarks like the PLE has.
> >> >> 
> >> >> My fear is that an unfriendly guest can issue sfence all day long
> >> >> flushing out other guests MMC queue (the writes followed by pcommits).
> >> >> Which means that an guest may have degraded performance as their
> >> >> memory writes are being flushed out immediately as if they were
> >> >> being written to UC instead of WB memory. 
> >> > 
> >> > pcommit takes no parameter and it seems hard to solve this problem
> >> > from hardware for now. And the current VMX does not provide mechanism
> >> > to limit the commit rate of pcommit like PLE for pause.
> >> > 
> >> >> In other words - the NVDIMM resource does not provide any resource
> >> >> isolation. However this may not be any different than what we had
> >> >> nowadays with CPU caches.
> >> >>
> >> > 
> >> > Does Xen have any mechanism to isolate multiple guests' operations on
> >> > CPU caches?
> >> 
> >> No. All it does is disallow wbinvd for guests not controlling any
> >> actual hardware. Perhaps pcommit should at least be limited in
> >> a similar way?
> >>
> > 
> > But pcommit is a must that makes writes be persistent on pmem. I'll
> > look at how guest wbinvd is limited in Xen.
> 
> But we could intercept it on guests _not_ supposed to use it, in order
> to simply drop in on the floor.
>

Oh yes! We can drop pcommit from domains not having access to host
NVDIMM, just like vmx_wbinvd_intercept() dropping wbinvd from domains
not accessing host iomem and ioport.

> > Any functions suggested, vmx_wbinvd_intercept()?
> 
> A good example, yes.
> 
> Jan
> 

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 12:02     ` Stefano Stabellini
@ 2016-02-03 13:11       ` Haozhong Zhang
  2016-02-03 14:20         ` Andrew Cooper
  2016-02-03 15:16       ` George Dunlap
  2016-02-03 15:47       ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-03 13:11 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Juergen Gross, Kevin Tian, Keir Fraser, Ian Campbell,
	George Dunlap, Andrew Cooper, Ian Jackson, Xiao Guangrong,
	xen-devel, Jan Beulich, Jun Nakajima, Wei Liu

On 02/03/16 12:02, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > On 02/02/16 17:11, Stefano Stabellini wrote:
> > > On Mon, 1 Feb 2016, Haozhong Zhang wrote:
[...]
> > > >  This design treats host NVDIMM devices as ordinary MMIO devices:
> > > >  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
> > > >      and drive host NVDIMM devices (implementing block device
> > > >      interface). Namespaces and file systems on host NVDIMM devices
> > > >      are handled by Dom0 Linux as well.
> > > > 
> > > >  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
> > > >      virtual address space (buf).
> > > > 
> > > >  (3) QEMU gets the host physical address of buf, i.e. the host system
> > > >      physical address that is occupied by /dev/pmem0, and calls Xen
> > > >      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> > > 
> > > How is this going to work from a security perspective? Is it going to
> > > require running QEMU as root in Dom0, which will prevent NVDIMM from
> > > working by default on Xen? If so, what's the plan?
> > >
> > 
> > Oh, I forgot to address the non-root qemu issues in this design ...
> > 
> > The default user:group of /dev/pmem0 is root:disk, and its permission
> > is rw-rw----. We could lift the others permission to rw, so that
> > non-root QEMU can mmap /dev/pmem0. But it looks too risky.
> 
> Yep, too risky.
> 
> 
> > Or, we can make a file system on /dev/pmem0, create files on it, set
> > the owner of those files to xen-qemuuser-domid$domid, and then pass
> > those files to QEMU. In this way, non-root QEMU should be able to
> > mmap those files.
> 
> Maybe that would work. Worth adding it to the design, I would like to
> read more details on it.
> 
> Also note that QEMU initially runs as root but drops privileges to
> xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> *could* mmap /dev/pmem0 while is still running as root, but then it
> wouldn't work for any devices that need to be mmap'ed at run time
> (hotplug scenario).
>

Thanks for this information. I'll test some experimental code and then
post a design to address the non-root qemu issue.

> 
> > > >  (ACPI part is described in Section 3.3 later)
> > > > 
> > > >  Above (1)(2) have already been done in current QEMU. Only (3) is
> > > >  needed to implement in QEMU. No change is needed in Xen for address
> > > >  mapping in this design.
> > > > 
> > > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > > >        get the physical address from a virtual address.
> > > >        /proc/<qemu_pid>/pagemap provides information of mapping from
> > > >        VA to PA. Is it an acceptable solution to let QEMU parse this
> > > >        file to get the physical address?
> > > 
> > > Does it work in a non-root scenario?
> > >
> > 
> > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
> > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> > | Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
> >
> > A possible alternative is to add a new hypercall similar to
> > XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> > parameter and translating to machine address in the hypervisor.
> 
> That might work.
> 
> 
> > > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > > >        occupied by pmem at the beginning, i.e. QEMU may not be able to
> > > >        get all SPA of pmem from buf (in virtual address space) when
> > > >        calling XEN_DOMCTL_memory_mapping.
> > > >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > > >        entire pmem being mmaped?
> > > 
> > > Ditto
> > >
> > 
> > No. If I take the above alternative for the first open, maybe the new
> > hypercall above can inject page faults into dom0 for the unmapped
> > virtual address so as to enforce dom0 Linux to create the page
> > mapping.
> 
> Otherwise you need to use something like the mapcache in QEMU
> (xen-mapcache.c), which admittedly, given its complexity, would be best
> to avoid.
>

Definitely not mapcache like things. What I want is something similar to
what emulate_gva_to_mfn() in Xen does.

[...]
> > > If we start asking QEMU to build ACPI tables, why should we stop at NFIT
> > > and SSDT?
> > 
> > for easing my development of supporting vNVDIMM in Xen ... I mean
> > NFIT and SSDT are the only two tables needed for this purpose and I'm
> > afraid to break exiting guests if I completely switch to QEMU for
> > guest ACPI tables.
> 
> I realize that my words have been a bit confusing. Not /all/ ACPI
> tables, just all the tables regarding devices for which QEMU is in
> charge (the PCI bus and all devices behind it). Anything related to cpus
> and memory (FADT, MADT, etc) would still be left to hvmloader.

OK, then it's clear for me. From Jan's reply, at least MCFG is from
QEMU. I'll look at whether other PCI related tables are also from QEMU
or similar to those in QEMU. If yes, then it looks reasonable to let
QEMU generate them.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03  9:13     ` Jan Beulich
@ 2016-02-03 14:09       ` Andrew Cooper
  2016-02-03 14:23         ` Haozhong Zhang
  2016-02-05 14:40         ` Ross Philipson
  0 siblings, 2 replies; 121+ messages in thread
From: Andrew Cooper @ 2016-02-03 14:09 UTC (permalink / raw)
  To: Jan Beulich, Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/02/16 09:13, Jan Beulich wrote:
>>>> On 03.02.16 at 08:00, <haozhong.zhang@intel.com> wrote:
>> On 02/02/16 17:11, Stefano Stabellini wrote:
>>> Once upon a time somebody made the decision that ACPI tables
>>> on Xen should be static and included in hvmloader. That might have been
>>> a bad decision but at least it was coherent. Loading only *some* tables
>>> from QEMU, but not others, it feels like an incomplete design to me.
>>>
>>> For example, QEMU is currently in charge of emulating the PCI bus, why
>>> shouldn't it be QEMU that generates the PRT and MCFG?
>>>
>> To Keir, Jan and Andrew:
>>
>> Are there anything related to ACPI that must be done (or are better to
>> be done) in hvmloader?
> Some of the static tables (FADT and HPET come to mind) likely would
> better continue to live in hvmloader. MCFG (for example) coming from
> qemu, otoh, would be quite natural (and would finally allow MMCFG
> support for guests in the first place). I.e. ...
>
>>> I prefer switching to QEMU building all ACPI tables for devices that it
>>> is emulating. However this alternative is good too because it is
>>> coherent with the current design.
>> I would prefer to this one if the final conclusion is that only one
>> agent should be allowed to build guest ACPI. As I said above, it looks
>> like a big change to switch to QEMU for all ACPI tables and I'm afraid
>> it would break some existing guests. 
> ... I indeed think that tables should come from qemu for components
> living in qemu, and from hvmloader for components coming from Xen.

I agree.

There has to be a single entity responsible for collating the eventual
ACPI handed to the guest, and this is definitely HVMLoader.

However, it is correct that Qemu create the ACPI tables for the devices
it emulates for the guest.

We need to agree on a mechanism whereby each entity can provide their
own subset of the ACPI tables to HVMLoader, and have HVMLoader present
the final set properly to the VM.

There is an existing usecase of passing the Host SLIC table to a VM, for
OEM Versions of Windows.  I believe this is achieved with
HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
inflexible and could probably do with being made a little more generic.

~Andrew

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 13:11       ` Haozhong Zhang
@ 2016-02-03 14:20         ` Andrew Cooper
  2016-02-04  3:10           ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Andrew Cooper @ 2016-02-03 14:20 UTC (permalink / raw)
  To: Stefano Stabellini, Jan Beulich, Keir Fraser, xen-devel,
	Konrad Rzeszutek Wilk, George Dunlap, Ian Jackson, Ian Campbell,
	Juergen Gross, Wei Liu, Kevin Tian, Xiao Guangrong, Jun Nakajima

On 03/02/16 13:11, Haozhong Zhang wrote:
> On 02/03/16 12:02, Stefano Stabellini wrote:
>> On Wed, 3 Feb 2016, Haozhong Zhang wrote:
>>> On 02/02/16 17:11, Stefano Stabellini wrote:
>>>> On Mon, 1 Feb 2016, Haozhong Zhang wrote:
> [...]
>>>>>  This design treats host NVDIMM devices as ordinary MMIO devices:
>>>>>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>>>>>      and drive host NVDIMM devices (implementing block device
>>>>>      interface). Namespaces and file systems on host NVDIMM devices
>>>>>      are handled by Dom0 Linux as well.
>>>>>
>>>>>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>>>>>      virtual address space (buf).
>>>>>
>>>>>  (3) QEMU gets the host physical address of buf, i.e. the host system
>>>>>      physical address that is occupied by /dev/pmem0, and calls Xen
>>>>>      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
>>>> How is this going to work from a security perspective? Is it going to
>>>> require running QEMU as root in Dom0, which will prevent NVDIMM from
>>>> working by default on Xen? If so, what's the plan?
>>>>
>>> Oh, I forgot to address the non-root qemu issues in this design ...
>>>
>>> The default user:group of /dev/pmem0 is root:disk, and its permission
>>> is rw-rw----. We could lift the others permission to rw, so that
>>> non-root QEMU can mmap /dev/pmem0. But it looks too risky.
>> Yep, too risky.
>>
>>
>>> Or, we can make a file system on /dev/pmem0, create files on it, set
>>> the owner of those files to xen-qemuuser-domid$domid, and then pass
>>> those files to QEMU. In this way, non-root QEMU should be able to
>>> mmap those files.
>> Maybe that would work. Worth adding it to the design, I would like to
>> read more details on it.
>>
>> Also note that QEMU initially runs as root but drops privileges to
>> xen-qemuuser-domid$domid before the guest is started. Initially QEMU
>> *could* mmap /dev/pmem0 while is still running as root, but then it
>> wouldn't work for any devices that need to be mmap'ed at run time
>> (hotplug scenario).
>>
> Thanks for this information. I'll test some experimental code and then
> post a design to address the non-root qemu issue.
>
>>>>>  (ACPI part is described in Section 3.3 later)
>>>>>
>>>>>  Above (1)(2) have already been done in current QEMU. Only (3) is
>>>>>  needed to implement in QEMU. No change is needed in Xen for address
>>>>>  mapping in this design.
>>>>>
>>>>>  Open: It seems no system call/ioctl is provided by Linux kernel to
>>>>>        get the physical address from a virtual address.
>>>>>        /proc/<qemu_pid>/pagemap provides information of mapping from
>>>>>        VA to PA. Is it an acceptable solution to let QEMU parse this
>>>>>        file to get the physical address?
>>>> Does it work in a non-root scenario?
>>>>
>>> Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
>>> | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
>>> | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
>>> | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
>>> | Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
>>>
>>> A possible alternative is to add a new hypercall similar to
>>> XEN_DOMCTL_memory_mapping but receiving virtual address as the address
>>> parameter and translating to machine address in the hypervisor.
>> That might work.
>>
>>
>>>>>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>>>>>        occupied by pmem at the beginning, i.e. QEMU may not be able to
>>>>>        get all SPA of pmem from buf (in virtual address space) when
>>>>>        calling XEN_DOMCTL_memory_mapping.
>>>>>        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>>>>>        entire pmem being mmaped?
>>>> Ditto
>>>>
>>> No. If I take the above alternative for the first open, maybe the new
>>> hypercall above can inject page faults into dom0 for the unmapped
>>> virtual address so as to enforce dom0 Linux to create the page
>>> mapping.
>> Otherwise you need to use something like the mapcache in QEMU
>> (xen-mapcache.c), which admittedly, given its complexity, would be best
>> to avoid.
>>
> Definitely not mapcache like things. What I want is something similar to
> what emulate_gva_to_mfn() in Xen does.

Please not quite like that.  It would restrict this to only working in a
PV dom0.

MFNs are an implementation detail.  Interfaces should take GFNs which
are consistent logical meaning between PV and HVM domains.

As an introduction,
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/xen/mm.h;h=a795dd6001eff7c5dd942bbaf153e3efa5202318;hb=refs/heads/staging#l8

We also need to consider the Xen side security.  Currently a domain may
be given privilege to map an MMIO range.  IIRC, this allows the emulator
domain to make mappings for the guest, and for the guest to make
mappings itself.  With PMEM, we can't allow a domain to make mappings
itself because it could end up mapping resources which belong to another
domain.  We probably need an intermediate level which only permits an
emulator to make the mappings.

>
> [...]
>>>> If we start asking QEMU to build ACPI tables, why should we stop at NFIT
>>>> and SSDT?
>>> for easing my development of supporting vNVDIMM in Xen ... I mean
>>> NFIT and SSDT are the only two tables needed for this purpose and I'm
>>> afraid to break exiting guests if I completely switch to QEMU for
>>> guest ACPI tables.
>> I realize that my words have been a bit confusing. Not /all/ ACPI
>> tables, just all the tables regarding devices for which QEMU is in
>> charge (the PCI bus and all devices behind it). Anything related to cpus
>> and memory (FADT, MADT, etc) would still be left to hvmloader.
> OK, then it's clear for me. From Jan's reply, at least MCFG is from
> QEMU. I'll look at whether other PCI related tables are also from QEMU
> or similar to those in QEMU. If yes, then it looks reasonable to let
> QEMU generate them.

It is entirely likely that the current split of sources of APCI tables
is incorrect.  We should also see what can be done about fixing that.

~Andrew

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 14:09       ` Andrew Cooper
@ 2016-02-03 14:23         ` Haozhong Zhang
  2016-02-05 14:40         ` Ross Philipson
  1 sibling, 0 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-03 14:23 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jun Nakajima, Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Ian Jackson, xen-devel,
	Jan Beulich, Xiao Guangrong, Keir Fraser

On 02/03/16 14:09, Andrew Cooper wrote:
> On 03/02/16 09:13, Jan Beulich wrote:
> >>>> On 03.02.16 at 08:00, <haozhong.zhang@intel.com> wrote:
> >> On 02/02/16 17:11, Stefano Stabellini wrote:
> >>> Once upon a time somebody made the decision that ACPI tables
> >>> on Xen should be static and included in hvmloader. That might have been
> >>> a bad decision but at least it was coherent. Loading only *some* tables
> >>> from QEMU, but not others, it feels like an incomplete design to me.
> >>>
> >>> For example, QEMU is currently in charge of emulating the PCI bus, why
> >>> shouldn't it be QEMU that generates the PRT and MCFG?
> >>>
> >> To Keir, Jan and Andrew:
> >>
> >> Are there anything related to ACPI that must be done (or are better to
> >> be done) in hvmloader?
> > Some of the static tables (FADT and HPET come to mind) likely would
> > better continue to live in hvmloader. MCFG (for example) coming from
> > qemu, otoh, would be quite natural (and would finally allow MMCFG
> > support for guests in the first place). I.e. ...
> >
> >>> I prefer switching to QEMU building all ACPI tables for devices that it
> >>> is emulating. However this alternative is good too because it is
> >>> coherent with the current design.
> >> I would prefer to this one if the final conclusion is that only one
> >> agent should be allowed to build guest ACPI. As I said above, it looks
> >> like a big change to switch to QEMU for all ACPI tables and I'm afraid
> >> it would break some existing guests. 
> > ... I indeed think that tables should come from qemu for components
> > living in qemu, and from hvmloader for components coming from Xen.
> 
> I agree.
> 
> There has to be a single entity responsible for collating the eventual
> ACPI handed to the guest, and this is definitely HVMLoader.
> 
> However, it is correct that Qemu create the ACPI tables for the devices
> it emulates for the guest.
> 
> We need to agree on a mechanism whereby each entity can provide their
> own subset of the ACPI tables to HVMLoader, and have HVMLoader present
> the final set properly to the VM.
> 
> There is an existing usecase of passing the Host SLIC table to a VM, for
> OEM Versions of Windows.  I believe this is achieved with
> HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
> inflexible and could probably do with being made a little more generic.
>

Yes, that is what one of my v1 patches does
([PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu).

It extends the existing construct_passthrough_tables() to get the
address and size of acpi tables from its parameters (a pair of
xenstore keys) rather than the hardcoded ones.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03  9:18     ` Jan Beulich
  2016-02-03 12:22       ` Haozhong Zhang
@ 2016-02-03 14:30       ` Andrew Cooper
  2016-02-03 14:39         ` Jan Beulich
  1 sibling, 1 reply; 121+ messages in thread
From: Andrew Cooper @ 2016-02-03 14:30 UTC (permalink / raw)
  To: Jan Beulich, Haozhong Zhang, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/02/16 09:18, Jan Beulich wrote:
>>
>>> In other words - the NVDIMM resource does not provide any resource
>>> isolation. However this may not be any different than what we had
>>> nowadays with CPU caches.
>>>
>> Does Xen have any mechanism to isolate multiple guests' operations on
>> CPU caches?
> No.

PSR Cache Allocation is supported in Xen 4.6 on supporting hardware, so
the administrator can partition guests if necessary.

~Andrew

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 14:30       ` Andrew Cooper
@ 2016-02-03 14:39         ` Jan Beulich
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Beulich @ 2016-02-03 14:39 UTC (permalink / raw)
  To: Andrew Cooper, Haozhong Zhang, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 03.02.16 at 15:30, <andrew.cooper3@citrix.com> wrote:
> On 03/02/16 09:18, Jan Beulich wrote:
>>>
>>>> In other words - the NVDIMM resource does not provide any resource
>>>> isolation. However this may not be any different than what we had
>>>> nowadays with CPU caches.
>>>>
>>> Does Xen have any mechanism to isolate multiple guests' operations on
>>> CPU caches?
>> No.
> 
> PSR Cache Allocation is supported in Xen 4.6 on supporting hardware, so
> the administrator can partition guests if necessary.

And if the hardware supports it (which for a while might be more the
exception than the rule).

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 12:02     ` Stefano Stabellini
  2016-02-03 13:11       ` Haozhong Zhang
@ 2016-02-03 15:16       ` George Dunlap
  2016-02-03 15:22         ` Stefano Stabellini
  2016-02-03 15:47       ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 121+ messages in thread
From: George Dunlap @ 2016-02-03 15:16 UTC (permalink / raw)
  To: Stefano Stabellini, Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Keir Fraser, Ian Campbell,
	George Dunlap, Andrew Cooper, Ian Jackson, Xiao Guangrong,
	xen-devel, Jan Beulich, Jun Nakajima, Wei Liu

On 03/02/16 12:02, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, Haozhong Zhang wrote:
>> Or, we can make a file system on /dev/pmem0, create files on it, set
>> the owner of those files to xen-qemuuser-domid$domid, and then pass
>> those files to QEMU. In this way, non-root QEMU should be able to
>> mmap those files.
> 
> Maybe that would work. Worth adding it to the design, I would like to
> read more details on it.
> 
> Also note that QEMU initially runs as root but drops privileges to
> xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> *could* mmap /dev/pmem0 while is still running as root, but then it
> wouldn't work for any devices that need to be mmap'ed at run time
> (hotplug scenario).

This is basically the same problem we have for a bunch of other things,
right?  Having xl open a file and then pass it via qmp to qemu should
work in theory, right?

 -George

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 15:16       ` George Dunlap
@ 2016-02-03 15:22         ` Stefano Stabellini
  2016-02-03 15:35           ` Konrad Rzeszutek Wilk
                             ` (2 more replies)
  0 siblings, 3 replies; 121+ messages in thread
From: Stefano Stabellini @ 2016-02-03 15:22 UTC (permalink / raw)
  To: George Dunlap
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Keir Fraser,
	Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, Xiao Guangrong, xen-devel, Jan Beulich,
	Jun Nakajima, Wei Liu

On Wed, 3 Feb 2016, George Dunlap wrote:
> On 03/02/16 12:02, Stefano Stabellini wrote:
> > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> >> Or, we can make a file system on /dev/pmem0, create files on it, set
> >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> >> those files to QEMU. In this way, non-root QEMU should be able to
> >> mmap those files.
> > 
> > Maybe that would work. Worth adding it to the design, I would like to
> > read more details on it.
> > 
> > Also note that QEMU initially runs as root but drops privileges to
> > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > *could* mmap /dev/pmem0 while is still running as root, but then it
> > wouldn't work for any devices that need to be mmap'ed at run time
> > (hotplug scenario).
> 
> This is basically the same problem we have for a bunch of other things,
> right?  Having xl open a file and then pass it via qmp to qemu should
> work in theory, right?

Is there one /dev/pmem? per assignable region? Otherwise it wouldn't be
safe.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 15:22         ` Stefano Stabellini
@ 2016-02-03 15:35           ` Konrad Rzeszutek Wilk
  2016-02-03 15:35           ` George Dunlap
  2016-02-04  2:55           ` Haozhong Zhang
  2 siblings, 0 replies; 121+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-02-03 15:35 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Keir Fraser,
	Ian Campbell, George Dunlap, Andrew Cooper, Xiao Guangrong,
	Ian Jackson, George Dunlap, xen-devel, Jan Beulich, Jun Nakajima,
	Wei Liu

On Wed, Feb 03, 2016 at 03:22:59PM +0000, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, George Dunlap wrote:
> > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > >> those files to QEMU. In this way, non-root QEMU should be able to
> > >> mmap those files.
> > > 
> > > Maybe that would work. Worth adding it to the design, I would like to
> > > read more details on it.
> > > 
> > > Also note that QEMU initially runs as root but drops privileges to
> > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > wouldn't work for any devices that need to be mmap'ed at run time
> > > (hotplug scenario).
> > 
> > This is basically the same problem we have for a bunch of other things,
> > right?  Having xl open a file and then pass it via qmp to qemu should
> > work in theory, right?
> 
> Is there one /dev/pmem? per assignable region? Otherwise it wouldn't be
> safe.

Can be - which may be interleaved on multiple NVDIMMs. But we would operate
on files (on the /dev/pmem which has an DAX enabled filesystem).

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 15:22         ` Stefano Stabellini
  2016-02-03 15:35           ` Konrad Rzeszutek Wilk
@ 2016-02-03 15:35           ` George Dunlap
  2016-02-04  2:55           ` Haozhong Zhang
  2 siblings, 0 replies; 121+ messages in thread
From: George Dunlap @ 2016-02-03 15:35 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Keir Fraser,
	Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson,
	Xiao Guangrong, xen-devel, Jan Beulich, Jun Nakajima, Wei Liu

On 03/02/16 15:22, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, George Dunlap wrote:
>> On 03/02/16 12:02, Stefano Stabellini wrote:
>>> On Wed, 3 Feb 2016, Haozhong Zhang wrote:
>>>> Or, we can make a file system on /dev/pmem0, create files on it, set
>>>> the owner of those files to xen-qemuuser-domid$domid, and then pass
>>>> those files to QEMU. In this way, non-root QEMU should be able to
>>>> mmap those files.
>>>
>>> Maybe that would work. Worth adding it to the design, I would like to
>>> read more details on it.
>>>
>>> Also note that QEMU initially runs as root but drops privileges to
>>> xen-qemuuser-domid$domid before the guest is started. Initially QEMU
>>> *could* mmap /dev/pmem0 while is still running as root, but then it
>>> wouldn't work for any devices that need to be mmap'ed at run time
>>> (hotplug scenario).
>>
>> This is basically the same problem we have for a bunch of other things,
>> right?  Having xl open a file and then pass it via qmp to qemu should
>> work in theory, right?
> 
> Is there one /dev/pmem? per assignable region? Otherwise it wouldn't be
> safe.

If I understood Haozhong's description right, you'd be passing through
the entirety of one thing that Linux gave you.  At the moment that'sone
/dev/pmemX, which at the moment corresponds to one region as specified
in the ACPI tables.  I understood his design going forward to mean that
it would rely on Linux to do any further partitioning within regions if
that was desired; in which case there would again be a single file that
qemu would have access to.

 -George

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 12:02     ` Stefano Stabellini
  2016-02-03 13:11       ` Haozhong Zhang
  2016-02-03 15:16       ` George Dunlap
@ 2016-02-03 15:47       ` Konrad Rzeszutek Wilk
  2016-02-04  2:36         ` Haozhong Zhang
  2016-02-15  9:04         ` Zhang, Haozhong
  2 siblings, 2 replies; 121+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-02-03 15:47 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Keir Fraser,
	Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson,
	Xiao Guangrong, xen-devel, Jan Beulich, Jun Nakajima, Wei Liu

> > > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > > >        get the physical address from a virtual address.
> > > >        /proc/<qemu_pid>/pagemap provides information of mapping from
> > > >        VA to PA. Is it an acceptable solution to let QEMU parse this
> > > >        file to get the physical address?
> > > 
> > > Does it work in a non-root scenario?
> > >
> > 
> > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
> > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> > | Reason: information about PFNs helps in exploiting Rowhammer vulnerability.

Ah right.
> >
> > A possible alternative is to add a new hypercall similar to
> > XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> > parameter and translating to machine address in the hypervisor.
> 
> That might work.

That won't work.

This is a userspace VMA - which means the once the ioctl is done we swap
to kernel virtual addresses. Now we may know that the prior cr3 has the
userspace virtual address and walk it down - but what if the domain
that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere
inside the kernel.

Which means this hypercall would need to know the Linux kernel task structure
to find this.

May I propose another solution - an stacking driver (similar to loop). You
setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img created).
Then mmap the /dev/mapper/guest.img - all of the operations are the same - except
it may have an extra ioctl - get_pfns - which would provide the data in similar
form to pagemap.txt.

But folks will then ask - why don't you just use pagemap? Could the pagemap
have an extra security capability check? One that can be set for
QEMU?

> 
> 
> > > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > > >        occupied by pmem at the beginning, i.e. QEMU may not be able to
> > > >        get all SPA of pmem from buf (in virtual address space) when
> > > >        calling XEN_DOMCTL_memory_mapping.
> > > >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > > >        entire pmem being mmaped?
> > > 
> > > Ditto
> > >
> > 
> > No. If I take the above alternative for the first open, maybe the new
> > hypercall above can inject page faults into dom0 for the unmapped
> > virtual address so as to enforce dom0 Linux to create the page
> > mapping.

Ugh. That sounds hacky. And you wouldn't neccessarily be safe.
Imagine that the system admin decides to defrag the /dev/pmem filesystem.
Or move the files (disk images) around. If they do that - we may
still have the guest mapped to system addresses which may contain filesystem
metadata now, or a different guest image. We MUST mlock or lock the file
during the duration of the guest.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 15:47       ` Konrad Rzeszutek Wilk
@ 2016-02-04  2:36         ` Haozhong Zhang
  2016-02-15  9:04         ` Zhang, Haozhong
  1 sibling, 0 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-04  2:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Wei Liu, Kevin Tian, Keir Fraser, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Jun Nakajima, Xiao Guangrong

On 02/03/16 10:47, Konrad Rzeszutek Wilk wrote:
> > > > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > > > >        get the physical address from a virtual address.
> > > > >        /proc/<qemu_pid>/pagemap provides information of mapping from
> > > > >        VA to PA. Is it an acceptable solution to let QEMU parse this
> > > > >        file to get the physical address?
> > > >
> > > > Does it work in a non-root scenario?
> > > >
> > >
> > > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> > > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
> > > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> > > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> > > | Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
>
> Ah right.
> > >
> > > A possible alternative is to add a new hypercall similar to
> > > XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> > > parameter and translating to machine address in the hypervisor.
> >
> > That might work.
>
> That won't work.
>
> This is a userspace VMA - which means the once the ioctl is done we swap
> to kernel virtual addresses. Now we may know that the prior cr3 has the
> userspace virtual address and walk it down - but what if the domain
> that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere
> inside the kernel.
>
> Which means this hypercall would need to know the Linux kernel task structure
> to find this.
>

Thanks for pointing out this. Really it's not a workable solution.

> May I propose another solution - an stacking driver (similar to loop). You
> setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img created).
> Then mmap the /dev/mapper/guest.img - all of the operations are the same - except
> it may have an extra ioctl - get_pfns - which would provide the data in similar
> form to pagemap.txt.
>

I'll have a look at this, thanks!

> But folks will then ask - why don't you just use pagemap? Could the pagemap
> have an extra security capability check? One that can be set for
> QEMU?
>

Basically for the concern on whether non-root QEMU could work as in
Stefano's comments.

> >
> >
> > > > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > > > >        occupied by pmem at the beginning, i.e. QEMU may not be able to
> > > > >        get all SPA of pmem from buf (in virtual address space) when
> > > > >        calling XEN_DOMCTL_memory_mapping.
> > > > >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > > > >        entire pmem being mmaped?
> > > >
> > > > Ditto
> > > >
> > >
> > > No. If I take the above alternative for the first open, maybe the new
> > > hypercall above can inject page faults into dom0 for the unmapped
> > > virtual address so as to enforce dom0 Linux to create the page
> > > mapping.
>
> Ugh. That sounds hacky. And you wouldn't neccessarily be safe.
> Imagine that the system admin decides to defrag the /dev/pmem filesystem.
> Or move the files (disk images) around. If they do that - we may
> still have the guest mapped to system addresses which may contain filesystem
> metadata now, or a different guest image. We MUST mlock or lock the file
> during the duration of the guest.
>
>

So mlocking or locking the mmaped file, or other ways to 'pin' the
mmaped file on pmem is a necessity.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 15:22         ` Stefano Stabellini
  2016-02-03 15:35           ` Konrad Rzeszutek Wilk
  2016-02-03 15:35           ` George Dunlap
@ 2016-02-04  2:55           ` Haozhong Zhang
  2016-02-04 12:24             ` Stefano Stabellini
  2 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-04  2:55 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Juergen Gross, Kevin Tian, Keir Fraser, Ian Campbell,
	George Dunlap, Andrew Cooper, Xiao Guangrong, Ian Jackson,
	George Dunlap, xen-devel, Jan Beulich, Jun Nakajima, Wei Liu

On 02/03/16 15:22, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, George Dunlap wrote:
> > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > >> those files to QEMU. In this way, non-root QEMU should be able to
> > >> mmap those files.
> > >
> > > Maybe that would work. Worth adding it to the design, I would like to
> > > read more details on it.
> > >
> > > Also note that QEMU initially runs as root but drops privileges to
> > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > wouldn't work for any devices that need to be mmap'ed at run time
> > > (hotplug scenario).
> >
> > This is basically the same problem we have for a bunch of other things,
> > right?  Having xl open a file and then pass it via qmp to qemu should
> > work in theory, right?
>
> Is there one /dev/pmem? per assignable region?

Yes.

BTW, I'm wondering whether and how non-root qemu works with xl disk
configuration that is going to access a host block device, e.g.
     disk = [ '/dev/sdb,,hda' ]
If that works with non-root qemu, I may take the similar solution for
pmem.

Thanks,
Haozhong

> Otherwise it wouldn't be safe.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 14:20         ` Andrew Cooper
@ 2016-02-04  3:10           ` Haozhong Zhang
  0 siblings, 0 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-04  3:10 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Kevin Tian, Keir Fraser, Ian Campbell,
	George Dunlap, Stefano Stabellini, Ian Jackson, Xiao Guangrong,
	xen-devel, Jan Beulich, Jun Nakajima, Wei Liu

On 02/03/16 14:20, Andrew Cooper wrote:
> >>>>>  (ACPI part is described in Section 3.3 later)
> >>>>>
> >>>>>  Above (1)(2) have already been done in current QEMU. Only (3) is
> >>>>>  needed to implement in QEMU. No change is needed in Xen for address
> >>>>>  mapping in this design.
> >>>>>
> >>>>>  Open: It seems no system call/ioctl is provided by Linux kernel to
> >>>>>        get the physical address from a virtual address.
> >>>>>        /proc/<qemu_pid>/pagemap provides information of mapping from
> >>>>>        VA to PA. Is it an acceptable solution to let QEMU parse this
> >>>>>        file to get the physical address?
> >>>> Does it work in a non-root scenario?
> >>>>
> >>> Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> >>> | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
> >>> | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> >>> | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> >>> | Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
> >>>
> >>> A possible alternative is to add a new hypercall similar to
> >>> XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> >>> parameter and translating to machine address in the hypervisor.
> >> That might work.
> >>
> >>
> >>>>>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> >>>>>        occupied by pmem at the beginning, i.e. QEMU may not be able to
> >>>>>        get all SPA of pmem from buf (in virtual address space) when
> >>>>>        calling XEN_DOMCTL_memory_mapping.
> >>>>>        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> >>>>>        entire pmem being mmaped?
> >>>> Ditto
> >>>>
> >>> No. If I take the above alternative for the first open, maybe the new
> >>> hypercall above can inject page faults into dom0 for the unmapped
> >>> virtual address so as to enforce dom0 Linux to create the page
> >>> mapping.
> >> Otherwise you need to use something like the mapcache in QEMU
> >> (xen-mapcache.c), which admittedly, given its complexity, would be best
> >> to avoid.
> >>
> > Definitely not mapcache like things. What I want is something similar to
> > what emulate_gva_to_mfn() in Xen does.
>
> Please not quite like that.  It would restrict this to only working in a
> PV dom0.
>
> MFNs are an implementation detail.

I don't get this point.
What do you mean by 'implementation detail'? Architectural differences?

> Interfaces should take GFNs which
> are consistent logical meaning between PV and HVM domains.
>
> As an introduction,
> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/xen/mm.h;h=a795dd6001eff7c5dd942bbaf153e3efa5202318;hb=refs/heads/staging#l8
>
> We also need to consider the Xen side security.  Currently a domain may
> be given privilege to map an MMIO range.  IIRC, this allows the emulator
> domain to make mappings for the guest, and for the guest to make
> mappings itself.  With PMEM, we can't allow a domain to make mappings
> itself because it could end up mapping resources which belong to another
> domain.  We probably need an intermediate level which only permits an
> emulator to make the mappings.
>

agree, this hypercall should not be called by arbitrary domains. Any
existing mechanism in Xen to restrict callers of hypercalls?

> >
> > [...]
> >>>> If we start asking QEMU to build ACPI tables, why should we stop at NFIT
> >>>> and SSDT?
> >>> for easing my development of supporting vNVDIMM in Xen ... I mean
> >>> NFIT and SSDT are the only two tables needed for this purpose and I'm
> >>> afraid to break exiting guests if I completely switch to QEMU for
> >>> guest ACPI tables.
> >> I realize that my words have been a bit confusing. Not /all/ ACPI
> >> tables, just all the tables regarding devices for which QEMU is in
> >> charge (the PCI bus and all devices behind it). Anything related to cpus
> >> and memory (FADT, MADT, etc) would still be left to hvmloader.
> > OK, then it's clear for me. From Jan's reply, at least MCFG is from
> > QEMU. I'll look at whether other PCI related tables are also from QEMU
> > or similar to those in QEMU. If yes, then it looks reasonable to let
> > QEMU generate them.
>
> It is entirely likely that the current split of sources of APCI tables
> is incorrect.  We should also see what can be done about fixing that.
>

How about Jan's comment
| tables should come from qemu for components living in qemu, and from
| hvmloader for components coming from Xen

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-04  2:55           ` Haozhong Zhang
@ 2016-02-04 12:24             ` Stefano Stabellini
  2016-02-15  3:16               ` Zhang, Haozhong
  0 siblings, 1 reply; 121+ messages in thread
From: Stefano Stabellini @ 2016-02-04 12:24 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Keir Fraser, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	George Dunlap, Xiao Guangrong, xen-devel, Jan Beulich,
	Jun Nakajima, Wei Liu

On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> On 02/03/16 15:22, Stefano Stabellini wrote:
> > On Wed, 3 Feb 2016, George Dunlap wrote:
> > > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > > >> those files to QEMU. In this way, non-root QEMU should be able to
> > > >> mmap those files.
> > > >
> > > > Maybe that would work. Worth adding it to the design, I would like to
> > > > read more details on it.
> > > >
> > > > Also note that QEMU initially runs as root but drops privileges to
> > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > > wouldn't work for any devices that need to be mmap'ed at run time
> > > > (hotplug scenario).
> > >
> > > This is basically the same problem we have for a bunch of other things,
> > > right?  Having xl open a file and then pass it via qmp to qemu should
> > > work in theory, right?
> >
> > Is there one /dev/pmem? per assignable region?
> 
> Yes.
> 
> BTW, I'm wondering whether and how non-root qemu works with xl disk
> configuration that is going to access a host block device, e.g.
>      disk = [ '/dev/sdb,,hda' ]
> If that works with non-root qemu, I may take the similar solution for
> pmem.
 
Today the user is required to give the correct ownership and access mode
to the block device, so that non-root QEMU can open it. However in the
case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
the feature doesn't work at all with non-root QEMU
(http://marc.info/?l=xen-devel&m=145261763600528).

If there is one /dev/pmem device per assignable region, then it would be
conceivable to change its ownership so that non-root QEMU can open it.
Or, better, the file descriptor could be passed by the toolstack via
qmp.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 14:09       ` Andrew Cooper
  2016-02-03 14:23         ` Haozhong Zhang
@ 2016-02-05 14:40         ` Ross Philipson
  2016-02-06  1:43           ` Haozhong Zhang
  1 sibling, 1 reply; 121+ messages in thread
From: Ross Philipson @ 2016-02-05 14:40 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/03/2016 09:09 AM, Andrew Cooper wrote:
> On 03/02/16 09:13, Jan Beulich wrote:
>>>>> On 03.02.16 at 08:00, <haozhong.zhang@intel.com> wrote:
>>> On 02/02/16 17:11, Stefano Stabellini wrote:
>>>> Once upon a time somebody made the decision that ACPI tables
>>>> on Xen should be static and included in hvmloader. That might have been
>>>> a bad decision but at least it was coherent. Loading only *some* tables
>>>> from QEMU, but not others, it feels like an incomplete design to me.
>>>>
>>>> For example, QEMU is currently in charge of emulating the PCI bus, why
>>>> shouldn't it be QEMU that generates the PRT and MCFG?
>>>>
>>> To Keir, Jan and Andrew:
>>>
>>> Are there anything related to ACPI that must be done (or are better to
>>> be done) in hvmloader?
>> Some of the static tables (FADT and HPET come to mind) likely would
>> better continue to live in hvmloader. MCFG (for example) coming from
>> qemu, otoh, would be quite natural (and would finally allow MMCFG
>> support for guests in the first place). I.e. ...
>>
>>>> I prefer switching to QEMU building all ACPI tables for devices that it
>>>> is emulating. However this alternative is good too because it is
>>>> coherent with the current design.
>>> I would prefer to this one if the final conclusion is that only one
>>> agent should be allowed to build guest ACPI. As I said above, it looks
>>> like a big change to switch to QEMU for all ACPI tables and I'm afraid
>>> it would break some existing guests.
>> ... I indeed think that tables should come from qemu for components
>> living in qemu, and from hvmloader for components coming from Xen.
>
> I agree.
>
> There has to be a single entity responsible for collating the eventual
> ACPI handed to the guest, and this is definitely HVMLoader.
>
> However, it is correct that Qemu create the ACPI tables for the devices
> it emulates for the guest.
>
> We need to agree on a mechanism whereby each entity can provide their
> own subset of the ACPI tables to HVMLoader, and have HVMLoader present
> the final set properly to the VM.
>
> There is an existing usecase of passing the Host SLIC table to a VM, for
> OEM Versions of Windows.  I believe this is achieved with
> HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
> inflexible and could probably do with being made a little more generic.

A while back I added a generic mechanism to load extra ACPI tables into 
a guest, configurable at runtime. It looks like the functionality is 
still present. That might be an option.

Also, following the thread, it wasn't clear if some of the tables like 
the SSDT for the NVDIMM device and it's _FIT/_DSM methods were something 
that could be statically created at build time. If it is something that 
needs to be generated at runtime (e.g. platform specific), I have a 
library that can generate any AML on the fly and create SSDTs.

Anyway just FYI in case this is helpful.

>
> ~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>


-- 
Ross Philipson

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-05 14:40         ` Ross Philipson
@ 2016-02-06  1:43           ` Haozhong Zhang
  2016-02-06 16:17             ` Ross Philipson
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-06  1:43 UTC (permalink / raw)
  To: Ross Philipson
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 02/05/16 09:40, Ross Philipson wrote:
> On 02/03/2016 09:09 AM, Andrew Cooper wrote:
[...]
> >I agree.
> >
> >There has to be a single entity responsible for collating the eventual
> >ACPI handed to the guest, and this is definitely HVMLoader.
> >
> >However, it is correct that Qemu create the ACPI tables for the devices
> >it emulates for the guest.
> >
> >We need to agree on a mechanism whereby each entity can provide their
> >own subset of the ACPI tables to HVMLoader, and have HVMLoader present
> >the final set properly to the VM.
> >
> >There is an existing usecase of passing the Host SLIC table to a VM, for
> >OEM Versions of Windows.  I believe this is achieved with
> >HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
> >inflexible and could probably do with being made a little more generic.
>
> A while back I added a generic mechanism to load extra ACPI tables into a
> guest, configurable at runtime. It looks like the functionality is still
> present. That might be an option.
>
> Also, following the thread, it wasn't clear if some of the tables like the
> SSDT for the NVDIMM device and it's _FIT/_DSM methods were something that
> could be statically created at build time. If it is something that needs to
> be generated at runtime (e.g. platform specific), I have a library that can
> generate any AML on the fly and create SSDTs.
>
> Anyway just FYI in case this is helpful.
>

Hi Ross,

Thanks for the information!

SSDT for NVDIMM devices can not be created statically, because the
number of some items in it can not be determined at build time. For
example, the number of NVDIMM ACPI namespace devices (_DSM is under it)
defined in SSDT is determined by the number of vNVDIMM devices in domain
configuration. FYI, a sample SSDT for NVDIMM looks like

  Scope (\_SB){
      Device (NVDR) // NVDIMM Root device
      {
          Name (_HID, “ACPI0012”)
          Method (_STA) {...}
          Method (_FIT) {...}
          Method (_DSM, ...) {
              ...
          }
      }

      Device (NVD0) // 1st NVDIMM Device
      {
          Name(_ADR, h0)
          Method (_DSM, ...) {
              ...
          }
      }

      Device (NVD1) // 2nd NVDIMM Device
      {
          Name(_ADR, h1)
          Method (_DSM, ...) {
              ...
          }
      }

      ...
  }

I had ported QEMU's AML builder code as well as NVDIMM ACPI building
code to hvmloader and it did work, but then there was too much
duplicated code for vNVDIMM between QEMU and hvmloader for vNVDIMM.
Therefore, I prefer to let QEMU that emulates vNVDIMM devices
to build those tables, as in Andrew and Jan's replies.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-06  1:43           ` Haozhong Zhang
@ 2016-02-06 16:17             ` Ross Philipson
  0 siblings, 0 replies; 121+ messages in thread
From: Ross Philipson @ 2016-02-06 16:17 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Juergen Gross, Kevin Tian, Wei Liu,
	Ian Campbell, Stefano Stabellini, George Dunlap, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/05/2016 08:43 PM, Haozhong Zhang wrote:
> On 02/05/16 09:40, Ross Philipson wrote:
>> On 02/03/2016 09:09 AM, Andrew Cooper wrote:
> [...]
>>> I agree.
>>>
>>> There has to be a single entity responsible for collating the eventual
>>> ACPI handed to the guest, and this is definitely HVMLoader.
>>>
>>> However, it is correct that Qemu create the ACPI tables for the devices
>>> it emulates for the guest.
>>>
>>> We need to agree on a mechanism whereby each entity can provide their
>>> own subset of the ACPI tables to HVMLoader, and have HVMLoader present
>>> the final set properly to the VM.
>>>
>>> There is an existing usecase of passing the Host SLIC table to a VM, for
>>> OEM Versions of Windows.  I believe this is achieved with
>>> HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
>>> inflexible and could probably do with being made a little more generic.
>>
>> A while back I added a generic mechanism to load extra ACPI tables into a
>> guest, configurable at runtime. It looks like the functionality is still
>> present. That might be an option.
>>
>> Also, following the thread, it wasn't clear if some of the tables like the
>> SSDT for the NVDIMM device and it's _FIT/_DSM methods were something that
>> could be statically created at build time. If it is something that needs to
>> be generated at runtime (e.g. platform specific), I have a library that can
>> generate any AML on the fly and create SSDTs.
>>
>> Anyway just FYI in case this is helpful.
>>
> 
> Hi Ross,
> 
> Thanks for the information!
> 
> SSDT for NVDIMM devices can not be created statically, because the
> number of some items in it can not be determined at build time. For
> example, the number of NVDIMM ACPI namespace devices (_DSM is under it)
> defined in SSDT is determined by the number of vNVDIMM devices in domain
> configuration. FYI, a sample SSDT for NVDIMM looks like
> 
>   Scope (\_SB){
>       Device (NVDR) // NVDIMM Root device
>       {
>           Name (_HID, “ACPI0012”)
>           Method (_STA) {...}
>           Method (_FIT) {...}
>           Method (_DSM, ...) {
>               ...
>           }
>       }
> 
>       Device (NVD0) // 1st NVDIMM Device
>       {
>           Name(_ADR, h0)
>           Method (_DSM, ...) {
>               ...
>           }
>       }
> 
>       Device (NVD1) // 2nd NVDIMM Device
>       {
>           Name(_ADR, h1)
>           Method (_DSM, ...) {
>               ...
>           }
>       }
> 
>       ...
>   }

Makes sense.

> 
> I had ported QEMU's AML builder code as well as NVDIMM ACPI building
> code to hvmloader and it did work, but then there was too much
> duplicated code for vNVDIMM between QEMU and hvmloader for vNVDIMM.
> Therefore, I prefer to let QEMU that emulates vNVDIMM devices
> to build those tables, as in Andrew and Jan's replies.

Yea it looks like QEUM's AML generating code is quite complete nowadays.
Back when I wrote my library there wasn't really much out there. Anyway
this is where it is if there is something that I might generate that is
missing:

https://github.com/OpenXT/xctools/tree/master/libxenacpi


> 
> Thanks,
> Haozhong
> 


-- 
Ross Philipson

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-04 12:24             ` Stefano Stabellini
@ 2016-02-15  3:16               ` Zhang, Haozhong
  2016-02-16 11:14                 ` Stefano Stabellini
  0 siblings, 1 reply; 121+ messages in thread
From: Zhang, Haozhong @ 2016-02-15  3:16 UTC (permalink / raw)
  To: Stefano Stabellini, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Tian, Kevin, Keir Fraser, Ian Campbell,
	George Dunlap, Andrew Cooper, Xiao Guangrong, Ian Jackson,
	George Dunlap, xen-devel, Jan Beulich, Nakajima, Jun, Wei Liu

On 02/04/16 20:24, Stefano Stabellini wrote:
> On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> > On 02/03/16 15:22, Stefano Stabellini wrote:
> > > On Wed, 3 Feb 2016, George Dunlap wrote:
> > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > > > >> those files to QEMU. In this way, non-root QEMU should be able to
> > > > >> mmap those files.
> > > > >
> > > > > Maybe that would work. Worth adding it to the design, I would like to
> > > > > read more details on it.
> > > > >
> > > > > Also note that QEMU initially runs as root but drops privileges to
> > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > > > wouldn't work for any devices that need to be mmap'ed at run time
> > > > > (hotplug scenario).
> > > >
> > > > This is basically the same problem we have for a bunch of other things,
> > > > right?  Having xl open a file and then pass it via qmp to qemu should
> > > > work in theory, right?
> > >
> > > Is there one /dev/pmem? per assignable region?
> > 
> > Yes.
> > 
> > BTW, I'm wondering whether and how non-root qemu works with xl disk
> > configuration that is going to access a host block device, e.g.
> >      disk = [ '/dev/sdb,,hda' ]
> > If that works with non-root qemu, I may take the similar solution for
> > pmem.
>  
> Today the user is required to give the correct ownership and access mode
> to the block device, so that non-root QEMU can open it. However in the
> case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> the feature doesn't work at all with non-root QEMU
> (http://marc.info/?l=xen-devel&m=145261763600528).
> 
> If there is one /dev/pmem device per assignable region, then it would be
> conceivable to change its ownership so that non-root QEMU can open it.
> Or, better, the file descriptor could be passed by the toolstack via
> qmp.

Passing file descriptor via qmp is not enough.

Let me clarify where the requirement for root/privileged permissions
comes from. The primary workflow in my design that maps a host pmem
region or files in host pmem region to guest is shown as below:
 (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
     /dev/pmem0) to its virtual address space, i.e. the guest virtual
     address space.
 (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
     occupied by the host pmem to a DomU. This step requires the
     translation from the guest virtual address (where the host pmem is
     mmaped in (1)) to the host physical address. The translation can be
     done by either
    (a) QEMU that parses its own /proc/self/pagemap,
     or
    (b) Xen hypervisor that does the translation by itself [1] (though
        this choice is not quite doable from Konrad's comments [2]).

[1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html
[2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html

For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
pagemap will not contain all mappings. However, mlock may require
privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
mlock operates on memory, the permission to open(2) the host pmem files
does not solve the problem and therefore passing file descriptor via qmp
does not help.

For 2-b, from Konrad's comments [2], mlock is also required and
privileged permission may be required consequently.

Note that the mapping and the address translation are done before QEMU
dropping privileged permissions, so non-root QEMU should be able to work
with above design until we start considering vNVDIMM hotplug (which has
not been supported by the current vNVDIMM implementation in QEMU). In
the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
running with root permissions.

Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-02 19:15 ` Konrad Rzeszutek Wilk
  2016-02-03  8:28   ` Haozhong Zhang
@ 2016-02-15  8:43   ` Haozhong Zhang
  2016-02-15 11:07     ` Jan Beulich
  1 sibling, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-15  8:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Tian, Kevin, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Nakajima, Jun, Xiao Guangrong,
	Keir Fraser

On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
> > 3. Design of vNVDIMM in Xen
> 
> Thank you for this design!
> 
> > 
> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >  three parts:
> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >  (2) Memory mapping, and
> >  (3) Guest ACPI emulation.
> 
> 
> .. MCE? and vMCE?
> 

NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
way similar to what mc_memerr_dhandler() does, with some differences in
the data structure and the broken page offline parts:

Broken NVDIMM pages should be marked as "offlined" so that Xen
hypervisor can refuse further requests that map them to DomU.

The real problem here is what data structure will be used to record
information of NVDIMM pages. Because the size of NVDIMM is usually much
larger than normal ram, using struct page_info for NVDIMM pages would
occupy too much memory.

Alternatively, we may use a range set to represent NVDIMM pages:

    struct nvdimm_pages
    {
        unsigned long mfn; /* starting MFN of a range of NVDIMM pages */
        unsigned long gfn; /* starting GFN where this range is mapped,
                              initially INVALID_GFN */
        unsigned long len; /* length of this range in bytes */

        int broken;        /* 0: initial value,
                              1: this range of NVDIMM pages are broken and offlined */
           
        struct domain *d;  /* NULL: initial value,
                              Not NULL: which domain this range is mapped to */

        /*
         * Every nvdimm_pages structure is linked in the global
         * xen_nvdimm_pages_list.
         *
         * If it is mapped to a domain d, it will be also linked in
         * d->arch.nvdimm_pages_list.
         */
        struct list_head *domain_list;
        struct list_head *global_list;
    }

    struct list_head xen_nvdimm_pages_list;

    /* in asm-x86/domain.h */
    struct arch_domain
    {
        ...
        struct list_head nvdimm_pages_list;
    }

(1) Initially, Xen hypervisor creates a nvdimm_pages structure for each
    pmem region (starting SPA and size reported by Dom0 NVDIMM driver)
    and links all nvdimm_pages structures in xen_nvdimm_pages_list.

(2) If Xen hypervisor is then requested to map a range of NVDIMM pages
    [start_mfn, end_mfn] to gfn of domain d, it will

   (a) Check whether the GFN range [gfn, gfn + end_mfn - start_mfn + 1]
       of domain d has been occupied (e.g. by normal ram, I/O or other
       vNVDIMM).

   (b) Search xen_nvdimm_pages_list for one or multiple nvdimm_pages
       that [start_mfn, end_mfn] can fit in.

       If a nvdimm_pages structure is entirely covered by [start_mfn,
       end_mfn], then link that nvdimm_pages structure to
       d->arch.nvdimm_pages_list.
    
       If only a portion of a nvdimm_pages structure is covered by
       [start_mfn, end_mfn], then split that nvdimm_pages structure
       into multiple ones (the one entirely covered and at most two not
       covered), link the covered one to d->arch.nvdimm_pages_list and
       all of them to xen_nvdimm_pages_list as well.

       gfn and d fields of nvdimm_pages structures linked to
       d->arch.nvdimm_pages_list are also set accordingly.

(3) When a domain d is shutdown/destroyed, merge its nvdimm_pages
    structures (i.e. those in d->arch.nvdimm_pages_list) in
    xen_nvdimm_pages_list.

(4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens,
  (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures,
  (b) for each affected nvdimm_pages, if it belongs to a domain d and
      its broken field is already set, the domain d will be shutdown to
      prevent malicious guest accessing broken page (similarly to what
      offline_page() does).
  (c) for each affected nvdimm_pages, set its broken field to 1, and
  (d) for each affected nvdimm_pages, inject to domain d a vMCE that
      covers its GFN range if that nvdimm_pages belongs to domain d.

Comments, pls.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-03 15:47       ` Konrad Rzeszutek Wilk
  2016-02-04  2:36         ` Haozhong Zhang
@ 2016-02-15  9:04         ` Zhang, Haozhong
  1 sibling, 0 replies; 121+ messages in thread
From: Zhang, Haozhong @ 2016-02-15  9:04 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Wei Liu, Tian, Kevin, Keir Fraser, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Nakajima, Jun, Xiao Guangrong

On 02/03/16 23:47, Konrad Rzeszutek Wilk wrote:
> > > > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > > > >        get the physical address from a virtual address.
> > > > >        /proc/<qemu_pid>/pagemap provides information of mapping from
> > > > >        VA to PA. Is it an acceptable solution to let QEMU parse this
> > > > >        file to get the physical address?
> > > > 
> > > > Does it work in a non-root scenario?
> > > >
> > > 
> > > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> > > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
> > > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> > > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> > > | Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
> 
> Ah right.
> > >
> > > A possible alternative is to add a new hypercall similar to
> > > XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> > > parameter and translating to machine address in the hypervisor.
> > 
> > That might work.
> 
> That won't work.
> 
> This is a userspace VMA - which means the once the ioctl is done we swap
> to kernel virtual addresses. Now we may know that the prior cr3 has the
> userspace virtual address and walk it down - but what if the domain
> that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere
> inside the kernel.
> 
> Which means this hypercall would need to know the Linux kernel task structure
> to find this.
> 
> May I propose another solution - an stacking driver (similar to loop). You
> setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img created).
> Then mmap the /dev/mapper/guest.img - all of the operations are the same - except
> it may have an extra ioctl - get_pfns - which would provide the data in similar
> form to pagemap.txt.
>

This stack driver approach seems still need privileged permission and
more modifications in kernel, so ...

> But folks will then ask - why don't you just use pagemap? Could the pagemap
> have an extra security capability check? One that can be set for
> QEMU?
>

... I would like to use pagemap and mlock.

Haozhong

> > 
> > 
> > > > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > > > >        occupied by pmem at the beginning, i.e. QEMU may not be able to
> > > > >        get all SPA of pmem from buf (in virtual address space) when
> > > > >        calling XEN_DOMCTL_memory_mapping.
> > > > >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > > > >        entire pmem being mmaped?
> > > > 
> > > > Ditto
> > > >
> > > 
> > > No. If I take the above alternative for the first open, maybe the new
> > > hypercall above can inject page faults into dom0 for the unmapped
> > > virtual address so as to enforce dom0 Linux to create the page
> > > mapping.
> 
> Ugh. That sounds hacky. And you wouldn't neccessarily be safe.
> Imagine that the system admin decides to defrag the /dev/pmem filesystem.
> Or move the files (disk images) around. If they do that - we may
> still have the guest mapped to system addresses which may contain filesystem
> metadata now, or a different guest image. We MUST mlock or lock the file
> during the duration of the guest.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-15  8:43   ` Haozhong Zhang
@ 2016-02-15 11:07     ` Jan Beulich
  2016-02-17  9:01       ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-15 11:07 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 15.02.16 at 09:43, <haozhong.zhang@intel.com> wrote:
> On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
>> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
>> >  three parts:
>> >  (1) Guest clwb/clflushopt/pcommit enabling,
>> >  (2) Memory mapping, and
>> >  (3) Guest ACPI emulation.
>> 
>> 
>> .. MCE? and vMCE?
>> 
> 
> NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
> way similar to what mc_memerr_dhandler() does, with some differences in
> the data structure and the broken page offline parts:
> 
> Broken NVDIMM pages should be marked as "offlined" so that Xen
> hypervisor can refuse further requests that map them to DomU.
> 
> The real problem here is what data structure will be used to record
> information of NVDIMM pages. Because the size of NVDIMM is usually much
> larger than normal ram, using struct page_info for NVDIMM pages would
> occupy too much memory.

I don't see how your alternative below would be less memory
hungry: Since guests have at least partial control of their GFN
space, a malicious guest could punch holes into the contiguous
GFN range that you appear to be thinking about, thus causing
arbitrary splitting of the control structure.

Also - see how you all of the sudden came to think of using
struct page_info here (implying hypervisor control of these
NVDIMM ranges)?

> (4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens,
>   (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures,
>   (b) for each affected nvdimm_pages, if it belongs to a domain d and
>       its broken field is already set, the domain d will be shutdown to
>       prevent malicious guest accessing broken page (similarly to what
>       offline_page() does).
>   (c) for each affected nvdimm_pages, set its broken field to 1, and
>   (d) for each affected nvdimm_pages, inject to domain d a vMCE that
>       covers its GFN range if that nvdimm_pages belongs to domain d.

I don't see why you'd want to mark the entire range bad: All
that's known to be broken is a single page. Hence this would be
another source of splits of the proposed control structures.

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-15  3:16               ` Zhang, Haozhong
@ 2016-02-16 11:14                 ` Stefano Stabellini
  2016-02-16 12:55                   ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Stefano Stabellini @ 2016-02-16 11:14 UTC (permalink / raw)
  To: Zhang, Haozhong
  Cc: Juergen Gross, Tian, Kevin, Keir Fraser, Ian Campbell,
	George Dunlap, Andrew Cooper, Stefano Stabellini, Ian Jackson,
	George Dunlap, Xiao Guangrong, xen-devel, Jan Beulich, Nakajima,
	Jun, Wei Liu

On Mon, 15 Feb 2016, Zhang, Haozhong wrote:
> On 02/04/16 20:24, Stefano Stabellini wrote:
> > On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> > > On 02/03/16 15:22, Stefano Stabellini wrote:
> > > > On Wed, 3 Feb 2016, George Dunlap wrote:
> > > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > > > > >> those files to QEMU. In this way, non-root QEMU should be able to
> > > > > >> mmap those files.
> > > > > >
> > > > > > Maybe that would work. Worth adding it to the design, I would like to
> > > > > > read more details on it.
> > > > > >
> > > > > > Also note that QEMU initially runs as root but drops privileges to
> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > > > > wouldn't work for any devices that need to be mmap'ed at run time
> > > > > > (hotplug scenario).
> > > > >
> > > > > This is basically the same problem we have for a bunch of other things,
> > > > > right?  Having xl open a file and then pass it via qmp to qemu should
> > > > > work in theory, right?
> > > >
> > > > Is there one /dev/pmem? per assignable region?
> > > 
> > > Yes.
> > > 
> > > BTW, I'm wondering whether and how non-root qemu works with xl disk
> > > configuration that is going to access a host block device, e.g.
> > >      disk = [ '/dev/sdb,,hda' ]
> > > If that works with non-root qemu, I may take the similar solution for
> > > pmem.
> >  
> > Today the user is required to give the correct ownership and access mode
> > to the block device, so that non-root QEMU can open it. However in the
> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> > the feature doesn't work at all with non-root QEMU
> > (http://marc.info/?l=xen-devel&m=145261763600528).
> > 
> > If there is one /dev/pmem device per assignable region, then it would be
> > conceivable to change its ownership so that non-root QEMU can open it.
> > Or, better, the file descriptor could be passed by the toolstack via
> > qmp.
> 
> Passing file descriptor via qmp is not enough.
> 
> Let me clarify where the requirement for root/privileged permissions
> comes from. The primary workflow in my design that maps a host pmem
> region or files in host pmem region to guest is shown as below:
>  (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
>      /dev/pmem0) to its virtual address space, i.e. the guest virtual
>      address space.
>  (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
>      occupied by the host pmem to a DomU. This step requires the
>      translation from the guest virtual address (where the host pmem is
>      mmaped in (1)) to the host physical address. The translation can be
>      done by either
>     (a) QEMU that parses its own /proc/self/pagemap,
>      or
>     (b) Xen hypervisor that does the translation by itself [1] (though
>         this choice is not quite doable from Konrad's comments [2]).
> 
> [1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html
> [2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html
> 
> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
> pagemap will not contain all mappings. However, mlock may require
> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
> mlock operates on memory, the permission to open(2) the host pmem files
> does not solve the problem and therefore passing file descriptor via qmp
> does not help.
> 
> For 2-b, from Konrad's comments [2], mlock is also required and
> privileged permission may be required consequently.
> 
> Note that the mapping and the address translation are done before QEMU
> dropping privileged permissions, so non-root QEMU should be able to work
> with above design until we start considering vNVDIMM hotplug (which has
> not been supported by the current vNVDIMM implementation in QEMU). In
> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
> running with root permissions.

Are we all good with the fact that vNVDIMM hotplug won't work (unless
the user explicitly asks for it at domain creation time, which is
very unlikely otherwise she could use coldplug)?

If so, the design is OK for me.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-16 11:14                 ` Stefano Stabellini
@ 2016-02-16 12:55                   ` Jan Beulich
  2016-02-17  9:03                     ` Haozhong Zhang
  2016-03-04  7:30                     ` Haozhong Zhang
  0 siblings, 2 replies; 121+ messages in thread
From: Jan Beulich @ 2016-02-16 12:55 UTC (permalink / raw)
  To: Stefano Stabellini, Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, IanJackson, George Dunlap, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 16.02.16 at 12:14, <stefano.stabellini@eu.citrix.com> wrote:
> On Mon, 15 Feb 2016, Zhang, Haozhong wrote:
>> On 02/04/16 20:24, Stefano Stabellini wrote:
>> > On Thu, 4 Feb 2016, Haozhong Zhang wrote:
>> > > On 02/03/16 15:22, Stefano Stabellini wrote:
>> > > > On Wed, 3 Feb 2016, George Dunlap wrote:
>> > > > > On 03/02/16 12:02, Stefano Stabellini wrote:
>> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
>> > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
>> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
>> > > > > >> those files to QEMU. In this way, non-root QEMU should be able to
>> > > > > >> mmap those files.
>> > > > > >
>> > > > > > Maybe that would work. Worth adding it to the design, I would like to
>> > > > > > read more details on it.
>> > > > > >
>> > > > > > Also note that QEMU initially runs as root but drops privileges to
>> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
>> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
>> > > > > > wouldn't work for any devices that need to be mmap'ed at run time
>> > > > > > (hotplug scenario).
>> > > > >
>> > > > > This is basically the same problem we have for a bunch of other things,
>> > > > > right?  Having xl open a file and then pass it via qmp to qemu should
>> > > > > work in theory, right?
>> > > >
>> > > > Is there one /dev/pmem? per assignable region?
>> > > 
>> > > Yes.
>> > > 
>> > > BTW, I'm wondering whether and how non-root qemu works with xl disk
>> > > configuration that is going to access a host block device, e.g.
>> > >      disk = [ '/dev/sdb,,hda' ]
>> > > If that works with non-root qemu, I may take the similar solution for
>> > > pmem.
>> >  
>> > Today the user is required to give the correct ownership and access mode
>> > to the block device, so that non-root QEMU can open it. However in the
>> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
>> > the feature doesn't work at all with non-root QEMU
>> > (http://marc.info/?l=xen-devel&m=145261763600528).
>> > 
>> > If there is one /dev/pmem device per assignable region, then it would be
>> > conceivable to change its ownership so that non-root QEMU can open it.
>> > Or, better, the file descriptor could be passed by the toolstack via
>> > qmp.
>> 
>> Passing file descriptor via qmp is not enough.
>> 
>> Let me clarify where the requirement for root/privileged permissions
>> comes from. The primary workflow in my design that maps a host pmem
>> region or files in host pmem region to guest is shown as below:
>>  (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
>>      /dev/pmem0) to its virtual address space, i.e. the guest virtual
>>      address space.
>>  (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
>>      occupied by the host pmem to a DomU. This step requires the
>>      translation from the guest virtual address (where the host pmem is
>>      mmaped in (1)) to the host physical address. The translation can be
>>      done by either
>>     (a) QEMU that parses its own /proc/self/pagemap,
>>      or
>>     (b) Xen hypervisor that does the translation by itself [1] (though
>>         this choice is not quite doable from Konrad's comments [2]).
>> 
>> [1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html 
>> [2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html 
>> 
>> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
>> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
>> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
>> pagemap will not contain all mappings. However, mlock may require
>> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
>> mlock operates on memory, the permission to open(2) the host pmem files
>> does not solve the problem and therefore passing file descriptor via qmp
>> does not help.
>> 
>> For 2-b, from Konrad's comments [2], mlock is also required and
>> privileged permission may be required consequently.
>> 
>> Note that the mapping and the address translation are done before QEMU
>> dropping privileged permissions, so non-root QEMU should be able to work
>> with above design until we start considering vNVDIMM hotplug (which has
>> not been supported by the current vNVDIMM implementation in QEMU). In
>> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
>> running with root permissions.
> 
> Are we all good with the fact that vNVDIMM hotplug won't work (unless
> the user explicitly asks for it at domain creation time, which is
> very unlikely otherwise she could use coldplug)?

No, at least there needs to be a road towards hotplug, even if
initially this may not be supported/implemented.

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-15 11:07     ` Jan Beulich
@ 2016-02-17  9:01       ` Haozhong Zhang
  2016-02-17  9:08         ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-17  9:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/15/16 04:07, Jan Beulich wrote:
> >>> On 15.02.16 at 09:43, <haozhong.zhang@intel.com> wrote:
> > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
> >> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >> >  three parts:
> >> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >> >  (2) Memory mapping, and
> >> >  (3) Guest ACPI emulation.
> >> 
> >> 
> >> .. MCE? and vMCE?
> >> 
> > 
> > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
> > way similar to what mc_memerr_dhandler() does, with some differences in
> > the data structure and the broken page offline parts:
> > 
> > Broken NVDIMM pages should be marked as "offlined" so that Xen
> > hypervisor can refuse further requests that map them to DomU.
> > 
> > The real problem here is what data structure will be used to record
> > information of NVDIMM pages. Because the size of NVDIMM is usually much
> > larger than normal ram, using struct page_info for NVDIMM pages would
> > occupy too much memory.
> 
> I don't see how your alternative below would be less memory
> hungry: Since guests have at least partial control of their GFN
> space, a malicious guest could punch holes into the contiguous
> GFN range that you appear to be thinking about, thus causing
> arbitrary splitting of the control structure.
>

QEMU would always use MFN above guest normal ram and I/O holes for
vNVDIMM. It would attempt to search in that space for a contiguous range
that is large enough for that that vNVDIMM devices. Is guest able to
punch holes in such GFN space?

> Also - see how you all of the sudden came to think of using
> struct page_info here (implying hypervisor control of these
> NVDIMM ranges)?
>
> > (4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens,
> >   (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures,
> >   (b) for each affected nvdimm_pages, if it belongs to a domain d and
> >       its broken field is already set, the domain d will be shutdown to
> >       prevent malicious guest accessing broken page (similarly to what
> >       offline_page() does).
> >   (c) for each affected nvdimm_pages, set its broken field to 1, and
> >   (d) for each affected nvdimm_pages, inject to domain d a vMCE that
> >       covers its GFN range if that nvdimm_pages belongs to domain d.
> 
> I don't see why you'd want to mark the entire range bad: All
> that's known to be broken is a single page. Hence this would be
> another source of splits of the proposed control structures.
>

Oh yes, I should split the whole range here. Such kind of splits is
caused by hardware errors. Unless the host NVDIMM is terribly broken,
there should not be a large amount of splits.

Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-16 12:55                   ` Jan Beulich
@ 2016-02-17  9:03                     ` Haozhong Zhang
  2016-03-04  7:30                     ` Haozhong Zhang
  1 sibling, 0 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-17  9:03 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jan Beulich, Jun Nakajima, Keir Fraser

On 02/16/16 05:55, Jan Beulich wrote:
> >>> On 16.02.16 at 12:14, <stefano.stabellini@eu.citrix.com> wrote:
> > On Mon, 15 Feb 2016, Zhang, Haozhong wrote:
> >> On 02/04/16 20:24, Stefano Stabellini wrote:
> >> > On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> >> > > On 02/03/16 15:22, Stefano Stabellini wrote:
> >> > > > On Wed, 3 Feb 2016, George Dunlap wrote:
> >> > > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> >> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> >> > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> >> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> >> > > > > >> those files to QEMU. In this way, non-root QEMU should be able to
> >> > > > > >> mmap those files.
> >> > > > > >
> >> > > > > > Maybe that would work. Worth adding it to the design, I would like to
> >> > > > > > read more details on it.
> >> > > > > >
> >> > > > > > Also note that QEMU initially runs as root but drops privileges to
> >> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> >> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> >> > > > > > wouldn't work for any devices that need to be mmap'ed at run time
> >> > > > > > (hotplug scenario).
> >> > > > >
> >> > > > > This is basically the same problem we have for a bunch of other things,
> >> > > > > right?  Having xl open a file and then pass it via qmp to qemu should
> >> > > > > work in theory, right?
> >> > > >
> >> > > > Is there one /dev/pmem? per assignable region?
> >> > > 
> >> > > Yes.
> >> > > 
> >> > > BTW, I'm wondering whether and how non-root qemu works with xl disk
> >> > > configuration that is going to access a host block device, e.g.
> >> > >      disk = [ '/dev/sdb,,hda' ]
> >> > > If that works with non-root qemu, I may take the similar solution for
> >> > > pmem.
> >> >  
> >> > Today the user is required to give the correct ownership and access mode
> >> > to the block device, so that non-root QEMU can open it. However in the
> >> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> >> > the feature doesn't work at all with non-root QEMU
> >> > (http://marc.info/?l=xen-devel&m=145261763600528).
> >> > 
> >> > If there is one /dev/pmem device per assignable region, then it would be
> >> > conceivable to change its ownership so that non-root QEMU can open it.
> >> > Or, better, the file descriptor could be passed by the toolstack via
> >> > qmp.
> >> 
> >> Passing file descriptor via qmp is not enough.
> >> 
> >> Let me clarify where the requirement for root/privileged permissions
> >> comes from. The primary workflow in my design that maps a host pmem
> >> region or files in host pmem region to guest is shown as below:
> >>  (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
> >>      /dev/pmem0) to its virtual address space, i.e. the guest virtual
> >>      address space.
> >>  (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
> >>      occupied by the host pmem to a DomU. This step requires the
> >>      translation from the guest virtual address (where the host pmem is
> >>      mmaped in (1)) to the host physical address. The translation can be
> >>      done by either
> >>     (a) QEMU that parses its own /proc/self/pagemap,
> >>      or
> >>     (b) Xen hypervisor that does the translation by itself [1] (though
> >>         this choice is not quite doable from Konrad's comments [2]).
> >> 
> >> [1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html 
> >> [2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html 
> >> 
> >> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
> >> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
> >> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
> >> pagemap will not contain all mappings. However, mlock may require
> >> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
> >> mlock operates on memory, the permission to open(2) the host pmem files
> >> does not solve the problem and therefore passing file descriptor via qmp
> >> does not help.
> >> 
> >> For 2-b, from Konrad's comments [2], mlock is also required and
> >> privileged permission may be required consequently.
> >> 
> >> Note that the mapping and the address translation are done before QEMU
> >> dropping privileged permissions, so non-root QEMU should be able to work
> >> with above design until we start considering vNVDIMM hotplug (which has
> >> not been supported by the current vNVDIMM implementation in QEMU). In
> >> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
> >> running with root permissions.
> > 
> > Are we all good with the fact that vNVDIMM hotplug won't work (unless
> > the user explicitly asks for it at domain creation time, which is
> > very unlikely otherwise she could use coldplug)?
> 
> No, at least there needs to be a road towards hotplug, even if
> initially this may not be supported/implemented.

Guangrong: any plan or design for vNVDIMM hotplug in QEMU?

Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-17  9:01       ` Haozhong Zhang
@ 2016-02-17  9:08         ` Jan Beulich
  2016-02-18  7:42           ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-17  9:08 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 17.02.16 at 10:01, <haozhong.zhang@intel.com> wrote:
> On 02/15/16 04:07, Jan Beulich wrote:
>> >>> On 15.02.16 at 09:43, <haozhong.zhang@intel.com> wrote:
>> > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
>> >> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
>> >> >  three parts:
>> >> >  (1) Guest clwb/clflushopt/pcommit enabling,
>> >> >  (2) Memory mapping, and
>> >> >  (3) Guest ACPI emulation.
>> >> 
>> >> 
>> >> .. MCE? and vMCE?
>> >> 
>> > 
>> > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
>> > way similar to what mc_memerr_dhandler() does, with some differences in
>> > the data structure and the broken page offline parts:
>> > 
>> > Broken NVDIMM pages should be marked as "offlined" so that Xen
>> > hypervisor can refuse further requests that map them to DomU.
>> > 
>> > The real problem here is what data structure will be used to record
>> > information of NVDIMM pages. Because the size of NVDIMM is usually much
>> > larger than normal ram, using struct page_info for NVDIMM pages would
>> > occupy too much memory.
>> 
>> I don't see how your alternative below would be less memory
>> hungry: Since guests have at least partial control of their GFN
>> space, a malicious guest could punch holes into the contiguous
>> GFN range that you appear to be thinking about, thus causing
>> arbitrary splitting of the control structure.
>>
> 
> QEMU would always use MFN above guest normal ram and I/O holes for
> vNVDIMM. It would attempt to search in that space for a contiguous range
> that is large enough for that that vNVDIMM devices. Is guest able to
> punch holes in such GFN space?

See XENMAPSPACE_* and their uses.

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-17  9:08         ` Jan Beulich
@ 2016-02-18  7:42           ` Haozhong Zhang
  2016-02-19  2:14             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-18  7:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/17/16 02:08, Jan Beulich wrote:
> >>> On 17.02.16 at 10:01, <haozhong.zhang@intel.com> wrote:
> > On 02/15/16 04:07, Jan Beulich wrote:
> >> >>> On 15.02.16 at 09:43, <haozhong.zhang@intel.com> wrote:
> >> > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
> >> >> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >> >> >  three parts:
> >> >> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >> >> >  (2) Memory mapping, and
> >> >> >  (3) Guest ACPI emulation.
> >> >> 
> >> >> 
> >> >> .. MCE? and vMCE?
> >> >> 
> >> > 
> >> > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
> >> > way similar to what mc_memerr_dhandler() does, with some differences in
> >> > the data structure and the broken page offline parts:
> >> > 
> >> > Broken NVDIMM pages should be marked as "offlined" so that Xen
> >> > hypervisor can refuse further requests that map them to DomU.
> >> > 
> >> > The real problem here is what data structure will be used to record
> >> > information of NVDIMM pages. Because the size of NVDIMM is usually much
> >> > larger than normal ram, using struct page_info for NVDIMM pages would
> >> > occupy too much memory.
> >> 
> >> I don't see how your alternative below would be less memory
> >> hungry: Since guests have at least partial control of their GFN
> >> space, a malicious guest could punch holes into the contiguous
> >> GFN range that you appear to be thinking about, thus causing
> >> arbitrary splitting of the control structure.
> >>
> > 
> > QEMU would always use MFN above guest normal ram and I/O holes for
> > vNVDIMM. It would attempt to search in that space for a contiguous range
> > that is large enough for that that vNVDIMM devices. Is guest able to
> > punch holes in such GFN space?
> 
> See XENMAPSPACE_* and their uses.
> 

I think we can add following restrictions to avoid uses of XENMAPSPACE_*
punching holes in GFNs of vNVDIMM:

(1) For XENMAPSPACE_shared_info and _grant_table, never map idx in them
    to GFNs occupied by vNVDIMM.
    
(2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
   (a) never map idx in them to GFNs occupied by vNVDIMM, and
   (b) never map idx corresponding to GFNs occupied by vNVDIMM


Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
                   ` (3 preceding siblings ...)
  2016-02-02 19:15 ` Konrad Rzeszutek Wilk
@ 2016-02-18 17:17 ` Jan Beulich
  2016-02-24 13:28   ` Haozhong Zhang
  4 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-18 17:17 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
>  This design treats host NVDIMM devices as ordinary MMIO devices:

Wrt the cachability note earlier on, I assume you're aware that with
the XSA-154 changes we disallow any cachable mappings of MMIO
by default.

>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>      and drive host NVDIMM devices (implementing block device
>      interface). Namespaces and file systems on host NVDIMM devices
>      are handled by Dom0 Linux as well.
> 
>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>      virtual address space (buf).
> 
>  (3) QEMU gets the host physical address of buf, i.e. the host system
>      physical address that is occupied by /dev/pmem0, and calls Xen
>      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> 
>  (ACPI part is described in Section 3.3 later)
> 
>  Above (1)(2) have already been done in current QEMU. Only (3) is
>  needed to implement in QEMU. No change is needed in Xen for address
>  mapping in this design.
> 
>  Open: It seems no system call/ioctl is provided by Linux kernel to
>        get the physical address from a virtual address.
>        /proc/<qemu_pid>/pagemap provides information of mapping from
>        VA to PA. Is it an acceptable solution to let QEMU parse this
>        file to get the physical address?
> 
>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>        occupied by pmem at the beginning, i.e. QEMU may not be able to
>        get all SPA of pmem from buf (in virtual address space) when
>        calling XEN_DOMCTL_memory_mapping.
>        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>        entire pmem being mmaped?

A fundamental question I have here is: Why does qemu need to
map this at all? It shouldn't itself need to access those ranges,
since the guest is given direct access. It would seem quite a bit
more natural if qemu simply inquired to underlying GFN range(s)
and handed those to Xen for translation to MFNs and mapping
into guest space.

>  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
>  check for the physical address and size passed from caller
>  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
>  aware of the SPA range of pmem so that it can refuse map physical
>  address in neither the normal ram nor pmem.

I'm not sure what missing sanity checks this is about: The handling
involves two iomem_access_permitted() calls.

> 3.3 Guest ACPI Emulation
> 
> 3.3.1 My Design
> 
>  Guest ACPI emulation is composed of two parts: building guest NFIT
>  and SSDT that defines ACPI namespace devices for NVDIMM, and
>  emulating guest _DSM.
> 
>  (1) Building Guest ACPI Tables
> 
>   This design reuses and extends hvmloader's existing mechanism that
>   loads passthrough ACPI tables from binary files to load NFIT and
>   SSDT tables built by QEMU:
>   1) Because the current QEMU does not building any ACPI tables when
>      it runs as the Xen device model, this design needs to patch QEMU
>      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> 
>   2) QEMU copies NFIT and SSDT to the end of guest memory below
>      4G. The guest address and size of those tables are written into
>      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> 
>   3) hvmloader is patched to probe and load device model passthrough
>      ACPI tables from above xenstore keys. The detected ACPI tables
>      are then appended to the end of existing guest ACPI tables just
>      like what current construct_passthrough_tables() does.
> 
>   Reasons for this design are listed below:
>   - NFIT and SSDT in question are quite self-contained, i.e. they do
>     not refer to other ACPI tables and not conflict with existing
>     guest ACPI tables in Xen. Therefore, it is safe to copy them from
>     QEMU and append to existing guest ACPI tables.

How is this not conflicting being guaranteed? In particular I don't
see how tables containing AML code and coming from different
sources won't possibly cause ACPI name space collisions.

> 3.3.3 Alternative Design 2: keeping in Xen
> 
>  Alternative to switching to QEMU, another design would be building
>  NFIT and SSDT in hvmloader or toolstack.
> 
>  The amount and parameters of sub-structures in guest NFIT vary
>  according to different vNVDIMM configurations that can not be decided
>  at compile-time. In contrast, current hvmloader and toolstack can
>  only build static ACPI tables, i.e. their contents are decided
>  statically at compile-time and independent from the guest
>  configuration. In order to build guest NFIT at runtime, this design
>  may take following steps:
>  (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU
>      options,
> 
>  (2) QEMU accepts above options, figures out the start SPA range
>      address/size/NVDIMM device handles/..., and writes them in
>      xenstore. No ACPI table is built by QEMU.
> 
>  (3) Either xl or hvmloader reads above parameters from xenstore and
>      builds the NFIT table.
> 
>  For guest SSDT, it would take more work. The ACPI namespace devices
>  are defined in SSDT by AML, so an AML builder would be needed to
>  generate those definitions at runtime.

I'm not sure this last half sentence is true: We do some dynamic
initialization of the pre-generated DSDT already, using the runtime
populated block at ACPI_INFO_PHYSICAL_ADDRESS.

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-18  7:42           ` Haozhong Zhang
@ 2016-02-19  2:14             ` Konrad Rzeszutek Wilk
  2016-03-01  7:39               ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-02-19  2:14 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, Ian Campbell, Wei Liu, George Dunlap,
	Ian Jackson, Stefano Stabellini, Jun Nakajima, Kevin Tian,
	Xiao Guangrong, xen-devel, Juergen Gross, Keir Fraser

> > > QEMU would always use MFN above guest normal ram and I/O holes for
> > > vNVDIMM. It would attempt to search in that space for a contiguous range
> > > that is large enough for that that vNVDIMM devices. Is guest able to
> > > punch holes in such GFN space?
> > 
> > See XENMAPSPACE_* and their uses.
> > 
> 
> I think we can add following restrictions to avoid uses of XENMAPSPACE_*
> punching holes in GFNs of vNVDIMM:
> 
> (1) For XENMAPSPACE_shared_info and _grant_table, never map idx in them
>     to GFNs occupied by vNVDIMM.

OK, that sounds correct.
>     
> (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
>    (a) never map idx in them to GFNs occupied by vNVDIMM, and
>    (b) never map idx corresponding to GFNs occupied by vNVDIMM

Would that mean that guest xen-blkback or xen-netback wouldn't
be able to fetch data from the GFNs? As in, what if the HVM guest
that has the NVDIMM also serves as a device domain - that is it
has xen-blkback running to service other guests?

> 
> 
> Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-18 17:17 ` Jan Beulich
@ 2016-02-24 13:28   ` Haozhong Zhang
  2016-02-24 14:00     ` Ross Philipson
  2016-02-24 14:24     ` Jan Beulich
  0 siblings, 2 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-24 13:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/18/16 10:17, Jan Beulich wrote:
> >>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
> >  This design treats host NVDIMM devices as ordinary MMIO devices:
> 
> Wrt the cachability note earlier on, I assume you're aware that with
> the XSA-154 changes we disallow any cachable mappings of MMIO
> by default.
>

EPT entries that map the host NVDIMM SPAs to guest will be the only
mapping used for NVDIMM. If the memory type in the last level entries is
always set to the same type reported by NFIT and the ipat bit is always
set as well, I think it would not suffer from the cache-type
inconsistency problem in XSA-154?

> >  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
> >      and drive host NVDIMM devices (implementing block device
> >      interface). Namespaces and file systems on host NVDIMM devices
> >      are handled by Dom0 Linux as well.
> > 
> >  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
> >      virtual address space (buf).
> > 
> >  (3) QEMU gets the host physical address of buf, i.e. the host system
> >      physical address that is occupied by /dev/pmem0, and calls Xen
> >      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> > 
> >  (ACPI part is described in Section 3.3 later)
> > 
> >  Above (1)(2) have already been done in current QEMU. Only (3) is
> >  needed to implement in QEMU. No change is needed in Xen for address
> >  mapping in this design.
> > 
> >  Open: It seems no system call/ioctl is provided by Linux kernel to
> >        get the physical address from a virtual address.
> >        /proc/<qemu_pid>/pagemap provides information of mapping from
> >        VA to PA. Is it an acceptable solution to let QEMU parse this
> >        file to get the physical address?
> > 
> >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> >        occupied by pmem at the beginning, i.e. QEMU may not be able to
> >        get all SPA of pmem from buf (in virtual address space) when
> >        calling XEN_DOMCTL_memory_mapping.
> >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> >        entire pmem being mmaped?
> 
> A fundamental question I have here is: Why does qemu need to
> map this at all? It shouldn't itself need to access those ranges,
> since the guest is given direct access. It would seem quite a bit
> more natural if qemu simply inquired to underlying GFN range(s)
> and handed those to Xen for translation to MFNs and mapping
> into guest space.
>

The above design is more like a hack on the existing QEMU
implementation for KVM without modifying the Dom0 kernel.

Maybe it's better to let QEMU to get SPAs from Dom0 kernel (NVDIMM
driver) and then pass them to Xen for the address mapping:
(1) QEMU passes fd of /dev/pmemN or file on /dev/pmemN to Dom0 kernel.
(2) Dom0 kernel finds and returns all SPAs occupied by /dev/pmemN or
    portions of /dev/pmemN where the file sits.
(3) QEMU passes above SPAs, and GMFN where they will be mapped to Xen
    which maps SPAs to GMFN.

> >  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
> >  check for the physical address and size passed from caller
> >  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
> >  aware of the SPA range of pmem so that it can refuse map physical
> >  address in neither the normal ram nor pmem.
> 
> I'm not sure what missing sanity checks this is about: The handling
> involves two iomem_access_permitted() calls.
>

Oh, I missed iomem_access_permitted(). Seemingly it is what I need if
the SPA ranges of host NVDIMMs are added to Dom0's iomem_cap initially.

> > 3.3 Guest ACPI Emulation
> > 
> > 3.3.1 My Design
> > 
> >  Guest ACPI emulation is composed of two parts: building guest NFIT
> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> >  emulating guest _DSM.
> > 
> >  (1) Building Guest ACPI Tables
> > 
> >   This design reuses and extends hvmloader's existing mechanism that
> >   loads passthrough ACPI tables from binary files to load NFIT and
> >   SSDT tables built by QEMU:
> >   1) Because the current QEMU does not building any ACPI tables when
> >      it runs as the Xen device model, this design needs to patch QEMU
> >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> > 
> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
> >      4G. The guest address and size of those tables are written into
> >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> > 
> >   3) hvmloader is patched to probe and load device model passthrough
> >      ACPI tables from above xenstore keys. The detected ACPI tables
> >      are then appended to the end of existing guest ACPI tables just
> >      like what current construct_passthrough_tables() does.
> > 
> >   Reasons for this design are listed below:
> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
> >     not refer to other ACPI tables and not conflict with existing
> >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
> >     QEMU and append to existing guest ACPI tables.
> 
> How is this not conflicting being guaranteed? In particular I don't
> see how tables containing AML code and coming from different
> sources won't possibly cause ACPI name space collisions.
>

Really there is no effective mechanism to avoid ACPI name space
collisions (and other kinds of conflicts) between ACPI tables loaded
from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
are loaded is determined by developers, IMO it's developers'
responsibility to avoid any collisions and conflicts with existing ACPI
tables.

> > 3.3.3 Alternative Design 2: keeping in Xen
> > 
> >  Alternative to switching to QEMU, another design would be building
> >  NFIT and SSDT in hvmloader or toolstack.
> >
> >  The amount and parameters of sub-structures in guest NFIT vary
> >  according to different vNVDIMM configurations that can not be decided
> >  at compile-time. In contrast, current hvmloader and toolstack can
> >  only build static ACPI tables, i.e. their contents are decided
> >  statically at compile-time and independent from the guest
> >  configuration. In order to build guest NFIT at runtime, this design
> >  may take following steps:
> >  (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU
> >      options,
> > 
> >  (2) QEMU accepts above options, figures out the start SPA range
> >      address/size/NVDIMM device handles/..., and writes them in
> >      xenstore. No ACPI table is built by QEMU.
> > 
> >  (3) Either xl or hvmloader reads above parameters from xenstore and
> >      builds the NFIT table.
> > 
> >  For guest SSDT, it would take more work. The ACPI namespace devices
> >  are defined in SSDT by AML, so an AML builder would be needed to
> >  generate those definitions at runtime.
> 
> I'm not sure this last half sentence is true: We do some dynamic
> initialization of the pre-generated DSDT already, using the runtime
> populated block at ACPI_INFO_PHYSICAL_ADDRESS.
>

IIUC, if an ACPI namespace device in guest DSDT needs to use an
parameter that is not available at build time, it can refer to the
corresponding field in the memory block at ACPI_INFO_PHYSICAL_ADDRESS
that is filled at runtime.

But it does not work for vNVDIMM. An example of NVDIMM AML code looks
like:

Scope (\_SB){
    Device (NVDR) // Root device for all NVDIMM devices
    {
        Name (_HID, “ACPI0012”)
        Method (_STA) {...}
        Method (_FIT) {...}
        Method (_DSM, ...) {
            ...
        }

        Device (NVD0)     // 1st NVDIMM device
        {
            Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
            Method (_DSM, ...) {
                ...
            }
        }

        Device (NVD1)     // 2nd NVDIMM device
        {
            Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
            Method (_DSM, ...) {
                ...
            }
        }

        ...
   }
}

For each NVDIMM device defined in NFIT, there is a ACPI namespace device
(e.g. NVD0, NVD1) under the root NVDIMM device (NVDR). Because the
number of vNVDIMM devices are unknown at build time, we can not
determine whether and how many NVDIMM ACPI namespace devices should be
defined in the pre-generated SSDT.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-24 13:28   ` Haozhong Zhang
@ 2016-02-24 14:00     ` Ross Philipson
  2016-02-24 16:42       ` Haozhong Zhang
  2016-02-24 14:24     ` Jan Beulich
  1 sibling, 1 reply; 121+ messages in thread
From: Ross Philipson @ 2016-02-24 14:00 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, Ian Campbell, Wei Liu, George Dunlap,
	Ian Jackson, Stefano Stabellini, Jun Nakajima, Kevin Tian,
	Xiao Guangrong, xen-devel, Konrad Rzeszutek Wilk, Juergen Gross,
	Keir Fraser

On 02/24/2016 08:28 AM, Haozhong Zhang wrote:
> On 02/18/16 10:17, Jan Beulich wrote:
>>>>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
>>>  This design treats host NVDIMM devices as ordinary MMIO devices:
>>
>> Wrt the cachability note earlier on, I assume you're aware that with
>> the XSA-154 changes we disallow any cachable mappings of MMIO
>> by default.
>>
> 
> EPT entries that map the host NVDIMM SPAs to guest will be the only
> mapping used for NVDIMM. If the memory type in the last level entries is
> always set to the same type reported by NFIT and the ipat bit is always
> set as well, I think it would not suffer from the cache-type
> inconsistency problem in XSA-154?
> 
>>>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>>>      and drive host NVDIMM devices (implementing block device
>>>      interface). Namespaces and file systems on host NVDIMM devices
>>>      are handled by Dom0 Linux as well.
>>>
>>>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>>>      virtual address space (buf).
>>>
>>>  (3) QEMU gets the host physical address of buf, i.e. the host system
>>>      physical address that is occupied by /dev/pmem0, and calls Xen
>>>      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
>>>
>>>  (ACPI part is described in Section 3.3 later)
>>>
>>>  Above (1)(2) have already been done in current QEMU. Only (3) is
>>>  needed to implement in QEMU. No change is needed in Xen for address
>>>  mapping in this design.
>>>
>>>  Open: It seems no system call/ioctl is provided by Linux kernel to
>>>        get the physical address from a virtual address.
>>>        /proc/<qemu_pid>/pagemap provides information of mapping from
>>>        VA to PA. Is it an acceptable solution to let QEMU parse this
>>>        file to get the physical address?
>>>
>>>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>>>        occupied by pmem at the beginning, i.e. QEMU may not be able to
>>>        get all SPA of pmem from buf (in virtual address space) when
>>>        calling XEN_DOMCTL_memory_mapping.
>>>        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>>>        entire pmem being mmaped?
>>
>> A fundamental question I have here is: Why does qemu need to
>> map this at all? It shouldn't itself need to access those ranges,
>> since the guest is given direct access. It would seem quite a bit
>> more natural if qemu simply inquired to underlying GFN range(s)
>> and handed those to Xen for translation to MFNs and mapping
>> into guest space.
>>
> 
> The above design is more like a hack on the existing QEMU
> implementation for KVM without modifying the Dom0 kernel.
> 
> Maybe it's better to let QEMU to get SPAs from Dom0 kernel (NVDIMM
> driver) and then pass them to Xen for the address mapping:
> (1) QEMU passes fd of /dev/pmemN or file on /dev/pmemN to Dom0 kernel.
> (2) Dom0 kernel finds and returns all SPAs occupied by /dev/pmemN or
>     portions of /dev/pmemN where the file sits.
> (3) QEMU passes above SPAs, and GMFN where they will be mapped to Xen
>     which maps SPAs to GMFN.
> 
>>>  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
>>>  check for the physical address and size passed from caller
>>>  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
>>>  aware of the SPA range of pmem so that it can refuse map physical
>>>  address in neither the normal ram nor pmem.
>>
>> I'm not sure what missing sanity checks this is about: The handling
>> involves two iomem_access_permitted() calls.
>>
> 
> Oh, I missed iomem_access_permitted(). Seemingly it is what I need if
> the SPA ranges of host NVDIMMs are added to Dom0's iomem_cap initially.
> 
>>> 3.3 Guest ACPI Emulation
>>>
>>> 3.3.1 My Design
>>>
>>>  Guest ACPI emulation is composed of two parts: building guest NFIT
>>>  and SSDT that defines ACPI namespace devices for NVDIMM, and
>>>  emulating guest _DSM.
>>>
>>>  (1) Building Guest ACPI Tables
>>>
>>>   This design reuses and extends hvmloader's existing mechanism that
>>>   loads passthrough ACPI tables from binary files to load NFIT and
>>>   SSDT tables built by QEMU:
>>>   1) Because the current QEMU does not building any ACPI tables when
>>>      it runs as the Xen device model, this design needs to patch QEMU
>>>      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
>>>
>>>   2) QEMU copies NFIT and SSDT to the end of guest memory below
>>>      4G. The guest address and size of those tables are written into
>>>      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
>>>
>>>   3) hvmloader is patched to probe and load device model passthrough
>>>      ACPI tables from above xenstore keys. The detected ACPI tables
>>>      are then appended to the end of existing guest ACPI tables just
>>>      like what current construct_passthrough_tables() does.
>>>
>>>   Reasons for this design are listed below:
>>>   - NFIT and SSDT in question are quite self-contained, i.e. they do
>>>     not refer to other ACPI tables and not conflict with existing
>>>     guest ACPI tables in Xen. Therefore, it is safe to copy them from
>>>     QEMU and append to existing guest ACPI tables.
>>
>> How is this not conflicting being guaranteed? In particular I don't
>> see how tables containing AML code and coming from different
>> sources won't possibly cause ACPI name space collisions.
>>
> 
> Really there is no effective mechanism to avoid ACPI name space
> collisions (and other kinds of conflicts) between ACPI tables loaded
> from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
> are loaded is determined by developers, IMO it's developers'
> responsibility to avoid any collisions and conflicts with existing ACPI
> tables.
> 
>>> 3.3.3 Alternative Design 2: keeping in Xen
>>>
>>>  Alternative to switching to QEMU, another design would be building
>>>  NFIT and SSDT in hvmloader or toolstack.
>>>
>>>  The amount and parameters of sub-structures in guest NFIT vary
>>>  according to different vNVDIMM configurations that can not be decided
>>>  at compile-time. In contrast, current hvmloader and toolstack can
>>>  only build static ACPI tables, i.e. their contents are decided
>>>  statically at compile-time and independent from the guest
>>>  configuration. In order to build guest NFIT at runtime, this design
>>>  may take following steps:
>>>  (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU
>>>      options,
>>>
>>>  (2) QEMU accepts above options, figures out the start SPA range
>>>      address/size/NVDIMM device handles/..., and writes them in
>>>      xenstore. No ACPI table is built by QEMU.
>>>
>>>  (3) Either xl or hvmloader reads above parameters from xenstore and
>>>      builds the NFIT table.
>>>
>>>  For guest SSDT, it would take more work. The ACPI namespace devices
>>>  are defined in SSDT by AML, so an AML builder would be needed to
>>>  generate those definitions at runtime.
>>
>> I'm not sure this last half sentence is true: We do some dynamic
>> initialization of the pre-generated DSDT already, using the runtime
>> populated block at ACPI_INFO_PHYSICAL_ADDRESS.
>>
> 
> IIUC, if an ACPI namespace device in guest DSDT needs to use an
> parameter that is not available at build time, it can refer to the
> corresponding field in the memory block at ACPI_INFO_PHYSICAL_ADDRESS
> that is filled at runtime.
> 
> But it does not work for vNVDIMM. An example of NVDIMM AML code looks
> like:
> 
> Scope (\_SB){
>     Device (NVDR) // Root device for all NVDIMM devices
>     {
>         Name (_HID, “ACPI0012”)
>         Method (_STA) {...}
>         Method (_FIT) {...}
>         Method (_DSM, ...) {
>             ...
>         }
> 
>         Device (NVD0)     // 1st NVDIMM device
>         {
>             Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
>             Method (_DSM, ...) {
>                 ...
>             }
>         }
> 
>         Device (NVD1)     // 2nd NVDIMM device
>         {
>             Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
>             Method (_DSM, ...) {
>                 ...
>             }
>         }
> 
>         ...
>    }
> }
> 
> For each NVDIMM device defined in NFIT, there is a ACPI namespace device
> (e.g. NVD0, NVD1) under the root NVDIMM device (NVDR). Because the
> number of vNVDIMM devices are unknown at build time, we can not
> determine whether and how many NVDIMM ACPI namespace devices should be
> defined in the pre-generated SSDT.

We have dealt with the exact same issue in the past (though it concerned
WMI ACPI devices). The layout and number of these devices and their
methods was very platform dependent as this seems to be. Thus we
generated an SSDT at runtime to reflect the underlying platform. At that
time we had to write an AML generator but as noted earlier, SeaBIOS now
seems to have a fairly complete one. This would allow you to do create
the proper number of NVDx devices and set the desired addresses at runtime.

If you have to do it statically at build time you will have to pick a
max number of possible NVDx devices and make some of them report they
are not there (e.g. w/ the _STA methods possibly). In this case you
could extend ACPI_INFO_PHYSICAL_ADDRESS or create your own IO/Memory
OpRegion that you handle at runtime. I would personally go for the first
option.

> 
> Haozhong
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 


-- 
Ross Philipson

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-24 13:28   ` Haozhong Zhang
  2016-02-24 14:00     ` Ross Philipson
@ 2016-02-24 14:24     ` Jan Beulich
  2016-02-24 15:48       ` Haozhong Zhang
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-24 14:24 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 24.02.16 at 14:28, <haozhong.zhang@intel.com> wrote:
> On 02/18/16 10:17, Jan Beulich wrote:
>> >>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
>> >  This design treats host NVDIMM devices as ordinary MMIO devices:
>> 
>> Wrt the cachability note earlier on, I assume you're aware that with
>> the XSA-154 changes we disallow any cachable mappings of MMIO
>> by default.
>>
> 
> EPT entries that map the host NVDIMM SPAs to guest will be the only
> mapping used for NVDIMM. If the memory type in the last level entries is
> always set to the same type reported by NFIT and the ipat bit is always
> set as well, I think it would not suffer from the cache-type
> inconsistency problem in XSA-154?

This assumes Xen knows the NVDIMM address ranges, which so
far you meant to keep out of Xen iirc. But yes, things surely can
be made work, I simply wanted to point out that there are some
caveats.

>> >  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>> >      and drive host NVDIMM devices (implementing block device
>> >      interface). Namespaces and file systems on host NVDIMM devices
>> >      are handled by Dom0 Linux as well.
>> > 
>> >  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>> >      virtual address space (buf).
>> > 
>> >  (3) QEMU gets the host physical address of buf, i.e. the host system
>> >      physical address that is occupied by /dev/pmem0, and calls Xen
>> >      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
>> > 
>> >  (ACPI part is described in Section 3.3 later)
>> > 
>> >  Above (1)(2) have already been done in current QEMU. Only (3) is
>> >  needed to implement in QEMU. No change is needed in Xen for address
>> >  mapping in this design.
>> > 
>> >  Open: It seems no system call/ioctl is provided by Linux kernel to
>> >        get the physical address from a virtual address.
>> >        /proc/<qemu_pid>/pagemap provides information of mapping from
>> >        VA to PA. Is it an acceptable solution to let QEMU parse this
>> >        file to get the physical address?
>> > 
>> >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>> >        occupied by pmem at the beginning, i.e. QEMU may not be able to
>> >        get all SPA of pmem from buf (in virtual address space) when
>> >        calling XEN_DOMCTL_memory_mapping.
>> >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>> >        entire pmem being mmaped?
>> 
>> A fundamental question I have here is: Why does qemu need to
>> map this at all? It shouldn't itself need to access those ranges,
>> since the guest is given direct access. It would seem quite a bit
>> more natural if qemu simply inquired to underlying GFN range(s)
>> and handed those to Xen for translation to MFNs and mapping
>> into guest space.
>>
> 
> The above design is more like a hack on the existing QEMU
> implementation for KVM without modifying the Dom0 kernel.
> 
> Maybe it's better to let QEMU to get SPAs from Dom0 kernel (NVDIMM
> driver) and then pass them to Xen for the address mapping:
> (1) QEMU passes fd of /dev/pmemN or file on /dev/pmemN to Dom0 kernel.
> (2) Dom0 kernel finds and returns all SPAs occupied by /dev/pmemN or
>     portions of /dev/pmemN where the file sits.
> (3) QEMU passes above SPAs, and GMFN where they will be mapped to Xen
>     which maps SPAs to GMFN.

Indeed, and that would also eliminate the second of your Opens
above.

>> > 3.3 Guest ACPI Emulation
>> > 
>> > 3.3.1 My Design
>> > 
>> >  Guest ACPI emulation is composed of two parts: building guest NFIT
>> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
>> >  emulating guest _DSM.
>> > 
>> >  (1) Building Guest ACPI Tables
>> > 
>> >   This design reuses and extends hvmloader's existing mechanism that
>> >   loads passthrough ACPI tables from binary files to load NFIT and
>> >   SSDT tables built by QEMU:
>> >   1) Because the current QEMU does not building any ACPI tables when
>> >      it runs as the Xen device model, this design needs to patch QEMU
>> >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
>> > 
>> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
>> >      4G. The guest address and size of those tables are written into
>> >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
>> > 
>> >   3) hvmloader is patched to probe and load device model passthrough
>> >      ACPI tables from above xenstore keys. The detected ACPI tables
>> >      are then appended to the end of existing guest ACPI tables just
>> >      like what current construct_passthrough_tables() does.
>> > 
>> >   Reasons for this design are listed below:
>> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
>> >     not refer to other ACPI tables and not conflict with existing
>> >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
>> >     QEMU and append to existing guest ACPI tables.
>> 
>> How is this not conflicting being guaranteed? In particular I don't
>> see how tables containing AML code and coming from different
>> sources won't possibly cause ACPI name space collisions.
>>
> 
> Really there is no effective mechanism to avoid ACPI name space
> collisions (and other kinds of conflicts) between ACPI tables loaded
> from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
> are loaded is determined by developers, IMO it's developers'
> responsibility to avoid any collisions and conflicts with existing ACPI
> tables.

Right, but this needs to be spelled out and settled on at design
time (i.e. now), rather leaving things unspecified, awaiting the
first clash.

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-24 14:24     ` Jan Beulich
@ 2016-02-24 15:48       ` Haozhong Zhang
  2016-02-24 16:54         ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-24 15:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/24/16 07:24, Jan Beulich wrote:
> >>> On 24.02.16 at 14:28, <haozhong.zhang@intel.com> wrote:
> > On 02/18/16 10:17, Jan Beulich wrote:
> >> >>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
> >> >  This design treats host NVDIMM devices as ordinary MMIO devices:
> >> 
> >> Wrt the cachability note earlier on, I assume you're aware that with
> >> the XSA-154 changes we disallow any cachable mappings of MMIO
> >> by default.
> >>
> > 
> > EPT entries that map the host NVDIMM SPAs to guest will be the only
> > mapping used for NVDIMM. If the memory type in the last level entries is
> > always set to the same type reported by NFIT and the ipat bit is always
> > set as well, I think it would not suffer from the cache-type
> > inconsistency problem in XSA-154?
> 
> This assumes Xen knows the NVDIMM address ranges, which so
> far you meant to keep out of Xen iirc.

The original design did not consider and failed for some cases, such as
XSA-154 and MCE handling. For those two cases, Xen really should track
mapping for NVDIMM. (Yes, I changed my mind)

> But yes, things surely can
> be made work, I simply wanted to point out that there are some
> caveats.
>
> >> >  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
> >> >      and drive host NVDIMM devices (implementing block device
> >> >      interface). Namespaces and file systems on host NVDIMM devices
> >> >      are handled by Dom0 Linux as well.
> >> > 
> >> >  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
> >> >      virtual address space (buf).
> >> > 
> >> >  (3) QEMU gets the host physical address of buf, i.e. the host system
> >> >      physical address that is occupied by /dev/pmem0, and calls Xen
> >> >      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> >> > 
> >> >  (ACPI part is described in Section 3.3 later)
> >> > 
> >> >  Above (1)(2) have already been done in current QEMU. Only (3) is
> >> >  needed to implement in QEMU. No change is needed in Xen for address
> >> >  mapping in this design.
> >> > 
> >> >  Open: It seems no system call/ioctl is provided by Linux kernel to
> >> >        get the physical address from a virtual address.
> >> >        /proc/<qemu_pid>/pagemap provides information of mapping from
> >> >        VA to PA. Is it an acceptable solution to let QEMU parse this
> >> >        file to get the physical address?
> >> > 
> >> >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> >> >        occupied by pmem at the beginning, i.e. QEMU may not be able to
> >> >        get all SPA of pmem from buf (in virtual address space) when
> >> >        calling XEN_DOMCTL_memory_mapping.
> >> >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> >> >        entire pmem being mmaped?
> >> 
> >> A fundamental question I have here is: Why does qemu need to
> >> map this at all? It shouldn't itself need to access those ranges,
> >> since the guest is given direct access. It would seem quite a bit
> >> more natural if qemu simply inquired to underlying GFN range(s)
> >> and handed those to Xen for translation to MFNs and mapping
> >> into guest space.
> >>
> > 
> > The above design is more like a hack on the existing QEMU
> > implementation for KVM without modifying the Dom0 kernel.
> > 
> > Maybe it's better to let QEMU to get SPAs from Dom0 kernel (NVDIMM
> > driver) and then pass them to Xen for the address mapping:
> > (1) QEMU passes fd of /dev/pmemN or file on /dev/pmemN to Dom0 kernel.
> > (2) Dom0 kernel finds and returns all SPAs occupied by /dev/pmemN or
> >     portions of /dev/pmemN where the file sits.
> > (3) QEMU passes above SPAs, and GMFN where they will be mapped to Xen
> >     which maps SPAs to GMFN.
> 
> Indeed, and that would also eliminate the second of your Opens
> above.
>

Yes.

> >> > 3.3 Guest ACPI Emulation
> >> > 
> >> > 3.3.1 My Design
> >> > 
> >> >  Guest ACPI emulation is composed of two parts: building guest NFIT
> >> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> >> >  emulating guest _DSM.
> >> > 
> >> >  (1) Building Guest ACPI Tables
> >> > 
> >> >   This design reuses and extends hvmloader's existing mechanism that
> >> >   loads passthrough ACPI tables from binary files to load NFIT and
> >> >   SSDT tables built by QEMU:
> >> >   1) Because the current QEMU does not building any ACPI tables when
> >> >      it runs as the Xen device model, this design needs to patch QEMU
> >> >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> >> > 
> >> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
> >> >      4G. The guest address and size of those tables are written into
> >> >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> >> > 
> >> >   3) hvmloader is patched to probe and load device model passthrough
> >> >      ACPI tables from above xenstore keys. The detected ACPI tables
> >> >      are then appended to the end of existing guest ACPI tables just
> >> >      like what current construct_passthrough_tables() does.
> >> > 
> >> >   Reasons for this design are listed below:
> >> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
> >> >     not refer to other ACPI tables and not conflict with existing
> >> >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
> >> >     QEMU and append to existing guest ACPI tables.
> >> 
> >> How is this not conflicting being guaranteed? In particular I don't
> >> see how tables containing AML code and coming from different
> >> sources won't possibly cause ACPI name space collisions.
> >>
> > 
> > Really there is no effective mechanism to avoid ACPI name space
> > collisions (and other kinds of conflicts) between ACPI tables loaded
> > from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
> > are loaded is determined by developers, IMO it's developers'
> > responsibility to avoid any collisions and conflicts with existing ACPI
> > tables.
> 
> Right, but this needs to be spelled out and settled on at design
> time (i.e. now), rather leaving things unspecified, awaiting the
> first clash.
>

So that means if no collision-proof mechanism is introduced, Xen should not
trust any passed-in ACPI tables and should build them by itself?

Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-24 14:00     ` Ross Philipson
@ 2016-02-24 16:42       ` Haozhong Zhang
  2016-02-24 17:50         ` Ross Philipson
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-24 16:42 UTC (permalink / raw)
  To: Ross Philipson
  Cc: Jun Nakajima, Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

On 02/24/16 09:00, Ross Philipson wrote:
[...]
> > For each NVDIMM device defined in NFIT, there is a ACPI namespace device
> > (e.g. NVD0, NVD1) under the root NVDIMM device (NVDR). Because the
> > number of vNVDIMM devices are unknown at build time, we can not
> > determine whether and how many NVDIMM ACPI namespace devices should be
> > defined in the pre-generated SSDT.
> 
> We have dealt with the exact same issue in the past (though it concerned
> WMI ACPI devices). The layout and number of these devices and their
> methods was very platform dependent as this seems to be. Thus we
> generated an SSDT at runtime to reflect the underlying platform. At that
> time we had to write an AML generator but as noted earlier, SeaBIOS now
> seems to have a fairly complete one. This would allow you to do create
> the proper number of NVDx devices and set the desired addresses at runtime.
>

Thanks for the information. I just did a quick grep on SeaBIOS code and
found some AML related code in scripts/acpi_extract.py. Is this the AML
generator you mean?

> If you have to do it statically at build time you will have to pick a
> max number of possible NVDx devices and make some of them report they
> are not there (e.g. w/ the _STA methods possibly). In this case you
> could extend ACPI_INFO_PHYSICAL_ADDRESS or create your own IO/Memory
> OpRegion that you handle at runtime. I would personally go for the first
> option.
>

_STA does not work, because the individual NVDIMM ACPI namespace device
(NVDx) does no provide _STA method. In addition, ACPI spec does not
define any mechanism to check the presence of an individual NVDx.

Haozhong

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-24 15:48       ` Haozhong Zhang
@ 2016-02-24 16:54         ` Jan Beulich
  2016-02-28 14:48           ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-24 16:54 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 24.02.16 at 16:48, <haozhong.zhang@intel.com> wrote:
> On 02/24/16 07:24, Jan Beulich wrote:
>> >>> On 24.02.16 at 14:28, <haozhong.zhang@intel.com> wrote:
>> > On 02/18/16 10:17, Jan Beulich wrote:
>> >> >>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
>> >> > 3.3 Guest ACPI Emulation
>> >> > 
>> >> > 3.3.1 My Design
>> >> > 
>> >> >  Guest ACPI emulation is composed of two parts: building guest NFIT
>> >> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
>> >> >  emulating guest _DSM.
>> >> > 
>> >> >  (1) Building Guest ACPI Tables
>> >> > 
>> >> >   This design reuses and extends hvmloader's existing mechanism that
>> >> >   loads passthrough ACPI tables from binary files to load NFIT and
>> >> >   SSDT tables built by QEMU:
>> >> >   1) Because the current QEMU does not building any ACPI tables when
>> >> >      it runs as the Xen device model, this design needs to patch QEMU
>> >> >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
>> >> > 
>> >> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
>> >> >      4G. The guest address and size of those tables are written into
>> >> >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
>> >> > 
>> >> >   3) hvmloader is patched to probe and load device model passthrough
>> >> >      ACPI tables from above xenstore keys. The detected ACPI tables
>> >> >      are then appended to the end of existing guest ACPI tables just
>> >> >      like what current construct_passthrough_tables() does.
>> >> > 
>> >> >   Reasons for this design are listed below:
>> >> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
>> >> >     not refer to other ACPI tables and not conflict with existing
>> >> >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
>> >> >     QEMU and append to existing guest ACPI tables.
>> >> 
>> >> How is this not conflicting being guaranteed? In particular I don't
>> >> see how tables containing AML code and coming from different
>> >> sources won't possibly cause ACPI name space collisions.
>> >>
>> > 
>> > Really there is no effective mechanism to avoid ACPI name space
>> > collisions (and other kinds of conflicts) between ACPI tables loaded
>> > from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
>> > are loaded is determined by developers, IMO it's developers'
>> > responsibility to avoid any collisions and conflicts with existing ACPI
>> > tables.
>> 
>> Right, but this needs to be spelled out and settled on at design
>> time (i.e. now), rather leaving things unspecified, awaiting the
>> first clash.
> 
> So that means if no collision-proof mechanism is introduced, Xen should not
> trust any passed-in ACPI tables and should build them by itself?

Basically yes, albeit collision-proof may be too much to demand.
Simply separating name spaces (for hvmloader and qemu to have
their own sub-spaces) would be sufficient imo. We should trust
ourselves to play by such a specification.

Jan

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-24 16:42       ` Haozhong Zhang
@ 2016-02-24 17:50         ` Ross Philipson
  0 siblings, 0 replies; 121+ messages in thread
From: Ross Philipson @ 2016-02-24 17:50 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, Ian Campbell, Wei Liu, George Dunlap,
	Ian Jackson, Stefano Stabellini, Jun Nakajima, Kevin Tian,
	Xiao Guangrong, xen-devel, Konrad Rzeszutek Wilk, Juergen Gross,
	Keir Fraser

On 02/24/2016 11:42 AM, Haozhong Zhang wrote:
> On 02/24/16 09:00, Ross Philipson wrote:
> [...]
>>> For each NVDIMM device defined in NFIT, there is a ACPI namespace device
>>> (e.g. NVD0, NVD1) under the root NVDIMM device (NVDR). Because the
>>> number of vNVDIMM devices are unknown at build time, we can not
>>> determine whether and how many NVDIMM ACPI namespace devices should be
>>> defined in the pre-generated SSDT.
>>
>> We have dealt with the exact same issue in the past (though it concerned
>> WMI ACPI devices). The layout and number of these devices and their
>> methods was very platform dependent as this seems to be. Thus we
>> generated an SSDT at runtime to reflect the underlying platform. At that
>> time we had to write an AML generator but as noted earlier, SeaBIOS now
>> seems to have a fairly complete one. This would allow you to do create
>> the proper number of NVDx devices and set the desired addresses at runtime.
>>
> 
> Thanks for the information. I just did a quick grep on SeaBIOS code and
> found some AML related code in scripts/acpi_extract.py. Is this the AML
> generator you mean?

Oh terribly sorry - I am talking about QEMU. The code in
hw/acpi/aml-build.c looks like it does a lot of what my library does. I
don't know how complete it is. For example I had to do a lot of work to
build resource packages and I have not found that yet. But it is a good
starting place.

> 
>> If you have to do it statically at build time you will have to pick a
>> max number of possible NVDx devices and make some of them report they
>> are not there (e.g. w/ the _STA methods possibly). In this case you
>> could extend ACPI_INFO_PHYSICAL_ADDRESS or create your own IO/Memory
>> OpRegion that you handle at runtime. I would personally go for the first
>> option.
>>
> 
> _STA does not work, because the individual NVDIMM ACPI namespace device
> (NVDx) does no provide _STA method. In addition, ACPI spec does not
> define any mechanism to check the presence of an individual NVDx.
> 
> Haozhong
> 


-- 
Ross Philipson

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-24 16:54         ` Jan Beulich
@ 2016-02-28 14:48           ` Haozhong Zhang
  2016-02-29  9:01             ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-28 14:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/24/16 09:54, Jan Beulich wrote:
> >>> On 24.02.16 at 16:48, <haozhong.zhang@intel.com> wrote:
> > On 02/24/16 07:24, Jan Beulich wrote:
> >> >>> On 24.02.16 at 14:28, <haozhong.zhang@intel.com> wrote:
> >> > On 02/18/16 10:17, Jan Beulich wrote:
> >> >> >>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
> >> >> > 3.3 Guest ACPI Emulation
> >> >> > 
> >> >> > 3.3.1 My Design
> >> >> > 
> >> >> >  Guest ACPI emulation is composed of two parts: building guest NFIT
> >> >> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> >> >> >  emulating guest _DSM.
> >> >> > 
> >> >> >  (1) Building Guest ACPI Tables
> >> >> > 
> >> >> >   This design reuses and extends hvmloader's existing mechanism that
> >> >> >   loads passthrough ACPI tables from binary files to load NFIT and
> >> >> >   SSDT tables built by QEMU:
> >> >> >   1) Because the current QEMU does not building any ACPI tables when
> >> >> >      it runs as the Xen device model, this design needs to patch QEMU
> >> >> >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> >> >> > 
> >> >> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
> >> >> >      4G. The guest address and size of those tables are written into
> >> >> >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> >> >> > 
> >> >> >   3) hvmloader is patched to probe and load device model passthrough
> >> >> >      ACPI tables from above xenstore keys. The detected ACPI tables
> >> >> >      are then appended to the end of existing guest ACPI tables just
> >> >> >      like what current construct_passthrough_tables() does.
> >> >> > 
> >> >> >   Reasons for this design are listed below:
> >> >> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
> >> >> >     not refer to other ACPI tables and not conflict with existing
> >> >> >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
> >> >> >     QEMU and append to existing guest ACPI tables.
> >> >> 
> >> >> How is this not conflicting being guaranteed? In particular I don't
> >> >> see how tables containing AML code and coming from different
> >> >> sources won't possibly cause ACPI name space collisions.
> >> >>
> >> > 
> >> > Really there is no effective mechanism to avoid ACPI name space
> >> > collisions (and other kinds of conflicts) between ACPI tables loaded
> >> > from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
> >> > are loaded is determined by developers, IMO it's developers'
> >> > responsibility to avoid any collisions and conflicts with existing ACPI
> >> > tables.
> >> 
> >> Right, but this needs to be spelled out and settled on at design
> >> time (i.e. now), rather leaving things unspecified, awaiting the
> >> first clash.
> > 
> > So that means if no collision-proof mechanism is introduced, Xen should not
> > trust any passed-in ACPI tables and should build them by itself?
> 
> Basically yes, albeit collision-proof may be too much to demand.
> Simply separating name spaces (for hvmloader and qemu to have
> their own sub-spaces) would be sufficient imo. We should trust
> ourselves to play by such a specification.
>

I don't quite understand 'separating name spaces'. Do you mean, for
example, if both hvmloader and qemu want to put a namespace device under
\_SB, they could be put in different sub-scopes under \_SB? But it does
not work for Linux at least.

Anyway, we may avoid some conflicts between ACPI tables/objects by
restricting which tables and objects can be passed from QEMU to Xen:
(1) For ACPI tables, xen does not accept those built by itself,
    e.g. DSDT and SSDT.
(2) xen does not accept ACPI tables for devices that are not attached to
    a domain, e.g. if NFIT cannot be passed if a domain does not have
    vNVDIMM.
(3) For ACPI objects, xen only accepts namespace devices and requires
    their names does not conflict with existing ones provided by Xen.

In implementation, QEMU could put the passed-in ACPI tables and objects in
a series of blobs in following format:
  +------+------+--....--+
  | type | size |  data  |
  +------+------+--....--+
where
(1) 'type' indicates which data are stored in this blob:
        0 for ACPI table,
        1 for ACPI namespace device,
(2) 'size' indicates this blob's size in bytes. The next blob (if exist)
    can be found by add 'size' to the base address of the current blob.
(3) 'data' is of variant-length and stores the actual passed content:
    (a) for type 0 blob (ACPI table), a complete ACPI table including the
        table header is stored in 'data'
    (b) for type 1 blob (ACPI namespace device), at the beginning of
       'data' is a 4 bytes device name, and followed is AML code within
       that device, e.g. for a device
          Device (DEV0) {
              Name (_HID, "ACPI1234")
              Method (_DSM) { ... }
          }
       "DEV0" is stored at the beginning of 'data', and then is AML code of
          Name (_HID, "ACPI1234")
          Method (_DSM) { ... }
     
The number and the starting guest address of all blobs are passed to Xen
via xenstore. Xen toolstack/hvmloader can enumerate those blobs and
check signatures or name in 'data' according to their types.

The acceptable ACPI tables will be appended to existing tables built by
Xen. For acceptable ACPI namespace devices, hvmloader will recreate AML
devices for them, and put those AML devices in a new SSDT. Only a small
piece of code to build just AML device entry rather than a comprehensive
AML builder is needed then.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-28 14:48           ` Haozhong Zhang
@ 2016-02-29  9:01             ` Jan Beulich
  2016-02-29  9:45               ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-29  9:01 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 28.02.16 at 15:48, <haozhong.zhang@intel.com> wrote:
> On 02/24/16 09:54, Jan Beulich wrote:
>> >>> On 24.02.16 at 16:48, <haozhong.zhang@intel.com> wrote:
>> > On 02/24/16 07:24, Jan Beulich wrote:
>> >> >>> On 24.02.16 at 14:28, <haozhong.zhang@intel.com> wrote:
>> >> > On 02/18/16 10:17, Jan Beulich wrote:
>> >> >> >>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
>> >> >> > 3.3 Guest ACPI Emulation
>> >> >> > 
>> >> >> > 3.3.1 My Design
>> >> >> > 
>> >> >> >  Guest ACPI emulation is composed of two parts: building guest NFIT
>> >> >> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
>> >> >> >  emulating guest _DSM.
>> >> >> > 
>> >> >> >  (1) Building Guest ACPI Tables
>> >> >> > 
>> >> >> >   This design reuses and extends hvmloader's existing mechanism that
>> >> >> >   loads passthrough ACPI tables from binary files to load NFIT and
>> >> >> >   SSDT tables built by QEMU:
>> >> >> >   1) Because the current QEMU does not building any ACPI tables when
>> >> >> >      it runs as the Xen device model, this design needs to patch QEMU
>> >> >> >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
>> >> >> > 
>> >> >> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
>> >> >> >      4G. The guest address and size of those tables are written into
>> >> >> >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
>> >> >> > 
>> >> >> >   3) hvmloader is patched to probe and load device model passthrough
>> >> >> >      ACPI tables from above xenstore keys. The detected ACPI tables
>> >> >> >      are then appended to the end of existing guest ACPI tables just
>> >> >> >      like what current construct_passthrough_tables() does.
>> >> >> > 
>> >> >> >   Reasons for this design are listed below:
>> >> >> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
>> >> >> >     not refer to other ACPI tables and not conflict with existing
>> >> >> >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
>> >> >> >     QEMU and append to existing guest ACPI tables.
>> >> >> 
>> >> >> How is this not conflicting being guaranteed? In particular I don't
>> >> >> see how tables containing AML code and coming from different
>> >> >> sources won't possibly cause ACPI name space collisions.
>> >> >>
>> >> > 
>> >> > Really there is no effective mechanism to avoid ACPI name space
>> >> > collisions (and other kinds of conflicts) between ACPI tables loaded
>> >> > from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
>> >> > are loaded is determined by developers, IMO it's developers'
>> >> > responsibility to avoid any collisions and conflicts with existing ACPI
>> >> > tables.
>> >> 
>> >> Right, but this needs to be spelled out and settled on at design
>> >> time (i.e. now), rather leaving things unspecified, awaiting the
>> >> first clash.
>> > 
>> > So that means if no collision-proof mechanism is introduced, Xen should not
>> > trust any passed-in ACPI tables and should build them by itself?
>> 
>> Basically yes, albeit collision-proof may be too much to demand.
>> Simply separating name spaces (for hvmloader and qemu to have
>> their own sub-spaces) would be sufficient imo. We should trust
>> ourselves to play by such a specification.
>>
> 
> I don't quite understand 'separating name spaces'. Do you mean, for
> example, if both hvmloader and qemu want to put a namespace device under
> \_SB, they could be put in different sub-scopes under \_SB? But it does
> not work for Linux at least.

Aiui just the leaf names matter for sufficient separation, i.e.
recurring sub-scopes should not be a problem.

> Anyway, we may avoid some conflicts between ACPI tables/objects by
> restricting which tables and objects can be passed from QEMU to Xen:
> (1) For ACPI tables, xen does not accept those built by itself,
>     e.g. DSDT and SSDT.
> (2) xen does not accept ACPI tables for devices that are not attached to
>     a domain, e.g. if NFIT cannot be passed if a domain does not have
>     vNVDIMM.
> (3) For ACPI objects, xen only accepts namespace devices and requires
>     their names does not conflict with existing ones provided by Xen.

And how do you imagine to enforce this without parsing the
handed AML? (Remember there's no AML parser in hvmloader.)

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-29  9:01             ` Jan Beulich
@ 2016-02-29  9:45               ` Haozhong Zhang
  2016-02-29 10:12                 ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-29  9:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/29/16 02:01, Jan Beulich wrote:
> >>> On 28.02.16 at 15:48, <haozhong.zhang@intel.com> wrote:
> > On 02/24/16 09:54, Jan Beulich wrote:
> >> >>> On 24.02.16 at 16:48, <haozhong.zhang@intel.com> wrote:
> >> > On 02/24/16 07:24, Jan Beulich wrote:
> >> >> >>> On 24.02.16 at 14:28, <haozhong.zhang@intel.com> wrote:
> >> >> > On 02/18/16 10:17, Jan Beulich wrote:
> >> >> >> >>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
> >> >> >> > 3.3 Guest ACPI Emulation
> >> >> >> > 
> >> >> >> > 3.3.1 My Design
> >> >> >> > 
> >> >> >> >  Guest ACPI emulation is composed of two parts: building guest NFIT
> >> >> >> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> >> >> >> >  emulating guest _DSM.
> >> >> >> > 
> >> >> >> >  (1) Building Guest ACPI Tables
> >> >> >> > 
> >> >> >> >   This design reuses and extends hvmloader's existing mechanism that
> >> >> >> >   loads passthrough ACPI tables from binary files to load NFIT and
> >> >> >> >   SSDT tables built by QEMU:
> >> >> >> >   1) Because the current QEMU does not building any ACPI tables when
> >> >> >> >      it runs as the Xen device model, this design needs to patch QEMU
> >> >> >> >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> >> >> >> > 
> >> >> >> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
> >> >> >> >      4G. The guest address and size of those tables are written into
> >> >> >> >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> >> >> >> > 
> >> >> >> >   3) hvmloader is patched to probe and load device model passthrough
> >> >> >> >      ACPI tables from above xenstore keys. The detected ACPI tables
> >> >> >> >      are then appended to the end of existing guest ACPI tables just
> >> >> >> >      like what current construct_passthrough_tables() does.
> >> >> >> > 
> >> >> >> >   Reasons for this design are listed below:
> >> >> >> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
> >> >> >> >     not refer to other ACPI tables and not conflict with existing
> >> >> >> >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
> >> >> >> >     QEMU and append to existing guest ACPI tables.
> >> >> >> 
> >> >> >> How is this not conflicting being guaranteed? In particular I don't
> >> >> >> see how tables containing AML code and coming from different
> >> >> >> sources won't possibly cause ACPI name space collisions.
> >> >> >>
> >> >> > 
> >> >> > Really there is no effective mechanism to avoid ACPI name space
> >> >> > collisions (and other kinds of conflicts) between ACPI tables loaded
> >> >> > from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
> >> >> > are loaded is determined by developers, IMO it's developers'
> >> >> > responsibility to avoid any collisions and conflicts with existing ACPI
> >> >> > tables.
> >> >> 
> >> >> Right, but this needs to be spelled out and settled on at design
> >> >> time (i.e. now), rather leaving things unspecified, awaiting the
> >> >> first clash.
> >> > 
> >> > So that means if no collision-proof mechanism is introduced, Xen should not
> >> > trust any passed-in ACPI tables and should build them by itself?
> >> 
> >> Basically yes, albeit collision-proof may be too much to demand.
> >> Simply separating name spaces (for hvmloader and qemu to have
> >> their own sub-spaces) would be sufficient imo. We should trust
> >> ourselves to play by such a specification.
> >>
> > 
> > I don't quite understand 'separating name spaces'. Do you mean, for
> > example, if both hvmloader and qemu want to put a namespace device under
> > \_SB, they could be put in different sub-scopes under \_SB? But it does
> > not work for Linux at least.
> 
> Aiui just the leaf names matter for sufficient separation, i.e.
> recurring sub-scopes should not be a problem.
>
> > Anyway, we may avoid some conflicts between ACPI tables/objects by
> > restricting which tables and objects can be passed from QEMU to Xen:
> > (1) For ACPI tables, xen does not accept those built by itself,
> >     e.g. DSDT and SSDT.
> > (2) xen does not accept ACPI tables for devices that are not attached to
> >     a domain, e.g. if NFIT cannot be passed if a domain does not have
> >     vNVDIMM.
> > (3) For ACPI objects, xen only accepts namespace devices and requires
> >     their names does not conflict with existing ones provided by Xen.
> 
> And how do you imagine to enforce this without parsing the
> handed AML? (Remember there's no AML parser in hvmloader.)
>

As I proposed in last reply, instead of passing an entire ACPI object,
QEMU passes the device name and the AML code under the AML device entry
separately. Because the name is explicitly given, no AML parser is
needed in hvmloader.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-29  9:45               ` Haozhong Zhang
@ 2016-02-29 10:12                 ` Jan Beulich
  2016-02-29 11:52                   ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-29 10:12 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 29.02.16 at 10:45, <haozhong.zhang@intel.com> wrote:
> On 02/29/16 02:01, Jan Beulich wrote:
>> >>> On 28.02.16 at 15:48, <haozhong.zhang@intel.com> wrote:
>> > Anyway, we may avoid some conflicts between ACPI tables/objects by
>> > restricting which tables and objects can be passed from QEMU to Xen:
>> > (1) For ACPI tables, xen does not accept those built by itself,
>> >     e.g. DSDT and SSDT.
>> > (2) xen does not accept ACPI tables for devices that are not attached to
>> >     a domain, e.g. if NFIT cannot be passed if a domain does not have
>> >     vNVDIMM.
>> > (3) For ACPI objects, xen only accepts namespace devices and requires
>> >     their names does not conflict with existing ones provided by Xen.
>> 
>> And how do you imagine to enforce this without parsing the
>> handed AML? (Remember there's no AML parser in hvmloader.)
> 
> As I proposed in last reply, instead of passing an entire ACPI object,
> QEMU passes the device name and the AML code under the AML device entry
> separately. Because the name is explicitly given, no AML parser is
> needed in hvmloader.

I must not only have missed that proposal, but I also don't see
how you mean this to work: Are you suggesting for hvmloader to
construct valid AML from the passed in blob? Or are you instead
considering to pass redundant information (name once given
explicitly and once embedded in the AML blob), allowing the two
to be out of sync?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-29 10:12                 ` Jan Beulich
@ 2016-02-29 11:52                   ` Haozhong Zhang
  2016-02-29 12:04                     ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-29 11:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/29/16 03:12, Jan Beulich wrote:
> >>> On 29.02.16 at 10:45, <haozhong.zhang@intel.com> wrote:
> > On 02/29/16 02:01, Jan Beulich wrote:
> >> >>> On 28.02.16 at 15:48, <haozhong.zhang@intel.com> wrote:
> >> > Anyway, we may avoid some conflicts between ACPI tables/objects by
> >> > restricting which tables and objects can be passed from QEMU to Xen:
> >> > (1) For ACPI tables, xen does not accept those built by itself,
> >> >     e.g. DSDT and SSDT.
> >> > (2) xen does not accept ACPI tables for devices that are not attached to
> >> >     a domain, e.g. if NFIT cannot be passed if a domain does not have
> >> >     vNVDIMM.
> >> > (3) For ACPI objects, xen only accepts namespace devices and requires
> >> >     their names does not conflict with existing ones provided by Xen.
> >> 
> >> And how do you imagine to enforce this without parsing the
> >> handed AML? (Remember there's no AML parser in hvmloader.)
> > 
> > As I proposed in last reply, instead of passing an entire ACPI object,
> > QEMU passes the device name and the AML code under the AML device entry
> > separately. Because the name is explicitly given, no AML parser is
> > needed in hvmloader.
> 
> I must not only have missed that proposal, but I also don't see
> how you mean this to work: Are you suggesting for hvmloader to
> construct valid AML from the passed in blob? Or are you instead
> considering to pass redundant information (name once given
> explicitly and once embedded in the AML blob), allowing the two
> to be out of sync?
>

I mean the former one.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-29 11:52                   ` Haozhong Zhang
@ 2016-02-29 12:04                     ` Jan Beulich
  2016-02-29 12:22                       ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-02-29 12:04 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 29.02.16 at 12:52, <haozhong.zhang@intel.com> wrote:
> On 02/29/16 03:12, Jan Beulich wrote:
>> >>> On 29.02.16 at 10:45, <haozhong.zhang@intel.com> wrote:
>> > On 02/29/16 02:01, Jan Beulich wrote:
>> >> >>> On 28.02.16 at 15:48, <haozhong.zhang@intel.com> wrote:
>> >> > Anyway, we may avoid some conflicts between ACPI tables/objects by
>> >> > restricting which tables and objects can be passed from QEMU to Xen:
>> >> > (1) For ACPI tables, xen does not accept those built by itself,
>> >> >     e.g. DSDT and SSDT.
>> >> > (2) xen does not accept ACPI tables for devices that are not attached to
>> >> >     a domain, e.g. if NFIT cannot be passed if a domain does not have
>> >> >     vNVDIMM.
>> >> > (3) For ACPI objects, xen only accepts namespace devices and requires
>> >> >     their names does not conflict with existing ones provided by Xen.
>> >> 
>> >> And how do you imagine to enforce this without parsing the
>> >> handed AML? (Remember there's no AML parser in hvmloader.)
>> > 
>> > As I proposed in last reply, instead of passing an entire ACPI object,
>> > QEMU passes the device name and the AML code under the AML device entry
>> > separately. Because the name is explicitly given, no AML parser is
>> > needed in hvmloader.
>> 
>> I must not only have missed that proposal, but I also don't see
>> how you mean this to work: Are you suggesting for hvmloader to
>> construct valid AML from the passed in blob? Or are you instead
>> considering to pass redundant information (name once given
>> explicitly and once embedded in the AML blob), allowing the two
>> to be out of sync?
> 
> I mean the former one.

Which will involve adding how much new code to it?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-29 12:04                     ` Jan Beulich
@ 2016-02-29 12:22                       ` Haozhong Zhang
  2016-03-01 13:51                         ` Ian Jackson
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-02-29 12:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 02/29/16 05:04, Jan Beulich wrote:
> >>> On 29.02.16 at 12:52, <haozhong.zhang@intel.com> wrote:
> > On 02/29/16 03:12, Jan Beulich wrote:
> >> >>> On 29.02.16 at 10:45, <haozhong.zhang@intel.com> wrote:
> >> > On 02/29/16 02:01, Jan Beulich wrote:
> >> >> >>> On 28.02.16 at 15:48, <haozhong.zhang@intel.com> wrote:
> >> >> > Anyway, we may avoid some conflicts between ACPI tables/objects by
> >> >> > restricting which tables and objects can be passed from QEMU to Xen:
> >> >> > (1) For ACPI tables, xen does not accept those built by itself,
> >> >> >     e.g. DSDT and SSDT.
> >> >> > (2) xen does not accept ACPI tables for devices that are not attached to
> >> >> >     a domain, e.g. if NFIT cannot be passed if a domain does not have
> >> >> >     vNVDIMM.
> >> >> > (3) For ACPI objects, xen only accepts namespace devices and requires
> >> >> >     their names does not conflict with existing ones provided by Xen.
> >> >> 
> >> >> And how do you imagine to enforce this without parsing the
> >> >> handed AML? (Remember there's no AML parser in hvmloader.)
> >> > 
> >> > As I proposed in last reply, instead of passing an entire ACPI object,
> >> > QEMU passes the device name and the AML code under the AML device entry
> >> > separately. Because the name is explicitly given, no AML parser is
> >> > needed in hvmloader.
> >> 
> >> I must not only have missed that proposal, but I also don't see
> >> how you mean this to work: Are you suggesting for hvmloader to
> >> construct valid AML from the passed in blob? Or are you instead
> >> considering to pass redundant information (name once given
> >> explicitly and once embedded in the AML blob), allowing the two
> >> to be out of sync?
> > 
> > I mean the former one.
> 
> Which will involve adding how much new code to it?
>

Because hvmloader only accepts AML device rather than arbitrary objects,
only code that builds the outmost part of AML device is needed. In ACPI
spec, an AML device is defined as
    DefDevice := DeviceOp PkgLength NameString ObjectList
hvmloader only needs to build the first 3 terms, while the last one is
passed from qemu.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-19  2:14             ` Konrad Rzeszutek Wilk
@ 2016-03-01  7:39               ` Haozhong Zhang
  2016-03-01 18:33                 ` Ian Jackson
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-01  7:39 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jun Nakajima, Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > > QEMU would always use MFN above guest normal ram and I/O holes for
> > > > vNVDIMM. It would attempt to search in that space for a contiguous range
> > > > that is large enough for that that vNVDIMM devices. Is guest able to
> > > > punch holes in such GFN space?
> > > 
> > > See XENMAPSPACE_* and their uses.
> > > 
> > 
> > I think we can add following restrictions to avoid uses of XENMAPSPACE_*
> > punching holes in GFNs of vNVDIMM:
> > 
> > (1) For XENMAPSPACE_shared_info and _grant_table, never map idx in them
> >     to GFNs occupied by vNVDIMM.
> 
> OK, that sounds correct.
> >     
> > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> >    (a) never map idx in them to GFNs occupied by vNVDIMM, and
> >    (b) never map idx corresponding to GFNs occupied by vNVDIMM
> 
> Would that mean that guest xen-blkback or xen-netback wouldn't
> be able to fetch data from the GFNs? As in, what if the HVM guest
> that has the NVDIMM also serves as a device domain - that is it
> has xen-blkback running to service other guests?
> 

I'm not familiar with xen-blkback and xen-netback, so following
statements maybe wrong.

In my understanding, xen-blkback/-netback in a device domain maps the
pages from other domains into its own domain, and copies data between
those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
driver in device domain. In which steps of this procedure that
xen-blkback/-netback needs to map into GFNs of vNVDIMM?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-29 12:22                       ` Haozhong Zhang
@ 2016-03-01 13:51                         ` Ian Jackson
  2016-03-01 15:04                           ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Ian Jackson @ 2016-03-01 13:51 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Jun Nakajima, Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, xen-devel,
	Jan Beulich, Xiao Guangrong, Keir Fraser

Haozhong Zhang writes ("Re: [RFC Design Doc] Add vNVDIMM support for Xen"):
> On 02/29/16 05:04, Jan Beulich wrote:
> > Which will involve adding how much new code to it?
> 
> Because hvmloader only accepts AML device rather than arbitrary objects,
> only code that builds the outmost part of AML device is needed. In ACPI
> spec, an AML device is defined as
>     DefDevice := DeviceOp PkgLength NameString ObjectList
> hvmloader only needs to build the first 3 terms, while the last one is
> passed from qemu.

Jan, is this a satisfactory answer ?

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-01 13:51                         ` Ian Jackson
@ 2016-03-01 15:04                           ` Jan Beulich
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Beulich @ 2016-03-01 15:04 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	StefanoStabellini, George Dunlap, Andrew Cooper, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 01.03.16 at 14:51, <Ian.Jackson@eu.citrix.com> wrote:
> Haozhong Zhang writes ("Re: [RFC Design Doc] Add vNVDIMM support for Xen"):
>> On 02/29/16 05:04, Jan Beulich wrote:
>> > Which will involve adding how much new code to it?
>> 
>> Because hvmloader only accepts AML device rather than arbitrary objects,
>> only code that builds the outmost part of AML device is needed. In ACPI
>> spec, an AML device is defined as
>>     DefDevice := DeviceOp PkgLength NameString ObjectList
>> hvmloader only needs to build the first 3 terms, while the last one is
>> passed from qemu.
> 
> Jan, is this a satisfactory answer ?

Well, sort of yes, but subject to me seeing the actual code this
converts to.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-01  7:39               ` Haozhong Zhang
@ 2016-03-01 18:33                 ` Ian Jackson
  2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 121+ messages in thread
From: Ian Jackson @ 2016-03-01 18:33 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Jun Nakajima, Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	George Dunlap, Andrew Cooper, Stefano Stabellini, xen-devel,
	Jan Beulich, Xiao Guangrong, Keir Fraser

Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > [someone:]
> > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > >    (a) never map idx in them to GFNs occupied by vNVDIMM, and
> > >    (b) never map idx corresponding to GFNs occupied by vNVDIMM
> > 
> > Would that mean that guest xen-blkback or xen-netback wouldn't
> > be able to fetch data from the GFNs? As in, what if the HVM guest
> > that has the NVDIMM also serves as a device domain - that is it
> > has xen-blkback running to service other guests?
> 
> I'm not familiar with xen-blkback and xen-netback, so following
> statements maybe wrong.
> 
> In my understanding, xen-blkback/-netback in a device domain maps the
> pages from other domains into its own domain, and copies data between
> those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> driver in device domain. In which steps of this procedure that
> xen-blkback/-netback needs to map into GFNs of vNVDIMM?

I think I agree with what you are saying.  I don't understand exactly
what you are proposing above in XENMAPSPACE_gmfn but I don't see how
anything about this would interfere with blkback.

blkback when talking to an nvdimm will just go through the block layer
front door, and do a copy, I presume.

I don't see how netback comes into it at all.

But maybe I am just confused or ignorant!  Please do explain :-).

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-01 18:33                 ` Ian Jackson
@ 2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
  2016-03-02  7:14                     ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-03-01 18:49 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Jun Nakajima, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

On Tue, Mar 01, 2016 at 06:33:32PM +0000, Ian Jackson wrote:
> Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > [someone:]
> > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > > >    (a) never map idx in them to GFNs occupied by vNVDIMM, and
> > > >    (b) never map idx corresponding to GFNs occupied by vNVDIMM
> > > 
> > > Would that mean that guest xen-blkback or xen-netback wouldn't
> > > be able to fetch data from the GFNs? As in, what if the HVM guest
> > > that has the NVDIMM also serves as a device domain - that is it
> > > has xen-blkback running to service other guests?
> > 
> > I'm not familiar with xen-blkback and xen-netback, so following
> > statements maybe wrong.
> > 
> > In my understanding, xen-blkback/-netback in a device domain maps the
> > pages from other domains into its own domain, and copies data between
> > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> > driver in device domain. In which steps of this procedure that
> > xen-blkback/-netback needs to map into GFNs of vNVDIMM?
> 
> I think I agree with what you are saying.  I don't understand exactly
> what you are proposing above in XENMAPSPACE_gmfn but I don't see how
> anything about this would interfere with blkback.
> 
> blkback when talking to an nvdimm will just go through the block layer
> front door, and do a copy, I presume.

I believe you are right. The block layer, and then the fs would copy in.
> 
> I don't see how netback comes into it at all.
> 
> But maybe I am just confused or ignorant!  Please do explain :-).

s/back/frontend/  

My fear was refcounting.

Specifically where we do not do copying. For example, you could
be sending data from the NVDIMM GFNs (scp?) to some other location
(another host?). It would go over the xen-netback (in the dom0)
- which would then grant map it (dom0 would).

In effect Xen there are two guests (dom0 and domU) pointing in the
P2M to the same GPFN. And that would mean:

> > > >    (b) never map idx corresponding to GFNs occupied by vNVDIMM

Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done
so perhaps this is not an issue?

The other situation I was envisioning - where the driver domain has
the NVDIMM passed in, and as well SR-IOV network card and functions
as an iSCSI target. That should work OK as we just need the IOMMU
to have the NVDIMM GPFNs programmed in.

> 
> Thanks,
> Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
@ 2016-03-02  7:14                     ` Haozhong Zhang
  2016-03-02 13:03                       ` Jan Beulich
  2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-02  7:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 03/01/16 13:49, Konrad Rzeszutek Wilk wrote:
> On Tue, Mar 01, 2016 at 06:33:32PM +0000, Ian Jackson wrote:
> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > > [someone:]
> > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > > > >    (a) never map idx in them to GFNs occupied by vNVDIMM, and
> > > > >    (b) never map idx corresponding to GFNs occupied by vNVDIMM
> > > > 
> > > > Would that mean that guest xen-blkback or xen-netback wouldn't
> > > > be able to fetch data from the GFNs? As in, what if the HVM guest
> > > > that has the NVDIMM also serves as a device domain - that is it
> > > > has xen-blkback running to service other guests?
> > > 
> > > I'm not familiar with xen-blkback and xen-netback, so following
> > > statements maybe wrong.
> > > 
> > > In my understanding, xen-blkback/-netback in a device domain maps the
> > > pages from other domains into its own domain, and copies data between
> > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> > > driver in device domain. In which steps of this procedure that
> > > xen-blkback/-netback needs to map into GFNs of vNVDIMM?
> > 
> > I think I agree with what you are saying.  I don't understand exactly
> > what you are proposing above in XENMAPSPACE_gmfn but I don't see how
> > anything about this would interfere with blkback.
> > 
> > blkback when talking to an nvdimm will just go through the block layer
> > front door, and do a copy, I presume.
> 
> I believe you are right. The block layer, and then the fs would copy in.
> > 
> > I don't see how netback comes into it at all.
> > 
> > But maybe I am just confused or ignorant!  Please do explain :-).
> 
> s/back/frontend/  
> 
> My fear was refcounting.
> 
> Specifically where we do not do copying. For example, you could
> be sending data from the NVDIMM GFNs (scp?) to some other location
> (another host?). It would go over the xen-netback (in the dom0)
> - which would then grant map it (dom0 would).
>

Thanks for the explanation!

It means NVDIMM is very possibly mapped in page granularity, and
hypervisor needs per-page data structures like page_info (rather than the
range set style nvdimm_pages) to manage those mappings.

Then we will face the problem that the potentially huge number of
per-page data structures may not fit in the normal ram. Linux kernel
developers came across the same problem, and their solution is to
reserve an area of NVDIMM and put the page structures in the reserved
area (https://lwn.net/Articles/672457/). I think we may take the similar
solution:
(1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
    (besides the one used by Linux kernel itself) and reports the address
    and size to Xen hypervisor.

    Reasons to choose Linux kernel to make the reservation include:
    (a) only Dom0 Linux kernel has the NVDIMM driver,
    (b) make it flexible for Dom0 Linux kernel to handle all
        reservations (for itself and Xen).

(2) Then Xen hypervisor builds the page structures for NVDIMM pages and
    stores them in above reserved areas.

(3) The reserved area is used as volatile, i.e. above two steps must be
    done for every host boot.

> In effect Xen there are two guests (dom0 and domU) pointing in the
> P2M to the same GPFN. And that would mean:
> 
> > > > >    (b) never map idx corresponding to GFNs occupied by vNVDIMM
> 
> Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done
> so perhaps this is not an issue?
> 
> The other situation I was envisioning - where the driver domain has
> the NVDIMM passed in, and as well SR-IOV network card and functions
> as an iSCSI target. That should work OK as we just need the IOMMU
> to have the NVDIMM GPFNs programmed in.
>

For this IOMMU usage example and above granted pages example, there
remains one question: who is responsible to perform NVDIMM flush
(clwb/clflushopt/pcommit)?

For the granted page example, if a NVDIMM page is granted to
xen-netback, does the hypervisor need to tell xen-netback it's a NVDIMM
page so that xen-netback can perform proper flush when it writes to that
page? Or we may keep the NVDIMM transparent to xen-netback, and let Xen
perform the flush when xen-netback gives up the granted NVDIMM page?

For the IOMMU example, my understanding is that there is a piece of
software in the driver domain that handles SCSI commands received from
network card and drives the network card to read/write certain areas of
NVDIMM. Then that software should be aware of the existence of NVDIMM
and perform the flush properly. Is that right?

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-02  7:14                     ` Haozhong Zhang
@ 2016-03-02 13:03                       ` Jan Beulich
  2016-03-04  2:20                         ` Haozhong Zhang
  2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-02 13:03 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 02.03.16 at 08:14, <haozhong.zhang@intel.com> wrote:
> It means NVDIMM is very possibly mapped in page granularity, and
> hypervisor needs per-page data structures like page_info (rather than the
> range set style nvdimm_pages) to manage those mappings.
> 
> Then we will face the problem that the potentially huge number of
> per-page data structures may not fit in the normal ram. Linux kernel
> developers came across the same problem, and their solution is to
> reserve an area of NVDIMM and put the page structures in the reserved
> area (https://lwn.net/Articles/672457/). I think we may take the similar
> solution:
> (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
>     (besides the one used by Linux kernel itself) and reports the address
>     and size to Xen hypervisor.
> 
>     Reasons to choose Linux kernel to make the reservation include:
>     (a) only Dom0 Linux kernel has the NVDIMM driver,
>     (b) make it flexible for Dom0 Linux kernel to handle all
>         reservations (for itself and Xen).
> 
> (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
>     stores them in above reserved areas.

Another argument against this being primarily Dom0-managed,
I would say. Furthermore - why would Dom0 waste space
creating per-page control structures for regions which are
meant to be handed to guests anyway?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-02 13:03                       ` Jan Beulich
@ 2016-03-04  2:20                         ` Haozhong Zhang
  2016-03-08  9:15                           ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-04  2:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/02/16 06:03, Jan Beulich wrote:
> >>> On 02.03.16 at 08:14, <haozhong.zhang@intel.com> wrote:
> > It means NVDIMM is very possibly mapped in page granularity, and
> > hypervisor needs per-page data structures like page_info (rather than the
> > range set style nvdimm_pages) to manage those mappings.
> > 
> > Then we will face the problem that the potentially huge number of
> > per-page data structures may not fit in the normal ram. Linux kernel
> > developers came across the same problem, and their solution is to
> > reserve an area of NVDIMM and put the page structures in the reserved
> > area (https://lwn.net/Articles/672457/). I think we may take the similar
> > solution:
> > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
> >     (besides the one used by Linux kernel itself) and reports the address
> >     and size to Xen hypervisor.
> > 
> >     Reasons to choose Linux kernel to make the reservation include:
> >     (a) only Dom0 Linux kernel has the NVDIMM driver,
> >     (b) make it flexible for Dom0 Linux kernel to handle all
> >         reservations (for itself and Xen).
> > 
> > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
> >     stores them in above reserved areas.
> 
> Another argument against this being primarily Dom0-managed,
> I would say.

Yes, Xen should, at least, manage all address mappings for NVDIMM. Dom0
Linux and QEMU then provide a user-friendly interface to configure
NVDIMM and vNVDIMM: like providing files (instead of address) as the
abstract of SPA ranges in/of NVDIMM.

> Furthermore - why would Dom0 waste space
> creating per-page control structures for regions which are
> meant to be handed to guests anyway?
> 

I found my description was not accurate after consulting with our driver
developers. By default the linux kernel does not create page structures
for NVDIMM which is called by kernel the "raw mode". We could enforce
the Dom0 kernel to pin NVDIMM in "raw mode" so as to avoid waste.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-02-16 12:55                   ` Jan Beulich
  2016-02-17  9:03                     ` Haozhong Zhang
@ 2016-03-04  7:30                     ` Haozhong Zhang
  2016-03-16 12:55                       ` Haozhong Zhang
  1 sibling, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-04  7:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 02/16/16 05:55, Jan Beulich wrote:
> >>> On 16.02.16 at 12:14, <stefano.stabellini@eu.citrix.com> wrote:
> > On Mon, 15 Feb 2016, Zhang, Haozhong wrote:
> >> On 02/04/16 20:24, Stefano Stabellini wrote:
> >> > On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> >> > > On 02/03/16 15:22, Stefano Stabellini wrote:
> >> > > > On Wed, 3 Feb 2016, George Dunlap wrote:
> >> > > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> >> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> >> > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> >> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> >> > > > > >> those files to QEMU. In this way, non-root QEMU should be able to
> >> > > > > >> mmap those files.
> >> > > > > >
> >> > > > > > Maybe that would work. Worth adding it to the design, I would like to
> >> > > > > > read more details on it.
> >> > > > > >
> >> > > > > > Also note that QEMU initially runs as root but drops privileges to
> >> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> >> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> >> > > > > > wouldn't work for any devices that need to be mmap'ed at run time
> >> > > > > > (hotplug scenario).
> >> > > > >
> >> > > > > This is basically the same problem we have for a bunch of other things,
> >> > > > > right?  Having xl open a file and then pass it via qmp to qemu should
> >> > > > > work in theory, right?
> >> > > >
> >> > > > Is there one /dev/pmem? per assignable region?
> >> > > 
> >> > > Yes.
> >> > > 
> >> > > BTW, I'm wondering whether and how non-root qemu works with xl disk
> >> > > configuration that is going to access a host block device, e.g.
> >> > >      disk = [ '/dev/sdb,,hda' ]
> >> > > If that works with non-root qemu, I may take the similar solution for
> >> > > pmem.
> >> >  
> >> > Today the user is required to give the correct ownership and access mode
> >> > to the block device, so that non-root QEMU can open it. However in the
> >> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> >> > the feature doesn't work at all with non-root QEMU
> >> > (http://marc.info/?l=xen-devel&m=145261763600528).
> >> > 
> >> > If there is one /dev/pmem device per assignable region, then it would be
> >> > conceivable to change its ownership so that non-root QEMU can open it.
> >> > Or, better, the file descriptor could be passed by the toolstack via
> >> > qmp.
> >> 
> >> Passing file descriptor via qmp is not enough.
> >> 
> >> Let me clarify where the requirement for root/privileged permissions
> >> comes from. The primary workflow in my design that maps a host pmem
> >> region or files in host pmem region to guest is shown as below:
> >>  (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
> >>      /dev/pmem0) to its virtual address space, i.e. the guest virtual
> >>      address space.
> >>  (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
> >>      occupied by the host pmem to a DomU. This step requires the
> >>      translation from the guest virtual address (where the host pmem is
> >>      mmaped in (1)) to the host physical address. The translation can be
> >>      done by either
> >>     (a) QEMU that parses its own /proc/self/pagemap,
> >>      or
> >>     (b) Xen hypervisor that does the translation by itself [1] (though
> >>         this choice is not quite doable from Konrad's comments [2]).
> >> 
> >> [1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html 
> >> [2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html 
> >> 
> >> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
> >> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
> >> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
> >> pagemap will not contain all mappings. However, mlock may require
> >> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
> >> mlock operates on memory, the permission to open(2) the host pmem files
> >> does not solve the problem and therefore passing file descriptor via qmp
> >> does not help.
> >> 
> >> For 2-b, from Konrad's comments [2], mlock is also required and
> >> privileged permission may be required consequently.
> >> 
> >> Note that the mapping and the address translation are done before QEMU
> >> dropping privileged permissions, so non-root QEMU should be able to work
> >> with above design until we start considering vNVDIMM hotplug (which has
> >> not been supported by the current vNVDIMM implementation in QEMU). In
> >> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
> >> running with root permissions.
> > 
> > Are we all good with the fact that vNVDIMM hotplug won't work (unless
> > the user explicitly asks for it at domain creation time, which is
> > very unlikely otherwise she could use coldplug)?
> 
> No, at least there needs to be a road towards hotplug, even if
> initially this may not be supported/implemented.
> 

Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
or files on NVDIMM. We can move that work to toolstack and pass SPA
ranges got by toolstack to qemu. In this way, no privileged operations
(mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
work even with vNVDIMM hotplug in future.

Haozhong



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-02  7:14                     ` Haozhong Zhang
  2016-03-02 13:03                       ` Jan Beulich
@ 2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
  2016-03-08  5:50                         ` Haozhong Zhang
  1 sibling, 1 reply; 121+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-03-07 20:53 UTC (permalink / raw)
  To: Ian Jackson, Jun Nakajima, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

On Wed, Mar 02, 2016 at 03:14:52PM +0800, Haozhong Zhang wrote:
> On 03/01/16 13:49, Konrad Rzeszutek Wilk wrote:
> > On Tue, Mar 01, 2016 at 06:33:32PM +0000, Ian Jackson wrote:
> > > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> > > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > > > [someone:]
> > > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > > > > >    (a) never map idx in them to GFNs occupied by vNVDIMM, and
> > > > > >    (b) never map idx corresponding to GFNs occupied by vNVDIMM
> > > > > 
> > > > > Would that mean that guest xen-blkback or xen-netback wouldn't
> > > > > be able to fetch data from the GFNs? As in, what if the HVM guest
> > > > > that has the NVDIMM also serves as a device domain - that is it
> > > > > has xen-blkback running to service other guests?
> > > > 
> > > > I'm not familiar with xen-blkback and xen-netback, so following
> > > > statements maybe wrong.
> > > > 
> > > > In my understanding, xen-blkback/-netback in a device domain maps the
> > > > pages from other domains into its own domain, and copies data between
> > > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> > > > driver in device domain. In which steps of this procedure that
> > > > xen-blkback/-netback needs to map into GFNs of vNVDIMM?
> > > 
> > > I think I agree with what you are saying.  I don't understand exactly
> > > what you are proposing above in XENMAPSPACE_gmfn but I don't see how
> > > anything about this would interfere with blkback.
> > > 
> > > blkback when talking to an nvdimm will just go through the block layer
> > > front door, and do a copy, I presume.
> > 
> > I believe you are right. The block layer, and then the fs would copy in.
> > > 
> > > I don't see how netback comes into it at all.
> > > 
> > > But maybe I am just confused or ignorant!  Please do explain :-).
> > 
> > s/back/frontend/  
> > 
> > My fear was refcounting.
> > 
> > Specifically where we do not do copying. For example, you could
> > be sending data from the NVDIMM GFNs (scp?) to some other location
> > (another host?). It would go over the xen-netback (in the dom0)
> > - which would then grant map it (dom0 would).
> >
> 
> Thanks for the explanation!
> 
> It means NVDIMM is very possibly mapped in page granularity, and
> hypervisor needs per-page data structures like page_info (rather than the
> range set style nvdimm_pages) to manage those mappings.

I do not know. I figured you need some accounting in the hypervisor
as the pages can be grant mapped but I don't know the intricate details
of the P2M code to tell you for certain.

[edit: Your later email seems to imply that you do not need all this
information? Just ranges?]
> 
> Then we will face the problem that the potentially huge number of
> per-page data structures may not fit in the normal ram. Linux kernel
> developers came across the same problem, and their solution is to
> reserve an area of NVDIMM and put the page structures in the reserved
> area (https://lwn.net/Articles/672457/). I think we may take the similar
> solution:
> (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
>     (besides the one used by Linux kernel itself) and reports the address
>     and size to Xen hypervisor.
> 
>     Reasons to choose Linux kernel to make the reservation include:
>     (a) only Dom0 Linux kernel has the NVDIMM driver,
>     (b) make it flexible for Dom0 Linux kernel to handle all
>         reservations (for itself and Xen).
> 
> (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
>     stores them in above reserved areas.
> 
> (3) The reserved area is used as volatile, i.e. above two steps must be
>     done for every host boot.
> 
> > In effect Xen there are two guests (dom0 and domU) pointing in the
> > P2M to the same GPFN. And that would mean:
> > 
> > > > > >    (b) never map idx corresponding to GFNs occupied by vNVDIMM
> > 
> > Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done
> > so perhaps this is not an issue?
> > 
> > The other situation I was envisioning - where the driver domain has
> > the NVDIMM passed in, and as well SR-IOV network card and functions
> > as an iSCSI target. That should work OK as we just need the IOMMU
> > to have the NVDIMM GPFNs programmed in.
> >
> 
> For this IOMMU usage example and above granted pages example, there
> remains one question: who is responsible to perform NVDIMM flush
> (clwb/clflushopt/pcommit)?


> 
> For the granted page example, if a NVDIMM page is granted to
> xen-netback, does the hypervisor need to tell xen-netback it's a NVDIMM
> page so that xen-netback can perform proper flush when it writes to that
> page? Or we may keep the NVDIMM transparent to xen-netback, and let Xen
> perform the flush when xen-netback gives up the granted NVDIMM page?
> 
> For the IOMMU example, my understanding is that there is a piece of
> software in the driver domain that handles SCSI commands received from
> network card and drives the network card to read/write certain areas of
> NVDIMM. Then that software should be aware of the existence of NVDIMM
> and perform the flush properly. Is that right?

I would imagine it is the same as any write on NVDIMM. The "owner"
of the NVDIMM would perform the pcommit. ?
> 
> Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
@ 2016-03-08  5:50                         ` Haozhong Zhang
  0 siblings, 0 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-08  5:50 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 03/07/16 15:53, Konrad Rzeszutek Wilk wrote:
> On Wed, Mar 02, 2016 at 03:14:52PM +0800, Haozhong Zhang wrote:
> > On 03/01/16 13:49, Konrad Rzeszutek Wilk wrote:
> > > On Tue, Mar 01, 2016 at 06:33:32PM +0000, Ian Jackson wrote:
> > > > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> > > > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > > > > [someone:]
> > > > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > > > > > >    (a) never map idx in them to GFNs occupied by vNVDIMM, and
> > > > > > >    (b) never map idx corresponding to GFNs occupied by vNVDIMM
> > > > > > 
> > > > > > Would that mean that guest xen-blkback or xen-netback wouldn't
> > > > > > be able to fetch data from the GFNs? As in, what if the HVM guest
> > > > > > that has the NVDIMM also serves as a device domain - that is it
> > > > > > has xen-blkback running to service other guests?
> > > > > 
> > > > > I'm not familiar with xen-blkback and xen-netback, so following
> > > > > statements maybe wrong.
> > > > > 
> > > > > In my understanding, xen-blkback/-netback in a device domain maps the
> > > > > pages from other domains into its own domain, and copies data between
> > > > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> > > > > driver in device domain. In which steps of this procedure that
> > > > > xen-blkback/-netback needs to map into GFNs of vNVDIMM?
> > > > 
> > > > I think I agree with what you are saying.  I don't understand exactly
> > > > what you are proposing above in XENMAPSPACE_gmfn but I don't see how
> > > > anything about this would interfere with blkback.
> > > > 
> > > > blkback when talking to an nvdimm will just go through the block layer
> > > > front door, and do a copy, I presume.
> > > 
> > > I believe you are right. The block layer, and then the fs would copy in.
> > > > 
> > > > I don't see how netback comes into it at all.
> > > > 
> > > > But maybe I am just confused or ignorant!  Please do explain :-).
> > > 
> > > s/back/frontend/  
> > > 
> > > My fear was refcounting.
> > > 
> > > Specifically where we do not do copying. For example, you could
> > > be sending data from the NVDIMM GFNs (scp?) to some other location
> > > (another host?). It would go over the xen-netback (in the dom0)
> > > - which would then grant map it (dom0 would).
> > >
> > 
> > Thanks for the explanation!
> > 
> > It means NVDIMM is very possibly mapped in page granularity, and
> > hypervisor needs per-page data structures like page_info (rather than the
> > range set style nvdimm_pages) to manage those mappings.
> 
> I do not know. I figured you need some accounting in the hypervisor
> as the pages can be grant mapped but I don't know the intricate details
> of the P2M code to tell you for certain.
> 
> [edit: Your later email seems to imply that you do not need all this
> information? Just ranges?]

Not quite sure which one do you mean. But at least in this example,
NVDIMM can be granted in the unit of page, so I think Xen still needs
per-page data structure to track this mapping information and range
structure is not enough.

> > 
> > Then we will face the problem that the potentially huge number of
> > per-page data structures may not fit in the normal ram. Linux kernel
> > developers came across the same problem, and their solution is to
> > reserve an area of NVDIMM and put the page structures in the reserved
> > area (https://lwn.net/Articles/672457/). I think we may take the similar
> > solution:
> > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
> >     (besides the one used by Linux kernel itself) and reports the address
> >     and size to Xen hypervisor.
> > 
> >     Reasons to choose Linux kernel to make the reservation include:
> >     (a) only Dom0 Linux kernel has the NVDIMM driver,
> >     (b) make it flexible for Dom0 Linux kernel to handle all
> >         reservations (for itself and Xen).
> > 
> > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
> >     stores them in above reserved areas.
> > 
> > (3) The reserved area is used as volatile, i.e. above two steps must be
> >     done for every host boot.
> > 
> > > In effect Xen there are two guests (dom0 and domU) pointing in the
> > > P2M to the same GPFN. And that would mean:
> > > 
> > > > > > >    (b) never map idx corresponding to GFNs occupied by vNVDIMM
> > > 
> > > Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done
> > > so perhaps this is not an issue?
> > > 
> > > The other situation I was envisioning - where the driver domain has
> > > the NVDIMM passed in, and as well SR-IOV network card and functions
> > > as an iSCSI target. That should work OK as we just need the IOMMU
> > > to have the NVDIMM GPFNs programmed in.
> > >
> > 
> > For this IOMMU usage example and above granted pages example, there
> > remains one question: who is responsible to perform NVDIMM flush
> > (clwb/clflushopt/pcommit)?
> 
> 
> > 
> > For the granted page example, if a NVDIMM page is granted to
> > xen-netback, does the hypervisor need to tell xen-netback it's a NVDIMM
> > page so that xen-netback can perform proper flush when it writes to that
> > page? Or we may keep the NVDIMM transparent to xen-netback, and let Xen
> > perform the flush when xen-netback gives up the granted NVDIMM page?
> > 
> > For the IOMMU example, my understanding is that there is a piece of
> > software in the driver domain that handles SCSI commands received from
> > network card and drives the network card to read/write certain areas of
> > NVDIMM. Then that software should be aware of the existence of NVDIMM
> > and perform the flush properly. Is that right?
> 
> I would imagine it is the same as any write on NVDIMM. The "owner"
> of the NVDIMM would perform the pcommit. ?

Agree, software accessing NVDIMM is responsible to perform proper flush.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-04  2:20                         ` Haozhong Zhang
@ 2016-03-08  9:15                           ` Haozhong Zhang
  2016-03-08  9:27                             ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-08  9:15 UTC (permalink / raw)
  To: Jan Beulich, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/04/16 10:20, Haozhong Zhang wrote:
> On 03/02/16 06:03, Jan Beulich wrote:
> > >>> On 02.03.16 at 08:14, <haozhong.zhang@intel.com> wrote:
> > > It means NVDIMM is very possibly mapped in page granularity, and
> > > hypervisor needs per-page data structures like page_info (rather than the
> > > range set style nvdimm_pages) to manage those mappings.
> > > 
> > > Then we will face the problem that the potentially huge number of
> > > per-page data structures may not fit in the normal ram. Linux kernel
> > > developers came across the same problem, and their solution is to
> > > reserve an area of NVDIMM and put the page structures in the reserved
> > > area (https://lwn.net/Articles/672457/). I think we may take the similar
> > > solution:
> > > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
> > >     (besides the one used by Linux kernel itself) and reports the address
> > >     and size to Xen hypervisor.
> > > 
> > >     Reasons to choose Linux kernel to make the reservation include:
> > >     (a) only Dom0 Linux kernel has the NVDIMM driver,
> > >     (b) make it flexible for Dom0 Linux kernel to handle all
> > >         reservations (for itself and Xen).
> > > 
> > > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
> > >     stores them in above reserved areas.
> > 
[...]
> > Furthermore - why would Dom0 waste space
> > creating per-page control structures for regions which are
> > meant to be handed to guests anyway?
> > 
> 
> I found my description was not accurate after consulting with our driver
> developers. By default the linux kernel does not create page structures
> for NVDIMM which is called by kernel the "raw mode". We could enforce
> the Dom0 kernel to pin NVDIMM in "raw mode" so as to avoid waste.
> 

More thoughts on reserving NVDIMM space for per-page structures

Currently, a per-page struct for managing mapping of NVDIMM pages may
include following fields:

struct nvdimm_page
{
    uint64_t mfn;        /* MFN of SPA of this NVDIMM page */
    uint64_t gfn;        /* GFN where this NVDIMM page is mapped */
    domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
    int      is_broken;  /* Is this NVDIMM page broken? (for MCE) */
}

Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
nvdimm_page structures would occupy 12 GB space, which is too hard to
fit in the normal ram on a small memory host. However, for smaller
NVDIMMs and/or hosts with large ram, those structures may still be able
to fit in the normal ram. In the latter circumstance, nvdimm_page
structures are stored in the normal ram, so they can be accessed more
quickly.

So we may add a boot parameter for Xen to allow users to configure which
place, the normal ram or nvdimm, are used to store those structures. For
the config of using normal ram, Xen could manage nvdimm_page structures
more quickly (and hence start a domain with NVDIMM more quickly), but
leaves less normal ram for VMs. For the config of using nvdimm, Xen would
take more time to mange nvdimm_page structures, but leaves more normal
ram for VMs.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-08  9:15                           ` Haozhong Zhang
@ 2016-03-08  9:27                             ` Jan Beulich
  2016-03-09 12:22                               ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-08  9:27 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 08.03.16 at 10:15, <haozhong.zhang@intel.com> wrote:
> More thoughts on reserving NVDIMM space for per-page structures
> 
> Currently, a per-page struct for managing mapping of NVDIMM pages may
> include following fields:
> 
> struct nvdimm_page
> {
>     uint64_t mfn;        /* MFN of SPA of this NVDIMM page */
>     uint64_t gfn;        /* GFN where this NVDIMM page is mapped */
>     domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
>     int      is_broken;  /* Is this NVDIMM page broken? (for MCE) */
> }
> 
> Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
> nvdimm_page structures would occupy 12 GB space, which is too hard to
> fit in the normal ram on a small memory host. However, for smaller
> NVDIMMs and/or hosts with large ram, those structures may still be able
> to fit in the normal ram. In the latter circumstance, nvdimm_page
> structures are stored in the normal ram, so they can be accessed more
> quickly.

Not sure how you came to the above structure - it's the first time
I see it, yet figuring out what information it needs to hold is what
this design process should be about. For example, I don't see why
it would need to duplicate M2P / P2M information. Nor do I see why
per-page data needs to hold the address of a page (struct
page_info also doesn't). And whether storing a domain ID (rather
than a pointer to struct domain, as in struct page_info) is the
correct think is also to be determined (rather than just stated).

Otoh you make no provisions at all for any kind of ref counting.
What if a guest wants to put page tables into NVDIMM space?

Since all of your calculations are based upon that fixed assumption
on the structure layout, I'm afraid they're not very meaningful
without first settling on what data needs tracking in the first place.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-08  9:27                             ` Jan Beulich
@ 2016-03-09 12:22                               ` Haozhong Zhang
  2016-03-09 16:17                                 ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-09 12:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/08/16 02:27, Jan Beulich wrote:
> >>> On 08.03.16 at 10:15, <haozhong.zhang@intel.com> wrote:
> > More thoughts on reserving NVDIMM space for per-page structures
> > 
> > Currently, a per-page struct for managing mapping of NVDIMM pages may
> > include following fields:
> > 
> > struct nvdimm_page
> > {
> >     uint64_t mfn;        /* MFN of SPA of this NVDIMM page */
> >     uint64_t gfn;        /* GFN where this NVDIMM page is mapped */
> >     domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
> >     int      is_broken;  /* Is this NVDIMM page broken? (for MCE) */
> > }
> > 
> > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
> > nvdimm_page structures would occupy 12 GB space, which is too hard to
> > fit in the normal ram on a small memory host. However, for smaller
> > NVDIMMs and/or hosts with large ram, those structures may still be able
> > to fit in the normal ram. In the latter circumstance, nvdimm_page
> > structures are stored in the normal ram, so they can be accessed more
> > quickly.
> 
> Not sure how you came to the above structure - it's the first time
> I see it, yet figuring out what information it needs to hold is what
> this design process should be about. For example, I don't see why
> it would need to duplicate M2P / P2M information. Nor do I see why
> per-page data needs to hold the address of a page (struct
> page_info also doesn't). And whether storing a domain ID (rather
> than a pointer to struct domain, as in struct page_info) is the
> correct think is also to be determined (rather than just stated).
> 
> Otoh you make no provisions at all for any kind of ref counting.
> What if a guest wants to put page tables into NVDIMM space?
> 
> Since all of your calculations are based upon that fixed assumption
> on the structure layout, I'm afraid they're not very meaningful
> without first settling on what data needs tracking in the first place.
> 
> Jan
> 

I should reexplain the choice of data structures and where to put them.

For handling MCE for NVDIMM, we need to track following data:
(1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are
    used to check whether a MCE is for NVDIMM.
(2) GFN to which a NVDIMM page is mapped, which is used to determine the
    address put in vMCE.
(3) the domain to which a NVDIMM page is mapped, which is used to
    determine whether a vMCE needs to be injected and where it will be
    injected.
(4) a flag to mark whether a NVDIMM page is broken, which is used to
    avoid mapping broken page to guests.

For granting NVDIMM pages (e.g. xen-blkback/netback),
(5) a reference counter is needed for each NVDIMM page

Above data can be organized as below:

* For (1) SPA ranges, we can record them in a global data structure,
  e.g. a list

    struct list_head nvdimm_iset_list;

    struct nvdimm_iset
    {
         uint64_t           base;  /* starting SPA of this interleave set */
         uint64_t           size;  /* size of this interleave set */
         struct nvdimm_page *pages;/* information for individual pages in this interleave set */
         struct list_head   list;
    };

* For (2) GFN, an intuitive place to get this information is from M2P
  table machine_to_phys_mapping[].  However, the address of NVDIMM is
  not required to be contiguous with normal ram, so, if NVDIMM starts
  from an address that is much higher than the end address of normal
  ram, it may result in a M2P table that maybe too large to fit in the
  normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
  table.

  Another possible solution is to extend page_info to include GFN for
  NVDIMM and use frame_table. A benefit of this solution is that other
  data (3)-(5) can be got from page_info as well. However, due to the
  same reason for machine_to_phys_mapping[] and the concern that the
  large number of page_info structures required for large NVDIMMs may
  consume lots of ram, page_info and frame_table seems not a good place
  either.

* At the end, we choose to introduce a new data structure for above
  per-page data (2)-(5)

    struct nvdimm_page
    {
        struct domain *domain;    /* for (3) */
        uint64_t      gfn;        /* for (2) */
        unsigned long count_info; /* for (4) and (5), same as page_info->count_info */
        /* other fields if needed, e.g. lock */
    }

  (MFN is not needed indeed)

  On each NVDIMM interleave set, we could reserve an area to place an
  array of nvdimm_page structures for pages in that interleave set. In
  addition, the corresponding global nvdimm_iset structure is set to
  point to this array via its 'pages' field.

* One disadvantage of above solution is that accessing NVDIMM is slower
  than normal ram, so some usage scenarios that requires frequently
  accesses to nvdimm_page structures may suffer poor
  performance. Therefore, we may add a boot parameter to allow users to
  choose normal ram for above nvdimm_page arrays if their hosts have
  plenty ram.

  One thing I have no idea is what percentage of ram used/reserved by
  Xen itself is considered as acceptable. If it exists and a boot
  parameter is given, we could let Xen choose the faster ram when
  the percentage has not been reached.


Any comments?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-09 12:22                               ` Haozhong Zhang
@ 2016-03-09 16:17                                 ` Jan Beulich
  2016-03-10  3:27                                   ` Haozhong Zhang
  2016-03-17 11:05                                   ` Ian Jackson
  0 siblings, 2 replies; 121+ messages in thread
From: Jan Beulich @ 2016-03-09 16:17 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 09.03.16 at 13:22, <haozhong.zhang@intel.com> wrote:
> On 03/08/16 02:27, Jan Beulich wrote:
>> >>> On 08.03.16 at 10:15, <haozhong.zhang@intel.com> wrote:
>> > More thoughts on reserving NVDIMM space for per-page structures
>> > 
>> > Currently, a per-page struct for managing mapping of NVDIMM pages may
>> > include following fields:
>> > 
>> > struct nvdimm_page
>> > {
>> >     uint64_t mfn;        /* MFN of SPA of this NVDIMM page */
>> >     uint64_t gfn;        /* GFN where this NVDIMM page is mapped */
>> >     domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
>> >     int      is_broken;  /* Is this NVDIMM page broken? (for MCE) */
>> > }
>> > 
>> > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
>> > nvdimm_page structures would occupy 12 GB space, which is too hard to
>> > fit in the normal ram on a small memory host. However, for smaller
>> > NVDIMMs and/or hosts with large ram, those structures may still be able
>> > to fit in the normal ram. In the latter circumstance, nvdimm_page
>> > structures are stored in the normal ram, so they can be accessed more
>> > quickly.
>> 
>> Not sure how you came to the above structure - it's the first time
>> I see it, yet figuring out what information it needs to hold is what
>> this design process should be about. For example, I don't see why
>> it would need to duplicate M2P / P2M information. Nor do I see why
>> per-page data needs to hold the address of a page (struct
>> page_info also doesn't). And whether storing a domain ID (rather
>> than a pointer to struct domain, as in struct page_info) is the
>> correct think is also to be determined (rather than just stated).
>> 
>> Otoh you make no provisions at all for any kind of ref counting.
>> What if a guest wants to put page tables into NVDIMM space?
>> 
>> Since all of your calculations are based upon that fixed assumption
>> on the structure layout, I'm afraid they're not very meaningful
>> without first settling on what data needs tracking in the first place.
> 
> I should reexplain the choice of data structures and where to put them.
> 
> For handling MCE for NVDIMM, we need to track following data:
> (1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are
>     used to check whether a MCE is for NVDIMM.
> (2) GFN to which a NVDIMM page is mapped, which is used to determine the
>     address put in vMCE.
> (3) the domain to which a NVDIMM page is mapped, which is used to
>     determine whether a vMCE needs to be injected and where it will be
>     injected.
> (4) a flag to mark whether a NVDIMM page is broken, which is used to
>     avoid mapping broken page to guests.
> 
> For granting NVDIMM pages (e.g. xen-blkback/netback),
> (5) a reference counter is needed for each NVDIMM page
> 
> Above data can be organized as below:
> 
> * For (1) SPA ranges, we can record them in a global data structure,
>   e.g. a list
> 
>     struct list_head nvdimm_iset_list;
> 
>     struct nvdimm_iset
>     {
>          uint64_t           base;  /* starting SPA of this interleave set */
>          uint64_t           size;  /* size of this interleave set */
>          struct nvdimm_page *pages;/* information for individual pages in this interleave set */
>          struct list_head   list;
>     };
> 
> * For (2) GFN, an intuitive place to get this information is from M2P
>   table machine_to_phys_mapping[].  However, the address of NVDIMM is
>   not required to be contiguous with normal ram, so, if NVDIMM starts
>   from an address that is much higher than the end address of normal
>   ram, it may result in a M2P table that maybe too large to fit in the
>   normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
>   table.

Any page that _may_ be used by a guest as normal RAM page
must have its mach->phys translation entered in the M2P. That's
because a r/o variant of that table is part of the hypervisor ABI
for PV guests. Size considerations simply don't apply here - the
table may be sparse (guests are required to deal with accesses
potentially faulting), and the 256Gb of virtual address space set
aside for it cover all memory up to the 47-bit boundary (there's
room for doubling this). Memory at addresses with bit 47 (or
higher) set would need a complete overhaul of that mechanism,
and whatever new mechanism we may pick would mean old
guests won#t be able to benefit.

>   Another possible solution is to extend page_info to include GFN for
>   NVDIMM and use frame_table. A benefit of this solution is that other
>   data (3)-(5) can be got from page_info as well. However, due to the
>   same reason for machine_to_phys_mapping[] and the concern that the
>   large number of page_info structures required for large NVDIMMs may
>   consume lots of ram, page_info and frame_table seems not a good place
>   either.

For this particular item struct page_info is the wrong place
anyway, due to what I've said above. Also extension
suggestions of struct page_info are quite problematic, as any
such implies a measurable increase on the memory overhead
the hypervisor incurs. Plus the structure right now is (with the
exception of the bigmem configuration) a carefully arranged
for power of two in size.

> * At the end, we choose to introduce a new data structure for above
>   per-page data (2)-(5)
> 
>     struct nvdimm_page
>     {
>         struct domain *domain;    /* for (3) */
>         uint64_t      gfn;        /* for (2) */
>         unsigned long count_info; /* for (4) and (5), same as page_info->count_info */
>         /* other fields if needed, e.g. lock */
>     }

So that again leaves unaddressed the question of what you
imply to do when a guest elects to use such a page as page
table. I'm afraid any attempt of yours to invent something that
is not struct page_info will not be suitable for all possible needs.

>   On each NVDIMM interleave set, we could reserve an area to place an
>   array of nvdimm_page structures for pages in that interleave set. In
>   addition, the corresponding global nvdimm_iset structure is set to
>   point to this array via its 'pages' field.

And I see no problem doing exactly that, just for an array of
struct page_info.

>   One thing I have no idea is what percentage of ram used/reserved by
>   Xen itself is considered as acceptable. If it exists and a boot
>   parameter is given, we could let Xen choose the faster ram when
>   the percentage has not been reached.

I think a conservative default would be to always place the
control structures in NVDIMM space, unless requested to be
put in RAM via command line option.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-09 16:17                                 ` Jan Beulich
@ 2016-03-10  3:27                                   ` Haozhong Zhang
  2016-03-17 11:05                                   ` Ian Jackson
  1 sibling, 0 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-10  3:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/09/16 09:17, Jan Beulich wrote:
> >>> On 09.03.16 at 13:22, <haozhong.zhang@intel.com> wrote:
> > On 03/08/16 02:27, Jan Beulich wrote:
> >> >>> On 08.03.16 at 10:15, <haozhong.zhang@intel.com> wrote:
[...]
> > I should reexplain the choice of data structures and where to put them.
> > 
> > For handling MCE for NVDIMM, we need to track following data:
> > (1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are
> >     used to check whether a MCE is for NVDIMM.
> > (2) GFN to which a NVDIMM page is mapped, which is used to determine the
> >     address put in vMCE.
> > (3) the domain to which a NVDIMM page is mapped, which is used to
> >     determine whether a vMCE needs to be injected and where it will be
> >     injected.
> > (4) a flag to mark whether a NVDIMM page is broken, which is used to
> >     avoid mapping broken page to guests.
> > 
> > For granting NVDIMM pages (e.g. xen-blkback/netback),
> > (5) a reference counter is needed for each NVDIMM page
> > 
> > Above data can be organized as below:
> > 
> > * For (1) SPA ranges, we can record them in a global data structure,
> >   e.g. a list
> > 
> >     struct list_head nvdimm_iset_list;
> > 
> >     struct nvdimm_iset
> >     {
> >          uint64_t           base;  /* starting SPA of this interleave set */
> >          uint64_t           size;  /* size of this interleave set */
> >          struct nvdimm_page *pages;/* information for individual pages in this interleave set */
> >          struct list_head   list;
> >     };
> > 
> > * For (2) GFN, an intuitive place to get this information is from M2P
> >   table machine_to_phys_mapping[].  However, the address of NVDIMM is
> >   not required to be contiguous with normal ram, so, if NVDIMM starts
> >   from an address that is much higher than the end address of normal
> >   ram, it may result in a M2P table that maybe too large to fit in the
> >   normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
> >   table.
> 
> Any page that _may_ be used by a guest as normal RAM page
> must have its mach->phys translation entered in the M2P. That's
> because a r/o variant of that table is part of the hypervisor ABI
> for PV guests. Size considerations simply don't apply here - the
> table may be sparse (guests are required to deal with accesses
> potentially faulting), and the 256Gb of virtual address space set
> aside for it cover all memory up to the 47-bit boundary (there's
> room for doubling this). Memory at addresses with bit 47 (or
> higher) set would need a complete overhaul of that mechanism,
> and whatever new mechanism we may pick would mean old
> guests won#t be able to benefit.
>

OK, then we can use M2P to get PFNs of NVDIMM pages. And ...

> >   Another possible solution is to extend page_info to include GFN for
> >   NVDIMM and use frame_table. A benefit of this solution is that other
> >   data (3)-(5) can be got from page_info as well. However, due to the
> >   same reason for machine_to_phys_mapping[] and the concern that the
> >   large number of page_info structures required for large NVDIMMs may
> >   consume lots of ram, page_info and frame_table seems not a good place
> >   either.
> 
> For this particular item struct page_info is the wrong place
> anyway, due to what I've said above. Also extension
> suggestions of struct page_info are quite problematic, as any
> such implies a measurable increase on the memory overhead
> the hypervisor incurs. Plus the structure right now is (with the
> exception of the bigmem configuration) a carefully arranged
> for power of two in size.
> 
> > * At the end, we choose to introduce a new data structure for above
> >   per-page data (2)-(5)
> > 
> >     struct nvdimm_page
> >     {
> >         struct domain *domain;    /* for (3) */
> >         uint64_t      gfn;        /* for (2) */
> >         unsigned long count_info; /* for (4) and (5), same as page_info->count_info */
> >         /* other fields if needed, e.g. lock */
> >     }
> 
> So that again leaves unaddressed the question of what you
> imply to do when a guest elects to use such a page as page
> table. I'm afraid any attempt of yours to invent something that
> is not struct page_info will not be suitable for all possible needs.
>

... we can use page_info struct rather than nvdimm_page struct for
NVDIMM pages and can be able to benefit from whatever have been done
with page_info.

> >   On each NVDIMM interleave set, we could reserve an area to place an
> >   array of nvdimm_page structures for pages in that interleave set. In
> >   addition, the corresponding global nvdimm_iset structure is set to
> >   point to this array via its 'pages' field.
> 
> And I see no problem doing exactly that, just for an array of
> struct page_info.
>

Yes, page_info arrays.

Because page_info structs for NVDIMM may be put in NVDIMM, existing code
that gets page_info from frame_table needs to be adjusted for NVDIMM
pages to use nvdimm_iset structs instead, including __mfn_to_page,
__page_to_mfn, page_to_spage, spage_to_page, page_to_pdx, pdx_to_page,
etc.

> >   One thing I have no idea is what percentage of ram used/reserved by
> >   Xen itself is considered as acceptable. If it exists and a boot
> >   parameter is given, we could let Xen choose the faster ram when
> >   the percentage has not been reached.
> 
> I think a conservative default would be to always place the
> control structures in NVDIMM space, unless requested to be
> put in RAM via command line option.
>

For the first version, I plan to implement above that puts all page_info
structs for NVDIMM either completely in NVDIMM or completely in normal
ram. In the future, we may introduce some software cache mechanism that
caches the most frequently used NVDIMM page_info in the normal ram.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-04  7:30                     ` Haozhong Zhang
@ 2016-03-16 12:55                       ` Haozhong Zhang
  2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
  2016-03-16 13:16                         ` Jan Beulich
  0 siblings, 2 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-16 12:55 UTC (permalink / raw)
  To: Jan Beulich, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

Hi Jan and Konrad,

On 03/04/16 15:30, Haozhong Zhang wrote:
> Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
> or files on NVDIMM. We can move that work to toolstack and pass SPA
> ranges got by toolstack to qemu. In this way, no privileged operations
> (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
> work even with vNVDIMM hotplug in future.
> 

As I'm going to let toolstack to get NVDIMM SPA ranges. This can be
done via dom0 kernel interface and xen hypercalls, and can be
implemented in different ways. I'm wondering which of the following
ones is preferred by xen.

1. Given
    * a file descriptor of either a NVDIMM device or a file on NVDIMM, and
    * domain id and guest MFN where vNVDIMM is going to be.
   xen toolstack (1) gets it SPA ranges via dom0 kernel interface
   (e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map
   above SPA ranges to the given guest MFN of the given domain.

2. Or, given the same inputs, we may combine above two steps into a new
   dom0 system call that (1) gets the SPA ranges, (2) calls xen
   hypercall to map SPA ranges, and, one step further, (3) returns SPA
   ranges to userspace (because QEMU needs these addresses to build ACPI).

The first way does not need to modify dom0 linux kernel, while the
second requires a new system call. I'm not sure whether xen toolstack
as a userspace program is considered to be safe to pass the host physical
address to hypervisor. If not, maybe the second one is better?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-16 12:55                       ` Haozhong Zhang
@ 2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
  2016-03-16 13:16                         ` Jan Beulich
  1 sibling, 0 replies; 121+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-03-16 13:13 UTC (permalink / raw)
  To: Jan Beulich, Stefano Stabellini, Juergen Gross, Kevin Tian,
	Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On Wed, Mar 16, 2016 at 08:55:08PM +0800, Haozhong Zhang wrote:
> Hi Jan and Konrad,
> 
> On 03/04/16 15:30, Haozhong Zhang wrote:
> > Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
> > or files on NVDIMM. We can move that work to toolstack and pass SPA
> > ranges got by toolstack to qemu. In this way, no privileged operations
> > (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
> > work even with vNVDIMM hotplug in future.
> > 
> 
> As I'm going to let toolstack to get NVDIMM SPA ranges. This can be
> done via dom0 kernel interface and xen hypercalls, and can be
> implemented in different ways. I'm wondering which of the following
> ones is preferred by xen.
> 
> 1. Given
>     * a file descriptor of either a NVDIMM device or a file on NVDIMM, and
>     * domain id and guest MFN where vNVDIMM is going to be.
>    xen toolstack (1) gets it SPA ranges via dom0 kernel interface
>    (e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map
>    above SPA ranges to the given guest MFN of the given domain.
> 
> 2. Or, given the same inputs, we may combine above two steps into a new
>    dom0 system call that (1) gets the SPA ranges, (2) calls xen
>    hypercall to map SPA ranges, and, one step further, (3) returns SPA
>    ranges to userspace (because QEMU needs these addresses to build ACPI).
> 
> The first way does not need to modify dom0 linux kernel, while the
> second requires a new system call. I'm not sure whether xen toolstack
> as a userspace program is considered to be safe to pass the host physical
> address to hypervisor. If not, maybe the second one is better?

Well, the toolstack does it already. (for MMIO ranges of PCIe devices and
such).

I would prefer 1) as it means less kernel code.
> 
> Thanks,
> Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-16 12:55                       ` Haozhong Zhang
  2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
@ 2016-03-16 13:16                         ` Jan Beulich
  2016-03-16 13:55                           ` Haozhong Zhang
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-16 13:16 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

>>> On 16.03.16 at 13:55, <haozhong.zhang@intel.com> wrote:
> Hi Jan and Konrad,
> 
> On 03/04/16 15:30, Haozhong Zhang wrote:
>> Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
>> or files on NVDIMM. We can move that work to toolstack and pass SPA
>> ranges got by toolstack to qemu. In this way, no privileged operations
>> (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
>> work even with vNVDIMM hotplug in future.
>> 
> 
> As I'm going to let toolstack to get NVDIMM SPA ranges. This can be
> done via dom0 kernel interface and xen hypercalls, and can be
> implemented in different ways. I'm wondering which of the following
> ones is preferred by xen.
> 
> 1. Given
>     * a file descriptor of either a NVDIMM device or a file on NVDIMM, and
>     * domain id and guest MFN where vNVDIMM is going to be.
>    xen toolstack (1) gets it SPA ranges via dom0 kernel interface
>    (e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map
>    above SPA ranges to the given guest MFN of the given domain.
> 
> 2. Or, given the same inputs, we may combine above two steps into a new
>    dom0 system call that (1) gets the SPA ranges, (2) calls xen
>    hypercall to map SPA ranges, and, one step further, (3) returns SPA
>    ranges to userspace (because QEMU needs these addresses to build ACPI).

DYM GPA here? Qemu should hardly have a need for SPA when
wanting to build ACPI tables for the guest.

> The first way does not need to modify dom0 linux kernel, while the
> second requires a new system call. I'm not sure whether xen toolstack
> as a userspace program is considered to be safe to pass the host physical
> address to hypervisor. If not, maybe the second one is better?

As long as the passing of physical addresses follows to model
of MMIO for passed through PCI devices, I don't think there's
problem with the tool stack bypassing the Dom0 kernel. So it
really all depends on how you make sure that the guest won't
get to see memory it has no permission to access.

Which reminds me: When considering a file on NVDIMM, how
are you making sure the mapping of the file to disk (i.e.
memory) blocks doesn't change while the guest has access
to it, e.g. due to some defragmentation going on? And
talking of fragmentation - how do you mean to track guest
permissions for an unbounded number of address ranges?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-16 13:16                         ` Jan Beulich
@ 2016-03-16 13:55                           ` Haozhong Zhang
  2016-03-16 14:23                             ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-16 13:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 03/16/16 07:16, Jan Beulich wrote:
> >>> On 16.03.16 at 13:55, <haozhong.zhang@intel.com> wrote:
> > Hi Jan and Konrad,
> > 
> > On 03/04/16 15:30, Haozhong Zhang wrote:
> >> Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
> >> or files on NVDIMM. We can move that work to toolstack and pass SPA
> >> ranges got by toolstack to qemu. In this way, no privileged operations
> >> (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
> >> work even with vNVDIMM hotplug in future.
> >> 
> > 
> > As I'm going to let toolstack to get NVDIMM SPA ranges. This can be
> > done via dom0 kernel interface and xen hypercalls, and can be
> > implemented in different ways. I'm wondering which of the following
> > ones is preferred by xen.
> > 
> > 1. Given
> >     * a file descriptor of either a NVDIMM device or a file on NVDIMM, and
> >     * domain id and guest MFN where vNVDIMM is going to be.
> >    xen toolstack (1) gets it SPA ranges via dom0 kernel interface
> >    (e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map
> >    above SPA ranges to the given guest MFN of the given domain.
> > 
> > 2. Or, given the same inputs, we may combine above two steps into a new
> >    dom0 system call that (1) gets the SPA ranges, (2) calls xen
> >    hypercall to map SPA ranges, and, one step further, (3) returns SPA
> >    ranges to userspace (because QEMU needs these addresses to build ACPI).
> 
> DYM GPA here? Qemu should hardly have a need for SPA when
> wanting to build ACPI tables for the guest.
>

Oh, it should be GPA for QEMU and (3) is not needed.

> > The first way does not need to modify dom0 linux kernel, while the
> > second requires a new system call. I'm not sure whether xen toolstack
> > as a userspace program is considered to be safe to pass the host physical
> > address to hypervisor. If not, maybe the second one is better?
> 
> As long as the passing of physical addresses follows to model
> of MMIO for passed through PCI devices, I don't think there's
> problem with the tool stack bypassing the Dom0 kernel. So it
> really all depends on how you make sure that the guest won't
> get to see memory it has no permission to access.
>

So the toolstack should first use XEN_DOMCTL_iomem_permission to grant
permissions to the guest and then call XEN_DOMCTL_memory_mapping for
the mapping.

> Which reminds me: When considering a file on NVDIMM, how
> are you making sure the mapping of the file to disk (i.e.
> memory) blocks doesn't change while the guest has access
> to it, e.g. due to some defragmentation going on?

The current linux kernel 4.5 has an experimental "raw device dax
support" (enabled by removing "depends on BROKEN" from "config
BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
developers are going to make it non-broken in linux kernel 4.6.

> And
> talking of fragmentation - how do you mean to track guest
> permissions for an unbounded number of address ranges?
>

In this case range structs in iomem_caps for NVDIMMs may consume a lot
of memory, so I think they are another candidate that should be put in
the reserved area on NVDIMM. If we only allow to grant access
permissions to NVDIMM page by page (rather than byte), the number of
range structs for each NVDIMM in the worst case is still decidable.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-16 13:55                           ` Haozhong Zhang
@ 2016-03-16 14:23                             ` Jan Beulich
  2016-03-16 14:55                               ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-16 14:23 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

>>> On 16.03.16 at 14:55, <haozhong.zhang@intel.com> wrote:
> On 03/16/16 07:16, Jan Beulich wrote:
>> Which reminds me: When considering a file on NVDIMM, how
>> are you making sure the mapping of the file to disk (i.e.
>> memory) blocks doesn't change while the guest has access
>> to it, e.g. due to some defragmentation going on?
> 
> The current linux kernel 4.5 has an experimental "raw device dax
> support" (enabled by removing "depends on BROKEN" from "config
> BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
> developers are going to make it non-broken in linux kernel 4.6.

But there you talk about full devices, whereas my question was
for files.

>> And
>> talking of fragmentation - how do you mean to track guest
>> permissions for an unbounded number of address ranges?
>>
> 
> In this case range structs in iomem_caps for NVDIMMs may consume a lot
> of memory, so I think they are another candidate that should be put in
> the reserved area on NVDIMM. If we only allow to grant access
> permissions to NVDIMM page by page (rather than byte), the number of
> range structs for each NVDIMM in the worst case is still decidable.

Of course the permission granularity is going to by pages, not
bytes (or else we couldn't allow the pages to be mapped into
guest address space). And the limit on the per-domain range
sets isn't going to be allowed to be bumped significantly, at
least not for any of the existing ones (or else you'd have to
prove such bumping can't be abused). Putting such control
structures on NVDIMM is a nice idea, but following our isolation
model for normal memory, any such memory used by Xen
would then need to be (made) inaccessible to Dom0.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-16 14:23                             ` Jan Beulich
@ 2016-03-16 14:55                               ` Haozhong Zhang
  2016-03-16 15:23                                 ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-16 14:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 03/16/16 08:23, Jan Beulich wrote:
> >>> On 16.03.16 at 14:55, <haozhong.zhang@intel.com> wrote:
> > On 03/16/16 07:16, Jan Beulich wrote:
> >> Which reminds me: When considering a file on NVDIMM, how
> >> are you making sure the mapping of the file to disk (i.e.
> >> memory) blocks doesn't change while the guest has access
> >> to it, e.g. due to some defragmentation going on?
> > 
> > The current linux kernel 4.5 has an experimental "raw device dax
> > support" (enabled by removing "depends on BROKEN" from "config
> > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
> > developers are going to make it non-broken in linux kernel 4.6.
> 
> But there you talk about full devices, whereas my question was
> for files.
>

the raw device dax support is for files on NVDIMM.

> >> And
> >> talking of fragmentation - how do you mean to track guest
> >> permissions for an unbounded number of address ranges?
> >>
> > 
> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
> > of memory, so I think they are another candidate that should be put in
> > the reserved area on NVDIMM. If we only allow to grant access
> > permissions to NVDIMM page by page (rather than byte), the number of
> > range structs for each NVDIMM in the worst case is still decidable.
> 
> Of course the permission granularity is going to by pages, not
> bytes (or else we couldn't allow the pages to be mapped into
> guest address space). And the limit on the per-domain range
> sets isn't going to be allowed to be bumped significantly, at
> least not for any of the existing ones (or else you'd have to
> prove such bumping can't be abused).

What is that limit? the total number of range structs in per-domain
range sets? I must miss something when looking through 'case
XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
limit, unless it means alloc_range() will fail when there are lots of
range structs.

> Putting such control
> structures on NVDIMM is a nice idea, but following our isolation
> model for normal memory, any such memory used by Xen
> would then need to be (made) inaccessible to Dom0.
>

I'm not clear how this is done. By marking those inaccessible pages as
unpresent in dom0's page table? Or any example I can follow?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-16 14:55                               ` Haozhong Zhang
@ 2016-03-16 15:23                                 ` Jan Beulich
  2016-03-17  8:58                                   ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-16 15:23 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

>>> On 16.03.16 at 15:55, <haozhong.zhang@intel.com> wrote:
> On 03/16/16 08:23, Jan Beulich wrote:
>> >>> On 16.03.16 at 14:55, <haozhong.zhang@intel.com> wrote:
>> > On 03/16/16 07:16, Jan Beulich wrote:
>> >> Which reminds me: When considering a file on NVDIMM, how
>> >> are you making sure the mapping of the file to disk (i.e.
>> >> memory) blocks doesn't change while the guest has access
>> >> to it, e.g. due to some defragmentation going on?
>> > 
>> > The current linux kernel 4.5 has an experimental "raw device dax
>> > support" (enabled by removing "depends on BROKEN" from "config
>> > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
>> > developers are going to make it non-broken in linux kernel 4.6.
>> 
>> But there you talk about full devices, whereas my question was
>> for files.
>>
> 
> the raw device dax support is for files on NVDIMM.

Okay, I can only trust you here. I thought FS_DAX is the file level
thing.

>> >> And
>> >> talking of fragmentation - how do you mean to track guest
>> >> permissions for an unbounded number of address ranges?
>> >>
>> > 
>> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
>> > of memory, so I think they are another candidate that should be put in
>> > the reserved area on NVDIMM. If we only allow to grant access
>> > permissions to NVDIMM page by page (rather than byte), the number of
>> > range structs for each NVDIMM in the worst case is still decidable.
>> 
>> Of course the permission granularity is going to by pages, not
>> bytes (or else we couldn't allow the pages to be mapped into
>> guest address space). And the limit on the per-domain range
>> sets isn't going to be allowed to be bumped significantly, at
>> least not for any of the existing ones (or else you'd have to
>> prove such bumping can't be abused).
> 
> What is that limit? the total number of range structs in per-domain
> range sets? I must miss something when looking through 'case
> XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
> limit, unless it means alloc_range() will fail when there are lots of
> range structs.

Oh, I'm sorry, that was a different set of range sets I was
thinking about. But note that excessive creation of ranges
through XEN_DOMCTL_iomem_permission is not a security issue
just because of XSA-77, i.e. we'd still not knowingly allow a
severe increase here.

>> Putting such control
>> structures on NVDIMM is a nice idea, but following our isolation
>> model for normal memory, any such memory used by Xen
>> would then need to be (made) inaccessible to Dom0.
> 
> I'm not clear how this is done. By marking those inaccessible pages as
> unpresent in dom0's page table? Or any example I can follow?

That's the problem - so far we had no need to do so since Dom0
was only ever allowed access to memory Xen didn't use for itself
or knows it wants to share. Whereas now you want such a
resource controlled first by Dom0, and only then handed to Xen.
So yes, Dom0 would need to zap any mappings of these pages
(and Xen would need to verify that, which would come mostly
without new code as long as struct page_info gets properly
used for all this memory) before Xen could use it. Much like
ballooning out a normal RAM page.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-16 15:23                                 ` Jan Beulich
@ 2016-03-17  8:58                                   ` Haozhong Zhang
  2016-03-17 11:04                                     ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-17  8:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 03/16/16 09:23, Jan Beulich wrote:
> >>> On 16.03.16 at 15:55, <haozhong.zhang@intel.com> wrote:
> > On 03/16/16 08:23, Jan Beulich wrote:
> >> >>> On 16.03.16 at 14:55, <haozhong.zhang@intel.com> wrote:
> >> > On 03/16/16 07:16, Jan Beulich wrote:
> >> >> Which reminds me: When considering a file on NVDIMM, how
> >> >> are you making sure the mapping of the file to disk (i.e.
> >> >> memory) blocks doesn't change while the guest has access
> >> >> to it, e.g. due to some defragmentation going on?
> >> > 
> >> > The current linux kernel 4.5 has an experimental "raw device dax
> >> > support" (enabled by removing "depends on BROKEN" from "config
> >> > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
> >> > developers are going to make it non-broken in linux kernel 4.6.
> >> 
> >> But there you talk about full devices, whereas my question was
> >> for files.
> >>
> > 
> > the raw device dax support is for files on NVDIMM.
> 
> Okay, I can only trust you here. I thought FS_DAX is the file level
> thing.
> 
> >> >> And
> >> >> talking of fragmentation - how do you mean to track guest
> >> >> permissions for an unbounded number of address ranges?
> >> >>
> >> > 
> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
> >> > of memory, so I think they are another candidate that should be put in
> >> > the reserved area on NVDIMM. If we only allow to grant access
> >> > permissions to NVDIMM page by page (rather than byte), the number of
> >> > range structs for each NVDIMM in the worst case is still decidable.
> >> 
> >> Of course the permission granularity is going to by pages, not
> >> bytes (or else we couldn't allow the pages to be mapped into
> >> guest address space). And the limit on the per-domain range
> >> sets isn't going to be allowed to be bumped significantly, at
> >> least not for any of the existing ones (or else you'd have to
> >> prove such bumping can't be abused).
> > 
> > What is that limit? the total number of range structs in per-domain
> > range sets? I must miss something when looking through 'case
> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
> > limit, unless it means alloc_range() will fail when there are lots of
> > range structs.
> 
> Oh, I'm sorry, that was a different set of range sets I was
> thinking about. But note that excessive creation of ranges
> through XEN_DOMCTL_iomem_permission is not a security issue
> just because of XSA-77, i.e. we'd still not knowingly allow a
> severe increase here.
>

I didn't notice that multiple domains can all have access permission
to an iomem range, i.e. there can be multiple range structs for a
single iomem range. If range structs for NVDIMM are put on NVDIMM,
then there would be still a huge amount of them on NVDIMM in the worst
case (maximum number of domains * number of NVDIMM pages).

A workaround is to only allow a range of NVDIMM pages be accessed by a
single domain. Whenever we add the access permission of NVDIMM pages
to a domain, we also remove the permission from its current
grantee. In this way, we only need to put 'number of NVDIMM pages'
range structs on NVDIMM in the worst case.

> >> Putting such control
> >> structures on NVDIMM is a nice idea, but following our isolation
> >> model for normal memory, any such memory used by Xen
> >> would then need to be (made) inaccessible to Dom0.
> > 
> > I'm not clear how this is done. By marking those inaccessible pages as
> > unpresent in dom0's page table? Or any example I can follow?
> 
> That's the problem - so far we had no need to do so since Dom0
> was only ever allowed access to memory Xen didn't use for itself
> or knows it wants to share. Whereas now you want such a
> resource controlled first by Dom0, and only then handed to Xen.
> So yes, Dom0 would need to zap any mappings of these pages
> (and Xen would need to verify that, which would come mostly
> without new code as long as struct page_info gets properly
> used for all this memory) before Xen could use it. Much like
> ballooning out a normal RAM page.
> 

Thanks, I'll look into this balloon approach.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17  8:58                                   ` Haozhong Zhang
@ 2016-03-17 11:04                                     ` Jan Beulich
  2016-03-17 12:44                                       ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-17 11:04 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

>>> On 17.03.16 at 09:58, <haozhong.zhang@intel.com> wrote:
> On 03/16/16 09:23, Jan Beulich wrote:
>> >>> On 16.03.16 at 15:55, <haozhong.zhang@intel.com> wrote:
>> > On 03/16/16 08:23, Jan Beulich wrote:
>> >> >>> On 16.03.16 at 14:55, <haozhong.zhang@intel.com> wrote:
>> >> > On 03/16/16 07:16, Jan Beulich wrote:
>> >> >> And
>> >> >> talking of fragmentation - how do you mean to track guest
>> >> >> permissions for an unbounded number of address ranges?
>> >> >>
>> >> > 
>> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
>> >> > of memory, so I think they are another candidate that should be put in
>> >> > the reserved area on NVDIMM. If we only allow to grant access
>> >> > permissions to NVDIMM page by page (rather than byte), the number of
>> >> > range structs for each NVDIMM in the worst case is still decidable.
>> >> 
>> >> Of course the permission granularity is going to by pages, not
>> >> bytes (or else we couldn't allow the pages to be mapped into
>> >> guest address space). And the limit on the per-domain range
>> >> sets isn't going to be allowed to be bumped significantly, at
>> >> least not for any of the existing ones (or else you'd have to
>> >> prove such bumping can't be abused).
>> > 
>> > What is that limit? the total number of range structs in per-domain
>> > range sets? I must miss something when looking through 'case
>> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
>> > limit, unless it means alloc_range() will fail when there are lots of
>> > range structs.
>> 
>> Oh, I'm sorry, that was a different set of range sets I was
>> thinking about. But note that excessive creation of ranges
>> through XEN_DOMCTL_iomem_permission is not a security issue
>> just because of XSA-77, i.e. we'd still not knowingly allow a
>> severe increase here.
>>
> 
> I didn't notice that multiple domains can all have access permission
> to an iomem range, i.e. there can be multiple range structs for a
> single iomem range. If range structs for NVDIMM are put on NVDIMM,
> then there would be still a huge amount of them on NVDIMM in the worst
> case (maximum number of domains * number of NVDIMM pages).
> 
> A workaround is to only allow a range of NVDIMM pages be accessed by a
> single domain. Whenever we add the access permission of NVDIMM pages
> to a domain, we also remove the permission from its current
> grantee. In this way, we only need to put 'number of NVDIMM pages'
> range structs on NVDIMM in the worst case.

But will this work? There's a reason multiple domains are permitted
access: The domain running qemu for the guest, for example,
needs to be able to access guest memory.

No matter how much you and others are opposed to this, I can't
help myself thinking that PMEM regions should be treated like RAM
(and hence be under full control of Xen), whereas PBLK regions
could indeed be treated like MMIO (and hence partly be under the
control of Dom0).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-09 16:17                                 ` Jan Beulich
  2016-03-10  3:27                                   ` Haozhong Zhang
@ 2016-03-17 11:05                                   ` Ian Jackson
  2016-03-17 13:37                                     ` Haozhong Zhang
  1 sibling, 1 reply; 121+ messages in thread
From: Ian Jackson @ 2016-03-17 11:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> So that again leaves unaddressed the question of what you
> imply to do when a guest elects to use such a page as page
> table. I'm afraid any attempt of yours to invent something that
> is not struct page_info will not be suitable for all possible needs.

It is not clear to me whether this is a realistic thing for a guest to
want to do.  Haozhong, maybe you want to consider this aspect.

If you can come up with an argument why it is OK to simply not permit
this, then maybe the recordkeeping requirements can be relaxed ?

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 11:04                                     ` Jan Beulich
@ 2016-03-17 12:44                                       ` Haozhong Zhang
  2016-03-17 12:59                                         ` Jan Beulich
  2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-17 12:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 03/17/16 05:04, Jan Beulich wrote:
> >>> On 17.03.16 at 09:58, <haozhong.zhang@intel.com> wrote:
> > On 03/16/16 09:23, Jan Beulich wrote:
> >> >>> On 16.03.16 at 15:55, <haozhong.zhang@intel.com> wrote:
> >> > On 03/16/16 08:23, Jan Beulich wrote:
> >> >> >>> On 16.03.16 at 14:55, <haozhong.zhang@intel.com> wrote:
> >> >> > On 03/16/16 07:16, Jan Beulich wrote:
> >> >> >> And
> >> >> >> talking of fragmentation - how do you mean to track guest
> >> >> >> permissions for an unbounded number of address ranges?
> >> >> >>
> >> >> > 
> >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
> >> >> > of memory, so I think they are another candidate that should be put in
> >> >> > the reserved area on NVDIMM. If we only allow to grant access
> >> >> > permissions to NVDIMM page by page (rather than byte), the number of
> >> >> > range structs for each NVDIMM in the worst case is still decidable.
> >> >> 
> >> >> Of course the permission granularity is going to by pages, not
> >> >> bytes (or else we couldn't allow the pages to be mapped into
> >> >> guest address space). And the limit on the per-domain range
> >> >> sets isn't going to be allowed to be bumped significantly, at
> >> >> least not for any of the existing ones (or else you'd have to
> >> >> prove such bumping can't be abused).
> >> > 
> >> > What is that limit? the total number of range structs in per-domain
> >> > range sets? I must miss something when looking through 'case
> >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
> >> > limit, unless it means alloc_range() will fail when there are lots of
> >> > range structs.
> >> 
> >> Oh, I'm sorry, that was a different set of range sets I was
> >> thinking about. But note that excessive creation of ranges
> >> through XEN_DOMCTL_iomem_permission is not a security issue
> >> just because of XSA-77, i.e. we'd still not knowingly allow a
> >> severe increase here.
> >>
> > 
> > I didn't notice that multiple domains can all have access permission
> > to an iomem range, i.e. there can be multiple range structs for a
> > single iomem range. If range structs for NVDIMM are put on NVDIMM,
> > then there would be still a huge amount of them on NVDIMM in the worst
> > case (maximum number of domains * number of NVDIMM pages).
> > 
> > A workaround is to only allow a range of NVDIMM pages be accessed by a
> > single domain. Whenever we add the access permission of NVDIMM pages
> > to a domain, we also remove the permission from its current
> > grantee. In this way, we only need to put 'number of NVDIMM pages'
> > range structs on NVDIMM in the worst case.
> 
> But will this work? There's a reason multiple domains are permitted
> access: The domain running qemu for the guest, for example,
> needs to be able to access guest memory.
>

QEMU now only maintains ACPI tables and emulates _DSM for vNVDIMM
which both do not need to access NVDIMM pages mapped to guest.

> No matter how much you and others are opposed to this, I can't
> help myself thinking that PMEM regions should be treated like RAM
> (and hence be under full control of Xen), whereas PBLK regions
> could indeed be treated like MMIO (and hence partly be under the
> control of Dom0).
>

Hmm, making Xen has full control could at least make reserving space
on NVDIMM easier. I guess full control does not include manipulating
file systems on NVDIMM which can be still left to dom0?

Then there is another problem (which also exists in the current
design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
that access label storage area (for namespace) for example:

The way Linux reserving space on pmem mode NVDIMM is to leave the
reserved space at the beginning of pmem mode NVDIMM and create a pmem
namespace which starts from the end of the reserved space. Because the
reservation information is written in the namespace in the NVDIMM
label storage area, every OS that follows the namespace spec would not
mistakenly write files in the reserved area. I prefer to the same way
if Xen is going to do the reservation. We definitely don't want dom0
to break the label storage area, so Xen seemingly needs to emulate the
corresponding _DSM functions for dom0? If so, which part, the
hypervisor or the toolstack, should do the emulation?

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 12:44                                       ` Haozhong Zhang
@ 2016-03-17 12:59                                         ` Jan Beulich
  2016-03-17 13:29                                           ` Haozhong Zhang
  2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-17 12:59 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

>>> On 17.03.16 at 13:44, <haozhong.zhang@intel.com> wrote:
> On 03/17/16 05:04, Jan Beulich wrote:
>> >>> On 17.03.16 at 09:58, <haozhong.zhang@intel.com> wrote:
>> > On 03/16/16 09:23, Jan Beulich wrote:
>> >> >>> On 16.03.16 at 15:55, <haozhong.zhang@intel.com> wrote:
>> >> > On 03/16/16 08:23, Jan Beulich wrote:
>> >> >> >>> On 16.03.16 at 14:55, <haozhong.zhang@intel.com> wrote:
>> >> >> > On 03/16/16 07:16, Jan Beulich wrote:
>> >> >> >> And
>> >> >> >> talking of fragmentation - how do you mean to track guest
>> >> >> >> permissions for an unbounded number of address ranges?
>> >> >> >>
>> >> >> > 
>> >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
>> >> >> > of memory, so I think they are another candidate that should be put in
>> >> >> > the reserved area on NVDIMM. If we only allow to grant access
>> >> >> > permissions to NVDIMM page by page (rather than byte), the number of
>> >> >> > range structs for each NVDIMM in the worst case is still decidable.
>> >> >> 
>> >> >> Of course the permission granularity is going to by pages, not
>> >> >> bytes (or else we couldn't allow the pages to be mapped into
>> >> >> guest address space). And the limit on the per-domain range
>> >> >> sets isn't going to be allowed to be bumped significantly, at
>> >> >> least not for any of the existing ones (or else you'd have to
>> >> >> prove such bumping can't be abused).
>> >> > 
>> >> > What is that limit? the total number of range structs in per-domain
>> >> > range sets? I must miss something when looking through 'case
>> >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
>> >> > limit, unless it means alloc_range() will fail when there are lots of
>> >> > range structs.
>> >> 
>> >> Oh, I'm sorry, that was a different set of range sets I was
>> >> thinking about. But note that excessive creation of ranges
>> >> through XEN_DOMCTL_iomem_permission is not a security issue
>> >> just because of XSA-77, i.e. we'd still not knowingly allow a
>> >> severe increase here.
>> >>
>> > 
>> > I didn't notice that multiple domains can all have access permission
>> > to an iomem range, i.e. there can be multiple range structs for a
>> > single iomem range. If range structs for NVDIMM are put on NVDIMM,
>> > then there would be still a huge amount of them on NVDIMM in the worst
>> > case (maximum number of domains * number of NVDIMM pages).
>> > 
>> > A workaround is to only allow a range of NVDIMM pages be accessed by a
>> > single domain. Whenever we add the access permission of NVDIMM pages
>> > to a domain, we also remove the permission from its current
>> > grantee. In this way, we only need to put 'number of NVDIMM pages'
>> > range structs on NVDIMM in the worst case.
>> 
>> But will this work? There's a reason multiple domains are permitted
>> access: The domain running qemu for the guest, for example,
>> needs to be able to access guest memory.
>>
> 
> QEMU now only maintains ACPI tables and emulates _DSM for vNVDIMM
> which both do not need to access NVDIMM pages mapped to guest.

For one - this was only an example. And then - iirc qemu keeps
mappings of certain guest RAM ranges. If I'm remembering this
right, then why would it be excluded that it also may need
mappings of guest NVDIMM?

>> No matter how much you and others are opposed to this, I can't
>> help myself thinking that PMEM regions should be treated like RAM
>> (and hence be under full control of Xen), whereas PBLK regions
>> could indeed be treated like MMIO (and hence partly be under the
>> control of Dom0).
>>
> 
> Hmm, making Xen has full control could at least make reserving space
> on NVDIMM easier. I guess full control does not include manipulating
> file systems on NVDIMM which can be still left to dom0?
> 
> Then there is another problem (which also exists in the current
> design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
> that access label storage area (for namespace) for example:
> 
> The way Linux reserving space on pmem mode NVDIMM is to leave the
> reserved space at the beginning of pmem mode NVDIMM and create a pmem
> namespace which starts from the end of the reserved space. Because the
> reservation information is written in the namespace in the NVDIMM
> label storage area, every OS that follows the namespace spec would not
> mistakenly write files in the reserved area. I prefer to the same way
> if Xen is going to do the reservation. We definitely don't want dom0
> to break the label storage area, so Xen seemingly needs to emulate the
> corresponding _DSM functions for dom0? If so, which part, the
> hypervisor or the toolstack, should do the emulation?

I don't think I can answer all but the very last point: Of course this
can't be done in the tool stack, since afaict the Dom0 kernel will
want to evaluate _DSM before the tool stack even runs.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 12:59                                         ` Jan Beulich
@ 2016-03-17 13:29                                           ` Haozhong Zhang
  2016-03-17 13:52                                             ` Jan Beulich
  2016-03-17 14:00                                             ` Ian Jackson
  0 siblings, 2 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-17 13:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 03/17/16 06:59, Jan Beulich wrote:
> >>> On 17.03.16 at 13:44, <haozhong.zhang@intel.com> wrote:
> > On 03/17/16 05:04, Jan Beulich wrote:
> >> >>> On 17.03.16 at 09:58, <haozhong.zhang@intel.com> wrote:
> >> > On 03/16/16 09:23, Jan Beulich wrote:
> >> >> >>> On 16.03.16 at 15:55, <haozhong.zhang@intel.com> wrote:
> >> >> > On 03/16/16 08:23, Jan Beulich wrote:
> >> >> >> >>> On 16.03.16 at 14:55, <haozhong.zhang@intel.com> wrote:
> >> >> >> > On 03/16/16 07:16, Jan Beulich wrote:
> >> >> >> >> And
> >> >> >> >> talking of fragmentation - how do you mean to track guest
> >> >> >> >> permissions for an unbounded number of address ranges?
> >> >> >> >>
> >> >> >> > 
> >> >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
> >> >> >> > of memory, so I think they are another candidate that should be put in
> >> >> >> > the reserved area on NVDIMM. If we only allow to grant access
> >> >> >> > permissions to NVDIMM page by page (rather than byte), the number of
> >> >> >> > range structs for each NVDIMM in the worst case is still decidable.
> >> >> >> 
> >> >> >> Of course the permission granularity is going to by pages, not
> >> >> >> bytes (or else we couldn't allow the pages to be mapped into
> >> >> >> guest address space). And the limit on the per-domain range
> >> >> >> sets isn't going to be allowed to be bumped significantly, at
> >> >> >> least not for any of the existing ones (or else you'd have to
> >> >> >> prove such bumping can't be abused).
> >> >> > 
> >> >> > What is that limit? the total number of range structs in per-domain
> >> >> > range sets? I must miss something when looking through 'case
> >> >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
> >> >> > limit, unless it means alloc_range() will fail when there are lots of
> >> >> > range structs.
> >> >> 
> >> >> Oh, I'm sorry, that was a different set of range sets I was
> >> >> thinking about. But note that excessive creation of ranges
> >> >> through XEN_DOMCTL_iomem_permission is not a security issue
> >> >> just because of XSA-77, i.e. we'd still not knowingly allow a
> >> >> severe increase here.
> >> >>
> >> > 
> >> > I didn't notice that multiple domains can all have access permission
> >> > to an iomem range, i.e. there can be multiple range structs for a
> >> > single iomem range. If range structs for NVDIMM are put on NVDIMM,
> >> > then there would be still a huge amount of them on NVDIMM in the worst
> >> > case (maximum number of domains * number of NVDIMM pages).
> >> > 
> >> > A workaround is to only allow a range of NVDIMM pages be accessed by a
> >> > single domain. Whenever we add the access permission of NVDIMM pages
> >> > to a domain, we also remove the permission from its current
> >> > grantee. In this way, we only need to put 'number of NVDIMM pages'
> >> > range structs on NVDIMM in the worst case.
> >> 
> >> But will this work? There's a reason multiple domains are permitted
> >> access: The domain running qemu for the guest, for example,
> >> needs to be able to access guest memory.
> >>
> > 
> > QEMU now only maintains ACPI tables and emulates _DSM for vNVDIMM
> > which both do not need to access NVDIMM pages mapped to guest.
> 
> For one - this was only an example. And then - iirc qemu keeps
> mappings of certain guest RAM ranges. If I'm remembering this
> right, then why would it be excluded that it also may need
> mappings of guest NVDIMM?
>

QEMU keeps mappings of guest memory because (1) that mapping is
created by itself, and/or (2) certain device emulation needs to access
the guest memory. But for vNVDIMM, I'm going to move the creation of
its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
access vNVDIMM pages mapped to guest, so it's not necessary to let
qemu keeps vNVDIMM mappings.

> >> No matter how much you and others are opposed to this, I can't
> >> help myself thinking that PMEM regions should be treated like RAM
> >> (and hence be under full control of Xen), whereas PBLK regions
> >> could indeed be treated like MMIO (and hence partly be under the
> >> control of Dom0).
> >>
> > 
> > Hmm, making Xen has full control could at least make reserving space
> > on NVDIMM easier. I guess full control does not include manipulating
> > file systems on NVDIMM which can be still left to dom0?
> > 
> > Then there is another problem (which also exists in the current
> > design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
> > that access label storage area (for namespace) for example:
> > 
> > The way Linux reserving space on pmem mode NVDIMM is to leave the
> > reserved space at the beginning of pmem mode NVDIMM and create a pmem
> > namespace which starts from the end of the reserved space. Because the
> > reservation information is written in the namespace in the NVDIMM
> > label storage area, every OS that follows the namespace spec would not
> > mistakenly write files in the reserved area. I prefer to the same way
> > if Xen is going to do the reservation. We definitely don't want dom0
> > to break the label storage area, so Xen seemingly needs to emulate the
> > corresponding _DSM functions for dom0? If so, which part, the
> > hypervisor or the toolstack, should do the emulation?
> 
> I don't think I can answer all but the very last point: Of course this
> can't be done in the tool stack, since afaict the Dom0 kernel will
> want to evaluate _DSM before the tool stack even runs.
>

Or, we could modify dom0 kernel to just use the label storage area as is
and does not modify it. Can xen hypervisor trust dom0 kernel in this aspect?

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 12:44                                       ` Haozhong Zhang
  2016-03-17 12:59                                         ` Jan Beulich
@ 2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 121+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-03-17 13:32 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, George Dunlap, Ian Campbell, Wei Liu,
	George Dunlap, IanJackson, Stefano Stabellini, Jun Nakajima,
	Kevin Tian, Xiao Guangrong, xen-devel, Juergen Gross,
	Keir Fraser

> Then there is another problem (which also exists in the current
> design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
> that access label storage area (for namespace) for example:

No. And it really can't as each vendors _DSM is different - and there
is no ACPI AML interpreter inside Xen hypervisor.
> 
> The way Linux reserving space on pmem mode NVDIMM is to leave the
> reserved space at the beginning of pmem mode NVDIMM and create a pmem
> namespace which starts from the end of the reserved space. Because the
> reservation information is written in the namespace in the NVDIMM
> label storage area, every OS that follows the namespace spec would not
> mistakenly write files in the reserved area. I prefer to the same way
> if Xen is going to do the reservation. We definitely don't want dom0
> to break the label storage area, so Xen seemingly needs to emulate the
> corresponding _DSM functions for dom0? If so, which part, the
> hypervisor or the toolstack, should do the emulation?

But we do not want Xen to do the reservation. The control guest (Dom0)
is the one that will mount the NVDIMM, and extract the system ranges
from the files on the NVDIMM - and glue them to a guest. 

It is also the job of Dom0 to do the actually partition the NVDIMM
as fit. Actually let me step back. It is the job of the guest who
has the full NVDIMM in it. At bootup it is Dom0 - but you can very
well 'unplug' the NVDIMM from Dom0 and assign it wholesale to a guest.

Granted at that point the _DSM operations have to go through QEMU
which ends up calling the dom0 ioctls on PMEM to do the operation
(like getting the SMART data).
> 
> Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 11:05                                   ` Ian Jackson
@ 2016-03-17 13:37                                     ` Haozhong Zhang
  2016-03-17 13:56                                       ` Jan Beulich
  2016-03-17 14:12                                       ` Xu, Quan
  0 siblings, 2 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-17 13:37 UTC (permalink / raw)
  To: Ian Jackson, Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/17/16 11:05, Ian Jackson wrote:
> Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> > So that again leaves unaddressed the question of what you
> > imply to do when a guest elects to use such a page as page
> > table. I'm afraid any attempt of yours to invent something that
> > is not struct page_info will not be suitable for all possible needs.
> 
> It is not clear to me whether this is a realistic thing for a guest to
> want to do.  Haozhong, maybe you want to consider this aspect.
>

For HVM guests, it's themselves responsibility to not grant (e.g. in
xen-blk/net drivers) a vNVDIMM page containing page tables to others.

For PV guests (if we add vNVDIMM support for them in future), as I'm
going to use page_info struct for it, I suppose the current mechanism
in Xen can handle this case. I'm not familiar with PV memory
management and have to admit I didn't find the exact code that handles
the case that a memory page contains the guest page table. Jan, could
you indicate the code that I can follow to understand what xen does in
this case?

Thanks,
Haozhong

> If you can come up with an argument why it is OK to simply not permit
> this, then maybe the recordkeeping requirements can be relaxed ?
> 
> Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 13:29                                           ` Haozhong Zhang
@ 2016-03-17 13:52                                             ` Jan Beulich
  2016-03-17 14:00                                             ` Ian Jackson
  1 sibling, 0 replies; 121+ messages in thread
From: Jan Beulich @ 2016-03-17 13:52 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, IanJackson,
	George Dunlap, xen-devel, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

>>> On 17.03.16 at 14:29, <haozhong.zhang@intel.com> wrote:
> On 03/17/16 06:59, Jan Beulich wrote:
>> >>> On 17.03.16 at 13:44, <haozhong.zhang@intel.com> wrote:
>> > Hmm, making Xen has full control could at least make reserving space
>> > on NVDIMM easier. I guess full control does not include manipulating
>> > file systems on NVDIMM which can be still left to dom0?
>> > 
>> > Then there is another problem (which also exists in the current
>> > design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
>> > that access label storage area (for namespace) for example:
>> > 
>> > The way Linux reserving space on pmem mode NVDIMM is to leave the
>> > reserved space at the beginning of pmem mode NVDIMM and create a pmem
>> > namespace which starts from the end of the reserved space. Because the
>> > reservation information is written in the namespace in the NVDIMM
>> > label storage area, every OS that follows the namespace spec would not
>> > mistakenly write files in the reserved area. I prefer to the same way
>> > if Xen is going to do the reservation. We definitely don't want dom0
>> > to break the label storage area, so Xen seemingly needs to emulate the
>> > corresponding _DSM functions for dom0? If so, which part, the
>> > hypervisor or the toolstack, should do the emulation?
>> 
>> I don't think I can answer all but the very last point: Of course this
>> can't be done in the tool stack, since afaict the Dom0 kernel will
>> want to evaluate _DSM before the tool stack even runs.
> 
> Or, we could modify dom0 kernel to just use the label storage area as is
> and does not modify it. Can xen hypervisor trust dom0 kernel in this aspect?

I think so, yes.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 13:37                                     ` Haozhong Zhang
@ 2016-03-17 13:56                                       ` Jan Beulich
  2016-03-17 14:22                                         ` Haozhong Zhang
  2016-03-17 14:12                                       ` Xu, Quan
  1 sibling, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-17 13:56 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 17.03.16 at 14:37, <haozhong.zhang@intel.com> wrote:
> On 03/17/16 11:05, Ian Jackson wrote:
>> Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for 
> Xen"):
>> > So that again leaves unaddressed the question of what you
>> > imply to do when a guest elects to use such a page as page
>> > table. I'm afraid any attempt of yours to invent something that
>> > is not struct page_info will not be suitable for all possible needs.
>> 
>> It is not clear to me whether this is a realistic thing for a guest to
>> want to do.  Haozhong, maybe you want to consider this aspect.
>>
> 
> For HVM guests, it's themselves responsibility to not grant (e.g. in
> xen-blk/net drivers) a vNVDIMM page containing page tables to others.
> 
> For PV guests (if we add vNVDIMM support for them in future), as I'm
> going to use page_info struct for it, I suppose the current mechanism
> in Xen can handle this case. I'm not familiar with PV memory
> management and have to admit I didn't find the exact code that handles
> the case that a memory page contains the guest page table. Jan, could
> you indicate the code that I can follow to understand what xen does in
> this case?

xen/arch/x86/mm.c has functions like __get_page_type(),
alloc_page_type(), alloc_l[1234]_table(), and mod_l[1234]_entry()
which all participate in this.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 13:29                                           ` Haozhong Zhang
  2016-03-17 13:52                                             ` Jan Beulich
@ 2016-03-17 14:00                                             ` Ian Jackson
  2016-03-17 14:21                                               ` Haozhong Zhang
  1 sibling, 1 reply; 121+ messages in thread
From: Ian Jackson @ 2016-03-17 14:00 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Jun Nakajima, Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, George Dunlap,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> QEMU keeps mappings of guest memory because (1) that mapping is
> created by itself, and/or (2) certain device emulation needs to access
> the guest memory. But for vNVDIMM, I'm going to move the creation of
> its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
> access vNVDIMM pages mapped to guest, so it's not necessary to let
> qemu keeps vNVDIMM mappings.

I'm confused by this.

Suppose a guest uses an emulated device (or backend) provided by qemu,
to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
pages into its own address space, so that it can write to the memory
(ie, do the virtual DMA).

That virtual DMA might well involve a direct mapping in the kernel
underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
directly to the NVDIMM, and with luck the actual device backing the
virtual device will be able to DMA to the NVDIMM.

All of this seems to me to mean that qemu needs to be able to map
its guest's parts of NVDIMMs

There are probably other example: memory inspection systems used by
virus scanners etc.; debuggers used to inspect a guest from outside;
etc.

I haven't even got started on save/restore...

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 13:37                                     ` Haozhong Zhang
  2016-03-17 13:56                                       ` Jan Beulich
@ 2016-03-17 14:12                                       ` Xu, Quan
  2016-03-17 14:22                                         ` Zhang, Haozhong
  1 sibling, 1 reply; 121+ messages in thread
From: Xu, Quan @ 2016-03-17 14:12 UTC (permalink / raw)
  To: Zhang, Haozhong
  Cc: Juergen Gross, Tian, Kevin, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Nakajima, Jun, Xiao Guangrong,
	Keir Fraser

On March 17, 2016 9:37pm, Haozhong Zhang <haozhong.zhang@intel.com> wrote:
> For PV guests (if we add vNVDIMM support for them in future), as I'm going to
> use page_info struct for it, I suppose the current mechanism in Xen can handle
> this case. I'm not familiar with PV memory management 

The below web may be helpful:
http://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management

:)
Quan

> and have to admit I
> didn't find the exact code that handles the case that a memory page contains
> the guest page table. Jan, could you indicate the code that I can follow to
> understand what xen does in this case?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 14:00                                             ` Ian Jackson
@ 2016-03-17 14:21                                               ` Haozhong Zhang
  2016-03-29  8:47                                                 ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-17 14:21 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Jun Nakajima, Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, George Dunlap,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

On 03/17/16 14:00, Ian Jackson wrote:
> Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> > QEMU keeps mappings of guest memory because (1) that mapping is
> > created by itself, and/or (2) certain device emulation needs to access
> > the guest memory. But for vNVDIMM, I'm going to move the creation of
> > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
> > access vNVDIMM pages mapped to guest, so it's not necessary to let
> > qemu keeps vNVDIMM mappings.
> 
> I'm confused by this.
> 
> Suppose a guest uses an emulated device (or backend) provided by qemu,
> to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
> pages into its own address space, so that it can write to the memory
> (ie, do the virtual DMA).
> 
> That virtual DMA might well involve a direct mapping in the kernel
> underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
> directly to the NVDIMM, and with luck the actual device backing the
> virtual device will be able to DMA to the NVDIMM.
> 
> All of this seems to me to mean that qemu needs to be able to map
> its guest's parts of NVDIMMs
> 
> There are probably other example: memory inspection systems used by
> virus scanners etc.; debuggers used to inspect a guest from outside;
> etc.
> 
> I haven't even got started on save/restore...
> 

Oops, so many cases I missed. Thanks Ian for pointing out all these!
Now I need to reconsider how to manage guest permissions for NVDIMM pages.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 14:12                                       ` Xu, Quan
@ 2016-03-17 14:22                                         ` Zhang, Haozhong
  0 siblings, 0 replies; 121+ messages in thread
From: Zhang, Haozhong @ 2016-03-17 14:22 UTC (permalink / raw)
  To: Xu, Quan
  Cc: Juergen Gross, Tian, Kevin, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Nakajima, Jun, Xiao Guangrong,
	Keir Fraser

On 03/17/16 22:12, Xu, Quan wrote:
> On March 17, 2016 9:37pm, Haozhong Zhang <haozhong.zhang@intel.com> wrote:
> > For PV guests (if we add vNVDIMM support for them in future), as I'm going to
> > use page_info struct for it, I suppose the current mechanism in Xen can handle
> > this case. I'm not familiar with PV memory management 
> 
> The below web may be helpful:
> http://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management
> 
> :)
> Quan
> 

Thanks!

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 13:56                                       ` Jan Beulich
@ 2016-03-17 14:22                                         ` Haozhong Zhang
  0 siblings, 0 replies; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-17 14:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/17/16 07:56, Jan Beulich wrote:
> >>> On 17.03.16 at 14:37, <haozhong.zhang@intel.com> wrote:
> > On 03/17/16 11:05, Ian Jackson wrote:
> >> Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for 
> > Xen"):
> >> > So that again leaves unaddressed the question of what you
> >> > imply to do when a guest elects to use such a page as page
> >> > table. I'm afraid any attempt of yours to invent something that
> >> > is not struct page_info will not be suitable for all possible needs.
> >> 
> >> It is not clear to me whether this is a realistic thing for a guest to
> >> want to do.  Haozhong, maybe you want to consider this aspect.
> >>
> > 
> > For HVM guests, it's themselves responsibility to not grant (e.g. in
> > xen-blk/net drivers) a vNVDIMM page containing page tables to others.
> > 
> > For PV guests (if we add vNVDIMM support for them in future), as I'm
> > going to use page_info struct for it, I suppose the current mechanism
> > in Xen can handle this case. I'm not familiar with PV memory
> > management and have to admit I didn't find the exact code that handles
> > the case that a memory page contains the guest page table. Jan, could
> > you indicate the code that I can follow to understand what xen does in
> > this case?
> 
> xen/arch/x86/mm.c has functions like __get_page_type(),
> alloc_page_type(), alloc_l[1234]_table(), and mod_l[1234]_entry()
> which all participate in this.
>

Thanks!

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-17 14:21                                               ` Haozhong Zhang
@ 2016-03-29  8:47                                                 ` Haozhong Zhang
  2016-03-29  9:11                                                   ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-29  8:47 UTC (permalink / raw)
  To: Ian Jackson, Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/17/16 22:21, Haozhong Zhang wrote:
> On 03/17/16 14:00, Ian Jackson wrote:
> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> > > QEMU keeps mappings of guest memory because (1) that mapping is
> > > created by itself, and/or (2) certain device emulation needs to access
> > > the guest memory. But for vNVDIMM, I'm going to move the creation of
> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
> > > access vNVDIMM pages mapped to guest, so it's not necessary to let
> > > qemu keeps vNVDIMM mappings.
> > 
> > I'm confused by this.
> > 
> > Suppose a guest uses an emulated device (or backend) provided by qemu,
> > to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
> > pages into its own address space, so that it can write to the memory
> > (ie, do the virtual DMA).
> > 
> > That virtual DMA might well involve a direct mapping in the kernel
> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
> > directly to the NVDIMM, and with luck the actual device backing the
> > virtual device will be able to DMA to the NVDIMM.
> > 
> > All of this seems to me to mean that qemu needs to be able to map
> > its guest's parts of NVDIMMs
> > 
> > There are probably other example: memory inspection systems used by
> > virus scanners etc.; debuggers used to inspect a guest from outside;
> > etc.
> > 
> > I haven't even got started on save/restore...
> > 
> 
> Oops, so many cases I missed. Thanks Ian for pointing out all these!
> Now I need to reconsider how to manage guest permissions for NVDIMM pages.
> 

I still cannot find a neat approach to manage guest permissions for
nvdimm pages. A possible one is to use a per-domain bitmap to track
permissions: each bit corresponding to an nvdimm page. The bitmap can
save lots of spaces and even be stored in the normal ram, but
operating it for a large nvdimm range, especially for a contiguous
one, is slower than rangeset.

BTW, if I take the other way to map nvdimm pages to guest
(http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html)
| 2. Or, given the same inputs, we may combine above two steps into a new
|    dom0 system call that (1) gets the SPA ranges, (2) calls xen
|    hypercall to map SPA ranges
and treat nvdimm as normal ram, then xen will not need to use rangeset
or above bitmap to track guest permissions for nvdimm? But looking at
how qemu currently populates guest memory via XENMEM_populate_physmap
, and other hypercalls like XENMEM_[in|de]crease_reservation, it looks
like that mapping a _dedicated_ piece of host ram to guest is not
allowed out of the hypervisor (and not allowed even in dom0 kernel)?
Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking
guest memory?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-29  8:47                                                 ` Haozhong Zhang
@ 2016-03-29  9:11                                                   ` Jan Beulich
  2016-03-29 10:10                                                     ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-29  9:11 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
> On 03/17/16 22:21, Haozhong Zhang wrote:
>> On 03/17/16 14:00, Ian Jackson wrote:
>> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
>> > > QEMU keeps mappings of guest memory because (1) that mapping is
>> > > created by itself, and/or (2) certain device emulation needs to access
>> > > the guest memory. But for vNVDIMM, I'm going to move the creation of
>> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
>> > > access vNVDIMM pages mapped to guest, so it's not necessary to let
>> > > qemu keeps vNVDIMM mappings.
>> > 
>> > I'm confused by this.
>> > 
>> > Suppose a guest uses an emulated device (or backend) provided by qemu,
>> > to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
>> > pages into its own address space, so that it can write to the memory
>> > (ie, do the virtual DMA).
>> > 
>> > That virtual DMA might well involve a direct mapping in the kernel
>> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
>> > directly to the NVDIMM, and with luck the actual device backing the
>> > virtual device will be able to DMA to the NVDIMM.
>> > 
>> > All of this seems to me to mean that qemu needs to be able to map
>> > its guest's parts of NVDIMMs
>> > 
>> > There are probably other example: memory inspection systems used by
>> > virus scanners etc.; debuggers used to inspect a guest from outside;
>> > etc.
>> > 
>> > I haven't even got started on save/restore...
>> > 
>> 
>> Oops, so many cases I missed. Thanks Ian for pointing out all these!
>> Now I need to reconsider how to manage guest permissions for NVDIMM pages.
>> 
> 
> I still cannot find a neat approach to manage guest permissions for
> nvdimm pages. A possible one is to use a per-domain bitmap to track
> permissions: each bit corresponding to an nvdimm page. The bitmap can
> save lots of spaces and even be stored in the normal ram, but
> operating it for a large nvdimm range, especially for a contiguous
> one, is slower than rangeset.

I don't follow: What would a single bit in that bitmap mean? Any
guest may access the page? That surely wouldn't be what we
need.

> BTW, if I take the other way to map nvdimm pages to guest
> (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html)
> | 2. Or, given the same inputs, we may combine above two steps into a new
> |    dom0 system call that (1) gets the SPA ranges, (2) calls xen
> |    hypercall to map SPA ranges
> and treat nvdimm as normal ram, then xen will not need to use rangeset
> or above bitmap to track guest permissions for nvdimm? But looking at
> how qemu currently populates guest memory via XENMEM_populate_physmap
> , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks
> like that mapping a _dedicated_ piece of host ram to guest is not
> allowed out of the hypervisor (and not allowed even in dom0 kernel)?
> Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking
> guest memory?

Well, it's simply because RAM is a resource managed through
allocation/freeing, instead of via reserving chunks for special
purposes.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-29  9:11                                                   ` Jan Beulich
@ 2016-03-29 10:10                                                     ` Haozhong Zhang
  2016-03-29 10:49                                                       ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-03-29 10:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/29/16 03:11, Jan Beulich wrote:
> >>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
> > On 03/17/16 22:21, Haozhong Zhang wrote:
> >> On 03/17/16 14:00, Ian Jackson wrote:
> >> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"):
> >> > > QEMU keeps mappings of guest memory because (1) that mapping is
> >> > > created by itself, and/or (2) certain device emulation needs to access
> >> > > the guest memory. But for vNVDIMM, I'm going to move the creation of
> >> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
> >> > > access vNVDIMM pages mapped to guest, so it's not necessary to let
> >> > > qemu keeps vNVDIMM mappings.
> >> > 
> >> > I'm confused by this.
> >> > 
> >> > Suppose a guest uses an emulated device (or backend) provided by qemu,
> >> > to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
> >> > pages into its own address space, so that it can write to the memory
> >> > (ie, do the virtual DMA).
> >> > 
> >> > That virtual DMA might well involve a direct mapping in the kernel
> >> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
> >> > directly to the NVDIMM, and with luck the actual device backing the
> >> > virtual device will be able to DMA to the NVDIMM.
> >> > 
> >> > All of this seems to me to mean that qemu needs to be able to map
> >> > its guest's parts of NVDIMMs
> >> > 
> >> > There are probably other example: memory inspection systems used by
> >> > virus scanners etc.; debuggers used to inspect a guest from outside;
> >> > etc.
> >> > 
> >> > I haven't even got started on save/restore...
> >> > 
> >> 
> >> Oops, so many cases I missed. Thanks Ian for pointing out all these!
> >> Now I need to reconsider how to manage guest permissions for NVDIMM pages.
> >> 
> > 
> > I still cannot find a neat approach to manage guest permissions for
> > nvdimm pages. A possible one is to use a per-domain bitmap to track
> > permissions: each bit corresponding to an nvdimm page. The bitmap can
> > save lots of spaces and even be stored in the normal ram, but
> > operating it for a large nvdimm range, especially for a contiguous
> > one, is slower than rangeset.
> 
> I don't follow: What would a single bit in that bitmap mean? Any
> guest may access the page? That surely wouldn't be what we
> need.
>

For a host having a N pages of nvdimm, each domain will have a N bits
bitmap. If the m'th bit of a domain's bitmap is set, then that domain
has the permission to access the m'th host nvdimm page.

> > BTW, if I take the other way to map nvdimm pages to guest
> > (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html)
> > | 2. Or, given the same inputs, we may combine above two steps into a new
> > |    dom0 system call that (1) gets the SPA ranges, (2) calls xen
> > |    hypercall to map SPA ranges
> > and treat nvdimm as normal ram, then xen will not need to use rangeset
> > or above bitmap to track guest permissions for nvdimm? But looking at
> > how qemu currently populates guest memory via XENMEM_populate_physmap
> > , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks
> > like that mapping a _dedicated_ piece of host ram to guest is not
> > allowed out of the hypervisor (and not allowed even in dom0 kernel)?
> > Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking
> > guest memory?
> 
> Well, it's simply because RAM is a resource managed through
> allocation/freeing, instead of via reserving chunks for special
> purposes.
> 

So that means xen can always ensure the ram assigned to a guest is
what the guest is permitted to access, so no data structures like
iomem_caps is needed for ram. If I have to introduce a hypercall that
maps the dedicated host ram/nvdimm to guest, then the explicit
permission management is still needed, regardless of who (dom0 kernel,
qemu or toolstack) will use it. Right?

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-29 10:10                                                     ` Haozhong Zhang
@ 2016-03-29 10:49                                                       ` Jan Beulich
  2016-04-08  5:02                                                         ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-03-29 10:49 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 29.03.16 at 12:10, <haozhong.zhang@intel.com> wrote:
> On 03/29/16 03:11, Jan Beulich wrote:
>> >>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
>> > On 03/17/16 22:21, Haozhong Zhang wrote:
>> >> On 03/17/16 14:00, Ian Jackson wrote:
>> >> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM 
> support for Xen"):
>> >> > > QEMU keeps mappings of guest memory because (1) that mapping is
>> >> > > created by itself, and/or (2) certain device emulation needs to access
>> >> > > the guest memory. But for vNVDIMM, I'm going to move the creation of
>> >> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
>> >> > > access vNVDIMM pages mapped to guest, so it's not necessary to let
>> >> > > qemu keeps vNVDIMM mappings.
>> >> > 
>> >> > I'm confused by this.
>> >> > 
>> >> > Suppose a guest uses an emulated device (or backend) provided by qemu,
>> >> > to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
>> >> > pages into its own address space, so that it can write to the memory
>> >> > (ie, do the virtual DMA).
>> >> > 
>> >> > That virtual DMA might well involve a direct mapping in the kernel
>> >> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
>> >> > directly to the NVDIMM, and with luck the actual device backing the
>> >> > virtual device will be able to DMA to the NVDIMM.
>> >> > 
>> >> > All of this seems to me to mean that qemu needs to be able to map
>> >> > its guest's parts of NVDIMMs
>> >> > 
>> >> > There are probably other example: memory inspection systems used by
>> >> > virus scanners etc.; debuggers used to inspect a guest from outside;
>> >> > etc.
>> >> > 
>> >> > I haven't even got started on save/restore...
>> >> > 
>> >> 
>> >> Oops, so many cases I missed. Thanks Ian for pointing out all these!
>> >> Now I need to reconsider how to manage guest permissions for NVDIMM pages.
>> >> 
>> > 
>> > I still cannot find a neat approach to manage guest permissions for
>> > nvdimm pages. A possible one is to use a per-domain bitmap to track
>> > permissions: each bit corresponding to an nvdimm page. The bitmap can
>> > save lots of spaces and even be stored in the normal ram, but
>> > operating it for a large nvdimm range, especially for a contiguous
>> > one, is slower than rangeset.
>> 
>> I don't follow: What would a single bit in that bitmap mean? Any
>> guest may access the page? That surely wouldn't be what we
>> need.
>>
> 
> For a host having a N pages of nvdimm, each domain will have a N bits
> bitmap. If the m'th bit of a domain's bitmap is set, then that domain
> has the permission to access the m'th host nvdimm page.

Which will be more overhead as soon as there are enough such
domains in a system.

>> > BTW, if I take the other way to map nvdimm pages to guest
>> > (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html)
>> > | 2. Or, given the same inputs, we may combine above two steps into a new
>> > |    dom0 system call that (1) gets the SPA ranges, (2) calls xen
>> > |    hypercall to map SPA ranges
>> > and treat nvdimm as normal ram, then xen will not need to use rangeset
>> > or above bitmap to track guest permissions for nvdimm? But looking at
>> > how qemu currently populates guest memory via XENMEM_populate_physmap
>> > , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks
>> > like that mapping a _dedicated_ piece of host ram to guest is not
>> > allowed out of the hypervisor (and not allowed even in dom0 kernel)?
>> > Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking
>> > guest memory?
>> 
>> Well, it's simply because RAM is a resource managed through
>> allocation/freeing, instead of via reserving chunks for special
>> purposes.
>> 
> 
> So that means xen can always ensure the ram assigned to a guest is
> what the guest is permitted to access, so no data structures like
> iomem_caps is needed for ram. If I have to introduce a hypercall that
> maps the dedicated host ram/nvdimm to guest, then the explicit
> permission management is still needed, regardless of who (dom0 kernel,
> qemu or toolstack) will use it. Right?

Yes (if you really mean to go that route).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-03-29 10:49                                                       ` Jan Beulich
@ 2016-04-08  5:02                                                         ` Haozhong Zhang
  2016-04-08 15:52                                                           ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-04-08  5:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 03/29/16 04:49, Jan Beulich wrote:
> >>> On 29.03.16 at 12:10, <haozhong.zhang@intel.com> wrote:
> > On 03/29/16 03:11, Jan Beulich wrote:
> >> >>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
[..]
> >> > I still cannot find a neat approach to manage guest permissions for
> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
> >> > save lots of spaces and even be stored in the normal ram, but
> >> > operating it for a large nvdimm range, especially for a contiguous
> >> > one, is slower than rangeset.
> >> 
> >> I don't follow: What would a single bit in that bitmap mean? Any
> >> guest may access the page? That surely wouldn't be what we
> >> need.
> >>
> > 
> > For a host having a N pages of nvdimm, each domain will have a N bits
> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
> > has the permission to access the m'th host nvdimm page.
> 
> Which will be more overhead as soon as there are enough such
> domains in a system.
>

Sorry for the late reply.

I think we can make some optimization to reduce the space consumed by
the bitmap.

A per-domain bitmap covering the entire host NVDIMM address range is
wasteful especially if the actual used ranges are congregated. We may
take following ways to reduce its space.

1) Split the per-domain bitmap into multiple sub-bitmap and each
   sub-bitmap covers a smaller and contiguous sub host NVDIMM address
   range. In the beginning, no sub-bitmap is allocated for the
   domain. If the access permission to a host NVDIMM page in a sub
   host address range is added to a domain, only the sub-bitmap for
   that address range is allocated for the domain. If access
   permissions to all host NVDIMM pages in a sub range are removed
   from a domain, the corresponding sub-bitmap can be freed.

2) If a domain has access permissions to all host NVDIMM pages in a
   sub range, the corresponding sub-bitmap will be replaced by a range
   struct. If range structs are used to track adjacent ranges, they
   will be merged into one range struct. If access permissions to some
   pages in that sub range are removed from a domain, the range struct
   should be converted back to bitmap segment(s).

3) Because there might be lots of above bitmap segments and range
   structs per-domain, we can organize them in a balanced interval
   tree to quickly search/add/remove an individual structure.

In the worst case that each sub range has non-contiguous pages
assigned to a domain, above solution will use all sub-bitmaps and
consume more space than a single bitmap because of the extra space for
organization. I assume that the sysadmin should be responsible to
ensure the host nvdimm ranges assigned to each domain as contiguous
and congregated as possible in order to avoid the worst case. However,
if the worst case does happen, xen hypervisor should refuse to assign
nvdimm to guest when it runs out of memory.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-08  5:02                                                         ` Haozhong Zhang
@ 2016-04-08 15:52                                                           ` Jan Beulich
  2016-04-12  8:45                                                             ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-04-08 15:52 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 08.04.16 at 07:02, <haozhong.zhang@intel.com> wrote:
> On 03/29/16 04:49, Jan Beulich wrote:
>> >>> On 29.03.16 at 12:10, <haozhong.zhang@intel.com> wrote:
>> > On 03/29/16 03:11, Jan Beulich wrote:
>> >> >>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
> [..]
>> >> > I still cannot find a neat approach to manage guest permissions for
>> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
>> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
>> >> > save lots of spaces and even be stored in the normal ram, but
>> >> > operating it for a large nvdimm range, especially for a contiguous
>> >> > one, is slower than rangeset.
>> >> 
>> >> I don't follow: What would a single bit in that bitmap mean? Any
>> >> guest may access the page? That surely wouldn't be what we
>> >> need.
>> >>
>> > 
>> > For a host having a N pages of nvdimm, each domain will have a N bits
>> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
>> > has the permission to access the m'th host nvdimm page.
>> 
>> Which will be more overhead as soon as there are enough such
>> domains in a system.
>>
> 
> Sorry for the late reply.
> 
> I think we can make some optimization to reduce the space consumed by
> the bitmap.
> 
> A per-domain bitmap covering the entire host NVDIMM address range is
> wasteful especially if the actual used ranges are congregated. We may
> take following ways to reduce its space.
> 
> 1) Split the per-domain bitmap into multiple sub-bitmap and each
>    sub-bitmap covers a smaller and contiguous sub host NVDIMM address
>    range. In the beginning, no sub-bitmap is allocated for the
>    domain. If the access permission to a host NVDIMM page in a sub
>    host address range is added to a domain, only the sub-bitmap for
>    that address range is allocated for the domain. If access
>    permissions to all host NVDIMM pages in a sub range are removed
>    from a domain, the corresponding sub-bitmap can be freed.
> 
> 2) If a domain has access permissions to all host NVDIMM pages in a
>    sub range, the corresponding sub-bitmap will be replaced by a range
>    struct. If range structs are used to track adjacent ranges, they
>    will be merged into one range struct. If access permissions to some
>    pages in that sub range are removed from a domain, the range struct
>    should be converted back to bitmap segment(s).
> 
> 3) Because there might be lots of above bitmap segments and range
>    structs per-domain, we can organize them in a balanced interval
>    tree to quickly search/add/remove an individual structure.
> 
> In the worst case that each sub range has non-contiguous pages
> assigned to a domain, above solution will use all sub-bitmaps and
> consume more space than a single bitmap because of the extra space for
> organization. I assume that the sysadmin should be responsible to
> ensure the host nvdimm ranges assigned to each domain as contiguous
> and congregated as possible in order to avoid the worst case. However,
> if the worst case does happen, xen hypervisor should refuse to assign
> nvdimm to guest when it runs out of memory.

To be honest, this all sounds pretty unconvincing wrt not using
existing code paths - a lot of special treatment, and hence a lot
of things that can go (slightly) wrong.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-08 15:52                                                           ` Jan Beulich
@ 2016-04-12  8:45                                                             ` Haozhong Zhang
  2016-04-21  5:09                                                               ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-04-12  8:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 04/08/16 09:52, Jan Beulich wrote:
> >>> On 08.04.16 at 07:02, <haozhong.zhang@intel.com> wrote:
> > On 03/29/16 04:49, Jan Beulich wrote:
> >> >>> On 29.03.16 at 12:10, <haozhong.zhang@intel.com> wrote:
> >> > On 03/29/16 03:11, Jan Beulich wrote:
> >> >> >>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
> > [..]
> >> >> > I still cannot find a neat approach to manage guest permissions for
> >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
> >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
> >> >> > save lots of spaces and even be stored in the normal ram, but
> >> >> > operating it for a large nvdimm range, especially for a contiguous
> >> >> > one, is slower than rangeset.
> >> >> 
> >> >> I don't follow: What would a single bit in that bitmap mean? Any
> >> >> guest may access the page? That surely wouldn't be what we
> >> >> need.
> >> >>
> >> > 
> >> > For a host having a N pages of nvdimm, each domain will have a N bits
> >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
> >> > has the permission to access the m'th host nvdimm page.
> >> 
> >> Which will be more overhead as soon as there are enough such
> >> domains in a system.
> >>
> > 
> > Sorry for the late reply.
> > 
> > I think we can make some optimization to reduce the space consumed by
> > the bitmap.
> > 
> > A per-domain bitmap covering the entire host NVDIMM address range is
> > wasteful especially if the actual used ranges are congregated. We may
> > take following ways to reduce its space.
> > 
> > 1) Split the per-domain bitmap into multiple sub-bitmap and each
> >    sub-bitmap covers a smaller and contiguous sub host NVDIMM address
> >    range. In the beginning, no sub-bitmap is allocated for the
> >    domain. If the access permission to a host NVDIMM page in a sub
> >    host address range is added to a domain, only the sub-bitmap for
> >    that address range is allocated for the domain. If access
> >    permissions to all host NVDIMM pages in a sub range are removed
> >    from a domain, the corresponding sub-bitmap can be freed.
> > 
> > 2) If a domain has access permissions to all host NVDIMM pages in a
> >    sub range, the corresponding sub-bitmap will be replaced by a range
> >    struct. If range structs are used to track adjacent ranges, they
> >    will be merged into one range struct. If access permissions to some
> >    pages in that sub range are removed from a domain, the range struct
> >    should be converted back to bitmap segment(s).
> > 
> > 3) Because there might be lots of above bitmap segments and range
> >    structs per-domain, we can organize them in a balanced interval
> >    tree to quickly search/add/remove an individual structure.
> > 
> > In the worst case that each sub range has non-contiguous pages
> > assigned to a domain, above solution will use all sub-bitmaps and
> > consume more space than a single bitmap because of the extra space for
> > organization. I assume that the sysadmin should be responsible to
> > ensure the host nvdimm ranges assigned to each domain as contiguous
> > and congregated as possible in order to avoid the worst case. However,
> > if the worst case does happen, xen hypervisor should refuse to assign
> > nvdimm to guest when it runs out of memory.
> 
> To be honest, this all sounds pretty unconvincing wrt not using
> existing code paths - a lot of special treatment, and hence a lot
> of things that can go (slightly) wrong.
> 

Well, using existing range struct to manage guest access permissions
to nvdimm could consume too much space which could not fit in either
memory or nvdimm. If the above solution looks really error-prone,
perhaps we can still come back to the existing one and restrict the
number of range structs each domain could have for nvdimm
(e.g. reserve one 4K-page per-domain for them) to make it work for
nvdimm, though it may reject nvdimm mapping that is terribly
fragmented.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-12  8:45                                                             ` Haozhong Zhang
@ 2016-04-21  5:09                                                               ` Haozhong Zhang
  2016-04-21  7:04                                                                 ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-04-21  5:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 04/12/16 16:45, Haozhong Zhang wrote:
> On 04/08/16 09:52, Jan Beulich wrote:
> > >>> On 08.04.16 at 07:02, <haozhong.zhang@intel.com> wrote:
> > > On 03/29/16 04:49, Jan Beulich wrote:
> > >> >>> On 29.03.16 at 12:10, <haozhong.zhang@intel.com> wrote:
> > >> > On 03/29/16 03:11, Jan Beulich wrote:
> > >> >> >>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
> > > [..]
> > >> >> > I still cannot find a neat approach to manage guest permissions for
> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
> > >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
> > >> >> > save lots of spaces and even be stored in the normal ram, but
> > >> >> > operating it for a large nvdimm range, especially for a contiguous
> > >> >> > one, is slower than rangeset.
> > >> >> 
> > >> >> I don't follow: What would a single bit in that bitmap mean? Any
> > >> >> guest may access the page? That surely wouldn't be what we
> > >> >> need.
> > >> >>
> > >> > 
> > >> > For a host having a N pages of nvdimm, each domain will have a N bits
> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
> > >> > has the permission to access the m'th host nvdimm page.
> > >> 
> > >> Which will be more overhead as soon as there are enough such
> > >> domains in a system.
> > >>
> > > 
> > > Sorry for the late reply.
> > > 
> > > I think we can make some optimization to reduce the space consumed by
> > > the bitmap.
> > > 
> > > A per-domain bitmap covering the entire host NVDIMM address range is
> > > wasteful especially if the actual used ranges are congregated. We may
> > > take following ways to reduce its space.
> > > 
> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each
> > >    sub-bitmap covers a smaller and contiguous sub host NVDIMM address
> > >    range. In the beginning, no sub-bitmap is allocated for the
> > >    domain. If the access permission to a host NVDIMM page in a sub
> > >    host address range is added to a domain, only the sub-bitmap for
> > >    that address range is allocated for the domain. If access
> > >    permissions to all host NVDIMM pages in a sub range are removed
> > >    from a domain, the corresponding sub-bitmap can be freed.
> > > 
> > > 2) If a domain has access permissions to all host NVDIMM pages in a
> > >    sub range, the corresponding sub-bitmap will be replaced by a range
> > >    struct. If range structs are used to track adjacent ranges, they
> > >    will be merged into one range struct. If access permissions to some
> > >    pages in that sub range are removed from a domain, the range struct
> > >    should be converted back to bitmap segment(s).
> > > 
> > > 3) Because there might be lots of above bitmap segments and range
> > >    structs per-domain, we can organize them in a balanced interval
> > >    tree to quickly search/add/remove an individual structure.
> > > 
> > > In the worst case that each sub range has non-contiguous pages
> > > assigned to a domain, above solution will use all sub-bitmaps and
> > > consume more space than a single bitmap because of the extra space for
> > > organization. I assume that the sysadmin should be responsible to
> > > ensure the host nvdimm ranges assigned to each domain as contiguous
> > > and congregated as possible in order to avoid the worst case. However,
> > > if the worst case does happen, xen hypervisor should refuse to assign
> > > nvdimm to guest when it runs out of memory.
> > 
> > To be honest, this all sounds pretty unconvincing wrt not using
> > existing code paths - a lot of special treatment, and hence a lot
> > of things that can go (slightly) wrong.
> > 
> 
> Well, using existing range struct to manage guest access permissions
> to nvdimm could consume too much space which could not fit in either
> memory or nvdimm. If the above solution looks really error-prone,
> perhaps we can still come back to the existing one and restrict the
> number of range structs each domain could have for nvdimm
> (e.g. reserve one 4K-page per-domain for them) to make it work for
> nvdimm, though it may reject nvdimm mapping that is terribly
> fragmented.

Hi Jan,

Any comments for this?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-21  5:09                                                               ` Haozhong Zhang
@ 2016-04-21  7:04                                                                 ` Jan Beulich
  2016-04-22  2:36                                                                   ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-04-21  7:04 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 21.04.16 at 07:09, <haozhong.zhang@intel.com> wrote:
> On 04/12/16 16:45, Haozhong Zhang wrote:
>> On 04/08/16 09:52, Jan Beulich wrote:
>> > >>> On 08.04.16 at 07:02, <haozhong.zhang@intel.com> wrote:
>> > > On 03/29/16 04:49, Jan Beulich wrote:
>> > >> >>> On 29.03.16 at 12:10, <haozhong.zhang@intel.com> wrote:
>> > >> > On 03/29/16 03:11, Jan Beulich wrote:
>> > >> >> >>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
>> > > [..]
>> > >> >> > I still cannot find a neat approach to manage guest permissions for
>> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
>> > >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
>> > >> >> > save lots of spaces and even be stored in the normal ram, but
>> > >> >> > operating it for a large nvdimm range, especially for a contiguous
>> > >> >> > one, is slower than rangeset.
>> > >> >> 
>> > >> >> I don't follow: What would a single bit in that bitmap mean? Any
>> > >> >> guest may access the page? That surely wouldn't be what we
>> > >> >> need.
>> > >> >>
>> > >> > 
>> > >> > For a host having a N pages of nvdimm, each domain will have a N bits
>> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
>> > >> > has the permission to access the m'th host nvdimm page.
>> > >> 
>> > >> Which will be more overhead as soon as there are enough such
>> > >> domains in a system.
>> > >>
>> > > 
>> > > Sorry for the late reply.
>> > > 
>> > > I think we can make some optimization to reduce the space consumed by
>> > > the bitmap.
>> > > 
>> > > A per-domain bitmap covering the entire host NVDIMM address range is
>> > > wasteful especially if the actual used ranges are congregated. We may
>> > > take following ways to reduce its space.
>> > > 
>> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each
>> > >    sub-bitmap covers a smaller and contiguous sub host NVDIMM address
>> > >    range. In the beginning, no sub-bitmap is allocated for the
>> > >    domain. If the access permission to a host NVDIMM page in a sub
>> > >    host address range is added to a domain, only the sub-bitmap for
>> > >    that address range is allocated for the domain. If access
>> > >    permissions to all host NVDIMM pages in a sub range are removed
>> > >    from a domain, the corresponding sub-bitmap can be freed.
>> > > 
>> > > 2) If a domain has access permissions to all host NVDIMM pages in a
>> > >    sub range, the corresponding sub-bitmap will be replaced by a range
>> > >    struct. If range structs are used to track adjacent ranges, they
>> > >    will be merged into one range struct. If access permissions to some
>> > >    pages in that sub range are removed from a domain, the range struct
>> > >    should be converted back to bitmap segment(s).
>> > > 
>> > > 3) Because there might be lots of above bitmap segments and range
>> > >    structs per-domain, we can organize them in a balanced interval
>> > >    tree to quickly search/add/remove an individual structure.
>> > > 
>> > > In the worst case that each sub range has non-contiguous pages
>> > > assigned to a domain, above solution will use all sub-bitmaps and
>> > > consume more space than a single bitmap because of the extra space for
>> > > organization. I assume that the sysadmin should be responsible to
>> > > ensure the host nvdimm ranges assigned to each domain as contiguous
>> > > and congregated as possible in order to avoid the worst case. However,
>> > > if the worst case does happen, xen hypervisor should refuse to assign
>> > > nvdimm to guest when it runs out of memory.
>> > 
>> > To be honest, this all sounds pretty unconvincing wrt not using
>> > existing code paths - a lot of special treatment, and hence a lot
>> > of things that can go (slightly) wrong.
>> > 
>> 
>> Well, using existing range struct to manage guest access permissions
>> to nvdimm could consume too much space which could not fit in either
>> memory or nvdimm. If the above solution looks really error-prone,
>> perhaps we can still come back to the existing one and restrict the
>> number of range structs each domain could have for nvdimm
>> (e.g. reserve one 4K-page per-domain for them) to make it work for
>> nvdimm, though it may reject nvdimm mapping that is terribly
>> fragmented.
> 
> Hi Jan,
> 
> Any comments for this?

Well, nothing new, i.e. my previous opinion on the old proposal didn't
change. I'm really opposed to any artificial limitations here, as I am to
any secondary (and hence error prone) code paths. IOW I continue
to think that there's no reasonable alternative to re-using the existing
memory management infrastructure for at least the PMEM case. The
only open question remains to be where to place the control structures,
and I think the thresholding proposal of yours was quite sensible.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-21  7:04                                                                 ` Jan Beulich
@ 2016-04-22  2:36                                                                   ` Haozhong Zhang
  2016-04-22  8:24                                                                     ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-04-22  2:36 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 04/21/16 01:04, Jan Beulich wrote:
> >>> On 21.04.16 at 07:09, <haozhong.zhang@intel.com> wrote:
> > On 04/12/16 16:45, Haozhong Zhang wrote:
> >> On 04/08/16 09:52, Jan Beulich wrote:
> >> > >>> On 08.04.16 at 07:02, <haozhong.zhang@intel.com> wrote:
> >> > > On 03/29/16 04:49, Jan Beulich wrote:
> >> > >> >>> On 29.03.16 at 12:10, <haozhong.zhang@intel.com> wrote:
> >> > >> > On 03/29/16 03:11, Jan Beulich wrote:
> >> > >> >> >>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
> >> > > [..]
> >> > >> >> > I still cannot find a neat approach to manage guest permissions for
> >> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
> >> > >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
> >> > >> >> > save lots of spaces and even be stored in the normal ram, but
> >> > >> >> > operating it for a large nvdimm range, especially for a contiguous
> >> > >> >> > one, is slower than rangeset.
> >> > >> >> 
> >> > >> >> I don't follow: What would a single bit in that bitmap mean? Any
> >> > >> >> guest may access the page? That surely wouldn't be what we
> >> > >> >> need.
> >> > >> >>
> >> > >> > 
> >> > >> > For a host having a N pages of nvdimm, each domain will have a N bits
> >> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
> >> > >> > has the permission to access the m'th host nvdimm page.
> >> > >> 
> >> > >> Which will be more overhead as soon as there are enough such
> >> > >> domains in a system.
> >> > >>
> >> > > 
> >> > > Sorry for the late reply.
> >> > > 
> >> > > I think we can make some optimization to reduce the space consumed by
> >> > > the bitmap.
> >> > > 
> >> > > A per-domain bitmap covering the entire host NVDIMM address range is
> >> > > wasteful especially if the actual used ranges are congregated. We may
> >> > > take following ways to reduce its space.
> >> > > 
> >> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each
> >> > >    sub-bitmap covers a smaller and contiguous sub host NVDIMM address
> >> > >    range. In the beginning, no sub-bitmap is allocated for the
> >> > >    domain. If the access permission to a host NVDIMM page in a sub
> >> > >    host address range is added to a domain, only the sub-bitmap for
> >> > >    that address range is allocated for the domain. If access
> >> > >    permissions to all host NVDIMM pages in a sub range are removed
> >> > >    from a domain, the corresponding sub-bitmap can be freed.
> >> > > 
> >> > > 2) If a domain has access permissions to all host NVDIMM pages in a
> >> > >    sub range, the corresponding sub-bitmap will be replaced by a range
> >> > >    struct. If range structs are used to track adjacent ranges, they
> >> > >    will be merged into one range struct. If access permissions to some
> >> > >    pages in that sub range are removed from a domain, the range struct
> >> > >    should be converted back to bitmap segment(s).
> >> > > 
> >> > > 3) Because there might be lots of above bitmap segments and range
> >> > >    structs per-domain, we can organize them in a balanced interval
> >> > >    tree to quickly search/add/remove an individual structure.
> >> > > 
> >> > > In the worst case that each sub range has non-contiguous pages
> >> > > assigned to a domain, above solution will use all sub-bitmaps and
> >> > > consume more space than a single bitmap because of the extra space for
> >> > > organization. I assume that the sysadmin should be responsible to
> >> > > ensure the host nvdimm ranges assigned to each domain as contiguous
> >> > > and congregated as possible in order to avoid the worst case. However,
> >> > > if the worst case does happen, xen hypervisor should refuse to assign
> >> > > nvdimm to guest when it runs out of memory.
> >> > 
> >> > To be honest, this all sounds pretty unconvincing wrt not using
> >> > existing code paths - a lot of special treatment, and hence a lot
> >> > of things that can go (slightly) wrong.
> >> > 
> >> 
> >> Well, using existing range struct to manage guest access permissions
> >> to nvdimm could consume too much space which could not fit in either
> >> memory or nvdimm. If the above solution looks really error-prone,
> >> perhaps we can still come back to the existing one and restrict the
> >> number of range structs each domain could have for nvdimm
> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
> >> nvdimm, though it may reject nvdimm mapping that is terribly
> >> fragmented.
> > 
> > Hi Jan,
> > 
> > Any comments for this?
> 
> Well, nothing new, i.e. my previous opinion on the old proposal didn't
> change. I'm really opposed to any artificial limitations here, as I am to
> any secondary (and hence error prone) code paths. IOW I continue
> to think that there's no reasonable alternative to re-using the existing
> memory management infrastructure for at least the PMEM case.

By re-using the existing memory management infrastructure, do you mean
re-using the existing model of MMIO for passthrough PCI devices to
handle the permission of pmem?

> The
> only open question remains to be where to place the control structures,
> and I think the thresholding proposal of yours was quite sensible.

I'm little confused here. Is 'restrict the number of range structs' in
my previous reply the 'thresholding proposal' you mean? Or it's one of
'artificial limitations'?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-22  2:36                                                                   ` Haozhong Zhang
@ 2016-04-22  8:24                                                                     ` Jan Beulich
  2016-04-22 10:16                                                                       ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-04-22  8:24 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 22.04.16 at 04:36, <haozhong.zhang@intel.com> wrote:
> On 04/21/16 01:04, Jan Beulich wrote:
>> >>> On 21.04.16 at 07:09, <haozhong.zhang@intel.com> wrote:
>> > On 04/12/16 16:45, Haozhong Zhang wrote:
>> >> On 04/08/16 09:52, Jan Beulich wrote:
>> >> > >>> On 08.04.16 at 07:02, <haozhong.zhang@intel.com> wrote:
>> >> > > On 03/29/16 04:49, Jan Beulich wrote:
>> >> > >> >>> On 29.03.16 at 12:10, <haozhong.zhang@intel.com> wrote:
>> >> > >> > On 03/29/16 03:11, Jan Beulich wrote:
>> >> > >> >> >>> On 29.03.16 at 10:47, <haozhong.zhang@intel.com> wrote:
>> >> > > [..]
>> >> > >> >> > I still cannot find a neat approach to manage guest permissions for
>> >> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
>> >> > >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
>> >> > >> >> > save lots of spaces and even be stored in the normal ram, but
>> >> > >> >> > operating it for a large nvdimm range, especially for a contiguous
>> >> > >> >> > one, is slower than rangeset.
>> >> > >> >> 
>> >> > >> >> I don't follow: What would a single bit in that bitmap mean? Any
>> >> > >> >> guest may access the page? That surely wouldn't be what we
>> >> > >> >> need.
>> >> > >> >>
>> >> > >> > 
>> >> > >> > For a host having a N pages of nvdimm, each domain will have a N bits
>> >> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
>> >> > >> > has the permission to access the m'th host nvdimm page.
>> >> > >> 
>> >> > >> Which will be more overhead as soon as there are enough such
>> >> > >> domains in a system.
>> >> > >>
>> >> > > 
>> >> > > Sorry for the late reply.
>> >> > > 
>> >> > > I think we can make some optimization to reduce the space consumed by
>> >> > > the bitmap.
>> >> > > 
>> >> > > A per-domain bitmap covering the entire host NVDIMM address range is
>> >> > > wasteful especially if the actual used ranges are congregated. We may
>> >> > > take following ways to reduce its space.
>> >> > > 
>> >> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each
>> >> > >    sub-bitmap covers a smaller and contiguous sub host NVDIMM address
>> >> > >    range. In the beginning, no sub-bitmap is allocated for the
>> >> > >    domain. If the access permission to a host NVDIMM page in a sub
>> >> > >    host address range is added to a domain, only the sub-bitmap for
>> >> > >    that address range is allocated for the domain. If access
>> >> > >    permissions to all host NVDIMM pages in a sub range are removed
>> >> > >    from a domain, the corresponding sub-bitmap can be freed.
>> >> > > 
>> >> > > 2) If a domain has access permissions to all host NVDIMM pages in a
>> >> > >    sub range, the corresponding sub-bitmap will be replaced by a range
>> >> > >    struct. If range structs are used to track adjacent ranges, they
>> >> > >    will be merged into one range struct. If access permissions to some
>> >> > >    pages in that sub range are removed from a domain, the range struct
>> >> > >    should be converted back to bitmap segment(s).
>> >> > > 
>> >> > > 3) Because there might be lots of above bitmap segments and range
>> >> > >    structs per-domain, we can organize them in a balanced interval
>> >> > >    tree to quickly search/add/remove an individual structure.
>> >> > > 
>> >> > > In the worst case that each sub range has non-contiguous pages
>> >> > > assigned to a domain, above solution will use all sub-bitmaps and
>> >> > > consume more space than a single bitmap because of the extra space for
>> >> > > organization. I assume that the sysadmin should be responsible to
>> >> > > ensure the host nvdimm ranges assigned to each domain as contiguous
>> >> > > and congregated as possible in order to avoid the worst case. However,
>> >> > > if the worst case does happen, xen hypervisor should refuse to assign
>> >> > > nvdimm to guest when it runs out of memory.
>> >> > 
>> >> > To be honest, this all sounds pretty unconvincing wrt not using
>> >> > existing code paths - a lot of special treatment, and hence a lot
>> >> > of things that can go (slightly) wrong.
>> >> > 
>> >> 
>> >> Well, using existing range struct to manage guest access permissions
>> >> to nvdimm could consume too much space which could not fit in either
>> >> memory or nvdimm. If the above solution looks really error-prone,
>> >> perhaps we can still come back to the existing one and restrict the
>> >> number of range structs each domain could have for nvdimm
>> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
>> >> nvdimm, though it may reject nvdimm mapping that is terribly
>> >> fragmented.
>> > 
>> > Hi Jan,
>> > 
>> > Any comments for this?
>> 
>> Well, nothing new, i.e. my previous opinion on the old proposal didn't
>> change. I'm really opposed to any artificial limitations here, as I am to
>> any secondary (and hence error prone) code paths. IOW I continue
>> to think that there's no reasonable alternative to re-using the existing
>> memory management infrastructure for at least the PMEM case.
> 
> By re-using the existing memory management infrastructure, do you mean
> re-using the existing model of MMIO for passthrough PCI devices to
> handle the permission of pmem?

No, re-using struct page_info.

>> The
>> only open question remains to be where to place the control structures,
>> and I think the thresholding proposal of yours was quite sensible.
> 
> I'm little confused here. Is 'restrict the number of range structs' in
> my previous reply the 'thresholding proposal' you mean? Or it's one of
> 'artificial limitations'?

Neither. It's the decision on where to place the struct page_info
arrays needed to manage the PMEM ranges.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-22  8:24                                                                     ` Jan Beulich
@ 2016-04-22 10:16                                                                       ` Haozhong Zhang
  2016-04-22 10:53                                                                         ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-04-22 10:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 04/22/16 02:24, Jan Beulich wrote:
[..]
> >> >> Well, using existing range struct to manage guest access permissions
> >> >> to nvdimm could consume too much space which could not fit in either
> >> >> memory or nvdimm. If the above solution looks really error-prone,
> >> >> perhaps we can still come back to the existing one and restrict the
> >> >> number of range structs each domain could have for nvdimm
> >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
> >> >> nvdimm, though it may reject nvdimm mapping that is terribly
> >> >> fragmented.
> >> > 
> >> > Hi Jan,
> >> > 
> >> > Any comments for this?
> >> 
> >> Well, nothing new, i.e. my previous opinion on the old proposal didn't
> >> change. I'm really opposed to any artificial limitations here, as I am to
> >> any secondary (and hence error prone) code paths. IOW I continue
> >> to think that there's no reasonable alternative to re-using the existing
> >> memory management infrastructure for at least the PMEM case.
> > 
> > By re-using the existing memory management infrastructure, do you mean
> > re-using the existing model of MMIO for passthrough PCI devices to
> > handle the permission of pmem?
> 
> No, re-using struct page_info.
> 
> >> The
> >> only open question remains to be where to place the control structures,
> >> and I think the thresholding proposal of yours was quite sensible.
> > 
> > I'm little confused here. Is 'restrict the number of range structs' in
> > my previous reply the 'thresholding proposal' you mean? Or it's one of
> > 'artificial limitations'?
> 
> Neither. It's the decision on where to place the struct page_info
> arrays needed to manage the PMEM ranges.
>

In [1][2], we have agreed to use struct page_info to manage mappings
for pmem and place them in reserved area on pmem.

But I think the discussion in this thread is to decide the data
structure which will be used to track access permission to host pmem.
The discussion started from my question in [3]:
| I'm not sure whether xen toolstack as a userspace program is
| considered to be safe to pass the host physical address (of host
| NVDIMM) to hypervisor.
In reply [4], you mentioned:
| As long as the passing of physical addresses follows to model of
| MMIO for passed through PCI devices, I don't think there's problem
| with the tool stack bypassing the Dom0 kernel. So it really all
| depends on how you make sure that the guest won't get to see memory
| it has no permission to access.

I interpreted it as the same access permission control mechanism used
for MMIO of passthrough pci devices (built around range struct) should
be used for pmem as well, so that we can safely allow toolstack to
pass the host physical address of nvdimm to hypervisor.
Was my understanding wrong from the beginning?

Thanks,
Haozhong

[1] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01161.html
[2] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01201.html
[3] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html
[4] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01981.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-22 10:16                                                                       ` Haozhong Zhang
@ 2016-04-22 10:53                                                                         ` Jan Beulich
  2016-04-22 12:26                                                                           ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-04-22 10:53 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 22.04.16 at 12:16, <haozhong.zhang@intel.com> wrote:
> On 04/22/16 02:24, Jan Beulich wrote:
> [..]
>> >> >> Well, using existing range struct to manage guest access permissions
>> >> >> to nvdimm could consume too much space which could not fit in either
>> >> >> memory or nvdimm. If the above solution looks really error-prone,
>> >> >> perhaps we can still come back to the existing one and restrict the
>> >> >> number of range structs each domain could have for nvdimm
>> >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
>> >> >> nvdimm, though it may reject nvdimm mapping that is terribly
>> >> >> fragmented.
>> >> > 
>> >> > Hi Jan,
>> >> > 
>> >> > Any comments for this?
>> >> 
>> >> Well, nothing new, i.e. my previous opinion on the old proposal didn't
>> >> change. I'm really opposed to any artificial limitations here, as I am to
>> >> any secondary (and hence error prone) code paths. IOW I continue
>> >> to think that there's no reasonable alternative to re-using the existing
>> >> memory management infrastructure for at least the PMEM case.
>> > 
>> > By re-using the existing memory management infrastructure, do you mean
>> > re-using the existing model of MMIO for passthrough PCI devices to
>> > handle the permission of pmem?
>> 
>> No, re-using struct page_info.
>> 
>> >> The
>> >> only open question remains to be where to place the control structures,
>> >> and I think the thresholding proposal of yours was quite sensible.
>> > 
>> > I'm little confused here. Is 'restrict the number of range structs' in
>> > my previous reply the 'thresholding proposal' you mean? Or it's one of
>> > 'artificial limitations'?
>> 
>> Neither. It's the decision on where to place the struct page_info
>> arrays needed to manage the PMEM ranges.
>>
> 
> In [1][2], we have agreed to use struct page_info to manage mappings
> for pmem and place them in reserved area on pmem.
> 
> But I think the discussion in this thread is to decide the data
> structure which will be used to track access permission to host pmem.
> The discussion started from my question in [3]:
> | I'm not sure whether xen toolstack as a userspace program is
> | considered to be safe to pass the host physical address (of host
> | NVDIMM) to hypervisor.
> In reply [4], you mentioned:
> | As long as the passing of physical addresses follows to model of
> | MMIO for passed through PCI devices, I don't think there's problem
> | with the tool stack bypassing the Dom0 kernel. So it really all
> | depends on how you make sure that the guest won't get to see memory
> | it has no permission to access.
> 
> I interpreted it as the same access permission control mechanism used
> for MMIO of passthrough pci devices (built around range struct) should
> be used for pmem as well, so that we can safely allow toolstack to
> pass the host physical address of nvdimm to hypervisor.
> Was my understanding wrong from the beginning?

Perhaps I have got confused by the back and forth. If we're to
use struct page_info, then everything should be following a
similar flow to what happens for normal RAM, i.e. normal page
allocation, and normal assignment of pages to guests.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-22 10:53                                                                         ` Jan Beulich
@ 2016-04-22 12:26                                                                           ` Haozhong Zhang
  2016-04-22 12:36                                                                             ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-04-22 12:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 04/22/16 04:53, Jan Beulich wrote:
> >>> On 22.04.16 at 12:16, <haozhong.zhang@intel.com> wrote:
> > On 04/22/16 02:24, Jan Beulich wrote:
> > [..]
> >> >> >> Well, using existing range struct to manage guest access permissions
> >> >> >> to nvdimm could consume too much space which could not fit in either
> >> >> >> memory or nvdimm. If the above solution looks really error-prone,
> >> >> >> perhaps we can still come back to the existing one and restrict the
> >> >> >> number of range structs each domain could have for nvdimm
> >> >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
> >> >> >> nvdimm, though it may reject nvdimm mapping that is terribly
> >> >> >> fragmented.
> >> >> > 
> >> >> > Hi Jan,
> >> >> > 
> >> >> > Any comments for this?
> >> >> 
> >> >> Well, nothing new, i.e. my previous opinion on the old proposal didn't
> >> >> change. I'm really opposed to any artificial limitations here, as I am to
> >> >> any secondary (and hence error prone) code paths. IOW I continue
> >> >> to think that there's no reasonable alternative to re-using the existing
> >> >> memory management infrastructure for at least the PMEM case.
> >> > 
> >> > By re-using the existing memory management infrastructure, do you mean
> >> > re-using the existing model of MMIO for passthrough PCI devices to
> >> > handle the permission of pmem?
> >> 
> >> No, re-using struct page_info.
> >> 
> >> >> The
> >> >> only open question remains to be where to place the control structures,
> >> >> and I think the thresholding proposal of yours was quite sensible.
> >> > 
> >> > I'm little confused here. Is 'restrict the number of range structs' in
> >> > my previous reply the 'thresholding proposal' you mean? Or it's one of
> >> > 'artificial limitations'?
> >> 
> >> Neither. It's the decision on where to place the struct page_info
> >> arrays needed to manage the PMEM ranges.
> >>
> > 
> > In [1][2], we have agreed to use struct page_info to manage mappings
> > for pmem and place them in reserved area on pmem.
> > 
> > But I think the discussion in this thread is to decide the data
> > structure which will be used to track access permission to host pmem.
> > The discussion started from my question in [3]:
> > | I'm not sure whether xen toolstack as a userspace program is
> > | considered to be safe to pass the host physical address (of host
> > | NVDIMM) to hypervisor.
> > In reply [4], you mentioned:
> > | As long as the passing of physical addresses follows to model of
> > | MMIO for passed through PCI devices, I don't think there's problem
> > | with the tool stack bypassing the Dom0 kernel. So it really all
> > | depends on how you make sure that the guest won't get to see memory
> > | it has no permission to access.
> > 
> > I interpreted it as the same access permission control mechanism used
> > for MMIO of passthrough pci devices (built around range struct) should
> > be used for pmem as well, so that we can safely allow toolstack to
> > pass the host physical address of nvdimm to hypervisor.
> > Was my understanding wrong from the beginning?
> 
> Perhaps I have got confused by the back and forth. If we're to
> use struct page_info, then everything should be following a
> similar flow to what happens for normal RAM, i.e. normal page
> allocation, and normal assignment of pages to guests.
>

I'll follow the normal assignment of pages to guests for pmem, but not
the normal page allocation. Because allocation is difficult to always
get the same pmem area for the same guest every time. It still needs
input from others (e.g. toolstack) that can provide the exact address.

Because the address is now not decided by xen hypervisor, certain
permission track is needed. For this part, we will re-use the existing
one for MMIO. Directly using existing range struct for pmem may
consume too much space, so I proposed to choose different data
structures or put limitation on exiting range struct to avoid or
mitigate this problem. The data structure change will be applied only
to pmem, and only the code that manipulate the range structs
(rangeset_*) will be changed for pmem. So for the permission tracking
part, it will still follow the exiting one.

Haozhong


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-22 12:26                                                                           ` Haozhong Zhang
@ 2016-04-22 12:36                                                                             ` Jan Beulich
  2016-04-22 12:54                                                                               ` Haozhong Zhang
  0 siblings, 1 reply; 121+ messages in thread
From: Jan Beulich @ 2016-04-22 12:36 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 22.04.16 at 14:26, <haozhong.zhang@intel.com> wrote:
> On 04/22/16 04:53, Jan Beulich wrote:
>> Perhaps I have got confused by the back and forth. If we're to
>> use struct page_info, then everything should be following a
>> similar flow to what happens for normal RAM, i.e. normal page
>> allocation, and normal assignment of pages to guests.
>>
> 
> I'll follow the normal assignment of pages to guests for pmem, but not
> the normal page allocation. Because allocation is difficult to always
> get the same pmem area for the same guest every time. It still needs
> input from others (e.g. toolstack) that can provide the exact address.

Understood.

> Because the address is now not decided by xen hypervisor, certain
> permission track is needed. For this part, we will re-use the existing
> one for MMIO. Directly using existing range struct for pmem may
> consume too much space, so I proposed to choose different data
> structures or put limitation on exiting range struct to avoid or
> mitigate this problem.

Why would these consume too much space? I'd expect there to be
just one or very few chunks, just like is the case for MMIO ranges
on devices.

Jan

> The data structure change will be applied only
> to pmem, and only the code that manipulate the range structs
> (rangeset_*) will be changed for pmem. So for the permission tracking
> part, it will still follow the exiting one.
> 
> Haozhong




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-22 12:36                                                                             ` Jan Beulich
@ 2016-04-22 12:54                                                                               ` Haozhong Zhang
  2016-04-22 13:22                                                                                 ` Jan Beulich
  0 siblings, 1 reply; 121+ messages in thread
From: Haozhong Zhang @ 2016-04-22 12:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On 04/22/16 06:36, Jan Beulich wrote:
> >>> On 22.04.16 at 14:26, <haozhong.zhang@intel.com> wrote:
> > On 04/22/16 04:53, Jan Beulich wrote:
> >> Perhaps I have got confused by the back and forth. If we're to
> >> use struct page_info, then everything should be following a
> >> similar flow to what happens for normal RAM, i.e. normal page
> >> allocation, and normal assignment of pages to guests.
> >>
> > 
> > I'll follow the normal assignment of pages to guests for pmem, but not
> > the normal page allocation. Because allocation is difficult to always
> > get the same pmem area for the same guest every time. It still needs
> > input from others (e.g. toolstack) that can provide the exact address.
> 
> Understood.
> 
> > Because the address is now not decided by xen hypervisor, certain
> > permission track is needed. For this part, we will re-use the existing
> > one for MMIO. Directly using existing range struct for pmem may
> > consume too much space, so I proposed to choose different data
> > structures or put limitation on exiting range struct to avoid or
> > mitigate this problem.
> 
> Why would these consume too much space? I'd expect there to be
> just one or very few chunks, just like is the case for MMIO ranges
> on devices.
> 

As Ian Jackson indicated [1], there are several cases that a pmem page
can be accessed from more than one domains. Then every domain involved
needs a range struct to track its access permission to that pmem
page. In a worst case, e.g. the first of every two contiguous pages on
a pmem are assigned to a domain and are shared with all other domains,
though the size of range struct for a single domain maybe acceptable,
the total will still be very large.

Haozhong

[1] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg02309.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC Design Doc] Add vNVDIMM support for Xen
  2016-04-22 12:54                                                                               ` Haozhong Zhang
@ 2016-04-22 13:22                                                                                 ` Jan Beulich
  0 siblings, 0 replies; 121+ messages in thread
From: Jan Beulich @ 2016-04-22 13:22 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, George Dunlap,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 22.04.16 at 14:54, <haozhong.zhang@intel.com> wrote:
> On 04/22/16 06:36, Jan Beulich wrote:
>> >>> On 22.04.16 at 14:26, <haozhong.zhang@intel.com> wrote:
>> > On 04/22/16 04:53, Jan Beulich wrote:
>> >> Perhaps I have got confused by the back and forth. If we're to
>> >> use struct page_info, then everything should be following a
>> >> similar flow to what happens for normal RAM, i.e. normal page
>> >> allocation, and normal assignment of pages to guests.
>> >>
>> > 
>> > I'll follow the normal assignment of pages to guests for pmem, but not
>> > the normal page allocation. Because allocation is difficult to always
>> > get the same pmem area for the same guest every time. It still needs
>> > input from others (e.g. toolstack) that can provide the exact address.
>> 
>> Understood.
>> 
>> > Because the address is now not decided by xen hypervisor, certain
>> > permission track is needed. For this part, we will re-use the existing
>> > one for MMIO. Directly using existing range struct for pmem may
>> > consume too much space, so I proposed to choose different data
>> > structures or put limitation on exiting range struct to avoid or
>> > mitigate this problem.
>> 
>> Why would these consume too much space? I'd expect there to be
>> just one or very few chunks, just like is the case for MMIO ranges
>> on devices.
> 
> As Ian Jackson indicated [1], there are several cases that a pmem page
> can be accessed from more than one domains. Then every domain involved
> needs a range struct to track its access permission to that pmem
> page. In a worst case, e.g. the first of every two contiguous pages on
> a pmem are assigned to a domain and are shared with all other domains,
> though the size of range struct for a single domain maybe acceptable,
> the total will still be very large.

Everything Ian has mentioned there is what normal RAM pages
also can get used for, yes as you have yourself said (still visible
in context above) you mean to only do allocation differently. Hence
the permission tracking you talk of should be necessary only for the
owning domain (to get validated during allocation), everything else
should follow the normal life cycle of a RAM page.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2016-04-22 13:22 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
2016-02-01 18:25 ` Andrew Cooper
2016-02-02  3:27   ` Tian, Kevin
2016-02-02  3:44   ` Haozhong Zhang
2016-02-02 11:09     ` Andrew Cooper
2016-02-02  6:33 ` Tian, Kevin
2016-02-02  7:39   ` Zhang, Haozhong
2016-02-02  7:48     ` Tian, Kevin
2016-02-02  7:53       ` Zhang, Haozhong
2016-02-02  8:03         ` Tian, Kevin
2016-02-02  8:49           ` Zhang, Haozhong
2016-02-02 19:01   ` Konrad Rzeszutek Wilk
2016-02-02 17:11 ` Stefano Stabellini
2016-02-03  7:00   ` Haozhong Zhang
2016-02-03  9:13     ` Jan Beulich
2016-02-03 14:09       ` Andrew Cooper
2016-02-03 14:23         ` Haozhong Zhang
2016-02-05 14:40         ` Ross Philipson
2016-02-06  1:43           ` Haozhong Zhang
2016-02-06 16:17             ` Ross Philipson
2016-02-03 12:02     ` Stefano Stabellini
2016-02-03 13:11       ` Haozhong Zhang
2016-02-03 14:20         ` Andrew Cooper
2016-02-04  3:10           ` Haozhong Zhang
2016-02-03 15:16       ` George Dunlap
2016-02-03 15:22         ` Stefano Stabellini
2016-02-03 15:35           ` Konrad Rzeszutek Wilk
2016-02-03 15:35           ` George Dunlap
2016-02-04  2:55           ` Haozhong Zhang
2016-02-04 12:24             ` Stefano Stabellini
2016-02-15  3:16               ` Zhang, Haozhong
2016-02-16 11:14                 ` Stefano Stabellini
2016-02-16 12:55                   ` Jan Beulich
2016-02-17  9:03                     ` Haozhong Zhang
2016-03-04  7:30                     ` Haozhong Zhang
2016-03-16 12:55                       ` Haozhong Zhang
2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
2016-03-16 13:16                         ` Jan Beulich
2016-03-16 13:55                           ` Haozhong Zhang
2016-03-16 14:23                             ` Jan Beulich
2016-03-16 14:55                               ` Haozhong Zhang
2016-03-16 15:23                                 ` Jan Beulich
2016-03-17  8:58                                   ` Haozhong Zhang
2016-03-17 11:04                                     ` Jan Beulich
2016-03-17 12:44                                       ` Haozhong Zhang
2016-03-17 12:59                                         ` Jan Beulich
2016-03-17 13:29                                           ` Haozhong Zhang
2016-03-17 13:52                                             ` Jan Beulich
2016-03-17 14:00                                             ` Ian Jackson
2016-03-17 14:21                                               ` Haozhong Zhang
2016-03-29  8:47                                                 ` Haozhong Zhang
2016-03-29  9:11                                                   ` Jan Beulich
2016-03-29 10:10                                                     ` Haozhong Zhang
2016-03-29 10:49                                                       ` Jan Beulich
2016-04-08  5:02                                                         ` Haozhong Zhang
2016-04-08 15:52                                                           ` Jan Beulich
2016-04-12  8:45                                                             ` Haozhong Zhang
2016-04-21  5:09                                                               ` Haozhong Zhang
2016-04-21  7:04                                                                 ` Jan Beulich
2016-04-22  2:36                                                                   ` Haozhong Zhang
2016-04-22  8:24                                                                     ` Jan Beulich
2016-04-22 10:16                                                                       ` Haozhong Zhang
2016-04-22 10:53                                                                         ` Jan Beulich
2016-04-22 12:26                                                                           ` Haozhong Zhang
2016-04-22 12:36                                                                             ` Jan Beulich
2016-04-22 12:54                                                                               ` Haozhong Zhang
2016-04-22 13:22                                                                                 ` Jan Beulich
2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
2016-02-03 15:47       ` Konrad Rzeszutek Wilk
2016-02-04  2:36         ` Haozhong Zhang
2016-02-15  9:04         ` Zhang, Haozhong
2016-02-02 19:15 ` Konrad Rzeszutek Wilk
2016-02-03  8:28   ` Haozhong Zhang
2016-02-03  9:18     ` Jan Beulich
2016-02-03 12:22       ` Haozhong Zhang
2016-02-03 12:38         ` Jan Beulich
2016-02-03 12:49           ` Haozhong Zhang
2016-02-03 14:30       ` Andrew Cooper
2016-02-03 14:39         ` Jan Beulich
2016-02-15  8:43   ` Haozhong Zhang
2016-02-15 11:07     ` Jan Beulich
2016-02-17  9:01       ` Haozhong Zhang
2016-02-17  9:08         ` Jan Beulich
2016-02-18  7:42           ` Haozhong Zhang
2016-02-19  2:14             ` Konrad Rzeszutek Wilk
2016-03-01  7:39               ` Haozhong Zhang
2016-03-01 18:33                 ` Ian Jackson
2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
2016-03-02  7:14                     ` Haozhong Zhang
2016-03-02 13:03                       ` Jan Beulich
2016-03-04  2:20                         ` Haozhong Zhang
2016-03-08  9:15                           ` Haozhong Zhang
2016-03-08  9:27                             ` Jan Beulich
2016-03-09 12:22                               ` Haozhong Zhang
2016-03-09 16:17                                 ` Jan Beulich
2016-03-10  3:27                                   ` Haozhong Zhang
2016-03-17 11:05                                   ` Ian Jackson
2016-03-17 13:37                                     ` Haozhong Zhang
2016-03-17 13:56                                       ` Jan Beulich
2016-03-17 14:22                                         ` Haozhong Zhang
2016-03-17 14:12                                       ` Xu, Quan
2016-03-17 14:22                                         ` Zhang, Haozhong
2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
2016-03-08  5:50                         ` Haozhong Zhang
2016-02-18 17:17 ` Jan Beulich
2016-02-24 13:28   ` Haozhong Zhang
2016-02-24 14:00     ` Ross Philipson
2016-02-24 16:42       ` Haozhong Zhang
2016-02-24 17:50         ` Ross Philipson
2016-02-24 14:24     ` Jan Beulich
2016-02-24 15:48       ` Haozhong Zhang
2016-02-24 16:54         ` Jan Beulich
2016-02-28 14:48           ` Haozhong Zhang
2016-02-29  9:01             ` Jan Beulich
2016-02-29  9:45               ` Haozhong Zhang
2016-02-29 10:12                 ` Jan Beulich
2016-02-29 11:52                   ` Haozhong Zhang
2016-02-29 12:04                     ` Jan Beulich
2016-02-29 12:22                       ` Haozhong Zhang
2016-03-01 13:51                         ` Ian Jackson
2016-03-01 15:04                           ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.