xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
To: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: Juergen Gross <jgross@suse.com>,
	Kevin Tian <kevin.tian@intel.com>, Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	xen-devel@lists.xen.org, Jan Beulich <jbeulich@suse.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Keir Fraser <keir@xen.org>
Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen
Date: Tue, 2 Feb 2016 17:11:49 +0000	[thread overview]
Message-ID: <alpine.DEB.2.02.1602021645210.29714@kaball.uk.xensource.com> (raw)
In-Reply-To: <20160201054414.GA25211@hz-desktop.sh.intel.com>

Haozhong, thanks for your work!

On Mon, 1 Feb 2016, Haozhong Zhang wrote:
> 3.2 Address Mapping
> 
> 3.2.1 My Design
> 
>  The overview of this design is shown in the following figure.
> 
>                  Dom0                         |               DomU
>                                               |
>                                               |
>  QEMU                                         |
>      +...+--------------------+...+-----+     |
>   VA |   | Label Storage Area |   | buf |     |
>      +...+--------------------+...+-----+     |
>                      ^            ^     ^     |
>                      |            |     |     |
>                      V            |     |     |
>      +-------+   +-------+        mmap(2)     |
>      | vACPI |   | v_DSM |        |     |     |        +----+------------+
>      +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
>          ^           ^     +------+     |     |        +----+------------+
>  --------|-----------|-----|------------|--   |             ^            ^
>          |           |     |            |     |             |            |
>          |    +------+     +------------~-----~-------------+            |
>          |    |            |            |     |        XEN_DOMCTL_memory_mapping
>          |    |            |            +-----~--------------------------+
>          |    |            |            |     |
>          |    |       +----+------------+     |
>  Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
>          |    |       +----+------------+     |     | ACPI |   | _DSM |
>          |    |                   ^           |     +------+   +------+
>          |    |                   |           |         |          |
>          |    |               Dom0 Driver     |   hvmloader/xl     |
>  --------|----|-------------------|---------------------|----------|---------------
>          |    +-------------------~---------------------~----------+
>  Xen     |                        |                     |
>          +------------------------~---------------------+
>  ---------------------------------|------------------------------------------------
>                                   +----------------+
>                                                    |
>                                             +-------------+
>  HW                                         |    NVDIMM   |
>                                             +-------------+
> 
> 
>  This design treats host NVDIMM devices as ordinary MMIO devices:
>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>      and drive host NVDIMM devices (implementing block device
>      interface). Namespaces and file systems on host NVDIMM devices
>      are handled by Dom0 Linux as well.
> 
>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>      virtual address space (buf).
> 
>  (3) QEMU gets the host physical address of buf, i.e. the host system
>      physical address that is occupied by /dev/pmem0, and calls Xen
>      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.

How is this going to work from a security perspective? Is it going to
require running QEMU as root in Dom0, which will prevent NVDIMM from
working by default on Xen? If so, what's the plan?



>  (ACPI part is described in Section 3.3 later)
> 
>  Above (1)(2) have already been done in current QEMU. Only (3) is
>  needed to implement in QEMU. No change is needed in Xen for address
>  mapping in this design.
> 
>  Open: It seems no system call/ioctl is provided by Linux kernel to
>        get the physical address from a virtual address.
>        /proc/<qemu_pid>/pagemap provides information of mapping from
>        VA to PA. Is it an acceptable solution to let QEMU parse this
>        file to get the physical address?

Does it work in a non-root scenario?


>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>        occupied by pmem at the beginning, i.e. QEMU may not be able to
>        get all SPA of pmem from buf (in virtual address space) when
>        calling XEN_DOMCTL_memory_mapping.
>        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>        entire pmem being mmaped?

Ditto


> 3.2.2 Alternative Design
> 
>  Jan Beulich's comments [7] on my question "why must pmem resource
>  management and partition be done in hypervisor":
>  | Because that's where memory management belongs. And PMEM,
>  | other than PBLK, is just another form of RAM.
>  | ...
>  | The main issue is that this would imo be a layering violation
> 
>  George Dunlap's comments [8]:
>  | This is not the case for PMEM.  The whole point of PMEM (correct me if
>    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
>  | I'm wrong) is to be used for long-term storage that survives over
>  | reboot.  It matters very much that a guest be given the same PRAM
>  | after the host is rebooted that it was given before.  It doesn't make
>  | any sense to manage it the way Xen currently manages RAM (i.e., that
>  | you request a page and get whatever Xen happens to give you).
>  |
>  | So if Xen is going to use PMEM, it will have to invent an entirely new
>  | interface for guests, and it will have to keep track of those
>  | resources across host reboots.  In other words, it will have to
>  | duplicate all the work that Linux already does.  What do we gain from
>  | that duplication?  Why not just leverage what's already implemented in
>  | dom0?
>  and [9]:
>  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
>  | then you're right -- it is just another form of RAM, that should be
>  | treated no differently than say, lowmem: a fungible resource that can be
>  | requested by setting a flag.
> 
>  However, pmem is used more as persistent storage than fungible ram,
>  and my design is for the former usage. I would like to leave the
>  detection, driver and partition (either through namespace or file
>  systems) of NVDIMM in Dom0 Linux kernel.
> 
>  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
>  check for the physical address and size passed from caller
>  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
>  aware of the SPA range of pmem so that it can refuse map physical
>  address in neither the normal ram nor pmem.

Indeed


>  Instead of duplicating the detection code (parsing NFIT and
>  evaluating _FIT) in Dom0 Linux kernel, we decide to patch Dom0 Linux
>  kernel to pass parameters of host pmem NVDIMM devices to Xen
>  hypervisor:
>  (1) Add a global
>        struct rangeset pmem_rangeset
>      in Xen hypervisor to record all SPA ranges of detected pmem devices.
>      Each range in pmem_rangeset corresponds to a pmem device.
> 
>  (2) Add a hypercall
>        XEN_SYSCTL_add_pmem_range
>      (should it be a sysctl or a platform op?)
>      that receives a pair of parameters (addr: starting SPA of pmem
>      region, len: size of pmem region) and add a range (addr, addr +
>      len - 1) in nvdimm_rangset.
> 
>  (3) Add a hypercall
>        XEN_DOMCTL_pmem_mapping
>      that takes the same parameters as XEN_DOMCTL_memory_mapping and
>      maps a given host pmem range to guest. It checks whether the
>      given host pmem range is in the pmem_rangeset before making the
>      actual mapping.
> 
>  (4) Patch Linux NVDIMM driver to call XEN_SYSCTL_add_pmem_range
>      whenever it detects a pmem device.
> 
>  (5) Patch QEMU to use XEN_DOMCTL_pmem_mapping for mapping host pmem
>      devices.
> 
> 
> 3.3 Guest ACPI Emulation
> 
> 3.3.1 My Design
> 
>  Guest ACPI emulation is composed of two parts: building guest NFIT
>  and SSDT that defines ACPI namespace devices for NVDIMM, and
>  emulating guest _DSM.
> 
>  (1) Building Guest ACPI Tables
> 
>   This design reuses and extends hvmloader's existing mechanism that
>   loads passthrough ACPI tables from binary files to load NFIT and
>   SSDT tables built by QEMU:
>   1) Because the current QEMU does not building any ACPI tables when
>      it runs as the Xen device model, this design needs to patch QEMU
>      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> 
>   2) QEMU copies NFIT and SSDT to the end of guest memory below
>      4G. The guest address and size of those tables are written into
>      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> 
>   3) hvmloader is patched to probe and load device model passthrough
>      ACPI tables from above xenstore keys. The detected ACPI tables
>      are then appended to the end of existing guest ACPI tables just
>      like what current construct_passthrough_tables() does.
> 
>   Reasons for this design are listed below:
>   - NFIT and SSDT in question are quite self-contained, i.e. they do
>     not refer to other ACPI tables and not conflict with existing
>     guest ACPI tables in Xen. Therefore, it is safe to copy them from
>     QEMU and append to existing guest ACPI tables.
> 
>   - A primary portion of current and future vNVDIMM implementation is
>     about building ACPI tables. And this design also leave the
>     emulation of _DSM to QEMU which needs to keep consistency with
>     NFIT and SSDT itself builds. Therefore, reusing NFIT and SSDT from
>     QEMU can ease the maintenance.
> 
>   - Anthony's work to pass ACPI tables from the toolstack to hvmloader
>     does not move building SSDT (and NFIT) to toolstack, so this
>     design can still put them in hvmloader.

If we start asking QEMU to build ACPI tables, why should we stop at NFIT
and SSDT? Once upon a time somebody made the decision that ACPI tables
on Xen should be static and included in hvmloader. That might have been
a bad decision but at least it was coherent. Loading only *some* tables
from QEMU, but not others, it feels like an incomplete design to me.

For example, QEMU is currently in charge of emulating the PCI bus, why
shouldn't it be QEMU that generates the PRT and MCFG?


>  (2) Emulating Guest _DSM
> 
>   Because the same NFIT and SSDT are used, we can leave the emulation
>   of guest _DSM to QEMU. Just as what it does with KVM, QEMU registers
>   the _DSM buffer as MMIO region with Xen and then all guest
>   evaluations of _DSM are trapped and emulated by QEMU.
> 
> 3.3.2 Alternative Design 1: switching to QEMU
> 
>  Stefano Stabellini's comments [10]:
>  | I don't think it is wise to have two components which both think are
>  | in control of generating ACPI tables, hvmloader (soon to be the
>  | toolstack with Anthony's work) and QEMU. From an architectural
>  | perspective, it doesn't look robust to me.
>  |
>  | Could we take this opportunity to switch to QEMU generating the whole
>  | set of ACPI tables?
> 
>  So an alternative design could be switching to QEMU to generate the
>  whole set of guest ACPI tables. In this way, no controversy would
>  happen between multiple agents QEMU and hvmloader. (is this what
>  Stefano Stabellini mean by 'robust'?)

Right


>  However, looking at the code building ACPI tables in QEMU and
>  hvmloader, they are quite different. As ACPI tables are important for
>  OS to boot and operate device, it's critical to ensure ACPI tables
>  built by QEMU would not break existing guests on Xen. Though I
>  believe it could be done after a thorough investigation and
>  adjustment, it may take quite a lot of work and tests and should be
>  another project besides enabling vNVDIMM in Xen.
>
> 3.3.3 Alternative Design 2: keeping in Xen
> 
>  Alternative to switching to QEMU, another design would be building
>  NFIT and SSDT in hvmloader or toolstack.
> 
>  The amount and parameters of sub-structures in guest NFIT vary
>  according to different vNVDIMM configurations that can not be decided
>  at compile-time. In contrast, current hvmloader and toolstack can
>  only build static ACPI tables, i.e. their contents are decided
>  statically at compile-time and independent from the guest
>  configuration. In order to build guest NFIT at runtime, this design
>  may take following steps:
>  (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU
>      options,
> 
>  (2) QEMU accepts above options, figures out the start SPA range
>      address/size/NVDIMM device handles/..., and writes them in
>      xenstore. No ACPI table is built by QEMU.
> 
>  (3) Either xl or hvmloader reads above parameters from xenstore and
>      builds the NFIT table.
> 
>  For guest SSDT, it would take more work. The ACPI namespace devices
>  are defined in SSDT by AML, so an AML builder would be needed to
>  generate those definitions at runtime.
> 
>  This alternative design still needs more work than the first design.

I prefer switching to QEMU building all ACPI tables for devices that it
is emulating. However this alternative is good too because it is
coherent with the current design.


> References:
> [1] ACPI Specification v6,
>     http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> [2] NVDIMM Namespace Specification,
>     http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> [3] NVDIMM Block Window Driver Writer's Guide,
>     http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> [4] NVDIMM DSM Interface Example,
>     http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> [5] UEFI Specification v2.6,
>     http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf
> [6] Intel Architecture Instruction Set Extensions Programming Reference,
>     https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf
> [7] http://www.gossamer-threads.com/lists/xen/devel/414945#414945
> [8] http://www.gossamer-threads.com/lists/xen/devel/415658#415658
> [9] http://www.gossamer-threads.com/lists/xen/devel/415681#415681
> [10] http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg00271.html
> 

  parent reply	other threads:[~2016-02-02 17:11 UTC|newest]

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
2016-02-01 18:25 ` Andrew Cooper
2016-02-02  3:27   ` Tian, Kevin
2016-02-02  3:44   ` Haozhong Zhang
2016-02-02 11:09     ` Andrew Cooper
2016-02-02  6:33 ` Tian, Kevin
2016-02-02  7:39   ` Zhang, Haozhong
2016-02-02  7:48     ` Tian, Kevin
2016-02-02  7:53       ` Zhang, Haozhong
2016-02-02  8:03         ` Tian, Kevin
2016-02-02  8:49           ` Zhang, Haozhong
2016-02-02 19:01   ` Konrad Rzeszutek Wilk
2016-02-02 17:11 ` Stefano Stabellini [this message]
2016-02-03  7:00   ` Haozhong Zhang
2016-02-03  9:13     ` Jan Beulich
2016-02-03 14:09       ` Andrew Cooper
2016-02-03 14:23         ` Haozhong Zhang
2016-02-05 14:40         ` Ross Philipson
2016-02-06  1:43           ` Haozhong Zhang
2016-02-06 16:17             ` Ross Philipson
2016-02-03 12:02     ` Stefano Stabellini
2016-02-03 13:11       ` Haozhong Zhang
2016-02-03 14:20         ` Andrew Cooper
2016-02-04  3:10           ` Haozhong Zhang
2016-02-03 15:16       ` George Dunlap
2016-02-03 15:22         ` Stefano Stabellini
2016-02-03 15:35           ` Konrad Rzeszutek Wilk
2016-02-03 15:35           ` George Dunlap
2016-02-04  2:55           ` Haozhong Zhang
2016-02-04 12:24             ` Stefano Stabellini
2016-02-15  3:16               ` Zhang, Haozhong
2016-02-16 11:14                 ` Stefano Stabellini
2016-02-16 12:55                   ` Jan Beulich
2016-02-17  9:03                     ` Haozhong Zhang
2016-03-04  7:30                     ` Haozhong Zhang
2016-03-16 12:55                       ` Haozhong Zhang
2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
2016-03-16 13:16                         ` Jan Beulich
2016-03-16 13:55                           ` Haozhong Zhang
2016-03-16 14:23                             ` Jan Beulich
2016-03-16 14:55                               ` Haozhong Zhang
2016-03-16 15:23                                 ` Jan Beulich
2016-03-17  8:58                                   ` Haozhong Zhang
2016-03-17 11:04                                     ` Jan Beulich
2016-03-17 12:44                                       ` Haozhong Zhang
2016-03-17 12:59                                         ` Jan Beulich
2016-03-17 13:29                                           ` Haozhong Zhang
2016-03-17 13:52                                             ` Jan Beulich
2016-03-17 14:00                                             ` Ian Jackson
2016-03-17 14:21                                               ` Haozhong Zhang
2016-03-29  8:47                                                 ` Haozhong Zhang
2016-03-29  9:11                                                   ` Jan Beulich
2016-03-29 10:10                                                     ` Haozhong Zhang
2016-03-29 10:49                                                       ` Jan Beulich
2016-04-08  5:02                                                         ` Haozhong Zhang
2016-04-08 15:52                                                           ` Jan Beulich
2016-04-12  8:45                                                             ` Haozhong Zhang
2016-04-21  5:09                                                               ` Haozhong Zhang
2016-04-21  7:04                                                                 ` Jan Beulich
2016-04-22  2:36                                                                   ` Haozhong Zhang
2016-04-22  8:24                                                                     ` Jan Beulich
2016-04-22 10:16                                                                       ` Haozhong Zhang
2016-04-22 10:53                                                                         ` Jan Beulich
2016-04-22 12:26                                                                           ` Haozhong Zhang
2016-04-22 12:36                                                                             ` Jan Beulich
2016-04-22 12:54                                                                               ` Haozhong Zhang
2016-04-22 13:22                                                                                 ` Jan Beulich
2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
2016-02-03 15:47       ` Konrad Rzeszutek Wilk
2016-02-04  2:36         ` Haozhong Zhang
2016-02-15  9:04         ` Zhang, Haozhong
2016-02-02 19:15 ` Konrad Rzeszutek Wilk
2016-02-03  8:28   ` Haozhong Zhang
2016-02-03  9:18     ` Jan Beulich
2016-02-03 12:22       ` Haozhong Zhang
2016-02-03 12:38         ` Jan Beulich
2016-02-03 12:49           ` Haozhong Zhang
2016-02-03 14:30       ` Andrew Cooper
2016-02-03 14:39         ` Jan Beulich
2016-02-15  8:43   ` Haozhong Zhang
2016-02-15 11:07     ` Jan Beulich
2016-02-17  9:01       ` Haozhong Zhang
2016-02-17  9:08         ` Jan Beulich
2016-02-18  7:42           ` Haozhong Zhang
2016-02-19  2:14             ` Konrad Rzeszutek Wilk
2016-03-01  7:39               ` Haozhong Zhang
2016-03-01 18:33                 ` Ian Jackson
2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
2016-03-02  7:14                     ` Haozhong Zhang
2016-03-02 13:03                       ` Jan Beulich
2016-03-04  2:20                         ` Haozhong Zhang
2016-03-08  9:15                           ` Haozhong Zhang
2016-03-08  9:27                             ` Jan Beulich
2016-03-09 12:22                               ` Haozhong Zhang
2016-03-09 16:17                                 ` Jan Beulich
2016-03-10  3:27                                   ` Haozhong Zhang
2016-03-17 11:05                                   ` Ian Jackson
2016-03-17 13:37                                     ` Haozhong Zhang
2016-03-17 13:56                                       ` Jan Beulich
2016-03-17 14:22                                         ` Haozhong Zhang
2016-03-17 14:12                                       ` Xu, Quan
2016-03-17 14:22                                         ` Zhang, Haozhong
2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
2016-03-08  5:50                         ` Haozhong Zhang
2016-02-18 17:17 ` Jan Beulich
2016-02-24 13:28   ` Haozhong Zhang
2016-02-24 14:00     ` Ross Philipson
2016-02-24 16:42       ` Haozhong Zhang
2016-02-24 17:50         ` Ross Philipson
2016-02-24 14:24     ` Jan Beulich
2016-02-24 15:48       ` Haozhong Zhang
2016-02-24 16:54         ` Jan Beulich
2016-02-28 14:48           ` Haozhong Zhang
2016-02-29  9:01             ` Jan Beulich
2016-02-29  9:45               ` Haozhong Zhang
2016-02-29 10:12                 ` Jan Beulich
2016-02-29 11:52                   ` Haozhong Zhang
2016-02-29 12:04                     ` Jan Beulich
2016-02-29 12:22                       ` Haozhong Zhang
2016-03-01 13:51                         ` Ian Jackson
2016-03-01 15:04                           ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.02.1602021645210.29714@kaball.uk.xensource.com \
    --to=stefano.stabellini@eu.citrix.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=haozhong.zhang@intel.com \
    --cc=ian.campbell@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=jun.nakajima@intel.com \
    --cc=keir@xen.org \
    --cc=kevin.tian@intel.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).