xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Haozhong Zhang <haozhong.zhang@intel.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Juergen Gross <jgross@suse.com>,
	Kevin Tian <kevin.tian@intel.com>, Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	xen-devel@lists.xen.org, Jan Beulich <jbeulich@suse.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Keir Fraser <keir@xen.org>
Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen
Date: Wed, 3 Feb 2016 16:28:31 +0800	[thread overview]
Message-ID: <20160203082831.GB4248@hz-desktop.sh.intel.com> (raw)
In-Reply-To: <20160202191519.GB21656@char.us.oracle.com>

On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
> > 3. Design of vNVDIMM in Xen
> 
> Thank you for this design!
> 
> > 
> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >  three parts:
> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >  (2) Memory mapping, and
> >  (3) Guest ACPI emulation.
> 
> 
> .. MCE? and vMCE?
>

Specifications on my hand seem not mention much about MCE for NVDIMM,
but I remember that NVDIMM driver in Linux kernel does have MCE
code. I'll have a look at that code and add this part later.

> > 
> >  The rest of this section present the design of each part
> >  respectively. The basic design principle to reuse existing code in
> >  Linux NVDIMM driver and QEMU as much as possible. As recent
> >  discussions in the both Xen and QEMU mailing lists for the v1 patch
> >  series, alternative designs are also listed below.
> > 
> > 
> > 3.1 Guest clwb/clflushopt/pcommit Enabling
> > 
> >  The instruction enabling is simple and we do the same work as in KVM/QEMU.
> >  - All three instructions are exposed to guest via guest cpuid.
> >  - L1 guest pcommit is never intercepted by Xen.
> 
> I wish there was some watermarks like the PLE has.
> 
> My fear is that an unfriendly guest can issue sfence all day long
> flushing out other guests MMC queue (the writes followed by pcommits).
> Which means that an guest may have degraded performance as their
> memory writes are being flushed out immediately as if they were
> being written to UC instead of WB memory. 
>

pcommit takes no parameter and it seems hard to solve this problem
from hardware for now. And the current VMX does not provide mechanism
to limit the commit rate of pcommit like PLE for pause.

> In other words - the NVDIMM resource does not provide any resource
> isolation. However this may not be any different than what we had
> nowadays with CPU caches.
>

Does Xen have any mechanism to isolate multiple guests' operations on
CPU caches?

> 
> >  - L1 hypervisor is allowed to intercept L2 guest pcommit.
> 
> clwb?
>

VMX is not capable to intercept clwb. Any reason to intercept it?

> > 
> > 
> > 3.2 Address Mapping
> > 
> > 3.2.1 My Design
> > 
> >  The overview of this design is shown in the following figure.
> > 
> >                  Dom0                         |               DomU
> >                                               |
> >                                               |
> >  QEMU                                         |
> >      +...+--------------------+...+-----+     |
> >   VA |   | Label Storage Area |   | buf |     |
> >      +...+--------------------+...+-----+     |
> >                      ^            ^     ^     |
> >                      |            |     |     |
> >                      V            |     |     |
> >      +-------+   +-------+        mmap(2)     |
> >      | vACPI |   | v_DSM |        |     |     |        +----+------------+
> >      +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
> >          ^           ^     +------+     |     |        +----+------------+
> >  --------|-----------|-----|------------|--   |             ^            ^
> >          |           |     |            |     |             |            |
> >          |    +------+     +------------~-----~-------------+            |
> >          |    |            |            |     |        XEN_DOMCTL_memory_mapping
> >          |    |            |            +-----~--------------------------+
> >          |    |            |            |     |
> >          |    |       +----+------------+     |
> >  Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
> >          |    |       +----+------------+     |     | ACPI |   | _DSM |
> >          |    |                   ^           |     +------+   +------+
> >          |    |                   |           |         |          |
> >          |    |               Dom0 Driver     |   hvmloader/xl     |
> >  --------|----|-------------------|---------------------|----------|---------------
> >          |    +-------------------~---------------------~----------+
> >  Xen     |                        |                     |
> >          +------------------------~---------------------+
> >  ---------------------------------|------------------------------------------------
> >                                   +----------------+
> >                                                    |
> >                                             +-------------+
> >  HW                                         |    NVDIMM   |
> >                                             +-------------+
> > 
> > 
> >  This design treats host NVDIMM devices as ordinary MMIO devices:
> 
> Nice.
> 
> But it also means you need Xen to 'share' the ranges of an MMIO device.
> 
> That is you may need dom0 _DSM method to access certain ranges
> (the AML code may need to poke there) - and the guest may want to access
> those as well.
>

Currently, we are going to support _DSM that queries supported _DSM
commands and accesses vNVDIMM's label storage area. Both are emulated
by QEMU and not applied to host NVDIMM.

> And keep in mind that this NVDIMM management may not need to be always
> in initial domain.

I guess you mean it can be in a dedicated driver domain,

> As in you could have NVDIMM device drivers that would
> carve out the ranges to guests.

but I don't get what you mean here. More hints?

[...] 
> > 3.2.2 Alternative Design
> > 
> >  Jan Beulich's comments [7] on my question "why must pmem resource
> >  management and partition be done in hypervisor":
> >  | Because that's where memory management belongs. And PMEM,
> >  | other than PBLK, is just another form of RAM.
> >  | ...
> >  | The main issue is that this would imo be a layering violation
> > 
> >  George Dunlap's comments [8]:
> >  | This is not the case for PMEM.  The whole point of PMEM (correct me if
> >    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
> >  | I'm wrong) is to be used for long-term storage that survives over
> >  | reboot.  It matters very much that a guest be given the same PRAM
> >  | after the host is rebooted that it was given before.  It doesn't make
> >  | any sense to manage it the way Xen currently manages RAM (i.e., that
> >  | you request a page and get whatever Xen happens to give you).
> >  |
> >  | So if Xen is going to use PMEM, it will have to invent an entirely new
> >  | interface for guests, and it will have to keep track of those
> >  | resources across host reboots.  In other words, it will have to
> >  | duplicate all the work that Linux already does.  What do we gain from
> >  | that duplication?  Why not just leverage what's already implemented in
> >  | dom0?
> >  and [9]:
> >  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
> >  | then you're right -- it is just another form of RAM, that should be
> >  | treated no differently than say, lowmem: a fungible resource that can be
> >  | requested by setting a flag.
> > 
> >  However, pmem is used more as persistent storage than fungible ram,
> >  and my design is for the former usage. I would like to leave the
> >  detection, driver and partition (either through namespace or file
> >  systems) of NVDIMM in Dom0 Linux kernel.
> > 
> >  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
> >  check for the physical address and size passed from caller
> >  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
> >  aware of the SPA range of pmem so that it can refuse map physical
> >  address in neither the normal ram nor pmem.
> 
> /me nods.
> > 
> >  Instead of duplicating the detection code (parsing NFIT and
> >  evaluating _FIT) in Dom0 Linux kernel, we decide to patch Dom0 Linux
> >  kernel to pass parameters of host pmem NVDIMM devices to Xen
> >  hypervisor:
> >  (1) Add a global
> >        struct rangeset pmem_rangeset
> >      in Xen hypervisor to record all SPA ranges of detected pmem devices.
> >      Each range in pmem_rangeset corresponds to a pmem device.
> > 
> >  (2) Add a hypercall
> >        XEN_SYSCTL_add_pmem_range
> >      (should it be a sysctl or a platform op?)
> >      that receives a pair of parameters (addr: starting SPA of pmem
> >      region, len: size of pmem region) and add a range (addr, addr +
> >      len - 1) in nvdimm_rangset.
> > 
> >  (3) Add a hypercall
> >        XEN_DOMCTL_pmem_mapping
> >      that takes the same parameters as XEN_DOMCTL_memory_mapping and
> >      maps a given host pmem range to guest. It checks whether the
> >      given host pmem range is in the pmem_rangeset before making the
> >      actual mapping.
> > 
> >  (4) Patch Linux NVDIMM driver to call XEN_SYSCTL_add_pmem_range
> >      whenever it detects a pmem device.
> > 
> >  (5) Patch QEMU to use XEN_DOMCTL_pmem_mapping for mapping host pmem
> >      devices.
> 
> That is nice - as you can instrument this on existing hardware and
> create 'fake' starting SPA for real memory - which Xen may not see
> due to being booted with 'mem=X'.
>

'mem=X' only limits the maximum address of normal ram. NVDIMM or other
MMIO devices are limited by it as well or not?

Thanks,
Haozhong

  reply	other threads:[~2016-02-03  8:28 UTC|newest]

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
2016-02-01 18:25 ` Andrew Cooper
2016-02-02  3:27   ` Tian, Kevin
2016-02-02  3:44   ` Haozhong Zhang
2016-02-02 11:09     ` Andrew Cooper
2016-02-02  6:33 ` Tian, Kevin
2016-02-02  7:39   ` Zhang, Haozhong
2016-02-02  7:48     ` Tian, Kevin
2016-02-02  7:53       ` Zhang, Haozhong
2016-02-02  8:03         ` Tian, Kevin
2016-02-02  8:49           ` Zhang, Haozhong
2016-02-02 19:01   ` Konrad Rzeszutek Wilk
2016-02-02 17:11 ` Stefano Stabellini
2016-02-03  7:00   ` Haozhong Zhang
2016-02-03  9:13     ` Jan Beulich
2016-02-03 14:09       ` Andrew Cooper
2016-02-03 14:23         ` Haozhong Zhang
2016-02-05 14:40         ` Ross Philipson
2016-02-06  1:43           ` Haozhong Zhang
2016-02-06 16:17             ` Ross Philipson
2016-02-03 12:02     ` Stefano Stabellini
2016-02-03 13:11       ` Haozhong Zhang
2016-02-03 14:20         ` Andrew Cooper
2016-02-04  3:10           ` Haozhong Zhang
2016-02-03 15:16       ` George Dunlap
2016-02-03 15:22         ` Stefano Stabellini
2016-02-03 15:35           ` Konrad Rzeszutek Wilk
2016-02-03 15:35           ` George Dunlap
2016-02-04  2:55           ` Haozhong Zhang
2016-02-04 12:24             ` Stefano Stabellini
2016-02-15  3:16               ` Zhang, Haozhong
2016-02-16 11:14                 ` Stefano Stabellini
2016-02-16 12:55                   ` Jan Beulich
2016-02-17  9:03                     ` Haozhong Zhang
2016-03-04  7:30                     ` Haozhong Zhang
2016-03-16 12:55                       ` Haozhong Zhang
2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
2016-03-16 13:16                         ` Jan Beulich
2016-03-16 13:55                           ` Haozhong Zhang
2016-03-16 14:23                             ` Jan Beulich
2016-03-16 14:55                               ` Haozhong Zhang
2016-03-16 15:23                                 ` Jan Beulich
2016-03-17  8:58                                   ` Haozhong Zhang
2016-03-17 11:04                                     ` Jan Beulich
2016-03-17 12:44                                       ` Haozhong Zhang
2016-03-17 12:59                                         ` Jan Beulich
2016-03-17 13:29                                           ` Haozhong Zhang
2016-03-17 13:52                                             ` Jan Beulich
2016-03-17 14:00                                             ` Ian Jackson
2016-03-17 14:21                                               ` Haozhong Zhang
2016-03-29  8:47                                                 ` Haozhong Zhang
2016-03-29  9:11                                                   ` Jan Beulich
2016-03-29 10:10                                                     ` Haozhong Zhang
2016-03-29 10:49                                                       ` Jan Beulich
2016-04-08  5:02                                                         ` Haozhong Zhang
2016-04-08 15:52                                                           ` Jan Beulich
2016-04-12  8:45                                                             ` Haozhong Zhang
2016-04-21  5:09                                                               ` Haozhong Zhang
2016-04-21  7:04                                                                 ` Jan Beulich
2016-04-22  2:36                                                                   ` Haozhong Zhang
2016-04-22  8:24                                                                     ` Jan Beulich
2016-04-22 10:16                                                                       ` Haozhong Zhang
2016-04-22 10:53                                                                         ` Jan Beulich
2016-04-22 12:26                                                                           ` Haozhong Zhang
2016-04-22 12:36                                                                             ` Jan Beulich
2016-04-22 12:54                                                                               ` Haozhong Zhang
2016-04-22 13:22                                                                                 ` Jan Beulich
2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
2016-02-03 15:47       ` Konrad Rzeszutek Wilk
2016-02-04  2:36         ` Haozhong Zhang
2016-02-15  9:04         ` Zhang, Haozhong
2016-02-02 19:15 ` Konrad Rzeszutek Wilk
2016-02-03  8:28   ` Haozhong Zhang [this message]
2016-02-03  9:18     ` Jan Beulich
2016-02-03 12:22       ` Haozhong Zhang
2016-02-03 12:38         ` Jan Beulich
2016-02-03 12:49           ` Haozhong Zhang
2016-02-03 14:30       ` Andrew Cooper
2016-02-03 14:39         ` Jan Beulich
2016-02-15  8:43   ` Haozhong Zhang
2016-02-15 11:07     ` Jan Beulich
2016-02-17  9:01       ` Haozhong Zhang
2016-02-17  9:08         ` Jan Beulich
2016-02-18  7:42           ` Haozhong Zhang
2016-02-19  2:14             ` Konrad Rzeszutek Wilk
2016-03-01  7:39               ` Haozhong Zhang
2016-03-01 18:33                 ` Ian Jackson
2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
2016-03-02  7:14                     ` Haozhong Zhang
2016-03-02 13:03                       ` Jan Beulich
2016-03-04  2:20                         ` Haozhong Zhang
2016-03-08  9:15                           ` Haozhong Zhang
2016-03-08  9:27                             ` Jan Beulich
2016-03-09 12:22                               ` Haozhong Zhang
2016-03-09 16:17                                 ` Jan Beulich
2016-03-10  3:27                                   ` Haozhong Zhang
2016-03-17 11:05                                   ` Ian Jackson
2016-03-17 13:37                                     ` Haozhong Zhang
2016-03-17 13:56                                       ` Jan Beulich
2016-03-17 14:22                                         ` Haozhong Zhang
2016-03-17 14:12                                       ` Xu, Quan
2016-03-17 14:22                                         ` Zhang, Haozhong
2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
2016-03-08  5:50                         ` Haozhong Zhang
2016-02-18 17:17 ` Jan Beulich
2016-02-24 13:28   ` Haozhong Zhang
2016-02-24 14:00     ` Ross Philipson
2016-02-24 16:42       ` Haozhong Zhang
2016-02-24 17:50         ` Ross Philipson
2016-02-24 14:24     ` Jan Beulich
2016-02-24 15:48       ` Haozhong Zhang
2016-02-24 16:54         ` Jan Beulich
2016-02-28 14:48           ` Haozhong Zhang
2016-02-29  9:01             ` Jan Beulich
2016-02-29  9:45               ` Haozhong Zhang
2016-02-29 10:12                 ` Jan Beulich
2016-02-29 11:52                   ` Haozhong Zhang
2016-02-29 12:04                     ` Jan Beulich
2016-02-29 12:22                       ` Haozhong Zhang
2016-03-01 13:51                         ` Ian Jackson
2016-03-01 15:04                           ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160203082831.GB4248@hz-desktop.sh.intel.com \
    --to=haozhong.zhang@intel.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=ian.campbell@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=jun.nakajima@intel.com \
    --cc=keir@xen.org \
    --cc=kevin.tian@intel.com \
    --cc=konrad.wilk@oracle.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).