xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Haozhong Zhang <haozhong.zhang@intel.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Juergen Gross <JGross@suse.com>,
	Kevin Tian <kevin.tian@intel.com>, Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	IanJackson <ian.jackson@eu.citrix.com>,
	George Dunlap <george.dunlap@citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Keir Fraser <keir@xen.org>
Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen
Date: Fri, 4 Mar 2016 15:30:30 +0800	[thread overview]
Message-ID: <20160304073030.GC6267@hz-desktop.sh.intel.com> (raw)
In-Reply-To: <56C32A6302000078000D2A1C@prv-mh.provo.novell.com>

On 02/16/16 05:55, Jan Beulich wrote:
> >>> On 16.02.16 at 12:14, <stefano.stabellini@eu.citrix.com> wrote:
> > On Mon, 15 Feb 2016, Zhang, Haozhong wrote:
> >> On 02/04/16 20:24, Stefano Stabellini wrote:
> >> > On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> >> > > On 02/03/16 15:22, Stefano Stabellini wrote:
> >> > > > On Wed, 3 Feb 2016, George Dunlap wrote:
> >> > > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> >> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> >> > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> >> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> >> > > > > >> those files to QEMU. In this way, non-root QEMU should be able to
> >> > > > > >> mmap those files.
> >> > > > > >
> >> > > > > > Maybe that would work. Worth adding it to the design, I would like to
> >> > > > > > read more details on it.
> >> > > > > >
> >> > > > > > Also note that QEMU initially runs as root but drops privileges to
> >> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> >> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> >> > > > > > wouldn't work for any devices that need to be mmap'ed at run time
> >> > > > > > (hotplug scenario).
> >> > > > >
> >> > > > > This is basically the same problem we have for a bunch of other things,
> >> > > > > right?  Having xl open a file and then pass it via qmp to qemu should
> >> > > > > work in theory, right?
> >> > > >
> >> > > > Is there one /dev/pmem? per assignable region?
> >> > > 
> >> > > Yes.
> >> > > 
> >> > > BTW, I'm wondering whether and how non-root qemu works with xl disk
> >> > > configuration that is going to access a host block device, e.g.
> >> > >      disk = [ '/dev/sdb,,hda' ]
> >> > > If that works with non-root qemu, I may take the similar solution for
> >> > > pmem.
> >> >  
> >> > Today the user is required to give the correct ownership and access mode
> >> > to the block device, so that non-root QEMU can open it. However in the
> >> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> >> > the feature doesn't work at all with non-root QEMU
> >> > (http://marc.info/?l=xen-devel&m=145261763600528).
> >> > 
> >> > If there is one /dev/pmem device per assignable region, then it would be
> >> > conceivable to change its ownership so that non-root QEMU can open it.
> >> > Or, better, the file descriptor could be passed by the toolstack via
> >> > qmp.
> >> 
> >> Passing file descriptor via qmp is not enough.
> >> 
> >> Let me clarify where the requirement for root/privileged permissions
> >> comes from. The primary workflow in my design that maps a host pmem
> >> region or files in host pmem region to guest is shown as below:
> >>  (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
> >>      /dev/pmem0) to its virtual address space, i.e. the guest virtual
> >>      address space.
> >>  (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
> >>      occupied by the host pmem to a DomU. This step requires the
> >>      translation from the guest virtual address (where the host pmem is
> >>      mmaped in (1)) to the host physical address. The translation can be
> >>      done by either
> >>     (a) QEMU that parses its own /proc/self/pagemap,
> >>      or
> >>     (b) Xen hypervisor that does the translation by itself [1] (though
> >>         this choice is not quite doable from Konrad's comments [2]).
> >> 
> >> [1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html 
> >> [2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html 
> >> 
> >> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
> >> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
> >> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
> >> pagemap will not contain all mappings. However, mlock may require
> >> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
> >> mlock operates on memory, the permission to open(2) the host pmem files
> >> does not solve the problem and therefore passing file descriptor via qmp
> >> does not help.
> >> 
> >> For 2-b, from Konrad's comments [2], mlock is also required and
> >> privileged permission may be required consequently.
> >> 
> >> Note that the mapping and the address translation are done before QEMU
> >> dropping privileged permissions, so non-root QEMU should be able to work
> >> with above design until we start considering vNVDIMM hotplug (which has
> >> not been supported by the current vNVDIMM implementation in QEMU). In
> >> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
> >> running with root permissions.
> > 
> > Are we all good with the fact that vNVDIMM hotplug won't work (unless
> > the user explicitly asks for it at domain creation time, which is
> > very unlikely otherwise she could use coldplug)?
> 
> No, at least there needs to be a road towards hotplug, even if
> initially this may not be supported/implemented.
> 

Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
or files on NVDIMM. We can move that work to toolstack and pass SPA
ranges got by toolstack to qemu. In this way, no privileged operations
(mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
work even with vNVDIMM hotplug in future.

Haozhong



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  parent reply	other threads:[~2016-03-04  7:30 UTC|newest]

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
2016-02-01 18:25 ` Andrew Cooper
2016-02-02  3:27   ` Tian, Kevin
2016-02-02  3:44   ` Haozhong Zhang
2016-02-02 11:09     ` Andrew Cooper
2016-02-02  6:33 ` Tian, Kevin
2016-02-02  7:39   ` Zhang, Haozhong
2016-02-02  7:48     ` Tian, Kevin
2016-02-02  7:53       ` Zhang, Haozhong
2016-02-02  8:03         ` Tian, Kevin
2016-02-02  8:49           ` Zhang, Haozhong
2016-02-02 19:01   ` Konrad Rzeszutek Wilk
2016-02-02 17:11 ` Stefano Stabellini
2016-02-03  7:00   ` Haozhong Zhang
2016-02-03  9:13     ` Jan Beulich
2016-02-03 14:09       ` Andrew Cooper
2016-02-03 14:23         ` Haozhong Zhang
2016-02-05 14:40         ` Ross Philipson
2016-02-06  1:43           ` Haozhong Zhang
2016-02-06 16:17             ` Ross Philipson
2016-02-03 12:02     ` Stefano Stabellini
2016-02-03 13:11       ` Haozhong Zhang
2016-02-03 14:20         ` Andrew Cooper
2016-02-04  3:10           ` Haozhong Zhang
2016-02-03 15:16       ` George Dunlap
2016-02-03 15:22         ` Stefano Stabellini
2016-02-03 15:35           ` Konrad Rzeszutek Wilk
2016-02-03 15:35           ` George Dunlap
2016-02-04  2:55           ` Haozhong Zhang
2016-02-04 12:24             ` Stefano Stabellini
2016-02-15  3:16               ` Zhang, Haozhong
2016-02-16 11:14                 ` Stefano Stabellini
2016-02-16 12:55                   ` Jan Beulich
2016-02-17  9:03                     ` Haozhong Zhang
2016-03-04  7:30                     ` Haozhong Zhang [this message]
2016-03-16 12:55                       ` Haozhong Zhang
2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
2016-03-16 13:16                         ` Jan Beulich
2016-03-16 13:55                           ` Haozhong Zhang
2016-03-16 14:23                             ` Jan Beulich
2016-03-16 14:55                               ` Haozhong Zhang
2016-03-16 15:23                                 ` Jan Beulich
2016-03-17  8:58                                   ` Haozhong Zhang
2016-03-17 11:04                                     ` Jan Beulich
2016-03-17 12:44                                       ` Haozhong Zhang
2016-03-17 12:59                                         ` Jan Beulich
2016-03-17 13:29                                           ` Haozhong Zhang
2016-03-17 13:52                                             ` Jan Beulich
2016-03-17 14:00                                             ` Ian Jackson
2016-03-17 14:21                                               ` Haozhong Zhang
2016-03-29  8:47                                                 ` Haozhong Zhang
2016-03-29  9:11                                                   ` Jan Beulich
2016-03-29 10:10                                                     ` Haozhong Zhang
2016-03-29 10:49                                                       ` Jan Beulich
2016-04-08  5:02                                                         ` Haozhong Zhang
2016-04-08 15:52                                                           ` Jan Beulich
2016-04-12  8:45                                                             ` Haozhong Zhang
2016-04-21  5:09                                                               ` Haozhong Zhang
2016-04-21  7:04                                                                 ` Jan Beulich
2016-04-22  2:36                                                                   ` Haozhong Zhang
2016-04-22  8:24                                                                     ` Jan Beulich
2016-04-22 10:16                                                                       ` Haozhong Zhang
2016-04-22 10:53                                                                         ` Jan Beulich
2016-04-22 12:26                                                                           ` Haozhong Zhang
2016-04-22 12:36                                                                             ` Jan Beulich
2016-04-22 12:54                                                                               ` Haozhong Zhang
2016-04-22 13:22                                                                                 ` Jan Beulich
2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
2016-02-03 15:47       ` Konrad Rzeszutek Wilk
2016-02-04  2:36         ` Haozhong Zhang
2016-02-15  9:04         ` Zhang, Haozhong
2016-02-02 19:15 ` Konrad Rzeszutek Wilk
2016-02-03  8:28   ` Haozhong Zhang
2016-02-03  9:18     ` Jan Beulich
2016-02-03 12:22       ` Haozhong Zhang
2016-02-03 12:38         ` Jan Beulich
2016-02-03 12:49           ` Haozhong Zhang
2016-02-03 14:30       ` Andrew Cooper
2016-02-03 14:39         ` Jan Beulich
2016-02-15  8:43   ` Haozhong Zhang
2016-02-15 11:07     ` Jan Beulich
2016-02-17  9:01       ` Haozhong Zhang
2016-02-17  9:08         ` Jan Beulich
2016-02-18  7:42           ` Haozhong Zhang
2016-02-19  2:14             ` Konrad Rzeszutek Wilk
2016-03-01  7:39               ` Haozhong Zhang
2016-03-01 18:33                 ` Ian Jackson
2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
2016-03-02  7:14                     ` Haozhong Zhang
2016-03-02 13:03                       ` Jan Beulich
2016-03-04  2:20                         ` Haozhong Zhang
2016-03-08  9:15                           ` Haozhong Zhang
2016-03-08  9:27                             ` Jan Beulich
2016-03-09 12:22                               ` Haozhong Zhang
2016-03-09 16:17                                 ` Jan Beulich
2016-03-10  3:27                                   ` Haozhong Zhang
2016-03-17 11:05                                   ` Ian Jackson
2016-03-17 13:37                                     ` Haozhong Zhang
2016-03-17 13:56                                       ` Jan Beulich
2016-03-17 14:22                                         ` Haozhong Zhang
2016-03-17 14:12                                       ` Xu, Quan
2016-03-17 14:22                                         ` Zhang, Haozhong
2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
2016-03-08  5:50                         ` Haozhong Zhang
2016-02-18 17:17 ` Jan Beulich
2016-02-24 13:28   ` Haozhong Zhang
2016-02-24 14:00     ` Ross Philipson
2016-02-24 16:42       ` Haozhong Zhang
2016-02-24 17:50         ` Ross Philipson
2016-02-24 14:24     ` Jan Beulich
2016-02-24 15:48       ` Haozhong Zhang
2016-02-24 16:54         ` Jan Beulich
2016-02-28 14:48           ` Haozhong Zhang
2016-02-29  9:01             ` Jan Beulich
2016-02-29  9:45               ` Haozhong Zhang
2016-02-29 10:12                 ` Jan Beulich
2016-02-29 11:52                   ` Haozhong Zhang
2016-02-29 12:04                     ` Jan Beulich
2016-02-29 12:22                       ` Haozhong Zhang
2016-03-01 13:51                         ` Ian Jackson
2016-03-01 15:04                           ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160304073030.GC6267@hz-desktop.sh.intel.com \
    --to=haozhong.zhang@intel.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=JGross@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=george.dunlap@citrix.com \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=ian.campbell@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=jun.nakajima@intel.com \
    --cc=keir@xen.org \
    --cc=kevin.tian@intel.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).