xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: "Zhang, Haozhong" <haozhong.zhang@intel.com>
To: Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Juergen Gross <jgross@suse.com>,
	"Tian, Kevin" <kevin.tian@intel.com>, Keir Fraser <keir@xen.org>,
	Ian Campbell <ian.campbell@citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	George Dunlap <george.dunlap@citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <jbeulich@suse.com>,
	"Nakajima, Jun" <jun.nakajima@intel.com>,
	Wei Liu <wei.liu2@citrix.com>
Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen
Date: Mon, 15 Feb 2016 11:16:57 +0800	[thread overview]
Message-ID: <20160215031657.GA8938@hz-desktop.sh.intel.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1602041214590.16539@kaball.uk.xensource.com>

On 02/04/16 20:24, Stefano Stabellini wrote:
> On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> > On 02/03/16 15:22, Stefano Stabellini wrote:
> > > On Wed, 3 Feb 2016, George Dunlap wrote:
> > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > > > >> those files to QEMU. In this way, non-root QEMU should be able to
> > > > >> mmap those files.
> > > > >
> > > > > Maybe that would work. Worth adding it to the design, I would like to
> > > > > read more details on it.
> > > > >
> > > > > Also note that QEMU initially runs as root but drops privileges to
> > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > > > wouldn't work for any devices that need to be mmap'ed at run time
> > > > > (hotplug scenario).
> > > >
> > > > This is basically the same problem we have for a bunch of other things,
> > > > right?  Having xl open a file and then pass it via qmp to qemu should
> > > > work in theory, right?
> > >
> > > Is there one /dev/pmem? per assignable region?
> > 
> > Yes.
> > 
> > BTW, I'm wondering whether and how non-root qemu works with xl disk
> > configuration that is going to access a host block device, e.g.
> >      disk = [ '/dev/sdb,,hda' ]
> > If that works with non-root qemu, I may take the similar solution for
> > pmem.
>  
> Today the user is required to give the correct ownership and access mode
> to the block device, so that non-root QEMU can open it. However in the
> case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> the feature doesn't work at all with non-root QEMU
> (http://marc.info/?l=xen-devel&m=145261763600528).
> 
> If there is one /dev/pmem device per assignable region, then it would be
> conceivable to change its ownership so that non-root QEMU can open it.
> Or, better, the file descriptor could be passed by the toolstack via
> qmp.

Passing file descriptor via qmp is not enough.

Let me clarify where the requirement for root/privileged permissions
comes from. The primary workflow in my design that maps a host pmem
region or files in host pmem region to guest is shown as below:
 (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
     /dev/pmem0) to its virtual address space, i.e. the guest virtual
     address space.
 (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
     occupied by the host pmem to a DomU. This step requires the
     translation from the guest virtual address (where the host pmem is
     mmaped in (1)) to the host physical address. The translation can be
     done by either
    (a) QEMU that parses its own /proc/self/pagemap,
     or
    (b) Xen hypervisor that does the translation by itself [1] (though
        this choice is not quite doable from Konrad's comments [2]).

[1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html
[2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html

For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
pagemap will not contain all mappings. However, mlock may require
privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
mlock operates on memory, the permission to open(2) the host pmem files
does not solve the problem and therefore passing file descriptor via qmp
does not help.

For 2-b, from Konrad's comments [2], mlock is also required and
privileged permission may be required consequently.

Note that the mapping and the address translation are done before QEMU
dropping privileged permissions, so non-root QEMU should be able to work
with above design until we start considering vNVDIMM hotplug (which has
not been supported by the current vNVDIMM implementation in QEMU). In
the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
running with root permissions.

Haozhong

  reply	other threads:[~2016-02-15  3:16 UTC|newest]

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
2016-02-01 18:25 ` Andrew Cooper
2016-02-02  3:27   ` Tian, Kevin
2016-02-02  3:44   ` Haozhong Zhang
2016-02-02 11:09     ` Andrew Cooper
2016-02-02  6:33 ` Tian, Kevin
2016-02-02  7:39   ` Zhang, Haozhong
2016-02-02  7:48     ` Tian, Kevin
2016-02-02  7:53       ` Zhang, Haozhong
2016-02-02  8:03         ` Tian, Kevin
2016-02-02  8:49           ` Zhang, Haozhong
2016-02-02 19:01   ` Konrad Rzeszutek Wilk
2016-02-02 17:11 ` Stefano Stabellini
2016-02-03  7:00   ` Haozhong Zhang
2016-02-03  9:13     ` Jan Beulich
2016-02-03 14:09       ` Andrew Cooper
2016-02-03 14:23         ` Haozhong Zhang
2016-02-05 14:40         ` Ross Philipson
2016-02-06  1:43           ` Haozhong Zhang
2016-02-06 16:17             ` Ross Philipson
2016-02-03 12:02     ` Stefano Stabellini
2016-02-03 13:11       ` Haozhong Zhang
2016-02-03 14:20         ` Andrew Cooper
2016-02-04  3:10           ` Haozhong Zhang
2016-02-03 15:16       ` George Dunlap
2016-02-03 15:22         ` Stefano Stabellini
2016-02-03 15:35           ` Konrad Rzeszutek Wilk
2016-02-03 15:35           ` George Dunlap
2016-02-04  2:55           ` Haozhong Zhang
2016-02-04 12:24             ` Stefano Stabellini
2016-02-15  3:16               ` Zhang, Haozhong [this message]
2016-02-16 11:14                 ` Stefano Stabellini
2016-02-16 12:55                   ` Jan Beulich
2016-02-17  9:03                     ` Haozhong Zhang
2016-03-04  7:30                     ` Haozhong Zhang
2016-03-16 12:55                       ` Haozhong Zhang
2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
2016-03-16 13:16                         ` Jan Beulich
2016-03-16 13:55                           ` Haozhong Zhang
2016-03-16 14:23                             ` Jan Beulich
2016-03-16 14:55                               ` Haozhong Zhang
2016-03-16 15:23                                 ` Jan Beulich
2016-03-17  8:58                                   ` Haozhong Zhang
2016-03-17 11:04                                     ` Jan Beulich
2016-03-17 12:44                                       ` Haozhong Zhang
2016-03-17 12:59                                         ` Jan Beulich
2016-03-17 13:29                                           ` Haozhong Zhang
2016-03-17 13:52                                             ` Jan Beulich
2016-03-17 14:00                                             ` Ian Jackson
2016-03-17 14:21                                               ` Haozhong Zhang
2016-03-29  8:47                                                 ` Haozhong Zhang
2016-03-29  9:11                                                   ` Jan Beulich
2016-03-29 10:10                                                     ` Haozhong Zhang
2016-03-29 10:49                                                       ` Jan Beulich
2016-04-08  5:02                                                         ` Haozhong Zhang
2016-04-08 15:52                                                           ` Jan Beulich
2016-04-12  8:45                                                             ` Haozhong Zhang
2016-04-21  5:09                                                               ` Haozhong Zhang
2016-04-21  7:04                                                                 ` Jan Beulich
2016-04-22  2:36                                                                   ` Haozhong Zhang
2016-04-22  8:24                                                                     ` Jan Beulich
2016-04-22 10:16                                                                       ` Haozhong Zhang
2016-04-22 10:53                                                                         ` Jan Beulich
2016-04-22 12:26                                                                           ` Haozhong Zhang
2016-04-22 12:36                                                                             ` Jan Beulich
2016-04-22 12:54                                                                               ` Haozhong Zhang
2016-04-22 13:22                                                                                 ` Jan Beulich
2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
2016-02-03 15:47       ` Konrad Rzeszutek Wilk
2016-02-04  2:36         ` Haozhong Zhang
2016-02-15  9:04         ` Zhang, Haozhong
2016-02-02 19:15 ` Konrad Rzeszutek Wilk
2016-02-03  8:28   ` Haozhong Zhang
2016-02-03  9:18     ` Jan Beulich
2016-02-03 12:22       ` Haozhong Zhang
2016-02-03 12:38         ` Jan Beulich
2016-02-03 12:49           ` Haozhong Zhang
2016-02-03 14:30       ` Andrew Cooper
2016-02-03 14:39         ` Jan Beulich
2016-02-15  8:43   ` Haozhong Zhang
2016-02-15 11:07     ` Jan Beulich
2016-02-17  9:01       ` Haozhong Zhang
2016-02-17  9:08         ` Jan Beulich
2016-02-18  7:42           ` Haozhong Zhang
2016-02-19  2:14             ` Konrad Rzeszutek Wilk
2016-03-01  7:39               ` Haozhong Zhang
2016-03-01 18:33                 ` Ian Jackson
2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
2016-03-02  7:14                     ` Haozhong Zhang
2016-03-02 13:03                       ` Jan Beulich
2016-03-04  2:20                         ` Haozhong Zhang
2016-03-08  9:15                           ` Haozhong Zhang
2016-03-08  9:27                             ` Jan Beulich
2016-03-09 12:22                               ` Haozhong Zhang
2016-03-09 16:17                                 ` Jan Beulich
2016-03-10  3:27                                   ` Haozhong Zhang
2016-03-17 11:05                                   ` Ian Jackson
2016-03-17 13:37                                     ` Haozhong Zhang
2016-03-17 13:56                                       ` Jan Beulich
2016-03-17 14:22                                         ` Haozhong Zhang
2016-03-17 14:12                                       ` Xu, Quan
2016-03-17 14:22                                         ` Zhang, Haozhong
2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
2016-03-08  5:50                         ` Haozhong Zhang
2016-02-18 17:17 ` Jan Beulich
2016-02-24 13:28   ` Haozhong Zhang
2016-02-24 14:00     ` Ross Philipson
2016-02-24 16:42       ` Haozhong Zhang
2016-02-24 17:50         ` Ross Philipson
2016-02-24 14:24     ` Jan Beulich
2016-02-24 15:48       ` Haozhong Zhang
2016-02-24 16:54         ` Jan Beulich
2016-02-28 14:48           ` Haozhong Zhang
2016-02-29  9:01             ` Jan Beulich
2016-02-29  9:45               ` Haozhong Zhang
2016-02-29 10:12                 ` Jan Beulich
2016-02-29 11:52                   ` Haozhong Zhang
2016-02-29 12:04                     ` Jan Beulich
2016-02-29 12:22                       ` Haozhong Zhang
2016-03-01 13:51                         ` Ian Jackson
2016-03-01 15:04                           ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160215031657.GA8938@hz-desktop.sh.intel.com \
    --to=haozhong.zhang@intel.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=george.dunlap@citrix.com \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=ian.campbell@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=jun.nakajima@intel.com \
    --cc=keir@xen.org \
    --cc=kevin.tian@intel.com \
    --cc=konrad.wilk@oracle.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).