All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Jan Beulich <jbeulich@suse.com>, Keir Fraser <keir@xen.org>,
	xen-devel@lists.xen.org,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Juergen Gross <jgross@suse.com>, Wei Liu <wei.liu2@citrix.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Jun Nakajima <jun.nakajima@intel.com>
Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen
Date: Wed, 3 Feb 2016 14:20:05 +0000	[thread overview]
Message-ID: <56B20C95.5090503@citrix.com> (raw)
In-Reply-To: <20160203131111.GB15605@hz-desktop.sh.intel.com>

On 03/02/16 13:11, Haozhong Zhang wrote:
> On 02/03/16 12:02, Stefano Stabellini wrote:
>> On Wed, 3 Feb 2016, Haozhong Zhang wrote:
>>> On 02/02/16 17:11, Stefano Stabellini wrote:
>>>> On Mon, 1 Feb 2016, Haozhong Zhang wrote:
> [...]
>>>>>  This design treats host NVDIMM devices as ordinary MMIO devices:
>>>>>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>>>>>      and drive host NVDIMM devices (implementing block device
>>>>>      interface). Namespaces and file systems on host NVDIMM devices
>>>>>      are handled by Dom0 Linux as well.
>>>>>
>>>>>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>>>>>      virtual address space (buf).
>>>>>
>>>>>  (3) QEMU gets the host physical address of buf, i.e. the host system
>>>>>      physical address that is occupied by /dev/pmem0, and calls Xen
>>>>>      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
>>>> How is this going to work from a security perspective? Is it going to
>>>> require running QEMU as root in Dom0, which will prevent NVDIMM from
>>>> working by default on Xen? If so, what's the plan?
>>>>
>>> Oh, I forgot to address the non-root qemu issues in this design ...
>>>
>>> The default user:group of /dev/pmem0 is root:disk, and its permission
>>> is rw-rw----. We could lift the others permission to rw, so that
>>> non-root QEMU can mmap /dev/pmem0. But it looks too risky.
>> Yep, too risky.
>>
>>
>>> Or, we can make a file system on /dev/pmem0, create files on it, set
>>> the owner of those files to xen-qemuuser-domid$domid, and then pass
>>> those files to QEMU. In this way, non-root QEMU should be able to
>>> mmap those files.
>> Maybe that would work. Worth adding it to the design, I would like to
>> read more details on it.
>>
>> Also note that QEMU initially runs as root but drops privileges to
>> xen-qemuuser-domid$domid before the guest is started. Initially QEMU
>> *could* mmap /dev/pmem0 while is still running as root, but then it
>> wouldn't work for any devices that need to be mmap'ed at run time
>> (hotplug scenario).
>>
> Thanks for this information. I'll test some experimental code and then
> post a design to address the non-root qemu issue.
>
>>>>>  (ACPI part is described in Section 3.3 later)
>>>>>
>>>>>  Above (1)(2) have already been done in current QEMU. Only (3) is
>>>>>  needed to implement in QEMU. No change is needed in Xen for address
>>>>>  mapping in this design.
>>>>>
>>>>>  Open: It seems no system call/ioctl is provided by Linux kernel to
>>>>>        get the physical address from a virtual address.
>>>>>        /proc/<qemu_pid>/pagemap provides information of mapping from
>>>>>        VA to PA. Is it an acceptable solution to let QEMU parse this
>>>>>        file to get the physical address?
>>>> Does it work in a non-root scenario?
>>>>
>>> Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
>>> | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
>>> | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
>>> | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
>>> | Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
>>>
>>> A possible alternative is to add a new hypercall similar to
>>> XEN_DOMCTL_memory_mapping but receiving virtual address as the address
>>> parameter and translating to machine address in the hypervisor.
>> That might work.
>>
>>
>>>>>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>>>>>        occupied by pmem at the beginning, i.e. QEMU may not be able to
>>>>>        get all SPA of pmem from buf (in virtual address space) when
>>>>>        calling XEN_DOMCTL_memory_mapping.
>>>>>        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>>>>>        entire pmem being mmaped?
>>>> Ditto
>>>>
>>> No. If I take the above alternative for the first open, maybe the new
>>> hypercall above can inject page faults into dom0 for the unmapped
>>> virtual address so as to enforce dom0 Linux to create the page
>>> mapping.
>> Otherwise you need to use something like the mapcache in QEMU
>> (xen-mapcache.c), which admittedly, given its complexity, would be best
>> to avoid.
>>
> Definitely not mapcache like things. What I want is something similar to
> what emulate_gva_to_mfn() in Xen does.

Please not quite like that.  It would restrict this to only working in a
PV dom0.

MFNs are an implementation detail.  Interfaces should take GFNs which
are consistent logical meaning between PV and HVM domains.

As an introduction,
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/xen/mm.h;h=a795dd6001eff7c5dd942bbaf153e3efa5202318;hb=refs/heads/staging#l8

We also need to consider the Xen side security.  Currently a domain may
be given privilege to map an MMIO range.  IIRC, this allows the emulator
domain to make mappings for the guest, and for the guest to make
mappings itself.  With PMEM, we can't allow a domain to make mappings
itself because it could end up mapping resources which belong to another
domain.  We probably need an intermediate level which only permits an
emulator to make the mappings.

>
> [...]
>>>> If we start asking QEMU to build ACPI tables, why should we stop at NFIT
>>>> and SSDT?
>>> for easing my development of supporting vNVDIMM in Xen ... I mean
>>> NFIT and SSDT are the only two tables needed for this purpose and I'm
>>> afraid to break exiting guests if I completely switch to QEMU for
>>> guest ACPI tables.
>> I realize that my words have been a bit confusing. Not /all/ ACPI
>> tables, just all the tables regarding devices for which QEMU is in
>> charge (the PCI bus and all devices behind it). Anything related to cpus
>> and memory (FADT, MADT, etc) would still be left to hvmloader.
> OK, then it's clear for me. From Jan's reply, at least MCFG is from
> QEMU. I'll look at whether other PCI related tables are also from QEMU
> or similar to those in QEMU. If yes, then it looks reasonable to let
> QEMU generate them.

It is entirely likely that the current split of sources of APCI tables
is incorrect.  We should also see what can be done about fixing that.

~Andrew

  reply	other threads:[~2016-02-03 14:20 UTC|newest]

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
2016-02-01 18:25 ` Andrew Cooper
2016-02-02  3:27   ` Tian, Kevin
2016-02-02  3:44   ` Haozhong Zhang
2016-02-02 11:09     ` Andrew Cooper
2016-02-02  6:33 ` Tian, Kevin
2016-02-02  7:39   ` Zhang, Haozhong
2016-02-02  7:48     ` Tian, Kevin
2016-02-02  7:53       ` Zhang, Haozhong
2016-02-02  8:03         ` Tian, Kevin
2016-02-02  8:49           ` Zhang, Haozhong
2016-02-02 19:01   ` Konrad Rzeszutek Wilk
2016-02-02 17:11 ` Stefano Stabellini
2016-02-03  7:00   ` Haozhong Zhang
2016-02-03  9:13     ` Jan Beulich
2016-02-03 14:09       ` Andrew Cooper
2016-02-03 14:23         ` Haozhong Zhang
2016-02-05 14:40         ` Ross Philipson
2016-02-06  1:43           ` Haozhong Zhang
2016-02-06 16:17             ` Ross Philipson
2016-02-03 12:02     ` Stefano Stabellini
2016-02-03 13:11       ` Haozhong Zhang
2016-02-03 14:20         ` Andrew Cooper [this message]
2016-02-04  3:10           ` Haozhong Zhang
2016-02-03 15:16       ` George Dunlap
2016-02-03 15:22         ` Stefano Stabellini
2016-02-03 15:35           ` Konrad Rzeszutek Wilk
2016-02-03 15:35           ` George Dunlap
2016-02-04  2:55           ` Haozhong Zhang
2016-02-04 12:24             ` Stefano Stabellini
2016-02-15  3:16               ` Zhang, Haozhong
2016-02-16 11:14                 ` Stefano Stabellini
2016-02-16 12:55                   ` Jan Beulich
2016-02-17  9:03                     ` Haozhong Zhang
2016-03-04  7:30                     ` Haozhong Zhang
2016-03-16 12:55                       ` Haozhong Zhang
2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
2016-03-16 13:16                         ` Jan Beulich
2016-03-16 13:55                           ` Haozhong Zhang
2016-03-16 14:23                             ` Jan Beulich
2016-03-16 14:55                               ` Haozhong Zhang
2016-03-16 15:23                                 ` Jan Beulich
2016-03-17  8:58                                   ` Haozhong Zhang
2016-03-17 11:04                                     ` Jan Beulich
2016-03-17 12:44                                       ` Haozhong Zhang
2016-03-17 12:59                                         ` Jan Beulich
2016-03-17 13:29                                           ` Haozhong Zhang
2016-03-17 13:52                                             ` Jan Beulich
2016-03-17 14:00                                             ` Ian Jackson
2016-03-17 14:21                                               ` Haozhong Zhang
2016-03-29  8:47                                                 ` Haozhong Zhang
2016-03-29  9:11                                                   ` Jan Beulich
2016-03-29 10:10                                                     ` Haozhong Zhang
2016-03-29 10:49                                                       ` Jan Beulich
2016-04-08  5:02                                                         ` Haozhong Zhang
2016-04-08 15:52                                                           ` Jan Beulich
2016-04-12  8:45                                                             ` Haozhong Zhang
2016-04-21  5:09                                                               ` Haozhong Zhang
2016-04-21  7:04                                                                 ` Jan Beulich
2016-04-22  2:36                                                                   ` Haozhong Zhang
2016-04-22  8:24                                                                     ` Jan Beulich
2016-04-22 10:16                                                                       ` Haozhong Zhang
2016-04-22 10:53                                                                         ` Jan Beulich
2016-04-22 12:26                                                                           ` Haozhong Zhang
2016-04-22 12:36                                                                             ` Jan Beulich
2016-04-22 12:54                                                                               ` Haozhong Zhang
2016-04-22 13:22                                                                                 ` Jan Beulich
2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
2016-02-03 15:47       ` Konrad Rzeszutek Wilk
2016-02-04  2:36         ` Haozhong Zhang
2016-02-15  9:04         ` Zhang, Haozhong
2016-02-02 19:15 ` Konrad Rzeszutek Wilk
2016-02-03  8:28   ` Haozhong Zhang
2016-02-03  9:18     ` Jan Beulich
2016-02-03 12:22       ` Haozhong Zhang
2016-02-03 12:38         ` Jan Beulich
2016-02-03 12:49           ` Haozhong Zhang
2016-02-03 14:30       ` Andrew Cooper
2016-02-03 14:39         ` Jan Beulich
2016-02-15  8:43   ` Haozhong Zhang
2016-02-15 11:07     ` Jan Beulich
2016-02-17  9:01       ` Haozhong Zhang
2016-02-17  9:08         ` Jan Beulich
2016-02-18  7:42           ` Haozhong Zhang
2016-02-19  2:14             ` Konrad Rzeszutek Wilk
2016-03-01  7:39               ` Haozhong Zhang
2016-03-01 18:33                 ` Ian Jackson
2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
2016-03-02  7:14                     ` Haozhong Zhang
2016-03-02 13:03                       ` Jan Beulich
2016-03-04  2:20                         ` Haozhong Zhang
2016-03-08  9:15                           ` Haozhong Zhang
2016-03-08  9:27                             ` Jan Beulich
2016-03-09 12:22                               ` Haozhong Zhang
2016-03-09 16:17                                 ` Jan Beulich
2016-03-10  3:27                                   ` Haozhong Zhang
2016-03-17 11:05                                   ` Ian Jackson
2016-03-17 13:37                                     ` Haozhong Zhang
2016-03-17 13:56                                       ` Jan Beulich
2016-03-17 14:22                                         ` Haozhong Zhang
2016-03-17 14:12                                       ` Xu, Quan
2016-03-17 14:22                                         ` Zhang, Haozhong
2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
2016-03-08  5:50                         ` Haozhong Zhang
2016-02-18 17:17 ` Jan Beulich
2016-02-24 13:28   ` Haozhong Zhang
2016-02-24 14:00     ` Ross Philipson
2016-02-24 16:42       ` Haozhong Zhang
2016-02-24 17:50         ` Ross Philipson
2016-02-24 14:24     ` Jan Beulich
2016-02-24 15:48       ` Haozhong Zhang
2016-02-24 16:54         ` Jan Beulich
2016-02-28 14:48           ` Haozhong Zhang
2016-02-29  9:01             ` Jan Beulich
2016-02-29  9:45               ` Haozhong Zhang
2016-02-29 10:12                 ` Jan Beulich
2016-02-29 11:52                   ` Haozhong Zhang
2016-02-29 12:04                     ` Jan Beulich
2016-02-29 12:22                       ` Haozhong Zhang
2016-03-01 13:51                         ` Ian Jackson
2016-03-01 15:04                           ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56B20C95.5090503@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=ian.campbell@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=jun.nakajima@intel.com \
    --cc=keir@xen.org \
    --cc=kevin.tian@intel.com \
    --cc=konrad.wilk@oracle.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.