Re: [RFC Design Doc v2] Add vNVDIMM support for Xen

From: Haozhong Zhang <haozhong.zhang@intel.com>
To: Bob Liu <bob.liu@oracle.com>
Cc: Juergen Gross <jgross@suse.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Wei Liu <wei.liu2@citrix.com>,
	"Nakajima, Jun" <jun.nakajima@intel.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <jbeulich@suse.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>
Subject: Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
Date: Tue, 19 Jul 2016 10:40:12 +0800	[thread overview]
Message-ID: <20160719024012.tpnh7qv5zs5lx347@hz-desktop> (raw)
In-Reply-To: <578D8911.7070503@oracle.com>

On 07/19/16 09:57, Bob Liu wrote:
> Hey Haozhong,
> 
> On 07/18/2016 08:29 AM, Haozhong Zhang wrote:
> > Hi,
> > 
> > Following is version 2 of the design doc for supporting vNVDIMM in
> 
> This version is really good, very clear and included almost everything I'd like to know.
> 
> > Xen. It's basically the summary of discussion on previous v1 design
> > (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
> > Any comments are welcome. The corresponding patches are WIP.
> > 
> 
> So are you(or Intel) going to write all the patches? Is there any task the community to take part in?
>

For the first version I think so. Currently there are some
dependencies among multiple parts in my patches (Xen/Linux/Qemu), and
I have to adjust them from time to time in the development. Once after
I can provide a working first version, I'm very glad to work with the
community for the further development.

> [..snip..]
> > 3. Usage Example of vNVDIMM in Xen
> > 
> >  Our design is to provide virtual pmem devices to HVM domains. The
> >  virtual pmem devices are backed by host pmem devices.
> > 
> >  Dom0 Linux kernel can detect the host pmem devices and create
> >  /dev/pmemXX for each detected devices. Users in Dom0 can then create
> >  DAX file system on /dev/pmemXX and create several pre-allocate files
> >  in the DAX file system.
> > 
> >  After setup the file system on the host pmem, users can add the
> >  following lines in the xl configuration files to assign the host pmem
> >  regions to domains:
> >      vnvdimm = [ 'file=/dev/pmem0' ]
> >  or
> >      vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]
> > 
> 
> Could you please also consider the case when driver domain gets involved?
> E.g vnvdimm = [ 'file=/dev/pmem0', backend='xxx' ]?
>

Will consider the design, but for the first version patches I would
like to make things simple.

> >   The first type of configuration assigns the entire pmem device
> >   (/dev/pmem0) to the domain, while the second assigns the space
> >   allocated to /mnt/dax/pre_allocated_file on the host pmem device to
> >   the domain.
> > 
> ..[snip..]
> > 
> > 4.2.2 Detection of Host pmem Devices
> > 
> >  The detection and initialize host pmem devices require a non-trivial
> >  driver to interact with the corresponding ACPI namespace devices,
> >  parse namespace labels and make necessary recovery actions. Instead
> >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> >  detected host pmem devices to Xen hypervisor.
> > 
> >  Our design takes following steps to detect host pmem devices when Xen
> >  boots.
> >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> >      Linux NVDIMM driver.
> > 
> >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> >      of the pmem devices and reserved areas to Xen hypervisor via a
> >      new hypercall.
> > 
> >  (3) Xen hypervisor then checks
> >      - whether SPA and size of the newly reported pmem device is overlap
> >        with any previously reported pmem devices;
> >      - whether the reserved area can fit in the pmem device and is
> >        large enough to hold page_info structs for itself.
> > 
> >      If any checks fail, the reported pmem device will be ignored by
> >      Xen hypervisor and hence will not be used by any
> >      guests. Otherwise, Xen hypervisor will recorded the reported
> >      parameters and create page_info structs in the reserved area.
> > 
> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >      should not be accessible by Dom0 any more. Therefore, if a host
> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> >      reserved area from Dom0. Our design also needs to extend Linux
> >      NVDIMM driver to "balloon out" the reserved area after it
> >      successfully reports a pmem device to Xen hypervisor.
> > 
> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> > 
> >  Before a pmem file is assigned to a domain, we need to know the host
> >  SPA ranges that are allocated to this file. We do this work in xl.
> > 
> >  If a pmem device /dev/pmem0 is given, xl will read
> >  /sys/block/pmem0/device/{resource,size} respectively for the start
> >  SPA and size of the pmem device.
> > 
> >  If a pre-allocated file /mnt/dax/file is given,
> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
> >      it uses the method above to get the start SPA of the host pmem
> >      device.
> >  (2) xl then uses fiemap ioctl to get the extend mappings of
> >      /mnt/dax/file, and adds the corresponding physical offsets and
> >      lengths in each mapping entries to above start SPA to get the SPA
> >      ranges pre-allocated for this file.
> > 
> 
> Looks like PMEM can't be passed through to driver domain directly like e.g PCI devices.
>

pmem is not a PCI device.

I'm not familiar with the driver domain. If only PCI devices can be
passed through to driver domain, it may not be able to passthrough a
pmem device to driver domain.

> So if created a driver domain by: vnvdimm = [ 'file=/dev/pmem0' ], and make a DAX file system on the driver domain.
> 
> Then creating new guests with vnvdimm = [ 'file=dax file in driver domain', backend = 'driver domain' ].
> Is this going to work? In my understanding, fiemap can only get the GPFN instead of the really SPA of PMEM in this case.
>

fiemap returns the offsets of extents. They will be added to the start
SPA of the corresponding /dev/pmem0 (got via /sys/block/pmem0/device/resource).
For Dom0, we can get the host physical address in this way.

I'm not sure whether a pmem device can be passed to a driver domain,
and (if it can) whether the host SPA would be seen by the driver
domain. If answer to either one is no, pmem would not work with driver
domain in above way.

> 
> >  The resulting host SPA ranges will be passed to QEMU which allocates
> >  guest address space for vNVDIMM devices and calls Xen hypervisor to
> >  map the guest address to the host SPA ranges.
> > 
> 
> Can Dom0 still access the same SPA range when Xen decides to assign it to new domU?
> I assume the range will be unmapped automatically from dom0 in the hypercall?
>

Yes, they will be unmaaped from dom0.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel