Re: [RFC Design Doc] Add vNVDIMM support for Xen

From: Ross Philipson <ross.philipson@gmail.com>
To: Jan Beulich <JBeulich@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Wei Liu <wei.liu2@citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	xen-devel@lists.xen.org,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Juergen Gross <JGross@suse.com>, Keir Fraser <keir@xen.org>
Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen
Date: Wed, 24 Feb 2016 09:00:21 -0500	[thread overview]
Message-ID: <56CDB775.4020109@gmail.com> (raw)
In-Reply-To: <20160224132855.GA2627@hz-desktop.sh.intel.com>

On 02/24/2016 08:28 AM, Haozhong Zhang wrote:
> On 02/18/16 10:17, Jan Beulich wrote:
>>>>> On 01.02.16 at 06:44, <haozhong.zhang@intel.com> wrote:
>>>  This design treats host NVDIMM devices as ordinary MMIO devices:
>>
>> Wrt the cachability note earlier on, I assume you're aware that with
>> the XSA-154 changes we disallow any cachable mappings of MMIO
>> by default.
>>
> 
> EPT entries that map the host NVDIMM SPAs to guest will be the only
> mapping used for NVDIMM. If the memory type in the last level entries is
> always set to the same type reported by NFIT and the ipat bit is always
> set as well, I think it would not suffer from the cache-type
> inconsistency problem in XSA-154?
> 
>>>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>>>      and drive host NVDIMM devices (implementing block device
>>>      interface). Namespaces and file systems on host NVDIMM devices
>>>      are handled by Dom0 Linux as well.
>>>
>>>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>>>      virtual address space (buf).
>>>
>>>  (3) QEMU gets the host physical address of buf, i.e. the host system
>>>      physical address that is occupied by /dev/pmem0, and calls Xen
>>>      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
>>>
>>>  (ACPI part is described in Section 3.3 later)
>>>
>>>  Above (1)(2) have already been done in current QEMU. Only (3) is
>>>  needed to implement in QEMU. No change is needed in Xen for address
>>>  mapping in this design.
>>>
>>>  Open: It seems no system call/ioctl is provided by Linux kernel to
>>>        get the physical address from a virtual address.
>>>        /proc/<qemu_pid>/pagemap provides information of mapping from
>>>        VA to PA. Is it an acceptable solution to let QEMU parse this
>>>        file to get the physical address?
>>>
>>>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>>>        occupied by pmem at the beginning, i.e. QEMU may not be able to
>>>        get all SPA of pmem from buf (in virtual address space) when
>>>        calling XEN_DOMCTL_memory_mapping.
>>>        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>>>        entire pmem being mmaped?
>>
>> A fundamental question I have here is: Why does qemu need to
>> map this at all? It shouldn't itself need to access those ranges,
>> since the guest is given direct access. It would seem quite a bit
>> more natural if qemu simply inquired to underlying GFN range(s)
>> and handed those to Xen for translation to MFNs and mapping
>> into guest space.
>>
> 
> The above design is more like a hack on the existing QEMU
> implementation for KVM without modifying the Dom0 kernel.
> 
> Maybe it's better to let QEMU to get SPAs from Dom0 kernel (NVDIMM
> driver) and then pass them to Xen for the address mapping:
> (1) QEMU passes fd of /dev/pmemN or file on /dev/pmemN to Dom0 kernel.
> (2) Dom0 kernel finds and returns all SPAs occupied by /dev/pmemN or
>     portions of /dev/pmemN where the file sits.
> (3) QEMU passes above SPAs, and GMFN where they will be mapped to Xen
>     which maps SPAs to GMFN.
> 
>>>  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
>>>  check for the physical address and size passed from caller
>>>  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
>>>  aware of the SPA range of pmem so that it can refuse map physical
>>>  address in neither the normal ram nor pmem.
>>
>> I'm not sure what missing sanity checks this is about: The handling
>> involves two iomem_access_permitted() calls.
>>
> 
> Oh, I missed iomem_access_permitted(). Seemingly it is what I need if
> the SPA ranges of host NVDIMMs are added to Dom0's iomem_cap initially.
> 
>>> 3.3 Guest ACPI Emulation
>>>
>>> 3.3.1 My Design
>>>
>>>  Guest ACPI emulation is composed of two parts: building guest NFIT
>>>  and SSDT that defines ACPI namespace devices for NVDIMM, and
>>>  emulating guest _DSM.
>>>
>>>  (1) Building Guest ACPI Tables
>>>
>>>   This design reuses and extends hvmloader's existing mechanism that
>>>   loads passthrough ACPI tables from binary files to load NFIT and
>>>   SSDT tables built by QEMU:
>>>   1) Because the current QEMU does not building any ACPI tables when
>>>      it runs as the Xen device model, this design needs to patch QEMU
>>>      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
>>>
>>>   2) QEMU copies NFIT and SSDT to the end of guest memory below
>>>      4G. The guest address and size of those tables are written into
>>>      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
>>>
>>>   3) hvmloader is patched to probe and load device model passthrough
>>>      ACPI tables from above xenstore keys. The detected ACPI tables
>>>      are then appended to the end of existing guest ACPI tables just
>>>      like what current construct_passthrough_tables() does.
>>>
>>>   Reasons for this design are listed below:
>>>   - NFIT and SSDT in question are quite self-contained, i.e. they do
>>>     not refer to other ACPI tables and not conflict with existing
>>>     guest ACPI tables in Xen. Therefore, it is safe to copy them from
>>>     QEMU and append to existing guest ACPI tables.
>>
>> How is this not conflicting being guaranteed? In particular I don't
>> see how tables containing AML code and coming from different
>> sources won't possibly cause ACPI name space collisions.
>>
> 
> Really there is no effective mechanism to avoid ACPI name space
> collisions (and other kinds of conflicts) between ACPI tables loaded
> from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
> are loaded is determined by developers, IMO it's developers'
> responsibility to avoid any collisions and conflicts with existing ACPI
> tables.
> 
>>> 3.3.3 Alternative Design 2: keeping in Xen
>>>
>>>  Alternative to switching to QEMU, another design would be building
>>>  NFIT and SSDT in hvmloader or toolstack.
>>>
>>>  The amount and parameters of sub-structures in guest NFIT vary
>>>  according to different vNVDIMM configurations that can not be decided
>>>  at compile-time. In contrast, current hvmloader and toolstack can
>>>  only build static ACPI tables, i.e. their contents are decided
>>>  statically at compile-time and independent from the guest
>>>  configuration. In order to build guest NFIT at runtime, this design
>>>  may take following steps:
>>>  (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU
>>>      options,
>>>
>>>  (2) QEMU accepts above options, figures out the start SPA range
>>>      address/size/NVDIMM device handles/..., and writes them in
>>>      xenstore. No ACPI table is built by QEMU.
>>>
>>>  (3) Either xl or hvmloader reads above parameters from xenstore and
>>>      builds the NFIT table.
>>>
>>>  For guest SSDT, it would take more work. The ACPI namespace devices
>>>  are defined in SSDT by AML, so an AML builder would be needed to
>>>  generate those definitions at runtime.
>>
>> I'm not sure this last half sentence is true: We do some dynamic
>> initialization of the pre-generated DSDT already, using the runtime
>> populated block at ACPI_INFO_PHYSICAL_ADDRESS.
>>
> 
> IIUC, if an ACPI namespace device in guest DSDT needs to use an
> parameter that is not available at build time, it can refer to the
> corresponding field in the memory block at ACPI_INFO_PHYSICAL_ADDRESS
> that is filled at runtime.
> 
> But it does not work for vNVDIMM. An example of NVDIMM AML code looks
> like:
> 
> Scope (\_SB){
>     Device (NVDR) // Root device for all NVDIMM devices
>     {
>         Name (_HID, “ACPI0012”)
>         Method (_STA) {...}
>         Method (_FIT) {...}
>         Method (_DSM, ...) {
>             ...
>         }
> 
>         Device (NVD0)     // 1st NVDIMM device
>         {
>             Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
>             Method (_DSM, ...) {
>                 ...
>             }
>         }
> 
>         Device (NVD1)     // 2nd NVDIMM device
>         {
>             Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
>             Method (_DSM, ...) {
>                 ...
>             }
>         }
> 
>         ...
>    }
> }
> 
> For each NVDIMM device defined in NFIT, there is a ACPI namespace device
> (e.g. NVD0, NVD1) under the root NVDIMM device (NVDR). Because the
> number of vNVDIMM devices are unknown at build time, we can not
> determine whether and how many NVDIMM ACPI namespace devices should be
> defined in the pre-generated SSDT.

We have dealt with the exact same issue in the past (though it concerned
WMI ACPI devices). The layout and number of these devices and their
methods was very platform dependent as this seems to be. Thus we
generated an SSDT at runtime to reflect the underlying platform. At that
time we had to write an AML generator but as noted earlier, SeaBIOS now
seems to have a fairly complete one. This would allow you to do create
the proper number of NVDx devices and set the desired addresses at runtime.

If you have to do it statically at build time you will have to pick a
max number of possible NVDx devices and make some of them report they
are not there (e.g. w/ the _STA methods possibly). In this case you
could extend ACPI_INFO_PHYSICAL_ADDRESS or create your own IO/Memory
OpRegion that you handle at runtime. I would personally go for the first
option.

> 
> Haozhong
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

-- 
Ross Philipson