Re: [RFC Design Doc] Add vNVDIMM support for Xen

From: Haozhong Zhang <haozhong.zhang@intel.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Juergen Gross <JGross@suse.com>,
	Kevin Tian <kevin.tian@intel.com>, Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Keir Fraser <keir@xen.org>
Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen
Date: Thu, 10 Mar 2016 11:27:01 +0800	[thread overview]
Message-ID: <20160310032700.GA3963@hz-desktop.sh.intel.com> (raw)
In-Reply-To: <56E05A9A02000078000DAEC4@prv-mh.provo.novell.com>

On 03/09/16 09:17, Jan Beulich wrote:
> >>> On 09.03.16 at 13:22, <haozhong.zhang@intel.com> wrote:
> > On 03/08/16 02:27, Jan Beulich wrote:
> >> >>> On 08.03.16 at 10:15, <haozhong.zhang@intel.com> wrote:
[...]
> > I should reexplain the choice of data structures and where to put them.
> > 
> > For handling MCE for NVDIMM, we need to track following data:
> > (1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are
> >     used to check whether a MCE is for NVDIMM.
> > (2) GFN to which a NVDIMM page is mapped, which is used to determine the
> >     address put in vMCE.
> > (3) the domain to which a NVDIMM page is mapped, which is used to
> >     determine whether a vMCE needs to be injected and where it will be
> >     injected.
> > (4) a flag to mark whether a NVDIMM page is broken, which is used to
> >     avoid mapping broken page to guests.
> > 
> > For granting NVDIMM pages (e.g. xen-blkback/netback),
> > (5) a reference counter is needed for each NVDIMM page
> > 
> > Above data can be organized as below:
> > 
> > * For (1) SPA ranges, we can record them in a global data structure,
> >   e.g. a list
> > 
> >     struct list_head nvdimm_iset_list;
> > 
> >     struct nvdimm_iset
> >     {
> >          uint64_t           base;  /* starting SPA of this interleave set */
> >          uint64_t           size;  /* size of this interleave set */
> >          struct nvdimm_page *pages;/* information for individual pages in this interleave set */
> >          struct list_head   list;
> >     };
> > 
> > * For (2) GFN, an intuitive place to get this information is from M2P
> >   table machine_to_phys_mapping[].  However, the address of NVDIMM is
> >   not required to be contiguous with normal ram, so, if NVDIMM starts
> >   from an address that is much higher than the end address of normal
> >   ram, it may result in a M2P table that maybe too large to fit in the
> >   normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
> >   table.
> 
> Any page that _may_ be used by a guest as normal RAM page
> must have its mach->phys translation entered in the M2P. That's
> because a r/o variant of that table is part of the hypervisor ABI
> for PV guests. Size considerations simply don't apply here - the
> table may be sparse (guests are required to deal with accesses
> potentially faulting), and the 256Gb of virtual address space set
> aside for it cover all memory up to the 47-bit boundary (there's
> room for doubling this). Memory at addresses with bit 47 (or
> higher) set would need a complete overhaul of that mechanism,
> and whatever new mechanism we may pick would mean old
> guests won#t be able to benefit.
>

OK, then we can use M2P to get PFNs of NVDIMM pages. And ...

> >   Another possible solution is to extend page_info to include GFN for
> >   NVDIMM and use frame_table. A benefit of this solution is that other
> >   data (3)-(5) can be got from page_info as well. However, due to the
> >   same reason for machine_to_phys_mapping[] and the concern that the
> >   large number of page_info structures required for large NVDIMMs may
> >   consume lots of ram, page_info and frame_table seems not a good place
> >   either.
> 
> For this particular item struct page_info is the wrong place
> anyway, due to what I've said above. Also extension
> suggestions of struct page_info are quite problematic, as any
> such implies a measurable increase on the memory overhead
> the hypervisor incurs. Plus the structure right now is (with the
> exception of the bigmem configuration) a carefully arranged
> for power of two in size.
> 
> > * At the end, we choose to introduce a new data structure for above
> >   per-page data (2)-(5)
> > 
> >     struct nvdimm_page
> >     {
> >         struct domain *domain;    /* for (3) */
> >         uint64_t      gfn;        /* for (2) */
> >         unsigned long count_info; /* for (4) and (5), same as page_info->count_info */
> >         /* other fields if needed, e.g. lock */
> >     }
> 
> So that again leaves unaddressed the question of what you
> imply to do when a guest elects to use such a page as page
> table. I'm afraid any attempt of yours to invent something that
> is not struct page_info will not be suitable for all possible needs.
>

... we can use page_info struct rather than nvdimm_page struct for
NVDIMM pages and can be able to benefit from whatever have been done
with page_info.

> >   On each NVDIMM interleave set, we could reserve an area to place an
> >   array of nvdimm_page structures for pages in that interleave set. In
> >   addition, the corresponding global nvdimm_iset structure is set to
> >   point to this array via its 'pages' field.
> 
> And I see no problem doing exactly that, just for an array of
> struct page_info.
>

Yes, page_info arrays.

Because page_info structs for NVDIMM may be put in NVDIMM, existing code
that gets page_info from frame_table needs to be adjusted for NVDIMM
pages to use nvdimm_iset structs instead, including __mfn_to_page,
__page_to_mfn, page_to_spage, spage_to_page, page_to_pdx, pdx_to_page,
etc.

> >   One thing I have no idea is what percentage of ram used/reserved by
> >   Xen itself is considered as acceptable. If it exists and a boot
> >   parameter is given, we could let Xen choose the faster ram when
> >   the percentage has not been reached.
> 
> I think a conservative default would be to always place the
> control structures in NVDIMM space, unless requested to be
> put in RAM via command line option.
>

For the first version, I plan to implement above that puts all page_info
structs for NVDIMM either completely in NVDIMM or completely in normal
ram. In the future, we may introduce some software cache mechanism that
caches the most frequently used NVDIMM page_info in the normal ram.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel