From mboxrd@z Thu Jan  1 00:00:00 1970
From: Haozhong Zhang <haozhong.zhang@intel.com>
Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen
Date: Wed, 17 Feb 2016 17:01:05 +0800
Message-ID: <20160217090105.GD5459@hz-desktop.sh.intel.com>
References: <20160201054414.GA25211@hz-desktop.sh.intel.com>
	<20160202191519.GB21656@char.us.oracle.com>
	<20160215084352.GB8938@hz-desktop.sh.intel.com>
	<56C1BF9302000078000D202D@prv-mh.provo.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Content-Disposition: inline
In-Reply-To: <56C1BF9302000078000D202D@prv-mh.provo.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>
Cc: Juergen Gross <JGross@suse.com>, Kevin Tian <kevin.tian@intel.com>, Wei Liu <wei.liu2@citrix.com>, Ian Campbell <ian.campbell@citrix.com>, Stefano Stabellini <stefano.stabellini@eu.citrix.com>, George Dunlap <George.Dunlap@eu.citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Jun Nakajima <jun.nakajima@intel.com>, Xiao Guangrong <guangrong.xiao@linux.intel.com>, Keir Fraser <keir@xen.org>
List-Id: xen-devel@lists.xenproject.org

On 02/15/16 04:07, Jan Beulich wrote:
> >>> On 15.02.16 at 09:43, <haozhong.zhang@intel.com> wrote:
> > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
> >> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >> >  three parts:
> >> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >> >  (2) Memory mapping, and
> >> >  (3) Guest ACPI emulation.
> >> 
> >> 
> >> .. MCE? and vMCE?
> >> 
> > 
> > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
> > way similar to what mc_memerr_dhandler() does, with some differences in
> > the data structure and the broken page offline parts:
> > 
> > Broken NVDIMM pages should be marked as "offlined" so that Xen
> > hypervisor can refuse further requests that map them to DomU.
> > 
> > The real problem here is what data structure will be used to record
> > information of NVDIMM pages. Because the size of NVDIMM is usually much
> > larger than normal ram, using struct page_info for NVDIMM pages would
> > occupy too much memory.
> 
> I don't see how your alternative below would be less memory
> hungry: Since guests have at least partial control of their GFN
> space, a malicious guest could punch holes into the contiguous
> GFN range that you appear to be thinking about, thus causing
> arbitrary splitting of the control structure.
>

QEMU would always use MFN above guest normal ram and I/O holes for
vNVDIMM. It would attempt to search in that space for a contiguous range
that is large enough for that that vNVDIMM devices. Is guest able to
punch holes in such GFN space?

> Also - see how you all of the sudden came to think of using
> struct page_info here (implying hypervisor control of these
> NVDIMM ranges)?
>
> > (4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens,
> >   (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures,
> >   (b) for each affected nvdimm_pages, if it belongs to a domain d and
> >       its broken field is already set, the domain d will be shutdown to
> >       prevent malicious guest accessing broken page (similarly to what
> >       offline_page() does).
> >   (c) for each affected nvdimm_pages, set its broken field to 1, and
> >   (d) for each affected nvdimm_pages, inject to domain d a vMCE that
> >       covers its GFN range if that nvdimm_pages belongs to domain d.
> 
> I don't see why you'd want to mark the entire range bad: All
> that's known to be broken is a single page. Hence this would be
> another source of splits of the proposed control structures.
>

Oh yes, I should split the whole range here. Such kind of splits is
caused by hardware errors. Unless the host NVDIMM is terribly broken,
there should not be a large amount of splits.

Haozhong