xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Haozhong Zhang <haozhong.zhang@intel.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Juergen Gross <JGross@suse.com>,
	Kevin Tian <kevin.tian@intel.com>, Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	Keir Fraser <keir@xen.org>
Subject: Re: [RFC Design Doc] Add vNVDIMM support for Xen
Date: Wed, 9 Mar 2016 20:22:59 +0800	[thread overview]
Message-ID: <20160309122259.GA6310@hz-desktop.sh.intel.com> (raw)
In-Reply-To: <56DEA91202000078000DA449@prv-mh.provo.novell.com>

On 03/08/16 02:27, Jan Beulich wrote:
> >>> On 08.03.16 at 10:15, <haozhong.zhang@intel.com> wrote:
> > More thoughts on reserving NVDIMM space for per-page structures
> > 
> > Currently, a per-page struct for managing mapping of NVDIMM pages may
> > include following fields:
> > 
> > struct nvdimm_page
> > {
> >     uint64_t mfn;        /* MFN of SPA of this NVDIMM page */
> >     uint64_t gfn;        /* GFN where this NVDIMM page is mapped */
> >     domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
> >     int      is_broken;  /* Is this NVDIMM page broken? (for MCE) */
> > }
> > 
> > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
> > nvdimm_page structures would occupy 12 GB space, which is too hard to
> > fit in the normal ram on a small memory host. However, for smaller
> > NVDIMMs and/or hosts with large ram, those structures may still be able
> > to fit in the normal ram. In the latter circumstance, nvdimm_page
> > structures are stored in the normal ram, so they can be accessed more
> > quickly.
> 
> Not sure how you came to the above structure - it's the first time
> I see it, yet figuring out what information it needs to hold is what
> this design process should be about. For example, I don't see why
> it would need to duplicate M2P / P2M information. Nor do I see why
> per-page data needs to hold the address of a page (struct
> page_info also doesn't). And whether storing a domain ID (rather
> than a pointer to struct domain, as in struct page_info) is the
> correct think is also to be determined (rather than just stated).
> 
> Otoh you make no provisions at all for any kind of ref counting.
> What if a guest wants to put page tables into NVDIMM space?
> 
> Since all of your calculations are based upon that fixed assumption
> on the structure layout, I'm afraid they're not very meaningful
> without first settling on what data needs tracking in the first place.
> 
> Jan
> 

I should reexplain the choice of data structures and where to put them.

For handling MCE for NVDIMM, we need to track following data:
(1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are
    used to check whether a MCE is for NVDIMM.
(2) GFN to which a NVDIMM page is mapped, which is used to determine the
    address put in vMCE.
(3) the domain to which a NVDIMM page is mapped, which is used to
    determine whether a vMCE needs to be injected and where it will be
    injected.
(4) a flag to mark whether a NVDIMM page is broken, which is used to
    avoid mapping broken page to guests.

For granting NVDIMM pages (e.g. xen-blkback/netback),
(5) a reference counter is needed for each NVDIMM page

Above data can be organized as below:

* For (1) SPA ranges, we can record them in a global data structure,
  e.g. a list

    struct list_head nvdimm_iset_list;

    struct nvdimm_iset
    {
         uint64_t           base;  /* starting SPA of this interleave set */
         uint64_t           size;  /* size of this interleave set */
         struct nvdimm_page *pages;/* information for individual pages in this interleave set */
         struct list_head   list;
    };

* For (2) GFN, an intuitive place to get this information is from M2P
  table machine_to_phys_mapping[].  However, the address of NVDIMM is
  not required to be contiguous with normal ram, so, if NVDIMM starts
  from an address that is much higher than the end address of normal
  ram, it may result in a M2P table that maybe too large to fit in the
  normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
  table.

  Another possible solution is to extend page_info to include GFN for
  NVDIMM and use frame_table. A benefit of this solution is that other
  data (3)-(5) can be got from page_info as well. However, due to the
  same reason for machine_to_phys_mapping[] and the concern that the
  large number of page_info structures required for large NVDIMMs may
  consume lots of ram, page_info and frame_table seems not a good place
  either.

* At the end, we choose to introduce a new data structure for above
  per-page data (2)-(5)

    struct nvdimm_page
    {
        struct domain *domain;    /* for (3) */
        uint64_t      gfn;        /* for (2) */
        unsigned long count_info; /* for (4) and (5), same as page_info->count_info */
        /* other fields if needed, e.g. lock */
    }

  (MFN is not needed indeed)

  On each NVDIMM interleave set, we could reserve an area to place an
  array of nvdimm_page structures for pages in that interleave set. In
  addition, the corresponding global nvdimm_iset structure is set to
  point to this array via its 'pages' field.

* One disadvantage of above solution is that accessing NVDIMM is slower
  than normal ram, so some usage scenarios that requires frequently
  accesses to nvdimm_page structures may suffer poor
  performance. Therefore, we may add a boot parameter to allow users to
  choose normal ram for above nvdimm_page arrays if their hosts have
  plenty ram.

  One thing I have no idea is what percentage of ram used/reserved by
  Xen itself is considered as acceptable. If it exists and a boot
  parameter is given, we could let Xen choose the faster ram when
  the percentage has not been reached.


Any comments?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  reply	other threads:[~2016-03-09 12:22 UTC|newest]

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-01  5:44 [RFC Design Doc] Add vNVDIMM support for Xen Haozhong Zhang
2016-02-01 18:25 ` Andrew Cooper
2016-02-02  3:27   ` Tian, Kevin
2016-02-02  3:44   ` Haozhong Zhang
2016-02-02 11:09     ` Andrew Cooper
2016-02-02  6:33 ` Tian, Kevin
2016-02-02  7:39   ` Zhang, Haozhong
2016-02-02  7:48     ` Tian, Kevin
2016-02-02  7:53       ` Zhang, Haozhong
2016-02-02  8:03         ` Tian, Kevin
2016-02-02  8:49           ` Zhang, Haozhong
2016-02-02 19:01   ` Konrad Rzeszutek Wilk
2016-02-02 17:11 ` Stefano Stabellini
2016-02-03  7:00   ` Haozhong Zhang
2016-02-03  9:13     ` Jan Beulich
2016-02-03 14:09       ` Andrew Cooper
2016-02-03 14:23         ` Haozhong Zhang
2016-02-05 14:40         ` Ross Philipson
2016-02-06  1:43           ` Haozhong Zhang
2016-02-06 16:17             ` Ross Philipson
2016-02-03 12:02     ` Stefano Stabellini
2016-02-03 13:11       ` Haozhong Zhang
2016-02-03 14:20         ` Andrew Cooper
2016-02-04  3:10           ` Haozhong Zhang
2016-02-03 15:16       ` George Dunlap
2016-02-03 15:22         ` Stefano Stabellini
2016-02-03 15:35           ` Konrad Rzeszutek Wilk
2016-02-03 15:35           ` George Dunlap
2016-02-04  2:55           ` Haozhong Zhang
2016-02-04 12:24             ` Stefano Stabellini
2016-02-15  3:16               ` Zhang, Haozhong
2016-02-16 11:14                 ` Stefano Stabellini
2016-02-16 12:55                   ` Jan Beulich
2016-02-17  9:03                     ` Haozhong Zhang
2016-03-04  7:30                     ` Haozhong Zhang
2016-03-16 12:55                       ` Haozhong Zhang
2016-03-16 13:13                         ` Konrad Rzeszutek Wilk
2016-03-16 13:16                         ` Jan Beulich
2016-03-16 13:55                           ` Haozhong Zhang
2016-03-16 14:23                             ` Jan Beulich
2016-03-16 14:55                               ` Haozhong Zhang
2016-03-16 15:23                                 ` Jan Beulich
2016-03-17  8:58                                   ` Haozhong Zhang
2016-03-17 11:04                                     ` Jan Beulich
2016-03-17 12:44                                       ` Haozhong Zhang
2016-03-17 12:59                                         ` Jan Beulich
2016-03-17 13:29                                           ` Haozhong Zhang
2016-03-17 13:52                                             ` Jan Beulich
2016-03-17 14:00                                             ` Ian Jackson
2016-03-17 14:21                                               ` Haozhong Zhang
2016-03-29  8:47                                                 ` Haozhong Zhang
2016-03-29  9:11                                                   ` Jan Beulich
2016-03-29 10:10                                                     ` Haozhong Zhang
2016-03-29 10:49                                                       ` Jan Beulich
2016-04-08  5:02                                                         ` Haozhong Zhang
2016-04-08 15:52                                                           ` Jan Beulich
2016-04-12  8:45                                                             ` Haozhong Zhang
2016-04-21  5:09                                                               ` Haozhong Zhang
2016-04-21  7:04                                                                 ` Jan Beulich
2016-04-22  2:36                                                                   ` Haozhong Zhang
2016-04-22  8:24                                                                     ` Jan Beulich
2016-04-22 10:16                                                                       ` Haozhong Zhang
2016-04-22 10:53                                                                         ` Jan Beulich
2016-04-22 12:26                                                                           ` Haozhong Zhang
2016-04-22 12:36                                                                             ` Jan Beulich
2016-04-22 12:54                                                                               ` Haozhong Zhang
2016-04-22 13:22                                                                                 ` Jan Beulich
2016-03-17 13:32                                         ` Konrad Rzeszutek Wilk
2016-02-03 15:47       ` Konrad Rzeszutek Wilk
2016-02-04  2:36         ` Haozhong Zhang
2016-02-15  9:04         ` Zhang, Haozhong
2016-02-02 19:15 ` Konrad Rzeszutek Wilk
2016-02-03  8:28   ` Haozhong Zhang
2016-02-03  9:18     ` Jan Beulich
2016-02-03 12:22       ` Haozhong Zhang
2016-02-03 12:38         ` Jan Beulich
2016-02-03 12:49           ` Haozhong Zhang
2016-02-03 14:30       ` Andrew Cooper
2016-02-03 14:39         ` Jan Beulich
2016-02-15  8:43   ` Haozhong Zhang
2016-02-15 11:07     ` Jan Beulich
2016-02-17  9:01       ` Haozhong Zhang
2016-02-17  9:08         ` Jan Beulich
2016-02-18  7:42           ` Haozhong Zhang
2016-02-19  2:14             ` Konrad Rzeszutek Wilk
2016-03-01  7:39               ` Haozhong Zhang
2016-03-01 18:33                 ` Ian Jackson
2016-03-01 18:49                   ` Konrad Rzeszutek Wilk
2016-03-02  7:14                     ` Haozhong Zhang
2016-03-02 13:03                       ` Jan Beulich
2016-03-04  2:20                         ` Haozhong Zhang
2016-03-08  9:15                           ` Haozhong Zhang
2016-03-08  9:27                             ` Jan Beulich
2016-03-09 12:22                               ` Haozhong Zhang [this message]
2016-03-09 16:17                                 ` Jan Beulich
2016-03-10  3:27                                   ` Haozhong Zhang
2016-03-17 11:05                                   ` Ian Jackson
2016-03-17 13:37                                     ` Haozhong Zhang
2016-03-17 13:56                                       ` Jan Beulich
2016-03-17 14:22                                         ` Haozhong Zhang
2016-03-17 14:12                                       ` Xu, Quan
2016-03-17 14:22                                         ` Zhang, Haozhong
2016-03-07 20:53                       ` Konrad Rzeszutek Wilk
2016-03-08  5:50                         ` Haozhong Zhang
2016-02-18 17:17 ` Jan Beulich
2016-02-24 13:28   ` Haozhong Zhang
2016-02-24 14:00     ` Ross Philipson
2016-02-24 16:42       ` Haozhong Zhang
2016-02-24 17:50         ` Ross Philipson
2016-02-24 14:24     ` Jan Beulich
2016-02-24 15:48       ` Haozhong Zhang
2016-02-24 16:54         ` Jan Beulich
2016-02-28 14:48           ` Haozhong Zhang
2016-02-29  9:01             ` Jan Beulich
2016-02-29  9:45               ` Haozhong Zhang
2016-02-29 10:12                 ` Jan Beulich
2016-02-29 11:52                   ` Haozhong Zhang
2016-02-29 12:04                     ` Jan Beulich
2016-02-29 12:22                       ` Haozhong Zhang
2016-03-01 13:51                         ` Ian Jackson
2016-03-01 15:04                           ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160309122259.GA6310@hz-desktop.sh.intel.com \
    --to=haozhong.zhang@intel.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=JGross@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=guangrong.xiao@linux.intel.com \
    --cc=ian.campbell@citrix.com \
    --cc=jun.nakajima@intel.com \
    --cc=keir@xen.org \
    --cc=kevin.tian@intel.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).