From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: Re: xc: error: xc_machphys_mfn_list: 83 != 129
	when suspending 32GB PV DomU
Date: Mon, 14 Mar 2011 10:20:09 +0000
Message-ID: <1300098009.17339.2110.camel@zakaz.uk.xensource.com>
References: <C9A026BF.14A37%keir.xen@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <C9A026BF.14A37%keir.xen@gmail.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Keir Fraser <keir.xen@gmail.com>
Cc: Tim Deegan <Tim.Deegan@eu.citrix.com>, Keir, Fraser <keir.fraser@xen.org>, Xen Devel <xen-devel@lists.xensource.com>, Gianni Tedesco <gianni.tedesco@citrix.com>
List-Id: xen-devel@lists.xenproject.org

On Fri, 2011-03-11 at 19:21 +0000, Keir Fraser wrote:
> On 11/03/2011 18:52, "Gianni Tedesco" <gianni.tedesco@citrix.com> wrote:
> 
> > Further debugging reveals the variables are set as such:
> >  (XEN) compat_machine_to_phys_mapping = 18446606377058041856
> >  (XEN) max_page = 67272704
> >  (XEN) MACH2PHYS_COMPAT_NR_ENTRIES(current->domain) = 43515904
> >  (XEN) RDWR_COMPAT_MPT_VIRT_START = 18446606377058041856
> >  (XEN) RDWR_COMPAT_MPT_VIRT_END = 18446606378131783680
> >  (XEN) limit = 18446606377232105472, (1 << L2_PAGETABLE_SHIFT) = 2097152
> > 
> > Could it be that the compat mach-to-phys conversion table size of 1GB is
> > too small?
> 
> It is insufficient to cover all of the system's memory. The reason for the
> limit is that a 1GB M2P table is all that is reasonable to map into a 32-bit
> domain's address space while still leaving space for the guest's own
> mappings.

The compat M2P actually mapped into the guest isn't 1GB, 1GB would be
the entire kernel mapping with no room for anything else. Also 1GB of
M2P is enough to cover 1TB of host memory so I don't think it's too
small at the moment. Is the limit here not MACH2PHYS_COMPAT_NR_ENTRIES?
(in the above limit == compat_machine_to_phys_mapping + ~160M)

IIRC the size of the M2P which is mapped into a PAE guest is normally
capped at ~160M (the total size of the hypervisor hole for a PAE guest
running on a PAE hypervisor). 160M is enough M2P for 160G of host
address space which would explain why this is seen on a 256GB host but
not a 128GB one.

The limit on the size of the M2P is adjustable, in particular for dom0 I
think it would be reasonable to allow it to expand to, e.g. 256M,
without too much cause for concern.

Obviously this hole eats into the 1GB kernel mapping so you don't want
it to grow too much bigger and long run something better would be needed
but this would probably allow you to support 256GB without too much
trouble in the short term, other than slightly reducing the amount of
lowmem the system sees (which might be an issue if you've chosen
dom0_mem on that basis...)

The lower limit is set by the kernel in its XEN_ELFNOTE_HV_START_LOW ELF
note (set in arch/x86/kernel/head_32-xen.S), which is picked up in
xen/arch/x86/build_domain.c:construct_dom0(). NB: This might be the
first time this functionality has been used in anger to increase the M2P
space (I think it is actively used to shrink it on hosts with <160G).

Another alternative, which would allow large hosts without needing to
expand the dom0 M2P, would be to provide interfaces that allow the tools
to map specific portions of the host M2P so the tools can build
themselves a mapcache style thing. The M2P space which needs to be
accessed to perform a migration of an individual guest is likely going
to be smaller than the total host RAM so even using 256M-512M of guest
user-mode address space (allowing for 256GB-512GB of host address space)
would likely allow you to map the bits you need without excessive churn
(aka performance hit) in the mapping. A given userspace process has 3G
of address space to play with so it can take the hit of increasing the
M2P mapcache size far easier than the kernel can. Hrm, maybe you don't
even need a map cache thing -- just a way to allow a userspace process
to map more M2P than the kernel can... (which might be as simple as
removing the limit clamp based on MACH2PHYS_COMPAT_NR_ENTRIES in the
compat layer?)

Ian.