From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60367)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Stefano.Stabellini@eu.citrix.com>)
	id 1Un7ph-0000nL-Ba
	for qemu-devel@nongnu.org; Thu, 13 Jun 2013 09:45:03 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Stefano.Stabellini@eu.citrix.com>)
	id 1Un7pc-0000UB-VQ
	for qemu-devel@nongnu.org; Thu, 13 Jun 2013 09:44:57 -0400
Received: from smtp02.citrix.com ([66.165.176.63]:30675)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Stefano.Stabellini@eu.citrix.com>)
	id 1Un7pc-0000Tl-NR
	for qemu-devel@nongnu.org; Thu, 13 Jun 2013 09:44:52 -0400
Date: Thu, 13 Jun 2013 14:44:40 +0100
From: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
In-Reply-To: <51B847E3.5010604@eu.citrix.com>
Message-ID: <alpine.DEB.2.02.1306131355550.4548@kaball.uk.xensource.com>
References: <EE92950F97EE42469CA4F508D4691F5E016FAD15@SHSMSX104.ccr.corp.intel.com>
	<alpine.DEB.2.02.1306071246270.4589@kaball.uk.xensource.com>
	<51B1FF50.90406@eu.citrix.com>
	<alpine.DEB.2.02.1306071655060.4589@kaball.uk.xensource.com>
	<403610A45A2B5242BD291EDAE8B37D3010E56731@SHSMSX102.ccr.corp.intel.com>
	<CAFLBxZZfH8im-hTrma29Ag7CUR1HZEm=4b7ft_h5weukGL1BzQ@mail.gmail.com>
	<alpine.DEB.2.02.1306111735590.4548@kaball.uk.xensource.com>
	<51B83E7A02000078000DD6E9@nat28.tlf.novell.com>
	<51B847E3.5010604@eu.citrix.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Subject: Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable
 device with memory more than 3600M
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Tim Deegan <tim@xen.org>, Yongjie Ren <yongjie.ren@intel.com>, yanqiangjun@huawei.com, Keir Fraser <keir@xen.org>, Ian Campbell <Ian.Campbell@citrix.com>, hanweidong@huawei.com, Xudong Hao <xudong.hao@intel.com>, Stefano Stabellini <stefano.stabellini@eu.citrix.com>, luonengjun@huawei.com, qemu-devel@nongnu.org, wangzhenguo@huawei.com, xiaowei.yang@huawei.com, arei.gonglei@huawei.com, Jan Beulich <JBeulich@suse.com>, Paolo Bonzini <pbonzini@redhat.com>, YongweiX Xu <yongweix.xu@intel.com>, SongtaoX Liu <songtaox.liu@intel.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>

On Wed, 12 Jun 2013, George Dunlap wrote:
> On 12/06/13 08:25, Jan Beulich wrote:
> > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> > > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> > > region is larger than 512MB.
> > > 
> > > Maybe we could just relax this condition and map the device memory to
> > > high memory no matter the size of the MMIO region if the PCI bar is
> > > 64-bit?
> > I can only recommend not to: For one, guests not using PAE or
> > PSE-36 can't map such space at all (and older OSes may not
> > properly deal with 64-bit BARs at all). And then one would generally
> > expect this allocation to be done top down (to minimize risk of
> > running into RAM), and doing so is going to present further risks of
> > incompatibilities with guest OSes (Linux for example learned only in
> > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > PFN to pfn_pte(), the respective parameter of which is
> > "unsigned long").
> > 
> > I think this ought to be done in an iterative process - if all MMIO
> > regions together don't fit below 4G, the biggest one should be
> > moved up beyond 4G first, followed by the next to biggest one
> > etc.
> 
> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> temporary work-around.  It should only be done if a device doesn't fit in the
> current MMIO range.
> 
> We have three options here:
> 1. Don't do anything
> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> fit
> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> memory).
> 4. Add a mechanism to tell qemu that memory is being relocated.
> 
> Number 4 is definitely the right answer long-term, but we just don't have time
> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> if it is, it may have unpredictable knock-on effects.
> 
> Doing #2, it is true that many guests will be unable to access the device
> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> access the device.  At least in #2, *many* guests will be able to do so.  In
> any case, apparently #2 is what KVM does, so having the limitation on guests
> is not without precedent.  It's also likely to be a somewhat tested
> configuration (unlike #3, for example).

I would avoid #3, because I don't think is a good idea to rely on that
behaviour.
I would also avoid #4, because having seen QEMU's code, it's wouldn't be
easy and certainly not doable in time for 4.3.

So we are left to play with the PCI MMIO region size and location in
hvmloader.

I agree with Jan that we shouldn't relocate unconditionally all the
devices to the region above 4G. I meant to say that we should relocate
only the ones that don't fit. And we shouldn't try to dynamically
increase the PCI hole below 4G because clearly that doesn't work.
However we could still increase the size of the PCI hole below 4G by
default from start at 0xf0000000 to starting at 0xe0000000.
Why do we know that is safe? Because in the current configuration
hvmloader *already* increases the PCI hole size by decreasing the start
address every time a device doesn't fit.
So it's already common for hvmloader to set pci_mem_start to
0xe0000000, you just need to assign a device with a PCI hole size big
enough.


My proposed solution is:

- set 0xe0000000 as the default PCI hole start for everybody, including
qemu-xen-traditional
- move above 4G everything that doesn't fit and support 64-bit bars
- print an error if the device doesn't fit and doesn't support 64-bit
bars

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Subject: Re: [Xen-devel] [BUG 1747]Guest could't find bootable
 device with memory more than 3600M
Date: Thu, 13 Jun 2013 14:44:40 +0100
Message-ID: <alpine.DEB.2.02.1306131355550.4548@kaball.uk.xensource.com>
References: <EE92950F97EE42469CA4F508D4691F5E016FAD15@SHSMSX104.ccr.corp.intel.com>
	<alpine.DEB.2.02.1306071246270.4589@kaball.uk.xensource.com>
	<51B1FF50.90406@eu.citrix.com>
	<alpine.DEB.2.02.1306071655060.4589@kaball.uk.xensource.com>
	<403610A45A2B5242BD291EDAE8B37D3010E56731@SHSMSX102.ccr.corp.intel.com>
	<CAFLBxZZfH8im-hTrma29Ag7CUR1HZEm=4b7ft_h5weukGL1BzQ@mail.gmail.com>
	<alpine.DEB.2.02.1306111735590.4548@kaball.uk.xensource.com>
	<51B83E7A02000078000DD6E9@nat28.tlf.novell.com>
	<51B847E3.5010604@eu.citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Return-path: <qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org>
In-Reply-To: <51B847E3.5010604@eu.citrix.com>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
To: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Tim Deegan <tim@xen.org>, Yongjie Ren <yongjie.ren@intel.com>, yanqiangjun@huawei.com, Keir Fraser <keir@xen.org>, Ian Campbell <Ian.Campbell@citrix.com>, hanweidong@huawei.com, Xudong Hao <xudong.hao@intel.com>, Stefano Stabellini <stefano.stabellini@eu.citrix.com>, luonengjun@huawei.com, qemu-devel@nongnu.org, wangzhenguo@huawei.com, xiaowei.yang@huawei.com, arei.gonglei@huawei.com, Jan Beulich <JBeulich@suse.com>, Paolo Bonzini <pbonzini@redhat.com>, YongweiX Xu <yongweix.xu@intel.com>, SongtaoX Liu <songtaox.liu@intel.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
List-Id: xen-devel@lists.xenproject.org

On Wed, 12 Jun 2013, George Dunlap wrote:
> On 12/06/13 08:25, Jan Beulich wrote:
> > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> > > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> > > region is larger than 512MB.
> > > 
> > > Maybe we could just relax this condition and map the device memory to
> > > high memory no matter the size of the MMIO region if the PCI bar is
> > > 64-bit?
> > I can only recommend not to: For one, guests not using PAE or
> > PSE-36 can't map such space at all (and older OSes may not
> > properly deal with 64-bit BARs at all). And then one would generally
> > expect this allocation to be done top down (to minimize risk of
> > running into RAM), and doing so is going to present further risks of
> > incompatibilities with guest OSes (Linux for example learned only in
> > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > PFN to pfn_pte(), the respective parameter of which is
> > "unsigned long").
> > 
> > I think this ought to be done in an iterative process - if all MMIO
> > regions together don't fit below 4G, the biggest one should be
> > moved up beyond 4G first, followed by the next to biggest one
> > etc.
> 
> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> temporary work-around.  It should only be done if a device doesn't fit in the
> current MMIO range.
> 
> We have three options here:
> 1. Don't do anything
> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> fit
> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> memory).
> 4. Add a mechanism to tell qemu that memory is being relocated.
> 
> Number 4 is definitely the right answer long-term, but we just don't have time
> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> if it is, it may have unpredictable knock-on effects.
> 
> Doing #2, it is true that many guests will be unable to access the device
> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> access the device.  At least in #2, *many* guests will be able to do so.  In
> any case, apparently #2 is what KVM does, so having the limitation on guests
> is not without precedent.  It's also likely to be a somewhat tested
> configuration (unlike #3, for example).

I would avoid #3, because I don't think is a good idea to rely on that
behaviour.
I would also avoid #4, because having seen QEMU's code, it's wouldn't be
easy and certainly not doable in time for 4.3.

So we are left to play with the PCI MMIO region size and location in
hvmloader.

I agree with Jan that we shouldn't relocate unconditionally all the
devices to the region above 4G. I meant to say that we should relocate
only the ones that don't fit. And we shouldn't try to dynamically
increase the PCI hole below 4G because clearly that doesn't work.
However we could still increase the size of the PCI hole below 4G by
default from start at 0xf0000000 to starting at 0xe0000000.
Why do we know that is safe? Because in the current configuration
hvmloader *already* increases the PCI hole size by decreasing the start
address every time a device doesn't fit.
So it's already common for hvmloader to set pci_mem_start to
0xe0000000, you just need to assign a device with a PCI hole size big
enough.


My proposed solution is:

- set 0xe0000000 as the default PCI hole start for everybody, including
qemu-xen-traditional
- move above 4G everything that doesn't fit and support 64-bit bars
- print an error if the device doesn't fit and doesn't support 64-bit
bars