[BUG 1747]Guest could't find bootable device with memory more than 3600M

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-07  9:20 Xu, YongweiX
  2013-06-07 12:15 ` Stefano Stabellini
  0 siblings, 1 reply; 82+ messages in thread
From: Xu, YongweiX @ 2013-06-07  9:20 UTC (permalink / raw)
  To: stefano.stabellini; +Cc: Ren, Yongjie, xen-devel, Liu, SongtaoX

Hi Stefano Stabellini,

I found a new bug "Guest could't find bootable device with memory more than 3600M".
Attach the link:  http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1857

When booting up guest(include Linux&Windows guest) with memory more than 3600M,the guest will show "No bootable device" and could not boot up. Then the guest will continuous reboot automatically and never found bootable device. But with guest memory less than or equal to 3600M, boot up successfully.

I found this is the qemu(qemu-upstream-unstable) issue, the latest version (commit:4597594c61add43725bd207bb498268a058f9cfb) caused this issue: 
xen: start PCI hole at 0xe0000000 (same as pc_init1 and qemu-xen-traditional) 
author      Stefano Stabellini <stefano.stabellini@eu.citrix.com>    
           Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
committer   Stefano Stabellini <stefano.stabellini@eu.citrix.com>    
           Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
commit     4597594c61add43725bd207bb498268a058f9cfb
tree        d6831f75f4a7d4ad7a94bd4e33584ac358808ee6    
parent      25adf763933faddcc6a62bf55e1c52909a9bafbb

Can you debug this issue soon? Thanks.

Best Regards
Yongwei(Terrence)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-07  9:20 [BUG 1747]Guest could't find bootable device with memory more than 3600M Xu, YongweiX
@ 2013-06-07 12:15 ` Stefano Stabellini
  2013-06-07 15:42   ` George Dunlap
  0 siblings, 1 reply; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-07 12:15 UTC (permalink / raw)
  To: Xu, YongweiX
  Cc: Ren, Yongjie, xen-devel, Keir Fraser, stefano.stabellini,
	george.dunlap, Tim Deegan, Liu, SongtaoX

On Fri, 7 Jun 2013, Xu, YongweiX wrote:
> Hi Stefano Stabellini,
> 
> I found a new bug "Guest could't find bootable device with memory more than 3600M".
> Attach the link:  http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1857
> 
> When booting up guest(include Linux&Windows guest) with memory more than 3600M,the guest will show "No bootable device" and could not boot up. Then the guest will continuous reboot automatically and never found bootable device. But with guest memory less than or equal to 3600M, boot up successfully.
> 
> I found this is the qemu(qemu-upstream-unstable) issue, the latest version (commit:4597594c61add43725bd207bb498268a058f9cfb) caused this issue: 
> xen: start PCI hole at 0xe0000000 (same as pc_init1 and qemu-xen-traditional) 
> author      Stefano Stabellini <stefano.stabellini@eu.citrix.com>    
>            Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
> committer   Stefano Stabellini <stefano.stabellini@eu.citrix.com>    
>            Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
> commit     4597594c61add43725bd207bb498268a058f9cfb
> tree        d6831f75f4a7d4ad7a94bd4e33584ac358808ee6    
> parent      25adf763933faddcc6a62bf55e1c52909a9bafbb
> 
> Can you debug this issue soon? Thanks.

Thank you very much for testing and bisecting the issue!

The problem is that by default hvmloader sets PCI_MEM_START to
0xf0000000, then dynamically expands it backwards if needed.
It works with qemu-xen-traditional, because it doesn't do any checking
when registering memory regions.
It doesn't work with qemu-xen because it needs to know where the ram and
where the pci hole are when building the emulated machine.
We can't have hvmloader increasing the pci hole in a way that overlaps
with the guest ram (or where qemu thinks that the guest ram is).
It is also worth noting thet seabios has its own view of where the pci
hole starts and at the moment is set to 0xe0000000 at build time.

I can see two ways of fixing this now:

1) have everybody agree about the pci hole starting at 0xe0000000

- set PCI_MEM_START in hvmloader to 0xe0000000 to match qemu's view of
the pci hole and have a bigger pci hole by default

- remove the loop in hvmloader to increase the pci hole

- set HVM_BELOW_4G_RAM_END to 0xe0000000, so that low_mem_pgend is set
accordingly in tools/libxc/xc_hvm_build_x86.c:build_hvm_info


2) have everybody agree about the pci hole starting at 0xf0000000

- revert 4597594c61add43725bd207bb498268a058f9cfb in qemu-xen

- set BUILD_PCIMEM_START to 0xf0000000 in seabios

- remove the loop in hvmloader to increase the pci hole


Given that in both cases we need to remove the loop to increase the pci
hole in hvmloader I would rather go with 1) and have a bigger pci hole
by default to avoid problems with the pci hole being too small.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-07 12:15 ` Stefano Stabellini
@ 2013-06-07 15:42   ` George Dunlap
  2013-06-07 15:56     ` Stefano Stabellini
  0 siblings, 1 reply; 82+ messages in thread
From: George Dunlap @ 2013-06-07 15:42 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Ren, Yongjie, xen-devel, Keir Fraser, Tim Deegan, Xu, YongweiX,
	Liu, SongtaoX

On 06/07/2013 01:15 PM, Stefano Stabellini wrote:
> On Fri, 7 Jun 2013, Xu, YongweiX wrote:
>> Hi Stefano Stabellini,
>>
>> I found a new bug "Guest could't find bootable device with memory more than 3600M".
>> Attach the link:  http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1857
>>
>> When booting up guest(include Linux&Windows guest) with memory more than 3600M,the guest will show "No bootable device" and could not boot up. Then the guest will continuous reboot automatically and never found bootable device. But with guest memory less than or equal to 3600M, boot up successfully.
>>
>> I found this is the qemu(qemu-upstream-unstable) issue, the latest version (commit:4597594c61add43725bd207bb498268a058f9cfb) caused this issue:
>> xen: start PCI hole at 0xe0000000 (same as pc_init1 and qemu-xen-traditional)
>> author      Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>>             Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
>> committer   Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>>             Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
>> commit     4597594c61add43725bd207bb498268a058f9cfb
>> tree        d6831f75f4a7d4ad7a94bd4e33584ac358808ee6
>> parent      25adf763933faddcc6a62bf55e1c52909a9bafbb
>>
>> Can you debug this issue soon? Thanks.
>
> Thank you very much for testing and bisecting the issue!
>
> The problem is that by default hvmloader sets PCI_MEM_START to
> 0xf0000000, then dynamically expands it backwards if needed.
> It works with qemu-xen-traditional, because it doesn't do any checking
> when registering memory regions.
> It doesn't work with qemu-xen because it needs to know where the ram and
> where the pci hole are when building the emulated machine.
> We can't have hvmloader increasing the pci hole in a way that overlaps
> with the guest ram (or where qemu thinks that the guest ram is).
> It is also worth noting thet seabios has its own view of where the pci
> hole starts and at the moment is set to 0xe0000000 at build time.
>
> I can see two ways of fixing this now:
>
> 1) have everybody agree about the pci hole starting at 0xe0000000
>
> - set PCI_MEM_START in hvmloader to 0xe0000000 to match qemu's view of
> the pci hole and have a bigger pci hole by default
>
> - remove the loop in hvmloader to increase the pci hole
>
> - set HVM_BELOW_4G_RAM_END to 0xe0000000, so that low_mem_pgend is set
> accordingly in tools/libxc/xc_hvm_build_x86.c:build_hvm_info
>
>
> 2) have everybody agree about the pci hole starting at 0xf0000000
>
> - revert 4597594c61add43725bd207bb498268a058f9cfb in qemu-xen
>
> - set BUILD_PCIMEM_START to 0xf0000000 in seabios
>
> - remove the loop in hvmloader to increase the pci hole
>
>
> Given that in both cases we need to remove the loop to increase the pci
> hole in hvmloader I would rather go with 1) and have a bigger pci hole
> by default to avoid problems with the pci hole being too small.
>

Did we ever figure out what actually happens on real hardware?  Or on 
KVM?  I have a hard time believing that real hardware is hard-coded -- 
who would be the person to ask about that, do you reckon?

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-07 15:42   ` George Dunlap
@ 2013-06-07 15:56     ` Stefano Stabellini
  2013-06-08  7:27       ` Hao, Xudong
  0 siblings, 1 reply; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-07 15:56 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ren, Yongjie, xen-devel, Keir Fraser, Stefano Stabellini,
	Tim Deegan, Paolo Bonzini, Xu, YongweiX, Liu, SongtaoX

On Fri, 7 Jun 2013, George Dunlap wrote:
> On 06/07/2013 01:15 PM, Stefano Stabellini wrote:
> > On Fri, 7 Jun 2013, Xu, YongweiX wrote:
> > > Hi Stefano Stabellini,
> > > 
> > > I found a new bug "Guest could't find bootable device with memory more
> > > than 3600M".
> > > Attach the link:
> > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1857
> > > 
> > > When booting up guest(include Linux&Windows guest) with memory more than
> > > 3600M,the guest will show "No bootable device" and could not boot up. Then
> > > the guest will continuous reboot automatically and never found bootable
> > > device. But with guest memory less than or equal to 3600M, boot up
> > > successfully.
> > > 
> > > I found this is the qemu(qemu-upstream-unstable) issue, the latest version
> > > (commit:4597594c61add43725bd207bb498268a058f9cfb) caused this issue:
> > > xen: start PCI hole at 0xe0000000 (same as pc_init1 and
> > > qemu-xen-traditional)
> > > author      Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > >             Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
> > > committer   Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > >             Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
> > > commit     4597594c61add43725bd207bb498268a058f9cfb
> > > tree        d6831f75f4a7d4ad7a94bd4e33584ac358808ee6
> > > parent      25adf763933faddcc6a62bf55e1c52909a9bafbb
> > > 
> > > Can you debug this issue soon? Thanks.
> > 
> > Thank you very much for testing and bisecting the issue!
> > 
> > The problem is that by default hvmloader sets PCI_MEM_START to
> > 0xf0000000, then dynamically expands it backwards if needed.
> > It works with qemu-xen-traditional, because it doesn't do any checking
> > when registering memory regions.
> > It doesn't work with qemu-xen because it needs to know where the ram and
> > where the pci hole are when building the emulated machine.
> > We can't have hvmloader increasing the pci hole in a way that overlaps
> > with the guest ram (or where qemu thinks that the guest ram is).
> > It is also worth noting thet seabios has its own view of where the pci
> > hole starts and at the moment is set to 0xe0000000 at build time.
> > 
> > I can see two ways of fixing this now:
> > 
> > 1) have everybody agree about the pci hole starting at 0xe0000000
> > 
> > - set PCI_MEM_START in hvmloader to 0xe0000000 to match qemu's view of
> > the pci hole and have a bigger pci hole by default
> > 
> > - remove the loop in hvmloader to increase the pci hole
> > 
> > - set HVM_BELOW_4G_RAM_END to 0xe0000000, so that low_mem_pgend is set
> > accordingly in tools/libxc/xc_hvm_build_x86.c:build_hvm_info
> > 
> > 
> > 2) have everybody agree about the pci hole starting at 0xf0000000
> > 
> > - revert 4597594c61add43725bd207bb498268a058f9cfb in qemu-xen
> > 
> > - set BUILD_PCIMEM_START to 0xf0000000 in seabios
> > 
> > - remove the loop in hvmloader to increase the pci hole
> > 
> > 
> > Given that in both cases we need to remove the loop to increase the pci
> > hole in hvmloader I would rather go with 1) and have a bigger pci hole
> > by default to avoid problems with the pci hole being too small.
> > 
> 
> Did we ever figure out what actually happens on real hardware?  Or on KVM?  I
> have a hard time believing that real hardware is hard-coded -- who would be
> the person to ask about that, do you reckon?

For KVM I am CC'ing Paolo, for real hardware I'll let the Intel guys
speak.
To be clear the question is:

What happens on real hardware when the BIOS boots and finds out that the
PCI hole is too small to contain all the MMIO regions of the PCI devices?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-07 15:56     ` Stefano Stabellini
@ 2013-06-08  7:27       ` Hao, Xudong
  2013-06-10 11:49         ` George Dunlap
  0 siblings, 1 reply; 82+ messages in thread
From: Hao, Xudong @ 2013-06-08  7:27 UTC (permalink / raw)
  To: Stefano Stabellini, George Dunlap
  Cc: Ren, Yongjie, xen-devel, Keir Fraser, Tim Deegan, Paolo Bonzini,
	Xu, YongweiX, Liu, SongtaoX

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org
> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Stefano Stabellini
> Sent: Friday, June 07, 2013 11:57 PM
> To: George Dunlap
> Cc: Ren, Yongjie; xen-devel@lists.xensource.com; Keir Fraser; Stefano Stabellini;
> Tim Deegan; Paolo Bonzini; Xu, YongweiX; Liu, SongtaoX
> Subject: Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with
> memory more than 3600M
> 
> On Fri, 7 Jun 2013, George Dunlap wrote:
> > On 06/07/2013 01:15 PM, Stefano Stabellini wrote:
> > > On Fri, 7 Jun 2013, Xu, YongweiX wrote:
> > > > Hi Stefano Stabellini,
> > > >
> > > > I found a new bug "Guest could't find bootable device with memory more
> > > > than 3600M".
> > > > Attach the link:
> > > > http://bugzilla-archived.xenproject.org//bugzilla/show_bug.cgi?id=1857
> > > >
> > > > When booting up guest(include Linux&Windows guest) with memory more
> than
> > > > 3600M,the guest will show "No bootable device" and could not boot up.
> Then
> > > > the guest will continuous reboot automatically and never found bootable
> > > > device. But with guest memory less than or equal to 3600M, boot up
> > > > successfully.
> > > >
> > > > I found this is the qemu(qemu-upstream-unstable) issue, the latest
> version
> > > > (commit:4597594c61add43725bd207bb498268a058f9cfb) caused this
> issue:
> > > > xen: start PCI hole at 0xe0000000 (same as pc_init1 and
> > > > qemu-xen-traditional)
> > > > author      Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > > >             Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
> > > > committer   Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > > >             Wed, 5 Jun 2013 19:36:10 +0800 (11:36 +0000)
> > > > commit     4597594c61add43725bd207bb498268a058f9cfb
> > > > tree        d6831f75f4a7d4ad7a94bd4e33584ac358808ee6
> > > > parent      25adf763933faddcc6a62bf55e1c52909a9bafbb
> > > >
> > > > Can you debug this issue soon? Thanks.
> > >
> > > Thank you very much for testing and bisecting the issue!
> > >
> > > The problem is that by default hvmloader sets PCI_MEM_START to
> > > 0xf0000000, then dynamically expands it backwards if needed.
> > > It works with qemu-xen-traditional, because it doesn't do any checking
> > > when registering memory regions.
> > > It doesn't work with qemu-xen because it needs to know where the ram and
> > > where the pci hole are when building the emulated machine.
> > > We can't have hvmloader increasing the pci hole in a way that overlaps
> > > with the guest ram (or where qemu thinks that the guest ram is).
> > > It is also worth noting thet seabios has its own view of where the pci
> > > hole starts and at the moment is set to 0xe0000000 at build time.
> > >
> > > I can see two ways of fixing this now:
> > >
> > > 1) have everybody agree about the pci hole starting at 0xe0000000
> > >
> > > - set PCI_MEM_START in hvmloader to 0xe0000000 to match qemu's view
> of
> > > the pci hole and have a bigger pci hole by default
> > >
> > > - remove the loop in hvmloader to increase the pci hole
> > >
> > > - set HVM_BELOW_4G_RAM_END to 0xe0000000, so that
> low_mem_pgend is set
> > > accordingly in tools/libxc/xc_hvm_build_x86.c:build_hvm_info
> > >
> > >
> > > 2) have everybody agree about the pci hole starting at 0xf0000000
> > >
> > > - revert 4597594c61add43725bd207bb498268a058f9cfb in qemu-xen
> > >
> > > - set BUILD_PCIMEM_START to 0xf0000000 in seabios
> > >
> > > - remove the loop in hvmloader to increase the pci hole
> > >
> > >
> > > Given that in both cases we need to remove the loop to increase the pci
> > > hole in hvmloader I would rather go with 1) and have a bigger pci hole
> > > by default to avoid problems with the pci hole being too small.
> > >
> >
> > Did we ever figure out what actually happens on real hardware?  Or on KVM?
> I
> > have a hard time believing that real hardware is hard-coded -- who would be
> > the person to ask about that, do you reckon?
> 
> For KVM I am CC'ing Paolo, for real hardware I'll let the Intel guys
> speak.
> To be clear the question is:
> 
> What happens on real hardware when the BIOS boots and finds out that the
> PCI hole is too small to contain all the MMIO regions of the PCI devices?
> 

I do not know what does BIOS do details, but I think it's in 4G above if pci hole is too small. I saw native OS program the pci hole to a very high addr 0x3c0c00000000 if the pci device has large bar size.

For Xen hvmloader, it already policy that is: if PCI hole is too small, another MMIO region will be created starting from the high_mem_pgend.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-08  7:27       ` Hao, Xudong
@ 2013-06-10 11:49         ` George Dunlap
  2013-06-11 17:26             ` Stefano Stabellini
  0 siblings, 1 reply; 82+ messages in thread
From: George Dunlap @ 2013-06-10 11:49 UTC (permalink / raw)
  To: Hao, Xudong
  Cc: Ren, Yongjie, xen-devel, Keir Fraser, Stefano Stabellini,
	Tim Deegan, Xu, YongweiX, Paolo Bonzini, Liu, SongtaoX

On Sat, Jun 8, 2013 at 8:27 AM, Hao, Xudong <xudong.hao@intel.com> wrote:
>>
>> What happens on real hardware when the BIOS boots and finds out that the
>> PCI hole is too small to contain all the MMIO regions of the PCI devices?
>>
>
> I do not know what does BIOS do details, but I think it's in 4G above if pci hole is too small. I saw native OS program the pci hole to a very high addr 0x3c0c00000000 if the pci device has large bar size.
>
> For Xen hvmloader, it already policy that is: if PCI hole is too small, another MMIO region will be created starting from the high_mem_pgend.

Yes, and that, as I understand it, is the problem: that this change of
the MMIO regions is not communicated to qemu.

But it seems like this same problem would occur on real hardware --
i.e., you add a new video card with 2GiB of video ram, and now you
need a 2.5 GiB MMIO hole. It seems like the BIOS would be the logical
place to sort it out.  But I'm not familiar enough with SeaBIOS / qemu
to really know for sure.

 -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-10 11:49         ` George Dunlap
@ 2013-06-11 17:26             ` Stefano Stabellini
  0 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-11 17:26 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ren, Yongjie, xen-devel, Keir Fraser, Ian Campbell,
	Stefano Stabellini, Tim Deegan, yanqiangjun, Hao, Xudong,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Xu,
	YongweiX, luonengjun, Paolo Bonzini, Liu, SongtaoX, hanweidong

On Mon, 10 Jun 2013, George Dunlap wrote:
> On Sat, Jun 8, 2013 at 8:27 AM, Hao, Xudong <xudong.hao@intel.com> wrote:
> >>
> >> What happens on real hardware when the BIOS boots and finds out that the
> >> PCI hole is too small to contain all the MMIO regions of the PCI devices?
> >>
> >
> > I do not know what does BIOS do details, but I think it's in 4G above if pci hole is too small. I saw native OS program the pci hole to a very high addr 0x3c0c00000000 if the pci device has large bar size.
> >
> > For Xen hvmloader, it already policy that is: if PCI hole is too small, another MMIO region will be created starting from the high_mem_pgend.
> 
> Yes, and that, as I understand it, is the problem: that this change of
> the MMIO regions is not communicated to qemu.
> 
> But it seems like this same problem would occur on real hardware --
> i.e., you add a new video card with 2GiB of video ram, and now you
> need a 2.5 GiB MMIO hole. It seems like the BIOS would be the logical
> place to sort it out.  But I'm not familiar enough with SeaBIOS / qemu
> to really know for sure.

I went through the code that maps the PCI MMIO regions in hvmloader
(tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
region is larger than 512MB.

Maybe we could just relax this condition and map the device memory to
high memory no matter the size of the MMIO region if the PCI bar is
64-bit?
If the Nvidia Quadro that originally Gonglei was trying to assign to the
guest is 64-bit capable, then it would fix the issue in the simplest
possible way.

Are there actually any PCI devices that people might want to assign to
guests that don't have 64-bit bars and don't fit in
0xe0000000-0xfc000000?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-11 17:26             ` Stefano Stabellini
  0 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-11 17:26 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ren, Yongjie, xen-devel, Keir Fraser, Ian Campbell,
	Stefano Stabellini, Tim Deegan, yanqiangjun, Hao, Xudong,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Xu,
	YongweiX, luonengjun, Paolo Bonzini, Liu, SongtaoX, hanweidong

On Mon, 10 Jun 2013, George Dunlap wrote:
> On Sat, Jun 8, 2013 at 8:27 AM, Hao, Xudong <xudong.hao@intel.com> wrote:
> >>
> >> What happens on real hardware when the BIOS boots and finds out that the
> >> PCI hole is too small to contain all the MMIO regions of the PCI devices?
> >>
> >
> > I do not know what does BIOS do details, but I think it's in 4G above if pci hole is too small. I saw native OS program the pci hole to a very high addr 0x3c0c00000000 if the pci device has large bar size.
> >
> > For Xen hvmloader, it already policy that is: if PCI hole is too small, another MMIO region will be created starting from the high_mem_pgend.
> 
> Yes, and that, as I understand it, is the problem: that this change of
> the MMIO regions is not communicated to qemu.
> 
> But it seems like this same problem would occur on real hardware --
> i.e., you add a new video card with 2GiB of video ram, and now you
> need a 2.5 GiB MMIO hole. It seems like the BIOS would be the logical
> place to sort it out.  But I'm not familiar enough with SeaBIOS / qemu
> to really know for sure.

I went through the code that maps the PCI MMIO regions in hvmloader
(tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
region is larger than 512MB.

Maybe we could just relax this condition and map the device memory to
high memory no matter the size of the MMIO region if the PCI bar is
64-bit?
If the Nvidia Quadro that originally Gonglei was trying to assign to the
guest is 64-bit capable, then it would fix the issue in the simplest
possible way.

Are there actually any PCI devices that people might want to assign to
guests that don't have 64-bit bars and don't fit in
0xe0000000-0xfc000000?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-11 17:26             ` Stefano Stabellini
@ 2013-06-12  7:25               ` Jan Beulich
  -1 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12  7:25 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Paolo Bonzini,
	YongweiX Xu, SongtaoX Liu

>>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> I went through the code that maps the PCI MMIO regions in hvmloader
> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> region is larger than 512MB.
> 
> Maybe we could just relax this condition and map the device memory to
> high memory no matter the size of the MMIO region if the PCI bar is
> 64-bit?

I can only recommend not to: For one, guests not using PAE or
PSE-36 can't map such space at all (and older OSes may not
properly deal with 64-bit BARs at all). And then one would generally
expect this allocation to be done top down (to minimize risk of
running into RAM), and doing so is going to present further risks of
incompatibilities with guest OSes (Linux for example learned only in
2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
PFN to pfn_pte(), the respective parameter of which is
"unsigned long").

I think this ought to be done in an iterative process - if all MMIO
regions together don't fit below 4G, the biggest one should be
moved up beyond 4G first, followed by the next to biggest one
etc.

And, just like many BIOSes have, there ought to be a guest
(config) controlled option to shrink the RAM portion below 4G
allowing more MMIO blocks to fit.

Finally we shouldn't forget the option of not doing any assignment
at all in the BIOS, allowing/forcing the OS to use suitable address
ranges. Of course any OS is permitted to re-assign resources, but
I think they will frequently prefer to avoid re-assignment if already
done by the BIOS.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12  7:25               ` Jan Beulich
  0 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12  7:25 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Paolo Bonzini,
	YongweiX Xu, SongtaoX Liu

>>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> I went through the code that maps the PCI MMIO regions in hvmloader
> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> region is larger than 512MB.
> 
> Maybe we could just relax this condition and map the device memory to
> high memory no matter the size of the MMIO region if the PCI bar is
> 64-bit?

I can only recommend not to: For one, guests not using PAE or
PSE-36 can't map such space at all (and older OSes may not
properly deal with 64-bit BARs at all). And then one would generally
expect this allocation to be done top down (to minimize risk of
running into RAM), and doing so is going to present further risks of
incompatibilities with guest OSes (Linux for example learned only in
2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
PFN to pfn_pte(), the respective parameter of which is
"unsigned long").

I think this ought to be done in an iterative process - if all MMIO
regions together don't fit below 4G, the biggest one should be
moved up beyond 4G first, followed by the next to biggest one
etc.

And, just like many BIOSes have, there ought to be a guest
(config) controlled option to shrink the RAM portion below 4G
allowing more MMIO blocks to fit.

Finally we shouldn't forget the option of not doing any assignment
at all in the BIOS, allowing/forcing the OS to use suitable address
ranges. Of course any OS is permitted to re-assign resources, but
I think they will frequently prefer to avoid re-assignment if already
done by the BIOS.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12  7:25               ` Jan Beulich
@ 2013-06-12  8:31                 ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-12  8:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > I went through the code that maps the PCI MMIO regions in hvmloader
> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> > region is larger than 512MB.
> > 
> > Maybe we could just relax this condition and map the device memory to
> > high memory no matter the size of the MMIO region if the PCI bar is
> > 64-bit?
> 
> I can only recommend not to: For one, guests not using PAE or
> PSE-36 can't map such space at all (and older OSes may not
> properly deal with 64-bit BARs at all). And then one would generally
> expect this allocation to be done top down (to minimize risk of
> running into RAM), and doing so is going to present further risks of
> incompatibilities with guest OSes (Linux for example learned only in
> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> PFN to pfn_pte(), the respective parameter of which is
> "unsigned long").
> 
> I think this ought to be done in an iterative process - if all MMIO
> regions together don't fit below 4G, the biggest one should be
> moved up beyond 4G first, followed by the next to biggest one
> etc.
> 
> And, just like many BIOSes have, there ought to be a guest
> (config) controlled option to shrink the RAM portion below 4G
> allowing more MMIO blocks to fit.
> 
> Finally we shouldn't forget the option of not doing any assignment
> at all in the BIOS, allowing/forcing the OS to use suitable address
> ranges. Of course any OS is permitted to re-assign resources, but
> I think they will frequently prefer to avoid re-assignment if already
> done by the BIOS.

Is "bios=assign-busses" on the guest command line suitable as a
workaround then? Or possibly "bios=realloc"

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12  8:31                 ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-12  8:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > I went through the code that maps the PCI MMIO regions in hvmloader
> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> > region is larger than 512MB.
> > 
> > Maybe we could just relax this condition and map the device memory to
> > high memory no matter the size of the MMIO region if the PCI bar is
> > 64-bit?
> 
> I can only recommend not to: For one, guests not using PAE or
> PSE-36 can't map such space at all (and older OSes may not
> properly deal with 64-bit BARs at all). And then one would generally
> expect this allocation to be done top down (to minimize risk of
> running into RAM), and doing so is going to present further risks of
> incompatibilities with guest OSes (Linux for example learned only in
> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> PFN to pfn_pte(), the respective parameter of which is
> "unsigned long").
> 
> I think this ought to be done in an iterative process - if all MMIO
> regions together don't fit below 4G, the biggest one should be
> moved up beyond 4G first, followed by the next to biggest one
> etc.
> 
> And, just like many BIOSes have, there ought to be a guest
> (config) controlled option to shrink the RAM portion below 4G
> allowing more MMIO blocks to fit.
> 
> Finally we shouldn't forget the option of not doing any assignment
> at all in the BIOS, allowing/forcing the OS to use suitable address
> ranges. Of course any OS is permitted to re-assign resources, but
> I think they will frequently prefer to avoid re-assignment if already
> done by the BIOS.

Is "bios=assign-busses" on the guest command line suitable as a
workaround then? Or possibly "bios=realloc"

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12  8:31                 ` Ian Campbell
@ 2013-06-12  9:02                   ` Jan Beulich
  -1 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12  9:02 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 10:31, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
>> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> wrote:
>> > I went through the code that maps the PCI MMIO regions in hvmloader
>> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>> > region is larger than 512MB.
>> > 
>> > Maybe we could just relax this condition and map the device memory to
>> > high memory no matter the size of the MMIO region if the PCI bar is
>> > 64-bit?
>> 
>> I can only recommend not to: For one, guests not using PAE or
>> PSE-36 can't map such space at all (and older OSes may not
>> properly deal with 64-bit BARs at all). And then one would generally
>> expect this allocation to be done top down (to minimize risk of
>> running into RAM), and doing so is going to present further risks of
>> incompatibilities with guest OSes (Linux for example learned only in
>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>> PFN to pfn_pte(), the respective parameter of which is
>> "unsigned long").
>> 
>> I think this ought to be done in an iterative process - if all MMIO
>> regions together don't fit below 4G, the biggest one should be
>> moved up beyond 4G first, followed by the next to biggest one
>> etc.
>> 
>> And, just like many BIOSes have, there ought to be a guest
>> (config) controlled option to shrink the RAM portion below 4G
>> allowing more MMIO blocks to fit.
>> 
>> Finally we shouldn't forget the option of not doing any assignment
>> at all in the BIOS, allowing/forcing the OS to use suitable address
>> ranges. Of course any OS is permitted to re-assign resources, but
>> I think they will frequently prefer to avoid re-assignment if already
>> done by the BIOS.
> 
> Is "bios=assign-busses" on the guest command line suitable as a
> workaround then? Or possibly "bios=realloc"

Which command line? Getting passed to hvmloader? In that case,
doing the assignment is the default, so an inverse option would be
needed. And not doing any assignment would be wrong too - all
devices involved in booting need (some of) their resources
assigned. That's particularly a potential problem since the graphics
card is the most likely candidate for wanting an extremely large
area, and I'm not sure whether booting with an assigned graphics
card would use that card instead of the emulated one.

As to "realloc" - that can hardly be meant as an option to
hvmloader, so I'm really unsure what command line you think
about here.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12  9:02                   ` Jan Beulich
  0 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12  9:02 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 10:31, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
>> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> wrote:
>> > I went through the code that maps the PCI MMIO regions in hvmloader
>> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>> > region is larger than 512MB.
>> > 
>> > Maybe we could just relax this condition and map the device memory to
>> > high memory no matter the size of the MMIO region if the PCI bar is
>> > 64-bit?
>> 
>> I can only recommend not to: For one, guests not using PAE or
>> PSE-36 can't map such space at all (and older OSes may not
>> properly deal with 64-bit BARs at all). And then one would generally
>> expect this allocation to be done top down (to minimize risk of
>> running into RAM), and doing so is going to present further risks of
>> incompatibilities with guest OSes (Linux for example learned only in
>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>> PFN to pfn_pte(), the respective parameter of which is
>> "unsigned long").
>> 
>> I think this ought to be done in an iterative process - if all MMIO
>> regions together don't fit below 4G, the biggest one should be
>> moved up beyond 4G first, followed by the next to biggest one
>> etc.
>> 
>> And, just like many BIOSes have, there ought to be a guest
>> (config) controlled option to shrink the RAM portion below 4G
>> allowing more MMIO blocks to fit.
>> 
>> Finally we shouldn't forget the option of not doing any assignment
>> at all in the BIOS, allowing/forcing the OS to use suitable address
>> ranges. Of course any OS is permitted to re-assign resources, but
>> I think they will frequently prefer to avoid re-assignment if already
>> done by the BIOS.
> 
> Is "bios=assign-busses" on the guest command line suitable as a
> workaround then? Or possibly "bios=realloc"

Which command line? Getting passed to hvmloader? In that case,
doing the assignment is the default, so an inverse option would be
needed. And not doing any assignment would be wrong too - all
devices involved in booting need (some of) their resources
assigned. That's particularly a potential problem since the graphics
card is the most likely candidate for wanting an extremely large
area, and I'm not sure whether booting with an assigned graphics
card would use that card instead of the emulated one.

As to "realloc" - that can hardly be meant as an option to
hvmloader, so I'm really unsure what command line you think
about here.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12  9:02                   ` Jan Beulich
@ 2013-06-12  9:22                     ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-12  9:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 2013-06-12 at 10:02 +0100, Jan Beulich wrote:
> >>> On 12.06.13 at 10:31, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
> >> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> > wrote:
> >> > I went through the code that maps the PCI MMIO regions in hvmloader
> >> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >> > region is larger than 512MB.
> >> > 
> >> > Maybe we could just relax this condition and map the device memory to
> >> > high memory no matter the size of the MMIO region if the PCI bar is
> >> > 64-bit?
> >> 
> >> I can only recommend not to: For one, guests not using PAE or
> >> PSE-36 can't map such space at all (and older OSes may not
> >> properly deal with 64-bit BARs at all). And then one would generally
> >> expect this allocation to be done top down (to minimize risk of
> >> running into RAM), and doing so is going to present further risks of
> >> incompatibilities with guest OSes (Linux for example learned only in
> >> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >> PFN to pfn_pte(), the respective parameter of which is
> >> "unsigned long").
> >> 
> >> I think this ought to be done in an iterative process - if all MMIO
> >> regions together don't fit below 4G, the biggest one should be
> >> moved up beyond 4G first, followed by the next to biggest one
> >> etc.
> >> 
> >> And, just like many BIOSes have, there ought to be a guest
> >> (config) controlled option to shrink the RAM portion below 4G
> >> allowing more MMIO blocks to fit.
> >> 
> >> Finally we shouldn't forget the option of not doing any assignment
> >> at all in the BIOS, allowing/forcing the OS to use suitable address
> >> ranges. Of course any OS is permitted to re-assign resources, but
> >> I think they will frequently prefer to avoid re-assignment if already
> >> done by the BIOS.
> > 
> > Is "bios=assign-busses" on the guest command line suitable as a
> > workaround then? Or possibly "bios=realloc"
> 
> Which command line? Getting passed to hvmloader?

I meant the guest kernel command line.

>  In that case,
> doing the assignment is the default, so an inverse option would be
> needed. And not doing any assignment would be wrong too - all
> devices involved in booting need (some of) their resources
> assigned. That's particularly a potential problem since the graphics
> card is the most likely candidate for wanting an extremely large
> area, and I'm not sure whether booting with an assigned graphics
> card would use that card instead of the emulated one.
> 
> As to "realloc" - that can hardly be meant as an option to
> hvmloader, so I'm really unsure what command line you think
> about here.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12  9:22                     ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-12  9:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 2013-06-12 at 10:02 +0100, Jan Beulich wrote:
> >>> On 12.06.13 at 10:31, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
> >> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> > wrote:
> >> > I went through the code that maps the PCI MMIO regions in hvmloader
> >> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >> > region is larger than 512MB.
> >> > 
> >> > Maybe we could just relax this condition and map the device memory to
> >> > high memory no matter the size of the MMIO region if the PCI bar is
> >> > 64-bit?
> >> 
> >> I can only recommend not to: For one, guests not using PAE or
> >> PSE-36 can't map such space at all (and older OSes may not
> >> properly deal with 64-bit BARs at all). And then one would generally
> >> expect this allocation to be done top down (to minimize risk of
> >> running into RAM), and doing so is going to present further risks of
> >> incompatibilities with guest OSes (Linux for example learned only in
> >> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >> PFN to pfn_pte(), the respective parameter of which is
> >> "unsigned long").
> >> 
> >> I think this ought to be done in an iterative process - if all MMIO
> >> regions together don't fit below 4G, the biggest one should be
> >> moved up beyond 4G first, followed by the next to biggest one
> >> etc.
> >> 
> >> And, just like many BIOSes have, there ought to be a guest
> >> (config) controlled option to shrink the RAM portion below 4G
> >> allowing more MMIO blocks to fit.
> >> 
> >> Finally we shouldn't forget the option of not doing any assignment
> >> at all in the BIOS, allowing/forcing the OS to use suitable address
> >> ranges. Of course any OS is permitted to re-assign resources, but
> >> I think they will frequently prefer to avoid re-assignment if already
> >> done by the BIOS.
> > 
> > Is "bios=assign-busses" on the guest command line suitable as a
> > workaround then? Or possibly "bios=realloc"
> 
> Which command line? Getting passed to hvmloader?

I meant the guest kernel command line.

>  In that case,
> doing the assignment is the default, so an inverse option would be
> needed. And not doing any assignment would be wrong too - all
> devices involved in booting need (some of) their resources
> assigned. That's particularly a potential problem since the graphics
> card is the most likely candidate for wanting an extremely large
> area, and I'm not sure whether booting with an assigned graphics
> card would use that card instead of the emulated one.
> 
> As to "realloc" - that can hardly be meant as an option to
> hvmloader, so I'm really unsure what command line you think
> about here.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12  9:22                     ` Ian Campbell
@ 2013-06-12 10:07                       ` Jan Beulich
  -1 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 10:07 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 11:22, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2013-06-12 at 10:02 +0100, Jan Beulich wrote:
>> >>> On 12.06.13 at 10:31, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> > On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
>> >> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
>> > wrote:
>> >> > I went through the code that maps the PCI MMIO regions in hvmloader
>> >> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>> >> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>> >> > region is larger than 512MB.
>> >> > 
>> >> > Maybe we could just relax this condition and map the device memory to
>> >> > high memory no matter the size of the MMIO region if the PCI bar is
>> >> > 64-bit?
>> >> 
>> >> I can only recommend not to: For one, guests not using PAE or
>> >> PSE-36 can't map such space at all (and older OSes may not
>> >> properly deal with 64-bit BARs at all). And then one would generally
>> >> expect this allocation to be done top down (to minimize risk of
>> >> running into RAM), and doing so is going to present further risks of
>> >> incompatibilities with guest OSes (Linux for example learned only in
>> >> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>> >> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>> >> PFN to pfn_pte(), the respective parameter of which is
>> >> "unsigned long").
>> >> 
>> >> I think this ought to be done in an iterative process - if all MMIO
>> >> regions together don't fit below 4G, the biggest one should be
>> >> moved up beyond 4G first, followed by the next to biggest one
>> >> etc.
>> >> 
>> >> And, just like many BIOSes have, there ought to be a guest
>> >> (config) controlled option to shrink the RAM portion below 4G
>> >> allowing more MMIO blocks to fit.
>> >> 
>> >> Finally we shouldn't forget the option of not doing any assignment
>> >> at all in the BIOS, allowing/forcing the OS to use suitable address
>> >> ranges. Of course any OS is permitted to re-assign resources, but
>> >> I think they will frequently prefer to avoid re-assignment if already
>> >> done by the BIOS.
>> > 
>> > Is "bios=assign-busses" on the guest command line suitable as a
>> > workaround then? Or possibly "bios=realloc"
>> 
>> Which command line? Getting passed to hvmloader?
> 
> I meant the guest kernel command line.

As there's no accessible guest kernel command for HVM guests,
did you mean to require the guest admin to put something on the
command line manually?

And then - this might cover Linux, but what about other OSes,
namely Windows? Oh, and for Linux you confused me by using
"bios=" instead of "pci="... And "pci=realloc" only exists as of 3.0.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 10:07                       ` Jan Beulich
  0 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 10:07 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 11:22, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2013-06-12 at 10:02 +0100, Jan Beulich wrote:
>> >>> On 12.06.13 at 10:31, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> > On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
>> >> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
>> > wrote:
>> >> > I went through the code that maps the PCI MMIO regions in hvmloader
>> >> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>> >> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>> >> > region is larger than 512MB.
>> >> > 
>> >> > Maybe we could just relax this condition and map the device memory to
>> >> > high memory no matter the size of the MMIO region if the PCI bar is
>> >> > 64-bit?
>> >> 
>> >> I can only recommend not to: For one, guests not using PAE or
>> >> PSE-36 can't map such space at all (and older OSes may not
>> >> properly deal with 64-bit BARs at all). And then one would generally
>> >> expect this allocation to be done top down (to minimize risk of
>> >> running into RAM), and doing so is going to present further risks of
>> >> incompatibilities with guest OSes (Linux for example learned only in
>> >> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>> >> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>> >> PFN to pfn_pte(), the respective parameter of which is
>> >> "unsigned long").
>> >> 
>> >> I think this ought to be done in an iterative process - if all MMIO
>> >> regions together don't fit below 4G, the biggest one should be
>> >> moved up beyond 4G first, followed by the next to biggest one
>> >> etc.
>> >> 
>> >> And, just like many BIOSes have, there ought to be a guest
>> >> (config) controlled option to shrink the RAM portion below 4G
>> >> allowing more MMIO blocks to fit.
>> >> 
>> >> Finally we shouldn't forget the option of not doing any assignment
>> >> at all in the BIOS, allowing/forcing the OS to use suitable address
>> >> ranges. Of course any OS is permitted to re-assign resources, but
>> >> I think they will frequently prefer to avoid re-assignment if already
>> >> done by the BIOS.
>> > 
>> > Is "bios=assign-busses" on the guest command line suitable as a
>> > workaround then? Or possibly "bios=realloc"
>> 
>> Which command line? Getting passed to hvmloader?
> 
> I meant the guest kernel command line.

As there's no accessible guest kernel command for HVM guests,
did you mean to require the guest admin to put something on the
command line manually?

And then - this might cover Linux, but what about other OSes,
namely Windows? Oh, and for Linux you confused me by using
"bios=" instead of "pci="... And "pci=realloc" only exists as of 3.0.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 10:07                       ` Jan Beulich
@ 2013-06-12 11:23                         ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-12 11:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 2013-06-12 at 11:07 +0100, Jan Beulich wrote:
> >>> On 12.06.13 at 11:22, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2013-06-12 at 10:02 +0100, Jan Beulich wrote:
> >> >>> On 12.06.13 at 10:31, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> >> > On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
> >> >> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> >> > wrote:
> >> >> > I went through the code that maps the PCI MMIO regions in hvmloader
> >> >> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >> >> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >> >> > region is larger than 512MB.
> >> >> > 
> >> >> > Maybe we could just relax this condition and map the device memory to
> >> >> > high memory no matter the size of the MMIO region if the PCI bar is
> >> >> > 64-bit?
> >> >> 
> >> >> I can only recommend not to: For one, guests not using PAE or
> >> >> PSE-36 can't map such space at all (and older OSes may not
> >> >> properly deal with 64-bit BARs at all). And then one would generally
> >> >> expect this allocation to be done top down (to minimize risk of
> >> >> running into RAM), and doing so is going to present further risks of
> >> >> incompatibilities with guest OSes (Linux for example learned only in
> >> >> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >> >> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >> >> PFN to pfn_pte(), the respective parameter of which is
> >> >> "unsigned long").
> >> >> 
> >> >> I think this ought to be done in an iterative process - if all MMIO
> >> >> regions together don't fit below 4G, the biggest one should be
> >> >> moved up beyond 4G first, followed by the next to biggest one
> >> >> etc.
> >> >> 
> >> >> And, just like many BIOSes have, there ought to be a guest
> >> >> (config) controlled option to shrink the RAM portion below 4G
> >> >> allowing more MMIO blocks to fit.
> >> >> 
> >> >> Finally we shouldn't forget the option of not doing any assignment
> >> >> at all in the BIOS, allowing/forcing the OS to use suitable address
> >> >> ranges. Of course any OS is permitted to re-assign resources, but
> >> >> I think they will frequently prefer to avoid re-assignment if already
> >> >> done by the BIOS.
> >> > 
> >> > Is "bios=assign-busses" on the guest command line suitable as a
> >> > workaround then? Or possibly "bios=realloc"
> >> 
> >> Which command line? Getting passed to hvmloader?
> > 
> > I meant the guest kernel command line.
> 
> As there's no accessible guest kernel command for HVM guests,
> did you mean to require the guest admin to put something on the
> command line manually?

Yes, as a workaround for this shortcoming of 4.3, not as a long term
solution. It's only people using passthrough with certain devices with
large BARs who will ever trip over this, right?

> And then - this might cover Linux, but what about other OSes,
> namely Windows?

True, I'm not sure if/how this can be done.

The only reference I could find was at
http://windows.microsoft.com/en-id/windows7/using-system-configuration
which says under "Advanced boot options":
        PCI Lock. Prevents Windows from reallocating I/O and IRQ
        resources on the PCI bus. The I/O and memory resources set by
        the BIOS are preserved.

Which seems to suggest the default is to reallocate, but I don't know.

>  Oh, and for Linux you confused me by using
> "bios=" instead of "pci="... And "pci=realloc" only exists as of 3.0.

Oops, sorry.

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 11:23                         ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-12 11:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 2013-06-12 at 11:07 +0100, Jan Beulich wrote:
> >>> On 12.06.13 at 11:22, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2013-06-12 at 10:02 +0100, Jan Beulich wrote:
> >> >>> On 12.06.13 at 10:31, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> >> > On Wed, 2013-06-12 at 08:25 +0100, Jan Beulich wrote:
> >> >> >>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> >> > wrote:
> >> >> > I went through the code that maps the PCI MMIO regions in hvmloader
> >> >> > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >> >> > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >> >> > region is larger than 512MB.
> >> >> > 
> >> >> > Maybe we could just relax this condition and map the device memory to
> >> >> > high memory no matter the size of the MMIO region if the PCI bar is
> >> >> > 64-bit?
> >> >> 
> >> >> I can only recommend not to: For one, guests not using PAE or
> >> >> PSE-36 can't map such space at all (and older OSes may not
> >> >> properly deal with 64-bit BARs at all). And then one would generally
> >> >> expect this allocation to be done top down (to minimize risk of
> >> >> running into RAM), and doing so is going to present further risks of
> >> >> incompatibilities with guest OSes (Linux for example learned only in
> >> >> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >> >> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >> >> PFN to pfn_pte(), the respective parameter of which is
> >> >> "unsigned long").
> >> >> 
> >> >> I think this ought to be done in an iterative process - if all MMIO
> >> >> regions together don't fit below 4G, the biggest one should be
> >> >> moved up beyond 4G first, followed by the next to biggest one
> >> >> etc.
> >> >> 
> >> >> And, just like many BIOSes have, there ought to be a guest
> >> >> (config) controlled option to shrink the RAM portion below 4G
> >> >> allowing more MMIO blocks to fit.
> >> >> 
> >> >> Finally we shouldn't forget the option of not doing any assignment
> >> >> at all in the BIOS, allowing/forcing the OS to use suitable address
> >> >> ranges. Of course any OS is permitted to re-assign resources, but
> >> >> I think they will frequently prefer to avoid re-assignment if already
> >> >> done by the BIOS.
> >> > 
> >> > Is "bios=assign-busses" on the guest command line suitable as a
> >> > workaround then? Or possibly "bios=realloc"
> >> 
> >> Which command line? Getting passed to hvmloader?
> > 
> > I meant the guest kernel command line.
> 
> As there's no accessible guest kernel command for HVM guests,
> did you mean to require the guest admin to put something on the
> command line manually?

Yes, as a workaround for this shortcoming of 4.3, not as a long term
solution. It's only people using passthrough with certain devices with
large BARs who will ever trip over this, right?

> And then - this might cover Linux, but what about other OSes,
> namely Windows?

True, I'm not sure if/how this can be done.

The only reference I could find was at
http://windows.microsoft.com/en-id/windows7/using-system-configuration
which says under "Advanced boot options":
        PCI Lock. Prevents Windows from reallocating I/O and IRQ
        resources on the PCI bus. The I/O and memory resources set by
        the BIOS are preserved.

Which seems to suggest the default is to reallocate, but I don't know.

>  Oh, and for Linux you confused me by using
> "bios=" instead of "pci="... And "pci=realloc" only exists as of 3.0.

Oops, sorry.

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 11:23                         ` Ian Campbell
@ 2013-06-12 11:56                           ` Jan Beulich
  -1 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 11:56 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 13:23, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2013-06-12 at 11:07 +0100, Jan Beulich wrote:
>> As there's no accessible guest kernel command for HVM guests,
>> did you mean to require the guest admin to put something on the
>> command line manually?
> 
> Yes, as a workaround for this shortcoming of 4.3, not as a long term
> solution. It's only people using passthrough with certain devices with
> large BARs who will ever trip over this, right?

Yes.

>> And then - this might cover Linux, but what about other OSes,
>> namely Windows?
> 
> True, I'm not sure if/how this can be done.
> 
> The only reference I could find was at
> http://windows.microsoft.com/en-id/windows7/using-system-configuration 
> which says under "Advanced boot options":
>         PCI Lock. Prevents Windows from reallocating I/O and IRQ
>         resources on the PCI bus. The I/O and memory resources set by
>         the BIOS are preserved.
> 
> Which seems to suggest the default is to reallocate, but I don't know.

Without having known of this option, I would take it as "may", not
"will" reallocate.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 11:56                           ` Jan Beulich
  0 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 11:56 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 13:23, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2013-06-12 at 11:07 +0100, Jan Beulich wrote:
>> As there's no accessible guest kernel command for HVM guests,
>> did you mean to require the guest admin to put something on the
>> command line manually?
> 
> Yes, as a workaround for this shortcoming of 4.3, not as a long term
> solution. It's only people using passthrough with certain devices with
> large BARs who will ever trip over this, right?

Yes.

>> And then - this might cover Linux, but what about other OSes,
>> namely Windows?
> 
> True, I'm not sure if/how this can be done.
> 
> The only reference I could find was at
> http://windows.microsoft.com/en-id/windows7/using-system-configuration 
> which says under "Advanced boot options":
>         PCI Lock. Prevents Windows from reallocating I/O and IRQ
>         resources on the PCI bus. The I/O and memory resources set by
>         the BIOS are preserved.
> 
> Which seems to suggest the default is to reallocate, but I don't know.

Without having known of this option, I would take it as "may", not
"will" reallocate.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 11:56                           ` Jan Beulich
@ 2013-06-12 11:59                             ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-12 11:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 2013-06-12 at 12:56 +0100, Jan Beulich wrote:
> >>> On 12.06.13 at 13:23, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2013-06-12 at 11:07 +0100, Jan Beulich wrote:

> >> And then - this might cover Linux, but what about other OSes,
> >> namely Windows?
> > 
> > True, I'm not sure if/how this can be done.
> > 
> > The only reference I could find was at
> > http://windows.microsoft.com/en-id/windows7/using-system-configuration 
> > which says under "Advanced boot options":
> >         PCI Lock. Prevents Windows from reallocating I/O and IRQ
> >         resources on the PCI bus. The I/O and memory resources set by
> >         the BIOS are preserved.
> > 
> > Which seems to suggest the default is to reallocate, but I don't know.
> 
> Without having known of this option, I would take it as "may", not
> "will" reallocate.

That does seem likely, yes.

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 11:59                             ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-12 11:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 2013-06-12 at 12:56 +0100, Jan Beulich wrote:
> >>> On 12.06.13 at 13:23, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2013-06-12 at 11:07 +0100, Jan Beulich wrote:

> >> And then - this might cover Linux, but what about other OSes,
> >> namely Windows?
> > 
> > True, I'm not sure if/how this can be done.
> > 
> > The only reference I could find was at
> > http://windows.microsoft.com/en-id/windows7/using-system-configuration 
> > which says under "Advanced boot options":
> >         PCI Lock. Prevents Windows from reallocating I/O and IRQ
> >         resources on the PCI bus. The I/O and memory resources set by
> >         the BIOS are preserved.
> > 
> > Which seems to suggest the default is to reallocate, but I don't know.
> 
> Without having known of this option, I would take it as "may", not
> "will" reallocate.

That does seem likely, yes.

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12  7:25               ` Jan Beulich
@ 2013-06-12 10:05                 ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-12 10:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On 12/06/13 08:25, Jan Beulich wrote:
>>>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
>> I went through the code that maps the PCI MMIO regions in hvmloader
>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>> region is larger than 512MB.
>>
>> Maybe we could just relax this condition and map the device memory to
>> high memory no matter the size of the MMIO region if the PCI bar is
>> 64-bit?
> I can only recommend not to: For one, guests not using PAE or
> PSE-36 can't map such space at all (and older OSes may not
> properly deal with 64-bit BARs at all). And then one would generally
> expect this allocation to be done top down (to minimize risk of
> running into RAM), and doing so is going to present further risks of
> incompatibilities with guest OSes (Linux for example learned only in
> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> PFN to pfn_pte(), the respective parameter of which is
> "unsigned long").
>
> I think this ought to be done in an iterative process - if all MMIO
> regions together don't fit below 4G, the biggest one should be
> moved up beyond 4G first, followed by the next to biggest one
> etc.

First of all, the proposal to move the PCI BAR up to the 64-bit range is 
a temporary work-around.  It should only be done if a device doesn't fit 
in the current MMIO range.

We have three options here:
1. Don't do anything
2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they 
don't fit
3. Convince qemu to allow MMIO regions to mask memory (or what it thinks 
is memory).
4. Add a mechanism to tell qemu that memory is being relocated.

Number 4 is definitely the right answer long-term, but we just don't 
have time to do that before the 4.3 release.  We're not sure yet if #3 
is possible; even if it is, it may have unpredictable knock-on effects.

Doing #2, it is true that many guests will be unable to access the 
device because of 32-bit limitations.  However, in #1, *no* guests will 
be able to access the device.  At least in #2, *many* guests will be 
able to do so.  In any case, apparently #2 is what KVM does, so having 
the limitation on guests is not without precedent.  It's also likely to 
be a somewhat tested configuration (unlike #3, for example).

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 10:05                 ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-12 10:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On 12/06/13 08:25, Jan Beulich wrote:
>>>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
>> I went through the code that maps the PCI MMIO regions in hvmloader
>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>> region is larger than 512MB.
>>
>> Maybe we could just relax this condition and map the device memory to
>> high memory no matter the size of the MMIO region if the PCI bar is
>> 64-bit?
> I can only recommend not to: For one, guests not using PAE or
> PSE-36 can't map such space at all (and older OSes may not
> properly deal with 64-bit BARs at all). And then one would generally
> expect this allocation to be done top down (to minimize risk of
> running into RAM), and doing so is going to present further risks of
> incompatibilities with guest OSes (Linux for example learned only in
> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> PFN to pfn_pte(), the respective parameter of which is
> "unsigned long").
>
> I think this ought to be done in an iterative process - if all MMIO
> regions together don't fit below 4G, the biggest one should be
> moved up beyond 4G first, followed by the next to biggest one
> etc.

First of all, the proposal to move the PCI BAR up to the 64-bit range is 
a temporary work-around.  It should only be done if a device doesn't fit 
in the current MMIO range.

We have three options here:
1. Don't do anything
2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they 
don't fit
3. Convince qemu to allow MMIO regions to mask memory (or what it thinks 
is memory).
4. Add a mechanism to tell qemu that memory is being relocated.

Number 4 is definitely the right answer long-term, but we just don't 
have time to do that before the 4.3 release.  We're not sure yet if #3 
is possible; even if it is, it may have unpredictable knock-on effects.

Doing #2, it is true that many guests will be unable to access the 
device because of 32-bit limitations.  However, in #1, *no* guests will 
be able to access the device.  At least in #2, *many* guests will be 
able to do so.  In any case, apparently #2 is what KVM does, so having 
the limitation on guests is not without precedent.  It's also likely to 
be a somewhat tested configuration (unlike #3, for example).

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 10:05                 ` George Dunlap
@ 2013-06-12 10:11                   ` Jan Beulich
  -1 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 10:11 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 12:05, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> On 12/06/13 08:25, Jan Beulich wrote:
>>>>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> wrote:
>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>> region is larger than 512MB.
>>>
>>> Maybe we could just relax this condition and map the device memory to
>>> high memory no matter the size of the MMIO region if the PCI bar is
>>> 64-bit?
>> I can only recommend not to: For one, guests not using PAE or
>> PSE-36 can't map such space at all (and older OSes may not
>> properly deal with 64-bit BARs at all). And then one would generally
>> expect this allocation to be done top down (to minimize risk of
>> running into RAM), and doing so is going to present further risks of
>> incompatibilities with guest OSes (Linux for example learned only in
>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>> PFN to pfn_pte(), the respective parameter of which is
>> "unsigned long").
>>
>> I think this ought to be done in an iterative process - if all MMIO
>> regions together don't fit below 4G, the biggest one should be
>> moved up beyond 4G first, followed by the next to biggest one
>> etc.
> 
> First of all, the proposal to move the PCI BAR up to the 64-bit range is 
> a temporary work-around.  It should only be done if a device doesn't fit 
> in the current MMIO range.
> 
> We have three options here:
> 1. Don't do anything
> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they 
> don't fit
> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks 
> is memory).
> 4. Add a mechanism to tell qemu that memory is being relocated.
> 
> Number 4 is definitely the right answer long-term, but we just don't 
> have time to do that before the 4.3 release.  We're not sure yet if #3 
> is possible; even if it is, it may have unpredictable knock-on effects.
> 
> Doing #2, it is true that many guests will be unable to access the 
> device because of 32-bit limitations.  However, in #1, *no* guests will 
> be able to access the device.  At least in #2, *many* guests will be 
> able to do so.  In any case, apparently #2 is what KVM does, so having 
> the limitation on guests is not without precedent.  It's also likely to 
> be a somewhat tested configuration (unlike #3, for example).

That's all fine with me. My objection was to Stefano's consideration
to assign high addresses to _all_ 64-bit capable BARs up, not just
the biggest one(s).

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 10:11                   ` Jan Beulich
  0 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 10:11 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 12:05, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> On 12/06/13 08:25, Jan Beulich wrote:
>>>>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> wrote:
>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>> region is larger than 512MB.
>>>
>>> Maybe we could just relax this condition and map the device memory to
>>> high memory no matter the size of the MMIO region if the PCI bar is
>>> 64-bit?
>> I can only recommend not to: For one, guests not using PAE or
>> PSE-36 can't map such space at all (and older OSes may not
>> properly deal with 64-bit BARs at all). And then one would generally
>> expect this allocation to be done top down (to minimize risk of
>> running into RAM), and doing so is going to present further risks of
>> incompatibilities with guest OSes (Linux for example learned only in
>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>> PFN to pfn_pte(), the respective parameter of which is
>> "unsigned long").
>>
>> I think this ought to be done in an iterative process - if all MMIO
>> regions together don't fit below 4G, the biggest one should be
>> moved up beyond 4G first, followed by the next to biggest one
>> etc.
> 
> First of all, the proposal to move the PCI BAR up to the 64-bit range is 
> a temporary work-around.  It should only be done if a device doesn't fit 
> in the current MMIO range.
> 
> We have three options here:
> 1. Don't do anything
> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they 
> don't fit
> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks 
> is memory).
> 4. Add a mechanism to tell qemu that memory is being relocated.
> 
> Number 4 is definitely the right answer long-term, but we just don't 
> have time to do that before the 4.3 release.  We're not sure yet if #3 
> is possible; even if it is, it may have unpredictable knock-on effects.
> 
> Doing #2, it is true that many guests will be unable to access the 
> device because of 32-bit limitations.  However, in #1, *no* guests will 
> be able to access the device.  At least in #2, *many* guests will be 
> able to do so.  In any case, apparently #2 is what KVM does, so having 
> the limitation on guests is not without precedent.  It's also likely to 
> be a somewhat tested configuration (unlike #3, for example).

That's all fine with me. My objection was to Stefano's consideration
to assign high addresses to _all_ 64-bit capable BARs up, not just
the biggest one(s).

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 10:11                   ` Jan Beulich
@ 2013-06-12 10:15                     ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-12 10:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On 12/06/13 11:11, Jan Beulich wrote:
>>>> On 12.06.13 at 12:05, George Dunlap <george.dunlap@eu.citrix.com> wrote:
>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>> wrote:
>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>> region is larger than 512MB.
>>>>
>>>> Maybe we could just relax this condition and map the device memory to
>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>> 64-bit?
>>> I can only recommend not to: For one, guests not using PAE or
>>> PSE-36 can't map such space at all (and older OSes may not
>>> properly deal with 64-bit BARs at all). And then one would generally
>>> expect this allocation to be done top down (to minimize risk of
>>> running into RAM), and doing so is going to present further risks of
>>> incompatibilities with guest OSes (Linux for example learned only in
>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>> PFN to pfn_pte(), the respective parameter of which is
>>> "unsigned long").
>>>
>>> I think this ought to be done in an iterative process - if all MMIO
>>> regions together don't fit below 4G, the biggest one should be
>>> moved up beyond 4G first, followed by the next to biggest one
>>> etc.
>> First of all, the proposal to move the PCI BAR up to the 64-bit range is
>> a temporary work-around.  It should only be done if a device doesn't fit
>> in the current MMIO range.
>>
>> We have three options here:
>> 1. Don't do anything
>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
>> don't fit
>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
>> is memory).
>> 4. Add a mechanism to tell qemu that memory is being relocated.
>>
>> Number 4 is definitely the right answer long-term, but we just don't
>> have time to do that before the 4.3 release.  We're not sure yet if #3
>> is possible; even if it is, it may have unpredictable knock-on effects.
>>
>> Doing #2, it is true that many guests will be unable to access the
>> device because of 32-bit limitations.  However, in #1, *no* guests will
>> be able to access the device.  At least in #2, *many* guests will be
>> able to do so.  In any case, apparently #2 is what KVM does, so having
>> the limitation on guests is not without precedent.  It's also likely to
>> be a somewhat tested configuration (unlike #3, for example).
> That's all fine with me. My objection was to Stefano's consideration
> to assign high addresses to _all_ 64-bit capable BARs up, not just
> the biggest one(s).

Oh right -- I understood him to mean, "*allow* hvmloader to map the 
device memory to high memory *if necessary* if the BAR is 64-bit". I 
agree, mapping them all at 64-bit even if there's room in the 32-bit 
hole isn't a good idea.

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 10:15                     ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-12 10:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On 12/06/13 11:11, Jan Beulich wrote:
>>>> On 12.06.13 at 12:05, George Dunlap <george.dunlap@eu.citrix.com> wrote:
>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>> On 11.06.13 at 19:26, Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>> wrote:
>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>> region is larger than 512MB.
>>>>
>>>> Maybe we could just relax this condition and map the device memory to
>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>> 64-bit?
>>> I can only recommend not to: For one, guests not using PAE or
>>> PSE-36 can't map such space at all (and older OSes may not
>>> properly deal with 64-bit BARs at all). And then one would generally
>>> expect this allocation to be done top down (to minimize risk of
>>> running into RAM), and doing so is going to present further risks of
>>> incompatibilities with guest OSes (Linux for example learned only in
>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>> PFN to pfn_pte(), the respective parameter of which is
>>> "unsigned long").
>>>
>>> I think this ought to be done in an iterative process - if all MMIO
>>> regions together don't fit below 4G, the biggest one should be
>>> moved up beyond 4G first, followed by the next to biggest one
>>> etc.
>> First of all, the proposal to move the PCI BAR up to the 64-bit range is
>> a temporary work-around.  It should only be done if a device doesn't fit
>> in the current MMIO range.
>>
>> We have three options here:
>> 1. Don't do anything
>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
>> don't fit
>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
>> is memory).
>> 4. Add a mechanism to tell qemu that memory is being relocated.
>>
>> Number 4 is definitely the right answer long-term, but we just don't
>> have time to do that before the 4.3 release.  We're not sure yet if #3
>> is possible; even if it is, it may have unpredictable knock-on effects.
>>
>> Doing #2, it is true that many guests will be unable to access the
>> device because of 32-bit limitations.  However, in #1, *no* guests will
>> be able to access the device.  At least in #2, *many* guests will be
>> able to do so.  In any case, apparently #2 is what KVM does, so having
>> the limitation on guests is not without precedent.  It's also likely to
>> be a somewhat tested configuration (unlike #3, for example).
> That's all fine with me. My objection was to Stefano's consideration
> to assign high addresses to _all_ 64-bit capable BARs up, not just
> the biggest one(s).

Oh right -- I understood him to mean, "*allow* hvmloader to map the 
device memory to high memory *if necessary* if the BAR is 64-bit". I 
agree, mapping them all at 64-bit even if there's room in the 32-bit 
hole isn't a good idea.

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 10:05                 ` George Dunlap
@ 2013-06-12 13:23                   ` Paolo Bonzini
  -1 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2013-06-12 13:23 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	YongweiX Xu, SongtaoX Liu, xen-devel

Il 12/06/2013 06:05, George Dunlap ha scritto:
> On 12/06/13 08:25, Jan Beulich wrote:
>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>> region is larger than 512MB.
>>>
>>> Maybe we could just relax this condition and map the device memory to
>>> high memory no matter the size of the MMIO region if the PCI bar is
>>> 64-bit?
>> I can only recommend not to: For one, guests not using PAE or
>> PSE-36 can't map such space at all (and older OSes may not
>> properly deal with 64-bit BARs at all). And then one would generally
>> expect this allocation to be done top down (to minimize risk of
>> running into RAM), and doing so is going to present further risks of
>> incompatibilities with guest OSes (Linux for example learned only in
>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>> PFN to pfn_pte(), the respective parameter of which is
>> "unsigned long").
>>
>> I think this ought to be done in an iterative process - if all MMIO
>> regions together don't fit below 4G, the biggest one should be
>> moved up beyond 4G first, followed by the next to biggest one
>> etc.
> 
> First of all, the proposal to move the PCI BAR up to the 64-bit range is
> a temporary work-around.  It should only be done if a device doesn't fit
> in the current MMIO range.
> 
> We have three options here:
> 1. Don't do anything
> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
> don't fit
> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
> is memory).
> 4. Add a mechanism to tell qemu that memory is being relocated.
> 
> Number 4 is definitely the right answer long-term, but we just don't
> have time to do that before the 4.3 release.  We're not sure yet if #3
> is possible; even if it is, it may have unpredictable knock-on effects.

#3 should be possible or even the default (would need to check), but #4
is probably a bit harder to do.  Perhaps you can use a magic I/O port
for the xen platform PV driver, but if you can simply use two PCI
windows it would be much simpler because that's the same that TCG and
KVM already do.  The code is all there for you to lift in SeaBIOS.

Only Windows XP and older had problems with that because they didn't
like something in the ASL; but the 64-bit window is placed at the end of
RAM, so in principle any PAE-enabled OS can use it.

Paolo

> Doing #2, it is true that many guests will be unable to access the
> device because of 32-bit limitations.  However, in #1, *no* guests will
> be able to access the device.  At least in #2, *many* guests will be
> able to do so.  In any case, apparently #2 is what KVM does, so having
> the limitation on guests is not without precedent.  It's also likely to
> be a somewhat tested configuration (unlike #3, for example).
> 
>  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 13:23                   ` Paolo Bonzini
  0 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2013-06-12 13:23 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	YongweiX Xu, SongtaoX Liu, xen-devel

Il 12/06/2013 06:05, George Dunlap ha scritto:
> On 12/06/13 08:25, Jan Beulich wrote:
>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>> region is larger than 512MB.
>>>
>>> Maybe we could just relax this condition and map the device memory to
>>> high memory no matter the size of the MMIO region if the PCI bar is
>>> 64-bit?
>> I can only recommend not to: For one, guests not using PAE or
>> PSE-36 can't map such space at all (and older OSes may not
>> properly deal with 64-bit BARs at all). And then one would generally
>> expect this allocation to be done top down (to minimize risk of
>> running into RAM), and doing so is going to present further risks of
>> incompatibilities with guest OSes (Linux for example learned only in
>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>> PFN to pfn_pte(), the respective parameter of which is
>> "unsigned long").
>>
>> I think this ought to be done in an iterative process - if all MMIO
>> regions together don't fit below 4G, the biggest one should be
>> moved up beyond 4G first, followed by the next to biggest one
>> etc.
> 
> First of all, the proposal to move the PCI BAR up to the 64-bit range is
> a temporary work-around.  It should only be done if a device doesn't fit
> in the current MMIO range.
> 
> We have three options here:
> 1. Don't do anything
> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
> don't fit
> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
> is memory).
> 4. Add a mechanism to tell qemu that memory is being relocated.
> 
> Number 4 is definitely the right answer long-term, but we just don't
> have time to do that before the 4.3 release.  We're not sure yet if #3
> is possible; even if it is, it may have unpredictable knock-on effects.

#3 should be possible or even the default (would need to check), but #4
is probably a bit harder to do.  Perhaps you can use a magic I/O port
for the xen platform PV driver, but if you can simply use two PCI
windows it would be much simpler because that's the same that TCG and
KVM already do.  The code is all there for you to lift in SeaBIOS.

Only Windows XP and older had problems with that because they didn't
like something in the ASL; but the 64-bit window is placed at the end of
RAM, so in principle any PAE-enabled OS can use it.

Paolo

> Doing #2, it is true that many guests will be unable to access the
> device because of 32-bit limitations.  However, in #1, *no* guests will
> be able to access the device.  At least in #2, *many* guests will be
> able to do so.  In any case, apparently #2 is what KVM does, so having
> the limitation on guests is not without precedent.  It's also likely to
> be a somewhat tested configuration (unlike #3, for example).
> 
>  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 13:23                   ` Paolo Bonzini
@ 2013-06-12 13:49                     ` Jan Beulich
  -1 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 13:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, George Dunlap, Xudong Hao, Stefano Stabellini,
	luonengjun, qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 15:23, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 12/06/2013 06:05, George Dunlap ha scritto:
>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>> region is larger than 512MB.
>>>>
>>>> Maybe we could just relax this condition and map the device memory to
>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>> 64-bit?
>>> I can only recommend not to: For one, guests not using PAE or
>>> PSE-36 can't map such space at all (and older OSes may not
>>> properly deal with 64-bit BARs at all). And then one would generally
>>> expect this allocation to be done top down (to minimize risk of
>>> running into RAM), and doing so is going to present further risks of
>>> incompatibilities with guest OSes (Linux for example learned only in
>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>> PFN to pfn_pte(), the respective parameter of which is
>>> "unsigned long").
>>>
>>> I think this ought to be done in an iterative process - if all MMIO
>>> regions together don't fit below 4G, the biggest one should be
>>> moved up beyond 4G first, followed by the next to biggest one
>>> etc.
>> 
>> First of all, the proposal to move the PCI BAR up to the 64-bit range is
>> a temporary work-around.  It should only be done if a device doesn't fit
>> in the current MMIO range.
>> 
>> We have three options here:
>> 1. Don't do anything
>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
>> don't fit
>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
>> is memory).
>> 4. Add a mechanism to tell qemu that memory is being relocated.
>> 
>> Number 4 is definitely the right answer long-term, but we just don't
>> have time to do that before the 4.3 release.  We're not sure yet if #3
>> is possible; even if it is, it may have unpredictable knock-on effects.
> 
> #3 should be possible or even the default (would need to check), but #4
> is probably a bit harder to do.  Perhaps you can use a magic I/O port
> for the xen platform PV driver, but if you can simply use two PCI
> windows it would be much simpler because that's the same that TCG and
> KVM already do.  The code is all there for you to lift in SeaBIOS.

What is the connection here to the platform PV driver?

> Only Windows XP and older had problems with that because they didn't
> like something in the ASL; but the 64-bit window is placed at the end of
> RAM, so in principle any PAE-enabled OS can use it.

At the end of _RAM_???

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 13:49                     ` Jan Beulich
  0 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 13:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, George Dunlap, Xudong Hao, Stefano Stabellini,
	luonengjun, qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 15:23, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 12/06/2013 06:05, George Dunlap ha scritto:
>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>> region is larger than 512MB.
>>>>
>>>> Maybe we could just relax this condition and map the device memory to
>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>> 64-bit?
>>> I can only recommend not to: For one, guests not using PAE or
>>> PSE-36 can't map such space at all (and older OSes may not
>>> properly deal with 64-bit BARs at all). And then one would generally
>>> expect this allocation to be done top down (to minimize risk of
>>> running into RAM), and doing so is going to present further risks of
>>> incompatibilities with guest OSes (Linux for example learned only in
>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>> PFN to pfn_pte(), the respective parameter of which is
>>> "unsigned long").
>>>
>>> I think this ought to be done in an iterative process - if all MMIO
>>> regions together don't fit below 4G, the biggest one should be
>>> moved up beyond 4G first, followed by the next to biggest one
>>> etc.
>> 
>> First of all, the proposal to move the PCI BAR up to the 64-bit range is
>> a temporary work-around.  It should only be done if a device doesn't fit
>> in the current MMIO range.
>> 
>> We have three options here:
>> 1. Don't do anything
>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
>> don't fit
>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
>> is memory).
>> 4. Add a mechanism to tell qemu that memory is being relocated.
>> 
>> Number 4 is definitely the right answer long-term, but we just don't
>> have time to do that before the 4.3 release.  We're not sure yet if #3
>> is possible; even if it is, it may have unpredictable knock-on effects.
> 
> #3 should be possible or even the default (would need to check), but #4
> is probably a bit harder to do.  Perhaps you can use a magic I/O port
> for the xen platform PV driver, but if you can simply use two PCI
> windows it would be much simpler because that's the same that TCG and
> KVM already do.  The code is all there for you to lift in SeaBIOS.

What is the connection here to the platform PV driver?

> Only Windows XP and older had problems with that because they didn't
> like something in the ASL; but the 64-bit window is placed at the end of
> RAM, so in principle any PAE-enabled OS can use it.

At the end of _RAM_???

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 13:49                     ` Jan Beulich
@ 2013-06-12 14:02                       ` Paolo Bonzini
  -1 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2013-06-12 14:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, George Dunlap, Xudong Hao, Stefano Stabellini,
	luonengjun, qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	YongweiX Xu, SongtaoX Liu, xen-devel

Il 12/06/2013 09:49, Jan Beulich ha scritto:
>> #3 should be possible or even the default (would need to check), but #4
>> is probably a bit harder to do.  Perhaps you can use a magic I/O port
>> for the xen platform PV driver, but if you can simply use two PCI
>> windows it would be much simpler because that's the same that TCG and
>> KVM already do.  The code is all there for you to lift in SeaBIOS.
> 
> What is the connection here to the platform PV driver?

It's just a hook you already have for Xen-specific stuff in QEMU.

>> Only Windows XP and older had problems with that because they didn't
>> like something in the ASL; but the 64-bit window is placed at the end of
>> RAM, so in principle any PAE-enabled OS can use it.
> 
> At the end of _RAM_???

Why the question marks? :)

If you have 4GB of RAM it will end at 0x140000000 (or something like
that) and that's where the 64-bit window starts.  Of course if you have
no RAM above the PCI hole, the 64-bit window will start at 0x100000000.

Paolo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 14:02                       ` Paolo Bonzini
  0 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2013-06-12 14:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, George Dunlap, Xudong Hao, Stefano Stabellini,
	luonengjun, qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	YongweiX Xu, SongtaoX Liu, xen-devel

Il 12/06/2013 09:49, Jan Beulich ha scritto:
>> #3 should be possible or even the default (would need to check), but #4
>> is probably a bit harder to do.  Perhaps you can use a magic I/O port
>> for the xen platform PV driver, but if you can simply use two PCI
>> windows it would be much simpler because that's the same that TCG and
>> KVM already do.  The code is all there for you to lift in SeaBIOS.
> 
> What is the connection here to the platform PV driver?

It's just a hook you already have for Xen-specific stuff in QEMU.

>> Only Windows XP and older had problems with that because they didn't
>> like something in the ASL; but the 64-bit window is placed at the end of
>> RAM, so in principle any PAE-enabled OS can use it.
> 
> At the end of _RAM_???

Why the question marks? :)

If you have 4GB of RAM it will end at 0x140000000 (or something like
that) and that's where the 64-bit window starts.  Of course if you have
no RAM above the PCI hole, the 64-bit window will start at 0x100000000.

Paolo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 14:02                       ` Paolo Bonzini
@ 2013-06-12 14:19                         ` Jan Beulich
  -1 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 14:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, George Dunlap, Xudong Hao, Stefano Stabellini,
	luonengjun, qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 16:02, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 12/06/2013 09:49, Jan Beulich ha scritto:
>>> #3 should be possible or even the default (would need to check), but #4
>>> is probably a bit harder to do.  Perhaps you can use a magic I/O port
>>> for the xen platform PV driver, but if you can simply use two PCI
>>> windows it would be much simpler because that's the same that TCG and
>>> KVM already do.  The code is all there for you to lift in SeaBIOS.
>> 
>> What is the connection here to the platform PV driver?
> 
> It's just a hook you already have for Xen-specific stuff in QEMU.

Oh, sorry, I'm generally taking this term to refer to a Linux
component.

>>> Only Windows XP and older had problems with that because they didn't
>>> like something in the ASL; but the 64-bit window is placed at the end of
>>> RAM, so in principle any PAE-enabled OS can use it.
>> 
>> At the end of _RAM_???
> 
> Why the question marks? :)

Ah, so mean right after RAM. "At the end of RAM" reads like
overlapping (and discarding) the tail of it.

> If you have 4GB of RAM it will end at 0x140000000 (or something like
> that) and that's where the 64-bit window starts.  Of course if you have
> no RAM above the PCI hole, the 64-bit window will start at 0x100000000.

So there's no provision whatsoever for extending the amount of RAM
a guest may see? This is why I'd see any such allocation strategy to
start at the end of physical address space, moving downwards.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 14:19                         ` Jan Beulich
  0 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-12 14:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, George Dunlap, Xudong Hao, Stefano Stabellini,
	luonengjun, qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	YongweiX Xu, SongtaoX Liu, xen-devel

>>> On 12.06.13 at 16:02, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 12/06/2013 09:49, Jan Beulich ha scritto:
>>> #3 should be possible or even the default (would need to check), but #4
>>> is probably a bit harder to do.  Perhaps you can use a magic I/O port
>>> for the xen platform PV driver, but if you can simply use two PCI
>>> windows it would be much simpler because that's the same that TCG and
>>> KVM already do.  The code is all there for you to lift in SeaBIOS.
>> 
>> What is the connection here to the platform PV driver?
> 
> It's just a hook you already have for Xen-specific stuff in QEMU.

Oh, sorry, I'm generally taking this term to refer to a Linux
component.

>>> Only Windows XP and older had problems with that because they didn't
>>> like something in the ASL; but the 64-bit window is placed at the end of
>>> RAM, so in principle any PAE-enabled OS can use it.
>> 
>> At the end of _RAM_???
> 
> Why the question marks? :)

Ah, so mean right after RAM. "At the end of RAM" reads like
overlapping (and discarding) the tail of it.

> If you have 4GB of RAM it will end at 0x140000000 (or something like
> that) and that's where the 64-bit window starts.  Of course if you have
> no RAM above the PCI hole, the 64-bit window will start at 0x100000000.

So there's no provision whatsoever for extending the amount of RAM
a guest may see? This is why I'd see any such allocation strategy to
start at the end of physical address space, moving downwards.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 14:19                         ` Jan Beulich
@ 2013-06-12 15:25                           ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-12 15:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, YongweiX Xu,
	Paolo Bonzini, SongtaoX Liu, xen-devel

On 12/06/13 15:19, Jan Beulich wrote:
>>>> On 12.06.13 at 16:02, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 12/06/2013 09:49, Jan Beulich ha scritto:
>>>> #3 should be possible or even the default (would need to check), but #4
>>>> is probably a bit harder to do.  Perhaps you can use a magic I/O port
>>>> for the xen platform PV driver, but if you can simply use two PCI
>>>> windows it would be much simpler because that's the same that TCG and
>>>> KVM already do.  The code is all there for you to lift in SeaBIOS.
>>> What is the connection here to the platform PV driver?
>> It's just a hook you already have for Xen-specific stuff in QEMU.
> Oh, sorry, I'm generally taking this term to refer to a Linux
> component.
>
>>>> Only Windows XP and older had problems with that because they didn't
>>>> like something in the ASL; but the 64-bit window is placed at the end of
>>>> RAM, so in principle any PAE-enabled OS can use it.
>>> At the end of _RAM_???
>> Why the question marks? :)
> Ah, so mean right after RAM. "At the end of RAM" reads like
> overlapping (and discarding) the tail of it.
>
>> If you have 4GB of RAM it will end at 0x140000000 (or something like
>> that) and that's where the 64-bit window starts.  Of course if you have
>> no RAM above the PCI hole, the 64-bit window will start at 0x100000000.
> So there's no provision whatsoever for extending the amount of RAM
> a guest may see? This is why I'd see any such allocation strategy to
> start at the end of physical address space, moving downwards.

Is there a mechanism to do memory hot-plug in qemu at the moment? If 
not, then there's no reason to put it anywhere else.

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 15:25                           ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-12 15:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, YongweiX Xu,
	Paolo Bonzini, SongtaoX Liu, xen-devel

On 12/06/13 15:19, Jan Beulich wrote:
>>>> On 12.06.13 at 16:02, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 12/06/2013 09:49, Jan Beulich ha scritto:
>>>> #3 should be possible or even the default (would need to check), but #4
>>>> is probably a bit harder to do.  Perhaps you can use a magic I/O port
>>>> for the xen platform PV driver, but if you can simply use two PCI
>>>> windows it would be much simpler because that's the same that TCG and
>>>> KVM already do.  The code is all there for you to lift in SeaBIOS.
>>> What is the connection here to the platform PV driver?
>> It's just a hook you already have for Xen-specific stuff in QEMU.
> Oh, sorry, I'm generally taking this term to refer to a Linux
> component.
>
>>>> Only Windows XP and older had problems with that because they didn't
>>>> like something in the ASL; but the 64-bit window is placed at the end of
>>>> RAM, so in principle any PAE-enabled OS can use it.
>>> At the end of _RAM_???
>> Why the question marks? :)
> Ah, so mean right after RAM. "At the end of RAM" reads like
> overlapping (and discarding) the tail of it.
>
>> If you have 4GB of RAM it will end at 0x140000000 (or something like
>> that) and that's where the 64-bit window starts.  Of course if you have
>> no RAM above the PCI hole, the 64-bit window will start at 0x100000000.
> So there's no provision whatsoever for extending the amount of RAM
> a guest may see? This is why I'd see any such allocation strategy to
> start at the end of physical address space, moving downwards.

Is there a mechanism to do memory hot-plug in qemu at the moment? If 
not, then there's no reason to put it anywhere else.

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 15:25                           ` George Dunlap
@ 2013-06-12 20:13                             ` Paolo Bonzini
  -1 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2013-06-12 20:13 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	YongweiX Xu, SongtaoX Liu, xen-devel

Il 12/06/2013 11:25, George Dunlap ha scritto:
>>> If you have 4GB of RAM it will end at 0x140000000 (or something like
>>> that) and that's where the 64-bit window starts.  Of course if you have
>>> no RAM above the PCI hole, the 64-bit window will start at 0x100000000.
>> So there's no provision whatsoever for extending the amount of RAM
>> a guest may see? This is why I'd see any such allocation strategy to
>> start at the end of physical address space, moving downwards.

That'd work too, I guess.

> Is there a mechanism to do memory hot-plug in qemu at the moment? If
> not, then there's no reason to put it anywhere else.

Not yet, but then memory could also be discontiguous as long as you
describe it correctly in the tables.

Paolo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-12 20:13                             ` Paolo Bonzini
  0 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2013-06-12 20:13 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	YongweiX Xu, SongtaoX Liu, xen-devel

Il 12/06/2013 11:25, George Dunlap ha scritto:
>>> If you have 4GB of RAM it will end at 0x140000000 (or something like
>>> that) and that's where the 64-bit window starts.  Of course if you have
>>> no RAM above the PCI hole, the 64-bit window will start at 0x100000000.
>> So there's no provision whatsoever for extending the amount of RAM
>> a guest may see? This is why I'd see any such allocation strategy to
>> start at the end of physical address space, moving downwards.

That'd work too, I guess.

> Is there a mechanism to do memory hot-plug in qemu at the moment? If
> not, then there's no reason to put it anywhere else.

Not yet, but then memory could also be discontiguous as long as you
describe it correctly in the tables.

Paolo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-12 10:05                 ` George Dunlap
@ 2013-06-13 13:44                   ` Stefano Stabellini
  -1 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 13:44 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 12 Jun 2013, George Dunlap wrote:
> On 12/06/13 08:25, Jan Beulich wrote:
> > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> > > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> > > region is larger than 512MB.
> > > 
> > > Maybe we could just relax this condition and map the device memory to
> > > high memory no matter the size of the MMIO region if the PCI bar is
> > > 64-bit?
> > I can only recommend not to: For one, guests not using PAE or
> > PSE-36 can't map such space at all (and older OSes may not
> > properly deal with 64-bit BARs at all). And then one would generally
> > expect this allocation to be done top down (to minimize risk of
> > running into RAM), and doing so is going to present further risks of
> > incompatibilities with guest OSes (Linux for example learned only in
> > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > PFN to pfn_pte(), the respective parameter of which is
> > "unsigned long").
> > 
> > I think this ought to be done in an iterative process - if all MMIO
> > regions together don't fit below 4G, the biggest one should be
> > moved up beyond 4G first, followed by the next to biggest one
> > etc.
> 
> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> temporary work-around.  It should only be done if a device doesn't fit in the
> current MMIO range.
> 
> We have three options here:
> 1. Don't do anything
> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> fit
> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> memory).
> 4. Add a mechanism to tell qemu that memory is being relocated.
> 
> Number 4 is definitely the right answer long-term, but we just don't have time
> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> if it is, it may have unpredictable knock-on effects.
> 
> Doing #2, it is true that many guests will be unable to access the device
> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> access the device.  At least in #2, *many* guests will be able to do so.  In
> any case, apparently #2 is what KVM does, so having the limitation on guests
> is not without precedent.  It's also likely to be a somewhat tested
> configuration (unlike #3, for example).

I would avoid #3, because I don't think is a good idea to rely on that
behaviour.
I would also avoid #4, because having seen QEMU's code, it's wouldn't be
easy and certainly not doable in time for 4.3.

So we are left to play with the PCI MMIO region size and location in
hvmloader.

I agree with Jan that we shouldn't relocate unconditionally all the
devices to the region above 4G. I meant to say that we should relocate
only the ones that don't fit. And we shouldn't try to dynamically
increase the PCI hole below 4G because clearly that doesn't work.
However we could still increase the size of the PCI hole below 4G by
default from start at 0xf0000000 to starting at 0xe0000000.
Why do we know that is safe? Because in the current configuration
hvmloader *already* increases the PCI hole size by decreasing the start
address every time a device doesn't fit.
So it's already common for hvmloader to set pci_mem_start to
0xe0000000, you just need to assign a device with a PCI hole size big
enough.


My proposed solution is:

- set 0xe0000000 as the default PCI hole start for everybody, including
qemu-xen-traditional
- move above 4G everything that doesn't fit and support 64-bit bars
- print an error if the device doesn't fit and doesn't support 64-bit
bars

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 13:44                   ` Stefano Stabellini
  0 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 13:44 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Wed, 12 Jun 2013, George Dunlap wrote:
> On 12/06/13 08:25, Jan Beulich wrote:
> > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> > > maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> > > region is larger than 512MB.
> > > 
> > > Maybe we could just relax this condition and map the device memory to
> > > high memory no matter the size of the MMIO region if the PCI bar is
> > > 64-bit?
> > I can only recommend not to: For one, guests not using PAE or
> > PSE-36 can't map such space at all (and older OSes may not
> > properly deal with 64-bit BARs at all). And then one would generally
> > expect this allocation to be done top down (to minimize risk of
> > running into RAM), and doing so is going to present further risks of
> > incompatibilities with guest OSes (Linux for example learned only in
> > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > PFN to pfn_pte(), the respective parameter of which is
> > "unsigned long").
> > 
> > I think this ought to be done in an iterative process - if all MMIO
> > regions together don't fit below 4G, the biggest one should be
> > moved up beyond 4G first, followed by the next to biggest one
> > etc.
> 
> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> temporary work-around.  It should only be done if a device doesn't fit in the
> current MMIO range.
> 
> We have three options here:
> 1. Don't do anything
> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> fit
> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> memory).
> 4. Add a mechanism to tell qemu that memory is being relocated.
> 
> Number 4 is definitely the right answer long-term, but we just don't have time
> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> if it is, it may have unpredictable knock-on effects.
> 
> Doing #2, it is true that many guests will be unable to access the device
> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> access the device.  At least in #2, *many* guests will be able to do so.  In
> any case, apparently #2 is what KVM does, so having the limitation on guests
> is not without precedent.  It's also likely to be a somewhat tested
> configuration (unlike #3, for example).

I would avoid #3, because I don't think is a good idea to rely on that
behaviour.
I would also avoid #4, because having seen QEMU's code, it's wouldn't be
easy and certainly not doable in time for 4.3.

So we are left to play with the PCI MMIO region size and location in
hvmloader.

I agree with Jan that we shouldn't relocate unconditionally all the
devices to the region above 4G. I meant to say that we should relocate
only the ones that don't fit. And we shouldn't try to dynamically
increase the PCI hole below 4G because clearly that doesn't work.
However we could still increase the size of the PCI hole below 4G by
default from start at 0xf0000000 to starting at 0xe0000000.
Why do we know that is safe? Because in the current configuration
hvmloader *already* increases the PCI hole size by decreasing the start
address every time a device doesn't fit.
So it's already common for hvmloader to set pci_mem_start to
0xe0000000, you just need to assign a device with a PCI hole size big
enough.


My proposed solution is:

- set 0xe0000000 as the default PCI hole start for everybody, including
qemu-xen-traditional
- move above 4G everything that doesn't fit and support 64-bit bars
- print an error if the device doesn't fit and doesn't support 64-bit
bars

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 13:44                   ` Stefano Stabellini
@ 2013-06-13 13:54                     ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-13 13:54 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

On 13/06/13 14:44, Stefano Stabellini wrote:
> On Wed, 12 Jun 2013, George Dunlap wrote:
>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>> region is larger than 512MB.
>>>>
>>>> Maybe we could just relax this condition and map the device memory to
>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>> 64-bit?
>>> I can only recommend not to: For one, guests not using PAE or
>>> PSE-36 can't map such space at all (and older OSes may not
>>> properly deal with 64-bit BARs at all). And then one would generally
>>> expect this allocation to be done top down (to minimize risk of
>>> running into RAM), and doing so is going to present further risks of
>>> incompatibilities with guest OSes (Linux for example learned only in
>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>> PFN to pfn_pte(), the respective parameter of which is
>>> "unsigned long").
>>>
>>> I think this ought to be done in an iterative process - if all MMIO
>>> regions together don't fit below 4G, the biggest one should be
>>> moved up beyond 4G first, followed by the next to biggest one
>>> etc.
>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
>> temporary work-around.  It should only be done if a device doesn't fit in the
>> current MMIO range.
>>
>> We have three options here:
>> 1. Don't do anything
>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
>> fit
>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
>> memory).
>> 4. Add a mechanism to tell qemu that memory is being relocated.
>>
>> Number 4 is definitely the right answer long-term, but we just don't have time
>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
>> if it is, it may have unpredictable knock-on effects.
>>
>> Doing #2, it is true that many guests will be unable to access the device
>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
>> access the device.  At least in #2, *many* guests will be able to do so.  In
>> any case, apparently #2 is what KVM does, so having the limitation on guests
>> is not without precedent.  It's also likely to be a somewhat tested
>> configuration (unlike #3, for example).
> I would avoid #3, because I don't think is a good idea to rely on that
> behaviour.
> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> easy and certainly not doable in time for 4.3.
>
> So we are left to play with the PCI MMIO region size and location in
> hvmloader.
>
> I agree with Jan that we shouldn't relocate unconditionally all the
> devices to the region above 4G. I meant to say that we should relocate
> only the ones that don't fit. And we shouldn't try to dynamically
> increase the PCI hole below 4G because clearly that doesn't work.
> However we could still increase the size of the PCI hole below 4G by
> default from start at 0xf0000000 to starting at 0xe0000000.
> Why do we know that is safe? Because in the current configuration
> hvmloader *already* increases the PCI hole size by decreasing the start
> address every time a device doesn't fit.
> So it's already common for hvmloader to set pci_mem_start to
> 0xe0000000, you just need to assign a device with a PCI hole size big
> enough.
>
>
> My proposed solution is:
>
> - set 0xe0000000 as the default PCI hole start for everybody, including
> qemu-xen-traditional
> - move above 4G everything that doesn't fit and support 64-bit bars
> - print an error if the device doesn't fit and doesn't support 64-bit
> bars

Also, as I understand it, at the moment:
1. Some operating systems (32-bit XP) won't be able to use relocated devices
2. Some devices (without 64-bit BARs) can't be relocated
3. qemu-traditional is fine with a resized <4GiB MMIO hole.

So if we have #1 or #2, at the moment an option for a work-around is to 
use qemu-traditional.

However, if we add your "print an error if the device doesn't fit", then 
this option will go away -- this will be a regression in functionality 
from 4.2.

I thought that what we had proposed was to have an option in xenstore, 
that libxl would set, which would instruct hvmloader whether to expand 
the MMIO hole and whether to relocate devices above 64-bit?

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 13:54                     ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-13 13:54 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

On 13/06/13 14:44, Stefano Stabellini wrote:
> On Wed, 12 Jun 2013, George Dunlap wrote:
>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>> region is larger than 512MB.
>>>>
>>>> Maybe we could just relax this condition and map the device memory to
>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>> 64-bit?
>>> I can only recommend not to: For one, guests not using PAE or
>>> PSE-36 can't map such space at all (and older OSes may not
>>> properly deal with 64-bit BARs at all). And then one would generally
>>> expect this allocation to be done top down (to minimize risk of
>>> running into RAM), and doing so is going to present further risks of
>>> incompatibilities with guest OSes (Linux for example learned only in
>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>> PFN to pfn_pte(), the respective parameter of which is
>>> "unsigned long").
>>>
>>> I think this ought to be done in an iterative process - if all MMIO
>>> regions together don't fit below 4G, the biggest one should be
>>> moved up beyond 4G first, followed by the next to biggest one
>>> etc.
>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
>> temporary work-around.  It should only be done if a device doesn't fit in the
>> current MMIO range.
>>
>> We have three options here:
>> 1. Don't do anything
>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
>> fit
>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
>> memory).
>> 4. Add a mechanism to tell qemu that memory is being relocated.
>>
>> Number 4 is definitely the right answer long-term, but we just don't have time
>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
>> if it is, it may have unpredictable knock-on effects.
>>
>> Doing #2, it is true that many guests will be unable to access the device
>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
>> access the device.  At least in #2, *many* guests will be able to do so.  In
>> any case, apparently #2 is what KVM does, so having the limitation on guests
>> is not without precedent.  It's also likely to be a somewhat tested
>> configuration (unlike #3, for example).
> I would avoid #3, because I don't think is a good idea to rely on that
> behaviour.
> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> easy and certainly not doable in time for 4.3.
>
> So we are left to play with the PCI MMIO region size and location in
> hvmloader.
>
> I agree with Jan that we shouldn't relocate unconditionally all the
> devices to the region above 4G. I meant to say that we should relocate
> only the ones that don't fit. And we shouldn't try to dynamically
> increase the PCI hole below 4G because clearly that doesn't work.
> However we could still increase the size of the PCI hole below 4G by
> default from start at 0xf0000000 to starting at 0xe0000000.
> Why do we know that is safe? Because in the current configuration
> hvmloader *already* increases the PCI hole size by decreasing the start
> address every time a device doesn't fit.
> So it's already common for hvmloader to set pci_mem_start to
> 0xe0000000, you just need to assign a device with a PCI hole size big
> enough.
>
>
> My proposed solution is:
>
> - set 0xe0000000 as the default PCI hole start for everybody, including
> qemu-xen-traditional
> - move above 4G everything that doesn't fit and support 64-bit bars
> - print an error if the device doesn't fit and doesn't support 64-bit
> bars

Also, as I understand it, at the moment:
1. Some operating systems (32-bit XP) won't be able to use relocated devices
2. Some devices (without 64-bit BARs) can't be relocated
3. qemu-traditional is fine with a resized <4GiB MMIO hole.

So if we have #1 or #2, at the moment an option for a work-around is to 
use qemu-traditional.

However, if we add your "print an error if the device doesn't fit", then 
this option will go away -- this will be a regression in functionality 
from 4.2.

I thought that what we had proposed was to have an option in xenstore, 
that libxl would set, which would instruct hvmloader whether to expand 
the MMIO hole and whether to relocate devices above 64-bit?

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 13:54                     ` George Dunlap
@ 2013-06-13 14:50                       ` Stefano Stabellini
  -1 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 14:50 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 13 Jun 2013, George Dunlap wrote:
> On 13/06/13 14:44, Stefano Stabellini wrote:
> > On Wed, 12 Jun 2013, George Dunlap wrote:
> > > On 12/06/13 08:25, Jan Beulich wrote:
> > > > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it
> > > > > already
> > > > > maps the PCI region to high memory if the PCI bar is 64-bit and the
> > > > > MMIO
> > > > > region is larger than 512MB.
> > > > > 
> > > > > Maybe we could just relax this condition and map the device memory to
> > > > > high memory no matter the size of the MMIO region if the PCI bar is
> > > > > 64-bit?
> > > > I can only recommend not to: For one, guests not using PAE or
> > > > PSE-36 can't map such space at all (and older OSes may not
> > > > properly deal with 64-bit BARs at all). And then one would generally
> > > > expect this allocation to be done top down (to minimize risk of
> > > > running into RAM), and doing so is going to present further risks of
> > > > incompatibilities with guest OSes (Linux for example learned only in
> > > > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > > > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > > > PFN to pfn_pte(), the respective parameter of which is
> > > > "unsigned long").
> > > > 
> > > > I think this ought to be done in an iterative process - if all MMIO
> > > > regions together don't fit below 4G, the biggest one should be
> > > > moved up beyond 4G first, followed by the next to biggest one
> > > > etc.
> > > First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> > > temporary work-around.  It should only be done if a device doesn't fit in
> > > the
> > > current MMIO range.
> > > 
> > > We have three options here:
> > > 1. Don't do anything
> > > 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
> > > don't
> > > fit
> > > 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
> > > is
> > > memory).
> > > 4. Add a mechanism to tell qemu that memory is being relocated.
> > > 
> > > Number 4 is definitely the right answer long-term, but we just don't have
> > > time
> > > to do that before the 4.3 release.  We're not sure yet if #3 is possible;
> > > even
> > > if it is, it may have unpredictable knock-on effects.
> > > 
> > > Doing #2, it is true that many guests will be unable to access the device
> > > because of 32-bit limitations.  However, in #1, *no* guests will be able
> > > to
> > > access the device.  At least in #2, *many* guests will be able to do so.
> > > In
> > > any case, apparently #2 is what KVM does, so having the limitation on
> > > guests
> > > is not without precedent.  It's also likely to be a somewhat tested
> > > configuration (unlike #3, for example).
> > I would avoid #3, because I don't think is a good idea to rely on that
> > behaviour.
> > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > easy and certainly not doable in time for 4.3.
> > 
> > So we are left to play with the PCI MMIO region size and location in
> > hvmloader.
> > 
> > I agree with Jan that we shouldn't relocate unconditionally all the
> > devices to the region above 4G. I meant to say that we should relocate
> > only the ones that don't fit. And we shouldn't try to dynamically
> > increase the PCI hole below 4G because clearly that doesn't work.
> > However we could still increase the size of the PCI hole below 4G by
> > default from start at 0xf0000000 to starting at 0xe0000000.
> > Why do we know that is safe? Because in the current configuration
> > hvmloader *already* increases the PCI hole size by decreasing the start
> > address every time a device doesn't fit.
> > So it's already common for hvmloader to set pci_mem_start to
> > 0xe0000000, you just need to assign a device with a PCI hole size big
> > enough.
> > 
> > 
> > My proposed solution is:
> > 
> > - set 0xe0000000 as the default PCI hole start for everybody, including
> > qemu-xen-traditional
> > - move above 4G everything that doesn't fit and support 64-bit bars
> > - print an error if the device doesn't fit and doesn't support 64-bit
> > bars
> 
> Also, as I understand it, at the moment:
> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> 2. Some devices (without 64-bit BARs) can't be relocated
> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> 
> So if we have #1 or #2, at the moment an option for a work-around is to use
> qemu-traditional.
> 
> However, if we add your "print an error if the device doesn't fit", then this
> option will go away -- this will be a regression in functionality from 4.2.

Keep in mind that if we start the pci hole at 0xe0000000, the number of
cases for which any workarounds are needed is going to be dramatically
decreased to the point that I don't think we need a workaround anymore.

The algorithm is going to work like this in details:

- the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
- we calculate the total mmio size, if it's bigger than the pci hole we
raise a 64 bit relocation flag
- if the 64 bit relocation is enabled, we relocate above 4G the first
device that is 64-bit capable and has an MMIO size greater or equal to
512MB
- if the pci hole size is now big enough for the remaining devices we
stop the above 4G relocation, otherwise keep relocating devices that are
64 bit capable and have an MMIO size greater or equal to 512MB
- if one or more devices don't fit we print an error and continue (it's
not a critical failure, one device won't be used)

We could have a xenstore flag somewhere that enables the old behaviour
so that people can revert back to qemu-xen-traditional and make the pci
hole below 4G even bigger than 448MB, but I think that keeping the old
behaviour around is going to make the code more difficult to maintain.

Also it's difficult for people to realize that they need the workaround
because hvmloader logs aren't enabled by default and only go to the Xen
serial console. The value of this workaround pretty low in my view.
Finally it's worth noting that Windows XP is going EOL in less than an
year.


> I thought that what we had proposed was to have an option in xenstore, that
> libxl would set, which would instruct hvmloader whether to expand the MMIO
> hole and whether to relocate devices above 64-bit?

I think it's right to have this discussion in public on the mailing
list, rather than behind closed doors.
Also I don't agree on the need for a workaround, as explained above.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 14:50                       ` Stefano Stabellini
  0 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 14:50 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 13 Jun 2013, George Dunlap wrote:
> On 13/06/13 14:44, Stefano Stabellini wrote:
> > On Wed, 12 Jun 2013, George Dunlap wrote:
> > > On 12/06/13 08:25, Jan Beulich wrote:
> > > > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it
> > > > > already
> > > > > maps the PCI region to high memory if the PCI bar is 64-bit and the
> > > > > MMIO
> > > > > region is larger than 512MB.
> > > > > 
> > > > > Maybe we could just relax this condition and map the device memory to
> > > > > high memory no matter the size of the MMIO region if the PCI bar is
> > > > > 64-bit?
> > > > I can only recommend not to: For one, guests not using PAE or
> > > > PSE-36 can't map such space at all (and older OSes may not
> > > > properly deal with 64-bit BARs at all). And then one would generally
> > > > expect this allocation to be done top down (to minimize risk of
> > > > running into RAM), and doing so is going to present further risks of
> > > > incompatibilities with guest OSes (Linux for example learned only in
> > > > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > > > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > > > PFN to pfn_pte(), the respective parameter of which is
> > > > "unsigned long").
> > > > 
> > > > I think this ought to be done in an iterative process - if all MMIO
> > > > regions together don't fit below 4G, the biggest one should be
> > > > moved up beyond 4G first, followed by the next to biggest one
> > > > etc.
> > > First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> > > temporary work-around.  It should only be done if a device doesn't fit in
> > > the
> > > current MMIO range.
> > > 
> > > We have three options here:
> > > 1. Don't do anything
> > > 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
> > > don't
> > > fit
> > > 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
> > > is
> > > memory).
> > > 4. Add a mechanism to tell qemu that memory is being relocated.
> > > 
> > > Number 4 is definitely the right answer long-term, but we just don't have
> > > time
> > > to do that before the 4.3 release.  We're not sure yet if #3 is possible;
> > > even
> > > if it is, it may have unpredictable knock-on effects.
> > > 
> > > Doing #2, it is true that many guests will be unable to access the device
> > > because of 32-bit limitations.  However, in #1, *no* guests will be able
> > > to
> > > access the device.  At least in #2, *many* guests will be able to do so.
> > > In
> > > any case, apparently #2 is what KVM does, so having the limitation on
> > > guests
> > > is not without precedent.  It's also likely to be a somewhat tested
> > > configuration (unlike #3, for example).
> > I would avoid #3, because I don't think is a good idea to rely on that
> > behaviour.
> > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > easy and certainly not doable in time for 4.3.
> > 
> > So we are left to play with the PCI MMIO region size and location in
> > hvmloader.
> > 
> > I agree with Jan that we shouldn't relocate unconditionally all the
> > devices to the region above 4G. I meant to say that we should relocate
> > only the ones that don't fit. And we shouldn't try to dynamically
> > increase the PCI hole below 4G because clearly that doesn't work.
> > However we could still increase the size of the PCI hole below 4G by
> > default from start at 0xf0000000 to starting at 0xe0000000.
> > Why do we know that is safe? Because in the current configuration
> > hvmloader *already* increases the PCI hole size by decreasing the start
> > address every time a device doesn't fit.
> > So it's already common for hvmloader to set pci_mem_start to
> > 0xe0000000, you just need to assign a device with a PCI hole size big
> > enough.
> > 
> > 
> > My proposed solution is:
> > 
> > - set 0xe0000000 as the default PCI hole start for everybody, including
> > qemu-xen-traditional
> > - move above 4G everything that doesn't fit and support 64-bit bars
> > - print an error if the device doesn't fit and doesn't support 64-bit
> > bars
> 
> Also, as I understand it, at the moment:
> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> 2. Some devices (without 64-bit BARs) can't be relocated
> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> 
> So if we have #1 or #2, at the moment an option for a work-around is to use
> qemu-traditional.
> 
> However, if we add your "print an error if the device doesn't fit", then this
> option will go away -- this will be a regression in functionality from 4.2.

Keep in mind that if we start the pci hole at 0xe0000000, the number of
cases for which any workarounds are needed is going to be dramatically
decreased to the point that I don't think we need a workaround anymore.

The algorithm is going to work like this in details:

- the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
- we calculate the total mmio size, if it's bigger than the pci hole we
raise a 64 bit relocation flag
- if the 64 bit relocation is enabled, we relocate above 4G the first
device that is 64-bit capable and has an MMIO size greater or equal to
512MB
- if the pci hole size is now big enough for the remaining devices we
stop the above 4G relocation, otherwise keep relocating devices that are
64 bit capable and have an MMIO size greater or equal to 512MB
- if one or more devices don't fit we print an error and continue (it's
not a critical failure, one device won't be used)

We could have a xenstore flag somewhere that enables the old behaviour
so that people can revert back to qemu-xen-traditional and make the pci
hole below 4G even bigger than 448MB, but I think that keeping the old
behaviour around is going to make the code more difficult to maintain.

Also it's difficult for people to realize that they need the workaround
because hvmloader logs aren't enabled by default and only go to the Xen
serial console. The value of this workaround pretty low in my view.
Finally it's worth noting that Windows XP is going EOL in less than an
year.


> I thought that what we had proposed was to have an option in xenstore, that
> libxl would set, which would instruct hvmloader whether to expand the MMIO
> hole and whether to relocate devices above 64-bit?

I think it's right to have this discussion in public on the mailing
list, rather than behind closed doors.
Also I don't agree on the need for a workaround, as explained above.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 14:50                       ` Stefano Stabellini
@ 2013-06-13 15:06                         ` Jan Beulich
  -1 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-13 15:06 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	hanweidong, George Dunlap, Xudong Hao, yanqiangjun, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

>>> On 13.06.13 at 16:50, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> The algorithm is going to work like this in details:
> 
> - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> - we calculate the total mmio size, if it's bigger than the pci hole we
> raise a 64 bit relocation flag
> - if the 64 bit relocation is enabled, we relocate above 4G the first
> device that is 64-bit capable and has an MMIO size greater or equal to
> 512MB
> - if the pci hole size is now big enough for the remaining devices we
> stop the above 4G relocation, otherwise keep relocating devices that are
> 64 bit capable and have an MMIO size greater or equal to 512MB
> - if one or more devices don't fit we print an error and continue (it's
> not a critical failure, one device won't be used)

Devices with 512Mb BARs won't fit in a 448Mb hole in any case,
so there's no point in trying. Any such BARs need to be relocated.
Then for 256Mb BARs, you could see whether it's just one and fits.
Else relocate it, and all others (except for perhaps one). Then
halve the size again and start over, etc.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 15:06                         ` Jan Beulich
  0 siblings, 0 replies; 82+ messages in thread
From: Jan Beulich @ 2013-06-13 15:06 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	hanweidong, George Dunlap, Xudong Hao, yanqiangjun, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

>>> On 13.06.13 at 16:50, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> The algorithm is going to work like this in details:
> 
> - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> - we calculate the total mmio size, if it's bigger than the pci hole we
> raise a 64 bit relocation flag
> - if the 64 bit relocation is enabled, we relocate above 4G the first
> device that is 64-bit capable and has an MMIO size greater or equal to
> 512MB
> - if the pci hole size is now big enough for the remaining devices we
> stop the above 4G relocation, otherwise keep relocating devices that are
> 64 bit capable and have an MMIO size greater or equal to 512MB
> - if one or more devices don't fit we print an error and continue (it's
> not a critical failure, one device won't be used)

Devices with 512Mb BARs won't fit in a 448Mb hole in any case,
so there's no point in trying. Any such BARs need to be relocated.
Then for 256Mb BARs, you could see whether it's just one and fits.
Else relocate it, and all others (except for perhaps one). Then
halve the size again and start over, etc.

Jan

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 14:50                       ` Stefano Stabellini
@ 2013-06-13 15:29                         ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-13 15:29 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

On 13/06/13 15:50, Stefano Stabellini wrote:
> Keep in mind that if we start the pci hole at 0xe0000000, the number of
> cases for which any workarounds are needed is going to be dramatically
> decreased to the point that I don't think we need a workaround anymore.

You don't think anyone is going to want to pass through a card with 
1GiB+ of RAM?

>
> The algorithm is going to work like this in details:
>
> - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> - we calculate the total mmio size, if it's bigger than the pci hole we
> raise a 64 bit relocation flag
> - if the 64 bit relocation is enabled, we relocate above 4G the first
> device that is 64-bit capable and has an MMIO size greater or equal to
> 512MB
> - if the pci hole size is now big enough for the remaining devices we
> stop the above 4G relocation, otherwise keep relocating devices that are
> 64 bit capable and have an MMIO size greater or equal to 512MB
> - if one or more devices don't fit we print an error and continue (it's
> not a critical failure, one device won't be used)
>
> We could have a xenstore flag somewhere that enables the old behaviour
> so that people can revert back to qemu-xen-traditional and make the pci
> hole below 4G even bigger than 448MB, but I think that keeping the old
> behaviour around is going to make the code more difficult to maintain.

We'll only need to do that for one release, until we have a chance to 
fix it properly.

>
> Also it's difficult for people to realize that they need the workaround
> because hvmloader logs aren't enabled by default and only go to the Xen
> serial console.

Well if key people know about it (Pasi, David Techer, &c), and we put it 
on the wikis related to VGA pass-through, I think information will get 
around.

> The value of this workaround pretty low in my view.
> Finally it's worth noting that Windows XP is going EOL in less than an
> year.

That's 1 year that a configuration with a currently-supported OS won't 
work for Xen 4.3 that worked for 4.2.  Apart from that, one of the 
reasons for doing virtualization in the first place is to be able to run 
older, unsupported OSes on current hardware; so "XP isn't important" 
doesn't really cut it for me. :-)

>
>
>> I thought that what we had proposed was to have an option in xenstore, that
>> libxl would set, which would instruct hvmloader whether to expand the MMIO
>> hole and whether to relocate devices above 64-bit?
> I think it's right to have this discussion in public on the mailing
> list, rather than behind closed doors.
> Also I don't agree on the need for a workaround, as explained above.

I see -- you thought it was a bad idea and so were letting someone else 
bring it up -- or maybe hoping no one would remember to bring it up. :-)

(Obviously the decision needs to be made in public, but sometimes having 
technical solutions hashed out in a face-to-face meeting is more efficient.)

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 15:29                         ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-13 15:29 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

On 13/06/13 15:50, Stefano Stabellini wrote:
> Keep in mind that if we start the pci hole at 0xe0000000, the number of
> cases for which any workarounds are needed is going to be dramatically
> decreased to the point that I don't think we need a workaround anymore.

You don't think anyone is going to want to pass through a card with 
1GiB+ of RAM?

>
> The algorithm is going to work like this in details:
>
> - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> - we calculate the total mmio size, if it's bigger than the pci hole we
> raise a 64 bit relocation flag
> - if the 64 bit relocation is enabled, we relocate above 4G the first
> device that is 64-bit capable and has an MMIO size greater or equal to
> 512MB
> - if the pci hole size is now big enough for the remaining devices we
> stop the above 4G relocation, otherwise keep relocating devices that are
> 64 bit capable and have an MMIO size greater or equal to 512MB
> - if one or more devices don't fit we print an error and continue (it's
> not a critical failure, one device won't be used)
>
> We could have a xenstore flag somewhere that enables the old behaviour
> so that people can revert back to qemu-xen-traditional and make the pci
> hole below 4G even bigger than 448MB, but I think that keeping the old
> behaviour around is going to make the code more difficult to maintain.

We'll only need to do that for one release, until we have a chance to 
fix it properly.

>
> Also it's difficult for people to realize that they need the workaround
> because hvmloader logs aren't enabled by default and only go to the Xen
> serial console.

Well if key people know about it (Pasi, David Techer, &c), and we put it 
on the wikis related to VGA pass-through, I think information will get 
around.

> The value of this workaround pretty low in my view.
> Finally it's worth noting that Windows XP is going EOL in less than an
> year.

That's 1 year that a configuration with a currently-supported OS won't 
work for Xen 4.3 that worked for 4.2.  Apart from that, one of the 
reasons for doing virtualization in the first place is to be able to run 
older, unsupported OSes on current hardware; so "XP isn't important" 
doesn't really cut it for me. :-)

>
>
>> I thought that what we had proposed was to have an option in xenstore, that
>> libxl would set, which would instruct hvmloader whether to expand the MMIO
>> hole and whether to relocate devices above 64-bit?
> I think it's right to have this discussion in public on the mailing
> list, rather than behind closed doors.
> Also I don't agree on the need for a workaround, as explained above.

I see -- you thought it was a bad idea and so were letting someone else 
bring it up -- or maybe hoping no one would remember to bring it up. :-)

(Obviously the decision needs to be made in public, but sometimes having 
technical solutions hashed out in a face-to-face meeting is more efficient.)

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 15:29                         ` George Dunlap
@ 2013-06-13 16:13                           ` Stefano Stabellini
  -1 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 16:13 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 13 Jun 2013, George Dunlap wrote:
> On 13/06/13 15:50, Stefano Stabellini wrote:
> > Keep in mind that if we start the pci hole at 0xe0000000, the number of
> > cases for which any workarounds are needed is going to be dramatically
> > decreased to the point that I don't think we need a workaround anymore.
> 
> You don't think anyone is going to want to pass through a card with 1GiB+ of
> RAM?

Yes, but as Paolo pointed out, those devices are going to be 64-bit
capable so they'll relocate above 4G just fine.


> > The algorithm is going to work like this in details:
> > 
> > - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> > - we calculate the total mmio size, if it's bigger than the pci hole we
> > raise a 64 bit relocation flag
> > - if the 64 bit relocation is enabled, we relocate above 4G the first
> > device that is 64-bit capable and has an MMIO size greater or equal to
> > 512MB
> > - if the pci hole size is now big enough for the remaining devices we
> > stop the above 4G relocation, otherwise keep relocating devices that are
> > 64 bit capable and have an MMIO size greater or equal to 512MB
> > - if one or more devices don't fit we print an error and continue (it's
> > not a critical failure, one device won't be used)
> > 
> > We could have a xenstore flag somewhere that enables the old behaviour
> > so that people can revert back to qemu-xen-traditional and make the pci
> > hole below 4G even bigger than 448MB, but I think that keeping the old
> > behaviour around is going to make the code more difficult to maintain.
> 
> We'll only need to do that for one release, until we have a chance to fix it
> properly.

There is nothing more lasting than a "temporary" workaround :-)
Also it's not very clear what the proper solution would be like in this
case.
However keeping the old behaviour is certainly possible. It would just
be a bit harder to also keep the old (smaller) default pci hole around.



> > Also it's difficult for people to realize that they need the workaround
> > because hvmloader logs aren't enabled by default and only go to the Xen
> > serial console.
> 
> Well if key people know about it (Pasi, David Techer, &c), and we put it on
> the wikis related to VGA pass-through, I think information will get around.

It's not that I don't value documentation, but given that the average
user won't see any logs and the error is completely non-informative,
many people are going to be lost in a wild goose chase on google.


> > The value of this workaround pretty low in my view.
> > Finally it's worth noting that Windows XP is going EOL in less than an
> > year.
> 
> That's 1 year that a configuration with a currently-supported OS won't work
> for Xen 4.3 that worked for 4.2.  Apart from that, one of the reasons for
> doing virtualization in the first place is to be able to run older,
> unsupported OSes on current hardware; so "XP isn't important" doesn't really
> cut it for me. :-)

fair enough


> > > I thought that what we had proposed was to have an option in xenstore,
> > > that
> > > libxl would set, which would instruct hvmloader whether to expand the MMIO
> > > hole and whether to relocate devices above 64-bit?
> > I think it's right to have this discussion in public on the mailing
> > list, rather than behind closed doors.
> > Also I don't agree on the need for a workaround, as explained above.
> 
> I see -- you thought it was a bad idea and so were letting someone else bring
> it up -- or maybe hoping no one would remember to bring it up. :-)
 
Nothing that Macchiavellian: I didn't consider all the implications at
the time and I thought I managed to come up with a better plan.


> (Obviously the decision needs to be made in public, but sometimes having
> technical solutions hashed out in a face-to-face meeting is more efficient.)

But it's also easier to overlook something, at least it is easier for
me.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 16:13                           ` Stefano Stabellini
  0 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 16:13 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, Ian Campbell,
	hanweidong, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 13 Jun 2013, George Dunlap wrote:
> On 13/06/13 15:50, Stefano Stabellini wrote:
> > Keep in mind that if we start the pci hole at 0xe0000000, the number of
> > cases for which any workarounds are needed is going to be dramatically
> > decreased to the point that I don't think we need a workaround anymore.
> 
> You don't think anyone is going to want to pass through a card with 1GiB+ of
> RAM?

Yes, but as Paolo pointed out, those devices are going to be 64-bit
capable so they'll relocate above 4G just fine.


> > The algorithm is going to work like this in details:
> > 
> > - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> > - we calculate the total mmio size, if it's bigger than the pci hole we
> > raise a 64 bit relocation flag
> > - if the 64 bit relocation is enabled, we relocate above 4G the first
> > device that is 64-bit capable and has an MMIO size greater or equal to
> > 512MB
> > - if the pci hole size is now big enough for the remaining devices we
> > stop the above 4G relocation, otherwise keep relocating devices that are
> > 64 bit capable and have an MMIO size greater or equal to 512MB
> > - if one or more devices don't fit we print an error and continue (it's
> > not a critical failure, one device won't be used)
> > 
> > We could have a xenstore flag somewhere that enables the old behaviour
> > so that people can revert back to qemu-xen-traditional and make the pci
> > hole below 4G even bigger than 448MB, but I think that keeping the old
> > behaviour around is going to make the code more difficult to maintain.
> 
> We'll only need to do that for one release, until we have a chance to fix it
> properly.

There is nothing more lasting than a "temporary" workaround :-)
Also it's not very clear what the proper solution would be like in this
case.
However keeping the old behaviour is certainly possible. It would just
be a bit harder to also keep the old (smaller) default pci hole around.



> > Also it's difficult for people to realize that they need the workaround
> > because hvmloader logs aren't enabled by default and only go to the Xen
> > serial console.
> 
> Well if key people know about it (Pasi, David Techer, &c), and we put it on
> the wikis related to VGA pass-through, I think information will get around.

It's not that I don't value documentation, but given that the average
user won't see any logs and the error is completely non-informative,
many people are going to be lost in a wild goose chase on google.


> > The value of this workaround pretty low in my view.
> > Finally it's worth noting that Windows XP is going EOL in less than an
> > year.
> 
> That's 1 year that a configuration with a currently-supported OS won't work
> for Xen 4.3 that worked for 4.2.  Apart from that, one of the reasons for
> doing virtualization in the first place is to be able to run older,
> unsupported OSes on current hardware; so "XP isn't important" doesn't really
> cut it for me. :-)

fair enough


> > > I thought that what we had proposed was to have an option in xenstore,
> > > that
> > > libxl would set, which would instruct hvmloader whether to expand the MMIO
> > > hole and whether to relocate devices above 64-bit?
> > I think it's right to have this discussion in public on the mailing
> > list, rather than behind closed doors.
> > Also I don't agree on the need for a workaround, as explained above.
> 
> I see -- you thought it was a bad idea and so were letting someone else bring
> it up -- or maybe hoping no one would remember to bring it up. :-)
 
Nothing that Macchiavellian: I didn't consider all the implications at
the time and I thought I managed to come up with a better plan.


> (Obviously the decision needs to be made in public, but sometimes having
> technical solutions hashed out in a face-to-face meeting is more efficient.)

But it's also easier to overlook something, at least it is easier for
me.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 14:50                       ` Stefano Stabellini
@ 2013-06-13 15:34                         ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 15:34 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

On Thu, 2013-06-13 at 15:50 +0100, Stefano Stabellini wrote:
> On Thu, 13 Jun 2013, George Dunlap wrote:
> > On 13/06/13 14:44, Stefano Stabellini wrote:
> > > On Wed, 12 Jun 2013, George Dunlap wrote:
> > > > On 12/06/13 08:25, Jan Beulich wrote:
> > > > > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > > > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > > > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it
> > > > > > already
> > > > > > maps the PCI region to high memory if the PCI bar is 64-bit and the
> > > > > > MMIO
> > > > > > region is larger than 512MB.
> > > > > > 
> > > > > > Maybe we could just relax this condition and map the device memory to
> > > > > > high memory no matter the size of the MMIO region if the PCI bar is
> > > > > > 64-bit?
> > > > > I can only recommend not to: For one, guests not using PAE or
> > > > > PSE-36 can't map such space at all (and older OSes may not
> > > > > properly deal with 64-bit BARs at all). And then one would generally
> > > > > expect this allocation to be done top down (to minimize risk of
> > > > > running into RAM), and doing so is going to present further risks of
> > > > > incompatibilities with guest OSes (Linux for example learned only in
> > > > > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > > > > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > > > > PFN to pfn_pte(), the respective parameter of which is
> > > > > "unsigned long").
> > > > > 
> > > > > I think this ought to be done in an iterative process - if all MMIO
> > > > > regions together don't fit below 4G, the biggest one should be
> > > > > moved up beyond 4G first, followed by the next to biggest one
> > > > > etc.
> > > > First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> > > > temporary work-around.  It should only be done if a device doesn't fit in
> > > > the
> > > > current MMIO range.
> > > > 
> > > > We have three options here:
> > > > 1. Don't do anything
> > > > 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
> > > > don't
> > > > fit
> > > > 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
> > > > is
> > > > memory).
> > > > 4. Add a mechanism to tell qemu that memory is being relocated.
> > > > 
> > > > Number 4 is definitely the right answer long-term, but we just don't have
> > > > time
> > > > to do that before the 4.3 release.  We're not sure yet if #3 is possible;
> > > > even
> > > > if it is, it may have unpredictable knock-on effects.
> > > > 
> > > > Doing #2, it is true that many guests will be unable to access the device
> > > > because of 32-bit limitations.  However, in #1, *no* guests will be able
> > > > to
> > > > access the device.  At least in #2, *many* guests will be able to do so.
> > > > In
> > > > any case, apparently #2 is what KVM does, so having the limitation on
> > > > guests
> > > > is not without precedent.  It's also likely to be a somewhat tested
> > > > configuration (unlike #3, for example).
> > > I would avoid #3, because I don't think is a good idea to rely on that
> > > behaviour.
> > > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > > easy and certainly not doable in time for 4.3.
> > > 
> > > So we are left to play with the PCI MMIO region size and location in
> > > hvmloader.
> > > 
> > > I agree with Jan that we shouldn't relocate unconditionally all the
> > > devices to the region above 4G. I meant to say that we should relocate
> > > only the ones that don't fit. And we shouldn't try to dynamically
> > > increase the PCI hole below 4G because clearly that doesn't work.
> > > However we could still increase the size of the PCI hole below 4G by
> > > default from start at 0xf0000000 to starting at 0xe0000000.
> > > Why do we know that is safe? Because in the current configuration
> > > hvmloader *already* increases the PCI hole size by decreasing the start
> > > address every time a device doesn't fit.
> > > So it's already common for hvmloader to set pci_mem_start to
> > > 0xe0000000, you just need to assign a device with a PCI hole size big
> > > enough.
> > > 
> > > 
> > > My proposed solution is:
> > > 
> > > - set 0xe0000000 as the default PCI hole start for everybody, including
> > > qemu-xen-traditional
> > > - move above 4G everything that doesn't fit and support 64-bit bars
> > > - print an error if the device doesn't fit and doesn't support 64-bit
> > > bars
> > 
> > Also, as I understand it, at the moment:
> > 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> > 2. Some devices (without 64-bit BARs) can't be relocated
> > 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> > 
> > So if we have #1 or #2, at the moment an option for a work-around is to use
> > qemu-traditional.
> > 
> > However, if we add your "print an error if the device doesn't fit", then this
> > option will go away -- this will be a regression in functionality from 4.2.
> 
> Keep in mind that if we start the pci hole at 0xe0000000, the number of
> cases for which any workarounds are needed is going to be dramatically
> decreased to the point that I don't think we need a workaround anymore.

Starting at 0xe0000000 leaves, as you say a 448MB whole, with graphics
cards regularly having 512M+ of RAM on them that suggests the workaround
will be required in many cases.

> The algorithm is going to work like this in details:
> 
> - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> - we calculate the total mmio size, if it's bigger than the pci hole we
> raise a 64 bit relocation flag
> - if the 64 bit relocation is enabled, we relocate above 4G the first
> device that is 64-bit capable and has an MMIO size greater or equal to
> 512MB

Don't you mean the device with the largest MMIO size? Otherwise 2 256MB
devices would still break things.

Paulo's comment about large alignments first is also worth considering.

> - if the pci hole size is now big enough for the remaining devices we
> stop the above 4G relocation, otherwise keep relocating devices that are
> 64 bit capable and have an MMIO size greater or equal to 512MB
> - if one or more devices don't fit we print an error and continue (it's
> not a critical failure, one device won't be used)

This can result in a different device being broken to the one which
would previously have been broken, including on qemu-trad I think?

> We could have a xenstore flag somewhere that enables the old behaviour
> so that people can revert back to qemu-xen-traditional and make the pci
> hole below 4G even bigger than 448MB, but I think that keeping the old
> behaviour around is going to make the code more difficult to maintain.

The downside of that is that things which worked with the old scheme may
not work with the new one though. Early in a release cycle when we have
time to discover what has broken then that might be OK, but is post rc4
really the time to be risking it?

> Also it's difficult for people to realize that they need the workaround
> because hvmloader logs aren't enabled by default and only go to the Xen
> serial console. The value of this workaround pretty low in my view.
> Finally it's worth noting that Windows XP is going EOL in less than an
> year.

That's been true for something like 5 years...

Also, apart from XP, doesn't Windows still pick a HAL at install time,
so even a modern guest installed under the old scheme may not get a PAE
capable HAL. If you increase the amount of RAM I think Windows will
"upgrade" the HAL, but is changing the MMIO layout enough to trigger
this? Or maybe modern Windows all use PAE (or even 64 bit) anyway?

There are also performance implications of enabling PAE over 2 level
paging. Not sure how significant they are with HAP though. Made a big
difference with shadow IIRC.

Maybe I'm worrying about nothing but while all of these unknowns might
be OK towards the start of a release cycle rc4 seems awfully late in the
day to be risking it.

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 15:34                         ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 15:34 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

On Thu, 2013-06-13 at 15:50 +0100, Stefano Stabellini wrote:
> On Thu, 13 Jun 2013, George Dunlap wrote:
> > On 13/06/13 14:44, Stefano Stabellini wrote:
> > > On Wed, 12 Jun 2013, George Dunlap wrote:
> > > > On 12/06/13 08:25, Jan Beulich wrote:
> > > > > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > > > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > > > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it
> > > > > > already
> > > > > > maps the PCI region to high memory if the PCI bar is 64-bit and the
> > > > > > MMIO
> > > > > > region is larger than 512MB.
> > > > > > 
> > > > > > Maybe we could just relax this condition and map the device memory to
> > > > > > high memory no matter the size of the MMIO region if the PCI bar is
> > > > > > 64-bit?
> > > > > I can only recommend not to: For one, guests not using PAE or
> > > > > PSE-36 can't map such space at all (and older OSes may not
> > > > > properly deal with 64-bit BARs at all). And then one would generally
> > > > > expect this allocation to be done top down (to minimize risk of
> > > > > running into RAM), and doing so is going to present further risks of
> > > > > incompatibilities with guest OSes (Linux for example learned only in
> > > > > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > > > > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > > > > PFN to pfn_pte(), the respective parameter of which is
> > > > > "unsigned long").
> > > > > 
> > > > > I think this ought to be done in an iterative process - if all MMIO
> > > > > regions together don't fit below 4G, the biggest one should be
> > > > > moved up beyond 4G first, followed by the next to biggest one
> > > > > etc.
> > > > First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> > > > temporary work-around.  It should only be done if a device doesn't fit in
> > > > the
> > > > current MMIO range.
> > > > 
> > > > We have three options here:
> > > > 1. Don't do anything
> > > > 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
> > > > don't
> > > > fit
> > > > 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
> > > > is
> > > > memory).
> > > > 4. Add a mechanism to tell qemu that memory is being relocated.
> > > > 
> > > > Number 4 is definitely the right answer long-term, but we just don't have
> > > > time
> > > > to do that before the 4.3 release.  We're not sure yet if #3 is possible;
> > > > even
> > > > if it is, it may have unpredictable knock-on effects.
> > > > 
> > > > Doing #2, it is true that many guests will be unable to access the device
> > > > because of 32-bit limitations.  However, in #1, *no* guests will be able
> > > > to
> > > > access the device.  At least in #2, *many* guests will be able to do so.
> > > > In
> > > > any case, apparently #2 is what KVM does, so having the limitation on
> > > > guests
> > > > is not without precedent.  It's also likely to be a somewhat tested
> > > > configuration (unlike #3, for example).
> > > I would avoid #3, because I don't think is a good idea to rely on that
> > > behaviour.
> > > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > > easy and certainly not doable in time for 4.3.
> > > 
> > > So we are left to play with the PCI MMIO region size and location in
> > > hvmloader.
> > > 
> > > I agree with Jan that we shouldn't relocate unconditionally all the
> > > devices to the region above 4G. I meant to say that we should relocate
> > > only the ones that don't fit. And we shouldn't try to dynamically
> > > increase the PCI hole below 4G because clearly that doesn't work.
> > > However we could still increase the size of the PCI hole below 4G by
> > > default from start at 0xf0000000 to starting at 0xe0000000.
> > > Why do we know that is safe? Because in the current configuration
> > > hvmloader *already* increases the PCI hole size by decreasing the start
> > > address every time a device doesn't fit.
> > > So it's already common for hvmloader to set pci_mem_start to
> > > 0xe0000000, you just need to assign a device with a PCI hole size big
> > > enough.
> > > 
> > > 
> > > My proposed solution is:
> > > 
> > > - set 0xe0000000 as the default PCI hole start for everybody, including
> > > qemu-xen-traditional
> > > - move above 4G everything that doesn't fit and support 64-bit bars
> > > - print an error if the device doesn't fit and doesn't support 64-bit
> > > bars
> > 
> > Also, as I understand it, at the moment:
> > 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> > 2. Some devices (without 64-bit BARs) can't be relocated
> > 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> > 
> > So if we have #1 or #2, at the moment an option for a work-around is to use
> > qemu-traditional.
> > 
> > However, if we add your "print an error if the device doesn't fit", then this
> > option will go away -- this will be a regression in functionality from 4.2.
> 
> Keep in mind that if we start the pci hole at 0xe0000000, the number of
> cases for which any workarounds are needed is going to be dramatically
> decreased to the point that I don't think we need a workaround anymore.

Starting at 0xe0000000 leaves, as you say a 448MB whole, with graphics
cards regularly having 512M+ of RAM on them that suggests the workaround
will be required in many cases.

> The algorithm is going to work like this in details:
> 
> - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> - we calculate the total mmio size, if it's bigger than the pci hole we
> raise a 64 bit relocation flag
> - if the 64 bit relocation is enabled, we relocate above 4G the first
> device that is 64-bit capable and has an MMIO size greater or equal to
> 512MB

Don't you mean the device with the largest MMIO size? Otherwise 2 256MB
devices would still break things.

Paulo's comment about large alignments first is also worth considering.

> - if the pci hole size is now big enough for the remaining devices we
> stop the above 4G relocation, otherwise keep relocating devices that are
> 64 bit capable and have an MMIO size greater or equal to 512MB
> - if one or more devices don't fit we print an error and continue (it's
> not a critical failure, one device won't be used)

This can result in a different device being broken to the one which
would previously have been broken, including on qemu-trad I think?

> We could have a xenstore flag somewhere that enables the old behaviour
> so that people can revert back to qemu-xen-traditional and make the pci
> hole below 4G even bigger than 448MB, but I think that keeping the old
> behaviour around is going to make the code more difficult to maintain.

The downside of that is that things which worked with the old scheme may
not work with the new one though. Early in a release cycle when we have
time to discover what has broken then that might be OK, but is post rc4
really the time to be risking it?

> Also it's difficult for people to realize that they need the workaround
> because hvmloader logs aren't enabled by default and only go to the Xen
> serial console. The value of this workaround pretty low in my view.
> Finally it's worth noting that Windows XP is going EOL in less than an
> year.

That's been true for something like 5 years...

Also, apart from XP, doesn't Windows still pick a HAL at install time,
so even a modern guest installed under the old scheme may not get a PAE
capable HAL. If you increase the amount of RAM I think Windows will
"upgrade" the HAL, but is changing the MMIO layout enough to trigger
this? Or maybe modern Windows all use PAE (or even 64 bit) anyway?

There are also performance implications of enabling PAE over 2 level
paging. Not sure how significant they are with HAP though. Made a big
difference with shadow IIRC.

Maybe I'm worrying about nothing but while all of these unknowns might
be OK towards the start of a release cycle rc4 seems awfully late in the
day to be risking it.

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 15:34                         ` Ian Campbell
@ 2013-06-13 16:55                           ` Stefano Stabellini
  -1 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 16:55 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 13 Jun 2013, Ian Campbell wrote:
> On Thu, 2013-06-13 at 15:50 +0100, Stefano Stabellini wrote:
> > On Thu, 13 Jun 2013, George Dunlap wrote:
> > > On 13/06/13 14:44, Stefano Stabellini wrote:
> > > > On Wed, 12 Jun 2013, George Dunlap wrote:
> > > > > On 12/06/13 08:25, Jan Beulich wrote:
> > > > > > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > > > > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > > > > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it
> > > > > > > already
> > > > > > > maps the PCI region to high memory if the PCI bar is 64-bit and the
> > > > > > > MMIO
> > > > > > > region is larger than 512MB.
> > > > > > > 
> > > > > > > Maybe we could just relax this condition and map the device memory to
> > > > > > > high memory no matter the size of the MMIO region if the PCI bar is
> > > > > > > 64-bit?
> > > > > > I can only recommend not to: For one, guests not using PAE or
> > > > > > PSE-36 can't map such space at all (and older OSes may not
> > > > > > properly deal with 64-bit BARs at all). And then one would generally
> > > > > > expect this allocation to be done top down (to minimize risk of
> > > > > > running into RAM), and doing so is going to present further risks of
> > > > > > incompatibilities with guest OSes (Linux for example learned only in
> > > > > > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > > > > > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > > > > > PFN to pfn_pte(), the respective parameter of which is
> > > > > > "unsigned long").
> > > > > > 
> > > > > > I think this ought to be done in an iterative process - if all MMIO
> > > > > > regions together don't fit below 4G, the biggest one should be
> > > > > > moved up beyond 4G first, followed by the next to biggest one
> > > > > > etc.
> > > > > First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> > > > > temporary work-around.  It should only be done if a device doesn't fit in
> > > > > the
> > > > > current MMIO range.
> > > > > 
> > > > > We have three options here:
> > > > > 1. Don't do anything
> > > > > 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
> > > > > don't
> > > > > fit
> > > > > 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
> > > > > is
> > > > > memory).
> > > > > 4. Add a mechanism to tell qemu that memory is being relocated.
> > > > > 
> > > > > Number 4 is definitely the right answer long-term, but we just don't have
> > > > > time
> > > > > to do that before the 4.3 release.  We're not sure yet if #3 is possible;
> > > > > even
> > > > > if it is, it may have unpredictable knock-on effects.
> > > > > 
> > > > > Doing #2, it is true that many guests will be unable to access the device
> > > > > because of 32-bit limitations.  However, in #1, *no* guests will be able
> > > > > to
> > > > > access the device.  At least in #2, *many* guests will be able to do so.
> > > > > In
> > > > > any case, apparently #2 is what KVM does, so having the limitation on
> > > > > guests
> > > > > is not without precedent.  It's also likely to be a somewhat tested
> > > > > configuration (unlike #3, for example).
> > > > I would avoid #3, because I don't think is a good idea to rely on that
> > > > behaviour.
> > > > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > > > easy and certainly not doable in time for 4.3.
> > > > 
> > > > So we are left to play with the PCI MMIO region size and location in
> > > > hvmloader.
> > > > 
> > > > I agree with Jan that we shouldn't relocate unconditionally all the
> > > > devices to the region above 4G. I meant to say that we should relocate
> > > > only the ones that don't fit. And we shouldn't try to dynamically
> > > > increase the PCI hole below 4G because clearly that doesn't work.
> > > > However we could still increase the size of the PCI hole below 4G by
> > > > default from start at 0xf0000000 to starting at 0xe0000000.
> > > > Why do we know that is safe? Because in the current configuration
> > > > hvmloader *already* increases the PCI hole size by decreasing the start
> > > > address every time a device doesn't fit.
> > > > So it's already common for hvmloader to set pci_mem_start to
> > > > 0xe0000000, you just need to assign a device with a PCI hole size big
> > > > enough.
> > > > 
> > > > 
> > > > My proposed solution is:
> > > > 
> > > > - set 0xe0000000 as the default PCI hole start for everybody, including
> > > > qemu-xen-traditional
> > > > - move above 4G everything that doesn't fit and support 64-bit bars
> > > > - print an error if the device doesn't fit and doesn't support 64-bit
> > > > bars
> > > 
> > > Also, as I understand it, at the moment:
> > > 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> > > 2. Some devices (without 64-bit BARs) can't be relocated
> > > 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> > > 
> > > So if we have #1 or #2, at the moment an option for a work-around is to use
> > > qemu-traditional.
> > > 
> > > However, if we add your "print an error if the device doesn't fit", then this
> > > option will go away -- this will be a regression in functionality from 4.2.
> > 
> > Keep in mind that if we start the pci hole at 0xe0000000, the number of
> > cases for which any workarounds are needed is going to be dramatically
> > decreased to the point that I don't think we need a workaround anymore.
> 
> Starting at 0xe0000000 leaves, as you say a 448MB whole, with graphics
> cards regularly having 512M+ of RAM on them that suggests the workaround
> will be required in many cases.

http://www.nvidia.co.uk/object/graphics_cards_buy_now_uk.html

Actually more than half of the graphic cards sold today have >= 2G of
videoram so they wouldn't fit below 4G even in the old scheme that gives
at most 2G of PCI hole below 4G.

So the resulting configurations would be the same: the devices would be
located above 4G.


> > The algorithm is going to work like this in details:
> > 
> > - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> > - we calculate the total mmio size, if it's bigger than the pci hole we
> > raise a 64 bit relocation flag
> > - if the 64 bit relocation is enabled, we relocate above 4G the first
> > device that is 64-bit capable and has an MMIO size greater or equal to
> > 512MB
> 
> Don't you mean the device with the largest MMIO size? Otherwise 2 256MB
> devices would still break things.

You are right that would be much better.
It's worth mentioning that the problem you have just identified exists
even in the current scheme. In fact you could reach a non-configurable
state by passing through 4 graphic cards with 512MB of video ram each.


> Paulo's comment about large alignments first is also worth considering.
>
> > - if the pci hole size is now big enough for the remaining devices we
> > stop the above 4G relocation, otherwise keep relocating devices that are
> > 64 bit capable and have an MMIO size greater or equal to 512MB
> > - if one or more devices don't fit we print an error and continue (it's
> > not a critical failure, one device won't be used)
> 
> This can result in a different device being broken to the one which
> would previously have been broken, including on qemu-trad I think?

Previously the guest would fail to boot with qemu-xen. On
qemu-xen-traditional the configuration would be completely different:
fewer devices would be relocated above 4G and the pci hole would be
bigger. So you are right the devices being broken would be different.


> > We could have a xenstore flag somewhere that enables the old behaviour
> > so that people can revert back to qemu-xen-traditional and make the pci
> > hole below 4G even bigger than 448MB, but I think that keeping the old
> > behaviour around is going to make the code more difficult to maintain.
> 
> The downside of that is that things which worked with the old scheme may
> not work with the new one though. Early in a release cycle when we have
> time to discover what has broken then that might be OK, but is post rc4
> really the time to be risking it?

Yes, you are right: there are some scenarios that would have worked
before that wouldn't work anymore with the new scheme.
Are they important enough to have a workaround, pretty difficult to
identify for a user?


> > Also it's difficult for people to realize that they need the workaround
> > because hvmloader logs aren't enabled by default and only go to the Xen
> > serial console. The value of this workaround pretty low in my view.
> > Finally it's worth noting that Windows XP is going EOL in less than an
> > year.
> 
> That's been true for something like 5 years...
> 
> Also, apart from XP, doesn't Windows still pick a HAL at install time,
> so even a modern guest installed under the old scheme may not get a PAE
> capable HAL. If you increase the amount of RAM I think Windows will
> "upgrade" the HAL, but is changing the MMIO layout enough to trigger
> this? Or maybe modern Windows all use PAE (or even 64 bit) anyway?
> 
> There are also performance implications of enabling PAE over 2 level
> paging. Not sure how significant they are with HAP though. Made a big
> difference with shadow IIRC.
> 
> Maybe I'm worrying about nothing but while all of these unknowns might
> be OK towards the start of a release cycle rc4 seems awfully late in the
> day to be risking it.

Keep in mind that all these configurations are perfectly valid even with
the code that we have out there today. We aren't doing anything new,
just modifying the default.
One just needs to assign a PCI device with more than 190MB to trigger it.
I am trusting the fact that given that we had this behaviour for many
years now, and it's pretty common to assign a device only some of the
times you are booting your guest, any problems would have already come
up.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 16:55                           ` Stefano Stabellini
  0 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 16:55 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 13 Jun 2013, Ian Campbell wrote:
> On Thu, 2013-06-13 at 15:50 +0100, Stefano Stabellini wrote:
> > On Thu, 13 Jun 2013, George Dunlap wrote:
> > > On 13/06/13 14:44, Stefano Stabellini wrote:
> > > > On Wed, 12 Jun 2013, George Dunlap wrote:
> > > > > On 12/06/13 08:25, Jan Beulich wrote:
> > > > > > > > > On 11.06.13 at 19:26, Stefano Stabellini
> > > > > > > > > <stefano.stabellini@eu.citrix.com> wrote:
> > > > > > > I went through the code that maps the PCI MMIO regions in hvmloader
> > > > > > > (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it
> > > > > > > already
> > > > > > > maps the PCI region to high memory if the PCI bar is 64-bit and the
> > > > > > > MMIO
> > > > > > > region is larger than 512MB.
> > > > > > > 
> > > > > > > Maybe we could just relax this condition and map the device memory to
> > > > > > > high memory no matter the size of the MMIO region if the PCI bar is
> > > > > > > 64-bit?
> > > > > > I can only recommend not to: For one, guests not using PAE or
> > > > > > PSE-36 can't map such space at all (and older OSes may not
> > > > > > properly deal with 64-bit BARs at all). And then one would generally
> > > > > > expect this allocation to be done top down (to minimize risk of
> > > > > > running into RAM), and doing so is going to present further risks of
> > > > > > incompatibilities with guest OSes (Linux for example learned only in
> > > > > > 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > > > > > 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > > > > > PFN to pfn_pte(), the respective parameter of which is
> > > > > > "unsigned long").
> > > > > > 
> > > > > > I think this ought to be done in an iterative process - if all MMIO
> > > > > > regions together don't fit below 4G, the biggest one should be
> > > > > > moved up beyond 4G first, followed by the next to biggest one
> > > > > > etc.
> > > > > First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> > > > > temporary work-around.  It should only be done if a device doesn't fit in
> > > > > the
> > > > > current MMIO range.
> > > > > 
> > > > > We have three options here:
> > > > > 1. Don't do anything
> > > > > 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they
> > > > > don't
> > > > > fit
> > > > > 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks
> > > > > is
> > > > > memory).
> > > > > 4. Add a mechanism to tell qemu that memory is being relocated.
> > > > > 
> > > > > Number 4 is definitely the right answer long-term, but we just don't have
> > > > > time
> > > > > to do that before the 4.3 release.  We're not sure yet if #3 is possible;
> > > > > even
> > > > > if it is, it may have unpredictable knock-on effects.
> > > > > 
> > > > > Doing #2, it is true that many guests will be unable to access the device
> > > > > because of 32-bit limitations.  However, in #1, *no* guests will be able
> > > > > to
> > > > > access the device.  At least in #2, *many* guests will be able to do so.
> > > > > In
> > > > > any case, apparently #2 is what KVM does, so having the limitation on
> > > > > guests
> > > > > is not without precedent.  It's also likely to be a somewhat tested
> > > > > configuration (unlike #3, for example).
> > > > I would avoid #3, because I don't think is a good idea to rely on that
> > > > behaviour.
> > > > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > > > easy and certainly not doable in time for 4.3.
> > > > 
> > > > So we are left to play with the PCI MMIO region size and location in
> > > > hvmloader.
> > > > 
> > > > I agree with Jan that we shouldn't relocate unconditionally all the
> > > > devices to the region above 4G. I meant to say that we should relocate
> > > > only the ones that don't fit. And we shouldn't try to dynamically
> > > > increase the PCI hole below 4G because clearly that doesn't work.
> > > > However we could still increase the size of the PCI hole below 4G by
> > > > default from start at 0xf0000000 to starting at 0xe0000000.
> > > > Why do we know that is safe? Because in the current configuration
> > > > hvmloader *already* increases the PCI hole size by decreasing the start
> > > > address every time a device doesn't fit.
> > > > So it's already common for hvmloader to set pci_mem_start to
> > > > 0xe0000000, you just need to assign a device with a PCI hole size big
> > > > enough.
> > > > 
> > > > 
> > > > My proposed solution is:
> > > > 
> > > > - set 0xe0000000 as the default PCI hole start for everybody, including
> > > > qemu-xen-traditional
> > > > - move above 4G everything that doesn't fit and support 64-bit bars
> > > > - print an error if the device doesn't fit and doesn't support 64-bit
> > > > bars
> > > 
> > > Also, as I understand it, at the moment:
> > > 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> > > 2. Some devices (without 64-bit BARs) can't be relocated
> > > 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> > > 
> > > So if we have #1 or #2, at the moment an option for a work-around is to use
> > > qemu-traditional.
> > > 
> > > However, if we add your "print an error if the device doesn't fit", then this
> > > option will go away -- this will be a regression in functionality from 4.2.
> > 
> > Keep in mind that if we start the pci hole at 0xe0000000, the number of
> > cases for which any workarounds are needed is going to be dramatically
> > decreased to the point that I don't think we need a workaround anymore.
> 
> Starting at 0xe0000000 leaves, as you say a 448MB whole, with graphics
> cards regularly having 512M+ of RAM on them that suggests the workaround
> will be required in many cases.

http://www.nvidia.co.uk/object/graphics_cards_buy_now_uk.html

Actually more than half of the graphic cards sold today have >= 2G of
videoram so they wouldn't fit below 4G even in the old scheme that gives
at most 2G of PCI hole below 4G.

So the resulting configurations would be the same: the devices would be
located above 4G.


> > The algorithm is going to work like this in details:
> > 
> > - the pci hole size is set to 0xfc000000-0xe0000000 = 448MB
> > - we calculate the total mmio size, if it's bigger than the pci hole we
> > raise a 64 bit relocation flag
> > - if the 64 bit relocation is enabled, we relocate above 4G the first
> > device that is 64-bit capable and has an MMIO size greater or equal to
> > 512MB
> 
> Don't you mean the device with the largest MMIO size? Otherwise 2 256MB
> devices would still break things.

You are right that would be much better.
It's worth mentioning that the problem you have just identified exists
even in the current scheme. In fact you could reach a non-configurable
state by passing through 4 graphic cards with 512MB of video ram each.


> Paulo's comment about large alignments first is also worth considering.
>
> > - if the pci hole size is now big enough for the remaining devices we
> > stop the above 4G relocation, otherwise keep relocating devices that are
> > 64 bit capable and have an MMIO size greater or equal to 512MB
> > - if one or more devices don't fit we print an error and continue (it's
> > not a critical failure, one device won't be used)
> 
> This can result in a different device being broken to the one which
> would previously have been broken, including on qemu-trad I think?

Previously the guest would fail to boot with qemu-xen. On
qemu-xen-traditional the configuration would be completely different:
fewer devices would be relocated above 4G and the pci hole would be
bigger. So you are right the devices being broken would be different.


> > We could have a xenstore flag somewhere that enables the old behaviour
> > so that people can revert back to qemu-xen-traditional and make the pci
> > hole below 4G even bigger than 448MB, but I think that keeping the old
> > behaviour around is going to make the code more difficult to maintain.
> 
> The downside of that is that things which worked with the old scheme may
> not work with the new one though. Early in a release cycle when we have
> time to discover what has broken then that might be OK, but is post rc4
> really the time to be risking it?

Yes, you are right: there are some scenarios that would have worked
before that wouldn't work anymore with the new scheme.
Are they important enough to have a workaround, pretty difficult to
identify for a user?


> > Also it's difficult for people to realize that they need the workaround
> > because hvmloader logs aren't enabled by default and only go to the Xen
> > serial console. The value of this workaround pretty low in my view.
> > Finally it's worth noting that Windows XP is going EOL in less than an
> > year.
> 
> That's been true for something like 5 years...
> 
> Also, apart from XP, doesn't Windows still pick a HAL at install time,
> so even a modern guest installed under the old scheme may not get a PAE
> capable HAL. If you increase the amount of RAM I think Windows will
> "upgrade" the HAL, but is changing the MMIO layout enough to trigger
> this? Or maybe modern Windows all use PAE (or even 64 bit) anyway?
> 
> There are also performance implications of enabling PAE over 2 level
> paging. Not sure how significant they are with HAP though. Made a big
> difference with shadow IIRC.
> 
> Maybe I'm worrying about nothing but while all of these unknowns might
> be OK towards the start of a release cycle rc4 seems awfully late in the
> day to be risking it.

Keep in mind that all these configurations are perfectly valid even with
the code that we have out there today. We aren't doing anything new,
just modifying the default.
One just needs to assign a PCI device with more than 190MB to trigger it.
I am trusting the fact that given that we had this behaviour for many
years now, and it's pretty common to assign a device only some of the
times you are booting your guest, any problems would have already come
up.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 16:55                           ` Stefano Stabellini
@ 2013-06-13 17:22                             ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 17:22 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:

> > > We could have a xenstore flag somewhere that enables the old behaviour
> > > so that people can revert back to qemu-xen-traditional and make the pci
> > > hole below 4G even bigger than 448MB, but I think that keeping the old
> > > behaviour around is going to make the code more difficult to maintain.
> > 
> > The downside of that is that things which worked with the old scheme may
> > not work with the new one though. Early in a release cycle when we have
> > time to discover what has broken then that might be OK, but is post rc4
> > really the time to be risking it?
> 
> Yes, you are right: there are some scenarios that would have worked
> before that wouldn't work anymore with the new scheme.
> Are they important enough to have a workaround, pretty difficult to
> identify for a user?

That question would be reasonable early in the development cycle. At rc4
the question should be: do we think this problem is so critical that we
want to risk breaking something else which currently works for people.

Remember that we are invalidating whatever passthrough testing people
have already done up to this point of the release.

It is also worth noting that the things which this change ends up
breaking may for all we know be equally difficult for a user to identify
(they are after all approximately the same class of issue).

The problem here is that the risk is difficult to evaluate, we just
don't know what will break with this change, and we don't know therefore
if the cure is worse than the disease. The conservative approach at this
point in the release would be to not change anything, or to change the
minimal possible number of things (which would preclude changes which
impact qemu-trad IMHO).

WRT pretty difficult to identify -- the root of this thread suggests the
guest entered a reboot loop with "No bootable device", that sounds
eminently release notable to me. I also not that it was changing the
size of the PCI hole which caused the issue -- which does somewhat
underscore the risks involved in this sort of change.

> > > Also it's difficult for people to realize that they need the workaround
> > > because hvmloader logs aren't enabled by default and only go to the Xen
> > > serial console. The value of this workaround pretty low in my view.
> > > Finally it's worth noting that Windows XP is going EOL in less than an
> > > year.
> > 
> > That's been true for something like 5 years...
> > 
> > Also, apart from XP, doesn't Windows still pick a HAL at install time,
> > so even a modern guest installed under the old scheme may not get a PAE
> > capable HAL. If you increase the amount of RAM I think Windows will
> > "upgrade" the HAL, but is changing the MMIO layout enough to trigger
> > this? Or maybe modern Windows all use PAE (or even 64 bit) anyway?
> > 
> > There are also performance implications of enabling PAE over 2 level
> > paging. Not sure how significant they are with HAP though. Made a big
> > difference with shadow IIRC.
> > 
> > Maybe I'm worrying about nothing but while all of these unknowns might
> > be OK towards the start of a release cycle rc4 seems awfully late in the
> > day to be risking it.
> 
> Keep in mind that all these configurations are perfectly valid even with
> the code that we have out there today. We aren't doing anything new,
> just modifying the default.

I don't think that is true. We are changing the behaviour, calling it
"just" a default doesn't make it any less worrying or any less of a
change.

> One just needs to assign a PCI device with more than 190MB to trigger it.
> I am trusting the fact that given that we had this behaviour for many
> years now, and it's pretty common to assign a device only some of the
> times you are booting your guest, any problems would have already come
> up.

With qemu-trad perhaps, although that's not completely obvious TBH. In
any case should we really be crossing our fingers and "trusting" that
it'll be ok at rc4?

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 17:22                             ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 17:22 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tim Deegan, Yongjie Ren, xen-devel, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, yanqiangjun, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu

On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:

> > > We could have a xenstore flag somewhere that enables the old behaviour
> > > so that people can revert back to qemu-xen-traditional and make the pci
> > > hole below 4G even bigger than 448MB, but I think that keeping the old
> > > behaviour around is going to make the code more difficult to maintain.
> > 
> > The downside of that is that things which worked with the old scheme may
> > not work with the new one though. Early in a release cycle when we have
> > time to discover what has broken then that might be OK, but is post rc4
> > really the time to be risking it?
> 
> Yes, you are right: there are some scenarios that would have worked
> before that wouldn't work anymore with the new scheme.
> Are they important enough to have a workaround, pretty difficult to
> identify for a user?

That question would be reasonable early in the development cycle. At rc4
the question should be: do we think this problem is so critical that we
want to risk breaking something else which currently works for people.

Remember that we are invalidating whatever passthrough testing people
have already done up to this point of the release.

It is also worth noting that the things which this change ends up
breaking may for all we know be equally difficult for a user to identify
(they are after all approximately the same class of issue).

The problem here is that the risk is difficult to evaluate, we just
don't know what will break with this change, and we don't know therefore
if the cure is worse than the disease. The conservative approach at this
point in the release would be to not change anything, or to change the
minimal possible number of things (which would preclude changes which
impact qemu-trad IMHO).

WRT pretty difficult to identify -- the root of this thread suggests the
guest entered a reboot loop with "No bootable device", that sounds
eminently release notable to me. I also not that it was changing the
size of the PCI hole which caused the issue -- which does somewhat
underscore the risks involved in this sort of change.

> > > Also it's difficult for people to realize that they need the workaround
> > > because hvmloader logs aren't enabled by default and only go to the Xen
> > > serial console. The value of this workaround pretty low in my view.
> > > Finally it's worth noting that Windows XP is going EOL in less than an
> > > year.
> > 
> > That's been true for something like 5 years...
> > 
> > Also, apart from XP, doesn't Windows still pick a HAL at install time,
> > so even a modern guest installed under the old scheme may not get a PAE
> > capable HAL. If you increase the amount of RAM I think Windows will
> > "upgrade" the HAL, but is changing the MMIO layout enough to trigger
> > this? Or maybe modern Windows all use PAE (or even 64 bit) anyway?
> > 
> > There are also performance implications of enabling PAE over 2 level
> > paging. Not sure how significant they are with HAP though. Made a big
> > difference with shadow IIRC.
> > 
> > Maybe I'm worrying about nothing but while all of these unknowns might
> > be OK towards the start of a release cycle rc4 seems awfully late in the
> > day to be risking it.
> 
> Keep in mind that all these configurations are perfectly valid even with
> the code that we have out there today. We aren't doing anything new,
> just modifying the default.

I don't think that is true. We are changing the behaviour, calling it
"just" a default doesn't make it any less worrying or any less of a
change.

> One just needs to assign a PCI device with more than 190MB to trigger it.
> I am trusting the fact that given that we had this behaviour for many
> years now, and it's pretty common to assign a device only some of the
> times you are booting your guest, any problems would have already come
> up.

With qemu-trad perhaps, although that's not completely obvious TBH. In
any case should we really be crossing our fingers and "trusting" that
it'll be ok at rc4?

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 17:22                             ` Ian Campbell
@ 2013-06-14 10:53                               ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-14 10:53 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Hanweidong, Xudong Hao,
	Stefano Stabellini, Tim Deegan, qemu-devel, Yanqiangjun,
	Wangzhenguo, YangXiaowei, Gonglei (Arei),
	Jan Beulich, YongweiX Xu, Luonengjun, Paolo Bonzini,
	SongtaoX Liu

On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
>
>> > > We could have a xenstore flag somewhere that enables the old behaviour
>> > > so that people can revert back to qemu-xen-traditional and make the pci
>> > > hole below 4G even bigger than 448MB, but I think that keeping the old
>> > > behaviour around is going to make the code more difficult to maintain.
>> >
>> > The downside of that is that things which worked with the old scheme may
>> > not work with the new one though. Early in a release cycle when we have
>> > time to discover what has broken then that might be OK, but is post rc4
>> > really the time to be risking it?
>>
>> Yes, you are right: there are some scenarios that would have worked
>> before that wouldn't work anymore with the new scheme.
>> Are they important enough to have a workaround, pretty difficult to
>> identify for a user?
>
> That question would be reasonable early in the development cycle. At rc4
> the question should be: do we think this problem is so critical that we
> want to risk breaking something else which currently works for people.
>
> Remember that we are invalidating whatever passthrough testing people
> have already done up to this point of the release.
>
> It is also worth noting that the things which this change ends up
> breaking may for all we know be equally difficult for a user to identify
> (they are after all approximately the same class of issue).
>
> The problem here is that the risk is difficult to evaluate, we just
> don't know what will break with this change, and we don't know therefore
> if the cure is worse than the disease. The conservative approach at this
> point in the release would be to not change anything, or to change the
> minimal possible number of things (which would preclude changes which
> impact qemu-trad IMHO).
>


> WRT pretty difficult to identify -- the root of this thread suggests the
> guest entered a reboot loop with "No bootable device", that sounds
> eminently release notable to me. I also not that it was changing the
> size of the PCI hole which caused the issue -- which does somewhat
> underscore the risks involved in this sort of change.

But that bug was a bug in the first attempt to fix the root problem.
The root problem shows up as qemu crashing at some point because it
tried to access invalid guest gpfn space; see
http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.

Stefano tried to fix it with the above patch, just changing the hole
to start at 0xe; but that was incomplete, as it didn't match with
hvmloader and seabios's view of the world.  That's what this bug
report is about.  This thread is an attempt to find a better fix.

So the root problem is that if we revert this patch, and someone
passes through a pci device using qemu-xen (the default) and the MMIO
hole is resized, at some point in the future qemu will randomly die.

If it's a choice between users experiencing, "My VM randomly crashes"
and experiencing, "I tried to pass through this device but the guest
OS doesn't see it", I'd rather choose the latter.

 -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-14 10:53                               ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-14 10:53 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Hanweidong, Xudong Hao,
	Stefano Stabellini, Tim Deegan, qemu-devel, Yanqiangjun,
	Wangzhenguo, YangXiaowei, Gonglei (Arei),
	Jan Beulich, YongweiX Xu, Luonengjun, Paolo Bonzini,
	SongtaoX Liu

On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
>
>> > > We could have a xenstore flag somewhere that enables the old behaviour
>> > > so that people can revert back to qemu-xen-traditional and make the pci
>> > > hole below 4G even bigger than 448MB, but I think that keeping the old
>> > > behaviour around is going to make the code more difficult to maintain.
>> >
>> > The downside of that is that things which worked with the old scheme may
>> > not work with the new one though. Early in a release cycle when we have
>> > time to discover what has broken then that might be OK, but is post rc4
>> > really the time to be risking it?
>>
>> Yes, you are right: there are some scenarios that would have worked
>> before that wouldn't work anymore with the new scheme.
>> Are they important enough to have a workaround, pretty difficult to
>> identify for a user?
>
> That question would be reasonable early in the development cycle. At rc4
> the question should be: do we think this problem is so critical that we
> want to risk breaking something else which currently works for people.
>
> Remember that we are invalidating whatever passthrough testing people
> have already done up to this point of the release.
>
> It is also worth noting that the things which this change ends up
> breaking may for all we know be equally difficult for a user to identify
> (they are after all approximately the same class of issue).
>
> The problem here is that the risk is difficult to evaluate, we just
> don't know what will break with this change, and we don't know therefore
> if the cure is worse than the disease. The conservative approach at this
> point in the release would be to not change anything, or to change the
> minimal possible number of things (which would preclude changes which
> impact qemu-trad IMHO).
>


> WRT pretty difficult to identify -- the root of this thread suggests the
> guest entered a reboot loop with "No bootable device", that sounds
> eminently release notable to me. I also not that it was changing the
> size of the PCI hole which caused the issue -- which does somewhat
> underscore the risks involved in this sort of change.

But that bug was a bug in the first attempt to fix the root problem.
The root problem shows up as qemu crashing at some point because it
tried to access invalid guest gpfn space; see
http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.

Stefano tried to fix it with the above patch, just changing the hole
to start at 0xe; but that was incomplete, as it didn't match with
hvmloader and seabios's view of the world.  That's what this bug
report is about.  This thread is an attempt to find a better fix.

So the root problem is that if we revert this patch, and someone
passes through a pci device using qemu-xen (the default) and the MMIO
hole is resized, at some point in the future qemu will randomly die.

If it's a choice between users experiencing, "My VM randomly crashes"
and experiencing, "I tried to pass through this device but the guest
OS doesn't see it", I'd rather choose the latter.

 -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-14 10:53                               ` George Dunlap
@ 2013-06-14 11:34                                 ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-14 11:34 UTC (permalink / raw)
  To: George Dunlap
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Hanweidong, Xudong Hao,
	Stefano Stabellini, Tim Deegan, qemu-devel, Yanqiangjun,
	Wangzhenguo, YangXiaowei, Gonglei (Arei),
	Jan Beulich, YongweiX Xu, Luonengjun, Paolo Bonzini,
	SongtaoX Liu

On Fri, 2013-06-14 at 11:53 +0100, George Dunlap wrote:
> On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
> >
> >> > > We could have a xenstore flag somewhere that enables the old behaviour
> >> > > so that people can revert back to qemu-xen-traditional and make the pci
> >> > > hole below 4G even bigger than 448MB, but I think that keeping the old
> >> > > behaviour around is going to make the code more difficult to maintain.
> >> >
> >> > The downside of that is that things which worked with the old scheme may
> >> > not work with the new one though. Early in a release cycle when we have
> >> > time to discover what has broken then that might be OK, but is post rc4
> >> > really the time to be risking it?
> >>
> >> Yes, you are right: there are some scenarios that would have worked
> >> before that wouldn't work anymore with the new scheme.
> >> Are they important enough to have a workaround, pretty difficult to
> >> identify for a user?
> >
> > That question would be reasonable early in the development cycle. At rc4
> > the question should be: do we think this problem is so critical that we
> > want to risk breaking something else which currently works for people.
> >
> > Remember that we are invalidating whatever passthrough testing people
> > have already done up to this point of the release.
> >
> > It is also worth noting that the things which this change ends up
> > breaking may for all we know be equally difficult for a user to identify
> > (they are after all approximately the same class of issue).
> >
> > The problem here is that the risk is difficult to evaluate, we just
> > don't know what will break with this change, and we don't know therefore
> > if the cure is worse than the disease. The conservative approach at this
> > point in the release would be to not change anything, or to change the
> > minimal possible number of things (which would preclude changes which
> > impact qemu-trad IMHO).
> >
> 
> 
> > WRT pretty difficult to identify -- the root of this thread suggests the
> > guest entered a reboot loop with "No bootable device", that sounds
> > eminently release notable to me. I also not that it was changing the
> > size of the PCI hole which caused the issue -- which does somewhat
> > underscore the risks involved in this sort of change.
> 
> But that bug was a bug in the first attempt to fix the root problem.
> The root problem shows up as qemu crashing at some point because it
> tried to access invalid guest gpfn space; see
> http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.
> 
> Stefano tried to fix it with the above patch, just changing the hole
> to start at 0xe; but that was incomplete, as it didn't match with
> hvmloader and seabios's view of the world.  That's what this bug
> report is about.  This thread is an attempt to find a better fix.
> 
> So the root problem is that if we revert this patch, and someone
> passes through a pci device using qemu-xen (the default) and the MMIO
> hole is resized, at some point in the future qemu will randomly die.

Right, I see, thanks for explaining.

> If it's a choice between users experiencing, "My VM randomly crashes"
> and experiencing, "I tried to pass through this device but the guest
> OS doesn't see it", I'd rather choose the latter.

All other things being equal, obviously we all would. But the point I've
been trying to make is that we don't know the other consequences of
making that fix -- e.g. on existing working configurations. So the
choice is "some VMs randomly crash, but other stuff works fine and we
have had a reasonable amount of user testing" and "those particular VMs
don't crash any more, but we don't know what other stuff no longer works
and the existing test base has been at least partially invalidated".

I think that at post rc4 in a release we ought to be being pretty
conservative about the risks of this sort of change, especially wrt
invalidating testing and the unknowns involved.

Aren't the configurations which might trip over this issue are going to
be in the minority compared to those which we risk breaking?

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-14 11:34                                 ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-14 11:34 UTC (permalink / raw)
  To: George Dunlap
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Hanweidong, Xudong Hao,
	Stefano Stabellini, Tim Deegan, qemu-devel, Yanqiangjun,
	Wangzhenguo, YangXiaowei, Gonglei (Arei),
	Jan Beulich, YongweiX Xu, Luonengjun, Paolo Bonzini,
	SongtaoX Liu

On Fri, 2013-06-14 at 11:53 +0100, George Dunlap wrote:
> On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
> >
> >> > > We could have a xenstore flag somewhere that enables the old behaviour
> >> > > so that people can revert back to qemu-xen-traditional and make the pci
> >> > > hole below 4G even bigger than 448MB, but I think that keeping the old
> >> > > behaviour around is going to make the code more difficult to maintain.
> >> >
> >> > The downside of that is that things which worked with the old scheme may
> >> > not work with the new one though. Early in a release cycle when we have
> >> > time to discover what has broken then that might be OK, but is post rc4
> >> > really the time to be risking it?
> >>
> >> Yes, you are right: there are some scenarios that would have worked
> >> before that wouldn't work anymore with the new scheme.
> >> Are they important enough to have a workaround, pretty difficult to
> >> identify for a user?
> >
> > That question would be reasonable early in the development cycle. At rc4
> > the question should be: do we think this problem is so critical that we
> > want to risk breaking something else which currently works for people.
> >
> > Remember that we are invalidating whatever passthrough testing people
> > have already done up to this point of the release.
> >
> > It is also worth noting that the things which this change ends up
> > breaking may for all we know be equally difficult for a user to identify
> > (they are after all approximately the same class of issue).
> >
> > The problem here is that the risk is difficult to evaluate, we just
> > don't know what will break with this change, and we don't know therefore
> > if the cure is worse than the disease. The conservative approach at this
> > point in the release would be to not change anything, or to change the
> > minimal possible number of things (which would preclude changes which
> > impact qemu-trad IMHO).
> >
> 
> 
> > WRT pretty difficult to identify -- the root of this thread suggests the
> > guest entered a reboot loop with "No bootable device", that sounds
> > eminently release notable to me. I also not that it was changing the
> > size of the PCI hole which caused the issue -- which does somewhat
> > underscore the risks involved in this sort of change.
> 
> But that bug was a bug in the first attempt to fix the root problem.
> The root problem shows up as qemu crashing at some point because it
> tried to access invalid guest gpfn space; see
> http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.
> 
> Stefano tried to fix it with the above patch, just changing the hole
> to start at 0xe; but that was incomplete, as it didn't match with
> hvmloader and seabios's view of the world.  That's what this bug
> report is about.  This thread is an attempt to find a better fix.
> 
> So the root problem is that if we revert this patch, and someone
> passes through a pci device using qemu-xen (the default) and the MMIO
> hole is resized, at some point in the future qemu will randomly die.

Right, I see, thanks for explaining.

> If it's a choice between users experiencing, "My VM randomly crashes"
> and experiencing, "I tried to pass through this device but the guest
> OS doesn't see it", I'd rather choose the latter.

All other things being equal, obviously we all would. But the point I've
been trying to make is that we don't know the other consequences of
making that fix -- e.g. on existing working configurations. So the
choice is "some VMs randomly crash, but other stuff works fine and we
have had a reasonable amount of user testing" and "those particular VMs
don't crash any more, but we don't know what other stuff no longer works
and the existing test base has been at least partially invalidated".

I think that at post rc4 in a release we ought to be being pretty
conservative about the risks of this sort of change, especially wrt
invalidating testing and the unknowns involved.

Aren't the configurations which might trip over this issue are going to
be in the minority compared to those which we risk breaking?

Ian.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-14 11:34                                 ` Ian Campbell
@ 2013-06-14 14:14                                   ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-14 14:14 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Hanweidong, Xudong Hao,
	Stefano Stabellini, Tim Deegan, qemu-devel, Yanqiangjun,
	Wangzhenguo, YangXiaowei, Gonglei (Arei),
	Jan Beulich, YongweiX Xu, Luonengjun, Paolo Bonzini,
	SongtaoX Liu

On 14/06/13 12:34, Ian Campbell wrote:
> On Fri, 2013-06-14 at 11:53 +0100, George Dunlap wrote:
>> On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>>> On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
>>>
>>>>>> We could have a xenstore flag somewhere that enables the old behaviour
>>>>>> so that people can revert back to qemu-xen-traditional and make the pci
>>>>>> hole below 4G even bigger than 448MB, but I think that keeping the old
>>>>>> behaviour around is going to make the code more difficult to maintain.
>>>>> The downside of that is that things which worked with the old scheme may
>>>>> not work with the new one though. Early in a release cycle when we have
>>>>> time to discover what has broken then that might be OK, but is post rc4
>>>>> really the time to be risking it?
>>>> Yes, you are right: there are some scenarios that would have worked
>>>> before that wouldn't work anymore with the new scheme.
>>>> Are they important enough to have a workaround, pretty difficult to
>>>> identify for a user?
>>> That question would be reasonable early in the development cycle. At rc4
>>> the question should be: do we think this problem is so critical that we
>>> want to risk breaking something else which currently works for people.
>>>
>>> Remember that we are invalidating whatever passthrough testing people
>>> have already done up to this point of the release.
>>>
>>> It is also worth noting that the things which this change ends up
>>> breaking may for all we know be equally difficult for a user to identify
>>> (they are after all approximately the same class of issue).
>>>
>>> The problem here is that the risk is difficult to evaluate, we just
>>> don't know what will break with this change, and we don't know therefore
>>> if the cure is worse than the disease. The conservative approach at this
>>> point in the release would be to not change anything, or to change the
>>> minimal possible number of things (which would preclude changes which
>>> impact qemu-trad IMHO).
>>>
>>
>>> WRT pretty difficult to identify -- the root of this thread suggests the
>>> guest entered a reboot loop with "No bootable device", that sounds
>>> eminently release notable to me. I also not that it was changing the
>>> size of the PCI hole which caused the issue -- which does somewhat
>>> underscore the risks involved in this sort of change.
>> But that bug was a bug in the first attempt to fix the root problem.
>> The root problem shows up as qemu crashing at some point because it
>> tried to access invalid guest gpfn space; see
>> http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.
>>
>> Stefano tried to fix it with the above patch, just changing the hole
>> to start at 0xe; but that was incomplete, as it didn't match with
>> hvmloader and seabios's view of the world.  That's what this bug
>> report is about.  This thread is an attempt to find a better fix.
>>
>> So the root problem is that if we revert this patch, and someone
>> passes through a pci device using qemu-xen (the default) and the MMIO
>> hole is resized, at some point in the future qemu will randomly die.
> Right, I see, thanks for explaining.
>
>> If it's a choice between users experiencing, "My VM randomly crashes"
>> and experiencing, "I tried to pass through this device but the guest
>> OS doesn't see it", I'd rather choose the latter.
> All other things being equal, obviously we all would. But the point I've
> been trying to make is that we don't know the other consequences of
> making that fix -- e.g. on existing working configurations. So the
> choice is "some VMs randomly crash, but other stuff works fine and we
> have had a reasonable amount of user testing" and "those particular VMs
> don't crash any more, but we don't know what other stuff no longer works
> and the existing test base has been at least partially invalidated".
>
> I think that at post rc4 in a release we ought to be being pretty
> conservative about the risks of this sort of change, especially wrt
> invalidating testing and the unknowns involved.
>
> Aren't the configurations which might trip over this issue are going to
> be in the minority compared to those which we risk breaking?

So there are the technical proposals we've been discussing, each of 
which has different risks.

1. Set the default MMIO hole size to 0xe0000000.
2. If possible, relocate PCI devices that don't fit in the hole to the 
64-bit hole.
  - Here "if possible" will mean a) the device has a 64-bit BAR, and b) 
this hasn't been disabled by libxl (probably via a xenstore key).
3. If possible, resize the MMIO hole; otherwise refuse to map the device
  - Currently "if possible" is always true; the new thing here would be 
making it possible for libxl to disable this, probably via a xenstore key.

Each of these will have different risks for qemu-traditional and qemu-xen.

Implementing #3 would have no risk for qemu-traditional, because we 
won't be changing the way anything works; what works will still work, 
what is broken (if anything) will still be broken.

Implementing #3 for qemu-xen only changes one kind of failure for 
another.  If you resize the MMIO hole for qemu-xen, then you *will* 
eventually crash.  So this will not break existing working 
configurations -- it will only change the failure from "qemu crashes at 
some point" to "the guest OS cannot see the device".  This is a uniform 
improvement.

So #3 is very low risk, as far as I can tell, and has a solid benefit.  
I think we should definitely implement it.

I think #2 should have no impact on qemu-traditional, because xl should 
disable it by default.

For qemu-xen, the only devices that are relocated are devices that would 
otherwise be disabled by #3; so remember that the alternate for this one 
(assuming we implement #3) is "not visible to OS". There are several 
potential sub-sets here:
  a. Guest OSes that can't access the 64-bit region.  In that case the 
device will not be visible, which is the same failure as not 
implementing this at all.
  b. I think most devices and operating systems will just work; this 
codepath in QEMU and guest OSes is fairly well tested with KVM.
  c. It may be that there are devices that would have worked with 
qemu-traditional and placed in a resized MMIO hole, but that will break 
with qemu-xen and and placed in the 64-bit MMIO hole.  For these 
devices, the failure mode will change from "not visible to guest OS" to 
"fails in some other unforseen way".

The the main risk I see from this one is c.  However, I think from a 
cost-benefits, it's still pretty low -- we get the benefit of most 
people transparently being able to just use pci-passthrough, at the 
potential cost of a handful of people having "weird crash" failures 
instead of "can't use the device" failures.

I suppose that there is a potential risk for b as well, in that we 
haven't tested relocating to the 64-bit MMIO hole *with Xen*.  There may 
be assumptions about 32-bit paddrs baked in somewhere that we don't know 
about.  If we implement this change on top of #3, and there are problems 
with b, then it will change "device not visible to guest OS" failures 
back into "may crash in a weird way" failures.

On the whole I would be inclined to implement #2 if it's not too 
difficult, but I can certainly see the point of saying that it's too 
risky and that we shouldn't do it.

Then there's #1.  This should in theory be low-risk, because in theory 
hvmloader might have chosen 0xe as the start of the MMIO hole anyway.

For qemu-traditional, this change has no benefits.  The benefits for 
qemu-xen depend on what else gets implemented.  If nothing else is 
implemented, this changes some "random qemu crash" failures into 
successes (while leaving other ones to keep crashing).  If #3 is 
implemented, then it changes some "guest OS can't see device" failures 
into successes.  If Both #3 and #2 are implemented, then some of the 
"guest OS can't see 64-bit MMIO hole" failures will change into successes.

However, as you say, the number of times hvmloader *actually* choses 
that at the moment is fairly low.  The vast majority of VMs and 
configurations at the moment will *not* use 0xe as a base; at most only 
a handful of people will have tested that configuration.  So there is a 
fairly significant risk that there *is* some configuration for which, 
*had* hvmloader chosen 0xe, then it *would* have caused a problem.  For 
those configurations, #1 will change "will break but only if you have a 
very large PCI device" to "will break no matter what".

This would be fine if we hadn't started RCs; but we have.  Our most 
important userbase is people who are *not* doing pci pass-through; so I 
think I agree with you, that this introduces a probably unacceptable 
risk for very little gain -- particularly if we implement #3.

So my recommendation is:
* Implement #3
* Consider implementing #2
* Don't implement #1.

Thoughts?

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-14 14:14                                   ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-14 14:14 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Hanweidong, Xudong Hao,
	Stefano Stabellini, Tim Deegan, qemu-devel, Yanqiangjun,
	Wangzhenguo, YangXiaowei, Gonglei (Arei),
	Jan Beulich, YongweiX Xu, Luonengjun, Paolo Bonzini,
	SongtaoX Liu

On 14/06/13 12:34, Ian Campbell wrote:
> On Fri, 2013-06-14 at 11:53 +0100, George Dunlap wrote:
>> On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>>> On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
>>>
>>>>>> We could have a xenstore flag somewhere that enables the old behaviour
>>>>>> so that people can revert back to qemu-xen-traditional and make the pci
>>>>>> hole below 4G even bigger than 448MB, but I think that keeping the old
>>>>>> behaviour around is going to make the code more difficult to maintain.
>>>>> The downside of that is that things which worked with the old scheme may
>>>>> not work with the new one though. Early in a release cycle when we have
>>>>> time to discover what has broken then that might be OK, but is post rc4
>>>>> really the time to be risking it?
>>>> Yes, you are right: there are some scenarios that would have worked
>>>> before that wouldn't work anymore with the new scheme.
>>>> Are they important enough to have a workaround, pretty difficult to
>>>> identify for a user?
>>> That question would be reasonable early in the development cycle. At rc4
>>> the question should be: do we think this problem is so critical that we
>>> want to risk breaking something else which currently works for people.
>>>
>>> Remember that we are invalidating whatever passthrough testing people
>>> have already done up to this point of the release.
>>>
>>> It is also worth noting that the things which this change ends up
>>> breaking may for all we know be equally difficult for a user to identify
>>> (they are after all approximately the same class of issue).
>>>
>>> The problem here is that the risk is difficult to evaluate, we just
>>> don't know what will break with this change, and we don't know therefore
>>> if the cure is worse than the disease. The conservative approach at this
>>> point in the release would be to not change anything, or to change the
>>> minimal possible number of things (which would preclude changes which
>>> impact qemu-trad IMHO).
>>>
>>
>>> WRT pretty difficult to identify -- the root of this thread suggests the
>>> guest entered a reboot loop with "No bootable device", that sounds
>>> eminently release notable to me. I also not that it was changing the
>>> size of the PCI hole which caused the issue -- which does somewhat
>>> underscore the risks involved in this sort of change.
>> But that bug was a bug in the first attempt to fix the root problem.
>> The root problem shows up as qemu crashing at some point because it
>> tried to access invalid guest gpfn space; see
>> http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.
>>
>> Stefano tried to fix it with the above patch, just changing the hole
>> to start at 0xe; but that was incomplete, as it didn't match with
>> hvmloader and seabios's view of the world.  That's what this bug
>> report is about.  This thread is an attempt to find a better fix.
>>
>> So the root problem is that if we revert this patch, and someone
>> passes through a pci device using qemu-xen (the default) and the MMIO
>> hole is resized, at some point in the future qemu will randomly die.
> Right, I see, thanks for explaining.
>
>> If it's a choice between users experiencing, "My VM randomly crashes"
>> and experiencing, "I tried to pass through this device but the guest
>> OS doesn't see it", I'd rather choose the latter.
> All other things being equal, obviously we all would. But the point I've
> been trying to make is that we don't know the other consequences of
> making that fix -- e.g. on existing working configurations. So the
> choice is "some VMs randomly crash, but other stuff works fine and we
> have had a reasonable amount of user testing" and "those particular VMs
> don't crash any more, but we don't know what other stuff no longer works
> and the existing test base has been at least partially invalidated".
>
> I think that at post rc4 in a release we ought to be being pretty
> conservative about the risks of this sort of change, especially wrt
> invalidating testing and the unknowns involved.
>
> Aren't the configurations which might trip over this issue are going to
> be in the minority compared to those which we risk breaking?

So there are the technical proposals we've been discussing, each of 
which has different risks.

1. Set the default MMIO hole size to 0xe0000000.
2. If possible, relocate PCI devices that don't fit in the hole to the 
64-bit hole.
  - Here "if possible" will mean a) the device has a 64-bit BAR, and b) 
this hasn't been disabled by libxl (probably via a xenstore key).
3. If possible, resize the MMIO hole; otherwise refuse to map the device
  - Currently "if possible" is always true; the new thing here would be 
making it possible for libxl to disable this, probably via a xenstore key.

Each of these will have different risks for qemu-traditional and qemu-xen.

Implementing #3 would have no risk for qemu-traditional, because we 
won't be changing the way anything works; what works will still work, 
what is broken (if anything) will still be broken.

Implementing #3 for qemu-xen only changes one kind of failure for 
another.  If you resize the MMIO hole for qemu-xen, then you *will* 
eventually crash.  So this will not break existing working 
configurations -- it will only change the failure from "qemu crashes at 
some point" to "the guest OS cannot see the device".  This is a uniform 
improvement.

So #3 is very low risk, as far as I can tell, and has a solid benefit.  
I think we should definitely implement it.

I think #2 should have no impact on qemu-traditional, because xl should 
disable it by default.

For qemu-xen, the only devices that are relocated are devices that would 
otherwise be disabled by #3; so remember that the alternate for this one 
(assuming we implement #3) is "not visible to OS". There are several 
potential sub-sets here:
  a. Guest OSes that can't access the 64-bit region.  In that case the 
device will not be visible, which is the same failure as not 
implementing this at all.
  b. I think most devices and operating systems will just work; this 
codepath in QEMU and guest OSes is fairly well tested with KVM.
  c. It may be that there are devices that would have worked with 
qemu-traditional and placed in a resized MMIO hole, but that will break 
with qemu-xen and and placed in the 64-bit MMIO hole.  For these 
devices, the failure mode will change from "not visible to guest OS" to 
"fails in some other unforseen way".

The the main risk I see from this one is c.  However, I think from a 
cost-benefits, it's still pretty low -- we get the benefit of most 
people transparently being able to just use pci-passthrough, at the 
potential cost of a handful of people having "weird crash" failures 
instead of "can't use the device" failures.

I suppose that there is a potential risk for b as well, in that we 
haven't tested relocating to the 64-bit MMIO hole *with Xen*.  There may 
be assumptions about 32-bit paddrs baked in somewhere that we don't know 
about.  If we implement this change on top of #3, and there are problems 
with b, then it will change "device not visible to guest OS" failures 
back into "may crash in a weird way" failures.

On the whole I would be inclined to implement #2 if it's not too 
difficult, but I can certainly see the point of saying that it's too 
risky and that we shouldn't do it.

Then there's #1.  This should in theory be low-risk, because in theory 
hvmloader might have chosen 0xe as the start of the MMIO hole anyway.

For qemu-traditional, this change has no benefits.  The benefits for 
qemu-xen depend on what else gets implemented.  If nothing else is 
implemented, this changes some "random qemu crash" failures into 
successes (while leaving other ones to keep crashing).  If #3 is 
implemented, then it changes some "guest OS can't see device" failures 
into successes.  If Both #3 and #2 are implemented, then some of the 
"guest OS can't see 64-bit MMIO hole" failures will change into successes.

However, as you say, the number of times hvmloader *actually* choses 
that at the moment is fairly low.  The vast majority of VMs and 
configurations at the moment will *not* use 0xe as a base; at most only 
a handful of people will have tested that configuration.  So there is a 
fairly significant risk that there *is* some configuration for which, 
*had* hvmloader chosen 0xe, then it *would* have caused a problem.  For 
those configurations, #1 will change "will break but only if you have a 
very large PCI device" to "will break no matter what".

This would be fine if we hadn't started RCs; but we have.  Our most 
important userbase is people who are *not* doing pci pass-through; so I 
think I agree with you, that this introduces a probably unacceptable 
risk for very little gain -- particularly if we implement #3.

So my recommendation is:
* Implement #3
* Consider implementing #2
* Don't implement #1.

Thoughts?

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-14 14:14                                   ` George Dunlap
@ 2013-06-14 14:36                                     ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-14 14:36 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Hanweidong, Xudong Hao,
	Stefano Stabellini, Tim Deegan, qemu-devel, Yanqiangjun,
	Wangzhenguo, YangXiaowei, Gonglei (Arei),
	Jan Beulich, YongweiX Xu, Luonengjun, Paolo Bonzini,
	SongtaoX Liu

On 14/06/13 15:14, George Dunlap wrote:
> On 14/06/13 12:34, Ian Campbell wrote:
>> On Fri, 2013-06-14 at 11:53 +0100, George Dunlap wrote:
>>> On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell 
>>> <Ian.Campbell@citrix.com> wrote:
>>>> On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
>>>>
>>>>>>> We could have a xenstore flag somewhere that enables the old 
>>>>>>> behaviour
>>>>>>> so that people can revert back to qemu-xen-traditional and make 
>>>>>>> the pci
>>>>>>> hole below 4G even bigger than 448MB, but I think that keeping 
>>>>>>> the old
>>>>>>> behaviour around is going to make the code more difficult to 
>>>>>>> maintain.
>>>>>> The downside of that is that things which worked with the old 
>>>>>> scheme may
>>>>>> not work with the new one though. Early in a release cycle when 
>>>>>> we have
>>>>>> time to discover what has broken then that might be OK, but is 
>>>>>> post rc4
>>>>>> really the time to be risking it?
>>>>> Yes, you are right: there are some scenarios that would have worked
>>>>> before that wouldn't work anymore with the new scheme.
>>>>> Are they important enough to have a workaround, pretty difficult to
>>>>> identify for a user?
>>>> That question would be reasonable early in the development cycle. 
>>>> At rc4
>>>> the question should be: do we think this problem is so critical 
>>>> that we
>>>> want to risk breaking something else which currently works for people.
>>>>
>>>> Remember that we are invalidating whatever passthrough testing people
>>>> have already done up to this point of the release.
>>>>
>>>> It is also worth noting that the things which this change ends up
>>>> breaking may for all we know be equally difficult for a user to 
>>>> identify
>>>> (they are after all approximately the same class of issue).
>>>>
>>>> The problem here is that the risk is difficult to evaluate, we just
>>>> don't know what will break with this change, and we don't know 
>>>> therefore
>>>> if the cure is worse than the disease. The conservative approach at 
>>>> this
>>>> point in the release would be to not change anything, or to change the
>>>> minimal possible number of things (which would preclude changes which
>>>> impact qemu-trad IMHO).
>>>>
>>>
>>>> WRT pretty difficult to identify -- the root of this thread 
>>>> suggests the
>>>> guest entered a reboot loop with "No bootable device", that sounds
>>>> eminently release notable to me. I also not that it was changing the
>>>> size of the PCI hole which caused the issue -- which does somewhat
>>>> underscore the risks involved in this sort of change.
>>> But that bug was a bug in the first attempt to fix the root problem.
>>> The root problem shows up as qemu crashing at some point because it
>>> tried to access invalid guest gpfn space; see
>>> http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.
>>>
>>> Stefano tried to fix it with the above patch, just changing the hole
>>> to start at 0xe; but that was incomplete, as it didn't match with
>>> hvmloader and seabios's view of the world.  That's what this bug
>>> report is about.  This thread is an attempt to find a better fix.
>>>
>>> So the root problem is that if we revert this patch, and someone
>>> passes through a pci device using qemu-xen (the default) and the MMIO
>>> hole is resized, at some point in the future qemu will randomly die.
>> Right, I see, thanks for explaining.
>>
>>> If it's a choice between users experiencing, "My VM randomly crashes"
>>> and experiencing, "I tried to pass through this device but the guest
>>> OS doesn't see it", I'd rather choose the latter.
>> All other things being equal, obviously we all would. But the point I've
>> been trying to make is that we don't know the other consequences of
>> making that fix -- e.g. on existing working configurations. So the
>> choice is "some VMs randomly crash, but other stuff works fine and we
>> have had a reasonable amount of user testing" and "those particular VMs
>> don't crash any more, but we don't know what other stuff no longer works
>> and the existing test base has been at least partially invalidated".
>>
>> I think that at post rc4 in a release we ought to be being pretty
>> conservative about the risks of this sort of change, especially wrt
>> invalidating testing and the unknowns involved.
>>
>> Aren't the configurations which might trip over this issue are going to
>> be in the minority compared to those which we risk breaking?
>
> So there are the technical proposals we've been discussing, each of 
> which has different risks.
>
> 1. Set the default MMIO hole size to 0xe0000000.
> 2. If possible, relocate PCI devices that don't fit in the hole to the 
> 64-bit hole.
>  - Here "if possible" will mean a) the device has a 64-bit BAR, and b) 
> this hasn't been disabled by libxl (probably via a xenstore key).
> 3. If possible, resize the MMIO hole; otherwise refuse to map the device
>  - Currently "if possible" is always true; the new thing here would be 
> making it possible for libxl to disable this, probably via a xenstore 
> key.
>
> Each of these will have different risks for qemu-traditional and 
> qemu-xen.
>
> Implementing #3 would have no risk for qemu-traditional, because we 
> won't be changing the way anything works; what works will still work, 
> what is broken (if anything) will still be broken.
>
> Implementing #3 for qemu-xen only changes one kind of failure for 
> another.  If you resize the MMIO hole for qemu-xen, then you *will* 
> eventually crash.  So this will not break existing working 
> configurations -- it will only change the failure from "qemu crashes 
> at some point" to "the guest OS cannot see the device". This is a 
> uniform improvement.

I suppose this is not strictly true.  If you resize the MMIO hole *such 
that it overlaps what was originally guest memory*, then it will crash.  
If you have a smaller guest with say, only 1 or 2GiB of RAM, then you 
can probably resize the MMIO hole arbitrarily on qemu-xen and have no 
ill effects.  So as stated ("never resize MMIO hole"), this would cause 
some successes into "guest can't see the device" failures.

(Stefano, correct me if I'm wrong here.)

But hvmloader should know whether this is the case, however, because if 
there is memory there it has to relocate it.  So we should change "is 
possible" to mean, "if we don't need to relocate memory, or if 
relocating memory has been enabled by libxl".

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-14 14:36                                     ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-14 14:36 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Hanweidong, Xudong Hao,
	Stefano Stabellini, Tim Deegan, qemu-devel, Yanqiangjun,
	Wangzhenguo, YangXiaowei, Gonglei (Arei),
	Jan Beulich, YongweiX Xu, Luonengjun, Paolo Bonzini,
	SongtaoX Liu

On 14/06/13 15:14, George Dunlap wrote:
> On 14/06/13 12:34, Ian Campbell wrote:
>> On Fri, 2013-06-14 at 11:53 +0100, George Dunlap wrote:
>>> On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell 
>>> <Ian.Campbell@citrix.com> wrote:
>>>> On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
>>>>
>>>>>>> We could have a xenstore flag somewhere that enables the old 
>>>>>>> behaviour
>>>>>>> so that people can revert back to qemu-xen-traditional and make 
>>>>>>> the pci
>>>>>>> hole below 4G even bigger than 448MB, but I think that keeping 
>>>>>>> the old
>>>>>>> behaviour around is going to make the code more difficult to 
>>>>>>> maintain.
>>>>>> The downside of that is that things which worked with the old 
>>>>>> scheme may
>>>>>> not work with the new one though. Early in a release cycle when 
>>>>>> we have
>>>>>> time to discover what has broken then that might be OK, but is 
>>>>>> post rc4
>>>>>> really the time to be risking it?
>>>>> Yes, you are right: there are some scenarios that would have worked
>>>>> before that wouldn't work anymore with the new scheme.
>>>>> Are they important enough to have a workaround, pretty difficult to
>>>>> identify for a user?
>>>> That question would be reasonable early in the development cycle. 
>>>> At rc4
>>>> the question should be: do we think this problem is so critical 
>>>> that we
>>>> want to risk breaking something else which currently works for people.
>>>>
>>>> Remember that we are invalidating whatever passthrough testing people
>>>> have already done up to this point of the release.
>>>>
>>>> It is also worth noting that the things which this change ends up
>>>> breaking may for all we know be equally difficult for a user to 
>>>> identify
>>>> (they are after all approximately the same class of issue).
>>>>
>>>> The problem here is that the risk is difficult to evaluate, we just
>>>> don't know what will break with this change, and we don't know 
>>>> therefore
>>>> if the cure is worse than the disease. The conservative approach at 
>>>> this
>>>> point in the release would be to not change anything, or to change the
>>>> minimal possible number of things (which would preclude changes which
>>>> impact qemu-trad IMHO).
>>>>
>>>
>>>> WRT pretty difficult to identify -- the root of this thread 
>>>> suggests the
>>>> guest entered a reboot loop with "No bootable device", that sounds
>>>> eminently release notable to me. I also not that it was changing the
>>>> size of the PCI hole which caused the issue -- which does somewhat
>>>> underscore the risks involved in this sort of change.
>>> But that bug was a bug in the first attempt to fix the root problem.
>>> The root problem shows up as qemu crashing at some point because it
>>> tried to access invalid guest gpfn space; see
>>> http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.
>>>
>>> Stefano tried to fix it with the above patch, just changing the hole
>>> to start at 0xe; but that was incomplete, as it didn't match with
>>> hvmloader and seabios's view of the world.  That's what this bug
>>> report is about.  This thread is an attempt to find a better fix.
>>>
>>> So the root problem is that if we revert this patch, and someone
>>> passes through a pci device using qemu-xen (the default) and the MMIO
>>> hole is resized, at some point in the future qemu will randomly die.
>> Right, I see, thanks for explaining.
>>
>>> If it's a choice between users experiencing, "My VM randomly crashes"
>>> and experiencing, "I tried to pass through this device but the guest
>>> OS doesn't see it", I'd rather choose the latter.
>> All other things being equal, obviously we all would. But the point I've
>> been trying to make is that we don't know the other consequences of
>> making that fix -- e.g. on existing working configurations. So the
>> choice is "some VMs randomly crash, but other stuff works fine and we
>> have had a reasonable amount of user testing" and "those particular VMs
>> don't crash any more, but we don't know what other stuff no longer works
>> and the existing test base has been at least partially invalidated".
>>
>> I think that at post rc4 in a release we ought to be being pretty
>> conservative about the risks of this sort of change, especially wrt
>> invalidating testing and the unknowns involved.
>>
>> Aren't the configurations which might trip over this issue are going to
>> be in the minority compared to those which we risk breaking?
>
> So there are the technical proposals we've been discussing, each of 
> which has different risks.
>
> 1. Set the default MMIO hole size to 0xe0000000.
> 2. If possible, relocate PCI devices that don't fit in the hole to the 
> 64-bit hole.
>  - Here "if possible" will mean a) the device has a 64-bit BAR, and b) 
> this hasn't been disabled by libxl (probably via a xenstore key).
> 3. If possible, resize the MMIO hole; otherwise refuse to map the device
>  - Currently "if possible" is always true; the new thing here would be 
> making it possible for libxl to disable this, probably via a xenstore 
> key.
>
> Each of these will have different risks for qemu-traditional and 
> qemu-xen.
>
> Implementing #3 would have no risk for qemu-traditional, because we 
> won't be changing the way anything works; what works will still work, 
> what is broken (if anything) will still be broken.
>
> Implementing #3 for qemu-xen only changes one kind of failure for 
> another.  If you resize the MMIO hole for qemu-xen, then you *will* 
> eventually crash.  So this will not break existing working 
> configurations -- it will only change the failure from "qemu crashes 
> at some point" to "the guest OS cannot see the device". This is a 
> uniform improvement.

I suppose this is not strictly true.  If you resize the MMIO hole *such 
that it overlaps what was originally guest memory*, then it will crash.  
If you have a smaller guest with say, only 1 or 2GiB of RAM, then you 
can probably resize the MMIO hole arbitrarily on qemu-xen and have no 
ill effects.  So as stated ("never resize MMIO hole"), this would cause 
some successes into "guest can't see the device" failures.

(Stefano, correct me if I'm wrong here.)

But hvmloader should know whether this is the case, however, because if 
there is memory there it has to relocate it.  So we should change "is 
possible" to mean, "if we don't need to relocate memory, or if 
relocating memory has been enabled by libxl".

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 13:54                     ` George Dunlap
@ 2013-06-13 14:54                       ` Paolo Bonzini
  -1 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2013-06-13 14:54 UTC (permalink / raw)
  To: George Dunlap
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	Stefano Stabellini, Xudong Hao, hanweidong, Tim Deegan,
	qemu-devel, yanqiangjun, wangzhenguo, xiaowei.yang, arei.gonglei,
	Jan Beulich, luonengjun, YongweiX Xu, SongtaoX Liu

Il 13/06/2013 09:54, George Dunlap ha scritto:
> 
> Also, as I understand it, at the moment:
> 1. Some operating systems (32-bit XP) won't be able to use relocated
> devices
> 2. Some devices (without 64-bit BARs) can't be relocated

Are there really devices with huge 32-bit BARs?  I think #1 is the only
real problem, so far though it never was for KVM.

SeaBIOS sorts the BARs from smallest to largest alignment, and then from
smallest to largest size.  Typically only the GPU will be relocated.

Paolo

> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> 
> So if we have #1 or #2, at the moment an option for a work-around is to
> use qemu-traditional.
> 
> However, if we add your "print an error if the device doesn't fit", then
> this option will go away -- this will be a regression in functionality
> from 4.2.
> 
> I thought that what we had proposed was to have an option in xenstore,
> that libxl would set, which would instruct hvmloader whether to expand
> the MMIO hole and whether to relocate devices above 64-bit?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 14:54                       ` Paolo Bonzini
  0 siblings, 0 replies; 82+ messages in thread
From: Paolo Bonzini @ 2013-06-13 14:54 UTC (permalink / raw)
  To: George Dunlap
  Cc: Yongjie Ren, xen-devel, Keir Fraser, Ian Campbell,
	Stefano Stabellini, Xudong Hao, hanweidong, Tim Deegan,
	qemu-devel, yanqiangjun, wangzhenguo, xiaowei.yang, arei.gonglei,
	Jan Beulich, luonengjun, YongweiX Xu, SongtaoX Liu

Il 13/06/2013 09:54, George Dunlap ha scritto:
> 
> Also, as I understand it, at the moment:
> 1. Some operating systems (32-bit XP) won't be able to use relocated
> devices
> 2. Some devices (without 64-bit BARs) can't be relocated

Are there really devices with huge 32-bit BARs?  I think #1 is the only
real problem, so far though it never was for KVM.

SeaBIOS sorts the BARs from smallest to largest alignment, and then from
smallest to largest size.  Typically only the GPU will be relocated.

Paolo

> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> 
> So if we have #1 or #2, at the moment an option for a work-around is to
> use qemu-traditional.
> 
> However, if we add your "print an error if the device doesn't fit", then
> this option will go away -- this will be a regression in functionality
> from 4.2.
> 
> I thought that what we had proposed was to have an option in xenstore,
> that libxl would set, which would instruct hvmloader whether to expand
> the MMIO hole and whether to relocate devices above 64-bit?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 13:54                     ` George Dunlap
@ 2013-06-13 15:16                       ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 15:16 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
> On 13/06/13 14:44, Stefano Stabellini wrote:
> > On Wed, 12 Jun 2013, George Dunlap wrote:
> >> On 12/06/13 08:25, Jan Beulich wrote:
> >>>>>> On 11.06.13 at 19:26, Stefano Stabellini
> >>>>>> <stefano.stabellini@eu.citrix.com> wrote:
> >>>> I went through the code that maps the PCI MMIO regions in hvmloader
> >>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >>>> region is larger than 512MB.
> >>>>
> >>>> Maybe we could just relax this condition and map the device memory to
> >>>> high memory no matter the size of the MMIO region if the PCI bar is
> >>>> 64-bit?
> >>> I can only recommend not to: For one, guests not using PAE or
> >>> PSE-36 can't map such space at all (and older OSes may not
> >>> properly deal with 64-bit BARs at all). And then one would generally
> >>> expect this allocation to be done top down (to minimize risk of
> >>> running into RAM), and doing so is going to present further risks of
> >>> incompatibilities with guest OSes (Linux for example learned only in
> >>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >>> PFN to pfn_pte(), the respective parameter of which is
> >>> "unsigned long").
> >>>
> >>> I think this ought to be done in an iterative process - if all MMIO
> >>> regions together don't fit below 4G, the biggest one should be
> >>> moved up beyond 4G first, followed by the next to biggest one
> >>> etc.
> >> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> >> temporary work-around.  It should only be done if a device doesn't fit in the
> >> current MMIO range.
> >>
> >> We have three options here:
> >> 1. Don't do anything
> >> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> >> fit
> >> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> >> memory).
> >> 4. Add a mechanism to tell qemu that memory is being relocated.
> >>
> >> Number 4 is definitely the right answer long-term, but we just don't have time
> >> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> >> if it is, it may have unpredictable knock-on effects.
> >>
> >> Doing #2, it is true that many guests will be unable to access the device
> >> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> >> access the device.  At least in #2, *many* guests will be able to do so.  In
> >> any case, apparently #2 is what KVM does, so having the limitation on guests
> >> is not without precedent.  It's also likely to be a somewhat tested
> >> configuration (unlike #3, for example).
> > I would avoid #3, because I don't think is a good idea to rely on that
> > behaviour.
> > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > easy and certainly not doable in time for 4.3.
> >
> > So we are left to play with the PCI MMIO region size and location in
> > hvmloader.
> >
> > I agree with Jan that we shouldn't relocate unconditionally all the
> > devices to the region above 4G. I meant to say that we should relocate
> > only the ones that don't fit. And we shouldn't try to dynamically
> > increase the PCI hole below 4G because clearly that doesn't work.
> > However we could still increase the size of the PCI hole below 4G by
> > default from start at 0xf0000000 to starting at 0xe0000000.
> > Why do we know that is safe? Because in the current configuration
> > hvmloader *already* increases the PCI hole size by decreasing the start
> > address every time a device doesn't fit.
> > So it's already common for hvmloader to set pci_mem_start to
> > 0xe0000000, you just need to assign a device with a PCI hole size big
> > enough.

Isn't this the exact case which is broken? And therefore not known safe
at all?

> >
> >
> > My proposed solution is:
> >
> > - set 0xe0000000 as the default PCI hole start for everybody, including
> > qemu-xen-traditional

What is the impact on existing qemu-trad guests?

It does mean that guest which were installed with a bit less than 4GB
RAM may now find a little bit of RAM moves above 4GB to make room for
the bigger whole. If they can dynamically enable PAE that might be ok.

Does this have any impact on Windows activation?

> > - move above 4G everything that doesn't fit and support 64-bit bars
> > - print an error if the device doesn't fit and doesn't support 64-bit
> > bars
> 
> Also, as I understand it, at the moment:
> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> 2. Some devices (without 64-bit BARs) can't be relocated
> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> 
> So if we have #1 or #2, at the moment an option for a work-around is to 
> use qemu-traditional.
> 
> However, if we add your "print an error if the device doesn't fit", then 
> this option will go away -- this will be a regression in functionality 
> from 4.2.

Only if print an error also involves aborting. It could print an error
(lets call it a warning) and continue, which would leave the workaround
viable.

> I thought that what we had proposed was to have an option in xenstore, 
> that libxl would set, which would instruct hvmloader whether to expand 
> the MMIO hole and whether to relocate devices above 64-bit?
> 
>   -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 15:16                       ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 15:16 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
> On 13/06/13 14:44, Stefano Stabellini wrote:
> > On Wed, 12 Jun 2013, George Dunlap wrote:
> >> On 12/06/13 08:25, Jan Beulich wrote:
> >>>>>> On 11.06.13 at 19:26, Stefano Stabellini
> >>>>>> <stefano.stabellini@eu.citrix.com> wrote:
> >>>> I went through the code that maps the PCI MMIO regions in hvmloader
> >>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >>>> region is larger than 512MB.
> >>>>
> >>>> Maybe we could just relax this condition and map the device memory to
> >>>> high memory no matter the size of the MMIO region if the PCI bar is
> >>>> 64-bit?
> >>> I can only recommend not to: For one, guests not using PAE or
> >>> PSE-36 can't map such space at all (and older OSes may not
> >>> properly deal with 64-bit BARs at all). And then one would generally
> >>> expect this allocation to be done top down (to minimize risk of
> >>> running into RAM), and doing so is going to present further risks of
> >>> incompatibilities with guest OSes (Linux for example learned only in
> >>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >>> PFN to pfn_pte(), the respective parameter of which is
> >>> "unsigned long").
> >>>
> >>> I think this ought to be done in an iterative process - if all MMIO
> >>> regions together don't fit below 4G, the biggest one should be
> >>> moved up beyond 4G first, followed by the next to biggest one
> >>> etc.
> >> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> >> temporary work-around.  It should only be done if a device doesn't fit in the
> >> current MMIO range.
> >>
> >> We have three options here:
> >> 1. Don't do anything
> >> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> >> fit
> >> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> >> memory).
> >> 4. Add a mechanism to tell qemu that memory is being relocated.
> >>
> >> Number 4 is definitely the right answer long-term, but we just don't have time
> >> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> >> if it is, it may have unpredictable knock-on effects.
> >>
> >> Doing #2, it is true that many guests will be unable to access the device
> >> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> >> access the device.  At least in #2, *many* guests will be able to do so.  In
> >> any case, apparently #2 is what KVM does, so having the limitation on guests
> >> is not without precedent.  It's also likely to be a somewhat tested
> >> configuration (unlike #3, for example).
> > I would avoid #3, because I don't think is a good idea to rely on that
> > behaviour.
> > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > easy and certainly not doable in time for 4.3.
> >
> > So we are left to play with the PCI MMIO region size and location in
> > hvmloader.
> >
> > I agree with Jan that we shouldn't relocate unconditionally all the
> > devices to the region above 4G. I meant to say that we should relocate
> > only the ones that don't fit. And we shouldn't try to dynamically
> > increase the PCI hole below 4G because clearly that doesn't work.
> > However we could still increase the size of the PCI hole below 4G by
> > default from start at 0xf0000000 to starting at 0xe0000000.
> > Why do we know that is safe? Because in the current configuration
> > hvmloader *already* increases the PCI hole size by decreasing the start
> > address every time a device doesn't fit.
> > So it's already common for hvmloader to set pci_mem_start to
> > 0xe0000000, you just need to assign a device with a PCI hole size big
> > enough.

Isn't this the exact case which is broken? And therefore not known safe
at all?

> >
> >
> > My proposed solution is:
> >
> > - set 0xe0000000 as the default PCI hole start for everybody, including
> > qemu-xen-traditional

What is the impact on existing qemu-trad guests?

It does mean that guest which were installed with a bit less than 4GB
RAM may now find a little bit of RAM moves above 4GB to make room for
the bigger whole. If they can dynamically enable PAE that might be ok.

Does this have any impact on Windows activation?

> > - move above 4G everything that doesn't fit and support 64-bit bars
> > - print an error if the device doesn't fit and doesn't support 64-bit
> > bars
> 
> Also, as I understand it, at the moment:
> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> 2. Some devices (without 64-bit BARs) can't be relocated
> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> 
> So if we have #1 or #2, at the moment an option for a work-around is to 
> use qemu-traditional.
> 
> However, if we add your "print an error if the device doesn't fit", then 
> this option will go away -- this will be a regression in functionality 
> from 4.2.

Only if print an error also involves aborting. It could print an error
(lets call it a warning) and continue, which would leave the workaround
viable.

> I thought that what we had proposed was to have an option in xenstore, 
> that libxl would set, which would instruct hvmloader whether to expand 
> the MMIO hole and whether to relocate devices above 64-bit?
> 
>   -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 15:16                       ` Ian Campbell
@ 2013-06-13 15:30                         ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-13 15:30 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On 13/06/13 16:16, Ian Campbell wrote:
> On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
>> On 13/06/13 14:44, Stefano Stabellini wrote:
>>> On Wed, 12 Jun 2013, George Dunlap wrote:
>>>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>>>> region is larger than 512MB.
>>>>>>
>>>>>> Maybe we could just relax this condition and map the device memory to
>>>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>>>> 64-bit?
>>>>> I can only recommend not to: For one, guests not using PAE or
>>>>> PSE-36 can't map such space at all (and older OSes may not
>>>>> properly deal with 64-bit BARs at all). And then one would generally
>>>>> expect this allocation to be done top down (to minimize risk of
>>>>> running into RAM), and doing so is going to present further risks of
>>>>> incompatibilities with guest OSes (Linux for example learned only in
>>>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>>>> PFN to pfn_pte(), the respective parameter of which is
>>>>> "unsigned long").
>>>>>
>>>>> I think this ought to be done in an iterative process - if all MMIO
>>>>> regions together don't fit below 4G, the biggest one should be
>>>>> moved up beyond 4G first, followed by the next to biggest one
>>>>> etc.
>>>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
>>>> temporary work-around.  It should only be done if a device doesn't fit in the
>>>> current MMIO range.
>>>>
>>>> We have three options here:
>>>> 1. Don't do anything
>>>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
>>>> fit
>>>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
>>>> memory).
>>>> 4. Add a mechanism to tell qemu that memory is being relocated.
>>>>
>>>> Number 4 is definitely the right answer long-term, but we just don't have time
>>>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
>>>> if it is, it may have unpredictable knock-on effects.
>>>>
>>>> Doing #2, it is true that many guests will be unable to access the device
>>>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
>>>> access the device.  At least in #2, *many* guests will be able to do so.  In
>>>> any case, apparently #2 is what KVM does, so having the limitation on guests
>>>> is not without precedent.  It's also likely to be a somewhat tested
>>>> configuration (unlike #3, for example).
>>> I would avoid #3, because I don't think is a good idea to rely on that
>>> behaviour.
>>> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
>>> easy and certainly not doable in time for 4.3.
>>>
>>> So we are left to play with the PCI MMIO region size and location in
>>> hvmloader.
>>>
>>> I agree with Jan that we shouldn't relocate unconditionally all the
>>> devices to the region above 4G. I meant to say that we should relocate
>>> only the ones that don't fit. And we shouldn't try to dynamically
>>> increase the PCI hole below 4G because clearly that doesn't work.
>>> However we could still increase the size of the PCI hole below 4G by
>>> default from start at 0xf0000000 to starting at 0xe0000000.
>>> Why do we know that is safe? Because in the current configuration
>>> hvmloader *already* increases the PCI hole size by decreasing the start
>>> address every time a device doesn't fit.
>>> So it's already common for hvmloader to set pci_mem_start to
>>> 0xe0000000, you just need to assign a device with a PCI hole size big
>>> enough.
> Isn't this the exact case which is broken? And therefore not known safe
> at all?
>
>>>
>>> My proposed solution is:
>>>
>>> - set 0xe0000000 as the default PCI hole start for everybody, including
>>> qemu-xen-traditional
> What is the impact on existing qemu-trad guests?
>
> It does mean that guest which were installed with a bit less than 4GB
> RAM may now find a little bit of RAM moves above 4GB to make room for
> the bigger whole. If they can dynamically enable PAE that might be ok.
>
> Does this have any impact on Windows activation?
>
>>> - move above 4G everything that doesn't fit and support 64-bit bars
>>> - print an error if the device doesn't fit and doesn't support 64-bit
>>> bars
>> Also, as I understand it, at the moment:
>> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
>> 2. Some devices (without 64-bit BARs) can't be relocated
>> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
>>
>> So if we have #1 or #2, at the moment an option for a work-around is to
>> use qemu-traditional.
>>
>> However, if we add your "print an error if the device doesn't fit", then
>> this option will go away -- this will be a regression in functionality
>> from 4.2.
> Only if print an error also involves aborting. It could print an error
> (lets call it a warning) and continue, which would leave the workaround
> viable.\

No, because if hvmloader doesn't increase the size of the MMIO hole, 
then the device won't actually work.  The guest will boot, but the OS 
will not be able to use it.

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 15:30                         ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-13 15:30 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On 13/06/13 16:16, Ian Campbell wrote:
> On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
>> On 13/06/13 14:44, Stefano Stabellini wrote:
>>> On Wed, 12 Jun 2013, George Dunlap wrote:
>>>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>>>> region is larger than 512MB.
>>>>>>
>>>>>> Maybe we could just relax this condition and map the device memory to
>>>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>>>> 64-bit?
>>>>> I can only recommend not to: For one, guests not using PAE or
>>>>> PSE-36 can't map such space at all (and older OSes may not
>>>>> properly deal with 64-bit BARs at all). And then one would generally
>>>>> expect this allocation to be done top down (to minimize risk of
>>>>> running into RAM), and doing so is going to present further risks of
>>>>> incompatibilities with guest OSes (Linux for example learned only in
>>>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>>>> PFN to pfn_pte(), the respective parameter of which is
>>>>> "unsigned long").
>>>>>
>>>>> I think this ought to be done in an iterative process - if all MMIO
>>>>> regions together don't fit below 4G, the biggest one should be
>>>>> moved up beyond 4G first, followed by the next to biggest one
>>>>> etc.
>>>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
>>>> temporary work-around.  It should only be done if a device doesn't fit in the
>>>> current MMIO range.
>>>>
>>>> We have three options here:
>>>> 1. Don't do anything
>>>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
>>>> fit
>>>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
>>>> memory).
>>>> 4. Add a mechanism to tell qemu that memory is being relocated.
>>>>
>>>> Number 4 is definitely the right answer long-term, but we just don't have time
>>>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
>>>> if it is, it may have unpredictable knock-on effects.
>>>>
>>>> Doing #2, it is true that many guests will be unable to access the device
>>>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
>>>> access the device.  At least in #2, *many* guests will be able to do so.  In
>>>> any case, apparently #2 is what KVM does, so having the limitation on guests
>>>> is not without precedent.  It's also likely to be a somewhat tested
>>>> configuration (unlike #3, for example).
>>> I would avoid #3, because I don't think is a good idea to rely on that
>>> behaviour.
>>> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
>>> easy and certainly not doable in time for 4.3.
>>>
>>> So we are left to play with the PCI MMIO region size and location in
>>> hvmloader.
>>>
>>> I agree with Jan that we shouldn't relocate unconditionally all the
>>> devices to the region above 4G. I meant to say that we should relocate
>>> only the ones that don't fit. And we shouldn't try to dynamically
>>> increase the PCI hole below 4G because clearly that doesn't work.
>>> However we could still increase the size of the PCI hole below 4G by
>>> default from start at 0xf0000000 to starting at 0xe0000000.
>>> Why do we know that is safe? Because in the current configuration
>>> hvmloader *already* increases the PCI hole size by decreasing the start
>>> address every time a device doesn't fit.
>>> So it's already common for hvmloader to set pci_mem_start to
>>> 0xe0000000, you just need to assign a device with a PCI hole size big
>>> enough.
> Isn't this the exact case which is broken? And therefore not known safe
> at all?
>
>>>
>>> My proposed solution is:
>>>
>>> - set 0xe0000000 as the default PCI hole start for everybody, including
>>> qemu-xen-traditional
> What is the impact on existing qemu-trad guests?
>
> It does mean that guest which were installed with a bit less than 4GB
> RAM may now find a little bit of RAM moves above 4GB to make room for
> the bigger whole. If they can dynamically enable PAE that might be ok.
>
> Does this have any impact on Windows activation?
>
>>> - move above 4G everything that doesn't fit and support 64-bit bars
>>> - print an error if the device doesn't fit and doesn't support 64-bit
>>> bars
>> Also, as I understand it, at the moment:
>> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
>> 2. Some devices (without 64-bit BARs) can't be relocated
>> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
>>
>> So if we have #1 or #2, at the moment an option for a work-around is to
>> use qemu-traditional.
>>
>> However, if we add your "print an error if the device doesn't fit", then
>> this option will go away -- this will be a regression in functionality
>> from 4.2.
> Only if print an error also involves aborting. It could print an error
> (lets call it a warning) and continue, which would leave the workaround
> viable.\

No, because if hvmloader doesn't increase the size of the MMIO hole, 
then the device won't actually work.  The guest will boot, but the OS 
will not be able to use it.

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 15:30                         ` George Dunlap
@ 2013-06-13 15:36                           ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 15:36 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 2013-06-13 at 16:30 +0100, George Dunlap wrote:
> On 13/06/13 16:16, Ian Campbell wrote:
> > On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
> >> On 13/06/13 14:44, Stefano Stabellini wrote:
> >>> On Wed, 12 Jun 2013, George Dunlap wrote:
> >>>> On 12/06/13 08:25, Jan Beulich wrote:
> >>>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
> >>>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
> >>>>>> I went through the code that maps the PCI MMIO regions in hvmloader
> >>>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >>>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >>>>>> region is larger than 512MB.
> >>>>>>
> >>>>>> Maybe we could just relax this condition and map the device memory to
> >>>>>> high memory no matter the size of the MMIO region if the PCI bar is
> >>>>>> 64-bit?
> >>>>> I can only recommend not to: For one, guests not using PAE or
> >>>>> PSE-36 can't map such space at all (and older OSes may not
> >>>>> properly deal with 64-bit BARs at all). And then one would generally
> >>>>> expect this allocation to be done top down (to minimize risk of
> >>>>> running into RAM), and doing so is going to present further risks of
> >>>>> incompatibilities with guest OSes (Linux for example learned only in
> >>>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >>>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >>>>> PFN to pfn_pte(), the respective parameter of which is
> >>>>> "unsigned long").
> >>>>>
> >>>>> I think this ought to be done in an iterative process - if all MMIO
> >>>>> regions together don't fit below 4G, the biggest one should be
> >>>>> moved up beyond 4G first, followed by the next to biggest one
> >>>>> etc.
> >>>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> >>>> temporary work-around.  It should only be done if a device doesn't fit in the
> >>>> current MMIO range.
> >>>>
> >>>> We have three options here:
> >>>> 1. Don't do anything
> >>>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> >>>> fit
> >>>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> >>>> memory).
> >>>> 4. Add a mechanism to tell qemu that memory is being relocated.
> >>>>
> >>>> Number 4 is definitely the right answer long-term, but we just don't have time
> >>>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> >>>> if it is, it may have unpredictable knock-on effects.
> >>>>
> >>>> Doing #2, it is true that many guests will be unable to access the device
> >>>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> >>>> access the device.  At least in #2, *many* guests will be able to do so.  In
> >>>> any case, apparently #2 is what KVM does, so having the limitation on guests
> >>>> is not without precedent.  It's also likely to be a somewhat tested
> >>>> configuration (unlike #3, for example).
> >>> I would avoid #3, because I don't think is a good idea to rely on that
> >>> behaviour.
> >>> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> >>> easy and certainly not doable in time for 4.3.
> >>>
> >>> So we are left to play with the PCI MMIO region size and location in
> >>> hvmloader.
> >>>
> >>> I agree with Jan that we shouldn't relocate unconditionally all the
> >>> devices to the region above 4G. I meant to say that we should relocate
> >>> only the ones that don't fit. And we shouldn't try to dynamically
> >>> increase the PCI hole below 4G because clearly that doesn't work.
> >>> However we could still increase the size of the PCI hole below 4G by
> >>> default from start at 0xf0000000 to starting at 0xe0000000.
> >>> Why do we know that is safe? Because in the current configuration
> >>> hvmloader *already* increases the PCI hole size by decreasing the start
> >>> address every time a device doesn't fit.
> >>> So it's already common for hvmloader to set pci_mem_start to
> >>> 0xe0000000, you just need to assign a device with a PCI hole size big
> >>> enough.
> > Isn't this the exact case which is broken? And therefore not known safe
> > at all?
> >
> >>>
> >>> My proposed solution is:
> >>>
> >>> - set 0xe0000000 as the default PCI hole start for everybody, including
> >>> qemu-xen-traditional
> > What is the impact on existing qemu-trad guests?
> >
> > It does mean that guest which were installed with a bit less than 4GB
> > RAM may now find a little bit of RAM moves above 4GB to make room for
> > the bigger whole. If they can dynamically enable PAE that might be ok.
> >
> > Does this have any impact on Windows activation?
> >
> >>> - move above 4G everything that doesn't fit and support 64-bit bars
> >>> - print an error if the device doesn't fit and doesn't support 64-bit
> >>> bars
> >> Also, as I understand it, at the moment:
> >> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> >> 2. Some devices (without 64-bit BARs) can't be relocated
> >> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> >>
> >> So if we have #1 or #2, at the moment an option for a work-around is to
> >> use qemu-traditional.
> >>
> >> However, if we add your "print an error if the device doesn't fit", then
> >> this option will go away -- this will be a regression in functionality
> >> from 4.2.
> > Only if print an error also involves aborting. It could print an error
> > (lets call it a warning) and continue, which would leave the workaround
> > viable.\
> 
> No, because if hvmloader doesn't increase the size of the MMIO hole, 
> then the device won't actually work.  The guest will boot, but the OS 
> will not be able to use it.

I meant continue as in increasing the hole too, although rereading the
thread maybe that's not what everyone else was talking about ;-)

> 
>   -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 15:36                           ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 15:36 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 2013-06-13 at 16:30 +0100, George Dunlap wrote:
> On 13/06/13 16:16, Ian Campbell wrote:
> > On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
> >> On 13/06/13 14:44, Stefano Stabellini wrote:
> >>> On Wed, 12 Jun 2013, George Dunlap wrote:
> >>>> On 12/06/13 08:25, Jan Beulich wrote:
> >>>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
> >>>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
> >>>>>> I went through the code that maps the PCI MMIO regions in hvmloader
> >>>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >>>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >>>>>> region is larger than 512MB.
> >>>>>>
> >>>>>> Maybe we could just relax this condition and map the device memory to
> >>>>>> high memory no matter the size of the MMIO region if the PCI bar is
> >>>>>> 64-bit?
> >>>>> I can only recommend not to: For one, guests not using PAE or
> >>>>> PSE-36 can't map such space at all (and older OSes may not
> >>>>> properly deal with 64-bit BARs at all). And then one would generally
> >>>>> expect this allocation to be done top down (to minimize risk of
> >>>>> running into RAM), and doing so is going to present further risks of
> >>>>> incompatibilities with guest OSes (Linux for example learned only in
> >>>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >>>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >>>>> PFN to pfn_pte(), the respective parameter of which is
> >>>>> "unsigned long").
> >>>>>
> >>>>> I think this ought to be done in an iterative process - if all MMIO
> >>>>> regions together don't fit below 4G, the biggest one should be
> >>>>> moved up beyond 4G first, followed by the next to biggest one
> >>>>> etc.
> >>>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> >>>> temporary work-around.  It should only be done if a device doesn't fit in the
> >>>> current MMIO range.
> >>>>
> >>>> We have three options here:
> >>>> 1. Don't do anything
> >>>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> >>>> fit
> >>>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> >>>> memory).
> >>>> 4. Add a mechanism to tell qemu that memory is being relocated.
> >>>>
> >>>> Number 4 is definitely the right answer long-term, but we just don't have time
> >>>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> >>>> if it is, it may have unpredictable knock-on effects.
> >>>>
> >>>> Doing #2, it is true that many guests will be unable to access the device
> >>>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> >>>> access the device.  At least in #2, *many* guests will be able to do so.  In
> >>>> any case, apparently #2 is what KVM does, so having the limitation on guests
> >>>> is not without precedent.  It's also likely to be a somewhat tested
> >>>> configuration (unlike #3, for example).
> >>> I would avoid #3, because I don't think is a good idea to rely on that
> >>> behaviour.
> >>> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> >>> easy and certainly not doable in time for 4.3.
> >>>
> >>> So we are left to play with the PCI MMIO region size and location in
> >>> hvmloader.
> >>>
> >>> I agree with Jan that we shouldn't relocate unconditionally all the
> >>> devices to the region above 4G. I meant to say that we should relocate
> >>> only the ones that don't fit. And we shouldn't try to dynamically
> >>> increase the PCI hole below 4G because clearly that doesn't work.
> >>> However we could still increase the size of the PCI hole below 4G by
> >>> default from start at 0xf0000000 to starting at 0xe0000000.
> >>> Why do we know that is safe? Because in the current configuration
> >>> hvmloader *already* increases the PCI hole size by decreasing the start
> >>> address every time a device doesn't fit.
> >>> So it's already common for hvmloader to set pci_mem_start to
> >>> 0xe0000000, you just need to assign a device with a PCI hole size big
> >>> enough.
> > Isn't this the exact case which is broken? And therefore not known safe
> > at all?
> >
> >>>
> >>> My proposed solution is:
> >>>
> >>> - set 0xe0000000 as the default PCI hole start for everybody, including
> >>> qemu-xen-traditional
> > What is the impact on existing qemu-trad guests?
> >
> > It does mean that guest which were installed with a bit less than 4GB
> > RAM may now find a little bit of RAM moves above 4GB to make room for
> > the bigger whole. If they can dynamically enable PAE that might be ok.
> >
> > Does this have any impact on Windows activation?
> >
> >>> - move above 4G everything that doesn't fit and support 64-bit bars
> >>> - print an error if the device doesn't fit and doesn't support 64-bit
> >>> bars
> >> Also, as I understand it, at the moment:
> >> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> >> 2. Some devices (without 64-bit BARs) can't be relocated
> >> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> >>
> >> So if we have #1 or #2, at the moment an option for a work-around is to
> >> use qemu-traditional.
> >>
> >> However, if we add your "print an error if the device doesn't fit", then
> >> this option will go away -- this will be a regression in functionality
> >> from 4.2.
> > Only if print an error also involves aborting. It could print an error
> > (lets call it a warning) and continue, which would leave the workaround
> > viable.\
> 
> No, because if hvmloader doesn't increase the size of the MMIO hole, 
> then the device won't actually work.  The guest will boot, but the OS 
> will not be able to use it.

I meant continue as in increasing the hole too, although rereading the
thread maybe that's not what everyone else was talking about ;-)

> 
>   -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 15:36                           ` Ian Campbell
@ 2013-06-13 15:40                             ` George Dunlap
  -1 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-13 15:40 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On 13/06/13 16:36, Ian Campbell wrote:
> On Thu, 2013-06-13 at 16:30 +0100, George Dunlap wrote:
>> On 13/06/13 16:16, Ian Campbell wrote:
>>> On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
>>>> On 13/06/13 14:44, Stefano Stabellini wrote:
>>>>> On Wed, 12 Jun 2013, George Dunlap wrote:
>>>>>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>>>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>>>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>>>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>>>>>> region is larger than 512MB.
>>>>>>>>
>>>>>>>> Maybe we could just relax this condition and map the device memory to
>>>>>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>>>>>> 64-bit?
>>>>>>> I can only recommend not to: For one, guests not using PAE or
>>>>>>> PSE-36 can't map such space at all (and older OSes may not
>>>>>>> properly deal with 64-bit BARs at all). And then one would generally
>>>>>>> expect this allocation to be done top down (to minimize risk of
>>>>>>> running into RAM), and doing so is going to present further risks of
>>>>>>> incompatibilities with guest OSes (Linux for example learned only in
>>>>>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>>>>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>>>>>> PFN to pfn_pte(), the respective parameter of which is
>>>>>>> "unsigned long").
>>>>>>>
>>>>>>> I think this ought to be done in an iterative process - if all MMIO
>>>>>>> regions together don't fit below 4G, the biggest one should be
>>>>>>> moved up beyond 4G first, followed by the next to biggest one
>>>>>>> etc.
>>>>>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
>>>>>> temporary work-around.  It should only be done if a device doesn't fit in the
>>>>>> current MMIO range.
>>>>>>
>>>>>> We have three options here:
>>>>>> 1. Don't do anything
>>>>>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
>>>>>> fit
>>>>>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
>>>>>> memory).
>>>>>> 4. Add a mechanism to tell qemu that memory is being relocated.
>>>>>>
>>>>>> Number 4 is definitely the right answer long-term, but we just don't have time
>>>>>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
>>>>>> if it is, it may have unpredictable knock-on effects.
>>>>>>
>>>>>> Doing #2, it is true that many guests will be unable to access the device
>>>>>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
>>>>>> access the device.  At least in #2, *many* guests will be able to do so.  In
>>>>>> any case, apparently #2 is what KVM does, so having the limitation on guests
>>>>>> is not without precedent.  It's also likely to be a somewhat tested
>>>>>> configuration (unlike #3, for example).
>>>>> I would avoid #3, because I don't think is a good idea to rely on that
>>>>> behaviour.
>>>>> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
>>>>> easy and certainly not doable in time for 4.3.
>>>>>
>>>>> So we are left to play with the PCI MMIO region size and location in
>>>>> hvmloader.
>>>>>
>>>>> I agree with Jan that we shouldn't relocate unconditionally all the
>>>>> devices to the region above 4G. I meant to say that we should relocate
>>>>> only the ones that don't fit. And we shouldn't try to dynamically
>>>>> increase the PCI hole below 4G because clearly that doesn't work.
>>>>> However we could still increase the size of the PCI hole below 4G by
>>>>> default from start at 0xf0000000 to starting at 0xe0000000.
>>>>> Why do we know that is safe? Because in the current configuration
>>>>> hvmloader *already* increases the PCI hole size by decreasing the start
>>>>> address every time a device doesn't fit.
>>>>> So it's already common for hvmloader to set pci_mem_start to
>>>>> 0xe0000000, you just need to assign a device with a PCI hole size big
>>>>> enough.
>>> Isn't this the exact case which is broken? And therefore not known safe
>>> at all?
>>>
>>>>> My proposed solution is:
>>>>>
>>>>> - set 0xe0000000 as the default PCI hole start for everybody, including
>>>>> qemu-xen-traditional
>>> What is the impact on existing qemu-trad guests?
>>>
>>> It does mean that guest which were installed with a bit less than 4GB
>>> RAM may now find a little bit of RAM moves above 4GB to make room for
>>> the bigger whole. If they can dynamically enable PAE that might be ok.
>>>
>>> Does this have any impact on Windows activation?
>>>
>>>>> - move above 4G everything that doesn't fit and support 64-bit bars
>>>>> - print an error if the device doesn't fit and doesn't support 64-bit
>>>>> bars
>>>> Also, as I understand it, at the moment:
>>>> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
>>>> 2. Some devices (without 64-bit BARs) can't be relocated
>>>> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
>>>>
>>>> So if we have #1 or #2, at the moment an option for a work-around is to
>>>> use qemu-traditional.
>>>>
>>>> However, if we add your "print an error if the device doesn't fit", then
>>>> this option will go away -- this will be a regression in functionality
>>>> from 4.2.
>>> Only if print an error also involves aborting. It could print an error
>>> (lets call it a warning) and continue, which would leave the workaround
>>> viable.\
>> No, because if hvmloader doesn't increase the size of the MMIO hole,
>> then the device won't actually work.  The guest will boot, but the OS
>> will not be able to use it.
> I meant continue as in increasing the hole too, although rereading the
> thread maybe that's not what everyone else was talking about ;-)

Well if you continue increasing the hole, then it works on 
qemu-traditional but on qemu-xen you have weird crashes and guest hangs 
at some point in the future when qemu tries to map a non-existent guest 
memory address -- that's much worse than the device just not being 
visible to the OS.

That's the point -- current behavior on qemu-xen causes weird hangs; but 
the simple way of preventing those hangs (just not increasing the MMIO 
hole size) removes functionality from both qemu-xen and 
qemu-traditional, even though qemu-traditional doesn't have any problems 
with the resized MMIO hole.

So there's no simple way to avoid random crashes while keeping the 
work-around functional; that's why someone suggested adding a xenstore 
key to tell hvmloader what to do.

At least, that's what I understood the situation to be -- someone 
correct me if I'm wrong. :-)

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 15:40                             ` George Dunlap
  0 siblings, 0 replies; 82+ messages in thread
From: George Dunlap @ 2013-06-13 15:40 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On 13/06/13 16:36, Ian Campbell wrote:
> On Thu, 2013-06-13 at 16:30 +0100, George Dunlap wrote:
>> On 13/06/13 16:16, Ian Campbell wrote:
>>> On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
>>>> On 13/06/13 14:44, Stefano Stabellini wrote:
>>>>> On Wed, 12 Jun 2013, George Dunlap wrote:
>>>>>> On 12/06/13 08:25, Jan Beulich wrote:
>>>>>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
>>>>>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>>>>>> I went through the code that maps the PCI MMIO regions in hvmloader
>>>>>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
>>>>>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
>>>>>>>> region is larger than 512MB.
>>>>>>>>
>>>>>>>> Maybe we could just relax this condition and map the device memory to
>>>>>>>> high memory no matter the size of the MMIO region if the PCI bar is
>>>>>>>> 64-bit?
>>>>>>> I can only recommend not to: For one, guests not using PAE or
>>>>>>> PSE-36 can't map such space at all (and older OSes may not
>>>>>>> properly deal with 64-bit BARs at all). And then one would generally
>>>>>>> expect this allocation to be done top down (to minimize risk of
>>>>>>> running into RAM), and doing so is going to present further risks of
>>>>>>> incompatibilities with guest OSes (Linux for example learned only in
>>>>>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
>>>>>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
>>>>>>> PFN to pfn_pte(), the respective parameter of which is
>>>>>>> "unsigned long").
>>>>>>>
>>>>>>> I think this ought to be done in an iterative process - if all MMIO
>>>>>>> regions together don't fit below 4G, the biggest one should be
>>>>>>> moved up beyond 4G first, followed by the next to biggest one
>>>>>>> etc.
>>>>>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
>>>>>> temporary work-around.  It should only be done if a device doesn't fit in the
>>>>>> current MMIO range.
>>>>>>
>>>>>> We have three options here:
>>>>>> 1. Don't do anything
>>>>>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
>>>>>> fit
>>>>>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
>>>>>> memory).
>>>>>> 4. Add a mechanism to tell qemu that memory is being relocated.
>>>>>>
>>>>>> Number 4 is definitely the right answer long-term, but we just don't have time
>>>>>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
>>>>>> if it is, it may have unpredictable knock-on effects.
>>>>>>
>>>>>> Doing #2, it is true that many guests will be unable to access the device
>>>>>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
>>>>>> access the device.  At least in #2, *many* guests will be able to do so.  In
>>>>>> any case, apparently #2 is what KVM does, so having the limitation on guests
>>>>>> is not without precedent.  It's also likely to be a somewhat tested
>>>>>> configuration (unlike #3, for example).
>>>>> I would avoid #3, because I don't think is a good idea to rely on that
>>>>> behaviour.
>>>>> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
>>>>> easy and certainly not doable in time for 4.3.
>>>>>
>>>>> So we are left to play with the PCI MMIO region size and location in
>>>>> hvmloader.
>>>>>
>>>>> I agree with Jan that we shouldn't relocate unconditionally all the
>>>>> devices to the region above 4G. I meant to say that we should relocate
>>>>> only the ones that don't fit. And we shouldn't try to dynamically
>>>>> increase the PCI hole below 4G because clearly that doesn't work.
>>>>> However we could still increase the size of the PCI hole below 4G by
>>>>> default from start at 0xf0000000 to starting at 0xe0000000.
>>>>> Why do we know that is safe? Because in the current configuration
>>>>> hvmloader *already* increases the PCI hole size by decreasing the start
>>>>> address every time a device doesn't fit.
>>>>> So it's already common for hvmloader to set pci_mem_start to
>>>>> 0xe0000000, you just need to assign a device with a PCI hole size big
>>>>> enough.
>>> Isn't this the exact case which is broken? And therefore not known safe
>>> at all?
>>>
>>>>> My proposed solution is:
>>>>>
>>>>> - set 0xe0000000 as the default PCI hole start for everybody, including
>>>>> qemu-xen-traditional
>>> What is the impact on existing qemu-trad guests?
>>>
>>> It does mean that guest which were installed with a bit less than 4GB
>>> RAM may now find a little bit of RAM moves above 4GB to make room for
>>> the bigger whole. If they can dynamically enable PAE that might be ok.
>>>
>>> Does this have any impact on Windows activation?
>>>
>>>>> - move above 4G everything that doesn't fit and support 64-bit bars
>>>>> - print an error if the device doesn't fit and doesn't support 64-bit
>>>>> bars
>>>> Also, as I understand it, at the moment:
>>>> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
>>>> 2. Some devices (without 64-bit BARs) can't be relocated
>>>> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
>>>>
>>>> So if we have #1 or #2, at the moment an option for a work-around is to
>>>> use qemu-traditional.
>>>>
>>>> However, if we add your "print an error if the device doesn't fit", then
>>>> this option will go away -- this will be a regression in functionality
>>>> from 4.2.
>>> Only if print an error also involves aborting. It could print an error
>>> (lets call it a warning) and continue, which would leave the workaround
>>> viable.\
>> No, because if hvmloader doesn't increase the size of the MMIO hole,
>> then the device won't actually work.  The guest will boot, but the OS
>> will not be able to use it.
> I meant continue as in increasing the hole too, although rereading the
> thread maybe that's not what everyone else was talking about ;-)

Well if you continue increasing the hole, then it works on 
qemu-traditional but on qemu-xen you have weird crashes and guest hangs 
at some point in the future when qemu tries to map a non-existent guest 
memory address -- that's much worse than the device just not being 
visible to the OS.

That's the point -- current behavior on qemu-xen causes weird hangs; but 
the simple way of preventing those hangs (just not increasing the MMIO 
hole size) removes functionality from both qemu-xen and 
qemu-traditional, even though qemu-traditional doesn't have any problems 
with the resized MMIO hole.

So there's no simple way to avoid random crashes while keeping the 
work-around functional; that's why someone suggested adding a xenstore 
key to tell hvmloader what to do.

At least, that's what I understood the situation to be -- someone 
correct me if I'm wrong. :-)

  -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 15:40                             ` George Dunlap
@ 2013-06-13 15:42                               ` Ian Campbell
  -1 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 15:42 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 2013-06-13 at 16:40 +0100, George Dunlap wrote:
> On 13/06/13 16:36, Ian Campbell wrote:
> > On Thu, 2013-06-13 at 16:30 +0100, George Dunlap wrote:
> >> On 13/06/13 16:16, Ian Campbell wrote:
> >>> On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
> >>>> On 13/06/13 14:44, Stefano Stabellini wrote:
> >>>>> On Wed, 12 Jun 2013, George Dunlap wrote:
> >>>>>> On 12/06/13 08:25, Jan Beulich wrote:
> >>>>>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
> >>>>>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
> >>>>>>>> I went through the code that maps the PCI MMIO regions in hvmloader
> >>>>>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >>>>>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >>>>>>>> region is larger than 512MB.
> >>>>>>>>
> >>>>>>>> Maybe we could just relax this condition and map the device memory to
> >>>>>>>> high memory no matter the size of the MMIO region if the PCI bar is
> >>>>>>>> 64-bit?
> >>>>>>> I can only recommend not to: For one, guests not using PAE or
> >>>>>>> PSE-36 can't map such space at all (and older OSes may not
> >>>>>>> properly deal with 64-bit BARs at all). And then one would generally
> >>>>>>> expect this allocation to be done top down (to minimize risk of
> >>>>>>> running into RAM), and doing so is going to present further risks of
> >>>>>>> incompatibilities with guest OSes (Linux for example learned only in
> >>>>>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >>>>>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >>>>>>> PFN to pfn_pte(), the respective parameter of which is
> >>>>>>> "unsigned long").
> >>>>>>>
> >>>>>>> I think this ought to be done in an iterative process - if all MMIO
> >>>>>>> regions together don't fit below 4G, the biggest one should be
> >>>>>>> moved up beyond 4G first, followed by the next to biggest one
> >>>>>>> etc.
> >>>>>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> >>>>>> temporary work-around.  It should only be done if a device doesn't fit in the
> >>>>>> current MMIO range.
> >>>>>>
> >>>>>> We have three options here:
> >>>>>> 1. Don't do anything
> >>>>>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> >>>>>> fit
> >>>>>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> >>>>>> memory).
> >>>>>> 4. Add a mechanism to tell qemu that memory is being relocated.
> >>>>>>
> >>>>>> Number 4 is definitely the right answer long-term, but we just don't have time
> >>>>>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> >>>>>> if it is, it may have unpredictable knock-on effects.
> >>>>>>
> >>>>>> Doing #2, it is true that many guests will be unable to access the device
> >>>>>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> >>>>>> access the device.  At least in #2, *many* guests will be able to do so.  In
> >>>>>> any case, apparently #2 is what KVM does, so having the limitation on guests
> >>>>>> is not without precedent.  It's also likely to be a somewhat tested
> >>>>>> configuration (unlike #3, for example).
> >>>>> I would avoid #3, because I don't think is a good idea to rely on that
> >>>>> behaviour.
> >>>>> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> >>>>> easy and certainly not doable in time for 4.3.
> >>>>>
> >>>>> So we are left to play with the PCI MMIO region size and location in
> >>>>> hvmloader.
> >>>>>
> >>>>> I agree with Jan that we shouldn't relocate unconditionally all the
> >>>>> devices to the region above 4G. I meant to say that we should relocate
> >>>>> only the ones that don't fit. And we shouldn't try to dynamically
> >>>>> increase the PCI hole below 4G because clearly that doesn't work.
> >>>>> However we could still increase the size of the PCI hole below 4G by
> >>>>> default from start at 0xf0000000 to starting at 0xe0000000.
> >>>>> Why do we know that is safe? Because in the current configuration
> >>>>> hvmloader *already* increases the PCI hole size by decreasing the start
> >>>>> address every time a device doesn't fit.
> >>>>> So it's already common for hvmloader to set pci_mem_start to
> >>>>> 0xe0000000, you just need to assign a device with a PCI hole size big
> >>>>> enough.
> >>> Isn't this the exact case which is broken? And therefore not known safe
> >>> at all?
> >>>
> >>>>> My proposed solution is:
> >>>>>
> >>>>> - set 0xe0000000 as the default PCI hole start for everybody, including
> >>>>> qemu-xen-traditional
> >>> What is the impact on existing qemu-trad guests?
> >>>
> >>> It does mean that guest which were installed with a bit less than 4GB
> >>> RAM may now find a little bit of RAM moves above 4GB to make room for
> >>> the bigger whole. If they can dynamically enable PAE that might be ok.
> >>>
> >>> Does this have any impact on Windows activation?
> >>>
> >>>>> - move above 4G everything that doesn't fit and support 64-bit bars
> >>>>> - print an error if the device doesn't fit and doesn't support 64-bit
> >>>>> bars
> >>>> Also, as I understand it, at the moment:
> >>>> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> >>>> 2. Some devices (without 64-bit BARs) can't be relocated
> >>>> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> >>>>
> >>>> So if we have #1 or #2, at the moment an option for a work-around is to
> >>>> use qemu-traditional.
> >>>>
> >>>> However, if we add your "print an error if the device doesn't fit", then
> >>>> this option will go away -- this will be a regression in functionality
> >>>> from 4.2.
> >>> Only if print an error also involves aborting. It could print an error
> >>> (lets call it a warning) and continue, which would leave the workaround
> >>> viable.\
> >> No, because if hvmloader doesn't increase the size of the MMIO hole,
> >> then the device won't actually work.  The guest will boot, but the OS
> >> will not be able to use it.
> > I meant continue as in increasing the hole too, although rereading the
> > thread maybe that's not what everyone else was talking about ;-)
> 
> Well if you continue increasing the hole, then it works on 
> qemu-traditional but on qemu-xen you have weird crashes and guest hangs 
> at some point in the future when qemu tries to map a non-existent guest 
> memory address -- that's much worse than the device just not being 
> visible to the OS.

I thought the point of the print was simply to give us something to spot
in the logs in this latter case.

> That's the point -- current behavior on qemu-xen causes weird hangs; but 
> the simple way of preventing those hangs (just not increasing the MMIO 
> hole size) removes functionality from both qemu-xen and 
> qemu-traditional, even though qemu-traditional doesn't have any problems 
> with the resized MMIO hole.
> 
> So there's no simple way to avoid random crashes while keeping the 
> work-around functional; that's why someone suggested adding a xenstore 
> key to tell hvmloader what to do.
> 
> At least, that's what I understood the situation to be -- someone 
> correct me if I'm wrong. :-)
> 
>   -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 15:42                               ` Ian Campbell
  0 siblings, 0 replies; 82+ messages in thread
From: Ian Campbell @ 2013-06-13 15:42 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	Xudong Hao, Stefano Stabellini, luonengjun, qemu-devel,
	wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 2013-06-13 at 16:40 +0100, George Dunlap wrote:
> On 13/06/13 16:36, Ian Campbell wrote:
> > On Thu, 2013-06-13 at 16:30 +0100, George Dunlap wrote:
> >> On 13/06/13 16:16, Ian Campbell wrote:
> >>> On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
> >>>> On 13/06/13 14:44, Stefano Stabellini wrote:
> >>>>> On Wed, 12 Jun 2013, George Dunlap wrote:
> >>>>>> On 12/06/13 08:25, Jan Beulich wrote:
> >>>>>>>>>> On 11.06.13 at 19:26, Stefano Stabellini
> >>>>>>>>>> <stefano.stabellini@eu.citrix.com> wrote:
> >>>>>>>> I went through the code that maps the PCI MMIO regions in hvmloader
> >>>>>>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> >>>>>>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> >>>>>>>> region is larger than 512MB.
> >>>>>>>>
> >>>>>>>> Maybe we could just relax this condition and map the device memory to
> >>>>>>>> high memory no matter the size of the MMIO region if the PCI bar is
> >>>>>>>> 64-bit?
> >>>>>>> I can only recommend not to: For one, guests not using PAE or
> >>>>>>> PSE-36 can't map such space at all (and older OSes may not
> >>>>>>> properly deal with 64-bit BARs at all). And then one would generally
> >>>>>>> expect this allocation to be done top down (to minimize risk of
> >>>>>>> running into RAM), and doing so is going to present further risks of
> >>>>>>> incompatibilities with guest OSes (Linux for example learned only in
> >>>>>>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> >>>>>>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> >>>>>>> PFN to pfn_pte(), the respective parameter of which is
> >>>>>>> "unsigned long").
> >>>>>>>
> >>>>>>> I think this ought to be done in an iterative process - if all MMIO
> >>>>>>> regions together don't fit below 4G, the biggest one should be
> >>>>>>> moved up beyond 4G first, followed by the next to biggest one
> >>>>>>> etc.
> >>>>>> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> >>>>>> temporary work-around.  It should only be done if a device doesn't fit in the
> >>>>>> current MMIO range.
> >>>>>>
> >>>>>> We have three options here:
> >>>>>> 1. Don't do anything
> >>>>>> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> >>>>>> fit
> >>>>>> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> >>>>>> memory).
> >>>>>> 4. Add a mechanism to tell qemu that memory is being relocated.
> >>>>>>
> >>>>>> Number 4 is definitely the right answer long-term, but we just don't have time
> >>>>>> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> >>>>>> if it is, it may have unpredictable knock-on effects.
> >>>>>>
> >>>>>> Doing #2, it is true that many guests will be unable to access the device
> >>>>>> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> >>>>>> access the device.  At least in #2, *many* guests will be able to do so.  In
> >>>>>> any case, apparently #2 is what KVM does, so having the limitation on guests
> >>>>>> is not without precedent.  It's also likely to be a somewhat tested
> >>>>>> configuration (unlike #3, for example).
> >>>>> I would avoid #3, because I don't think is a good idea to rely on that
> >>>>> behaviour.
> >>>>> I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> >>>>> easy and certainly not doable in time for 4.3.
> >>>>>
> >>>>> So we are left to play with the PCI MMIO region size and location in
> >>>>> hvmloader.
> >>>>>
> >>>>> I agree with Jan that we shouldn't relocate unconditionally all the
> >>>>> devices to the region above 4G. I meant to say that we should relocate
> >>>>> only the ones that don't fit. And we shouldn't try to dynamically
> >>>>> increase the PCI hole below 4G because clearly that doesn't work.
> >>>>> However we could still increase the size of the PCI hole below 4G by
> >>>>> default from start at 0xf0000000 to starting at 0xe0000000.
> >>>>> Why do we know that is safe? Because in the current configuration
> >>>>> hvmloader *already* increases the PCI hole size by decreasing the start
> >>>>> address every time a device doesn't fit.
> >>>>> So it's already common for hvmloader to set pci_mem_start to
> >>>>> 0xe0000000, you just need to assign a device with a PCI hole size big
> >>>>> enough.
> >>> Isn't this the exact case which is broken? And therefore not known safe
> >>> at all?
> >>>
> >>>>> My proposed solution is:
> >>>>>
> >>>>> - set 0xe0000000 as the default PCI hole start for everybody, including
> >>>>> qemu-xen-traditional
> >>> What is the impact on existing qemu-trad guests?
> >>>
> >>> It does mean that guest which were installed with a bit less than 4GB
> >>> RAM may now find a little bit of RAM moves above 4GB to make room for
> >>> the bigger whole. If they can dynamically enable PAE that might be ok.
> >>>
> >>> Does this have any impact on Windows activation?
> >>>
> >>>>> - move above 4G everything that doesn't fit and support 64-bit bars
> >>>>> - print an error if the device doesn't fit and doesn't support 64-bit
> >>>>> bars
> >>>> Also, as I understand it, at the moment:
> >>>> 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> >>>> 2. Some devices (without 64-bit BARs) can't be relocated
> >>>> 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> >>>>
> >>>> So if we have #1 or #2, at the moment an option for a work-around is to
> >>>> use qemu-traditional.
> >>>>
> >>>> However, if we add your "print an error if the device doesn't fit", then
> >>>> this option will go away -- this will be a regression in functionality
> >>>> from 4.2.
> >>> Only if print an error also involves aborting. It could print an error
> >>> (lets call it a warning) and continue, which would leave the workaround
> >>> viable.\
> >> No, because if hvmloader doesn't increase the size of the MMIO hole,
> >> then the device won't actually work.  The guest will boot, but the OS
> >> will not be able to use it.
> > I meant continue as in increasing the hole too, although rereading the
> > thread maybe that's not what everyone else was talking about ;-)
> 
> Well if you continue increasing the hole, then it works on 
> qemu-traditional but on qemu-xen you have weird crashes and guest hangs 
> at some point in the future when qemu tries to map a non-existent guest 
> memory address -- that's much worse than the device just not being 
> visible to the OS.

I thought the point of the print was simply to give us something to spot
in the logs in this latter case.

> That's the point -- current behavior on qemu-xen causes weird hangs; but 
> the simple way of preventing those hangs (just not increasing the MMIO 
> hole size) removes functionality from both qemu-xen and 
> qemu-traditional, even though qemu-traditional doesn't have any problems 
> with the resized MMIO hole.
> 
> So there's no simple way to avoid random crashes while keeping the 
> work-around functional; that's why someone suggested adding a xenstore 
> key to tell hvmloader what to do.
> 
> At least, that's what I understood the situation to be -- someone 
> correct me if I'm wrong. :-)
> 
>   -George

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  2013-06-13 15:16                       ` Ian Campbell
@ 2013-06-13 15:40                         ` Stefano Stabellini
  -1 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 15:40 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 13 Jun 2013, Ian Campbell wrote:
> On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
> > On 13/06/13 14:44, Stefano Stabellini wrote:
> > > On Wed, 12 Jun 2013, George Dunlap wrote:
> > >> On 12/06/13 08:25, Jan Beulich wrote:
> > >>>>>> On 11.06.13 at 19:26, Stefano Stabellini
> > >>>>>> <stefano.stabellini@eu.citrix.com> wrote:
> > >>>> I went through the code that maps the PCI MMIO regions in hvmloader
> > >>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> > >>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> > >>>> region is larger than 512MB.
> > >>>>
> > >>>> Maybe we could just relax this condition and map the device memory to
> > >>>> high memory no matter the size of the MMIO region if the PCI bar is
> > >>>> 64-bit?
> > >>> I can only recommend not to: For one, guests not using PAE or
> > >>> PSE-36 can't map such space at all (and older OSes may not
> > >>> properly deal with 64-bit BARs at all). And then one would generally
> > >>> expect this allocation to be done top down (to minimize risk of
> > >>> running into RAM), and doing so is going to present further risks of
> > >>> incompatibilities with guest OSes (Linux for example learned only in
> > >>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > >>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > >>> PFN to pfn_pte(), the respective parameter of which is
> > >>> "unsigned long").
> > >>>
> > >>> I think this ought to be done in an iterative process - if all MMIO
> > >>> regions together don't fit below 4G, the biggest one should be
> > >>> moved up beyond 4G first, followed by the next to biggest one
> > >>> etc.
> > >> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> > >> temporary work-around.  It should only be done if a device doesn't fit in the
> > >> current MMIO range.
> > >>
> > >> We have three options here:
> > >> 1. Don't do anything
> > >> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> > >> fit
> > >> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> > >> memory).
> > >> 4. Add a mechanism to tell qemu that memory is being relocated.
> > >>
> > >> Number 4 is definitely the right answer long-term, but we just don't have time
> > >> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> > >> if it is, it may have unpredictable knock-on effects.
> > >>
> > >> Doing #2, it is true that many guests will be unable to access the device
> > >> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> > >> access the device.  At least in #2, *many* guests will be able to do so.  In
> > >> any case, apparently #2 is what KVM does, so having the limitation on guests
> > >> is not without precedent.  It's also likely to be a somewhat tested
> > >> configuration (unlike #3, for example).
> > > I would avoid #3, because I don't think is a good idea to rely on that
> > > behaviour.
> > > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > > easy and certainly not doable in time for 4.3.
> > >
> > > So we are left to play with the PCI MMIO region size and location in
> > > hvmloader.
> > >
> > > I agree with Jan that we shouldn't relocate unconditionally all the
> > > devices to the region above 4G. I meant to say that we should relocate
> > > only the ones that don't fit. And we shouldn't try to dynamically
> > > increase the PCI hole below 4G because clearly that doesn't work.
> > > However we could still increase the size of the PCI hole below 4G by
> > > default from start at 0xf0000000 to starting at 0xe0000000.
> > > Why do we know that is safe? Because in the current configuration
> > > hvmloader *already* increases the PCI hole size by decreasing the start
> > > address every time a device doesn't fit.
> > > So it's already common for hvmloader to set pci_mem_start to
> > > 0xe0000000, you just need to assign a device with a PCI hole size big
> > > enough.
> 
> Isn't this the exact case which is broken? And therefore not known safe
> at all?

hvmloader sets pci_mem_start to 0xe0000000 and works with
qemu-xen-traditional but it doesn't with qemu-xen (before the patch that
increases the default pci hole size in QEMU).
What I was trying to say is: "it's already common for hvmloader to set
pci_mem_start to 0xe0000000 with qemu-xen-traditional".


> > > My proposed solution is:
> > >
> > > - set 0xe0000000 as the default PCI hole start for everybody, including
> > > qemu-xen-traditional
> 
> What is the impact on existing qemu-trad guests?
> 
> It does mean that guest which were installed with a bit less than 4GB
> RAM may now find a little bit of RAM moves above 4GB to make room for
> the bigger whole. If they can dynamically enable PAE that might be ok.

Yes, the amount of below 4G ram is going to be a bit less.


> Does this have any impact on Windows activation?

I don't think so: I assigned graphic cards with less than 512MB of
videoram to Windows guests before without compromising the Windows
license. I'll get more info on this.


> > > - move above 4G everything that doesn't fit and support 64-bit bars
> > > - print an error if the device doesn't fit and doesn't support 64-bit
> > > bars
> > 
> > Also, as I understand it, at the moment:
> > 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> > 2. Some devices (without 64-bit BARs) can't be relocated
> > 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> > 
> > So if we have #1 or #2, at the moment an option for a work-around is to 
> > use qemu-traditional.
> > 
> > However, if we add your "print an error if the device doesn't fit", then 
> > this option will go away -- this will be a regression in functionality 
> > from 4.2.
> 
> Only if print an error also involves aborting. It could print an error
> (lets call it a warning) and continue, which would leave the workaround
> viable.

Regardless of the warning we print and the strategy we decide to have to
fix the problem, we can always have a manually selectable workaround to
switch to the old behaviour if we want to (aside from the minimum PCI
hole size that might be harder to make selectable). It just
means more code in hvmloader.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
@ 2013-06-13 15:40                         ` Stefano Stabellini
  0 siblings, 0 replies; 82+ messages in thread
From: Stefano Stabellini @ 2013-06-13 15:40 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Tim Deegan, Yongjie Ren, yanqiangjun, Keir Fraser, hanweidong,
	George Dunlap, Xudong Hao, Stefano Stabellini, luonengjun,
	qemu-devel, wangzhenguo, xiaowei.yang, arei.gonglei, Jan Beulich,
	Paolo Bonzini, YongweiX Xu, SongtaoX Liu, xen-devel

On Thu, 13 Jun 2013, Ian Campbell wrote:
> On Thu, 2013-06-13 at 14:54 +0100, George Dunlap wrote:
> > On 13/06/13 14:44, Stefano Stabellini wrote:
> > > On Wed, 12 Jun 2013, George Dunlap wrote:
> > >> On 12/06/13 08:25, Jan Beulich wrote:
> > >>>>>> On 11.06.13 at 19:26, Stefano Stabellini
> > >>>>>> <stefano.stabellini@eu.citrix.com> wrote:
> > >>>> I went through the code that maps the PCI MMIO regions in hvmloader
> > >>>> (tools/firmware/hvmloader/pci.c:pci_setup) and it looks like it already
> > >>>> maps the PCI region to high memory if the PCI bar is 64-bit and the MMIO
> > >>>> region is larger than 512MB.
> > >>>>
> > >>>> Maybe we could just relax this condition and map the device memory to
> > >>>> high memory no matter the size of the MMIO region if the PCI bar is
> > >>>> 64-bit?
> > >>> I can only recommend not to: For one, guests not using PAE or
> > >>> PSE-36 can't map such space at all (and older OSes may not
> > >>> properly deal with 64-bit BARs at all). And then one would generally
> > >>> expect this allocation to be done top down (to minimize risk of
> > >>> running into RAM), and doing so is going to present further risks of
> > >>> incompatibilities with guest OSes (Linux for example learned only in
> > >>> 2.6.36 that PFNs in ioremap() can exceed 32 bits, but even in
> > >>> 3.10-rc5 ioremap_pte_range(), while using "u64 pfn", passes the
> > >>> PFN to pfn_pte(), the respective parameter of which is
> > >>> "unsigned long").
> > >>>
> > >>> I think this ought to be done in an iterative process - if all MMIO
> > >>> regions together don't fit below 4G, the biggest one should be
> > >>> moved up beyond 4G first, followed by the next to biggest one
> > >>> etc.
> > >> First of all, the proposal to move the PCI BAR up to the 64-bit range is a
> > >> temporary work-around.  It should only be done if a device doesn't fit in the
> > >> current MMIO range.
> > >>
> > >> We have three options here:
> > >> 1. Don't do anything
> > >> 2. Have hvmloader move PCI devices up to the 64-bit MMIO hole if they don't
> > >> fit
> > >> 3. Convince qemu to allow MMIO regions to mask memory (or what it thinks is
> > >> memory).
> > >> 4. Add a mechanism to tell qemu that memory is being relocated.
> > >>
> > >> Number 4 is definitely the right answer long-term, but we just don't have time
> > >> to do that before the 4.3 release.  We're not sure yet if #3 is possible; even
> > >> if it is, it may have unpredictable knock-on effects.
> > >>
> > >> Doing #2, it is true that many guests will be unable to access the device
> > >> because of 32-bit limitations.  However, in #1, *no* guests will be able to
> > >> access the device.  At least in #2, *many* guests will be able to do so.  In
> > >> any case, apparently #2 is what KVM does, so having the limitation on guests
> > >> is not without precedent.  It's also likely to be a somewhat tested
> > >> configuration (unlike #3, for example).
> > > I would avoid #3, because I don't think is a good idea to rely on that
> > > behaviour.
> > > I would also avoid #4, because having seen QEMU's code, it's wouldn't be
> > > easy and certainly not doable in time for 4.3.
> > >
> > > So we are left to play with the PCI MMIO region size and location in
> > > hvmloader.
> > >
> > > I agree with Jan that we shouldn't relocate unconditionally all the
> > > devices to the region above 4G. I meant to say that we should relocate
> > > only the ones that don't fit. And we shouldn't try to dynamically
> > > increase the PCI hole below 4G because clearly that doesn't work.
> > > However we could still increase the size of the PCI hole below 4G by
> > > default from start at 0xf0000000 to starting at 0xe0000000.
> > > Why do we know that is safe? Because in the current configuration
> > > hvmloader *already* increases the PCI hole size by decreasing the start
> > > address every time a device doesn't fit.
> > > So it's already common for hvmloader to set pci_mem_start to
> > > 0xe0000000, you just need to assign a device with a PCI hole size big
> > > enough.
> 
> Isn't this the exact case which is broken? And therefore not known safe
> at all?

hvmloader sets pci_mem_start to 0xe0000000 and works with
qemu-xen-traditional but it doesn't with qemu-xen (before the patch that
increases the default pci hole size in QEMU).
What I was trying to say is: "it's already common for hvmloader to set
pci_mem_start to 0xe0000000 with qemu-xen-traditional".


> > > My proposed solution is:
> > >
> > > - set 0xe0000000 as the default PCI hole start for everybody, including
> > > qemu-xen-traditional
> 
> What is the impact on existing qemu-trad guests?
> 
> It does mean that guest which were installed with a bit less than 4GB
> RAM may now find a little bit of RAM moves above 4GB to make room for
> the bigger whole. If they can dynamically enable PAE that might be ok.

Yes, the amount of below 4G ram is going to be a bit less.


> Does this have any impact on Windows activation?

I don't think so: I assigned graphic cards with less than 512MB of
videoram to Windows guests before without compromising the Windows
license. I'll get more info on this.


> > > - move above 4G everything that doesn't fit and support 64-bit bars
> > > - print an error if the device doesn't fit and doesn't support 64-bit
> > > bars
> > 
> > Also, as I understand it, at the moment:
> > 1. Some operating systems (32-bit XP) won't be able to use relocated devices
> > 2. Some devices (without 64-bit BARs) can't be relocated
> > 3. qemu-traditional is fine with a resized <4GiB MMIO hole.
> > 
> > So if we have #1 or #2, at the moment an option for a work-around is to 
> > use qemu-traditional.
> > 
> > However, if we add your "print an error if the device doesn't fit", then 
> > this option will go away -- this will be a regression in functionality 
> > from 4.2.
> 
> Only if print an error also involves aborting. It could print an error
> (lets call it a warning) and continue, which would leave the workaround
> viable.

Regardless of the warning we print and the strategy we decide to have to
fix the problem, we can always have a manually selectable workaround to
switch to the old behaviour if we want to (aside from the minimum PCI
hole size that might be harder to make selectable). It just
means more code in hvmloader.

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2013-06-14 14:37 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-07  9:20 [BUG 1747]Guest could't find bootable device with memory more than 3600M Xu, YongweiX
2013-06-07 12:15 ` Stefano Stabellini
2013-06-07 15:42   ` George Dunlap
2013-06-07 15:56     ` Stefano Stabellini
2013-06-08  7:27       ` Hao, Xudong
2013-06-10 11:49         ` George Dunlap
2013-06-11 17:26           ` [Qemu-devel] [Xen-devel] " Stefano Stabellini
2013-06-11 17:26             ` Stefano Stabellini
2013-06-12  7:25             ` [Qemu-devel] " Jan Beulich
2013-06-12  7:25               ` Jan Beulich
2013-06-12  8:31               ` [Qemu-devel] " Ian Campbell
2013-06-12  8:31                 ` Ian Campbell
2013-06-12  9:02                 ` [Qemu-devel] " Jan Beulich
2013-06-12  9:02                   ` Jan Beulich
2013-06-12  9:22                   ` [Qemu-devel] " Ian Campbell
2013-06-12  9:22                     ` Ian Campbell
2013-06-12 10:07                     ` [Qemu-devel] [Xen-devel] " Jan Beulich
2013-06-12 10:07                       ` Jan Beulich
2013-06-12 11:23                       ` [Qemu-devel] " Ian Campbell
2013-06-12 11:23                         ` Ian Campbell
2013-06-12 11:56                         ` [Qemu-devel] " Jan Beulich
2013-06-12 11:56                           ` Jan Beulich
2013-06-12 11:59                           ` [Qemu-devel] " Ian Campbell
2013-06-12 11:59                             ` Ian Campbell
2013-06-12 10:05               ` [Qemu-devel] " George Dunlap
2013-06-12 10:05                 ` George Dunlap
2013-06-12 10:11                 ` [Qemu-devel] " Jan Beulich
2013-06-12 10:11                   ` Jan Beulich
2013-06-12 10:15                   ` [Qemu-devel] " George Dunlap
2013-06-12 10:15                     ` George Dunlap
2013-06-12 13:23                 ` [Qemu-devel] " Paolo Bonzini
2013-06-12 13:23                   ` Paolo Bonzini
2013-06-12 13:49                   ` [Qemu-devel] " Jan Beulich
2013-06-12 13:49                     ` Jan Beulich
2013-06-12 14:02                     ` [Qemu-devel] " Paolo Bonzini
2013-06-12 14:02                       ` Paolo Bonzini
2013-06-12 14:19                       ` [Qemu-devel] " Jan Beulich
2013-06-12 14:19                         ` Jan Beulich
2013-06-12 15:25                         ` [Qemu-devel] " George Dunlap
2013-06-12 15:25                           ` George Dunlap
2013-06-12 20:13                           ` [Qemu-devel] " Paolo Bonzini
2013-06-12 20:13                             ` Paolo Bonzini
2013-06-13 13:44                 ` [Qemu-devel] " Stefano Stabellini
2013-06-13 13:44                   ` Stefano Stabellini
2013-06-13 13:54                   ` [Qemu-devel] " George Dunlap
2013-06-13 13:54                     ` George Dunlap
2013-06-13 14:50                     ` [Qemu-devel] " Stefano Stabellini
2013-06-13 14:50                       ` Stefano Stabellini
2013-06-13 15:06                       ` [Qemu-devel] [Xen-devel] " Jan Beulich
2013-06-13 15:06                         ` Jan Beulich
2013-06-13 15:29                       ` [Qemu-devel] [Xen-devel] " George Dunlap
2013-06-13 15:29                         ` George Dunlap
2013-06-13 16:13                         ` [Qemu-devel] " Stefano Stabellini
2013-06-13 16:13                           ` Stefano Stabellini
2013-06-13 15:34                       ` [Qemu-devel] " Ian Campbell
2013-06-13 15:34                         ` Ian Campbell
2013-06-13 16:55                         ` [Qemu-devel] " Stefano Stabellini
2013-06-13 16:55                           ` Stefano Stabellini
2013-06-13 17:22                           ` [Qemu-devel] " Ian Campbell
2013-06-13 17:22                             ` Ian Campbell
2013-06-14 10:53                             ` [Qemu-devel] " George Dunlap
2013-06-14 10:53                               ` George Dunlap
2013-06-14 11:34                               ` [Qemu-devel] [Xen-devel] " Ian Campbell
2013-06-14 11:34                                 ` Ian Campbell
2013-06-14 14:14                                 ` [Qemu-devel] " George Dunlap
2013-06-14 14:14                                   ` George Dunlap
2013-06-14 14:36                                   ` [Qemu-devel] " George Dunlap
2013-06-14 14:36                                     ` George Dunlap
2013-06-13 14:54                     ` [Qemu-devel] " Paolo Bonzini
2013-06-13 14:54                       ` Paolo Bonzini
2013-06-13 15:16                     ` [Qemu-devel] [Xen-devel] " Ian Campbell
2013-06-13 15:16                       ` Ian Campbell
2013-06-13 15:30                       ` [Qemu-devel] [Xen-devel] " George Dunlap
2013-06-13 15:30                         ` George Dunlap
2013-06-13 15:36                         ` [Qemu-devel] " Ian Campbell
2013-06-13 15:36                           ` Ian Campbell
2013-06-13 15:40                           ` [Qemu-devel] " George Dunlap
2013-06-13 15:40                             ` George Dunlap
2013-06-13 15:42                             ` [Qemu-devel] " Ian Campbell
2013-06-13 15:42                               ` Ian Campbell
2013-06-13 15:40                       ` [Qemu-devel] " Stefano Stabellini
2013-06-13 15:40                         ` Stefano Stabellini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.