[Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
@ 2013-07-16 17:22 Andrea Arcangeli
  2013-07-16 17:26 ` Paolo Bonzini
  2013-07-16 17:38 ` Eduardo Habkost
  0 siblings, 2 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 17:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, qemu-stable, Gleb Natapov

Without this patch the guest physical bits are advertised as 40, not
44 or more depending on the hardware capability of the host.

That leads to guest kernel crashes with injection of page faults 9
(see oops: 0009) as bits above 40 in the guest pagetables are
considered reserved.

exregion-0206 [324572448] [17] ex_system_memory_space: System-Memory (width 32) R/W 0 Address=00000000FED00000
BUG: unable to handle kernel paging request at ffffc9006030e000
IP: [<ffffffff812fbb6f>] acpi_ex_system_memory_space_handler+0x23e/0x2cb
PGD e01f875067 PUD 1001f075067 PMD e0178d8067 PTE 80000000fed00173
Oops: 0009 [#1] SMP

(see PUD with bit >=40 set)

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Chegu Vinod <chegu_vinod@hp.com>
---
 target-i386/cpu.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index e3f75a8..0e65673 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2108,6 +2108,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
             /* 64 bit processor */
 /* XXX: The physical address space is limited to 42 bits in exec.c. */
             *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
+            if (kvm_enabled()) {
+                uint32_t _eax;
+                host_cpuid(0x80000000, 0, &_eax, NULL, NULL, NULL);
+                if (_eax >= 0x80000008)
+                    host_cpuid(0x80000008, 0, eax, NULL, NULL, NULL);
+            }
         } else {
             if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
                 *eax = 0x00000024; /* 36 bits physical */

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 17:22 [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests Andrea Arcangeli
@ 2013-07-16 17:26 ` Paolo Bonzini
  2013-07-16 17:38 ` Eduardo Habkost
  1 sibling, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2013-07-16 17:26 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Eduardo Habkost, qemu-devel, Gleb Natapov, qemu-stable

Il 16/07/2013 19:22, Andrea Arcangeli ha scritto:
> Without this patch the guest physical bits are advertised as 40, not
> 44 or more depending on the hardware capability of the host.
> 
> That leads to guest kernel crashes with injection of page faults 9
> (see oops: 0009) as bits above 40 in the guest pagetables are
> considered reserved.
> 
> exregion-0206 [324572448] [17] ex_system_memory_space: System-Memory (width 32) R/W 0 Address=00000000FED00000
> BUG: unable to handle kernel paging request at ffffc9006030e000
> IP: [<ffffffff812fbb6f>] acpi_ex_system_memory_space_handler+0x23e/0x2cb
> PGD e01f875067 PUD 1001f075067 PMD e0178d8067 PTE 80000000fed00173
> Oops: 0009 [#1] SMP
> 
> (see PUD with bit >=40 set)
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  target-i386/cpu.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index e3f75a8..0e65673 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2108,6 +2108,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>              /* 64 bit processor */
>  /* XXX: The physical address space is limited to 42 bits in exec.c. */
>              *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
> +            if (kvm_enabled()) {
> +                uint32_t _eax;
> +                host_cpuid(0x80000000, 0, &_eax, NULL, NULL, NULL);
> +                if (_eax >= 0x80000008)
> +                    host_cpuid(0x80000008, 0, eax, NULL, NULL, NULL);
> +            }
>          } else {
>              if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
>                  *eax = 0x00000024; /* 36 bits physical */
> 

This is fine by me.

It has the usual problem that "-cpu" does not affect most CPUID leaves
(and read-only MSRs too). We have the same problem for vPMU and I think
for leaf 0xD too.  I don't think this should block the patch, though,
it's just one more thing waiting for CPUID infrastructure work.

Let's see what Eduardo thinks.  If he acks the patch, I'll take this
into uq/master.

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 17:22 [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests Andrea Arcangeli
  2013-07-16 17:26 ` Paolo Bonzini
@ 2013-07-16 17:38 ` Eduardo Habkost
  2013-07-16 17:46   ` Paolo Bonzini
  1 sibling, 1 reply; 14+ messages in thread
From: Eduardo Habkost @ 2013-07-16 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Paolo Bonzini, qemu-devel, Gleb Natapov, qemu-stable

On Tue, Jul 16, 2013 at 07:22:01PM +0200, Andrea Arcangeli wrote:
> Without this patch the guest physical bits are advertised as 40, not
> 44 or more depending on the hardware capability of the host.
> 
> That leads to guest kernel crashes with injection of page faults 9
> (see oops: 0009) as bits above 40 in the guest pagetables are
> considered reserved.
> 
> exregion-0206 [324572448] [17] ex_system_memory_space: System-Memory (width 32) R/W 0 Address=00000000FED00000
> BUG: unable to handle kernel paging request at ffffc9006030e000
> IP: [<ffffffff812fbb6f>] acpi_ex_system_memory_space_handler+0x23e/0x2cb
> PGD e01f875067 PUD 1001f075067 PMD e0178d8067 PTE 80000000fed00173
> Oops: 0009 [#1] SMP
> 
> (see PUD with bit >=40 set)

I am not sure I understand what caused this: if we are advertising 40
physical bits to the guest, why are we ending up with a PUD with
bit >= 40 set?

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  target-i386/cpu.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index e3f75a8..0e65673 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2108,6 +2108,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>              /* 64 bit processor */
>  /* XXX: The physical address space is limited to 42 bits in exec.c. */
>              *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
> +            if (kvm_enabled()) {
> +                uint32_t _eax;
> +                host_cpuid(0x80000000, 0, &_eax, NULL, NULL, NULL);
> +                if (_eax >= 0x80000008)
> +                    host_cpuid(0x80000008, 0, eax, NULL, NULL, NULL);
> +            }

We can't expose a different virtual machine depending on host
capabilities. What if we live-migrate between hosts with different
physical address bit sizes?

>          } else {
>              if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
>                  *eax = 0x00000024; /* 36 bits physical */
> 

-- 
Eduardo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 17:38 ` Eduardo Habkost
@ 2013-07-16 17:46   ` Paolo Bonzini
  2013-07-16 17:48     ` Paolo Bonzini
  2013-07-16 18:11     ` Eduardo Habkost
  0 siblings, 2 replies; 14+ messages in thread
From: Paolo Bonzini @ 2013-07-16 17:46 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Andrea Arcangeli, qemu-devel, Gleb Natapov, qemu-stable

Il 16/07/2013 19:38, Eduardo Habkost ha scritto:
> On Tue, Jul 16, 2013 at 07:22:01PM +0200, Andrea Arcangeli wrote:
>> Without this patch the guest physical bits are advertised as 40, not
>> 44 or more depending on the hardware capability of the host.
>>
>> That leads to guest kernel crashes with injection of page faults 9
>> (see oops: 0009) as bits above 40 in the guest pagetables are
>> considered reserved.
>>
>> exregion-0206 [324572448] [17] ex_system_memory_space: System-Memory (width 32) R/W 0 Address=00000000FED00000
>> BUG: unable to handle kernel paging request at ffffc9006030e000
>> IP: [<ffffffff812fbb6f>] acpi_ex_system_memory_space_handler+0x23e/0x2cb
>> PGD e01f875067 PUD 1001f075067 PMD e0178d8067 PTE 80000000fed00173
>> Oops: 0009 [#1] SMP
>>
>> (see PUD with bit >=40 set)
> 
> I am not sure I understand what caused this: if we are advertising 40
> physical bits to the guest, why are we ending up with a PUD with
> bit >= 40 set?

Because we create a guest that has bigger memory than what we advertise
in CPUID.

>>
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> Reported-by: Chegu Vinod <chegu_vinod@hp.com>
>> ---
>>  target-i386/cpu.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
>> index e3f75a8..0e65673 100644
>> --- a/target-i386/cpu.c
>> +++ b/target-i386/cpu.c
>> @@ -2108,6 +2108,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>>              /* 64 bit processor */
>>  /* XXX: The physical address space is limited to 42 bits in exec.c. */
>>              *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
>> +            if (kvm_enabled()) {
>> +                uint32_t _eax;
>> +                host_cpuid(0x80000000, 0, &_eax, NULL, NULL, NULL);
>> +                if (_eax >= 0x80000008)
>> +                    host_cpuid(0x80000008, 0, eax, NULL, NULL, NULL);
>> +            }
> 
> We can't expose a different virtual machine depending on host
> capabilities. What if we live-migrate between hosts with different
> physical address bit sizes?

Same as for vPMU or leaf 0xD: who knows.  In practice, this has an
effect only for guests with 1T or more memory, otherwise the physical
memory is smaller than 40 bits.

Paolo

>>          } else {
>>              if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
>>                  *eax = 0x00000024; /* 36 bits physical */
>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 17:46   ` Paolo Bonzini
@ 2013-07-16 17:48     ` Paolo Bonzini
  2013-07-16 18:06       ` Andrea Arcangeli
  2013-07-16 18:11     ` Eduardo Habkost
  1 sibling, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2013-07-16 17:48 UTC (permalink / raw)
  Cc: Andrea Arcangeli, qemu-stable, Eduardo Habkost, Gleb Natapov, qemu-devel

Il 16/07/2013 19:46, Paolo Bonzini ha scritto:
>>> >> (see PUD with bit >=40 set)
>> > 
>> > I am not sure I understand what caused this: if we are advertising 40
>> > physical bits to the guest, why are we ending up with a PUD with
>> > bit >= 40 set?
> Because we create a guest that has bigger memory than what we advertise
> in CPUID.
> 

Also, note that the guest does not really care about this CPUID.  It is
only used by KVM itself to decide which bits in the page tables are
reserved.

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 17:48     ` Paolo Bonzini
@ 2013-07-16 18:06       ` Andrea Arcangeli
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 18:06 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-stable, Eduardo Habkost, Gleb Natapov, qemu-devel

-On Tue, Jul 16, 2013 at 07:48:27PM +0200, Paolo Bonzini wrote:
> Il 16/07/2013 19:46, Paolo Bonzini ha scritto:
> >>> >> (see PUD with bit >=40 set)
> >> > 
> >> > I am not sure I understand what caused this: if we are advertising 40
> >> > physical bits to the guest, why are we ending up with a PUD with
> >> > bit >= 40 set?
> > Because we create a guest that has bigger memory than what we advertise
> > in CPUID.
> > 
> 
> Also, note that the guest does not really care about this CPUID.  It is
> only used by KVM itself to decide which bits in the page tables are
> reserved.

Yes, I suppose guests thinks if there's >1TB of RAM there are enough
bits in the pagetables to map whatever RAM "range" was found.

About migrating >1TB guests it will be possible as soon as qemu starts
to use the remap_anon_pages+MADV_USERFAULT two features I posted on
linux-kernel recently. With those two features, qemu should start to
provide postcopy live migration by default and after pre-copy is
complete, it will never need to freeze a guest in the source node
anymore.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 17:46   ` Paolo Bonzini
  2013-07-16 17:48     ` Paolo Bonzini
@ 2013-07-16 18:11     ` Eduardo Habkost
  2013-07-16 19:24       ` Paolo Bonzini
  1 sibling, 1 reply; 14+ messages in thread
From: Eduardo Habkost @ 2013-07-16 18:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Andrea Arcangeli, qemu-devel, Gleb Natapov, qemu-stable

On Tue, Jul 16, 2013 at 07:46:14PM +0200, Paolo Bonzini wrote:
> Il 16/07/2013 19:38, Eduardo Habkost ha scritto:
> > On Tue, Jul 16, 2013 at 07:22:01PM +0200, Andrea Arcangeli wrote:
> >> Without this patch the guest physical bits are advertised as 40, not
> >> 44 or more depending on the hardware capability of the host.
> >>
> >> That leads to guest kernel crashes with injection of page faults 9
> >> (see oops: 0009) as bits above 40 in the guest pagetables are
> >> considered reserved.
> >>
> >> exregion-0206 [324572448] [17] ex_system_memory_space: System-Memory (width 32) R/W 0 Address=00000000FED00000
> >> BUG: unable to handle kernel paging request at ffffc9006030e000
> >> IP: [<ffffffff812fbb6f>] acpi_ex_system_memory_space_handler+0x23e/0x2cb
> >> PGD e01f875067 PUD 1001f075067 PMD e0178d8067 PTE 80000000fed00173
> >> Oops: 0009 [#1] SMP
> >>
> >> (see PUD with bit >=40 set)
> > 
> > I am not sure I understand what caused this: if we are advertising 40
> > physical bits to the guest, why are we ending up with a PUD with
> > bit >= 40 set?
> 
> Because we create a guest that has bigger memory than what we advertise
> in CPUID.

That means we never supported guests that large, doesn't it? Why is it a
qemu-stable candidate, then?

> 
> >>
> >> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> >> Reported-by: Chegu Vinod <chegu_vinod@hp.com>
> >> ---
> >>  target-i386/cpu.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> >> index e3f75a8..0e65673 100644
> >> --- a/target-i386/cpu.c
> >> +++ b/target-i386/cpu.c
> >> @@ -2108,6 +2108,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> >>              /* 64 bit processor */
> >>  /* XXX: The physical address space is limited to 42 bits in exec.c. */
> >>              *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
> >> +            if (kvm_enabled()) {
> >> +                uint32_t _eax;
> >> +                host_cpuid(0x80000000, 0, &_eax, NULL, NULL, NULL);
> >> +                if (_eax >= 0x80000008)
> >> +                    host_cpuid(0x80000008, 0, eax, NULL, NULL, NULL);
> >> +            }
> > 
> > We can't expose a different virtual machine depending on host
> > capabilities. What if we live-migrate between hosts with different
> > physical address bit sizes?
> 
> Same as for vPMU or leaf 0xD: who knows.  In practice, this has an
> effect only for guests with 1T or more memory, otherwise the physical
> memory is smaller than 40 bits.

Leafs 0xA and 0xD are buggy, and we need to disable that "passthrough
mode" behavior by default, because it breaks live-migration. If we want
to expose new features to the guest we need to make those features
configurable, not expose them automatically depending on host
capabilities.

(I was aware of the problem on vPMU/0xA, but I didn't notice 0xD was
broken was well. Thanks for noting.)

For physical bit size, what about extending it in a backwards-compatible
way? Something like this:

    *eax = 0x0003000; /* 48 bits virtual */
    if (ram_size < 1TB) {
        physical_size = 40; /* Keeping backwards compatibility */
    } else if (ram_size < 4TB) {
        physical_size = 42;
    } else {
        abort();
    }
    if (supported_host_physical_size() < physical_size) {
        abort();
    }
    *eax |= physical_size;

(Of course, the abort() calls should be replaced with proper error
reporting)

> 
> Paolo
> 
> >>          } else {
> >>              if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
> >>                  *eax = 0x00000024; /* 36 bits physical */
> >>
> > 
> 
> 

-- 
Eduardo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 18:11     ` Eduardo Habkost
@ 2013-07-16 19:24       ` Paolo Bonzini
  2013-07-16 19:42         ` Eduardo Habkost
  0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2013-07-16 19:24 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Andrea Arcangeli, qemu-devel, Gleb Natapov, qemu-stable

Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
> For physical bit size, what about extending it in a backwards-compatible
> way? Something like this:
> 
>     *eax = 0x0003000; /* 48 bits virtual */
>     if (ram_size < 1TB) {
>         physical_size = 40; /* Keeping backwards compatibility */
>     } else if (ram_size < 4TB) {
>         physical_size = 42;

Why not go straight up to 44?

>     } else {
>         abort();
>     }
>     if (supported_host_physical_size() < physical_size) {
>         abort();
>     }
>     *eax |= physical_size;
> 
> (Of course, the abort() calls should be replaced with proper error
> reporting)

This makes sense too.  Though the best would be of course to use CPUID
values coming from the real processors, and only using 40 for backwards
compatibility.

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 19:24       ` Paolo Bonzini
@ 2013-07-16 19:42         ` Eduardo Habkost
  2013-07-17  8:09           ` Paolo Bonzini
  2013-07-17 15:19           ` Gleb Natapov
  0 siblings, 2 replies; 14+ messages in thread
From: Eduardo Habkost @ 2013-07-16 19:42 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Andrea Arcangeli, qemu-devel, Gleb Natapov, qemu-stable

On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote:
> Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
> > For physical bit size, what about extending it in a backwards-compatible
> > way? Something like this:
> > 
> >     *eax = 0x0003000; /* 48 bits virtual */
> >     if (ram_size < 1TB) {
> >         physical_size = 40; /* Keeping backwards compatibility */
> >     } else if (ram_size < 4TB) {
> >         physical_size = 42;
> 
> Why not go straight up to 44?

I simply trusted the comment saying: "The physical address space is
limited to 42 bits in exec.c", and assumed we had a 42-bit limit
somewhere else.

We could also try something like this:

    if (ram_size < 1TB) {
        physical_size = 40; /* Keeping backwards compatibility */
    } else {
        physical_size = msb(ram_size);
    }
    if (supported_host_physical_size() < physical_size) {
        abort();
    }

> 
> >     } else {
> >         abort();
> >     }
> >     if (supported_host_physical_size() < physical_size) {
> >         abort();
> >     }
> >     *eax |= physical_size;
> > 
> > (Of course, the abort() calls should be replaced with proper error
> > reporting)
> 
> This makes sense too.  Though the best would be of course to use CPUID
> values coming from the real processors, and only using 40 for backwards
> compatibility.

We can't use the values coming from the real processors directly, or
we will break live migration.

If we sent those CPUID bits as part of the migration stream, it would
make it a little safer, but then it would be impossible for libvirt to
tell if it is really possible to migrate from one host to another.

In other words, if QEMU silently decide to expose a different ABI
depending on host capabilities, we have to either: a) silently break the
ABI when migrating to an incompatible host; b) prevent migration to an
incompatible host, but make it impossible for the management layer to
know if migration is really possible.

(The above applies to CPUID leaves 0xA and 0xD as well)

-- 
Eduardo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 19:42         ` Eduardo Habkost
@ 2013-07-17  8:09           ` Paolo Bonzini
  2013-07-17 13:39             ` Eduardo Habkost
  2013-07-17 15:19           ` Gleb Natapov
  1 sibling, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2013-07-17  8:09 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Andrea Arcangeli, qemu-devel, Gleb Natapov, qemu-stable

Il 16/07/2013 21:42, Eduardo Habkost ha scritto:
> On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote:
>> Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
>>> For physical bit size, what about extending it in a backwards-compatible
>>> way? Something like this:
>>>
>>>     *eax = 0x0003000; /* 48 bits virtual */
>>>     if (ram_size < 1TB) {
>>>         physical_size = 40; /* Keeping backwards compatibility */
>>>     } else if (ram_size < 4TB) {
>>>         physical_size = 42;
>>
>> Why not go straight up to 44?
> 
> I simply trusted the comment saying: "The physical address space is
> limited to 42 bits in exec.c", and assumed we had a 42-bit limit
> somewhere else.

Yeah, that's obsolete.  We now can go up to 64 (but actually only
support 52 because that's what Intel says will be the limit----4PB RAM
should be good for everyone, as Bill Gates used to say).

So far Intel has been upgrading the physical RAM size in x16 steps
(MAXPHYADDR was 36, then 40, then 44).  MAXPHYADDR is how Intel calls
what you wrote as physical_size.

>     if (ram_size < 1TB) {
>         physical_size = 40; /* Keeping backwards compatibility */
>     } else {
>         physical_size = msb(ram_size);
>     }
>     if (supported_host_physical_size() < physical_size) {
>         abort();
>     }

Not enough because there are processors with 36.  So perhaps, putting
together both of your ideas:

     if (supported_host_physical_size() < msb(ram_size)) {
         abort();
     }
     if (ram_size < 64GB && !some_compat_prop) {
         physical_size = 36;
     } else if (ram_size < 1TB) {
         physical_size = 40;
     } else {
         physical_size = 44;
     }

What do you think?

>> This makes sense too.  Though the best would be of course to use CPUID
>> values coming from the real processors, and only using 40 for backwards
>> compatibility.
> 
> We can't use the values coming from the real processors directly, or
> we will break live migration.

I said real processors, not host processors. :)

So a Core 2 has MAXPHYADDR=36, Nehalem has IIRC 40, Sandy Bridge has 44,
and so on.

> If we sent those CPUID bits as part of the migration stream, it would
> make it a little safer, but then it would be impossible for libvirt to
> tell if it is really possible to migrate from one host to another.

The libvirt problem still remains, since libvirt currently doesn't know
the MAXPHYADDRs and would have to learn them.

I guess the above "artificial" computation of MAXPHYADDR is enough.

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-17  8:09           ` Paolo Bonzini
@ 2013-07-17 13:39             ` Eduardo Habkost
  2013-07-17 14:01               ` Paolo Bonzini
  0 siblings, 1 reply; 14+ messages in thread
From: Eduardo Habkost @ 2013-07-17 13:39 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Andrea Arcangeli, qemu-devel, Gleb Natapov, qemu-stable

On Wed, Jul 17, 2013 at 10:09:01AM +0200, Paolo Bonzini wrote:
> Il 16/07/2013 21:42, Eduardo Habkost ha scritto:
> > On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote:
> >> Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
> >>> For physical bit size, what about extending it in a backwards-compatible
> >>> way? Something like this:
> >>>
> >>>     *eax = 0x0003000; /* 48 bits virtual */
> >>>     if (ram_size < 1TB) {
> >>>         physical_size = 40; /* Keeping backwards compatibility */
> >>>     } else if (ram_size < 4TB) {
> >>>         physical_size = 42;
> >>
> >> Why not go straight up to 44?
> > 
> > I simply trusted the comment saying: "The physical address space is
> > limited to 42 bits in exec.c", and assumed we had a 42-bit limit
> > somewhere else.
> 
> Yeah, that's obsolete.  We now can go up to 64 (but actually only
> support 52 because that's what Intel says will be the limit----4PB RAM
> should be good for everyone, as Bill Gates used to say).
> 
> So far Intel has been upgrading the physical RAM size in x16 steps
> (MAXPHYADDR was 36, then 40, then 44).  MAXPHYADDR is how Intel calls
> what you wrote as physical_size.

Then if ram_size is too large, we could round it up to a multiple of 4?

> 
> >     if (ram_size < 1TB) {
> >         physical_size = 40; /* Keeping backwards compatibility */
> >     } else {
> >         physical_size = msb(ram_size);
> >     }
> >     if (supported_host_physical_size() < physical_size) {
> >         abort();
> >     }
> 
> Not enough because there are processors with 36.  So perhaps, putting
> together both of your ideas:
> 
>      if (supported_host_physical_size() < msb(ram_size)) {
>          abort();
>      }

What if the host max physical size is smaller than the MAXPHYADDR we're
setting for the VM (but still larger than msb(ram_size))? Will the CPU
or KVM complain, or will it work?

In other words, do we need a check for
  (supported_host_physical_size() < physical_size)
below, as well?

>      if (ram_size < 64GB && !some_compat_prop) {
>          physical_size = 36;
>      } else if (ram_size < 1TB) {
>          physical_size = 40;
>      } else {
>          physical_size = 44;
>      }
> 
> What do you think?

Why stop at 44? What if ram_size is larger than (1 << 44)?

We must never start a VM with physical_size > msb(ram_size), right?

But I am not sure if we should we simply increase physical_size
automatically, or abort on some cases where we physical_size ends up
being too small. (I believe we should simply increase it automatically)

> 
> >> This makes sense too.  Though the best would be of course to use CPUID
> >> values coming from the real processors, and only using 40 for backwards
> >> compatibility.
> > 
> > We can't use the values coming from the real processors directly, or
> > we will break live migration.
> 
> I said real processors, not host processors. :)

So, you mean setting per-cpu-model values?

> 
> So a Core 2 has MAXPHYADDR=36, Nehalem has IIRC 40, Sandy Bridge has 44,
> and so on.

That would work as well (and on pc-1.5 and older we could keep
physical_size=40 or use the above algorithm). But: I wonder if it would
be usable. If somebody is using "-cpu Nehalem -m 8T", I believe it would
make sense to increase the physical address size automatically instead
of aborting. What do you think?

> 
> > If we sent those CPUID bits as part of the migration stream, it would
> > make it a little safer, but then it would be impossible for libvirt to
> > tell if it is really possible to migrate from one host to another.
> 
> The libvirt problem still remains, since libvirt currently doesn't know
> the MAXPHYADDRs and would have to learn them.
> 
> I guess the above "artificial" computation of MAXPHYADDR is enough.

If the VM is already broken and has MAXPHYADDR < msb(ram_size), I don't
worry about ABI stability, because those VMs are not even supposed to be
running properly. If somebody have such a broken VM running, it seems
better to make the MAXPHYADDR bits suddenly change to a reasonable value
(so that MAXPHYADDR >= msb(ram_size)) during live-migration than
aborting migration, or keeping the broken value.

So if we use an algorithm that always increase MAXPHYADDR automatically
(breaking the ABI in the cases where the current is already broken), we
are not going to abort migration unless the host really can't run our
guest (this seems to be a less serious problem than aborting because the
VM configuration is broken and needs to be manually adjusted).

-- 
Eduardo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-17 13:39             ` Eduardo Habkost
@ 2013-07-17 14:01               ` Paolo Bonzini
  0 siblings, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2013-07-17 14:01 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Andrea Arcangeli, qemu-devel, Gleb Natapov, qemu-stable

Il 17/07/2013 15:39, Eduardo Habkost ha scritto:
> On Wed, Jul 17, 2013 at 10:09:01AM +0200, Paolo Bonzini wrote:
>> Il 16/07/2013 21:42, Eduardo Habkost ha scritto:
>>> On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote:
>>>> Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
>>>>> For physical bit size, what about extending it in a backwards-compatible
>>>>> way? Something like this:
>>>>>
>>>>>     *eax = 0x0003000; /* 48 bits virtual */
>>>>>     if (ram_size < 1TB) {
>>>>>         physical_size = 40; /* Keeping backwards compatibility */
>>>>>     } else if (ram_size < 4TB) {
>>>>>         physical_size = 42;
>>>>
>>>> Why not go straight up to 44?
>>>
>>> I simply trusted the comment saying: "The physical address space is
>>> limited to 42 bits in exec.c", and assumed we had a 42-bit limit
>>> somewhere else.
>>
>> Yeah, that's obsolete.  We now can go up to 64 (but actually only
>> support 52 because that's what Intel says will be the limit----4PB RAM
>> should be good for everyone, as Bill Gates used to say).
>>
>> So far Intel has been upgrading the physical RAM size in x16 steps
>> (MAXPHYADDR was 36, then 40, then 44).  MAXPHYADDR is how Intel calls
>> what you wrote as physical_size.
> 
> Then if ram_size is too large, we could round it up to a multiple of 4?

Yeah, that would be closer to real processors.

>>      if (supported_host_physical_size() < msb(ram_size)) {
>>          abort();
>>      }
> 
> What if the host max physical size is smaller than the MAXPHYADDR we're
> setting for the VM (but still larger than msb(ram_size))? Will the CPU
> or KVM complain, or will it work?

That would only happen with a buggy guest that points page tables to
non-existent RAM, i.e.:

- host has MAXPHYADDR = 36, corresponding to 64 GB physical address space

- guest has 2 GB RAM, but still we set MAXPHYADDR = 40 for backwards
compatibility

- guest creates a page table for RAM beyond 64GB just because it can

KVM would complain, real hardware probably would return all-ones or
something like that.  I think we can ignore it.

> In other words, do we need a check for
>   (supported_host_physical_size() < physical_size)
> below, as well?

That would have to be conditional on !some_compat_prop, too, because
MAXPHYADDR=40 is beyond what older VMX-enabled CPUs support (for example
Core 2 has MAXPHYADDR=36).  I don't think it's important though.  The
less stringent test vs. msb(ram_size) should be enough.

>>      if (ram_size < 64GB && !some_compat_prop) {
>>          physical_size = 36;
>>      } else if (ram_size < 1TB) {
>>          physical_size = 40;
>>      } else {
>>          physical_size = 44;
>>      }
>>
>> What do you think?
> 
> Why stop at 44? What if ram_size is larger than (1 << 44)?

You can certainly add more.  However, no processors exist yet that have
MAXPHYADDR > 44, so you would have already failed the check above.  I
didn't go beyond 44 because who knows, maybe Intel will go from 44 to 46
and break the tradition of increasing by 4.

> We must never start a VM with physical_size > msb(ram_size), right?
> 
> But I am not sure if we should we simply increase physical_size
> automatically, or abort on some cases where we physical_size ends up
> being too small. (I believe we should simply increase it automatically)

Increasing automatically is fine, that's basically what those "if"s do.

>>>> This makes sense too.  Though the best would be of course to use CPUID
>>>> values coming from the real processors, and only using 40 for backwards
>>>> compatibility.
>>>
>>> We can't use the values coming from the real processors directly, or
>>> we will break live migration.
>>
>> I said real processors, not host processors. :)
> 
> So, you mean setting per-cpu-model values?

Yes.

>> So a Core 2 has MAXPHYADDR=36, Nehalem has IIRC 40, Sandy Bridge has 44,
>> and so on.
> 
> That would work as well (and on pc-1.5 and older we could keep
> physical_size=40 or use the above algorithm). But: I wonder if it would
> be usable. If somebody is using "-cpu Nehalem -m 8T", I believe it would
> make sense to increase the physical address size automatically instead
> of aborting. What do you think?

I think I agree.  This leaf is used so little by the guest (the
hypervisor uses it) that mimicking real CPU models in everything is of
little value.

> If the VM is already broken and has MAXPHYADDR < msb(ram_size), I don't
> worry about ABI stability, because those VMs are not even supposed to be
> running properly. If somebody have such a broken VM running, it seems
> better to make the MAXPHYADDR bits suddenly change to a reasonable value
> (so that MAXPHYADDR >= msb(ram_size)) during live-migration than
> aborting migration, or keeping the broken value.
> 
> So if we use an algorithm that always increase MAXPHYADDR automatically
> (breaking the ABI in the cases where the current is already broken), we
> are not going to abort migration unless the host really can't run our
> guest (this seems to be a less serious problem than aborting because the
> VM configuration is broken and needs to be manually adjusted).

Agreed.  So perhaps per-cpu-model values aren't too useful.

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-16 19:42         ` Eduardo Habkost
  2013-07-17  8:09           ` Paolo Bonzini
@ 2013-07-17 15:19           ` Gleb Natapov
  2013-07-17 21:20             ` Eduardo Habkost
  1 sibling, 1 reply; 14+ messages in thread
From: Gleb Natapov @ 2013-07-17 15:19 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Andrea Arcangeli, Paolo Bonzini, qemu-devel, qemu-stable

On Tue, Jul 16, 2013 at 04:42:38PM -0300, Eduardo Habkost wrote:
> On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote:
> > Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
> > > For physical bit size, what about extending it in a backwards-compatible
> > > way? Something like this:
> > > 
> > >     *eax = 0x0003000; /* 48 bits virtual */
> > >     if (ram_size < 1TB) {
> > >         physical_size = 40; /* Keeping backwards compatibility */
> > >     } else if (ram_size < 4TB) {
> > >         physical_size = 42;
> > 
> > Why not go straight up to 44?
> 
> I simply trusted the comment saying: "The physical address space is
> limited to 42 bits in exec.c", and assumed we had a 42-bit limit
> somewhere else.
> 
> We could also try something like this:
> 
>     if (ram_size < 1TB) {
>         physical_size = 40; /* Keeping backwards compatibility */
>     } else {
>         physical_size = msb(ram_size);
>     }
>     if (supported_host_physical_size() < physical_size) {
>         abort();
>     }
> 
> 
ram_size is the things we set with -m, right? Because if it is then
using it here is incorrect since, due to PCI hole, max phys address is
higher than ram_size.

--
			Gleb.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests
  2013-07-17 15:19           ` Gleb Natapov
@ 2013-07-17 21:20             ` Eduardo Habkost
  0 siblings, 0 replies; 14+ messages in thread
From: Eduardo Habkost @ 2013-07-17 21:20 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Andrea Arcangeli, Paolo Bonzini, qemu-devel, qemu-stable

On Wed, Jul 17, 2013 at 06:19:28PM +0300, Gleb Natapov wrote:
> On Tue, Jul 16, 2013 at 04:42:38PM -0300, Eduardo Habkost wrote:
> > On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote:
> > > Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
> > > > For physical bit size, what about extending it in a backwards-compatible
> > > > way? Something like this:
> > > > 
> > > >     *eax = 0x0003000; /* 48 bits virtual */
> > > >     if (ram_size < 1TB) {
> > > >         physical_size = 40; /* Keeping backwards compatibility */
> > > >     } else if (ram_size < 4TB) {
> > > >         physical_size = 42;
> > > 
> > > Why not go straight up to 44?
> > 
> > I simply trusted the comment saying: "The physical address space is
> > limited to 42 bits in exec.c", and assumed we had a 42-bit limit
> > somewhere else.
> > 
> > We could also try something like this:
> > 
> >     if (ram_size < 1TB) {
> >         physical_size = 40; /* Keeping backwards compatibility */
> >     } else {
> >         physical_size = msb(ram_size);
> >     }
> >     if (supported_host_physical_size() < physical_size) {
> >         abort();
> >     }
> > 
> > 
> ram_size is the things we set with -m, right? Because if it is then
> using it here is incorrect since, due to PCI hole, max phys address is
> higher than ram_size.

Yeah, everywhere I used ram_size I actually meant "max phys address".

-- 
Eduardo

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-07-17 21:20 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-16 17:22 [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests Andrea Arcangeli
2013-07-16 17:26 ` Paolo Bonzini
2013-07-16 17:38 ` Eduardo Habkost
2013-07-16 17:46   ` Paolo Bonzini
2013-07-16 17:48     ` Paolo Bonzini
2013-07-16 18:06       ` Andrea Arcangeli
2013-07-16 18:11     ` Eduardo Habkost
2013-07-16 19:24       ` Paolo Bonzini
2013-07-16 19:42         ` Eduardo Habkost
2013-07-17  8:09           ` Paolo Bonzini
2013-07-17 13:39             ` Eduardo Habkost
2013-07-17 14:01               ` Paolo Bonzini
2013-07-17 15:19           ` Gleb Natapov
2013-07-17 21:20             ` Eduardo Habkost

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.