All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/5] x86: Physical address limit patches
@ 2016-06-16 17:12 Dr. David Alan Gilbert (git)
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro Dr. David Alan Gilbert (git)
                   ` (4 more replies)
  0 siblings, 5 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2016-06-16 17:12 UTC (permalink / raw)
  To: qemu-devel, pbonzini, aarcange, ehabkost

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

QEMU sets the guests physical address bits to 40; this is wrong
on most hardware, and can theoretically be detected by the guest.
It also stops you using really huge multi-TB VMs.

Red Hat has had a patch, that Andrea wrote, downstream for a couple
of years that reads the hosts value and uses that in the guest.  That's
correct as far as the guest sees it, and lets you create huge VMs.

The downside, is that if you've got a mix of hosts, say an i7 and a Xeon,
life gets complicated in migration; prior to 2.6 it all apparently
worked (although a guest that looked might spot the change).
In 2.6 Paolo started checking MSR writes and they failed when the 
incoming MTRR mask didn't fit.

This series:
   a) Fixes up mtrr masks so that if you're migrating between hosts
      of different physical address size it tries to do something sensible.

   b) Lets you specify the guest physical address size via a CPU property, i.e.
        -cpu SandyBridge,phys-bits=36

      The default is to use the existing 40 bits value.

   c) Lets you tell qemu to use the same setting as the host, i.e.
        -cpu SandyBridge,phys-bits=0

Note that mixed size hosts are still not necessarily safe; a guest
started on a host with a large physical address size might start using
those bits and get upset when it's moved to a small host.
However that was already potentially broken in existing qemu that
used a magic value of 40.

There's potential to add some extra guards against people
doing silly stuff; e.g. stop people running VMs using 1TB of
address space on a tiny host.

Dave

Dr. David Alan Gilbert (5):
  BIT_RANGE convenience macro
  x86: Mask mtrr mask based on CPU physical address limits
  x86: fill high bits of mtrr mask
  x86: Allow physical address bits to be set
  x86: Set physical address bits based on host

 include/hw/i386/pc.h  |  5 +++++
 include/qemu/bitops.h |  3 +++
 target-i386/cpu.c     | 36 +++++++++++++++++++++++++++++++++---
 target-i386/cpu.h     |  7 +++++++
 target-i386/kvm.c     | 38 +++++++++++++++++++++++++++++++++++---
 5 files changed, 83 insertions(+), 6 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro
  2016-06-16 17:12 [Qemu-devel] [PATCH 0/5] x86: Physical address limit patches Dr. David Alan Gilbert (git)
@ 2016-06-16 17:12 ` Dr. David Alan Gilbert (git)
  2016-06-16 17:23   ` Paolo Bonzini
  2016-06-16 18:01   ` Peter Maydell
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 2/5] x86: Mask mtrr mask based on CPU physical address limits Dr. David Alan Gilbert (git)
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2016-06-16 17:12 UTC (permalink / raw)
  To: qemu-devel, pbonzini, aarcange, ehabkost

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

e.g. BIT_RANGE(15, 0) gives 0xff00

Suggested by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/qemu/bitops.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index 755fdd1..e411688 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -23,6 +23,9 @@
 #define BIT_MASK(nr)            (1UL << ((nr) % BITS_PER_LONG))
 #define BIT_WORD(nr)            ((nr) / BITS_PER_LONG)
 #define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
+/* e.g. BIT_RANGE(15, 0) -> 0xff00 */
+#define BIT_RANGE(hb, lb)     ((2ull << (hb)) - (1ull << (lb)))
+
 
 /**
  * set_bit - Set a bit in memory
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH 2/5] x86: Mask mtrr mask based on CPU physical address limits
  2016-06-16 17:12 [Qemu-devel] [PATCH 0/5] x86: Physical address limit patches Dr. David Alan Gilbert (git)
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro Dr. David Alan Gilbert (git)
@ 2016-06-16 17:12 ` Dr. David Alan Gilbert (git)
  2016-06-16 19:59   ` Eduardo Habkost
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask Dr. David Alan Gilbert (git)
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2016-06-16 17:12 UTC (permalink / raw)
  To: qemu-devel, pbonzini, aarcange, ehabkost

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The CPU GPs if we try and set a bit in a variable MTRR mask above
the limit of physical address bits on the host.  We hit this
when loading a migration from a host with a larger physical
address limit than our destination (e.g. a Xeon->i7 of same
generation) but previously used to get away with it
until 48e1a45 started checking that msr writes actually worked.

It seems in our case the GP probably comes from KVM emulating
that GP.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 target-i386/cpu.c | 13 +++++++++++++
 target-i386/cpu.h |  1 +
 target-i386/kvm.c | 13 +++++++++++--
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 3665fec..f9302f9 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2668,6 +2668,19 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
     }
 }
 
+/* Returns the number of physical bits supported by the guest CPU */
+unsigned int cpu_x86_guest_phys_address_bits(CPUX86State *env)
+{
+    /* the cpuid emulation code already calculates a value to give to the
+     * guest and this should match.
+     */
+    uint32_t eax, unused;
+    cpu_x86_cpuid(env, 0x80000008, 0, &eax, &unused, &unused, &unused);
+
+    /* Bottom 8 bits of leaf 0x80000008 see Table 3-17 in CPUID definition */
+    return eax & 0xff;
+}
+
 /* CPUClass::reset() */
 static void x86_cpu_reset(CPUState *s)
 {
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index d9ab884..61c06b7 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -1238,6 +1238,7 @@ X86CPU *cpu_x86_create(const char *cpu_model, Error **errp);
 int cpu_x86_exec(CPUState *cpu);
 void x86_cpu_list(FILE *f, fprintf_function cpu_fprintf);
 int cpu_x86_support_mca_broadcast(CPUX86State *env);
+unsigned int cpu_x86_guest_phys_address_bits(CPUX86State *env);
 
 int cpu_get_pic_interrupt(CPUX86State *s);
 /* MSDOS compatibility mode FPU exception support */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index abf50e6..d95d06b 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1674,6 +1674,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
             }
         }
         if (has_msr_mtrr) {
+            uint64_t phys_mask = BIT_RANGE(
+                                   cpu_x86_guest_phys_address_bits(env) - 1,
+                                   0);
+
             kvm_msr_entry_add(cpu, MSR_MTRRdefType, env->mtrr_deftype);
             kvm_msr_entry_add(cpu, MSR_MTRRfix64K_00000, env->mtrr_fixed[0]);
             kvm_msr_entry_add(cpu, MSR_MTRRfix16K_80000, env->mtrr_fixed[1]);
@@ -1687,10 +1691,15 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
             kvm_msr_entry_add(cpu, MSR_MTRRfix4K_F0000, env->mtrr_fixed[9]);
             kvm_msr_entry_add(cpu, MSR_MTRRfix4K_F8000, env->mtrr_fixed[10]);
             for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
+                /* The CPU GPs if we write to a bit above the physical limit of
+                 * the host CPU (and KVM emulates that)
+                 */
+                uint64_t mask = env->mtrr_var[i].mask;
+                mask &= phys_mask;
+
                 kvm_msr_entry_add(cpu, MSR_MTRRphysBase(i),
                                   env->mtrr_var[i].base);
-                kvm_msr_entry_add(cpu, MSR_MTRRphysMask(i),
-                                  env->mtrr_var[i].mask);
+                kvm_msr_entry_add(cpu, MSR_MTRRphysMask(i), mask);
             }
         }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-16 17:12 [Qemu-devel] [PATCH 0/5] x86: Physical address limit patches Dr. David Alan Gilbert (git)
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro Dr. David Alan Gilbert (git)
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 2/5] x86: Mask mtrr mask based on CPU physical address limits Dr. David Alan Gilbert (git)
@ 2016-06-16 17:12 ` Dr. David Alan Gilbert (git)
  2016-06-16 20:14   ` Eduardo Habkost
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set Dr. David Alan Gilbert (git)
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 5/5] x86: Set physical address bits based on host Dr. David Alan Gilbert (git)
  4 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2016-06-16 17:12 UTC (permalink / raw)
  To: qemu-devel, pbonzini, aarcange, ehabkost

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Fill the bits between 51..number-of-physical-address-bits in the
MTRR_PHYSMASKn variable range mtrr masks so that they're consistent
in the migration stream irrespective of the physical address space
of the source VM in a migration.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/hw/i386/pc.h |  5 +++++
 target-i386/cpu.c    |  1 +
 target-i386/cpu.h    |  3 +++
 target-i386/kvm.c    | 25 ++++++++++++++++++++++++-
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 49566c8..3883d04 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -362,6 +362,11 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
         .driver   = TYPE_X86_CPU,\
         .property = "cpuid-0xb",\
         .value    = "off",\
+    },\
+    {\
+        .driver = TYPE_X86_CPU,\
+        .property = "fill-mtrr-mask",\
+        .value = "off",\
     },
 
 #define PC_COMPAT_2_5 \
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index f9302f9..4111240 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -3272,6 +3272,7 @@ static Property x86_cpu_properties[] = {
     DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, true),
     DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
     DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
+    DEFINE_PROP_BOOL("fill-mtrr-mask", X86CPU, fill_mtrr_mask, true),
     DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
     DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
     DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 61c06b7..13d8501 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -1181,6 +1181,9 @@ struct X86CPU {
     /* Compatibility bits for old machine types: */
     bool enable_cpuid_0xb;
 
+    /* if true fill the top bits of the MTRR_PHYSMASKn variable range */
+    bool fill_mtrr_mask;
+
     /* in order to simplify APIC support, we leave this pointer to the
        user */
     struct DeviceState *apic_state;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index d95d06b..9b5158e 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1934,6 +1934,7 @@ static int kvm_get_msrs(X86CPU *cpu)
     CPUX86State *env = &cpu->env;
     struct kvm_msr_entry *msrs = cpu->kvm_msr_buf->entries;
     int ret, i;
+    uint64_t mtrr_top_bits;
 
     kvm_msr_buf_reset(cpu);
 
@@ -2083,6 +2084,27 @@ static int kvm_get_msrs(X86CPU *cpu)
     }
 
     assert(ret == cpu->kvm_msr_buf->nmsrs);
+    /*
+     * MTRR masks: Each mask consists of 5 parts
+     * a  10..0: must be zero
+     * b  11   : valid bit
+     * c n-1.12: actual mask bits
+     * d  51..n: reserved must be zero
+     * e  63.52: reserved must be zero
+     *
+     * 'n' is the number of physical bits supported by the CPU and is
+     * apparently always <= 52.   We know our 'n' but don't know what
+     * the destinations 'n' is; it might be smaller, in which case
+     * it masks (c) on loading. It might be larger, in which case
+     * we fill 'd' so that d..c is consistent irrespetive of the 'n'
+     * we're migrating to.
+     */
+    if (cpu->fill_mtrr_mask) {
+        mtrr_top_bits = BIT_RANGE(51, cpu_x86_guest_phys_address_bits(env));
+    } else {
+        mtrr_top_bits = 0;
+    }
+
     for (i = 0; i < ret; i++) {
         uint32_t index = msrs[i].index;
         switch (index) {
@@ -2278,7 +2300,8 @@ static int kvm_get_msrs(X86CPU *cpu)
             break;
         case MSR_MTRRphysBase(0) ... MSR_MTRRphysMask(MSR_MTRRcap_VCNT - 1):
             if (index & 1) {
-                env->mtrr_var[MSR_MTRRphysIndex(index)].mask = msrs[i].data;
+                env->mtrr_var[MSR_MTRRphysIndex(index)].mask = msrs[i].data |
+                                                               mtrr_top_bits;
             } else {
                 env->mtrr_var[MSR_MTRRphysIndex(index)].base = msrs[i].data;
             }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-16 17:12 [Qemu-devel] [PATCH 0/5] x86: Physical address limit patches Dr. David Alan Gilbert (git)
                   ` (2 preceding siblings ...)
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask Dr. David Alan Gilbert (git)
@ 2016-06-16 17:12 ` Dr. David Alan Gilbert (git)
  2016-06-16 17:26   ` Paolo Bonzini
  2016-06-16 20:24   ` Eduardo Habkost
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 5/5] x86: Set physical address bits based on host Dr. David Alan Gilbert (git)
  4 siblings, 2 replies; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2016-06-16 17:12 UTC (permalink / raw)
  To: qemu-devel, pbonzini, aarcange, ehabkost

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Currently QEMU sets the x86 number of physical address bits to the
magic number 40.  This is only correct on some small AMD systems;
Intel systems tend to have 36, 39, 46 bits, and large AMD systems
tend to have 48.

Having the value different from your actual hardware is detectable
by the guest and in principal can cause problems;
The current limit of 40 stops TB VMs being created by those lucky
enough to have that much.

This patch lets you set the physical bits by a cpu property but
defaults to the same existing magic 40.

I've removed the ancient warning about the 42 bit limit in exec.c;
I can't find that limit in there and no one else seems to know where
it is.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 target-i386/cpu.c | 8 +++++---
 target-i386/cpu.h | 3 +++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 4111240..c3bbf8e 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2606,9 +2606,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
         /* virtual & phys address size in low 2 bytes. */
 /* XXX: This value must match the one used in the MMU code. */
         if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
-            /* 64 bit processor */
-/* XXX: The physical address space is limited to 42 bits in exec.c. */
-            *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
+            /* 64 bit processor, 48 bits virtual, configurable
+             * physical bits.
+             */
+            *eax = 0x00003000 + cpu->phys_bits;
         } else {
             if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
                 *eax = 0x00000024; /* 36 bits physical */
@@ -3273,6 +3274,7 @@ static Property x86_cpu_properties[] = {
     DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
     DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
     DEFINE_PROP_BOOL("fill-mtrr-mask", X86CPU, fill_mtrr_mask, true),
+    DEFINE_PROP_UINT32("phys-bits", X86CPU, phys_bits, 40),
     DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
     DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
     DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 13d8501..fccba72 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -1184,6 +1184,9 @@ struct X86CPU {
     /* if true fill the top bits of the MTRR_PHYSMASKn variable range */
     bool fill_mtrr_mask;
 
+    /* Number of physical address bits supported */
+    uint32_t phys_bits;
+
     /* in order to simplify APIC support, we leave this pointer to the
        user */
     struct DeviceState *apic_state;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH 5/5] x86: Set physical address bits based on host
  2016-06-16 17:12 [Qemu-devel] [PATCH 0/5] x86: Physical address limit patches Dr. David Alan Gilbert (git)
                   ` (3 preceding siblings ...)
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set Dr. David Alan Gilbert (git)
@ 2016-06-16 17:12 ` Dr. David Alan Gilbert (git)
  2016-06-17  7:25   ` Igor Mammedov
  4 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2016-06-16 17:12 UTC (permalink / raw)
  To: qemu-devel, pbonzini, aarcange, ehabkost

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

A special case based on the previous phys-bits property; if it's
the magic value 0 then use the hosts capabilities.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 target-i386/cpu.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index c3bbf8e..e03e48f 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2609,7 +2609,21 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
             /* 64 bit processor, 48 bits virtual, configurable
              * physical bits.
              */
-            *eax = 0x00003000 + cpu->phys_bits;
+            if (cpu->phys_bits != 0) {
+                *eax = 0x00003000 + cpu->phys_bits;
+            } else {
+                /* phys_bits set to 0 -> Try and read the host, again
+                 * fall back to the magic 40 qemu used for a long time
+                 * Note: This is setting the virtual size as well from
+                 * the host; TODO: Split that out
+                 */
+                uint32_t _eax;
+                *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
+                host_cpuid(0x80000000, 0, &_eax, NULL, NULL, NULL);
+                if (_eax >= 0x80000008) {
+                    host_cpuid(0x80000008, 0, eax, NULL, NULL, NULL);
+                }
+            }
         } else {
             if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
                 *eax = 0x00000024; /* 36 bits physical */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro Dr. David Alan Gilbert (git)
@ 2016-06-16 17:23   ` Paolo Bonzini
  2016-06-16 17:24     ` Dr. David Alan Gilbert
  2016-06-16 18:01   ` Peter Maydell
  1 sibling, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-16 17:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, aarcange, ehabkost



On 16/06/2016 19:12, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> e.g. BIT_RANGE(15, 0) gives 0xff00

That looks more like BIT_RANGE(15, 8). :)

Paolo

> Suggested by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/qemu/bitops.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
> index 755fdd1..e411688 100644
> --- a/include/qemu/bitops.h
> +++ b/include/qemu/bitops.h
> @@ -23,6 +23,9 @@
>  #define BIT_MASK(nr)            (1UL << ((nr) % BITS_PER_LONG))
>  #define BIT_WORD(nr)            ((nr) / BITS_PER_LONG)
>  #define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
> +/* e.g. BIT_RANGE(15, 0) -> 0xff00 */
> +#define BIT_RANGE(hb, lb)     ((2ull << (hb)) - (1ull << (lb)))
> +
>  

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro
  2016-06-16 17:23   ` Paolo Bonzini
@ 2016-06-16 17:24     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-16 17:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, aarcange, ehabkost

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 16/06/2016 19:12, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > e.g. BIT_RANGE(15, 0) gives 0xff00
> 
> That looks more like BIT_RANGE(15, 8). :)

Yes (and in the comment below) - I'll repost after
waiting to see if there's any other nasties.

Dave

> 
> Paolo
> 
> > Suggested by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/qemu/bitops.h | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
> > index 755fdd1..e411688 100644
> > --- a/include/qemu/bitops.h
> > +++ b/include/qemu/bitops.h
> > @@ -23,6 +23,9 @@
> >  #define BIT_MASK(nr)            (1UL << ((nr) % BITS_PER_LONG))
> >  #define BIT_WORD(nr)            ((nr) / BITS_PER_LONG)
> >  #define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
> > +/* e.g. BIT_RANGE(15, 0) -> 0xff00 */
> > +#define BIT_RANGE(hb, lb)     ((2ull << (hb)) - (1ull << (lb)))
> > +
> >  
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set Dr. David Alan Gilbert (git)
@ 2016-06-16 17:26   ` Paolo Bonzini
  2016-06-16 18:09     ` Eduardo Habkost
  2016-06-16 20:24   ` Eduardo Habkost
  1 sibling, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-16 17:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel, aarcange, ehabkost



On 16/06/2016 19:12, Dr. David Alan Gilbert (git) wrote:
>          if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
> -            /* 64 bit processor */
> -/* XXX: The physical address space is limited to 42 bits in exec.c. */
> -            *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
> +            /* 64 bit processor, 48 bits virtual, configurable
> +             * physical bits.
> +             */
> +            *eax = 0x00003000 + cpu->phys_bits;

This probably ought to be range checked in realize.  I think you can
also implement the magic 0 value there and have a bit less indentation.

In addition, "-cpu host" should have phys-bits=0.

Otherwise looks great.

Paolo

>          } else {
>              if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
>                  *eax = 0x00000024; /* 36 bits physical */
> @@ -3273,6 +3274,7 @@ static Property x86_cpu_properties[] = {
>      DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
>      DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
>      DEFINE_PROP_BOOL("fill-mtrr-mask", X86CPU, fill_mtrr_mask, true),
> +    DEFINE_PROP_UINT32("phys-bits", X86CPU, phys_bits, 40),
>      DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
>      DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
>      DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 13d8501..fccba72 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -1184,6 +1184,9 @@ struct X86CPU {
>      /* if true fill the top bits of the MTRR_PHYSMASKn variable range */
>      bool fill_mtrr_mask;
>  
> +    /* Number of physical address bits supported */
> +    uint32_t phys_bits;

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro Dr. David Alan Gilbert (git)
  2016-06-16 17:23   ` Paolo Bonzini
@ 2016-06-16 18:01   ` Peter Maydell
  2016-06-16 18:05     ` Paolo Bonzini
  2016-06-20 14:11     ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 73+ messages in thread
From: Peter Maydell @ 2016-06-16 18:01 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: QEMU Developers, Paolo Bonzini, Andrea Arcangeli, Eduardo Habkost

On 16 June 2016 at 18:12, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> e.g. BIT_RANGE(15, 0) gives 0xff00
>
> Suggested by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  include/qemu/bitops.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
> index 755fdd1..e411688 100644
> --- a/include/qemu/bitops.h
> +++ b/include/qemu/bitops.h
> @@ -23,6 +23,9 @@
>  #define BIT_MASK(nr)            (1UL << ((nr) % BITS_PER_LONG))
>  #define BIT_WORD(nr)            ((nr) / BITS_PER_LONG)
>  #define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
> +/* e.g. BIT_RANGE(15, 0) -> 0xff00 */
> +#define BIT_RANGE(hb, lb)     ((2ull << (hb)) - (1ull << (lb)))

Isn't this undefined behaviour if the hb is 63?

Also there is semantic overlap with the MAKE_64BIT_MASK macro
proposed in https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg02154.html
(which also has ub, but see
https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg02614.html
for the version which doesn't).

I prefer a "start, length" macro to "position, position",
because this matches what we already have for the deposit
and extract functions in this header.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro
  2016-06-16 18:01   ` Peter Maydell
@ 2016-06-16 18:05     ` Paolo Bonzini
  2016-06-20 14:11     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-16 18:05 UTC (permalink / raw)
  To: Peter Maydell, Dr. David Alan Gilbert (git)
  Cc: QEMU Developers, Andrea Arcangeli, Eduardo Habkost



On 16/06/2016 20:01, Peter Maydell wrote:
>> diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
>> index 755fdd1..e411688 100644
>> --- a/include/qemu/bitops.h
>> +++ b/include/qemu/bitops.h
>> @@ -23,6 +23,9 @@
>>  #define BIT_MASK(nr)            (1UL << ((nr) % BITS_PER_LONG))
>>  #define BIT_WORD(nr)            ((nr) / BITS_PER_LONG)
>>  #define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
>> +/* e.g. BIT_RANGE(15, 0) -> 0xff00 */
>> +#define BIT_RANGE(hb, lb)     ((2ull << (hb)) - (1ull << (lb)))
> 
> Isn't this undefined behaviour if the hb is 63?

No, these are unsigned values so there's no undefined behavior.

> I prefer a "start, length" macro to "position, position",
> because this matches what we already have for the deposit
> and extract functions in this header.

That's fine too, albeit in the case of this patch the code would be uglier.

Also, if you want a "start, length" mask you could always deposit -1
into 0...

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-16 17:26   ` Paolo Bonzini
@ 2016-06-16 18:09     ` Eduardo Habkost
  0 siblings, 0 replies; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-16 18:09 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Dr. David Alan Gilbert (git), qemu-devel, aarcange

On Thu, Jun 16, 2016 at 07:26:01PM +0200, Paolo Bonzini wrote:
> 
> 
> On 16/06/2016 19:12, Dr. David Alan Gilbert (git) wrote:
> >          if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
> > -            /* 64 bit processor */
> > -/* XXX: The physical address space is limited to 42 bits in exec.c. */
> > -            *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
> > +            /* 64 bit processor, 48 bits virtual, configurable
> > +             * physical bits.
> > +             */
> > +            *eax = 0x00003000 + cpu->phys_bits;
> 
> This probably ought to be range checked in realize.  I think you can
> also implement the magic 0 value there and have a bit less indentation.

Yes, please. We should move everything that requires looking at
the host inside realize. Close to the "if (cpu->host_features)"
check.

> 
> In addition, "-cpu host" should have phys-bits=0.

Yes.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] x86: Mask mtrr mask based on CPU physical address limits
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 2/5] x86: Mask mtrr mask based on CPU physical address limits Dr. David Alan Gilbert (git)
@ 2016-06-16 19:59   ` Eduardo Habkost
  2016-06-17  8:23     ` Dr. David Alan Gilbert
  2016-06-17 12:13     ` Paolo Bonzini
  0 siblings, 2 replies; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-16 19:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, pbonzini, aarcange

On Thu, Jun 16, 2016 at 06:12:10PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The CPU GPs if we try and set a bit in a variable MTRR mask above
> the limit of physical address bits on the host.  We hit this
> when loading a migration from a host with a larger physical
> address limit than our destination (e.g. a Xeon->i7 of same
> generation) but previously used to get away with it
> until 48e1a45 started checking that msr writes actually worked.
> 
> It seems in our case the GP probably comes from KVM emulating
> that GP.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  target-i386/cpu.c | 13 +++++++++++++
>  target-i386/cpu.h |  1 +
>  target-i386/kvm.c | 13 +++++++++++--
>  3 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 3665fec..f9302f9 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2668,6 +2668,19 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>      }
>  }
>  
> +/* Returns the number of physical bits supported by the guest CPU */
> +unsigned int cpu_x86_guest_phys_address_bits(CPUX86State *env)
> +{
> +    /* the cpuid emulation code already calculates a value to give to the
> +     * guest and this should match.
> +     */
> +    uint32_t eax, unused;
> +    cpu_x86_cpuid(env, 0x80000008, 0, &eax, &unused, &unused, &unused);
> +
> +    /* Bottom 8 bits of leaf 0x80000008 see Table 3-17 in CPUID definition */
> +    return eax & 0xff;
> +}

If you are adding X86CPU::phys_bits in patch 4/5, can't
kvm_put_msrs() just query it directly?

(It would require changing patch 5/5 to set phys_bits on
realizefn, like Paolo suggested.)

[...]
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index abf50e6..d95d06b 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -1674,6 +1674,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>              }
>          }
>          if (has_msr_mtrr) {
> +            uint64_t phys_mask = BIT_RANGE(
> +                                   cpu_x86_guest_phys_address_bits(env) - 1,
> +                                   0);
> +
>              kvm_msr_entry_add(cpu, MSR_MTRRdefType, env->mtrr_deftype);
>              kvm_msr_entry_add(cpu, MSR_MTRRfix64K_00000, env->mtrr_fixed[0]);
>              kvm_msr_entry_add(cpu, MSR_MTRRfix16K_80000, env->mtrr_fixed[1]);
> @@ -1687,10 +1691,15 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>              kvm_msr_entry_add(cpu, MSR_MTRRfix4K_F0000, env->mtrr_fixed[9]);
>              kvm_msr_entry_add(cpu, MSR_MTRRfix4K_F8000, env->mtrr_fixed[10]);
>              for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> +                /* The CPU GPs if we write to a bit above the physical limit of
> +                 * the host CPU (and KVM emulates that)
> +                 */
> +                uint64_t mask = env->mtrr_var[i].mask;
> +                mask &= phys_mask;
> +

We are silently changing the MSR value seen by the guest, should
we print a warning in case mask != env->mtrr_var[i].mask?

>                  kvm_msr_entry_add(cpu, MSR_MTRRphysBase(i),
>                                    env->mtrr_var[i].base);
> -                kvm_msr_entry_add(cpu, MSR_MTRRphysMask(i),
> -                                  env->mtrr_var[i].mask);
> +                kvm_msr_entry_add(cpu, MSR_MTRRphysMask(i), mask);
>              }
>          }
>  
> -- 
> 2.7.4
> 

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask Dr. David Alan Gilbert (git)
@ 2016-06-16 20:14   ` Eduardo Habkost
  2016-06-17  7:47     ` Paolo Bonzini
  2016-06-17  8:53     ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-16 20:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, pbonzini, aarcange

On Thu, Jun 16, 2016 at 06:12:11PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Fill the bits between 51..number-of-physical-address-bits in the
> MTRR_PHYSMASKn variable range mtrr masks so that they're consistent
> in the migration stream irrespective of the physical address space
> of the source VM in a migration.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  include/hw/i386/pc.h |  5 +++++
>  target-i386/cpu.c    |  1 +
>  target-i386/cpu.h    |  3 +++
>  target-i386/kvm.c    | 25 ++++++++++++++++++++++++-
>  4 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 49566c8..3883d04 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -362,6 +362,11 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>          .driver   = TYPE_X86_CPU,\
>          .property = "cpuid-0xb",\
>          .value    = "off",\
> +    },\
> +    {\
> +        .driver = TYPE_X86_CPU,\
> +        .property = "fill-mtrr-mask",\
> +        .value = "off",\
>      },
>  
>  #define PC_COMPAT_2_5 \
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index f9302f9..4111240 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -3272,6 +3272,7 @@ static Property x86_cpu_properties[] = {
>      DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, true),
>      DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
>      DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
> +    DEFINE_PROP_BOOL("fill-mtrr-mask", X86CPU, fill_mtrr_mask, true),

This is necessary only when phys_bits is higher on the
destination, right?

Should we really default this to true? I would like to enable
this hack only when really necessary. Except when using host's
phys_bits (phys-bits=0), is there any valid reason to expect
higher phys-bits on the destination?



>      DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
>      DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
>      DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 61c06b7..13d8501 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -1181,6 +1181,9 @@ struct X86CPU {
>      /* Compatibility bits for old machine types: */
>      bool enable_cpuid_0xb;
>  
> +    /* if true fill the top bits of the MTRR_PHYSMASKn variable range */
> +    bool fill_mtrr_mask;
> +
>      /* in order to simplify APIC support, we leave this pointer to the
>         user */
>      struct DeviceState *apic_state;
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index d95d06b..9b5158e 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -1934,6 +1934,7 @@ static int kvm_get_msrs(X86CPU *cpu)
>      CPUX86State *env = &cpu->env;
>      struct kvm_msr_entry *msrs = cpu->kvm_msr_buf->entries;
>      int ret, i;
> +    uint64_t mtrr_top_bits;
>  
>      kvm_msr_buf_reset(cpu);
>  
> @@ -2083,6 +2084,27 @@ static int kvm_get_msrs(X86CPU *cpu)
>      }
>  
>      assert(ret == cpu->kvm_msr_buf->nmsrs);
> +    /*
> +     * MTRR masks: Each mask consists of 5 parts
> +     * a  10..0: must be zero
> +     * b  11   : valid bit
> +     * c n-1.12: actual mask bits
> +     * d  51..n: reserved must be zero
> +     * e  63.52: reserved must be zero
> +     *
> +     * 'n' is the number of physical bits supported by the CPU and is
> +     * apparently always <= 52.   We know our 'n' but don't know what
> +     * the destinations 'n' is; it might be smaller, in which case
> +     * it masks (c) on loading. It might be larger, in which case
> +     * we fill 'd' so that d..c is consistent irrespetive of the 'n'
> +     * we're migrating to.
> +     */
> +    if (cpu->fill_mtrr_mask) {
> +        mtrr_top_bits = BIT_RANGE(51, cpu_x86_guest_phys_address_bits(env));
> +    } else {
> +        mtrr_top_bits = 0;
> +    }
> +
>      for (i = 0; i < ret; i++) {
>          uint32_t index = msrs[i].index;
>          switch (index) {
> @@ -2278,7 +2300,8 @@ static int kvm_get_msrs(X86CPU *cpu)
>              break;
>          case MSR_MTRRphysBase(0) ... MSR_MTRRphysMask(MSR_MTRRcap_VCNT - 1):
>              if (index & 1) {
> -                env->mtrr_var[MSR_MTRRphysIndex(index)].mask = msrs[i].data;
> +                env->mtrr_var[MSR_MTRRphysIndex(index)].mask = msrs[i].data |
> +                                                               mtrr_top_bits;
>              } else {
>                  env->mtrr_var[MSR_MTRRphysIndex(index)].base = msrs[i].data;
>              }
> -- 
> 2.7.4
> 

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set Dr. David Alan Gilbert (git)
  2016-06-16 17:26   ` Paolo Bonzini
@ 2016-06-16 20:24   ` Eduardo Habkost
  2016-06-17  8:15     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-16 20:24 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, pbonzini, aarcange

On Thu, Jun 16, 2016 at 06:12:12PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Currently QEMU sets the x86 number of physical address bits to the
> magic number 40.  This is only correct on some small AMD systems;
> Intel systems tend to have 36, 39, 46 bits, and large AMD systems
> tend to have 48.
> 
> Having the value different from your actual hardware is detectable
> by the guest and in principal can cause problems;

What kind of problems?

Is it a problem to have something smaller from the actual
hardware, or just if it's higher?

> The current limit of 40 stops TB VMs being created by those lucky
> enough to have that much.
> 
> This patch lets you set the physical bits by a cpu property but
> defaults to the same existing magic 40.
> 

The existing 40-bit default looks like a problem for 36-bit
systems. Do you know what kind of systems have 36 bits only? Only
old ones, or recent ones too? If only old ones, how old?

Can't we have a new default that is as small as possible for the
VM RAM+devices configuration?

> I've removed the ancient warning about the 42 bit limit in exec.c;
> I can't find that limit in there and no one else seems to know where
> it is.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  target-i386/cpu.c | 8 +++++---
>  target-i386/cpu.h | 3 +++
>  2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 4111240..c3bbf8e 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2606,9 +2606,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>          /* virtual & phys address size in low 2 bytes. */
>  /* XXX: This value must match the one used in the MMU code. */

Do you know where's the MMU code mentioned here?

>          if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
> -            /* 64 bit processor */
> -/* XXX: The physical address space is limited to 42 bits in exec.c. */
> -            *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
> +            /* 64 bit processor, 48 bits virtual, configurable
> +             * physical bits.
> +             */
> +            *eax = 0x00003000 + cpu->phys_bits;

We should reject the configuration if phys-bits is set to
something larger than the host's phys_bits, shouldn't we?

Maybe we can't do that on old machine-types that already have the
40-bit default, but if we have a new reasonable default based on
VM size, we can be more strict.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] x86: Set physical address bits based on host
  2016-06-16 17:12 ` [Qemu-devel] [PATCH 5/5] x86: Set physical address bits based on host Dr. David Alan Gilbert (git)
@ 2016-06-17  7:25   ` Igor Mammedov
  0 siblings, 0 replies; 73+ messages in thread
From: Igor Mammedov @ 2016-06-17  7:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: qemu-devel, pbonzini, aarcange, ehabkost

On Thu, 16 Jun 2016 18:12:13 +0100
"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:

> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> A special case based on the previous phys-bits property; if it's
> the magic value 0 then use the hosts capabilities.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  target-i386/cpu.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index c3bbf8e..e03e48f 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2609,7 +2609,21 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>              /* 64 bit processor, 48 bits virtual, configurable
>               * physical bits.
>               */
> -            *eax = 0x00003000 + cpu->phys_bits;
> +            if (cpu->phys_bits != 0) {
> +                *eax = 0x00003000 + cpu->phys_bits;
> +            } else {
> +                /* phys_bits set to 0 -> Try and read the host, again
> +                 * fall back to the magic 40 qemu used for a long time
> +                 * Note: This is setting the virtual size as well from
> +                 * the host; TODO: Split that out
> +                 */
> +                uint32_t _eax;
> +                *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
> +                host_cpuid(0x80000000, 0, &_eax, NULL, NULL, NULL);
> +                if (_eax >= 0x80000008) {
> +                    host_cpuid(0x80000008, 0, eax, NULL, NULL, NULL);
> +                }
could host part be done at host_x86_cpu_initfn() time and set
needed value of cpu->phys_bits?

> +            }
>          } else {
>              if (env->features[FEAT_1_EDX] & CPUID_PSE36) {
>                  *eax = 0x00000024; /* 36 bits physical */

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-16 20:14   ` Eduardo Habkost
@ 2016-06-17  7:47     ` Paolo Bonzini
  2016-06-17 12:46       ` Eduardo Habkost
  2016-06-17  8:53     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17  7:47 UTC (permalink / raw)
  To: Eduardo Habkost, Dr. David Alan Gilbert (git); +Cc: aarcange, qemu-devel



On 16/06/2016 22:14, Eduardo Habkost wrote:
> This is necessary only when phys_bits is higher on the
> destination, right?
> 
> Should we really default this to true? I would like to enable
> this hack only when really necessary. Except when using host's
> phys_bits (phys-bits=0), is there any valid reason to expect
> higher phys-bits on the destination?

It would need a property even if you did it only for phys-bits==0, and
then it's simpler to just do it always.

The bits are reserved anyway, so we can do whatever we want with them.
In fact I think it's weird for the architecture to make them
must-be-zero, it might even make more sense to make them must-be-one...
It's a mask after all, and there's no way to access out-of-range
physical addresses.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-16 20:24   ` Eduardo Habkost
@ 2016-06-17  8:15     ` Dr. David Alan Gilbert
  2016-06-17  8:43       ` Paolo Bonzini
                         ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-17  8:15 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: qemu-devel, pbonzini, aarcange

* Eduardo Habkost (ehabkost@redhat.com) wrote:
> On Thu, Jun 16, 2016 at 06:12:12PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Currently QEMU sets the x86 number of physical address bits to the
> > magic number 40.  This is only correct on some small AMD systems;
> > Intel systems tend to have 36, 39, 46 bits, and large AMD systems
> > tend to have 48.
> > 
> > Having the value different from your actual hardware is detectable
> > by the guest and in principal can cause problems;
> 
> What kind of problems?
> 
> Is it a problem to have something smaller from the actual
> hardware, or just if it's higher?

I'm a bit vague on the failure cases; but my understanding of the two
cases are;

Larger is a problem if the guest tries to map something to a high
address that's not addressable.

Smaller is potentially a problem if the guest plays tricks with
what it thinks are spare bits in page tables but which are actually
interpreted.   I believe KVM plays a trick like this.

> > The current limit of 40 stops TB VMs being created by those lucky
> > enough to have that much.
> > 
> > This patch lets you set the physical bits by a cpu property but
> > defaults to the same existing magic 40.
> > 
> 
> The existing 40-bit default looks like a problem for 36-bit
> systems. Do you know what kind of systems have 36 bits only? Only
> old ones, or recent ones too? If only old ones, how old?

My Sandy Bridge (~2.5 year old) laptop is 36 bits; I've seen
some other single-socket 39bit machines (Ivy bridge and I think newer).

> Can't we have a new default that is as small as possible for the
> VM RAM+devices configuration?

That would be good to have, however:
   1) I didn't want to change the default upstream behaviour at first,
      this time through I just wanted a way to set it.
   2) While we have maxmem settings to tell us the top of VM RAM, do
      we have anything that tells us the top of IO space? What happens
      when we hotplug a PCI card?
   3) Is it better to stick to sizes that correspond to real hardware
      if you can?  For example I don't know of any machines with 37 bits
      - in practice I think it's best to stick with sizes that correspond
      to some real hardware.

> > I've removed the ancient warning about the 42 bit limit in exec.c;
> > I can't find that limit in there and no one else seems to know where
> > it is.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  target-i386/cpu.c | 8 +++++---
> >  target-i386/cpu.h | 3 +++
> >  2 files changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index 4111240..c3bbf8e 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -2606,9 +2606,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> >          /* virtual & phys address size in low 2 bytes. */
> >  /* XXX: This value must match the one used in the MMU code. */
> 
> Do you know where's the MMU code mentioned here?

No!

> >          if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
> > -            /* 64 bit processor */
> > -/* XXX: The physical address space is limited to 42 bits in exec.c. */
> > -            *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
> > +            /* 64 bit processor, 48 bits virtual, configurable
> > +             * physical bits.
> > +             */
> > +            *eax = 0x00003000 + cpu->phys_bits;
> 
> We should reject the configuration if phys-bits is set to
> something larger than the host's phys_bits, shouldn't we?
> 
> Maybe we can't do that on old machine-types that already have the
> 40-bit default, but if we have a new reasonable default based on
> VM size, we can be more strict.

See notes above; but my worry with rejecting stuff is I don't
know how many thousands of machines are already relying on it
in farms with a mix of single and multi socket hosts.

Dave

> 
> -- 
> Eduardo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] x86: Mask mtrr mask based on CPU physical address limits
  2016-06-16 19:59   ` Eduardo Habkost
@ 2016-06-17  8:23     ` Dr. David Alan Gilbert
  2016-06-17 12:13     ` Paolo Bonzini
  1 sibling, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-17  8:23 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: qemu-devel, pbonzini, aarcange

* Eduardo Habkost (ehabkost@redhat.com) wrote:
> On Thu, Jun 16, 2016 at 06:12:10PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The CPU GPs if we try and set a bit in a variable MTRR mask above
> > the limit of physical address bits on the host.  We hit this
> > when loading a migration from a host with a larger physical
> > address limit than our destination (e.g. a Xeon->i7 of same
> > generation) but previously used to get away with it
> > until 48e1a45 started checking that msr writes actually worked.
> > 
> > It seems in our case the GP probably comes from KVM emulating
> > that GP.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  target-i386/cpu.c | 13 +++++++++++++
> >  target-i386/cpu.h |  1 +
> >  target-i386/kvm.c | 13 +++++++++++--
> >  3 files changed, 25 insertions(+), 2 deletions(-)
> > 
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index 3665fec..f9302f9 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -2668,6 +2668,19 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> >      }
> >  }
> >  
> > +/* Returns the number of physical bits supported by the guest CPU */
> > +unsigned int cpu_x86_guest_phys_address_bits(CPUX86State *env)
> > +{
> > +    /* the cpuid emulation code already calculates a value to give to the
> > +     * guest and this should match.
> > +     */
> > +    uint32_t eax, unused;
> > +    cpu_x86_cpuid(env, 0x80000008, 0, &eax, &unused, &unused, &unused);
> > +
> > +    /* Bottom 8 bits of leaf 0x80000008 see Table 3-17 in CPUID definition */
> > +    return eax & 0xff;
> > +}
> 
> If you are adding X86CPU::phys_bits in patch 4/5, can't
> kvm_put_msrs() just query it directly?
> 
> (It would require changing patch 5/5 to set phys_bits on
> realizefn, like Paolo suggested.)

Yes, I think that's doable.

> [...]
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index abf50e6..d95d06b 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -1674,6 +1674,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
> >              }
> >          }
> >          if (has_msr_mtrr) {
> > +            uint64_t phys_mask = BIT_RANGE(
> > +                                   cpu_x86_guest_phys_address_bits(env) - 1,
> > +                                   0);
> > +
> >              kvm_msr_entry_add(cpu, MSR_MTRRdefType, env->mtrr_deftype);
> >              kvm_msr_entry_add(cpu, MSR_MTRRfix64K_00000, env->mtrr_fixed[0]);
> >              kvm_msr_entry_add(cpu, MSR_MTRRfix16K_80000, env->mtrr_fixed[1]);
> > @@ -1687,10 +1691,15 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
> >              kvm_msr_entry_add(cpu, MSR_MTRRfix4K_F0000, env->mtrr_fixed[9]);
> >              kvm_msr_entry_add(cpu, MSR_MTRRfix4K_F8000, env->mtrr_fixed[10]);
> >              for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> > +                /* The CPU GPs if we write to a bit above the physical limit of
> > +                 * the host CPU (and KVM emulates that)
> > +                 */
> > +                uint64_t mask = env->mtrr_var[i].mask;
> > +                mask &= phys_mask;
> > +
> 
> We are silently changing the MSR value seen by the guest, should
> we print a warning in case mask != env->mtrr_var[i].mask?

That would work, except that now we fill in the extra bits we'll always
get that warning.

Dave

> 
> >                  kvm_msr_entry_add(cpu, MSR_MTRRphysBase(i),
> >                                    env->mtrr_var[i].base);
> > -                kvm_msr_entry_add(cpu, MSR_MTRRphysMask(i),
> > -                                  env->mtrr_var[i].mask);
> > +                kvm_msr_entry_add(cpu, MSR_MTRRphysMask(i), mask);
> >              }
> >          }
> >  
> > -- 
> > 2.7.4
> > 
> 
> -- 
> Eduardo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17  8:15     ` Dr. David Alan Gilbert
@ 2016-06-17  8:43       ` Paolo Bonzini
  2016-06-17  9:17         ` Gerd Hoffmann
  2016-06-17  9:37       ` Dr. David Alan Gilbert
  2016-06-17 13:18       ` Eduardo Habkost
  2 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17  8:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Eduardo Habkost; +Cc: aarcange, qemu-devel



On 17/06/2016 10:15, Dr. David Alan Gilbert wrote:
> Larger is a problem if the guest tries to map something to a high
> address that's not addressable.

Right.  It's not a problem for most emulated PCI devices (it would be a
problem for those that have large RAM BARs, but even our emulated video
cards do not have 64-bit RAM BARs, I think; emulated MMIO BARs are not
an issue because they're marked as not present or reserved in the
processor page tables).  However, it's bad for devices like ivshmem
(that do have large RAM BARs) and assigned devices, because EPT limits
guest physical addresses to MAXPHYADDR bits.  Assigned devices use EPT
page tables to map guest physical BAR addresses to host physical BAR
addresses.

> Smaller is potentially a problem if the guest plays tricks with
> what it thinks are spare bits in page tables but which are actually
> interpreted.   I believe KVM plays a trick like this.

It does, though it uses bit 51 so it's not a problem in practice with
current hardware.  It would be a problem if a processor existed with
52-bit physical address space.

>> The existing 40-bit default looks like a problem for 36-bit
>> systems. Do you know what kind of systems have 36 bits only? Only
>> old ones, or recent ones too? If only old ones, how old?
> 
> My Sandy Bridge (~2.5 year old) laptop is 36 bits; I've seen
> some other single-socket 39bit machines (Ivy bridge and I think newer).

I have an Ivy Bridge Core i7 laptop and it still has 36 bits.

>> Can't we have a new default that is as small as possible for the
>> VM RAM+devices configuration?
> 
> That would be good to have, however:
>    1) I didn't want to change the default upstream behaviour at first,
>       this time through I just wanted a way to set it.
>    2) While we have maxmem settings to tell us the top of VM RAM, do
>       we have anything that tells us the top of IO space? What happens
>       when we hotplug a PCI card?
>    3) Is it better to stick to sizes that correspond to real hardware
>       if you can?  For example I don't know of any machines with 37 bits
>       - in practice I think it's best to stick with sizes that correspond
>       to some real hardware.

I'm not worried about (3), guests really use this bit to do little more
than Linux's

	iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;

(arch/x86/kernel/setup.c) but I agree that (2) is a blocker.

You don't know how the guest will assign PCI BAR addresses, and as you
said there's hotplug too.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-16 20:14   ` Eduardo Habkost
  2016-06-17  7:47     ` Paolo Bonzini
@ 2016-06-17  8:53     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-17  8:53 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: qemu-devel, pbonzini, aarcange

* Eduardo Habkost (ehabkost@redhat.com) wrote:
> On Thu, Jun 16, 2016 at 06:12:11PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Fill the bits between 51..number-of-physical-address-bits in the
> > MTRR_PHYSMASKn variable range mtrr masks so that they're consistent
> > in the migration stream irrespective of the physical address space
> > of the source VM in a migration.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > ---
> >  include/hw/i386/pc.h |  5 +++++
> >  target-i386/cpu.c    |  1 +
> >  target-i386/cpu.h    |  3 +++
> >  target-i386/kvm.c    | 25 ++++++++++++++++++++++++-
> >  4 files changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> > index 49566c8..3883d04 100644
> > --- a/include/hw/i386/pc.h
> > +++ b/include/hw/i386/pc.h
> > @@ -362,6 +362,11 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
> >          .driver   = TYPE_X86_CPU,\
> >          .property = "cpuid-0xb",\
> >          .value    = "off",\
> > +    },\
> > +    {\
> > +        .driver = TYPE_X86_CPU,\
> > +        .property = "fill-mtrr-mask",\
> > +        .value = "off",\
> >      },
> >  
> >  #define PC_COMPAT_2_5 \
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index f9302f9..4111240 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -3272,6 +3272,7 @@ static Property x86_cpu_properties[] = {
> >      DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, true),
> >      DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
> >      DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
> > +    DEFINE_PROP_BOOL("fill-mtrr-mask", X86CPU, fill_mtrr_mask, true),
> 
> This is necessary only when phys_bits is higher on the
> destination, right?

Yes because otherwise the destination will mask it out.

> Should we really default this to true? I would like to enable
> this hack only when really necessary. Except when using host's
> phys_bits (phys-bits=0), is there any valid reason to expect
> higher phys-bits on the destination?

Who is to say? With the current default of 40 we've not
been setting these higher bits when going from 39-46 bit
systems for years; I'm not sure what the consequence has been.

Dave

> 
> 
> 
> >      DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
> >      DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
> >      DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
> > diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> > index 61c06b7..13d8501 100644
> > --- a/target-i386/cpu.h
> > +++ b/target-i386/cpu.h
> > @@ -1181,6 +1181,9 @@ struct X86CPU {
> >      /* Compatibility bits for old machine types: */
> >      bool enable_cpuid_0xb;
> >  
> > +    /* if true fill the top bits of the MTRR_PHYSMASKn variable range */
> > +    bool fill_mtrr_mask;
> > +
> >      /* in order to simplify APIC support, we leave this pointer to the
> >         user */
> >      struct DeviceState *apic_state;
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index d95d06b..9b5158e 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -1934,6 +1934,7 @@ static int kvm_get_msrs(X86CPU *cpu)
> >      CPUX86State *env = &cpu->env;
> >      struct kvm_msr_entry *msrs = cpu->kvm_msr_buf->entries;
> >      int ret, i;
> > +    uint64_t mtrr_top_bits;
> >  
> >      kvm_msr_buf_reset(cpu);
> >  
> > @@ -2083,6 +2084,27 @@ static int kvm_get_msrs(X86CPU *cpu)
> >      }
> >  
> >      assert(ret == cpu->kvm_msr_buf->nmsrs);
> > +    /*
> > +     * MTRR masks: Each mask consists of 5 parts
> > +     * a  10..0: must be zero
> > +     * b  11   : valid bit
> > +     * c n-1.12: actual mask bits
> > +     * d  51..n: reserved must be zero
> > +     * e  63.52: reserved must be zero
> > +     *
> > +     * 'n' is the number of physical bits supported by the CPU and is
> > +     * apparently always <= 52.   We know our 'n' but don't know what
> > +     * the destinations 'n' is; it might be smaller, in which case
> > +     * it masks (c) on loading. It might be larger, in which case
> > +     * we fill 'd' so that d..c is consistent irrespetive of the 'n'
> > +     * we're migrating to.
> > +     */
> > +    if (cpu->fill_mtrr_mask) {
> > +        mtrr_top_bits = BIT_RANGE(51, cpu_x86_guest_phys_address_bits(env));
> > +    } else {
> > +        mtrr_top_bits = 0;
> > +    }
> > +
> >      for (i = 0; i < ret; i++) {
> >          uint32_t index = msrs[i].index;
> >          switch (index) {
> > @@ -2278,7 +2300,8 @@ static int kvm_get_msrs(X86CPU *cpu)
> >              break;
> >          case MSR_MTRRphysBase(0) ... MSR_MTRRphysMask(MSR_MTRRcap_VCNT - 1):
> >              if (index & 1) {
> > -                env->mtrr_var[MSR_MTRRphysIndex(index)].mask = msrs[i].data;
> > +                env->mtrr_var[MSR_MTRRphysIndex(index)].mask = msrs[i].data |
> > +                                                               mtrr_top_bits;
> >              } else {
> >                  env->mtrr_var[MSR_MTRRphysIndex(index)].base = msrs[i].data;
> >              }
> > -- 
> > 2.7.4
> > 
> 
> -- 
> Eduardo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17  8:43       ` Paolo Bonzini
@ 2016-06-17  9:17         ` Gerd Hoffmann
  2016-06-17  9:52           ` Igor Mammedov
  0 siblings, 1 reply; 73+ messages in thread
From: Gerd Hoffmann @ 2016-06-17  9:17 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Dr. David Alan Gilbert, Eduardo Habkost, aarcange, qemu-devel

On Fr, 2016-06-17 at 10:43 +0200, Paolo Bonzini wrote:
> 
> On 17/06/2016 10:15, Dr. David Alan Gilbert wrote:
> > Larger is a problem if the guest tries to map something to a high
> > address that's not addressable.
> 
> Right.  It's not a problem for most emulated PCI devices (it would be a
> problem for those that have large RAM BARs, but even our emulated video
> cards do not have 64-bit RAM BARs, I think;

qxl can be configured to have one, try "-device
qxl-vga,vram64_size_mb=1024"

> >    2) While we have maxmem settings to tell us the top of VM RAM, do
> >       we have anything that tells us the top of IO space? What happens
> >       when we hotplug a PCI card?

> (arch/x86/kernel/setup.c) but I agree that (2) is a blocker.

seabios maps stuff right above ram (possibly with a hole due to
alignment requirements).

ovmf maps stuff into a 32G-aligned 32G hole.  Which lands at 32G and
therefore is addressable with 36 bits, unless you have tons of ram (>
30G) assigned to your guest.  A physical host machine where you can plug
in enough ram for such a configuration likely has more than 36 physical
address lines too ...

qemu checks where the firmware mapped 64bit bars, then adds those ranges
to the root bus pci resources in the acpi tables (see /proc/iomem).

> You don't know how the guest will assign PCI BAR addresses, and as you
> said there's hotplug too.

Not sure whenever qemu adds some extra space for hotplug to the 64bit
hole and if so how it calculates the size then.  But the guest os should
stick to those ranges when configuring hotplugged devices.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17  8:15     ` Dr. David Alan Gilbert
  2016-06-17  8:43       ` Paolo Bonzini
@ 2016-06-17  9:37       ` Dr. David Alan Gilbert
  2016-06-17  9:54         ` Paolo Bonzini
  2016-06-17 13:18       ` Eduardo Habkost
  2 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-17  9:37 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: qemu-devel, pbonzini, aarcange

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > On Thu, Jun 16, 2016 at 06:12:12PM +0100, Dr. David Alan Gilbert (git) wrote:

<snip>

> > The existing 40-bit default looks like a problem for 36-bit
> > systems. Do you know what kind of systems have 36 bits only? Only
> > old ones, or recent ones too? If only old ones, how old?
> 
> My Sandy Bridge (~2.5 year old) laptop is 36 bits; I've seen
> some other single-socket 39bit machines (Ivy bridge and I think newer).

and it looks like the single socket Xeons are as well; Xeon E3-1225
I just found has 36 bit physical.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17  9:17         ` Gerd Hoffmann
@ 2016-06-17  9:52           ` Igor Mammedov
  2016-06-17 11:20             ` Gerd Hoffmann
  2016-06-17 16:07             ` Laszlo Ersek
  0 siblings, 2 replies; 73+ messages in thread
From: Igor Mammedov @ 2016-06-17  9:52 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Paolo Bonzini, aarcange, qemu-devel, Dr. David Alan Gilbert,
	Eduardo Habkost, Laszlo Ersek

On Fri, 17 Jun 2016 11:17:54 +0200
Gerd Hoffmann <kraxel@redhat.com> wrote:

> On Fr, 2016-06-17 at 10:43 +0200, Paolo Bonzini wrote:
> > 
> > On 17/06/2016 10:15, Dr. David Alan Gilbert wrote:  
> > > Larger is a problem if the guest tries to map something to a high
> > > address that's not addressable.  
> > 
> > Right.  It's not a problem for most emulated PCI devices (it would be a
> > problem for those that have large RAM BARs, but even our emulated video
> > cards do not have 64-bit RAM BARs, I think;  
> 
> qxl can be configured to have one, try "-device
> qxl-vga,vram64_size_mb=1024"
> 
> > >    2) While we have maxmem settings to tell us the top of VM RAM, do
> > >       we have anything that tells us the top of IO space? What happens
> > >       when we hotplug a PCI card?  
> 
> > (arch/x86/kernel/setup.c) but I agree that (2) is a blocker.  
> 
> seabios maps stuff right above ram (possibly with a hole due to
> alignment requirements).
> 
> ovmf maps stuff into a 32G-aligned 32G hole.  Which lands at 32G and
> therefore is addressable with 36 bits, unless you have tons of ram (>
> 30G) assigned to your guest.  A physical host machine where you can plug
> in enough ram for such a configuration likely has more than 36 physical
> address lines too ...
> 
> qemu checks where the firmware mapped 64bit bars, then adds those ranges
> to the root bus pci resources in the acpi tables (see /proc/iomem).
> 
> > You don't know how the guest will assign PCI BAR addresses, and as you
> > said there's hotplug too.  
> 
> Not sure whenever qemu adds some extra space for hotplug to the 64bit
> hole and if so how it calculates the size then.  But the guest os should
> stick to those ranges when configuring hotplugged devices.
currently firmware would assign 64-bit BARs after reserved-memory-end
(not sure about ovmf though) but QEMU on ACPI side will add 64-bit _CRS only
for firmware mapped devices (i.e. no space reserved for hotplug).
And is I recall correctly ovmf won't map BARs if it doesn't have
a driver for it so ACPI tables won't even have a space for not mapped
64-bit BARs.

There was another attempt to reserve more space in _CRS
  https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00090.html


> 
> cheers,
>   Gerd
> 
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17  9:37       ` Dr. David Alan Gilbert
@ 2016-06-17  9:54         ` Paolo Bonzini
  0 siblings, 0 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17  9:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Eduardo Habkost; +Cc: aarcange, qemu-devel



On 17/06/2016 11:37, Dr. David Alan Gilbert wrote:
>>> The existing 40-bit default looks like a problem for 36-bit
>>> > > systems. Do you know what kind of systems have 36 bits only? Only
>>> > > old ones, or recent ones too? If only old ones, how old?
>> > 
>> > My Sandy Bridge (~2.5 year old) laptop is 36 bits; I've seen
>> > some other single-socket 39bit machines (Ivy bridge and I think newer).
> and it looks like the single socket Xeons are as well; Xeon E3-1225
> I just found has 36 bit physical.

The Xeon E3 is really little more than a Core i7 (for example it doesn't
have APICv).  Laurent checked a single-socket Sandy Bridge Xeon E5-1620
and it has 46.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17  9:52           ` Igor Mammedov
@ 2016-06-17 11:20             ` Gerd Hoffmann
  2016-06-17 16:20               ` Laszlo Ersek
  2016-06-17 16:07             ` Laszlo Ersek
  1 sibling, 1 reply; 73+ messages in thread
From: Gerd Hoffmann @ 2016-06-17 11:20 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Paolo Bonzini, aarcange, qemu-devel, Dr. David Alan Gilbert,
	Eduardo Habkost, Laszlo Ersek

  Hi,

> > Not sure whenever qemu adds some extra space for hotplug to the 64bit
> > hole and if so how it calculates the size then.  But the guest os should
> > stick to those ranges when configuring hotplugged devices.

> currently firmware would assign 64-bit BARs after reserved-memory-end
> (not sure about ovmf though)

Ah, right, reserved-memory-end is checked too if present.  Both seabios
and ovmf should do that.

> but QEMU on ACPI side will add 64-bit _CRS only
> for firmware mapped devices (i.e. no space reserved for hotplug).

Yes.  Tested meanwhile, looks like this (seabios):

100000000-17fffffff : System RAM
180000000-1c1ffffff : PCI Bus 0000:00
  180000000-1bfffffff : 0000:00:0f.0
  1c0000000-1c07fffff : PCI Bus 0000:04
    1c0000000-1c07fffff : 0000:04:00.0
      1c0000000-1c07fffff : virtio-pci
  1c0800000-1c0ffffff : PCI Bus 0000:03
    1c0800000-1c0ffffff : 0000:03:00.0
      1c0800000-1c0ffffff : virtio-pci
  1c1000000-1c17fffff : PCI Bus 0000:02
    1c1000000-1c17fffff : 0000:02:00.0
      1c1000000-1c17fffff : virtio-pci
  1c1800000-1c19fffff : PCI Bus 0000:08
  1c1a00000-1c1bfffff : PCI Bus 0000:07
  1c1c00000-1c1dfffff : PCI Bus 0000:06
  1c1e00000-1c1ffffff : PCI Bus 0000:05

seabios assigns a 2M memory window to pci bridges (which support
hotplug) even in case no device is connected, so there is some space for
hotplug because of that.

/me should try the same with ovmf ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] x86: Mask mtrr mask based on CPU physical address limits
  2016-06-16 19:59   ` Eduardo Habkost
  2016-06-17  8:23     ` Dr. David Alan Gilbert
@ 2016-06-17 12:13     ` Paolo Bonzini
  1 sibling, 0 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17 12:13 UTC (permalink / raw)
  To: Eduardo Habkost, Dr. David Alan Gilbert (git); +Cc: qemu-devel, aarcange



On 16/06/2016 21:59, Eduardo Habkost wrote:
> > +                /* The CPU GPs if we write to a bit above the physical limit of
> > +                 * the host CPU (and KVM emulates that)
> > +                 */
> > +                uint64_t mask = env->mtrr_var[i].mask;
> > +                mask &= phys_mask;
> > +
> 
> We are silently changing the MSR value seen by the guest, should
> we print a warning in case mask != env->mtrr_var[i].mask?

You're right.  post_load is probably a better place to remove these bits
(and to fail if they are not all ones).

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-17  7:47     ` Paolo Bonzini
@ 2016-06-17 12:46       ` Eduardo Habkost
  2016-06-17 13:01         ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-17 12:46 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Dr. David Alan Gilbert (git), aarcange, qemu-devel

On Fri, Jun 17, 2016 at 09:47:27AM +0200, Paolo Bonzini wrote:
> On 16/06/2016 22:14, Eduardo Habkost wrote:
> > This is necessary only when phys_bits is higher on the
> > destination, right?
> > 
> > Should we really default this to true? I would like to enable
> > this hack only when really necessary. Except when using host's
> > phys_bits (phys-bits=0), is there any valid reason to expect
> > higher phys-bits on the destination?
> 
> It would need a property even if you did it only for phys-bits==0, and
> then it's simpler to just do it always.
> 
> The bits are reserved anyway, so we can do whatever we want with them.
> In fact I think it's weird for the architecture to make them
> must-be-zero, it might even make more sense to make them must-be-one...
> It's a mask after all, and there's no way to access out-of-range
> physical addresses.

If we always fill the bits on the source, the destination can't
differentiate between a 40-bit source that set the MTRR to
0xffffffffff from a 36-bit source that set the MTRR to
0xfffffffff.

I really want to print a warning if the MTRR value or the
phys-bits value is being changed during migration, just in case
this has unintended consequences in the future. We can send the
current phys_bits value in the migration stream, so the
destination can decide how to handle it (and which warnings to
print).

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-17 12:46       ` Eduardo Habkost
@ 2016-06-17 13:01         ` Paolo Bonzini
  2016-06-17 13:41           ` Eduardo Habkost
  2016-06-17 13:51           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17 13:01 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Dr. David Alan Gilbert (git), aarcange, qemu-devel



On 17/06/2016 14:46, Eduardo Habkost wrote:
>> > 
>> > The bits are reserved anyway, so we can do whatever we want with them.
>> > In fact I think it's weird for the architecture to make them
>> > must-be-zero, it might even make more sense to make them must-be-one...
>> > It's a mask after all, and there's no way to access out-of-range
>> > physical addresses.
> 
> If we always fill the bits on the source, the destination can't
> differentiate between a 40-bit source that set the MTRR to
> 0xffffffffff from a 36-bit source that set the MTRR to
> 0xfffffffff.

That's not a bug, it's a feature.  MTRRs work by comparing (address &
mtrr_mask) with mtrr_base.

A 40-bit source that sets the MTRR to 0xffffffffff must set higher bits
of mtrr_base to zero, but when you migrate to another destination, you
want the comparison to hold.  There are two ways for it to hold:

- if the physical address space becomes shorter, the base bits must be
zero or writing mtrr_base fails, and the address bits must be zero or
the processor fails to walk EPT page tables.  So it's fine to strip the
top four bits of the mask.

- if the physical address space becomes larger, the base bits must be
zero but the mask bits must become one.  So why not migrate them this
way directly.

(For what it's worth, KVM also stores the mask extended to all ones, and
strips the unnecessary high bits when reading the MSRs).

> I really want to print a warning if the MTRR value or the
> phys-bits value is being changed during migration, just in case
> this has unintended consequences in the future. We can send the
> current phys_bits value in the migration stream, so the
> destination can decide how to handle it (and which warnings to
> print).

The problem with this is that it must be a warning only, because usually
things will work (evidence: they have worked on RHEL6 and RHEL7 since
2010).  No one will read it until it's too late and they have lost a VM.

Note that the failure mode is pretty brutal since KVM reports an
internal error right after restarting on the destination, and who knows
what used to happen before the assertion was introduced.  All MSRs after
MTRRs would be ignored by KVM!

Migrating and checking the configuration complicates the code and,
because it's only for a warning, doesn't save you from massaging the
values to make things work.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17  8:15     ` Dr. David Alan Gilbert
  2016-06-17  8:43       ` Paolo Bonzini
  2016-06-17  9:37       ` Dr. David Alan Gilbert
@ 2016-06-17 13:18       ` Eduardo Habkost
  2016-06-17 13:38         ` Paolo Bonzini
  2016-06-17 14:24         ` Marcel Apfelbaum
  2 siblings, 2 replies; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-17 13:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, pbonzini, aarcange, Marcel Apfelbaum, Michael S. Tsirkin

On Fri, Jun 17, 2016 at 09:15:06AM +0100, Dr. David Alan Gilbert wrote:
> * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > On Thu, Jun 16, 2016 at 06:12:12PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Currently QEMU sets the x86 number of physical address bits to the
> > > magic number 40.  This is only correct on some small AMD systems;
> > > Intel systems tend to have 36, 39, 46 bits, and large AMD systems
> > > tend to have 48.
> > > 
> > > Having the value different from your actual hardware is detectable
> > > by the guest and in principal can cause problems;
> > 
> > What kind of problems?
> > 
> > Is it a problem to have something smaller from the actual
> > hardware, or just if it's higher?
> 
> I'm a bit vague on the failure cases; but my understanding of the two
> cases are;
> 
> Larger is a problem if the guest tries to map something to a high
> address that's not addressable.
> 
> Smaller is potentially a problem if the guest plays tricks with
> what it thinks are spare bits in page tables but which are actually
> interpreted.   I believe KVM plays a trick like this.

If both smaller and larger are a problem, we have a much bigger
problem than we thought. We need to confirm this.

So, what happens if the guest play tricks in bits 40-45 when QEMU
sets the limit to 40 but we are running in a 46-bit host? Is it
really a problem? I assumed it would be safe.

> 
> > > The current limit of 40 stops TB VMs being created by those lucky
> > > enough to have that much.
> > > 
> > > This patch lets you set the physical bits by a cpu property but
> > > defaults to the same existing magic 40.
> > > 
> > 
> > The existing 40-bit default looks like a problem for 36-bit
> > systems. Do you know what kind of systems have 36 bits only? Only
> > old ones, or recent ones too? If only old ones, how old?
> 
> My Sandy Bridge (~2.5 year old) laptop is 36 bits; I've seen
> some other single-socket 39bit machines (Ivy bridge and I think newer).
> 
> > Can't we have a new default that is as small as possible for the
> > VM RAM+devices configuration?
> 
> That would be good to have, however:
>    1) I didn't want to change the default upstream behaviour at first,
>       this time through I just wanted a way to set it.

Agreed. We can change the default after we introduce the
mechanisms to handle the existing defaults properly.

>    2) While we have maxmem settings to tell us the top of VM RAM, do
>       we have anything that tells us the top of IO space? What happens
>       when we hotplug a PCI card?

(CCing Marcel and Michael, as we were discussing this recently.)

That's a good question. When calculating how many bits the
machine requires, machine code could choose to reserve a
reasonable amount of space for hotplug by default.

Whatever we choose as the default, in some corner cases (e.g.
almost-32GB VMs running in a 39-bit host) we will still need to
let the user choose between having extra space for hotplug and
being able to safely migrate to 36-bit hosts.

This can be solved by setting x86_64-cpu.phys-bits, but isn't it
too low level? We could have a higher level mechanism in the PC
machine class (e.g. a min-extra-phys-addr-space-for-hotplug
property).


>    3) Is it better to stick to sizes that correspond to real hardware
>       if you can?  For example I don't know of any machines with 37 bits
>       - in practice I think it's best to stick with sizes that correspond
>       to some real hardware.

Yeah, "as small as possible" could be actually "the smallest
possible value from a set of known-to-exist values". e.g. if we
find out that we need 37 bits, it's probably better to simply use
39 bits.

Choosing from a smaller set of values also makes corner cases
(like the example above) less likely to happen.

> 
> > > I've removed the ancient warning about the 42 bit limit in exec.c;
> > > I can't find that limit in there and no one else seems to know where
> > > it is.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  target-i386/cpu.c | 8 +++++---
> > >  target-i386/cpu.h | 3 +++
> > >  2 files changed, 8 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > > index 4111240..c3bbf8e 100644
> > > --- a/target-i386/cpu.c
> > > +++ b/target-i386/cpu.c
> > > @@ -2606,9 +2606,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> > >          /* virtual & phys address size in low 2 bytes. */
> > >  /* XXX: This value must match the one used in the MMU code. */
> > 
> > Do you know where's the MMU code mentioned here?
> 
> No!
> 
> > >          if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
> > > -            /* 64 bit processor */
> > > -/* XXX: The physical address space is limited to 42 bits in exec.c. */
> > > -            *eax = 0x00003028; /* 48 bits virtual, 40 bits physical */
> > > +            /* 64 bit processor, 48 bits virtual, configurable
> > > +             * physical bits.
> > > +             */
> > > +            *eax = 0x00003000 + cpu->phys_bits;
> > 
> > We should reject the configuration if phys-bits is set to
> > something larger than the host's phys_bits, shouldn't we?
> > 
> > Maybe we can't do that on old machine-types that already have the
> > 40-bit default, but if we have a new reasonable default based on
> > VM size, we can be more strict.
> 
> See notes above; but my worry with rejecting stuff is I don't
> know how many thousands of machines are already relying on it
> in farms with a mix of single and multi socket hosts.

That's true. Maybe we should just print a warning instead.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17 13:18       ` Eduardo Habkost
@ 2016-06-17 13:38         ` Paolo Bonzini
  2016-06-17 15:19           ` Eduardo Habkost
  2016-06-19  3:36           ` [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set Michael S. Tsirkin
  2016-06-17 14:24         ` Marcel Apfelbaum
  1 sibling, 2 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17 13:38 UTC (permalink / raw)
  To: Eduardo Habkost, Dr. David Alan Gilbert
  Cc: qemu-devel, aarcange, Marcel Apfelbaum, Michael S. Tsirkin



On 17/06/2016 15:18, Eduardo Habkost wrote:
> On Fri, Jun 17, 2016 at 09:15:06AM +0100, Dr. David Alan Gilbert wrote:
>> * Eduardo Habkost (ehabkost@redhat.com) wrote:
>>> On Thu, Jun 16, 2016 at 06:12:12PM +0100, Dr. David Alan Gilbert (git) wrote:
>>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>>
>>>> Currently QEMU sets the x86 number of physical address bits to the
>>>> magic number 40.  This is only correct on some small AMD systems;
>>>> Intel systems tend to have 36, 39, 46 bits, and large AMD systems
>>>> tend to have 48.
>>>>
>>>> Having the value different from your actual hardware is detectable
>>>> by the guest and in principal can cause problems;
>>>
>>> What kind of problems?
>>>
>>> Is it a problem to have something smaller from the actual
>>> hardware, or just if it's higher?
>>
>> I'm a bit vague on the failure cases; but my understanding of the two
>> cases are;
>>
>> Larger is a problem if the guest tries to map something to a high
>> address that's not addressable.

        (Note: this is a problem when migrating to hosts with _smaller_
               phys-bits)

>> Smaller is potentially a problem if the guest plays tricks with
>> what it thinks are spare bits in page tables but which are actually
>> interpreted.   I believe KVM plays a trick like this.

        (Note: this is a problem when migrating to hosts with _larger_
               phys-bits)

> If both smaller and larger are a problem, we have a much bigger
> problem than we thought. We need to confirm this.
> 
> So, what happens if the guest play tricks in bits 40-45 when QEMU
> sets the limit to 40 but we are running in a 46-bit host? Is it
> really a problem? I assumed it would be safe.

The guest expects a "reserved bit set" page fault, but doesn't get one.

>>    2) While we have maxmem settings to tell us the top of VM RAM, do
>>       we have anything that tells us the top of IO space? What happens
>>       when we hotplug a PCI card?
> 
> (CCing Marcel and Michael, as we were discussing this recently.)
> 
> That's a good question. When calculating how many bits the
> machine requires, machine code could choose to reserve a
> reasonable amount of space for hotplug by default.
> 
> Whatever we choose as the default, in some corner cases (e.g.
> almost-32GB VMs running in a 39-bit host) we will still need to
> let the user choose between having extra space for hotplug and
> being able to safely migrate to 36-bit hosts.

No, this is not possible unfortunately.  If you set phys-bits <
host-phys-bits, the guest may expect some bits to be reserved, when they
actually aren't.  In practice this doesn't happen for the reason I
mentioned in my other message (tl;dr: 1-the trick is rarely used though
KVM uses it, 2-if they use bit 51 they're safe in practice).  But still
making phys-bits smaller than host-phys-bits is a bad idea.

Making the guest's phys-bits larger than host-phys-bits would be okay if
you reserve the area in the e820 and assume the guest doesn't touch it.
But it is not a great idea too, because e820 describes RAM, so you're
telling the guest "look, there's 64 TB of reserved RAM up there".

>>    3) Is it better to stick to sizes that correspond to real hardware
>>       if you can?  For example I don't know of any machines with 37 bits
>>       - in practice I think it's best to stick with sizes that correspond
>>       to some real hardware.
> 
> Yeah, "as small as possible" could be actually "the smallest
> possible value from a set of known-to-exist values". e.g. if we
> find out that we need 37 bits, it's probably better to simply use
> 39 bits.
> 
> Choosing from a smaller set of values also makes corner cases
> (like the example above) less likely to happen.

Not really, because any value that doesn't match the host is
problematic, albeit in different ways.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-17 13:01         ` Paolo Bonzini
@ 2016-06-17 13:41           ` Eduardo Habkost
  2016-06-17 14:25             ` Paolo Bonzini
  2016-06-17 13:51           ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-17 13:41 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Dr. David Alan Gilbert (git), aarcange, qemu-devel

On Fri, Jun 17, 2016 at 03:01:55PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/06/2016 14:46, Eduardo Habkost wrote:
> >> > 
> >> > The bits are reserved anyway, so we can do whatever we want with them.
> >> > In fact I think it's weird for the architecture to make them
> >> > must-be-zero, it might even make more sense to make them must-be-one...
> >> > It's a mask after all, and there's no way to access out-of-range
> >> > physical addresses.
> > 
> > If we always fill the bits on the source, the destination can't
> > differentiate between a 40-bit source that set the MTRR to
> > 0xffffffffff from a 36-bit source that set the MTRR to
> > 0xfffffffff.
> 
> That's not a bug, it's a feature.  MTRRs work by comparing (address &
> mtrr_mask) with mtrr_base.
> 
> A 40-bit source that sets the MTRR to 0xffffffffff must set higher bits
> of mtrr_base to zero, but when you migrate to another destination, you
> want the comparison to hold.  There are two ways for it to hold:
> 
> - if the physical address space becomes shorter, the base bits must be
> zero or writing mtrr_base fails, and the address bits must be zero or
> the processor fails to walk EPT page tables.  So it's fine to strip the
> top four bits of the mask.
> 
> - if the physical address space becomes larger, the base bits must be
> zero but the mask bits must become one.  So why not migrate them this
> way directly.
> 
> (For what it's worth, KVM also stores the mask extended to all ones, and
> strips the unnecessary high bits when reading the MSRs).

So the MSR is changing but the semantics are preserved. That's
good, but we may still have other problems if phys_addr changes
on migration, so:

> 
> > I really want to print a warning if the MTRR value or the
> > phys-bits value is being changed during migration, just in case
> > this has unintended consequences in the future. We can send the
> > current phys_bits value in the migration stream, so the
> > destination can decide how to handle it (and which warnings to
> > print).
> 
> The problem with this is that it must be a warning only, because usually
> things will work (evidence: they have worked on RHEL6 and RHEL7 since
> 2010).  No one will read it until it's too late and they have lost a VM.
> 
> Note that the failure mode is pretty brutal since KVM reports an
> internal error right after restarting on the destination, and who knows
> what used to happen before the assertion was introduced.  All MSRs after
> MTRRs would be ignored by KVM!
> 

It's probably too late for the user to act on it, but I want to
see the warning in the logs of a bug report in case something
else breaks. The MTRR changes may be 100% safe (and may not
deserve a warning), but other things can break when phys_bits
change on migration.

> Migrating and checking the configuration complicates the code and,
> because it's only for a warning, doesn't save you from massaging the
> values to make things work.

I would agree with you if we didn't have the existing
host-phys-bits mode already. In this case, phys_bits is
initialized to a different value depending on the host. You can
argue it is not machine state, but it is not user-specified
configuration either.

In theory, we should never initialize anything on the machine
based on the host we are running. In practice we sometimes do
that, and we know it's unsafe. Sending the value on the migration
stream is a solution to detect when this breaks something. We did
that before, for TSC frequency.

I prefer to add a little extra code, than to waste time debugging
when that stuff breaks.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-17 13:01         ` Paolo Bonzini
  2016-06-17 13:41           ` Eduardo Habkost
@ 2016-06-17 13:51           ` Dr. David Alan Gilbert
  2016-06-17 14:19             ` Paolo Bonzini
  1 sibling, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-17 13:51 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Eduardo Habkost, aarcange, qemu-devel

* Paolo Bonzini (pbonzini@redhat.com) wrote:

> Note that the failure mode is pretty brutal since KVM reports an
> internal error right after restarting on the destination, and who knows
> what used to happen before the assertion was introduced.  All MSRs after
> MTRRs would be ignored by KVM!

I think we got lucky because the MTRRs were close to the last MSRs
in the list, there was only some MCG/MCE stuff after it which probably
wouldn't have been missed;  if the MTRR wasn't set I'm guessing someone
would have just had a behaviour that looked like dreadful performance
after migration perhaps?

Dave

> Migrating and checking the configuration complicates the code and,
> because it's only for a warning, doesn't save you from massaging the
> values to make things work.
> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-17 13:51           ` Dr. David Alan Gilbert
@ 2016-06-17 14:19             ` Paolo Bonzini
  0 siblings, 0 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17 14:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Eduardo Habkost, aarcange, qemu-devel



On 17/06/2016 15:51, Dr. David Alan Gilbert wrote:
> 
>> > Note that the failure mode is pretty brutal since KVM reports an
>> > internal error right after restarting on the destination, and who knows
>> > what used to happen before the assertion was introduced.  All MSRs after
>> > MTRRs would be ignored by KVM!
> I think we got lucky because the MTRRs were close to the last MSRs
> in the list, there was only some MCG/MCE stuff after it which probably
> wouldn't have been missed;  if the MTRR wasn't set I'm guessing someone
> would have just had a behaviour that looked like dreadful performance
> after migration perhaps?

No, MTRRs are only used by KVM if there's an assigned device, and that's
incompatible with migration.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17 13:18       ` Eduardo Habkost
  2016-06-17 13:38         ` Paolo Bonzini
@ 2016-06-17 14:24         ` Marcel Apfelbaum
  1 sibling, 0 replies; 73+ messages in thread
From: Marcel Apfelbaum @ 2016-06-17 14:24 UTC (permalink / raw)
  To: Eduardo Habkost, Dr. David Alan Gilbert
  Cc: qemu-devel, pbonzini, aarcange, Michael S. Tsirkin

On 06/17/2016 04:18 PM, Eduardo Habkost wrote:
> On Fri, Jun 17, 2016 at 09:15:06AM +0100, Dr. David Alan Gilbert wrote:
>> * Eduardo Habkost (ehabkost@redhat.com) wrote:
>>> On Thu, Jun 16, 2016 at 06:12:12PM +0100, Dr. David Alan Gilbert (git) wrote:
>>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>>
>>>> Currently QEMU sets the x86 number of physical address bits to the
>>>> magic number 40.  This is only correct on some small AMD systems;
>>>> Intel systems tend to have 36, 39, 46 bits, and large AMD systems
>>>> tend to have 48.
>>>>
>>>> Having the value different from your actual hardware is detectable
>>>> by the guest and in principal can cause problems;
>>>

[...]

>
>>     2) While we have maxmem settings to tell us the top of VM RAM, do
>>        we have anything that tells us the top of IO space? What happens
>>        when we hotplug a PCI card?
>
> (CCing Marcel and Michael, as we were discussing this recently.)
>
> That's a good question. When calculating how many bits the
> machine requires, machine code could choose to reserve a
> reasonable amount of space for hotplug by default.
>

Indeed we have a real problem with PCI hotplug. Numerous configurations (RAM size + devices present at boot)
leave us with little (32bit PCI hole - devices present at boot) or none (if 32bit hole is full) MMIO for hotplug.

We need a way to reserve [max-ram-including-hotplug-reserved-mem, x] range for this.
But what is x? I thought cpu-max-addressable-addr is OK, now I don't think is not a good idea anymore
because it will limit the migration only to hosts with at least same limit. (migrations that worked before will not work now)

> Whatever we choose as the default, in some corner cases (e.g.
> almost-32GB VMs running in a 39-bit host) we will still need to
> let the user choose between having extra space for hotplug and
> being able to safely migrate to 36-bit hosts.
>
> This can be solved by setting x86_64-cpu.phys-bits, but isn't it
> too low level? We could have a higher level mechanism in the PC
> machine class (e.g. a min-extra-phys-addr-space-for-hotplug
> property).
>

I personally prefer  min-extra-phys-addr-space-for-hotplug, but why min?
I would go for phys-addr-space-for-hotplug that starts right after max-ram-including-hotplug-reserved-mem
and is less than cpu-max-addressable-addr.

But how to know cpu-max-addressable-addr? David's cpuid wrapper looks just fine, but I understood
there are Windows versions that are not probing cpuid...


Thanks,
Marcel


[...]




[...]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-17 13:41           ` Eduardo Habkost
@ 2016-06-17 14:25             ` Paolo Bonzini
  2016-06-17 15:27               ` Eduardo Habkost
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17 14:25 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Dr. David Alan Gilbert (git), aarcange, qemu-devel



On 17/06/2016 15:41, Eduardo Habkost wrote:
> In theory, we should never initialize anything on the machine
> based on the host we are running. In practice we sometimes do
> that, and we know it's unsafe. Sending the value on the migration
> stream is a solution to detect when this breaks something. We did
> that before, for TSC frequency.
> 
> I prefer to add a little extra code, than to waste time debugging
> when that stuff breaks.

Fair enough (but still let's add and strip the 1s in pre_save and
post_load).  It is a good justification for sending configuration over
the wire, that we initialize things based on the host we're running.

But we shouldn't initialize more things based on the host (e.g. new
booleans should be "enforce"-style).  Also I don't really like
introducing sanity checks for those that aren't based on the host (as in
the LMCE thread).  This code will never trigger in practice, so it's
just extra cost for no benefit.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17 13:38         ` Paolo Bonzini
@ 2016-06-17 15:19           ` Eduardo Habkost
  2016-06-17 15:28             ` Paolo Bonzini
  2016-06-19  3:36           ` [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set Michael S. Tsirkin
  1 sibling, 1 reply; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-17 15:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Dr. David Alan Gilbert, qemu-devel, aarcange, Marcel Apfelbaum,
	Michael S. Tsirkin

On Fri, Jun 17, 2016 at 03:38:53PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/06/2016 15:18, Eduardo Habkost wrote:
> > On Fri, Jun 17, 2016 at 09:15:06AM +0100, Dr. David Alan Gilbert wrote:
> >> * Eduardo Habkost (ehabkost@redhat.com) wrote:
> >>> On Thu, Jun 16, 2016 at 06:12:12PM +0100, Dr. David Alan Gilbert (git) wrote:
> >>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >>>>
> >>>> Currently QEMU sets the x86 number of physical address bits to the
> >>>> magic number 40.  This is only correct on some small AMD systems;
> >>>> Intel systems tend to have 36, 39, 46 bits, and large AMD systems
> >>>> tend to have 48.
> >>>>
> >>>> Having the value different from your actual hardware is detectable
> >>>> by the guest and in principal can cause problems;
> >>>
> >>> What kind of problems?
> >>>
> >>> Is it a problem to have something smaller from the actual
> >>> hardware, or just if it's higher?
> >>
> >> I'm a bit vague on the failure cases; but my understanding of the two
> >> cases are;
> >>
> >> Larger is a problem if the guest tries to map something to a high
> >> address that's not addressable.
> 
>         (Note: this is a problem when migrating to hosts with _smaller_
>                phys-bits)
> 
> >> Smaller is potentially a problem if the guest plays tricks with
> >> what it thinks are spare bits in page tables but which are actually
> >> interpreted.   I believe KVM plays a trick like this.
> 
>         (Note: this is a problem when migrating to hosts with _larger_
>                phys-bits)
> 
> > If both smaller and larger are a problem, we have a much bigger
> > problem than we thought. We need to confirm this.
> > 
> > So, what happens if the guest play tricks in bits 40-45 when QEMU
> > sets the limit to 40 but we are running in a 46-bit host? Is it
> > really a problem? I assumed it would be safe.
> 
> The guest expects a "reserved bit set" page fault, but doesn't get one.

Wait, are you talking about migration only, or are you really
talking about running current QEMU (hardcoded to 40) on a 46-bit
host? I'm not talking about migration, above.

We really can't emulate a 40-bit machine in a 46-bit host? I
didn't expect that.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-17 14:25             ` Paolo Bonzini
@ 2016-06-17 15:27               ` Eduardo Habkost
  2016-06-17 15:29                 ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-17 15:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Dr. David Alan Gilbert (git), aarcange, qemu-devel

On Fri, Jun 17, 2016 at 04:25:57PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/06/2016 15:41, Eduardo Habkost wrote:
> > In theory, we should never initialize anything on the machine
> > based on the host we are running. In practice we sometimes do
> > that, and we know it's unsafe. Sending the value on the migration
> > stream is a solution to detect when this breaks something. We did
> > that before, for TSC frequency.
> > 
> > I prefer to add a little extra code, than to waste time debugging
> > when that stuff breaks.
> 
> Fair enough (but still let's add and strip the 1s in pre_save and
> post_load).  It is a good justification for sending configuration over
> the wire, that we initialize things based on the host we're running.
> 
> But we shouldn't initialize more things based on the host (e.g. new
> booleans should be "enforce"-style).

Strongly agreed.

> Also I don't really like
> introducing sanity checks for those that aren't based on the host (as in
> the LMCE thread).  This code will never trigger in practice, so it's
> just extra cost for no benefit.

Our experience with guest ABI and compatibility code tells me
otherwise. Keeping guest ABI is hard, so I would always support
adding extra code to help us catch mistakes (either in QEMU,
management software, or user mistakes).

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17 15:19           ` Eduardo Habkost
@ 2016-06-17 15:28             ` Paolo Bonzini
  2016-06-17 15:49               ` Eduardo Habkost
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17 15:28 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Dr. David Alan Gilbert, qemu-devel, aarcange, Marcel Apfelbaum,
	Michael S. Tsirkin



On 17/06/2016 17:19, Eduardo Habkost wrote:
> > > So, what happens if the guest play tricks in bits 40-45 when QEMU
> > > sets the limit to 40 but we are running in a 46-bit host? Is it
> > > really a problem? I assumed it would be safe.
> > 
> > The guest expects a "reserved bit set" page fault, but doesn't get one.
> 
> Wait, are you talking about migration only, or are you really
> talking about running current QEMU (hardcoded to 40) on a 46-bit
> host? I'm not talking about migration, above.

I'm talking about both. :(

> We really can't emulate a 40-bit machine in a 46-bit host? I
> didn't expect that.

Unfortunately that's the case. :(

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-17 15:27               ` Eduardo Habkost
@ 2016-06-17 15:29                 ` Paolo Bonzini
  2016-06-17 15:35                   ` Eduardo Habkost
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-17 15:29 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Dr. David Alan Gilbert (git), aarcange, qemu-devel



On 17/06/2016 17:27, Eduardo Habkost wrote:
> On Fri, Jun 17, 2016 at 04:25:57PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 17/06/2016 15:41, Eduardo Habkost wrote:
>>> In theory, we should never initialize anything on the machine
>>> based on the host we are running. In practice we sometimes do
>>> that, and we know it's unsafe. Sending the value on the migration
>>> stream is a solution to detect when this breaks something. We did
>>> that before, for TSC frequency.
>>>
>>> I prefer to add a little extra code, than to waste time debugging
>>> when that stuff breaks.
>>
>> Fair enough (but still let's add and strip the 1s in pre_save and
>> post_load).  It is a good justification for sending configuration over
>> the wire, that we initialize things based on the host we're running.
>>
>> But we shouldn't initialize more things based on the host (e.g. new
>> booleans should be "enforce"-style).
> 
> Strongly agreed.
> 
>> Also I don't really like
>> introducing sanity checks for those that aren't based on the host (as in
>> the LMCE thread).  This code will never trigger in practice, so it's
>> just extra cost for no benefit.
> 
> Our experience with guest ABI and compatibility code tells me
> otherwise. Keeping guest ABI is hard, so I would always support
> adding extra code to help us catch mistakes (either in QEMU,
> management software, or user mistakes).

I agree, but I think the "enforce"-style properties are enough (and
count as extra code :)).

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask
  2016-06-17 15:29                 ` Paolo Bonzini
@ 2016-06-17 15:35                   ` Eduardo Habkost
  0 siblings, 0 replies; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-17 15:35 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Dr. David Alan Gilbert (git), aarcange, qemu-devel

On Fri, Jun 17, 2016 at 05:29:21PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/06/2016 17:27, Eduardo Habkost wrote:
> > On Fri, Jun 17, 2016 at 04:25:57PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 17/06/2016 15:41, Eduardo Habkost wrote:
> >>> In theory, we should never initialize anything on the machine
> >>> based on the host we are running. In practice we sometimes do
> >>> that, and we know it's unsafe. Sending the value on the migration
> >>> stream is a solution to detect when this breaks something. We did
> >>> that before, for TSC frequency.
> >>>
> >>> I prefer to add a little extra code, than to waste time debugging
> >>> when that stuff breaks.
> >>
> >> Fair enough (but still let's add and strip the 1s in pre_save and
> >> post_load).  It is a good justification for sending configuration over
> >> the wire, that we initialize things based on the host we're running.
> >>
> >> But we shouldn't initialize more things based on the host (e.g. new
> >> booleans should be "enforce"-style).
> > 
> > Strongly agreed.
> > 
> >> Also I don't really like
> >> introducing sanity checks for those that aren't based on the host (as in
> >> the LMCE thread).  This code will never trigger in practice, so it's
> >> just extra cost for no benefit.
> > 
> > Our experience with guest ABI and compatibility code tells me
> > otherwise. Keeping guest ABI is hard, so I would always support
> > adding extra code to help us catch mistakes (either in QEMU,
> > management software, or user mistakes).
> 
> I agree, but I think the "enforce"-style properties are enough (and
> count as extra code :)).

I agree they are enough when bits have an easy 1-1 correspondence
to command-line options. I was thinking about things like CPUID
bits, that depend on complex CPU model + accelerator +
machine-type interactions.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17 15:28             ` Paolo Bonzini
@ 2016-06-17 15:49               ` Eduardo Habkost
  2016-06-21 19:44                 ` [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set) Eduardo Habkost
  0 siblings, 1 reply; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-17 15:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Dr. David Alan Gilbert, qemu-devel, aarcange, Marcel Apfelbaum,
	Michael S. Tsirkin

On Fri, Jun 17, 2016 at 05:28:03PM +0200, Paolo Bonzini wrote:
> On 17/06/2016 17:19, Eduardo Habkost wrote:
> > > > So, what happens if the guest play tricks in bits 40-45 when QEMU
> > > > sets the limit to 40 but we are running in a 46-bit host? Is it
> > > > really a problem? I assumed it would be safe.
> > > 
> > > The guest expects a "reserved bit set" page fault, but doesn't get one.
> > 
> > Wait, are you talking about migration only, or are you really
> > talking about running current QEMU (hardcoded to 40) on a 46-bit
> > host? I'm not talking about migration, above.
> 
> I'm talking about both. :(
> 
> > We really can't emulate a 40-bit machine in a 46-bit host? I
> > didn't expect that.
> 
> Unfortunately that's the case. :(

OK, so please ignore all my suggestions about choosing reasonable
defaults based on VM size, etc.

Now all options look bad, and we need to choose the least harmful
as the default... I need to re-read this thread to be able to
give an opinion.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17  9:52           ` Igor Mammedov
  2016-06-17 11:20             ` Gerd Hoffmann
@ 2016-06-17 16:07             ` Laszlo Ersek
  2016-06-19 16:13               ` Marcel Apfelbaum
  1 sibling, 1 reply; 73+ messages in thread
From: Laszlo Ersek @ 2016-06-17 16:07 UTC (permalink / raw)
  To: Igor Mammedov, Gerd Hoffmann
  Cc: Paolo Bonzini, aarcange, qemu-devel, Dr. David Alan Gilbert,
	Eduardo Habkost, Marcel Apfelbaum

On 06/17/16 11:52, Igor Mammedov wrote:
> On Fri, 17 Jun 2016 11:17:54 +0200
> Gerd Hoffmann <kraxel@redhat.com> wrote:
> 
>> On Fr, 2016-06-17 at 10:43 +0200, Paolo Bonzini wrote:
>>>
>>> On 17/06/2016 10:15, Dr. David Alan Gilbert wrote:  
>>>> Larger is a problem if the guest tries to map something to a high
>>>> address that's not addressable.  
>>>
>>> Right.  It's not a problem for most emulated PCI devices (it would be a
>>> problem for those that have large RAM BARs, but even our emulated video
>>> cards do not have 64-bit RAM BARs, I think;  
>>
>> qxl can be configured to have one, try "-device
>> qxl-vga,vram64_size_mb=1024"
>>
>>>>    2) While we have maxmem settings to tell us the top of VM RAM, do
>>>>       we have anything that tells us the top of IO space? What happens
>>>>       when we hotplug a PCI card?  
>>
>>> (arch/x86/kernel/setup.c) but I agree that (2) is a blocker.  
>>
>> seabios maps stuff right above ram (possibly with a hole due to
>> alignment requirements).
>>
>> ovmf maps stuff into a 32G-aligned 32G hole.  Which lands at 32G and
>> therefore is addressable with 36 bits, unless you have tons of ram (>
>> 30G) assigned to your guest.  A physical host machine where you can plug
>> in enough ram for such a configuration likely has more than 36 physical
>> address lines too ...
>>
>> qemu checks where the firmware mapped 64bit bars, then adds those ranges
>> to the root bus pci resources in the acpi tables (see /proc/iomem).
>>
>>> You don't know how the guest will assign PCI BAR addresses, and as you
>>> said there's hotplug too.  
>>
>> Not sure whenever qemu adds some extra space for hotplug to the 64bit
>> hole and if so how it calculates the size then.  But the guest os should
>> stick to those ranges when configuring hotplugged devices.
> currently firmware would assign 64-bit BARs after reserved-memory-end
> (not sure about ovmf though)

OVMF does the same as well. It makes sure that the 64-bit PCI MMIO
aperture is located above "etc/reserved-memory-end", if the latter exists.

> but QEMU on ACPI side will add 64-bit _CRS only
> for firmware mapped devices (i.e. no space reserved for hotplug).
> And is I recall correctly ovmf won't map BARs if it doesn't have
> a driver for it

Yes, that's correct, generally for all UEFI firmware.

More precisely, BARs will be allocated and programmed, but the MMIO
space decoding bit will not be set (permanently) in the device's command
register, if there is no matching driver in the firmware (or in the
device's own oprom).

> so ACPI tables won't even have a space for not mapped
> 64-bit BARs.

This used to be true, but that's not the case since
<https://github.com/tianocore/edk2/commit/8f35eb92c419>.

Namely, specifically for conforming to QEMU's ACPI generator, OVMF
*temporarily* enables, as a platform quirk, all PCI devices present in
the system, before triggering QEMU to generate the ACPI payload.

Thus, nowadays 64-bit BARs work fine with OVMF, both for virtio-modern
devices, and assigned physical devices. (This is very easy to test,
because, unlike SeaBIOS, the edk2 stuff built into OVMF prefers to
allocate 64-bit BARs outside of the 32-bit address space.)

Devices behind PXBs are a different story, but Marcel's been looking
into that, see <https://bugzilla.redhat.com/show_bug.cgi?id=1323976>.

> There was another attempt to reserve more space in _CRS
>   https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00090.html

That's actually Marcel's first own patch set for addressing RHBZ#1323976
that I mentioned above (see it linked in
<https://bugzilla.redhat.com/show_bug.cgi?id=1323976#c2>).

It might have wider effects, but it is entirely motivated, to my
knowledge, by PXB. If you don't have extra root bridges, and/or you plug
all your devices with 64-bit MMIO BARs into the "main" (default) root
bridge, then (I believe) that patch set is not supposed to make any
difference. (I could be wrong, it's been a while since I looked at
Marcel's work!)

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17 11:20             ` Gerd Hoffmann
@ 2016-06-17 16:20               ` Laszlo Ersek
  0 siblings, 0 replies; 73+ messages in thread
From: Laszlo Ersek @ 2016-06-17 16:20 UTC (permalink / raw)
  To: Gerd Hoffmann, Igor Mammedov
  Cc: Paolo Bonzini, aarcange, qemu-devel, Dr. David Alan Gilbert,
	Eduardo Habkost, Marcel Apfelbaum

On 06/17/16 13:20, Gerd Hoffmann wrote:
>   Hi,
> 
>>> Not sure whenever qemu adds some extra space for hotplug to the 64bit
>>> hole and if so how it calculates the size then.  But the guest os should
>>> stick to those ranges when configuring hotplugged devices.
> 
>> currently firmware would assign 64-bit BARs after reserved-memory-end
>> (not sure about ovmf though)
> 
> Ah, right, reserved-memory-end is checked too if present.  Both seabios
> and ovmf should do that.

OVMF does that, yes.

> 
>> but QEMU on ACPI side will add 64-bit _CRS only
>> for firmware mapped devices (i.e. no space reserved for hotplug).
> 
> Yes.  Tested meanwhile, looks like this (seabios):
> 
> 100000000-17fffffff : System RAM
> 180000000-1c1ffffff : PCI Bus 0000:00
>   180000000-1bfffffff : 0000:00:0f.0
>   1c0000000-1c07fffff : PCI Bus 0000:04
>     1c0000000-1c07fffff : 0000:04:00.0
>       1c0000000-1c07fffff : virtio-pci
>   1c0800000-1c0ffffff : PCI Bus 0000:03
>     1c0800000-1c0ffffff : 0000:03:00.0
>       1c0800000-1c0ffffff : virtio-pci
>   1c1000000-1c17fffff : PCI Bus 0000:02
>     1c1000000-1c17fffff : 0000:02:00.0
>       1c1000000-1c17fffff : virtio-pci
>   1c1800000-1c19fffff : PCI Bus 0000:08
>   1c1a00000-1c1bfffff : PCI Bus 0000:07
>   1c1c00000-1c1dfffff : PCI Bus 0000:06
>   1c1e00000-1c1ffffff : PCI Bus 0000:05

(This is from /proc/iomem in the guest, right? A note about this later.)

> seabios assigns a 2M memory window to pci bridges (which support
> hotplug) even in case no device is connected, so there is some space for
> hotplug because of that.
> 
> /me should try the same with ovmf ...

The generic PCI bus driver in edk2 (PciBusDxe), which is built into
OVMF, does not allocate IO and/or MMIO for a bridge if there are zero
devices behind the bridge that consume that kind of resource. I actually
consider this a feature, not a bug, because it helps avoid the
exhaustion of the limited IO port space.

In any case, /proc/iomem is not a good tool for investigating resources
assigned by the firmware. Linux is stubborn and will *itself* allocate
IO port ranges for bridges, for example, when OVMF -- justifiedly, see
above -- did no such thing. Marcel suggested to prevent this with
various bridge registers, and Paolo suggested to modify Linux's
behavior. Either way, what you see in /proc/iomem is not the
unadulterated effect of the firmware.

In OVMF's case, the resource map for every bridge / device can be seen
in the OVMF debug log, searching for the string "resource map"
(case-insensitively).

Regarding PCIe hotplug, it seemed to work under OVMF, after we fixed
<https://github.com/tianocore/edk2/issues/32> with Marcel's crucial help.

OVMF inherits the PcdPciBusHotplugDeviceSupport=TRUE build-time feature
flag from "MdeModulePkg/MdeModulePkg.dec":

  ## Indicates if PciBus driver supports the hot plug device.<BR><BR>
  #   TRUE  - PciBus driver supports the hot plug device.<BR>
  #   FALSE - PciBus driver doesn't support the hot plug device.<BR>
  # @Prompt Enable PciBus hot plug device support.

gEfiMdeModulePkgTokenSpaceGuid.PcdPciBusHotplugDeviceSupport|TRUE|BOOLEAN|0x0001003d

I've never checked how exactly that knob affects PciBusDxe, but it's
TRUE in OVMF.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17 13:38         ` Paolo Bonzini
  2016-06-17 15:19           ` Eduardo Habkost
@ 2016-06-19  3:36           ` Michael S. Tsirkin
  2016-06-20  7:04             ` Paolo Bonzini
  1 sibling, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2016-06-19  3:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eduardo Habkost, Dr. David Alan Gilbert, qemu-devel, aarcange,
	Marcel Apfelbaum

On Fri, Jun 17, 2016 at 03:38:53PM +0200, Paolo Bonzini wrote:
> Making the guest's phys-bits larger than host-phys-bits would be okay if
> you reserve the area in the e820 and assume the guest doesn't touch it.

How would it touch it if there's no RAM there?
PCI BARs is the only thing that comes to mind,
but we fix that by making a sane _CRS.

What did I miss?

> But it is not a great idea too, because e820 describes RAM, so you're
> telling the guest "look, there's 64 TB of reserved RAM up there".
> 
> >>    3) Is it better to stick to sizes that correspond to real hardware
> >>       if you can?  For example I don't know of any machines with 37 bits
> >>       - in practice I think it's best to stick with sizes that correspond
> >>       to some real hardware.
> > 
> > Yeah, "as small as possible" could be actually "the smallest
> > possible value from a set of known-to-exist values". e.g. if we
> > find out that we need 37 bits, it's probably better to simply use
> > 39 bits.
> > 
> > Choosing from a smaller set of values also makes corner cases
> > (like the example above) less likely to happen.
> 
> Not really, because any value that doesn't match the host is
> problematic, albeit in different ways.
> 
> Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-17 16:07             ` Laszlo Ersek
@ 2016-06-19 16:13               ` Marcel Apfelbaum
  2016-06-20 10:42                 ` Igor Mammedov
  0 siblings, 1 reply; 73+ messages in thread
From: Marcel Apfelbaum @ 2016-06-19 16:13 UTC (permalink / raw)
  To: Laszlo Ersek, Igor Mammedov, Gerd Hoffmann
  Cc: Paolo Bonzini, aarcange, qemu-devel, Dr. David Alan Gilbert,
	Eduardo Habkost

On 06/17/2016 07:07 PM, Laszlo Ersek wrote:
> On 06/17/16 11:52, Igor Mammedov wrote:
>> On Fri, 17 Jun 2016 11:17:54 +0200
>> Gerd Hoffmann <kraxel@redhat.com> wrote:
>>
>>> On Fr, 2016-06-17 at 10:43 +0200, Paolo Bonzini wrote:
>>>>
>>>> On 17/06/2016 10:15, Dr. David Alan Gilbert wrote:
>>>>> Larger is a problem if the guest tries to map something to a high
>>>>> address that's not addressable.
>>>>
>>>> Right.  It's not a problem for most emulated PCI devices (it would be a
>>>> problem for those that have large RAM BARs, but even our emulated video
>>>> cards do not have 64-bit RAM BARs, I think;
>>>
>>> qxl can be configured to have one, try "-device
>>> qxl-vga,vram64_size_mb=1024"
>>>
>>>>>     2) While we have maxmem settings to tell us the top of VM RAM, do
>>>>>        we have anything that tells us the top of IO space? What happens
>>>>>        when we hotplug a PCI card?
>>>
>>>> (arch/x86/kernel/setup.c) but I agree that (2) is a blocker.
>>>
>>> seabios maps stuff right above ram (possibly with a hole due to
>>> alignment requirements).
>>>
>>> ovmf maps stuff into a 32G-aligned 32G hole.  Which lands at 32G and
>>> therefore is addressable with 36 bits, unless you have tons of ram (>
>>> 30G) assigned to your guest.  A physical host machine where you can plug
>>> in enough ram for such a configuration likely has more than 36 physical
>>> address lines too ...
>>>
>>> qemu checks where the firmware mapped 64bit bars, then adds those ranges
>>> to the root bus pci resources in the acpi tables (see /proc/iomem).
>>>
>>>> You don't know how the guest will assign PCI BAR addresses, and as you
>>>> said there's hotplug too.
>>>
>>> Not sure whenever qemu adds some extra space for hotplug to the 64bit
>>> hole and if so how it calculates the size then.  But the guest os should
>>> stick to those ranges when configuring hotplugged devices.
>> currently firmware would assign 64-bit BARs after reserved-memory-end
>> (not sure about ovmf though)
>
> OVMF does the same as well. It makes sure that the 64-bit PCI MMIO
> aperture is located above "etc/reserved-memory-end", if the latter exists.
>
>> but QEMU on ACPI side will add 64-bit _CRS only
>> for firmware mapped devices (i.e. no space reserved for hotplug).
>> And is I recall correctly ovmf won't map BARs if it doesn't have
>> a driver for it
>
> Yes, that's correct, generally for all UEFI firmware.
>
> More precisely, BARs will be allocated and programmed, but the MMIO
> space decoding bit will not be set (permanently) in the device's command
> register, if there is no matching driver in the firmware (or in the
> device's own oprom).
>
>> so ACPI tables won't even have a space for not mapped
>> 64-bit BARs.
>
> This used to be true, but that's not the case since
> <https://github.com/tianocore/edk2/commit/8f35eb92c419>.
>
> Namely, specifically for conforming to QEMU's ACPI generator, OVMF
> *temporarily* enables, as a platform quirk, all PCI devices present in
> the system, before triggering QEMU to generate the ACPI payload.
>
> Thus, nowadays 64-bit BARs work fine with OVMF, both for virtio-modern
> devices, and assigned physical devices. (This is very easy to test,
> because, unlike SeaBIOS, the edk2 stuff built into OVMF prefers to
> allocate 64-bit BARs outside of the 32-bit address space.)
>
> Devices behind PXBs are a different story, but Marcel's been looking
> into that, see <https://bugzilla.redhat.com/show_bug.cgi?id=1323976>.
>
>> There was another attempt to reserve more space in _CRS
>>    https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00090.html
>
> That's actually Marcel's first own patch set for addressing RHBZ#1323976
> that I mentioned above (see it linked in
> <https://bugzilla.redhat.com/show_bug.cgi?id=1323976#c2>).
>
> It might have wider effects, but it is entirely motivated, to my
> knowledge, by PXB. If you don't have extra root bridges, and/or you plug
> all your devices with 64-bit MMIO BARs into the "main" (default) root
> bridge, then (I believe) that patch set is not supposed to make any
> difference. (I could be wrong, it's been a while since I looked at
> Marcel's work!)
>

Patch 3 and 4 indeed are for PXB only. but patch 'pci: reserve 64 bit MMIO range for PCI hotplug'
(see https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00091.html) tries
to reserve [above_4g_mem_size, max_addressable_cpu_bits] range for PCI hotplug.

The implementation is not good enough because the number of addressable bits is hard-coded.
However, we have now David's wrapper I can use.


Thanks,
Marcel







> Thanks
> Laszlo
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-19  3:36           ` [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set Michael S. Tsirkin
@ 2016-06-20  7:04             ` Paolo Bonzini
  0 siblings, 0 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-20  7:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Dr. David Alan Gilbert, qemu-devel, aarcange,
	Marcel Apfelbaum



On 19/06/2016 05:36, Michael S. Tsirkin wrote:
> > Making the guest's phys-bits larger than host-phys-bits would be okay if
> > you reserve the area in the e820 and assume the guest doesn't touch it.
>
> How would it touch it if there's no RAM there?
> PCI BARs is the only thing that comes to mind,
> but we fix that by making a sane _CRS.

Yes, I was thinking of PCI BARs.  We're talking about making guest
phys-bits independent from host-phys-bits, so I assume that you also
want _CRS to be independent of host-phys-bits.  I assume that _CRS would
be related to guest-phys-bits.

Paolo

> What did I miss?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-19 16:13               ` Marcel Apfelbaum
@ 2016-06-20 10:42                 ` Igor Mammedov
  2016-06-20 11:13                   ` Marcel Apfelbaum
  0 siblings, 1 reply; 73+ messages in thread
From: Igor Mammedov @ 2016-06-20 10:42 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: Laszlo Ersek, Gerd Hoffmann, Paolo Bonzini, aarcange, qemu-devel,
	Dr. David Alan Gilbert, Eduardo Habkost

On Sun, 19 Jun 2016 19:13:17 +0300
Marcel Apfelbaum <marcel@redhat.com> wrote:

> On 06/17/2016 07:07 PM, Laszlo Ersek wrote:
> > On 06/17/16 11:52, Igor Mammedov wrote:
> >> On Fri, 17 Jun 2016 11:17:54 +0200
> >> Gerd Hoffmann <kraxel@redhat.com> wrote:
> >>
> >>> On Fr, 2016-06-17 at 10:43 +0200, Paolo Bonzini wrote:
> >>>>
> >>>> On 17/06/2016 10:15, Dr. David Alan Gilbert wrote:
> >>>>> Larger is a problem if the guest tries to map something to a
> >>>>> high address that's not addressable.
> >>>>
> >>>> Right.  It's not a problem for most emulated PCI devices (it
> >>>> would be a problem for those that have large RAM BARs, but even
> >>>> our emulated video cards do not have 64-bit RAM BARs, I think;
> >>>
> >>> qxl can be configured to have one, try "-device
> >>> qxl-vga,vram64_size_mb=1024"
> >>>
> >>>>>     2) While we have maxmem settings to tell us the top of VM
> >>>>> RAM, do we have anything that tells us the top of IO space?
> >>>>> What happens when we hotplug a PCI card?
> >>>
> >>>> (arch/x86/kernel/setup.c) but I agree that (2) is a blocker.
> >>>
> >>> seabios maps stuff right above ram (possibly with a hole due to
> >>> alignment requirements).
> >>>
> >>> ovmf maps stuff into a 32G-aligned 32G hole.  Which lands at 32G
> >>> and therefore is addressable with 36 bits, unless you have tons
> >>> of ram (> 30G) assigned to your guest.  A physical host machine
> >>> where you can plug in enough ram for such a configuration likely
> >>> has more than 36 physical address lines too ...
> >>>
> >>> qemu checks where the firmware mapped 64bit bars, then adds those
> >>> ranges to the root bus pci resources in the acpi tables
> >>> (see /proc/iomem).
> >>>
> >>>> You don't know how the guest will assign PCI BAR addresses, and
> >>>> as you said there's hotplug too.
> >>>
> >>> Not sure whenever qemu adds some extra space for hotplug to the
> >>> 64bit hole and if so how it calculates the size then.  But the
> >>> guest os should stick to those ranges when configuring hotplugged
> >>> devices.
> >> currently firmware would assign 64-bit BARs after
> >> reserved-memory-end (not sure about ovmf though)
> >
> > OVMF does the same as well. It makes sure that the 64-bit PCI MMIO
> > aperture is located above "etc/reserved-memory-end", if the latter
> > exists.
> >
> >> but QEMU on ACPI side will add 64-bit _CRS only
> >> for firmware mapped devices (i.e. no space reserved for hotplug).
> >> And is I recall correctly ovmf won't map BARs if it doesn't have
> >> a driver for it
> >
> > Yes, that's correct, generally for all UEFI firmware.
> >
> > More precisely, BARs will be allocated and programmed, but the MMIO
> > space decoding bit will not be set (permanently) in the device's
> > command register, if there is no matching driver in the firmware
> > (or in the device's own oprom).
> >
> >> so ACPI tables won't even have a space for not mapped
> >> 64-bit BARs.
> >
> > This used to be true, but that's not the case since
> > <https://github.com/tianocore/edk2/commit/8f35eb92c419>.
> >
> > Namely, specifically for conforming to QEMU's ACPI generator, OVMF
> > *temporarily* enables, as a platform quirk, all PCI devices present
> > in the system, before triggering QEMU to generate the ACPI payload.
> >
> > Thus, nowadays 64-bit BARs work fine with OVMF, both for
> > virtio-modern devices, and assigned physical devices. (This is very
> > easy to test, because, unlike SeaBIOS, the edk2 stuff built into
> > OVMF prefers to allocate 64-bit BARs outside of the 32-bit address
> > space.)
> >
> > Devices behind PXBs are a different story, but Marcel's been looking
> > into that, see
> > <https://bugzilla.redhat.com/show_bug.cgi?id=1323976>.
> >
> >> There was another attempt to reserve more space in _CRS
> >>    https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00090.html
> >
> > That's actually Marcel's first own patch set for addressing
> > RHBZ#1323976 that I mentioned above (see it linked in
> > <https://bugzilla.redhat.com/show_bug.cgi?id=1323976#c2>).
> >
> > It might have wider effects, but it is entirely motivated, to my
> > knowledge, by PXB. If you don't have extra root bridges, and/or you
> > plug all your devices with 64-bit MMIO BARs into the
> > "main" (default) root bridge, then (I believe) that patch set is
> > not supposed to make any difference. (I could be wrong, it's been a
> > while since I looked at Marcel's work!)
> >
> 
> Patch 3 and 4 indeed are for PXB only. but patch 'pci: reserve 64 bit
> MMIO range for PCI hotplug' (see
> https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00091.html)
> tries to reserve [above_4g_mem_size, max_addressable_cpu_bits] range
> for PCI hotplug.
it should be [reserved-memory-end, max_addressable_cpu_bits]

> 
> The implementation is not good enough because the number of
> addressable bits is hard-coded. However, we have now David's wrapper
> I can use.
> 
> 
> Thanks,
> Marcel
> 
> 
> 
> 
> 
> 
> 
> > Thanks
> > Laszlo
> >
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set
  2016-06-20 10:42                 ` Igor Mammedov
@ 2016-06-20 11:13                   ` Marcel Apfelbaum
  0 siblings, 0 replies; 73+ messages in thread
From: Marcel Apfelbaum @ 2016-06-20 11:13 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Laszlo Ersek, Gerd Hoffmann, Paolo Bonzini, aarcange, qemu-devel,
	Dr. David Alan Gilbert, Eduardo Habkost

On 06/20/2016 01:42 PM, Igor Mammedov wrote:
> On Sun, 19 Jun 2016 19:13:17 +0300
> Marcel Apfelbaum <marcel@redhat.com> wrote:
>
>> On 06/17/2016 07:07 PM, Laszlo Ersek wrote:
>>> On 06/17/16 11:52, Igor Mammedov wrote:
>>>> On Fri, 17 Jun 2016 11:17:54 +0200
>>>> Gerd Hoffmann <kraxel@redhat.com> wrote:
>>>>
>>>>> On Fr, 2016-06-17 at 10:43 +0200, Paolo Bonzini wrote:
>>>>>>
>>>>>> On 17/06/2016 10:15, Dr. David Alan Gilbert wrote:
>>>>>>> Larger is a problem if the guest tries to map something to a
>>>>>>> high address that's not addressable.
>>>>>>
>>>>>> Right.  It's not a problem for most emulated PCI devices (it
>>>>>> would be a problem for those that have large RAM BARs, but even
>>>>>> our emulated video cards do not have 64-bit RAM BARs, I think;
>>>>>
>>>>> qxl can be configured to have one, try "-device
>>>>> qxl-vga,vram64_size_mb=1024"
>>>>>
>>>>>>>      2) While we have maxmem settings to tell us the top of VM
>>>>>>> RAM, do we have anything that tells us the top of IO space?
>>>>>>> What happens when we hotplug a PCI card?
>>>>>
>>>>>> (arch/x86/kernel/setup.c) but I agree that (2) is a blocker.
>>>>>
>>>>> seabios maps stuff right above ram (possibly with a hole due to
>>>>> alignment requirements).
>>>>>
>>>>> ovmf maps stuff into a 32G-aligned 32G hole.  Which lands at 32G
>>>>> and therefore is addressable with 36 bits, unless you have tons
>>>>> of ram (> 30G) assigned to your guest.  A physical host machine
>>>>> where you can plug in enough ram for such a configuration likely
>>>>> has more than 36 physical address lines too ...
>>>>>
>>>>> qemu checks where the firmware mapped 64bit bars, then adds those
>>>>> ranges to the root bus pci resources in the acpi tables
>>>>> (see /proc/iomem).
>>>>>
>>>>>> You don't know how the guest will assign PCI BAR addresses, and
>>>>>> as you said there's hotplug too.
>>>>>
>>>>> Not sure whenever qemu adds some extra space for hotplug to the
>>>>> 64bit hole and if so how it calculates the size then.  But the
>>>>> guest os should stick to those ranges when configuring hotplugged
>>>>> devices.
>>>> currently firmware would assign 64-bit BARs after
>>>> reserved-memory-end (not sure about ovmf though)
>>>
>>> OVMF does the same as well. It makes sure that the 64-bit PCI MMIO
>>> aperture is located above "etc/reserved-memory-end", if the latter
>>> exists.
>>>
>>>> but QEMU on ACPI side will add 64-bit _CRS only
>>>> for firmware mapped devices (i.e. no space reserved for hotplug).
>>>> And is I recall correctly ovmf won't map BARs if it doesn't have
>>>> a driver for it
>>>
>>> Yes, that's correct, generally for all UEFI firmware.
>>>
>>> More precisely, BARs will be allocated and programmed, but the MMIO
>>> space decoding bit will not be set (permanently) in the device's
>>> command register, if there is no matching driver in the firmware
>>> (or in the device's own oprom).
>>>
>>>> so ACPI tables won't even have a space for not mapped
>>>> 64-bit BARs.
>>>
>>> This used to be true, but that's not the case since
>>> <https://github.com/tianocore/edk2/commit/8f35eb92c419>.
>>>
>>> Namely, specifically for conforming to QEMU's ACPI generator, OVMF
>>> *temporarily* enables, as a platform quirk, all PCI devices present
>>> in the system, before triggering QEMU to generate the ACPI payload.
>>>
>>> Thus, nowadays 64-bit BARs work fine with OVMF, both for
>>> virtio-modern devices, and assigned physical devices. (This is very
>>> easy to test, because, unlike SeaBIOS, the edk2 stuff built into
>>> OVMF prefers to allocate 64-bit BARs outside of the 32-bit address
>>> space.)
>>>
>>> Devices behind PXBs are a different story, but Marcel's been looking
>>> into that, see
>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1323976>.
>>>
>>>> There was another attempt to reserve more space in _CRS
>>>>     https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00090.html
>>>
>>> That's actually Marcel's first own patch set for addressing
>>> RHBZ#1323976 that I mentioned above (see it linked in
>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1323976#c2>).
>>>
>>> It might have wider effects, but it is entirely motivated, to my
>>> knowledge, by PXB. If you don't have extra root bridges, and/or you
>>> plug all your devices with 64-bit MMIO BARs into the
>>> "main" (default) root bridge, then (I believe) that patch set is
>>> not supposed to make any difference. (I could be wrong, it's been a
>>> while since I looked at Marcel's work!)
>>>
>>
>> Patch 3 and 4 indeed are for PXB only. but patch 'pci: reserve 64 bit
>> MMIO range for PCI hotplug' (see
>> https://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg00091.html)
>> tries to reserve [above_4g_mem_size, max_addressable_cpu_bits] range
>> for PCI hotplug.
> it should be [reserved-memory-end, max_addressable_cpu_bits]
>

Right, thanks, actually the patch works like you pointed out.

Thanks,
Marcel

>>
>> The implementation is not good enough because the number of
>> addressable bits is hard-coded. However, we have now David's wrapper
>> I can use.
>>
>>
>> Thanks,
>> Marcel
>>
>>
>>
>>
>>
>>
>>
>>> Thanks
>>> Laszlo
>>>
>>
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro
  2016-06-16 18:01   ` Peter Maydell
  2016-06-16 18:05     ` Paolo Bonzini
@ 2016-06-20 14:11     ` Dr. David Alan Gilbert
  2016-06-20 14:17       ` Peter Maydell
  1 sibling, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-20 14:11 UTC (permalink / raw)
  To: Peter Maydell
  Cc: QEMU Developers, Paolo Bonzini, Andrea Arcangeli, Eduardo Habkost

* Peter Maydell (peter.maydell@linaro.org) wrote:
> On 16 June 2016 at 18:12, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > e.g. BIT_RANGE(15, 0) gives 0xff00
> >
> > Suggested by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  include/qemu/bitops.h | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
> > index 755fdd1..e411688 100644
> > --- a/include/qemu/bitops.h
> > +++ b/include/qemu/bitops.h
> > @@ -23,6 +23,9 @@
> >  #define BIT_MASK(nr)            (1UL << ((nr) % BITS_PER_LONG))
> >  #define BIT_WORD(nr)            ((nr) / BITS_PER_LONG)
> >  #define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
> > +/* e.g. BIT_RANGE(15, 0) -> 0xff00 */
> > +#define BIT_RANGE(hb, lb)     ((2ull << (hb)) - (1ull << (lb)))
> 
> Isn't this undefined behaviour if the hb is 63?

I've checked the C99 spec; it explicitly defines the unsigned
behaviour ('reduced modulo one more than the maximum value representable in the result type').

> Also there is semantic overlap with the MAKE_64BIT_MASK macro
> proposed in https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg02154.html
> (which also has ub, but see
> https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg02614.html
> for the version which doesn't).
> 
> I prefer a "start, length" macro to "position, position",
> because this matches what we already have for the deposit
> and extract functions in this header.

I think it depends on the use; I agree that makes sense
for things like extracting an n-bit integer; in this case
what we have is something which is fixed at bit 51 and
another bit - we dont ever think about the difference between
those two bits.

Dave

> 
> thanks
> -- PMM
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro
  2016-06-20 14:11     ` Dr. David Alan Gilbert
@ 2016-06-20 14:17       ` Peter Maydell
  0 siblings, 0 replies; 73+ messages in thread
From: Peter Maydell @ 2016-06-20 14:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: QEMU Developers, Paolo Bonzini, Andrea Arcangeli, Eduardo Habkost

On 20 June 2016 at 15:11, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> * Peter Maydell (peter.maydell@linaro.org) wrote:
>> I prefer a "start, length" macro to "position, position",
>> because this matches what we already have for the deposit
>> and extract functions in this header.
>
> I think it depends on the use; I agree that makes sense
> for things like extracting an n-bit integer; in this case
> what we have is something which is fixed at bit 51 and
> another bit - we dont ever think about the difference between
> those two bits.

Well, sure, sometimes device descriptions define fields in
registers as "from bit X to bit Y", but we don't have two
versions of extract32(). We've already settled on using
"start, length" for other operations in this header, so
I think we should continue in that vein, not have some
things using "start, length" and others using "start, end".

thanks
-- PMM

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-17 15:49               ` Eduardo Habkost
@ 2016-06-21 19:44                 ` Eduardo Habkost
  2016-06-22 12:41                   ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Eduardo Habkost @ 2016-06-21 19:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aarcange, Marcel Apfelbaum, Michael S. Tsirkin,
	Dr. David Alan Gilbert, qemu-devel

On Fri, Jun 17, 2016 at 12:49:05PM -0300, Eduardo Habkost wrote:
> On Fri, Jun 17, 2016 at 05:28:03PM +0200, Paolo Bonzini wrote:
> > On 17/06/2016 17:19, Eduardo Habkost wrote:
> > > > > So, what happens if the guest play tricks in bits 40-45 when QEMU
> > > > > sets the limit to 40 but we are running in a 46-bit host? Is it
> > > > > really a problem? I assumed it would be safe.
> > > > 
> > > > The guest expects a "reserved bit set" page fault, but doesn't get one.
> > > 
> > > Wait, are you talking about migration only, or are you really
> > > talking about running current QEMU (hardcoded to 40) on a 46-bit
> > > host? I'm not talking about migration, above.
> > 
> > I'm talking about both. :(
> > 
> > > We really can't emulate a 40-bit machine in a 46-bit host? I
> > > didn't expect that.
> > 
> > Unfortunately that's the case. :(
> 
> OK, so please ignore all my suggestions about choosing reasonable
> defaults based on VM size, etc.
> 
> Now all options look bad, and we need to choose the least harmful
> as the default... I need to re-read this thread to be able to
> give an opinion.
> 

I noticed we didn't continue to discuss what would be a good
default.

We can delegate this decision to libvirt, but we would still have
to evaluate the options and tell the libvirt folks what's the
default we recommend.

In an offline discussion, Paolo suggsted making host
phys-addr-bits the default, but I am not sure this is really the
best option.

The consequences of migrating (or having migration blocked) to a
host with smaller phys-addr-bits sound worse to me than the
consequences of just having guest's phys-addr-bits smaller than
the host's.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-21 19:44                 ` [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set) Eduardo Habkost
@ 2016-06-22 12:41                   ` Paolo Bonzini
  2016-06-22 14:24                     ` Andrea Arcangeli
  2016-06-22 22:40                     ` Michael S. Tsirkin
  0 siblings, 2 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-22 12:41 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: aarcange, Marcel Apfelbaum, Michael S. Tsirkin,
	Dr. David Alan Gilbert, qemu-devel



On 21/06/2016 21:44, Eduardo Habkost wrote:
> The consequences of migrating (or having migration blocked) to a
> host with smaller phys-addr-bits sound worse to me than the
> consequences of just having guest's phys-addr-bits smaller than
> the host's.

There is no correct answer.  We've been using host phys-addr-bits in
RHEL for 6 years and no one has ever reported a bug.

Most data centers (the ones that actually use migration) will all have
Xeon E5s and above, and pretty much all of them have 46-bits physical
address bits since at least Sandy Bridge.  That probably helps.
Save/restore is usually done on the same machine, which also helps
because host phys-addr-bits doesn't change.

>From a semantics point of view, using a smaller phys-addr-bits than the
host is the worst, because you tell the guest that some bits are
must-be-zero, when they're not.  Using a larger phys-addr-bits cannot
cause malfunctioning, only crashes (and as Gerd said, if you cross your
fingers and hope the guest doesn't put anything so high in memory,
chances are you'll succeed), and this makes it "safer".  I'm not sure
which one is more likely to happen.

So there's no correct answer, and that's why I think the lesser evil is
to go with the time-tested alternative and use host phys-addr-bits as
the default, even if it causes weird behavior on migration.  If a fixed
phys-addr-bits is specified on the destination, it should match the
value that was used on the source though.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 12:41                   ` Paolo Bonzini
@ 2016-06-22 14:24                     ` Andrea Arcangeli
  2016-06-22 14:33                       ` Paolo Bonzini
  2016-06-22 22:44                       ` Michael S. Tsirkin
  2016-06-22 22:40                     ` Michael S. Tsirkin
  1 sibling, 2 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2016-06-22 14:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eduardo Habkost, Marcel Apfelbaum, Michael S. Tsirkin,
	Dr. David Alan Gilbert, qemu-devel

Hello,

On Wed, Jun 22, 2016 at 02:41:22PM +0200, Paolo Bonzini wrote:
> From a semantics point of view, using a smaller phys-addr-bits than the
> host is the worst, because you tell the guest that some bits are
> must-be-zero, when they're not.  Using a larger phys-addr-bits cannot

Ok, so EPT/KVM should always use the host phys bits, never the guest
ones, for EPT violations. KVM runs in the host so that's not a
concern. EPT is irrelevant.

The only relevancy is in the guest pagetables with EPT. I don't think
any sane OS can break. It'd be inefficient too, to use a cacheline to
check the phys bits at runtime before setting up pagetables.

The MTRR if it doesn't set the "valid" phys bits to 1 (because the
guest bits are reduced), it should be still safe.

> cause malfunctioning, only crashes (and as Gerd said, if you cross your
> fingers and hope the guest doesn't put anything so high in memory,
> chances are you'll succeed), and this makes it "safer".  I'm not sure
> which one is more likely to happen.

But the crash with guest phys bits > host phys bits is material, linux
will definitely crash in such condition.

Linux could not possibly crash instead if host phys bits > guest phys
bits because it will never depend on GPF triggering if the must be
zero bits of the guest pagetables are set. Linux won't ever try to set
those bits and I'd be shocked if any other OS does.

So while not perfect emulation of the hardware, the risk with known OS
should be zero.

> So there's no correct answer, and that's why I think the lesser evil is
> to go with the time-tested alternative and use host phys-addr-bits as
> the default, even if it causes weird behavior on migration.  If a fixed
> phys-addr-bits is specified on the destination, it should match the
> value that was used on the source though.

I agree with should start with the host phys bits like we use in
production (plus the mtrr fix).

It is a net improvement compared to upstream because it restrict the
risk to only live migration and it otherwise always runs perfectly
safe. Upstream is never safe on any host with phys bits != 40,
especially if the host phys bits is < 40.

My solution has the main benefit to avoid to compute the highest
possible RAM/PCI bar guest physical address that could be ever mapped,
in order to generate a "soft" guest phys bits.

Later we could still consider to introduce a "soft" guest phys bits
with the only objective of preventing the risk of migration breakages.

qemu shouldn't let the guest migrate if we find destination host phys
bits is < "soft" guest phys bits.

Then a command line quirk -cpu=force_host_phys_bits would set the
"soft" guest phys bits to the host value and prevent live migration to
any destination with host phys bits != "soft" guest phys bits. And it
should be used only for such hypothetical OS that depends on the must
zero bits violations in the guest pagetables.

If this is good idea or not as a second step, boils down to how
difficult it is to calculate the highest possible guest physical
address at boot time. If that's impossible with PCI hotplug (memory
hotplug shouldn't be an issue), then "soft" guest phys bits also
becomes mostly worthless (unless we require -cpu=force_host_phys_bits
for PCI hotplug to work).

Again starting with the host -> guest phys bits sounds fine to me, at
least everything will work perfect in all cases except live migration,
and you should know what you're doing with live migration if you've a
very diverse host phys bits in the cloud nodes and very large guests
too or guest with weird OS depending on must be zero bits violations
in guest pagetables.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 14:24                     ` Andrea Arcangeli
@ 2016-06-22 14:33                       ` Paolo Bonzini
  2016-06-22 14:44                         ` Andrea Arcangeli
  2016-06-22 22:44                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-22 14:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Eduardo Habkost, Marcel Apfelbaum, Michael S. Tsirkin,
	Dr. David Alan Gilbert, qemu-devel



On 22/06/2016 16:24, Andrea Arcangeli wrote:
> Linux could not possibly crash instead if host phys bits > guest phys
> bits because it will never depend on GPF triggering if the must be
> zero bits of the guest pagetables are set. Linux won't ever try to set
> those bits and I'd be shocked if any other OS does.

Well, KVM does.  It sets _all_ bits up to 51, not just one, but still we
have a counterexample.

The reason to do this is that you can distinguish a not-present from a
present-reserved page fault, and handle the present-reserved page fault
from a cache without having to walk the page tables.

Paolo

> So while not perfect emulation of the hardware, the risk with known OS
> should be zero.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 14:33                       ` Paolo Bonzini
@ 2016-06-22 14:44                         ` Andrea Arcangeli
  2016-06-22 14:48                           ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Andrea Arcangeli @ 2016-06-22 14:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eduardo Habkost, Marcel Apfelbaum, Michael S. Tsirkin,
	Dr. David Alan Gilbert, qemu-devel

On Wed, Jun 22, 2016 at 04:33:18PM +0200, Paolo Bonzini wrote:
> 
> 
> On 22/06/2016 16:24, Andrea Arcangeli wrote:
> > Linux could not possibly crash instead if host phys bits > guest phys
> > bits because it will never depend on GPF triggering if the must be
> > zero bits of the guest pagetables are set. Linux won't ever try to set
> > those bits and I'd be shocked if any other OS does.
> 
> Well, KVM does.  It sets _all_ bits up to 51, not just one, but still we
> have a counterexample.

How can that crash? KVM doesn't use the host phys bits (or level1 host
phys bits), the bit to set is hardcoded up to 51 and assumed no host
would possibly fail at that.

The concern in the KVM case is for nested virt as it is for an host in
this regard.

The scenario you are concerned about only happens if the bit set is not
hardcoded to 51 but is in function of the host phys bits, which is not
what KVM does.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 14:44                         ` Andrea Arcangeli
@ 2016-06-22 14:48                           ` Paolo Bonzini
  2016-06-22 15:02                             ` Andrea Arcangeli
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2016-06-22 14:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Eduardo Habkost, Marcel Apfelbaum, Michael S. Tsirkin,
	Dr. David Alan Gilbert, qemu-devel



On 22/06/2016 16:44, Andrea Arcangeli wrote:
> On Wed, Jun 22, 2016 at 04:33:18PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 22/06/2016 16:24, Andrea Arcangeli wrote:
>>> Linux could not possibly crash instead if host phys bits > guest phys
>>> bits because it will never depend on GPF triggering if the must be
>>> zero bits of the guest pagetables are set. Linux won't ever try to set
>>> those bits and I'd be shocked if any other OS does.
>>
>> Well, KVM does.  It sets _all_ bits up to 51, not just one, but still we
>> have a counterexample.
> 
> How can that crash? KVM doesn't use the host phys bits (or level1 host
> phys bits), the bit to set is hardcoded up to 51 and assumed no host
> would possibly fail at that.

KVM encodes other information in the sPTE when it sets the reserved bit
(a generation count).  Instead of using all bits up to 51, KVM could
well use bit MAXPHYADDR+1 as a marker and add bits MAXPHYADDR+2...51 to
the generation count.

You cannot really rely on what the guest does, no matter how crazy it seems.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 14:48                           ` Paolo Bonzini
@ 2016-06-22 15:02                             ` Andrea Arcangeli
  0 siblings, 0 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2016-06-22 15:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eduardo Habkost, Marcel Apfelbaum, Michael S. Tsirkin,
	Dr. David Alan Gilbert, qemu-devel

On Wed, Jun 22, 2016 at 04:48:50PM +0200, Paolo Bonzini wrote:
> KVM encodes other information in the sPTE when it sets the reserved bit
> (a generation count).  Instead of using all bits up to 51, KVM could
> well use bit MAXPHYADDR+1 as a marker and add bits MAXPHYADDR+2...51 to
> the generation count.

It could but it doesn't, doesn't it?

> You cannot really rely on what the guest does, no matter how crazy it seems.

I agree, but then to support such guest with full safety, we should
disable live migration if host phys bits change, which we clearly
aren't doing (at least by default), nor planning to do as it's way too
restrictive and it'd be a nuisance for a purely theoretical issue.

Considering the unfixable issues there are in live migration if a
guest uses the phys bits to calculate the first must be zero bit in
the pagetables, I don't think any OS will start to do that, nor
KVM in nested virt.

host phys bits < "soft" guest phys bits is more material issue the
guest cannot prevent in software, it already broke the mtrr in fact
(which thankfully can be fixed in the emulation layer with David's
patch).

This is why I think a "soft" guest phys bits would provide a peace of
mind for a 100% reliable live migration in the future.

Starting with the production solution (plus the mtrr fix that as
expected only matters for live migration) sounds fine to me, how far
we go with the "soft" guest phys bits, then depends on the complexity
of calculating the highest possible physical address with pci/memory
hotplug.

The only thing I'm not in full agreement with, is that the concern of
"soft" guest phys bits < host phys bits should be considered a
material concern, that to me sounds purely theoretical and in any case
it's fixable by changing the guest source code to provide full
reliability if "soft" guest phys bits < host phys.

With "soft" guest phys bits > host phys bits there's nothing the guest
software can do to avoid running into troubles instead.

If we introduce later a "soft" guest phys bits set to the lowest
possible value that can fit the highest possible physical address of
the guest, we should still then have a -cpu=force_host_phys_bits flag
to handle the extreme case of proprietary OS which would depend on the
must zero bits pagetable violations, but I don't think anybody would
ever need to use such flag.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 12:41                   ` Paolo Bonzini
  2016-06-22 14:24                     ` Andrea Arcangeli
@ 2016-06-22 22:40                     ` Michael S. Tsirkin
  2016-06-22 23:15                       ` Andrea Arcangeli
  1 sibling, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2016-06-22 22:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eduardo Habkost, aarcange, Marcel Apfelbaum,
	Dr. David Alan Gilbert, qemu-devel

On Wed, Jun 22, 2016 at 02:41:22PM +0200, Paolo Bonzini wrote:
> 
> 
> On 21/06/2016 21:44, Eduardo Habkost wrote:
> > The consequences of migrating (or having migration blocked) to a
> > host with smaller phys-addr-bits sound worse to me than the
> > consequences of just having guest's phys-addr-bits smaller than
> > the host's.
> 
> There is no correct answer.  We've been using host phys-addr-bits in
> RHEL for 6 years and no one has ever reported a bug.
> 
> Most data centers (the ones that actually use migration) will all have
> Xeon E5s and above, and pretty much all of them have 46-bits physical
> address bits since at least Sandy Bridge.  That probably helps.
> Save/restore is usually done on the same machine, which also helps
> because host phys-addr-bits doesn't change.
> 
> >From a semantics point of view, using a smaller phys-addr-bits than the
> host is the worst, because you tell the guest that some bits are
> must-be-zero, when they're not.  Using a larger phys-addr-bits cannot
> cause malfunctioning, only crashes (and as Gerd said, if you cross your
> fingers and hope the guest doesn't put anything so high in memory,
> chances are you'll succeed), and this makes it "safer".  I'm not sure
> which one is more likely to happen.
> 
> So there's no correct answer,


Wait a second:

1. Use CPUID to tell guest it can address 46 bits
2. use e820 to tell guest RAM has addresses below e.g. 39 bits
3. _CRS to tell guest not to put anything above e.g. 39 bits

will result in guest never using any addresses above 39 bits but
also always setting bits 39 to 46 to zero.

No crashes, no corruptions.

Where's a problem then?



> and that's why I think the lesser evil is
> to go with the time-tested alternative and use host phys-addr-bits as
> the default, even if it causes weird behavior on migration.  If a fixed
> phys-addr-bits is specified on the destination, it should match the
> value that was used on the source though.
> 
> Paolo


I agree, but I don't see a need to use host bits at all.
So I think that all we need is a way to let libvirt control
the _CRS range. Teach it that _CRS must fit within what
host can support. Also check and fail kvm init if _CRS exceeds
what host can support.

Hmm?

-- 
MST

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 14:24                     ` Andrea Arcangeli
  2016-06-22 14:33                       ` Paolo Bonzini
@ 2016-06-22 22:44                       ` Michael S. Tsirkin
  2016-06-22 23:23                         ` Andrea Arcangeli
  1 sibling, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2016-06-22 22:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Dr. David Alan Gilbert, qemu-devel

On Wed, Jun 22, 2016 at 04:24:14PM +0200, Andrea Arcangeli wrote:
> > cause malfunctioning, only crashes (and as Gerd said, if you cross your
> > fingers and hope the guest doesn't put anything so high in memory,
> > chances are you'll succeed), and this makes it "safer".  I'm not sure
> > which one is more likely to happen.
> 
> But the crash with guest phys bits > host phys bits is material, linux
> will definitely crash in such condition.

Why would it? Most GPA addresses are not guest controllable.
Don't give guest addresses that host can't access, you will not get
a crash.

The only exception I know of is PCI BARs but we can limit
these to a safe addressable range using _CRS method in ACPI.

Could you explain please?

-- 
MST

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 22:40                     ` Michael S. Tsirkin
@ 2016-06-22 23:15                       ` Andrea Arcangeli
  0 siblings, 0 replies; 73+ messages in thread
From: Andrea Arcangeli @ 2016-06-22 23:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Dr. David Alan Gilbert, qemu-devel

On Thu, Jun 23, 2016 at 01:40:42AM +0300, Michael S. Tsirkin wrote:
> Where's a problem then?

If EPT/NPT is enabled, the guest pagetables are parsed by the hardware
and not by the KVM shadow MMU in software. The hardware speaks host
phys bits and AFIK the hardware will behave different depending on the
host phys bits.

In fact the guest could probe those the host phys bits anyway.

Now the breakage with guest phys bits < host phys bits happens only
with EPT/NPT if the guest instead of "probing" the host phys bits, it
just runs cpuid and it assumes the value it receives would be the same
as the effect of a "probe".

Then guest could assume the probing effect would match the guest phys
bits returned by the guest cpuid insn, and do important stuff in
function of that (i.e. expecting a GPF which won't materialize if the
host phys bits is > guest phys bits).

The guest must do somewhat weird for any breakage to happen (notably
changing pagetable format in function of cpuid retval). The guest of
course could also be changed to stop being weird and then it wouldn't
break anymore.

So just in case there's any weird proprietary OS like that, we can
still add a -cpu=force_host_phys_bits fallback, to prevent the
discrepancy between cpuid and probing effect, in turn eliminating any
risk of guest failures (but then we should also prevent live migration
if source host phys bits != destination host phys bits to provide the
same guarantee to the weird guest, through live migration).

> So I think that all we need is a way to let libvirt control
> the _CRS range. Teach it that _CRS must fit within what
> host can support. Also check and fail kvm init if _CRS exceeds
> what host can support.

Right.

The production solution is such a simple patch that I certainly agree
it can be applied first, along with the mtrr fix.

The complexity in dealing with _CSR and all the up layers about this
subtle phys bits detail, to calculate the highest possible guest
physical address, is what makes the production solution attractive in
the short term.

Then if we implement the "soft" guest phys bits exercise, that is all
about adding robustness to live migration (and save/restore). So for
it not to risk to be futile, it'd be nice if the phys bits checks were
all contained inside qemu.

Initially libvirt/ovirt/OpenStack would just return some live
migration generic error to the user, in the unlikely case there's a
phys bits mismatch during the live migration or restore (i.e. "soft"
guest phys bits > destination host phys bits). That will still avoid
us getting weird bugreports and way more important it'll avoid any
risk of customer unexpected guest crashes.

The managers that load-balance the load in the cloud if they want they
can still do their own calculation on the host/guest phys bits
matching the qemu internal calculation and guarantee themselves that
they'll never run into the qemu live migration error because of too
low destination host phys bits (either that or they can check a proper
error from the migration command).

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 22:44                       ` Michael S. Tsirkin
@ 2016-06-22 23:23                         ` Andrea Arcangeli
  2016-06-22 23:45                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Andrea Arcangeli @ 2016-06-22 23:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Dr. David Alan Gilbert, qemu-devel

On Thu, Jun 23, 2016 at 01:44:06AM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 22, 2016 at 04:24:14PM +0200, Andrea Arcangeli wrote:
> > > cause malfunctioning, only crashes (and as Gerd said, if you cross your
> > > fingers and hope the guest doesn't put anything so high in memory,
> > > chances are you'll succeed), and this makes it "safer".  I'm not sure
> > > which one is more likely to happen.
> > 
> > But the crash with guest phys bits > host phys bits is material, linux
> > will definitely crash in such condition.
> 
> Why would it? Most GPA addresses are not guest controllable.
> Don't give guest addresses that host can't access, you will not get
> a crash.
> 
> The only exception I know of is PCI BARs but we can limit
> these to a safe addressable range using _CRS method in ACPI.
> 
> Could you explain please?

Well the crash of guest phys bits > host phys bits, should be easy to
reproduce by booting a 65GB guest on a 64GB RAM + 2GB swap host with
36 host phys bits using the upstream qemu that forces the guest phys
bits to 40.

Likely the guest won't boot properly regardless if the PCI bars are at
the end, but it may have a chance to print something meaningful on the
console while trying instead of failing in some unexpected way.

Now the production patch fixes it 100% by using the host bits instead
of value 40. However you'd run into the instability if you migrate the
same guest to the aformentioned host.

No amount of guest changes can fix the above. So then we can avoid any
risk of breakages also during live migration introducing a "soft"
guest phys bits set as low as possible. And live migration restore can
check it against the host phys bits.

The other case (guest phys bits < host phys bits) instead requires a
guest doing something strange, definitely never going to be a problem
with linux as guest at least.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 23:23                         ` Andrea Arcangeli
@ 2016-06-22 23:45                           ` Michael S. Tsirkin
  2016-06-23  8:40                             ` Gerd Hoffmann
  0 siblings, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2016-06-22 23:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Dr. David Alan Gilbert, qemu-devel

On Thu, Jun 23, 2016 at 01:23:08AM +0200, Andrea Arcangeli wrote:
> On Thu, Jun 23, 2016 at 01:44:06AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 22, 2016 at 04:24:14PM +0200, Andrea Arcangeli wrote:
> > > > cause malfunctioning, only crashes (and as Gerd said, if you cross your
> > > > fingers and hope the guest doesn't put anything so high in memory,
> > > > chances are you'll succeed), and this makes it "safer".  I'm not sure
> > > > which one is more likely to happen.
> > > 
> > > But the crash with guest phys bits > host phys bits is material, linux
> > > will definitely crash in such condition.
> > 
> > Why would it? Most GPA addresses are not guest controllable.
> > Don't give guest addresses that host can't access, you will not get
> > a crash.
> > 
> > The only exception I know of is PCI BARs but we can limit
> > these to a safe addressable range using _CRS method in ACPI.
> > 
> > Could you explain please?
> 
> Well the crash of guest phys bits > host phys bits, should be easy to
> reproduce by booting a 65GB guest on a 64GB RAM + 2GB swap host with
> 36 host phys bits using the upstream qemu that forces the guest phys
> bits to 40.

So you supply more RAM than host can address, and guest crashes?

Why are we worried about it?

I would say that's a management bug.


> Likely the guest won't boot properly regardless if the PCI bars are at
> the end, but it may have a chance to print something meaningful on the
> console while trying instead of failing in some unexpected way.
> 
> Now the production patch fixes it 100% by using the host bits instead
> of value 40. However you'd run into the instability if you migrate the
> same guest to the aformentioned host.
> 
> No amount of guest changes can fix the above. So then we can avoid any
> risk of breakages also during live migration introducing a "soft"
> guest phys bits set as low as possible. And live migration restore can
> check it against the host phys bits.

I don't think it's worth fixing. Just don't give more RAM than
host can address.

> The other case (guest phys bits < host phys bits) instead requires a
> guest doing something strange, definitely never going to be a problem
> with linux as guest at least.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-22 23:45                           ` Michael S. Tsirkin
@ 2016-06-23  8:40                             ` Gerd Hoffmann
  2016-06-23 16:38                               ` Michael S. Tsirkin
  2016-06-29 16:42                               ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 73+ messages in thread
From: Gerd Hoffmann @ 2016-06-23  8:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andrea Arcangeli, Marcel Apfelbaum, Paolo Bonzini, qemu-devel,
	Eduardo Habkost, Dr. David Alan Gilbert

  Hi,

> > Well the crash of guest phys bits > host phys bits, should be easy to
> > reproduce by booting a 65GB guest on a 64GB RAM + 2GB swap host with
> > 36 host phys bits using the upstream qemu that forces the guest phys
> > bits to 40.
> 
> So you supply more RAM than host can address, and guest crashes?

Yep.  The only reason we don't see this happening in practice is that
it's probably next to impossible to find a machine which has (a) only 36
physical address lines and (b) allows to plug that much RAM.

> Why are we worried about it?

It's more a issue with pci ressources.  In theory seabios/edk2 could go
figure how big the physical address space is, then map 64bit pci bars as
high as possible, thereby making stuff like etc/reserved-memory-end in
fw_cfg unnecessary.

But with qemu saying 40 phys bits are available even if they are not
this approach isn't going to fly ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-23  8:40                             ` Gerd Hoffmann
@ 2016-06-23 16:38                               ` Michael S. Tsirkin
  2016-06-24  5:55                                 ` Gerd Hoffmann
  2016-06-29 16:42                               ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2016-06-23 16:38 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Andrea Arcangeli, Marcel Apfelbaum, Paolo Bonzini, qemu-devel,
	Eduardo Habkost, Dr. David Alan Gilbert

On Thu, Jun 23, 2016 at 10:40:03AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > > Well the crash of guest phys bits > host phys bits, should be easy to
> > > reproduce by booting a 65GB guest on a 64GB RAM + 2GB swap host with
> > > 36 host phys bits using the upstream qemu that forces the guest phys
> > > bits to 40.
> > 
> > So you supply more RAM than host can address, and guest crashes?
> 
> Yep.  The only reason we don't see this happening in practice is that
> it's probably next to impossible to find a machine which has (a) only 36
> physical address lines and (b) allows to plug that much RAM.
> 
> > Why are we worried about it?
> 
> It's more a issue with pci ressources.  In theory seabios/edk2 could go
> figure how big the physical address space is, then map 64bit pci bars as
> high as possible, thereby making stuff like etc/reserved-memory-end in
> fw_cfg unnecessary.
> 
> But with qemu saying 40 phys bits are available even if they are not
> this approach isn't going to fly ...
> 
> cheers,
>   Gerd

Nah, x86 guests really need to go by _CRS. bios doesn't want to parse that
so it can go by some fw cfg file instead.

Going by phys bits won't work on old qemu so I don't believe it's
practical.

-- 
MST

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-23 16:38                               ` Michael S. Tsirkin
@ 2016-06-24  5:55                                 ` Gerd Hoffmann
  2016-06-24 23:12                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Gerd Hoffmann @ 2016-06-24  5:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andrea Arcangeli, Marcel Apfelbaum, Paolo Bonzini, qemu-devel,
	Eduardo Habkost, Dr. David Alan Gilbert

On Do, 2016-06-23 at 19:38 +0300, Michael S. Tsirkin wrote:
> On Thu, Jun 23, 2016 at 10:40:03AM +0200, Gerd Hoffmann wrote:
> >   Hi,
> > 
> > > > Well the crash of guest phys bits > host phys bits, should be easy to
> > > > reproduce by booting a 65GB guest on a 64GB RAM + 2GB swap host with
> > > > 36 host phys bits using the upstream qemu that forces the guest phys
> > > > bits to 40.
> > > 
> > > So you supply more RAM than host can address, and guest crashes?
> > 
> > Yep.  The only reason we don't see this happening in practice is that
> > it's probably next to impossible to find a machine which has (a) only 36
> > physical address lines and (b) allows to plug that much RAM.
> > 
> > > Why are we worried about it?
> > 
> > It's more a issue with pci ressources.  In theory seabios/edk2 could go
> > figure how big the physical address space is, then map 64bit pci bars as
> > high as possible, thereby making stuff like etc/reserved-memory-end in
> > fw_cfg unnecessary.
> > 
> > But with qemu saying 40 phys bits are available even if they are not
> > this approach isn't going to fly ...
> > 
> > cheers,
> >   Gerd
> 
> Nah, x86 guests really need to go by _CRS.

Yep, we can implement the "soft-phys-bits" that way.

> bios doesn't want to parse that
> so it can go by some fw cfg file instead.

firmware can't use it anyway because the firmware first maps the bars,
the loads acpi tables (while qemu generates _CRS entries according to
the bios mappings).

> Going by phys bits won't work on old qemu so I don't believe it's
> practical.

Indeed, so I guess we'll have to stick to the current approach of
mapping 64bit bars above ram (or etc/reserved-memory-end if present).

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-24  5:55                                 ` Gerd Hoffmann
@ 2016-06-24 23:12                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 73+ messages in thread
From: Michael S. Tsirkin @ 2016-06-24 23:12 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Andrea Arcangeli, Marcel Apfelbaum, Paolo Bonzini, qemu-devel,
	Eduardo Habkost, Dr. David Alan Gilbert

On Fri, Jun 24, 2016 at 07:55:33AM +0200, Gerd Hoffmann wrote:
> On Do, 2016-06-23 at 19:38 +0300, Michael S. Tsirkin wrote:
> > On Thu, Jun 23, 2016 at 10:40:03AM +0200, Gerd Hoffmann wrote:
> > >   Hi,
> > > 
> > > > > Well the crash of guest phys bits > host phys bits, should be easy to
> > > > > reproduce by booting a 65GB guest on a 64GB RAM + 2GB swap host with
> > > > > 36 host phys bits using the upstream qemu that forces the guest phys
> > > > > bits to 40.
> > > > 
> > > > So you supply more RAM than host can address, and guest crashes?
> > > 
> > > Yep.  The only reason we don't see this happening in practice is that
> > > it's probably next to impossible to find a machine which has (a) only 36
> > > physical address lines and (b) allows to plug that much RAM.
> > > 
> > > > Why are we worried about it?
> > > 
> > > It's more a issue with pci ressources.  In theory seabios/edk2 could go
> > > figure how big the physical address space is, then map 64bit pci bars as
> > > high as possible, thereby making stuff like etc/reserved-memory-end in
> > > fw_cfg unnecessary.
> > > 
> > > But with qemu saying 40 phys bits are available even if they are not
> > > this approach isn't going to fly ...
> > > 
> > > cheers,
> > >   Gerd
> > 
> > Nah, x86 guests really need to go by _CRS.
> 
> Yep, we can implement the "soft-phys-bits" that way.

The advantage being no guest changes are necessary. Guests
already use _CRS.

> > bios doesn't want to parse that
> > so it can go by some fw cfg file instead.
> 
> firmware can't use it anyway because the firmware first maps the bars,
> the loads acpi tables (while qemu generates _CRS entries according to
> the bios mappings).
> 
> > Going by phys bits won't work on old qemu so I don't believe it's
> > practical.
> 
> Indeed, so I guess we'll have to stick to the current approach of
> mapping 64bit bars above ram (or etc/reserved-memory-end if present).
> 
> cheers,
>   Gerd

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-23  8:40                             ` Gerd Hoffmann
  2016-06-23 16:38                               ` Michael S. Tsirkin
@ 2016-06-29 16:42                               ` Dr. David Alan Gilbert
  2016-06-30  6:10                                 ` Gerd Hoffmann
  1 sibling, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-29 16:42 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Michael S. Tsirkin, Andrea Arcangeli, Marcel Apfelbaum,
	Paolo Bonzini, qemu-devel, Eduardo Habkost

* Gerd Hoffmann (kraxel@redhat.com) wrote:
>   Hi,
> 
> > > Well the crash of guest phys bits > host phys bits, should be easy to
> > > reproduce by booting a 65GB guest on a 64GB RAM + 2GB swap host with
> > > 36 host phys bits using the upstream qemu that forces the guest phys
> > > bits to 40.
> > 
> > So you supply more RAM than host can address, and guest crashes?
> 
> Yep.  The only reason we don't see this happening in practice is that
> it's probably next to impossible to find a machine which has (a) only 36
> physical address lines and (b) allows to plug that much RAM.
> 
> > Why are we worried about it?
> 
> It's more a issue with pci ressources.  In theory seabios/edk2 could go
> figure how big the physical address space is, then map 64bit pci bars as
> high as possible, thereby making stuff like etc/reserved-memory-end in
> fw_cfg unnecessary.
> 
> But with qemu saying 40 phys bits are available even if they are not
> this approach isn't going to fly ...

Something somewhere in qemu/ kernel/ firmware is already reading the number
of physical bits to determine PCI mapping; if I do:

./x86_64-softmmu/qemu-system-x86_64 -machine q35,accel=kvm,usb=off -m 4096,slots=16,maxmem=128T   -vga none -device qxl-vga,bus=pcie.0,ram_size_mb=2048,vram64_size_mb=2048 -vnc 0.0.0.0:0 /home/vms/7.2a.qcow2 -chardev stdio,mux=on,id=mon -mon chardev=mon,mode=readline -cpu host,phys-bits=48

it will happily map the qxl VRAM right up high, but if I lower
the phys-bits down to 46 it won't.

    VGA controller: PCI device 1b36:0100
      IRQ 11.
      BAR0: 32 bit memory at 0xc0000000 [0xdfffffff].
      BAR1: 32 bit memory at 0xe0000000 [0xe3ffffff].
      BAR2: 32 bit memory at 0xe4070000 [0xe4071fff].
      BAR3: I/O at 0xc080 [0xc09f].
      BAR4: 64 bit prefetchable memory at 0x800480000000 [0x8004ffffffff].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0000fffe].
      id ""

I'll admit to not understanding why it all still boots and doesn't
fall in a mess with that mapping (46 bit Xeon).

Dave
> 
> cheers,
>   Gerd
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-29 16:42                               ` Dr. David Alan Gilbert
@ 2016-06-30  6:10                                 ` Gerd Hoffmann
  2016-06-30 10:59                                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 73+ messages in thread
From: Gerd Hoffmann @ 2016-06-30  6:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Michael S. Tsirkin, Andrea Arcangeli, Marcel Apfelbaum,
	Paolo Bonzini, qemu-devel, Eduardo Habkost

  Hi,

> Something somewhere in qemu/ kernel/ firmware is already reading the number
> of physical bits to determine PCI mapping; if I do:
> 
> ./x86_64-softmmu/qemu-system-x86_64 -m 4096,slots=16,maxmem=128T

No, it's not the physbits.  You add some memory hotplug slots here.
Qemu will ask seabios to reserve address space for those, which seabios
promptly does and maps 64bit pci bars above the reserved address space.

>    -vga none -device qxl-vga,bus=pcie.0,ram_size_mb=2048,vram64_size_mb=2048 -vnc 0.0.0.0:0 /home/vms/7.2a.qcow2 -chardev stdio,mux=on,id=mon -mon chardev=mon,mode=readline -cpu host,phys-bits=48
> 
> it will happily map the qxl VRAM right up high, but if I lower
> the phys-bits down to 46 it won't.

I suspect the linux kernel remaps the bar because the seabios mapping is
unreachable.  Check dmesg.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-30  6:10                                 ` Gerd Hoffmann
@ 2016-06-30 10:59                                   ` Dr. David Alan Gilbert
  2016-06-30 16:14                                     ` Gerd Hoffmann
  0 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-30 10:59 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Michael S. Tsirkin, Andrea Arcangeli, Marcel Apfelbaum,
	Paolo Bonzini, qemu-devel, Eduardo Habkost

* Gerd Hoffmann (kraxel@redhat.com) wrote:
>   Hi,
> 
> > Something somewhere in qemu/ kernel/ firmware is already reading the number
> > of physical bits to determine PCI mapping; if I do:
> > 
> > ./x86_64-softmmu/qemu-system-x86_64 -m 4096,slots=16,maxmem=128T
> 
> No, it's not the physbits.  You add some memory hotplug slots here.
> Qemu will ask seabios to reserve address space for those, which seabios
> promptly does and maps 64bit pci bars above the reserved address space.

Right, that's what I was trying to do - I wanted to see if I could get something
to use the non-existing address space.

> >    -vga none -device qxl-vga,bus=pcie.0,ram_size_mb=2048,vram64_size_mb=2048 -vnc 0.0.0.0:0 /home/vms/7.2a.qcow2 -chardev stdio,mux=on,id=mon -mon chardev=mon,mode=readline -cpu host,phys-bits=48
> > 
> > it will happily map the qxl VRAM right up high, but if I lower
> > the phys-bits down to 46 it won't.
> 
> I suspect the linux kernel remaps the bar because the seabios mapping is
> unreachable.  Check dmesg.

Right, and that is dependent on physbits; if I run with:

./x86_64-softmmu/qemu-system-x86_64 -machine q35,accel=kvm,usb=off -m 4096,slots=16,maxmem=128T   -vga none -device qxl-vga,bus=pcie.0,ram_size_mb=2048,vram64_size_mb=2048 -vnc 0.0.0.0:0 /home/vms/7.2a.qcow2 -chardev stdio,mux=on,id=mon -mon chardev=mon,mode=readline -cpu host,phys-bits=48
  (on a 46 bit xeon) it happily maps that 64-bit bar into somewhere
that shouldn't be accessible:

[    0.266183] pci_bus 0000:00: root bus resource [mem 0x800480000000-0x8004ffffffff]
[    0.321611] pci 0000:00:02.0: reg 0x20: [mem 0x800480000000-0x8004ffffffff 64bit pref]
[    0.423257] pci_bus 0000:00: resource 8 [mem 0x800480000000-0x8004ffffffff]

lspci -v:

00:02.0 VGA compatible controller: Red Hat, Inc. QXL paravirtual graphic card (rev 04) (prog-if 00 [VGA controller])
	Subsystem: Red Hat, Inc QEMU Virtual Machine
	Flags: fast devsel, IRQ 22
	Memory at c0000000 (32-bit, non-prefetchable) [size=512M]
	Memory at e0000000 (32-bit, non-prefetchable) [size=64M]
	Memory at e4070000 (32-bit, non-prefetchable) [size=8K]
	I/O ports at c080 [size=32]
	Memory at 800480000000 (64-bit, prefetchable) [size=2G]
	Expansion ROM at e4060000 [disabled] [size=64K]
	Kernel driver in use: qxl

So that's mapped at an address beyond host phys-bits.
And it hasn't failed/crashed etc - but I guess maybe nothing is using that 2G space?

If I change the phys-bits=48 to 46 the kernel avoids it:
    [    0.414867] acpi PNP0A08:00: host bridge window [0x800480000000-0x8004ffffffff] (ignored, not CPU addressable)
    [    0.683134] pci 0000:00:02.0: can't claim BAR 4 [mem 0x800480000000-0x8004ffffffff 64bit pref]: no compatible bridge window
    [    0.703948] pci 0000:00:02.0: BAR 4: [mem size 0x80000000 64bit pref] conflicts with PCI mem [mem 0x00000000-0x3fffffffffff]
    [    0.703951] pci 0000:00:02.0: BAR 4: failed to assign [mem size 0x80000000 64bit pref]

lspci shows:
    Memory at <ignored> (64-bit, prefetchable)

(Although interesting qemu's info pci still shows it).

The 'ignored, not CPU addressable' comes from the kernel's drivers/acpi/pci_root.c acpi_pci_root_validate_resources
that uses a value set in arch/x86/kernel/setup.c:
    iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;

So at least the Linux kernel does sanity check using the phys_bits value.

Obviously 128T is a bit silly for maxmem at the moment, however I was worrying what
happens with 36/39/40bit hosts, and it's not unusual to pick a maxmem that's a few TB
even if the VMs you're initially creating are only a handful of GB. (oVirt/RHEV seems to use
a 4TB default for maxmem).

Still, this only hits as a problem if you hit the combination of:
   a) You use large PCI bars
   b) On a 36/39/40bit host
   c) With a large maxmem that forces those PCI bars up to something silly.

Dave

> 
> cheers,
>   Gerd
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-30 10:59                                   ` Dr. David Alan Gilbert
@ 2016-06-30 16:14                                     ` Gerd Hoffmann
  2016-06-30 17:12                                       ` Dr. David Alan Gilbert
  2016-07-01 19:03                                       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 73+ messages in thread
From: Gerd Hoffmann @ 2016-06-30 16:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Michael S. Tsirkin, Andrea Arcangeli, Marcel Apfelbaum,
	Paolo Bonzini, qemu-devel, Eduardo Habkost

> So that's mapped at an address beyond host phys-bits.
> And it hasn't failed/crashed etc - but I guess maybe nothing is using that 2G space?

root@fedora ~# dmesg | grep Surface
[    4.830095] [drm] qxl: 2048M of Surface memory size

qxl bar 4 (64bit) and qxl bar 1 (32bit) are the same thing.  The 64bit
bar can be alot larger obviously.  The 32bit bar is just an alias for
the first portion of the 64bit bar.  So I guess qxl just falls back to
use bar 1 instead of bar 4 because ioremap() on bar 4 fails.

> Obviously 128T is a bit silly for maxmem at the moment, however I was worrying what
> happens with 36/39/40bit hosts, and it's not unusual to pick a maxmem that's a few TB
> even if the VMs you're initially creating are only a handful of GB. (oVirt/RHEV seems to use
> a 4TB default for maxmem).

Oh, ok.  Should be fixed I guess.

> Still, this only hits as a problem if you hit the combination of:
>    a) You use large PCI bars

ovmf will map all 64bit bars high, even without running out of 32bit
address space.  And with virtio 1.0 pretty much every virtual machine
will have 64bit bars.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-30 16:14                                     ` Gerd Hoffmann
@ 2016-06-30 17:12                                       ` Dr. David Alan Gilbert
  2016-07-01 19:03                                       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-06-30 17:12 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Michael S. Tsirkin, Andrea Arcangeli, Marcel Apfelbaum,
	Paolo Bonzini, qemu-devel, Eduardo Habkost

* Gerd Hoffmann (kraxel@redhat.com) wrote:
> > So that's mapped at an address beyond host phys-bits.
> > And it hasn't failed/crashed etc - but I guess maybe nothing is using that 2G space?
> 
> root@fedora ~# dmesg | grep Surface
> [    4.830095] [drm] qxl: 2048M of Surface memory size
> 
> qxl bar 4 (64bit) and qxl bar 1 (32bit) are the same thing.  The 64bit
> bar can be alot larger obviously.  The 32bit bar is just an alias for
> the first portion of the 64bit bar.  So I guess qxl just falls back to
> use bar 1 instead of bar 4 because ioremap() on bar 4 fails.

Hmm for me it's saying it mapped 64M on the setup with 64T maxmem and 48bit phys-bits;
even though the bar is showing it as OK; how is the guest ioremap detecting a problem?

> > Obviously 128T is a bit silly for maxmem at the moment, however I was worrying what
> > happens with 36/39/40bit hosts, and it's not unusual to pick a maxmem that's a few TB
> > even if the VMs you're initially creating are only a handful of GB. (oVirt/RHEV seems to use
> > a 4TB default for maxmem).
> 
> Oh, ok.  Should be fixed I guess.
> 
> > Still, this only hits as a problem if you hit the combination of:
> >    a) You use large PCI bars
> 
> ovmf will map all 64bit bars high, even without running out of 32bit
> address space.  And with virtio 1.0 pretty much every virtual machine
> will have 64bit bars.

Hmm OK let me think about this; using current BIOS+old virtio+host phys-bits
on a 36bit host, with maxmem=4T would apparently work because
nothing would actually get mapped that high.

The same with OVMF would work as well, because generally you wouldn't have
any 64bit bars; but then you turn on virtio 1.0 and.. well then what happens?
The guest sees it's 36bit phys-address bits so linux probably drops the
bars associated with the virtio?  Hmm.

With current downstream qemu's 40bit physical address bit you'd
get those bar's mapped; so it might break badly - except if
we can figure out what causes my 2GB qxl bar not to happen
as at the start of this message.

Dave

> cheers,
>   Gerd
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set)
  2016-06-30 16:14                                     ` Gerd Hoffmann
  2016-06-30 17:12                                       ` Dr. David Alan Gilbert
@ 2016-07-01 19:03                                       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2016-07-01 19:03 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Michael S. Tsirkin, Andrea Arcangeli, Marcel Apfelbaum,
	Paolo Bonzini, qemu-devel, Eduardo Habkost

* Gerd Hoffmann (kraxel@redhat.com) wrote:
> > So that's mapped at an address beyond host phys-bits.
> > And it hasn't failed/crashed etc - but I guess maybe nothing is using that 2G space?
> 
> root@fedora ~# dmesg | grep Surface
> [    4.830095] [drm] qxl: 2048M of Surface memory size
> 
> qxl bar 4 (64bit) and qxl bar 1 (32bit) are the same thing.  The 64bit
> bar can be alot larger obviously.  The 32bit bar is just an alias for
> the first portion of the 64bit bar.  So I guess qxl just falls back to
> use bar 1 instead of bar 4 because ioremap() on bar 4 fails.
> 
> > Obviously 128T is a bit silly for maxmem at the moment, however I was worrying what
> > happens with 36/39/40bit hosts, and it's not unusual to pick a maxmem that's a few TB
> > even if the VMs you're initially creating are only a handful of GB. (oVirt/RHEV seems to use
> > a 4TB default for maxmem).
> 
> Oh, ok.  Should be fixed I guess.
> 
> > Still, this only hits as a problem if you hit the combination of:
> >    a) You use large PCI bars
> 
> ovmf will map all 64bit bars high, even without running out of 32bit
> address space.  And with virtio 1.0 pretty much every virtual machine
> will have 64bit bars.

OK, yes I got that working, and you're right, it does map it high;
(with recent OVMF running virtio 1.0, that needed recent guest/host kernels)
and it does fail easily as well if memory doesn't fit, so for
example:

(all on a xeon with 46 bit physical host)

specifying maxmem=1T - upstream build is hanging - but works if I specify phys-bits=46
    so yes, it's noticing if the guest phys-bits is too small
    even if the host can manage it.

It's OK if running with small amount of RAM and phys-bits=40

maxmem=64T with any phys-bits hangs.


specifying maxmem=64T with phys-bits=46 on xeon and it hangs
specifying maxmem=64T with phys-bits=48 on xeon and it hangs
specifying maxmem=32T with phys-bits=46-48 on xeon and it works

So for example we see:
  Bus  2, device   4, function 0:
    SCSI controller: PCI device 1af4:1042
      IRQ 10.
      BAR1: 32 bit memory at 0x98000000 [0x98000fff].
      BAR4: 64 bit prefetchable memory at 0x200800000000 [0x2008007fffff].
      id "virtio-disk0"

  and that works nicely.

Dave

> 
> cheers,
>   Gerd
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2016-07-01 19:04 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16 17:12 [Qemu-devel] [PATCH 0/5] x86: Physical address limit patches Dr. David Alan Gilbert (git)
2016-06-16 17:12 ` [Qemu-devel] [PATCH 1/5] BIT_RANGE convenience macro Dr. David Alan Gilbert (git)
2016-06-16 17:23   ` Paolo Bonzini
2016-06-16 17:24     ` Dr. David Alan Gilbert
2016-06-16 18:01   ` Peter Maydell
2016-06-16 18:05     ` Paolo Bonzini
2016-06-20 14:11     ` Dr. David Alan Gilbert
2016-06-20 14:17       ` Peter Maydell
2016-06-16 17:12 ` [Qemu-devel] [PATCH 2/5] x86: Mask mtrr mask based on CPU physical address limits Dr. David Alan Gilbert (git)
2016-06-16 19:59   ` Eduardo Habkost
2016-06-17  8:23     ` Dr. David Alan Gilbert
2016-06-17 12:13     ` Paolo Bonzini
2016-06-16 17:12 ` [Qemu-devel] [PATCH 3/5] x86: fill high bits of mtrr mask Dr. David Alan Gilbert (git)
2016-06-16 20:14   ` Eduardo Habkost
2016-06-17  7:47     ` Paolo Bonzini
2016-06-17 12:46       ` Eduardo Habkost
2016-06-17 13:01         ` Paolo Bonzini
2016-06-17 13:41           ` Eduardo Habkost
2016-06-17 14:25             ` Paolo Bonzini
2016-06-17 15:27               ` Eduardo Habkost
2016-06-17 15:29                 ` Paolo Bonzini
2016-06-17 15:35                   ` Eduardo Habkost
2016-06-17 13:51           ` Dr. David Alan Gilbert
2016-06-17 14:19             ` Paolo Bonzini
2016-06-17  8:53     ` Dr. David Alan Gilbert
2016-06-16 17:12 ` [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set Dr. David Alan Gilbert (git)
2016-06-16 17:26   ` Paolo Bonzini
2016-06-16 18:09     ` Eduardo Habkost
2016-06-16 20:24   ` Eduardo Habkost
2016-06-17  8:15     ` Dr. David Alan Gilbert
2016-06-17  8:43       ` Paolo Bonzini
2016-06-17  9:17         ` Gerd Hoffmann
2016-06-17  9:52           ` Igor Mammedov
2016-06-17 11:20             ` Gerd Hoffmann
2016-06-17 16:20               ` Laszlo Ersek
2016-06-17 16:07             ` Laszlo Ersek
2016-06-19 16:13               ` Marcel Apfelbaum
2016-06-20 10:42                 ` Igor Mammedov
2016-06-20 11:13                   ` Marcel Apfelbaum
2016-06-17  9:37       ` Dr. David Alan Gilbert
2016-06-17  9:54         ` Paolo Bonzini
2016-06-17 13:18       ` Eduardo Habkost
2016-06-17 13:38         ` Paolo Bonzini
2016-06-17 15:19           ` Eduardo Habkost
2016-06-17 15:28             ` Paolo Bonzini
2016-06-17 15:49               ` Eduardo Habkost
2016-06-21 19:44                 ` [Qemu-devel] Default for phys-addr-bits? (was Re: [PATCH 4/5] x86: Allow physical address bits to be set) Eduardo Habkost
2016-06-22 12:41                   ` Paolo Bonzini
2016-06-22 14:24                     ` Andrea Arcangeli
2016-06-22 14:33                       ` Paolo Bonzini
2016-06-22 14:44                         ` Andrea Arcangeli
2016-06-22 14:48                           ` Paolo Bonzini
2016-06-22 15:02                             ` Andrea Arcangeli
2016-06-22 22:44                       ` Michael S. Tsirkin
2016-06-22 23:23                         ` Andrea Arcangeli
2016-06-22 23:45                           ` Michael S. Tsirkin
2016-06-23  8:40                             ` Gerd Hoffmann
2016-06-23 16:38                               ` Michael S. Tsirkin
2016-06-24  5:55                                 ` Gerd Hoffmann
2016-06-24 23:12                                   ` Michael S. Tsirkin
2016-06-29 16:42                               ` Dr. David Alan Gilbert
2016-06-30  6:10                                 ` Gerd Hoffmann
2016-06-30 10:59                                   ` Dr. David Alan Gilbert
2016-06-30 16:14                                     ` Gerd Hoffmann
2016-06-30 17:12                                       ` Dr. David Alan Gilbert
2016-07-01 19:03                                       ` Dr. David Alan Gilbert
2016-06-22 22:40                     ` Michael S. Tsirkin
2016-06-22 23:15                       ` Andrea Arcangeli
2016-06-19  3:36           ` [Qemu-devel] [PATCH 4/5] x86: Allow physical address bits to be set Michael S. Tsirkin
2016-06-20  7:04             ` Paolo Bonzini
2016-06-17 14:24         ` Marcel Apfelbaum
2016-06-16 17:12 ` [Qemu-devel] [PATCH 5/5] x86: Set physical address bits based on host Dr. David Alan Gilbert (git)
2016-06-17  7:25   ` Igor Mammedov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.