From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:45352)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UzSJ0-0007Wu-Ba
	for qemu-devel@nongnu.org; Wed, 17 Jul 2013 10:02:15 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UzSIy-0001cJ-Jv
	for qemu-devel@nongnu.org; Wed, 17 Jul 2013 10:02:10 -0400
Message-ID: <51E6A3D3.4020908@redhat.com>
Date: Wed, 17 Jul 2013 16:01:55 +0200
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <1373995321-2470-1-git-send-email-aarcange@redhat.com>
	<20130716173844.GC19826@otherpad.lan.raisama.net>
	<51E586E6.1060001@redhat.com>
	<20130716181136.GD19826@otherpad.lan.raisama.net>
	<51E59DEE.5030603@redhat.com>
	<20130716194238.GG11420@otherpad.lan.raisama.net>
	<51E6511D.8070606@redhat.com>
	<20130717133721.GH11420@otherpad.lan.raisama.net>
In-Reply-To: <20130717133721.GH11420@otherpad.lan.raisama.net>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH] fix guest physical bits to match host,
 to go beyond 1TB guests
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Eduardo Habkost <ehabkost@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>, qemu-devel@nongnu.org, Gleb Natapov <gleb@redhat.com>, qemu-stable@nongnu.org

Il 17/07/2013 15:39, Eduardo Habkost ha scritto:
> On Wed, Jul 17, 2013 at 10:09:01AM +0200, Paolo Bonzini wrote:
>> Il 16/07/2013 21:42, Eduardo Habkost ha scritto:
>>> On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote:
>>>> Il 16/07/2013 20:11, Eduardo Habkost ha scritto:
>>>>> For physical bit size, what about extending it in a backwards-compatible
>>>>> way? Something like this:
>>>>>
>>>>>     *eax = 0x0003000; /* 48 bits virtual */
>>>>>     if (ram_size < 1TB) {
>>>>>         physical_size = 40; /* Keeping backwards compatibility */
>>>>>     } else if (ram_size < 4TB) {
>>>>>         physical_size = 42;
>>>>
>>>> Why not go straight up to 44?
>>>
>>> I simply trusted the comment saying: "The physical address space is
>>> limited to 42 bits in exec.c", and assumed we had a 42-bit limit
>>> somewhere else.
>>
>> Yeah, that's obsolete.  We now can go up to 64 (but actually only
>> support 52 because that's what Intel says will be the limit----4PB RAM
>> should be good for everyone, as Bill Gates used to say).
>>
>> So far Intel has been upgrading the physical RAM size in x16 steps
>> (MAXPHYADDR was 36, then 40, then 44).  MAXPHYADDR is how Intel calls
>> what you wrote as physical_size.
> 
> Then if ram_size is too large, we could round it up to a multiple of 4?

Yeah, that would be closer to real processors.

>>      if (supported_host_physical_size() < msb(ram_size)) {
>>          abort();
>>      }
> 
> What if the host max physical size is smaller than the MAXPHYADDR we're
> setting for the VM (but still larger than msb(ram_size))? Will the CPU
> or KVM complain, or will it work?

That would only happen with a buggy guest that points page tables to
non-existent RAM, i.e.:

- host has MAXPHYADDR = 36, corresponding to 64 GB physical address space

- guest has 2 GB RAM, but still we set MAXPHYADDR = 40 for backwards
compatibility

- guest creates a page table for RAM beyond 64GB just because it can

KVM would complain, real hardware probably would return all-ones or
something like that.  I think we can ignore it.

> In other words, do we need a check for
>   (supported_host_physical_size() < physical_size)
> below, as well?

That would have to be conditional on !some_compat_prop, too, because
MAXPHYADDR=40 is beyond what older VMX-enabled CPUs support (for example
Core 2 has MAXPHYADDR=36).  I don't think it's important though.  The
less stringent test vs. msb(ram_size) should be enough.

>>      if (ram_size < 64GB && !some_compat_prop) {
>>          physical_size = 36;
>>      } else if (ram_size < 1TB) {
>>          physical_size = 40;
>>      } else {
>>          physical_size = 44;
>>      }
>>
>> What do you think?
> 
> Why stop at 44? What if ram_size is larger than (1 << 44)?

You can certainly add more.  However, no processors exist yet that have
MAXPHYADDR > 44, so you would have already failed the check above.  I
didn't go beyond 44 because who knows, maybe Intel will go from 44 to 46
and break the tradition of increasing by 4.

> We must never start a VM with physical_size > msb(ram_size), right?
> 
> But I am not sure if we should we simply increase physical_size
> automatically, or abort on some cases where we physical_size ends up
> being too small. (I believe we should simply increase it automatically)

Increasing automatically is fine, that's basically what those "if"s do.

>>>> This makes sense too.  Though the best would be of course to use CPUID
>>>> values coming from the real processors, and only using 40 for backwards
>>>> compatibility.
>>>
>>> We can't use the values coming from the real processors directly, or
>>> we will break live migration.
>>
>> I said real processors, not host processors. :)
> 
> So, you mean setting per-cpu-model values?

Yes.

>> So a Core 2 has MAXPHYADDR=36, Nehalem has IIRC 40, Sandy Bridge has 44,
>> and so on.
> 
> That would work as well (and on pc-1.5 and older we could keep
> physical_size=40 or use the above algorithm). But: I wonder if it would
> be usable. If somebody is using "-cpu Nehalem -m 8T", I believe it would
> make sense to increase the physical address size automatically instead
> of aborting. What do you think?

I think I agree.  This leaf is used so little by the guest (the
hypervisor uses it) that mimicking real CPU models in everything is of
little value.

> If the VM is already broken and has MAXPHYADDR < msb(ram_size), I don't
> worry about ABI stability, because those VMs are not even supposed to be
> running properly. If somebody have such a broken VM running, it seems
> better to make the MAXPHYADDR bits suddenly change to a reasonable value
> (so that MAXPHYADDR >= msb(ram_size)) during live-migration than
> aborting migration, or keeping the broken value.
> 
> So if we use an algorithm that always increase MAXPHYADDR automatically
> (breaking the ABI in the cases where the current is already broken), we
> are not going to abort migration unless the host really can't run our
> guest (this seems to be a less serious problem than aborting because the
> VM configuration is broken and needs to be manually adjusted).

Agreed.  So perhaps per-cpu-model values aren't too useful.

Paolo