From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45352) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UzSJ0-0007Wu-Ba for qemu-devel@nongnu.org; Wed, 17 Jul 2013 10:02:15 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UzSIy-0001cJ-Jv for qemu-devel@nongnu.org; Wed, 17 Jul 2013 10:02:10 -0400 Message-ID: <51E6A3D3.4020908@redhat.com> Date: Wed, 17 Jul 2013 16:01:55 +0200 From: Paolo Bonzini MIME-Version: 1.0 References: <1373995321-2470-1-git-send-email-aarcange@redhat.com> <20130716173844.GC19826@otherpad.lan.raisama.net> <51E586E6.1060001@redhat.com> <20130716181136.GD19826@otherpad.lan.raisama.net> <51E59DEE.5030603@redhat.com> <20130716194238.GG11420@otherpad.lan.raisama.net> <51E6511D.8070606@redhat.com> <20130717133721.GH11420@otherpad.lan.raisama.net> In-Reply-To: <20130717133721.GH11420@otherpad.lan.raisama.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH] fix guest physical bits to match host, to go beyond 1TB guests List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eduardo Habkost Cc: Andrea Arcangeli , qemu-devel@nongnu.org, Gleb Natapov , qemu-stable@nongnu.org Il 17/07/2013 15:39, Eduardo Habkost ha scritto: > On Wed, Jul 17, 2013 at 10:09:01AM +0200, Paolo Bonzini wrote: >> Il 16/07/2013 21:42, Eduardo Habkost ha scritto: >>> On Tue, Jul 16, 2013 at 09:24:30PM +0200, Paolo Bonzini wrote: >>>> Il 16/07/2013 20:11, Eduardo Habkost ha scritto: >>>>> For physical bit size, what about extending it in a backwards-compatible >>>>> way? Something like this: >>>>> >>>>> *eax = 0x0003000; /* 48 bits virtual */ >>>>> if (ram_size < 1TB) { >>>>> physical_size = 40; /* Keeping backwards compatibility */ >>>>> } else if (ram_size < 4TB) { >>>>> physical_size = 42; >>>> >>>> Why not go straight up to 44? >>> >>> I simply trusted the comment saying: "The physical address space is >>> limited to 42 bits in exec.c", and assumed we had a 42-bit limit >>> somewhere else. >> >> Yeah, that's obsolete. We now can go up to 64 (but actually only >> support 52 because that's what Intel says will be the limit----4PB RAM >> should be good for everyone, as Bill Gates used to say). >> >> So far Intel has been upgrading the physical RAM size in x16 steps >> (MAXPHYADDR was 36, then 40, then 44). MAXPHYADDR is how Intel calls >> what you wrote as physical_size. > > Then if ram_size is too large, we could round it up to a multiple of 4? Yeah, that would be closer to real processors. >> if (supported_host_physical_size() < msb(ram_size)) { >> abort(); >> } > > What if the host max physical size is smaller than the MAXPHYADDR we're > setting for the VM (but still larger than msb(ram_size))? Will the CPU > or KVM complain, or will it work? That would only happen with a buggy guest that points page tables to non-existent RAM, i.e.: - host has MAXPHYADDR = 36, corresponding to 64 GB physical address space - guest has 2 GB RAM, but still we set MAXPHYADDR = 40 for backwards compatibility - guest creates a page table for RAM beyond 64GB just because it can KVM would complain, real hardware probably would return all-ones or something like that. I think we can ignore it. > In other words, do we need a check for > (supported_host_physical_size() < physical_size) > below, as well? That would have to be conditional on !some_compat_prop, too, because MAXPHYADDR=40 is beyond what older VMX-enabled CPUs support (for example Core 2 has MAXPHYADDR=36). I don't think it's important though. The less stringent test vs. msb(ram_size) should be enough. >> if (ram_size < 64GB && !some_compat_prop) { >> physical_size = 36; >> } else if (ram_size < 1TB) { >> physical_size = 40; >> } else { >> physical_size = 44; >> } >> >> What do you think? > > Why stop at 44? What if ram_size is larger than (1 << 44)? You can certainly add more. However, no processors exist yet that have MAXPHYADDR > 44, so you would have already failed the check above. I didn't go beyond 44 because who knows, maybe Intel will go from 44 to 46 and break the tradition of increasing by 4. > We must never start a VM with physical_size > msb(ram_size), right? > > But I am not sure if we should we simply increase physical_size > automatically, or abort on some cases where we physical_size ends up > being too small. (I believe we should simply increase it automatically) Increasing automatically is fine, that's basically what those "if"s do. >>>> This makes sense too. Though the best would be of course to use CPUID >>>> values coming from the real processors, and only using 40 for backwards >>>> compatibility. >>> >>> We can't use the values coming from the real processors directly, or >>> we will break live migration. >> >> I said real processors, not host processors. :) > > So, you mean setting per-cpu-model values? Yes. >> So a Core 2 has MAXPHYADDR=36, Nehalem has IIRC 40, Sandy Bridge has 44, >> and so on. > > That would work as well (and on pc-1.5 and older we could keep > physical_size=40 or use the above algorithm). But: I wonder if it would > be usable. If somebody is using "-cpu Nehalem -m 8T", I believe it would > make sense to increase the physical address size automatically instead > of aborting. What do you think? I think I agree. This leaf is used so little by the guest (the hypervisor uses it) that mimicking real CPU models in everything is of little value. > If the VM is already broken and has MAXPHYADDR < msb(ram_size), I don't > worry about ABI stability, because those VMs are not even supposed to be > running properly. If somebody have such a broken VM running, it seems > better to make the MAXPHYADDR bits suddenly change to a reasonable value > (so that MAXPHYADDR >= msb(ram_size)) during live-migration than > aborting migration, or keeping the broken value. > > So if we use an algorithm that always increase MAXPHYADDR automatically > (breaking the ABI in the cases where the current is already broken), we > are not going to abort migration unless the host really can't run our > guest (this seems to be a less serious problem than aborting because the > VM configuration is broken and needs to be manually adjusted). Agreed. So perhaps per-cpu-model values aren't too useful. Paolo