From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:39386)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mtosatti@redhat.com>) id 1VbH0V-0008NU-NK
	for qemu-devel@nongnu.org; Tue, 29 Oct 2013 17:39:29 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mtosatti@redhat.com>) id 1VbH0P-0005GO-Kt
	for qemu-devel@nongnu.org; Tue, 29 Oct 2013 17:39:23 -0400
Received: from mx1.redhat.com ([209.132.183.28]:65091)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mtosatti@redhat.com>) id 1VbH0P-0005GB-Dm
	for qemu-devel@nongnu.org; Tue, 29 Oct 2013 17:39:17 -0400
Received: from int-mx01.intmail.prod.int.phx2.redhat.com
	(int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r9TLdGUn027513
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <qemu-devel@nongnu.org>; Tue, 29 Oct 2013 17:39:16 -0400
Date: Tue, 29 Oct 2013 19:21:59 -0200
From: Marcelo Tosatti <mtosatti@redhat.com>
Message-ID: <20131029212149.GA32615@amt.cnet>
References: <20131024211158.064049176@amt.cnet>
	<20131024211249.723543071@amt.cnet> <5269B378.6040409@redhat.com>
	<20131025045805.GA18280@amt.cnet>
	<20131025115718.15b6e788@redhat.com>
	<20131025133421.GA27529@amt.cnet>
	<20131027162044.19769397@redhat.com>
	<20131028140406.GA18025@amt.cnet>
	<20131029190054.0c9faec5@nial.usersys.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20131029190054.0c9faec5@nial.usersys.redhat.com>
Subject: Re: [Qemu-devel] [patch 2/2] i386: pc: align gpa<->hpa on 1GB
 boundary
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Igor Mammedov <imammedo@redhat.com>
Cc: aarcange@redhat.com, Paolo Bonzini <pbonzini@redhat.com>, qemu-devel@nongnu.org, gleb@redhat.com

On Tue, Oct 29, 2013 at 07:00:54PM +0100, Igor Mammedov wrote:
> On Mon, 28 Oct 2013 12:04:06 -0200
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > On Sun, Oct 27, 2013 at 04:20:44PM +0100, igor Mammedov wrote:
> > > > Yes, thought of that, unfortunately its cumbersome to add an interface
> > > > for the user to supply both 2MB and 1GB hugetlbfs pages.
> > > Could 2Mb tails be automated, meaning if host uses 1Gb hugepages and
> > > there is/are tail/s, QEMU should be able to figure out alignment
> > > issues and allocate with appropriate pages.
> > 
> > Yes that would be ideal but the problem with hugetlbfs is that pages are
> > preallocated.
> > 
> > So in the end you'd have to expose the split of guest RAM in 2MB/1GB types
> > to the user (it would be necessary for the user to calculate the size of
> > the hole, etc).
> exposing it to the user might be not necessary,
> QEMU could allocate 5Gb+3Mb ram without user intervention:

It is necessary because the user has to allocate hugetlbfs pages 
(see the end of the email).

>  3Gb low.ram.aligned.region // using huge pages
>  1mb low.ram.unaligned.region if below_4g_ram_size - 3Gb; // so not to waste precious low ram, using fallback allocation
>  //hypothetically hole starts at 3Gb+1mb
>  2Gb high.ram.aligned.region // using huge pages
>  2Mb high.ram.unaligned.region // so that not to waste 1Gb on memory using huge page

You want memory areas not backed by 1GB pages to be backed by 2MB pages
(so that possibility of creation of TLB entries per physical address
range is similar, or matches, physical hardware) (*)

> > > Goal is separate host part allocation aspect from guest related one,
> > > aliasing 32-bit hole size at the end doesn't help it at all, it's quite
> > > opposite, it's making current code more complicated and harder to fix
> > > in the future.
> > 
> > You can simply back the 1GB areas which the hole reside with 2MB pages.
> I'm not getting what do you mean here.

The 1GB memory backed areas which can't be mapped with 1GB TLBs, such as the
[3GB,4GB] guest physical address range can be mapped with 2MB TLBs.

> > Can't see why having the tail of RAM map to the hole is problematic.
> Problem I see is that with proposed aliasing there is no one-one
> mapping to future "memdev" where each Dimm device (guest/model visible memory block)
> has a corresponding memdev backend (host memory block).

1) What is the dependency of memdev on linear host memory block? (that is,
i can't see the reasoning behind a one-to-one mapping).

2) Why can't memdev access host memory via mappings? (that
is, why does memdev require each DIMM to be mapped linearly in
QEMU's virtual address space?).

> Moreover with current hugepages handling in QEMU including this patch and usage of
> 1Gb hugepages, QEMU might loose ~1Gb if -m "hpagesize*n+1", which is by itself is a
> good reason to use several allocations with different allocator backends.
> > Understand your concern, but the complication is necessary: the host
> > virtual/physical address and guest physical addresses must be aligned on
> > largepage boundaries.
> I don't argue against it, only about the best way to achieve it.
> 
> If we assume possible conversion from adhoc way of allocating initial RAM to
> DIMM devices in the future then changing region layout several times in
> incompatible way doesn't seems to be the best approach. If we are going to
> change it, let at least minimize compatibility issues and do it right
> in the first place.
> 
> I'll post RFC patch as reply to this thread.
> 
> > 
> > Do you foresee any problem with memory hotplug?
> I don't see any problem with memory hotplug so far, but as noted above
> there will be problems with converting initial ram to DIMM devices.
> 
> > 
> > Could add a warning to memory API: if memory region is larger than 1GB
> > and RAM is 1GB backed, and not properly aligned, warn.
> Perhaps it would be better do abort and ask user to fix configuration,
> and on hugepage allocation failure not fallback to malloc but abort and
> tell user amount of hugepages needed to run guest with hugepage backend.

You want to back your guest with 1GB hugepages. You get 1 such page at a
time, worst case.

You either 

1) map the guest physical address space region (1GB sized) where
the hole is located with smaller page sizes, which must be 2MB, see *,
which requires the user to specify a different hugetlbfs mount path with
sufficient 2MB huge pages.

2) move the pieces of memory which can't be 1GB mapped backed into
1GB hugepages, and map the remaining 1GB-aligned regions to individual 1GB 
pages.

I am trying to avoid 1) as it complicates management (and fixes a bug).