From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:36310)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mtosatti@redhat.com>) id 1Vbb2f-0002AX-D8
	for qemu-devel@nongnu.org; Wed, 30 Oct 2013 15:03:03 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mtosatti@redhat.com>) id 1Vbb2Z-0008Qh-5E
	for qemu-devel@nongnu.org; Wed, 30 Oct 2013 15:02:57 -0400
Received: from mx1.redhat.com ([209.132.183.28]:6366)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mtosatti@redhat.com>) id 1Vbb2Y-0008Q9-TR
	for qemu-devel@nongnu.org; Wed, 30 Oct 2013 15:02:51 -0400
Date: Wed, 30 Oct 2013 16:51:29 -0200
From: Marcelo Tosatti <mtosatti@redhat.com>
Message-ID: <20131030185129.GB18378@amt.cnet>
References: <20131028140406.GA18025@amt.cnet>
	<1383070729-19427-1-git-send-email-imammedo@redhat.com>
	<20131029213844.GB32615@amt.cnet>
	<20131030174949.2fb0d2c2@nial.usersys.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20131030174949.2fb0d2c2@nial.usersys.redhat.com>
Subject: Re: [Qemu-devel] [RFC PATCH] pc: align gpa<->hpa on 1GB boundary by
 splitting RAM on several regions
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Igor Mammedov <imammedo@redhat.com>
Cc: aarcange@redhat.com, peter.maydell@linaro.org, gleb@redhat.com, quintela@redhat.com, jan.kiszka@siemens.com, qemu-devel@nongnu.org, aliguori@amazon.com, pbonzini@redhat.com, afaerber@suse.de, rth@twiddle.net

On Wed, Oct 30, 2013 at 05:49:49PM +0100, Igor Mammedov wrote:
> On Tue, 29 Oct 2013 19:38:44 -0200
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > On Tue, Oct 29, 2013 at 07:18:49PM +0100, Igor Mammedov wrote:
> > > Otherwise 1GB TLBs cannot be cached for the range.
> > 
> > This fails to back non-1GB-aligned gpas, but 2MB aligned, with 2MB large
> > pages.
> With current command line only one hugetlbfs mount point is possible, so it
> will back with whatever alignment specified hugetlbfs mount point has.
> Anything that doesn't fit into page aligned region goes to tail using
> non hugepage baked phys_mem_set_alloc()=qemu_anon_ram_alloc() allocator.

The patch you propose allocates the non-1GB aligned tail of RAM with 4k
pages. As mentioned, this is not acceptable (2MB pages should be used
whenever 1GB alignment is not possible).

I believe its easier for the user to allocate enough 1GB pages to back
all of guest RAM, since allocation is static, than for him to allocate
mixed 1GB/2MB pages in hugetlbfs.

> > Since hugetlbfs allocation is static, it requires the user to inform
> > different 1GB and 2MB sized hugetlbfs mount points (with proper number
> > of corresponding hugetlbfs pages allocated). This is incompatible with
> > the current command line, and i'd like to see this problem handled in a
> > way that is command line backwards compatible.
> patch doesn't change that, it uses provided hugetlbfs and fallbacks (hunk 2)
> to phys_mem_alloc if requested memory region is not hugepage size aligned.
> So there is no any CLI change, only fixing memory leak.
> 
> > Also, if the argument for one-to-one mapping between dimms and linear host
> > virtual address sections holds, it means virtual DIMMs must be
> > partitioned into whatever hugepage alignment is necessary (and in
> > that case, why they can't be partitioned similarly with the memory
> > region aliases?).
> Because during hotplug a new memory region of desired size is allocated
> and it could be mapped directly without any aliasing. And if some day we
> convert adhoc initial memory allocation to dimm devices there is no reason to
> alloc one huge block and then invent means how to alias hole somewhere else,
> we could just reuse memdev/dimm and allocate several memory regions
> with desired properties each represented by a memdev/dimm pair.
> 
> one-one mapping simplifies design and interface with ACPI part during memory
> hotplug.
> 
> for hotplug case flow could look like:
>  memdev_add id=x1,size=1Gb,mem-path=/hugetlbfs/1gb,other-host-related-stuff-options
>  #memdev could enforce size to be backend aligned
>  device_add dimm,id=y1,backend=x1,addr=xxxxxx
>  #dimm could get alignment from associated memdev or fail if addr
>  #doesn't meet alignment of memdev backend
> 
>  memdev_add id=x2,size=2mb,mem-path=/hugetlbfs/2mb
>  device_add dimm,id=y2,backend=x2,addr=yyyyyyy
> 
>  memdev_add id=x3,size=1mb
>  device_add dimm,id=y3,backend=x3,addr=xxxxxxx
> 
> linear memory block is allocated at runtime (user has to make sure that enough
> hugepages are available) by each memdev_add command and that RAM memory region
> is mapped into GPA by virtual DIMM as is, there wouldn't be any need for
> aliasing.
> 
> Now back to intial memory and bright future we are looking forward to (i.e.
> ability to create machine from configuration file without adhoc codding
> like(pc_memory_init)):
> 
> legacy cmdline "-m 4512 -mem-path /hugetlbfs/1gb" could be automatically
> translated into:
> 
> -memdev id=x1,size=3g,mem-path=/hugetlbfs/1gb -device dimm,backend=x1,addr=0
> -memdev id=x2,size=1g,mem-path=/hugetlbfs/1gb -device dimm,backend=x2,addr=4g
> -memdev id=x3,size=512m -device dimm,backend=x3,addr=5g
> 
> or user could drop legacy CLI and assume fine grained control over memory
> configuration:
> 
> -memdev id=x1,size=3g,mem-path=/hugetlbfs/1gb -device dimm,backend=x1,addr=0
> -memdev id=x2,size=1g,mem-path=/hugetlbfs/1gb -device dimm,backend=x2,addr=4g
> -memdev id=x3,size=512m,mem-path=/hugetlbfs/2mb -device dimm,backend=x3,addr=5g
> 
> So if we are going to break migration compatibility for new machine type
> lets do a way that could painlessly changed to memdev/device in future.

Ok then please improve your proposal to allow for multiple hugetlbfs
mount points.

> > > PS:
> > > as side effect we are not wasting ~1Gb of memory if
> > > 1Gb hugepages are used and -m "hpagesize(in Mb)*n + 1"
> > 
> > This is how hugetlbfs works. You waste 1GB hugepage if an extra
> > byte is requested.
> it looks more a bug than feature,
> why do it if leak could be avoided as shown below?

Because IMO it is confusing for the user, since hugetlbfs allocation is
static. But if you have a necessity for the one-to-one relationship, 
feel free to support mixed hugetlbfs page sizes.