Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled

From: Marc Zyngier <maz@kernel.org>
To: Ricardo Koller <ricarkol@google.com>
Cc: Oliver Upton <oliver.upton@linux.dev>,
	pbonzini@redhat.com, oupton@google.com, yuzenghui@huawei.com,
	dmatlack@google.com, kvm@vger.kernel.org, kvmarm@lists.linux.dev,
	qperret@google.com, catalin.marinas@arm.com,
	andrew.jones@linux.dev, seanjc@google.com,
	alexandru.elisei@arm.com, suzuki.poulose@arm.com,
	eric.auger@redhat.com, gshan@redhat.com, reijiw@google.com,
	rananta@google.com, bgardon@google.com, ricarkol@gmail.com
Subject: Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
Date: Tue, 31 Jan 2023 10:28:47 +0000	[thread overview]
Message-ID: <86h6w70zhc.wl-maz@kernel.org> (raw)
In-Reply-To: <CAOHnOrx-vvuZ9n8xDRmJTBCZNiqvcqURVyrEt2tDpw5bWT0qew@mail.gmail.com>

On Fri, 27 Jan 2023 15:45:15 +0000,
Ricardo Koller <ricarkol@google.com> wrote:
> 
> > The one thing that would convince me to make it an option is the
> > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > I'm not overly happy about that. Why can't this be a userspace visible
> > option, selectable on a per VM (or memslot) basis?
> >
> 
> It should be possible.  I am exploring a couple of ideas that could
> help when the hugepages are not 1G (e.g., 2M).  However, they add
> complexity and I'm not sure they help much.
> 
> (will be using PAGE_SIZE=4K to make things simpler)
> 
> This feature pre-allocates 513 pages before splitting every 1G range.
> For example, it converts 1G block PTEs into trees made of 513 pages.
> When not using this feature, the same 513 pages would be allocated,
> but lazily over a longer period of time.

This is an important difference. It avoids the upfront allocation
"thermal shock", giving time to the kernel to reclaim memory from
somewhere else. Doing it upfront means you *must* have 2MB+ of
immediately available memory for each GB of RAM you guest uses.

> 
> Eager-splitting pre-allocates those pages in order to split huge-pages
> into fully populated trees.  Which is needed in order to use FEAT_BBM
> and skipping the expensive TLBI broadcasts.  513 is just the number of
> pages needed to break a 1G huge-page.

I understand that. But it also clear that 1GB huge pages are unlikely
to be THPs, and I wonder if we should treat the two differently. Using
HugeTLBFS pages is significant here.

> 
> We could optimize for smaller huge-pages, like 2M by splitting 1
> huge-page at a time: only preallocate one 4K page at a time.  The
> trick is how to know that we are splitting 2M huge-pages.  We could
> either get the vma pagesize or use hints from userspace.  I'm not sure
> that this is worth it though.  The user will most likely want to split
> big ranges of memory (>1G), so optimizing for smaller huge-pages only
> converts the left into the right:
> 
> alloc 1 page            |    |  alloc 512 pages
> split 2M huge-page      |    |  split 2M huge-page
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      | => |  split 2M huge-page
>                         ...
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      |    |  split 2M huge-page
> 
> Still thinking of what else to do.

I think the 1G case fits your own use case, but I doubt this covers
the majority of the users. Most people rely on the kernel ability to
use THPs, which are capped at the first level of block mapping.

2MB (and 32MB for 16kB base pages) are the most likely mappings in my
experience (512MB with 64kB pages are vanishingly rare).

Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it
fits the model. For THPs, where everything is opportunistic and the
user not involved, this is a lot more debatable.

This is why I'd like this behaviour to be a buy-in, either directly (a
first class userspace API) or indirectly (the provenance of the
memory).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.