kvmarm.lists.cs.columbia.edu archive mirror
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: Ricardo Koller <ricarkol@google.com>
Cc: Oliver Upton <oliver.upton@linux.dev>,
	pbonzini@redhat.com, oupton@google.com, yuzenghui@huawei.com,
	dmatlack@google.com, kvm@vger.kernel.org, kvmarm@lists.linux.dev,
	qperret@google.com, catalin.marinas@arm.com,
	andrew.jones@linux.dev, seanjc@google.com,
	alexandru.elisei@arm.com, suzuki.poulose@arm.com,
	eric.auger@redhat.com, gshan@redhat.com, reijiw@google.com,
	rananta@google.com, bgardon@google.com, ricarkol@gmail.com
Subject: Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
Date: Tue, 31 Jan 2023 10:28:47 +0000	[thread overview]
Message-ID: <86h6w70zhc.wl-maz@kernel.org> (raw)
In-Reply-To: <CAOHnOrx-vvuZ9n8xDRmJTBCZNiqvcqURVyrEt2tDpw5bWT0qew@mail.gmail.com>

On Fri, 27 Jan 2023 15:45:15 +0000,
Ricardo Koller <ricarkol@google.com> wrote:
> 
> > The one thing that would convince me to make it an option is the
> > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > I'm not overly happy about that. Why can't this be a userspace visible
> > option, selectable on a per VM (or memslot) basis?
> >
> 
> It should be possible.  I am exploring a couple of ideas that could
> help when the hugepages are not 1G (e.g., 2M).  However, they add
> complexity and I'm not sure they help much.
> 
> (will be using PAGE_SIZE=4K to make things simpler)
> 
> This feature pre-allocates 513 pages before splitting every 1G range.
> For example, it converts 1G block PTEs into trees made of 513 pages.
> When not using this feature, the same 513 pages would be allocated,
> but lazily over a longer period of time.

This is an important difference. It avoids the upfront allocation
"thermal shock", giving time to the kernel to reclaim memory from
somewhere else. Doing it upfront means you *must* have 2MB+ of
immediately available memory for each GB of RAM you guest uses.

> 
> Eager-splitting pre-allocates those pages in order to split huge-pages
> into fully populated trees.  Which is needed in order to use FEAT_BBM
> and skipping the expensive TLBI broadcasts.  513 is just the number of
> pages needed to break a 1G huge-page.

I understand that. But it also clear that 1GB huge pages are unlikely
to be THPs, and I wonder if we should treat the two differently. Using
HugeTLBFS pages is significant here.

> 
> We could optimize for smaller huge-pages, like 2M by splitting 1
> huge-page at a time: only preallocate one 4K page at a time.  The
> trick is how to know that we are splitting 2M huge-pages.  We could
> either get the vma pagesize or use hints from userspace.  I'm not sure
> that this is worth it though.  The user will most likely want to split
> big ranges of memory (>1G), so optimizing for smaller huge-pages only
> converts the left into the right:
> 
> alloc 1 page            |    |  alloc 512 pages
> split 2M huge-page      |    |  split 2M huge-page
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      | => |  split 2M huge-page
>                         ...
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      |    |  split 2M huge-page
> 
> Still thinking of what else to do.

I think the 1G case fits your own use case, but I doubt this covers
the majority of the users. Most people rely on the kernel ability to
use THPs, which are capped at the first level of block mapping.

2MB (and 32MB for 16kB base pages) are the most likely mappings in my
experience (512MB with 64kB pages are vanishingly rare).

Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it
fits the model. For THPs, where everything is opportunistic and the
user not involved, this is a lot more debatable.

This is why I'd like this behaviour to be a buy-in, either directly (a
first class userspace API) or indirectly (the provenance of the
memory).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

  parent reply	other threads:[~2023-01-31 10:28 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
2023-01-13  3:49 ` [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags Ricardo Koller
2023-01-24  0:51   ` Ben Gardon
2023-01-24  0:56     ` Oliver Upton
2023-01-24 16:32       ` Ricardo Koller
2023-01-24 18:00         ` Ben Gardon
2023-01-26 18:48           ` Ricardo Koller
2023-01-24 16:30     ` Ricardo Koller
2023-01-13  3:49 ` [PATCH 2/9] KVM: arm64: Add helper for creating removed stage2 subtrees Ricardo Koller
2023-01-14 17:58   ` kernel test robot
2023-01-24  0:55   ` Ben Gardon
2023-01-24 16:35     ` Ricardo Koller
2023-01-24 17:07       ` Oliver Upton
2023-01-13  3:49 ` [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
2023-01-24  1:03   ` Ben Gardon
2023-01-24 16:46     ` Ricardo Koller
2023-01-24 17:11       ` Oliver Upton
2023-01-24 17:18         ` Ricardo Koller
2023-01-24 17:48           ` David Matlack
2023-01-24 20:28             ` Oliver Upton
2023-02-06  9:20   ` Zheng Chuan
2023-02-06 16:28     ` Ricardo Koller
2023-01-13  3:49 ` [PATCH 4/9] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
2023-01-13  3:49 ` [PATCH 5/9] KVM: arm64: Add kvm_uninit_stage2_mmu() Ricardo Koller
2023-01-13  3:49 ` [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
2023-01-24 17:52   ` Ben Gardon
2023-01-24 22:19     ` Oliver Upton
2023-01-24 22:45   ` Oliver Upton
2023-01-26 18:45     ` Ricardo Koller
2023-01-26 19:25       ` Ricardo Koller
2023-01-26 20:10       ` Marc Zyngier
2023-01-27 15:45         ` Ricardo Koller
2023-01-30 21:18           ` Oliver Upton
2023-01-31  1:18             ` Sean Christopherson
2023-01-31 17:45               ` Oliver Upton
2023-01-31 17:54                 ` Sean Christopherson
2023-01-31 19:06                   ` Oliver Upton
2023-01-31 18:01                 ` David Matlack
2023-01-31 18:19                   ` Ricardo Koller
2023-01-31 18:35                   ` Oliver Upton
2023-01-31 10:31             ` Marc Zyngier
2023-01-31 10:28           ` Marc Zyngier [this message]
2023-02-06 16:35             ` Ricardo Koller
2023-01-13  3:49 ` [PATCH 7/9] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() Ricardo Koller
2023-01-13  3:49 ` [PATCH 8/9] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG Ricardo Koller
2023-01-13  3:50 ` [PATCH 9/9] KVM: arm64: Use local TLBI on permission relaxation Ricardo Koller
2023-01-24  0:48 ` [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ben Gardon
2023-01-24 16:50   ` Ricardo Koller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86h6w70zhc.wl-maz@kernel.org \
    --to=maz@kernel.org \
    --cc=alexandru.elisei@arm.com \
    --cc=andrew.jones@linux.dev \
    --cc=bgardon@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=dmatlack@google.com \
    --cc=eric.auger@redhat.com \
    --cc=gshan@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=oliver.upton@linux.dev \
    --cc=oupton@google.com \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=rananta@google.com \
    --cc=reijiw@google.com \
    --cc=ricarkol@gmail.com \
    --cc=ricarkol@google.com \
    --cc=seanjc@google.com \
    --cc=suzuki.poulose@arm.com \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).