kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: David Matlack <dmatlack@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, Ben Gardon <bgardon@google.com>,
	Joerg Roedel <joro@8bytes.org>, Jim Mattson <jmattson@google.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Janis Schoetterl-Glausch <scgl@linux.vnet.ibm.com>,
	Junaid Shahid <junaids@google.com>,
	Oliver Upton <oupton@google.com>,
	Harish Barathvajasankar <hbarath@google.com>,
	Peter Shier <pshier@google.com>
Subject: Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
Date: Wed, 1 Dec 2021 18:29:04 +0000	[thread overview]
Message-ID: <Yae+8Oshu9sVrrvd@google.com> (raw)
In-Reply-To: <YabeFZxWqPAuoEtZ@xz-m1.local>

On Wed, Dec 01, 2021, Peter Xu wrote:
> On Tue, Nov 30, 2021 at 05:29:10PM -0800, David Matlack wrote:
> > On Tue, Nov 30, 2021 at 5:01 PM Sean Christopherson <seanjc@google.com> wrote:
> > > So '1' is technically correct, but I think it's the wrong choice given the behavior
> > > of this code.  E.g. if there's 1 object in the cache, the initial top-up will do
> > > nothing,
> > 
> > This scenario will not happen though, since we free the caches after
> > splitting. So, the next time userspace enables dirty logging on a
> > memslot and we go to do the initial top-up the caches will have 0
> > objects.

Ah.

> > > and then tdp_mmu_split_large_pages_root() will almost immediately drop
> > > mmu_lock to topup the cache.  Since the in-loop usage explicitly checks for an
> > > empty cache, i.e. any non-zero @min will have identical behavior, I think it makes
> > > sense to use KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE _and_ add a comment explaining why.
> > 
> > If we set the min to KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> > kvm_mmu_topup_memory_cache will return ENOMEM if it can't allocate at
> > least KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects, even though we really
> > only need 1 to make forward progress.
> > 
> > It's a total edge case but there could be a scenario where userspace
> > sets the cgroup memory limits so tight that we can't allocate
> > KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects when splitting the last few
> > pages and in the end we only needed 1 or 2 objects to finish
> > splitting. In this case we'd end up with a spurious pr_warn and may
> > not split the last few pages depending on which cache failed to get
> > topped up.
> 
> IMHO when -ENOMEM happens, instead of keep trying with 1 shadow sp we should
> just bail out even earlier.
> 
> Say, if we only have 10 (<40) pages left for shadow sp's use, we'd better make
> good use of them lazily to be consumed in follow up page faults when the guest
> accessed any of the huge pages, rather than we take them all over to split the
> next continuous huge pages assuming it'll be helpful..
> 
> From that POV I have a slight preference over Sean's suggestion because that'll
> make us fail earlier.  But I agree it shouldn't be a big deal.

Hmm, in this particular case, I think using the caches is the wrong approach.  The
behavior of pre-filling the caches makes sense for vCPUs because faults may need
multiple objects and filling the cache ensures the entire fault can be handled
without dropping mmu_lock.  And any extra/unused objects can be used by future
faults.  For page splitting, neither of those really holds true.  If there are a
lot of pages to split, KVM will have to drop mmu_lock to refill the cache.  And if
there are few pages to split, or the caches are refilled toward the end of the walk,
KVM may end up with a pile of unused objects it needs to free.

Since this code already needs to handle failure, and more importantly, it's a
best-effort optimization, I think trying to use the caches is a square peg, round
hole scenario.

Rather than use the caches, we could do allocation 100% on-demand and never drop
mmu_lock to do allocation.  The one caveat is that direct reclaim would need to be
disallowed so that the allocation won't sleep.  That would mean that eager splitting
would fail under heavy memory pressure when it otherwise might succeed by reclaiming.
That would mean vCPUs get penalized as they'd need to do the splitting on fault and
potentially do direct reclaim as well.  It's not obvious that that would be a problem
in practice, e.g. the vCPU is probably already seeing a fair amount of disruption due
to memory pressure, and slowing down vCPUs might alleviate some of that pressure.

Not using the cache would also reduce the extra complexity, e.g. no need for
special mmu_cache handling or a variant of tdp_mmu_iter_cond_resched().

I'm thinking something like this (very incomplete):

static void init_tdp_mmu_page(struct kvm_mmu_page *sp, u64 *spt, gfn_t gfn,
			      union kvm_mmu_page_role role)
{
	sp->spt = spt;
	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

	sp->role = role;
	sp->gfn = gfn;
	sp->tdp_mmu_page = true;

	trace_kvm_mmu_get_page(sp, true);
}

static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
					       union kvm_mmu_page_role role)
{
	struct kvm_mmu_page *sp;
	u64 *spt;

	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
	spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);

	init_tdp_mmu_page(sp, spt, gfn, role);
}

static union kvm_mmu_page_role get_child_page_role(struct tdp_iter *iter)
{
	struct kvm_mmu_page *parent = sptep_to_sp(rcu_dereference(iter->sptep));
	union kvm_mmu_page_role role = parent->role;

	role.level--;
	return role;
}

static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
				      struct tdp_iter *iter,
				      struct kvm_mmu_page *sp,
				      bool account_nx)
{
	u64 spte;

	spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);

	if (tdp_mmu_set_spte_atomic(kvm, iter, spte)) {
		tdp_mmu_link_page(kvm, sp, account_nx);
		return true;
	}
	return false;
}

static void tdp_mmu_split_large_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
					   gfn_t start, gfn_t end, int target_level)
{
	/*
	 * Disallow direct reclaim, allocations will be made while holding
	 * mmu_lock and must not sleep.
	 */
	gfp_t gfp = (GFP_KERNEL_ACCOUNT | __GFP_ZERO) & ~__GFP_DIRECT_RECLAIM;
	struct kvm_mmu_page *sp = NULL;
	struct tdp_iter iter;
	bool flush = false;
	u64 *spt = NULL;
	int r;

	rcu_read_lock();

	/*
	 * Traverse the page table splitting all large pages above the target
	 * level into one lower level. For example, if we encounter a 1GB page
	 * we split it into 512 2MB pages.
	 *
	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
	 * to visit an SPTE before ever visiting its children, which means we
	 * will correctly recursively split large pages that are more than one
	 * level above the target level (e.g. splitting 1GB to 2MB to 4KB).
	 */
	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true))
			continue;

		if (!is_shadow_present_pte(iter.old_spte || !is_large_pte(pte))
			continue;

		if (!sp) {
			sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
			if (!sp)
				break;
			spt = (void *)__get_free_page(gfp);
			if (!spt)
				break;
		}

		init_tdp_mmu_page(sp, spt, iter->gfn,
				  get_child_page_role(&iter));

		if (!tdp_mmu_split_large_page(kvm, &iter, sp))
			goto retry;

		sp = NULL;
		spt = NULL;
	}

	free_page((unsigned long)spt);
	kmem_cache_free(mmu_page_header_cache, sp);

	rcu_read_unlock();

	if (flush)
		kvm_flush_remote_tlbs(kvm);
}

  reply	other threads:[~2021-12-01 18:29 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
2021-11-19 23:57 ` [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-26 12:18   ` Peter Xu
2021-11-19 23:57 ` [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-26 12:18   ` Peter Xu
2021-11-19 23:57 ` [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-30 23:25     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-30 23:27     ` David Matlack
2021-12-01 19:13   ` Sean Christopherson
2021-12-01 21:52     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct David Matlack
2021-11-22 18:55   ` Ben Gardon
2021-11-22 18:55     ` Ben Gardon
2021-11-30 23:28     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent David Matlack
2021-11-20 12:53   ` Paolo Bonzini
2021-11-27  2:07     ` Lai Jiangshan
2021-11-27 10:26       ` Paolo Bonzini
2021-11-30 23:31     ` David Matlack
2021-12-01  0:45       ` Sean Christopherson
2021-12-01 21:56         ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-12-01 18:34   ` Sean Christopherson
2021-12-01 21:13     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled David Matlack
2021-11-22  5:05   ` Nikunj A. Dadhania
2021-11-30 23:33     ` David Matlack
2021-11-22 19:30   ` Ben Gardon
2021-11-30 23:44     ` David Matlack
2021-11-26 12:01   ` Peter Xu
2021-11-30 23:56     ` David Matlack
2021-12-01  1:00       ` Sean Christopherson
2021-12-01  1:29         ` David Matlack
2021-12-01  2:29           ` Peter Xu
2021-12-01 18:29             ` Sean Christopherson [this message]
2021-12-01 21:36               ` David Matlack
2021-12-01 23:37                 ` Sean Christopherson
2021-12-02 17:41                   ` David Matlack
2021-12-02 18:42                     ` Sean Christopherson
2021-12-03  0:00                       ` David Matlack
2021-12-03  1:07                         ` Sean Christopherson
2021-12-03 17:22                           ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG David Matlack
2021-11-26 12:17   ` Peter Xu
2021-12-01  0:16     ` David Matlack
2021-12-01  0:17       ` David Matlack
2021-12-01  4:03         ` Peter Xu
2021-12-01 22:14           ` David Matlack
2021-12-03  4:57             ` Peter Xu
2021-12-01 19:22   ` Sean Christopherson
2021-12-01 19:49     ` Ben Gardon
2021-12-01 20:16       ` Sean Christopherson
2021-12-01 22:11         ` Ben Gardon
2021-12-01 22:17     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages David Matlack
2021-11-19 23:57 ` [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when " David Matlack
2021-12-01 19:36   ` Sean Christopherson
2021-12-01 21:11     ` David Matlack
2021-11-26 14:13 ` [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Peter Xu
2021-11-30 23:22   ` David Matlack
2021-12-01  4:10     ` Peter Xu
2021-12-01  4:19       ` Peter Xu
2021-12-01 21:46       ` David Matlack

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yae+8Oshu9sVrrvd@google.com \
    --to=seanjc@google.com \
    --cc=bgardon@google.com \
    --cc=dmatlack@google.com \
    --cc=hbarath@google.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=junaids@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=oupton@google.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pshier@google.com \
    --cc=scgl@linux.vnet.ibm.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).