Re: [PATCH 5.4 1/1] KVM: SEV: add cache flush to solve SEV cache incoherency issues

From: Mingwei Zhang <mizhang@google.com>
To: Sean Christopherson <seanjc@google.com>
Cc: "Kalra, Ashish" <ashish.kalra@amd.com>,
	ovidiu.panait@windriver.com, kvm@vger.kernel.org,
	liam.merwick@oracle.com, pbonzini@redhat.com,
	thomas.lendacky@amd.com, michael.roth@amd.com, pgonda@google.com,
	marcorr@google.com, alpergun@google.com, jarkko@kernel.org,
	jroedel@suse.de, bp@alien8.de, rientjes@google.com
Subject: Re: [PATCH 5.4 1/1] KVM: SEV: add cache flush to solve SEV cache incoherency issues
Date: Thu, 6 Oct 2022 23:03:46 -0700	[thread overview]
Message-ID: <CAL715WL4L=9vhhU3TvY7TOe3HZ73weWFNiaP2RyBtzN-kZ4EoQ@mail.gmail.com> (raw)
In-Reply-To: <Yz99nF+d6D+37efE@google.com>

On Thu, Oct 6, 2022 at 6:15 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Oct 06, 2022, Kalra, Ashish wrote:
> > For the MMU invalidation notifiers we are going to make two changes
> > currently:
> >
> > 1). Use clflush/clflushopt instead of wbinvd_on_all_cpus() for range <= 2MB.
>
> IMO, this isn't worth pursuing, to the point where I might object to this code
> being added upstream.  Avoiding WBINVD for the mmu_notifiers doesn't prevent a
> malicious userspace from using SEV-induced WBINVD to effectively DoS the host,
> e.g. userspace can simply ADD+DELETE memslots, or mprotect() chunks > 2mb.
>

I think using clflush/clflushopt is a tactical workaround for SNP VMs.
As mentioned earlier by Ashish:

"For SNP guests we don't need to invoke the MMU invalidation notifiers
and the cache flush should be done at the point of RMP ownership change
instead of mmu_notifier, which will be when the unregister_enc_region
ioctl is called, but as we don't trust the userspace (which can bypass
this ioctl), therefore we continue to use the MMU invalidation
notifiers."

So if that is true: SNP VMs also have to use mmu_notifiers for
splitting the PMDs, then I think using clflush/clflushopt might be the
only workaround that I know of.

> Using clfushopt also effectively puts a requirement on mm/ that the notifiers
> be invoked _before_ PTEs are modified in the primary MMU, otherwise KVM may not
> be able to resolve the VA=>PFN, or even worse, resolve the wrong PFN.

I don't understand this. Isn't it always true that MM should fire
mmu_notifiers before modifying PTEs in host MMU? This should be a
strict rule as in my knowledge, no?

>
> And no sane VMM should be modifying userspace mappings that cover SEV guest memory
> at any reasonable rate.
>
> In other words, switching to CLFUSHOPT for SEV+SEV-ES VMs is effectively a
> band-aid for the NUMA balancing issue.

That's not true. KSM might also use the same mechanism. For NUMA
balancing and KSM, there seems to be a pattern: blindly flushing
mmu_notifier first, then try to do the actual work.

I have a limited knowledge on MM, but from my observations, it looks
like the property of a page being "PINNED" is very unreliable (or
expensive), i.e., anyone can jump in and pin the page. So it is hard
to see whether a page is truly "PINNED" or maybe just someone is
"working" on it without holding the lock. Holding the refcount of a
struct page requires a spinlock. I suspect that might be the reason
why NUMA balancing and KSM is just aggressively firing mmu_notifiers
first. I don't know if there is other stuff in MM following the same
pattern.

Concretely, my deep worry is the mmu_notifier in try_to_unmap_one(). I
cannot enumerate all of the callers. But if there is someone who calls
into this, it might be a disaster level (4K) performance lock. Hope we
can prove that won't happen.