Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory

From: Will Deacon <will@kernel.org>
To: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: maz@kernel.org, kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org
Subject: Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
Date: Mon, 1 Aug 2022 18:00:56 +0100	[thread overview]
Message-ID: <20220801170055.GB26471@willie-the-truck> (raw)
In-Reply-To: <Yt5nFAscgrRGNGoH@monolith.localdoman>

Hi Alex,

On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > > altogether. I've taken this approach because:
> > > 
> > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > > in the case of a stage 2 fault on a stage 1 translation table walk.
> > > 
> > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > means there will be a window where profiling is stopped from the moment SPE
> > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > is obviously not present when running on bare metal, as there is no second
> > > stage of address translation being performed.
> > 
> > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > thought SPE buffer data could be written out in whacky ways such that even
> > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > and so pinning is the only game in town.
> > 
> > A funkier approach might be to defer pinning of the buffer until the SPE is
> > enabled and avoid pinning all of VM memory that way, although I can't
> > immediately tell how flexible the architecture is in allowing you to cache
> > the base/limit values.
> 
> I was investigating this approach, and Mark raised a concern that I think
> might be a showstopper.
> 
> Let's consider this scenario:
> 
> Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> 
> 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> 2. Guest programs SPE to enable profiling at **EL0**
> (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> 3. Guest changes the translation table entries for the buffer. The
> architecture allows this.

The architecture also allows MMIO accesses to use writeback addressing
modes, but it doesn't provide a mechanism to virtualise them sensibly.

So I'd prefer that we don't pin all of guest memory just to satisfy a corner
case -- as long as the impact of a guest doing this funny sequence is
constrained to the guest, then I think pinning only what is required is
probably the most pragmatic approach.

Is it ideal? No, of course not, and we should probably try to get the debug
architecture extended to be properly virtualisable, but in the meantime
having major operating systems as guests and being able to use SPE without
pinning seems like a major design goal to me.

In any case, that's just my thinking on this and I defer to Oliver and
Marc on the ultimate decision.

Will
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm