All of lore.kernel.org
 help / color / mirror / Atom feed
From: Will Deacon <will@kernel.org>
To: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: maz@kernel.org, kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org
Subject: Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
Date: Mon, 1 Aug 2022 18:00:56 +0100	[thread overview]
Message-ID: <20220801170055.GB26471@willie-the-truck> (raw)
In-Reply-To: <Yt5nFAscgrRGNGoH@monolith.localdoman>

Hi Alex,

On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > > altogether. I've taken this approach because:
> > > 
> > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > > in the case of a stage 2 fault on a stage 1 translation table walk.
> > > 
> > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > means there will be a window where profiling is stopped from the moment SPE
> > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > is obviously not present when running on bare metal, as there is no second
> > > stage of address translation being performed.
> > 
> > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > thought SPE buffer data could be written out in whacky ways such that even
> > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > and so pinning is the only game in town.
> > 
> > A funkier approach might be to defer pinning of the buffer until the SPE is
> > enabled and avoid pinning all of VM memory that way, although I can't
> > immediately tell how flexible the architecture is in allowing you to cache
> > the base/limit values.
> 
> I was investigating this approach, and Mark raised a concern that I think
> might be a showstopper.
> 
> Let's consider this scenario:
> 
> Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> 
> 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> 2. Guest programs SPE to enable profiling at **EL0**
> (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> 3. Guest changes the translation table entries for the buffer. The
> architecture allows this.

The architecture also allows MMIO accesses to use writeback addressing
modes, but it doesn't provide a mechanism to virtualise them sensibly.

So I'd prefer that we don't pin all of guest memory just to satisfy a corner
case -- as long as the impact of a guest doing this funny sequence is
constrained to the guest, then I think pinning only what is required is
probably the most pragmatic approach.

Is it ideal? No, of course not, and we should probably try to get the debug
architecture extended to be properly virtualisable, but in the meantime
having major operating systems as guests and being able to use SPE without
pinning seems like a major design goal to me.

In any case, that's just my thinking on this and I defer to Oliver and
Marc on the ultimate decision.

Will
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

WARNING: multiple messages have this Message-ID (diff)
From: Will Deacon <will@kernel.org>
To: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: mark.rutland@arm.com, linux-arm-kernel@lists.infradead.org,
	maz@kernel.org, james.morse@arm.com, suzuki.poulose@arm.com,
	kvmarm@lists.cs.columbia.edu
Subject: Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
Date: Mon, 1 Aug 2022 18:00:56 +0100	[thread overview]
Message-ID: <20220801170055.GB26471@willie-the-truck> (raw)
In-Reply-To: <Yt5nFAscgrRGNGoH@monolith.localdoman>

Hi Alex,

On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > > altogether. I've taken this approach because:
> > > 
> > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > > in the case of a stage 2 fault on a stage 1 translation table walk.
> > > 
> > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > means there will be a window where profiling is stopped from the moment SPE
> > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > is obviously not present when running on bare metal, as there is no second
> > > stage of address translation being performed.
> > 
> > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > thought SPE buffer data could be written out in whacky ways such that even
> > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > and so pinning is the only game in town.
> > 
> > A funkier approach might be to defer pinning of the buffer until the SPE is
> > enabled and avoid pinning all of VM memory that way, although I can't
> > immediately tell how flexible the architecture is in allowing you to cache
> > the base/limit values.
> 
> I was investigating this approach, and Mark raised a concern that I think
> might be a showstopper.
> 
> Let's consider this scenario:
> 
> Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> 
> 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> 2. Guest programs SPE to enable profiling at **EL0**
> (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> 3. Guest changes the translation table entries for the buffer. The
> architecture allows this.

The architecture also allows MMIO accesses to use writeback addressing
modes, but it doesn't provide a mechanism to virtualise them sensibly.

So I'd prefer that we don't pin all of guest memory just to satisfy a corner
case -- as long as the impact of a guest doing this funny sequence is
constrained to the guest, then I think pinning only what is required is
probably the most pragmatic approach.

Is it ideal? No, of course not, and we should probably try to get the debug
architecture extended to be properly virtualisable, but in the meantime
having major operating systems as guests and being able to use SPE without
pinning seems like a major design goal to me.

In any case, that's just my thinking on this and I defer to Oliver and
Marc on the ultimate decision.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2022-08-01 17:01 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-19 13:51 KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory Alexandru Elisei
2022-04-19 13:51 ` Alexandru Elisei
2022-04-19 14:10 ` Will Deacon
2022-04-19 14:10   ` Will Deacon
2022-04-19 14:44   ` Alexandru Elisei
2022-04-19 14:44     ` Alexandru Elisei
2022-04-19 14:59     ` Will Deacon
2022-04-19 14:59       ` Will Deacon
2022-04-19 15:20       ` Alexandru Elisei
2022-04-19 15:20         ` Alexandru Elisei
2022-04-19 15:35         ` Alexandru Elisei
2022-04-19 15:35           ` Alexandru Elisei
2022-07-25 10:06   ` Alexandru Elisei
2022-07-25 10:06     ` Alexandru Elisei
2022-07-26 17:51     ` Oliver Upton
2022-07-26 17:51       ` Oliver Upton
2022-07-27  9:30       ` Marc Zyngier
2022-07-27  9:30         ` Marc Zyngier
2022-07-27  9:52         ` Marc Zyngier
2022-07-27  9:52           ` Marc Zyngier
2022-07-27 10:38           ` Alexandru Elisei
2022-07-27 10:38             ` Alexandru Elisei
2022-07-27 16:06             ` Oliver Upton
2022-07-27 16:06               ` Oliver Upton
2022-07-27 10:56         ` Alexandru Elisei
2022-07-27 10:56           ` Alexandru Elisei
2022-07-27 11:18           ` Marc Zyngier
2022-07-27 11:18             ` Marc Zyngier
2022-07-27 12:10             ` Alexandru Elisei
2022-07-27 12:10               ` Alexandru Elisei
2022-07-27 10:19       ` Alexandru Elisei
2022-07-27 10:19         ` Alexandru Elisei
2022-07-27 10:29         ` Marc Zyngier
2022-07-27 10:29           ` Marc Zyngier
2022-07-27 10:44           ` Alexandru Elisei
2022-07-27 10:44             ` Alexandru Elisei
2022-07-27 11:08             ` Marc Zyngier
2022-07-27 11:08               ` Marc Zyngier
2022-07-27 11:57               ` Alexandru Elisei
2022-07-27 11:57                 ` Alexandru Elisei
2022-07-27 15:15                 ` Oliver Upton
2022-07-27 15:15                   ` Oliver Upton
2022-07-27 11:00       ` Alexandru Elisei
2022-07-27 11:00         ` Alexandru Elisei
2022-08-01 17:00     ` Will Deacon [this message]
2022-08-01 17:00       ` Will Deacon
2022-08-02  9:49       ` Alexandru Elisei
2022-08-02  9:49         ` Alexandru Elisei
2022-08-02 19:34         ` Oliver Upton
2022-08-02 19:34           ` Oliver Upton
2022-08-09 14:01           ` Alexandru Elisei
2022-08-09 14:01             ` Alexandru Elisei
2022-08-09 18:43             ` Oliver Upton
2022-08-09 18:43               ` Oliver Upton
2022-08-10  9:37               ` Alexandru Elisei
2022-08-10  9:37                 ` Alexandru Elisei
2022-08-10 15:25                 ` Oliver Upton
2022-08-10 15:25                   ` Oliver Upton
2022-08-12 13:05                   ` Alexandru Elisei
2022-08-12 13:05                     ` Alexandru Elisei
2022-08-17 15:05                     ` Oliver Upton
2022-08-17 15:05                       ` Oliver Upton
2022-09-12 14:50                       ` Alexandru Elisei
2022-09-12 14:50                         ` Alexandru Elisei
2022-09-13 10:58                         ` Oliver Upton
2022-09-13 10:58                           ` Oliver Upton
2022-09-13 12:41                           ` Alexandru Elisei
2022-09-13 12:41                             ` Alexandru Elisei
2022-09-13 14:13                             ` Oliver Upton
2022-09-13 14:13                               ` Oliver Upton
2023-01-03 14:26                               ` Alexandru Elisei
2023-01-03 14:26                                 ` Alexandru Elisei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220801170055.GB26471@willie-the-truck \
    --to=will@kernel.org \
    --cc=alexandru.elisei@arm.com \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=maz@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.