All of lore.kernel.org
 help / color / mirror / Atom feed
* KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-04-19 13:51 ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-04-19 13:51 UTC (permalink / raw)
  To: will, mark.rutland, linux-arm-kernel, maz, james.morse,
	suzuki.poulose, kvmarm

The approach I've taken so far in adding support for SPE in KVM [1] relies
on pinning the entire VM memory to avoid SPE triggering stage 2 faults
altogether. I've taken this approach because:

1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
and at the moment KVM has no way to resolve the VA to IPA translation.  The
AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
in the case of a stage 2 fault on a stage 1 translation table walk.

2. The stage 2 fault is reported asynchronously via an interrupt, which
means there will be a window where profiling is stopped from the moment SPE
triggers the fault and when the PE taks the interrupt. This blackout window
is obviously not present when running on bare metal, as there is no second
stage of address translation being performed.

I've been thinking about this approach and I was considering translating
the VA reported by SPE to the IPA instead, thus treating the SPE stage 2
data aborts more like regular (MMU) data aborts. As I see it, this approach
has several merits over memory pinning:

- The stage 1 translation table walker is also needed for nested
  virtualization, to emulate AT S1* instructions executed by the L1
  guest hypervisor.

- Walking the guest's translation tables is less of a departure from the
  way KVM manages physical memory for a virtual machine today.

I had a discussion with Mark offline about this approach and he expressed a
very sensible concern: when a guest is profiling, there is a blackout
window where profiling is stopped which doesn't happen on bare metal (point
2 above).

My questions are:

1. Is having this blackout window, regardless of its size, unnacceptable?
If it is, then I'll continue with the memory pinning approach.

2. If having a blackout window is acceptable, how large can this window be
before it becomes too much? I can try to take some performance measurements
to evaluate the blackout window when using a stage 1 walker in relation to
the buffer write speed on different hardware. I have access to an N1SDP
machine and an Ampere Altra for this.

[1] https://lore.kernel.org/all/20211117153842.302159-1-alexandru.elisei@arm.com/

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-04-19 13:51 ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-04-19 13:51 UTC (permalink / raw)
  To: will, mark.rutland, linux-arm-kernel, maz, james.morse,
	suzuki.poulose, kvmarm

The approach I've taken so far in adding support for SPE in KVM [1] relies
on pinning the entire VM memory to avoid SPE triggering stage 2 faults
altogether. I've taken this approach because:

1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
and at the moment KVM has no way to resolve the VA to IPA translation.  The
AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
in the case of a stage 2 fault on a stage 1 translation table walk.

2. The stage 2 fault is reported asynchronously via an interrupt, which
means there will be a window where profiling is stopped from the moment SPE
triggers the fault and when the PE taks the interrupt. This blackout window
is obviously not present when running on bare metal, as there is no second
stage of address translation being performed.

I've been thinking about this approach and I was considering translating
the VA reported by SPE to the IPA instead, thus treating the SPE stage 2
data aborts more like regular (MMU) data aborts. As I see it, this approach
has several merits over memory pinning:

- The stage 1 translation table walker is also needed for nested
  virtualization, to emulate AT S1* instructions executed by the L1
  guest hypervisor.

- Walking the guest's translation tables is less of a departure from the
  way KVM manages physical memory for a virtual machine today.

I had a discussion with Mark offline about this approach and he expressed a
very sensible concern: when a guest is profiling, there is a blackout
window where profiling is stopped which doesn't happen on bare metal (point
2 above).

My questions are:

1. Is having this blackout window, regardless of its size, unnacceptable?
If it is, then I'll continue with the memory pinning approach.

2. If having a blackout window is acceptable, how large can this window be
before it becomes too much? I can try to take some performance measurements
to evaluate the blackout window when using a stage 1 walker in relation to
the buffer write speed on different hardware. I have access to an N1SDP
machine and an Ampere Altra for this.

[1] https://lore.kernel.org/all/20211117153842.302159-1-alexandru.elisei@arm.com/

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-04-19 13:51 ` Alexandru Elisei
@ 2022-04-19 14:10   ` Will Deacon
  -1 siblings, 0 replies; 72+ messages in thread
From: Will Deacon @ 2022-04-19 14:10 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, kvmarm, linux-arm-kernel

On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> The approach I've taken so far in adding support for SPE in KVM [1] relies
> on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> altogether. I've taken this approach because:
> 
> 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> and at the moment KVM has no way to resolve the VA to IPA translation.  The
> AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> in the case of a stage 2 fault on a stage 1 translation table walk.
> 
> 2. The stage 2 fault is reported asynchronously via an interrupt, which
> means there will be a window where profiling is stopped from the moment SPE
> triggers the fault and when the PE taks the interrupt. This blackout window
> is obviously not present when running on bare metal, as there is no second
> stage of address translation being performed.

Are these faults actually recoverable? My memory is a bit hazy here, but I
thought SPE buffer data could be written out in whacky ways such that even
a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
and so pinning is the only game in town.

A funkier approach might be to defer pinning of the buffer until the SPE is
enabled and avoid pinning all of VM memory that way, although I can't
immediately tell how flexible the architecture is in allowing you to cache
the base/limit values.

Will
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-04-19 14:10   ` Will Deacon
  0 siblings, 0 replies; 72+ messages in thread
From: Will Deacon @ 2022-04-19 14:10 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm

On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> The approach I've taken so far in adding support for SPE in KVM [1] relies
> on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> altogether. I've taken this approach because:
> 
> 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> and at the moment KVM has no way to resolve the VA to IPA translation.  The
> AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> in the case of a stage 2 fault on a stage 1 translation table walk.
> 
> 2. The stage 2 fault is reported asynchronously via an interrupt, which
> means there will be a window where profiling is stopped from the moment SPE
> triggers the fault and when the PE taks the interrupt. This blackout window
> is obviously not present when running on bare metal, as there is no second
> stage of address translation being performed.

Are these faults actually recoverable? My memory is a bit hazy here, but I
thought SPE buffer data could be written out in whacky ways such that even
a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
and so pinning is the only game in town.

A funkier approach might be to defer pinning of the buffer until the SPE is
enabled and avoid pinning all of VM memory that way, although I can't
immediately tell how flexible the architecture is in allowing you to cache
the base/limit values.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-04-19 14:10   ` Will Deacon
@ 2022-04-19 14:44     ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-04-19 14:44 UTC (permalink / raw)
  To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel

Hi Will,

On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > altogether. I've taken this approach because:
> > 
> > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > in the case of a stage 2 fault on a stage 1 translation table walk.
> > 
> > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > means there will be a window where profiling is stopped from the moment SPE
> > triggers the fault and when the PE taks the interrupt. This blackout window
> > is obviously not present when running on bare metal, as there is no second
> > stage of address translation being performed.
> 
> Are these faults actually recoverable? My memory is a bit hazy here, but I
> thought SPE buffer data could be written out in whacky ways such that even
> a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> and so pinning is the only game in town.

Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
D10-5177):

"The architecture does not require that a sample record is written
sequentially by the SPU, only that:
[..]
- On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
  whether PMBPTR_EL1 points to the first byte after the last complete
  sample record.
- On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
  Fault Address Register."

and (page D10-5179):

"If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
then a Profiling Buffer management event is generated:
[..]
- If PMBPTR_EL1 is not the address of the first byte after the last
  complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
  Otherwise, PMBSR_EL1.DL is unchanged."

Since there is no way to know the record size (well, unless
PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
requirement), it means that KVM cannot restore the write pointer to the
address of the last complete record + 1, to allow the guest to resume
profiling without corrupted records.

> 
> A funkier approach might be to defer pinning of the buffer until the SPE is
> enabled and avoid pinning all of VM memory that way, although I can't
> immediately tell how flexible the architecture is in allowing you to cache
> the base/limit values.

A guest can use this to pin the VM memory (or a significant part of it),
either by doing it on purpose, or by allocating new buffers as they get
full. This will probably result in KVM killing the VM if the pinned memory
is larger than ulimit's max locked memory, which I believe is going to be a
bad experience for the user caught unaware. Unless we don't want KVM to
take ulimit into account when pinning the memory, which as far as I can
goes against KVM's approach so far.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-04-19 14:44     ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-04-19 14:44 UTC (permalink / raw)
  To: Will Deacon
  Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm

Hi Will,

On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > altogether. I've taken this approach because:
> > 
> > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > in the case of a stage 2 fault on a stage 1 translation table walk.
> > 
> > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > means there will be a window where profiling is stopped from the moment SPE
> > triggers the fault and when the PE taks the interrupt. This blackout window
> > is obviously not present when running on bare metal, as there is no second
> > stage of address translation being performed.
> 
> Are these faults actually recoverable? My memory is a bit hazy here, but I
> thought SPE buffer data could be written out in whacky ways such that even
> a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> and so pinning is the only game in town.

Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
D10-5177):

"The architecture does not require that a sample record is written
sequentially by the SPU, only that:
[..]
- On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
  whether PMBPTR_EL1 points to the first byte after the last complete
  sample record.
- On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
  Fault Address Register."

and (page D10-5179):

"If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
then a Profiling Buffer management event is generated:
[..]
- If PMBPTR_EL1 is not the address of the first byte after the last
  complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
  Otherwise, PMBSR_EL1.DL is unchanged."

Since there is no way to know the record size (well, unless
PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
requirement), it means that KVM cannot restore the write pointer to the
address of the last complete record + 1, to allow the guest to resume
profiling without corrupted records.

> 
> A funkier approach might be to defer pinning of the buffer until the SPE is
> enabled and avoid pinning all of VM memory that way, although I can't
> immediately tell how flexible the architecture is in allowing you to cache
> the base/limit values.

A guest can use this to pin the VM memory (or a significant part of it),
either by doing it on purpose, or by allocating new buffers as they get
full. This will probably result in KVM killing the VM if the pinned memory
is larger than ulimit's max locked memory, which I believe is going to be a
bad experience for the user caught unaware. Unless we don't want KVM to
take ulimit into account when pinning the memory, which as far as I can
goes against KVM's approach so far.

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-04-19 14:44     ` Alexandru Elisei
@ 2022-04-19 14:59       ` Will Deacon
  -1 siblings, 0 replies; 72+ messages in thread
From: Will Deacon @ 2022-04-19 14:59 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, kvmarm, linux-arm-kernel

On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote:
> On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > means there will be a window where profiling is stopped from the moment SPE
> > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > is obviously not present when running on bare metal, as there is no second
> > > stage of address translation being performed.
> > 
> > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > thought SPE buffer data could be written out in whacky ways such that even
> > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > and so pinning is the only game in town.
> 
> Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
> D10-5177):
> 
> "The architecture does not require that a sample record is written
> sequentially by the SPU, only that:
> [..]
> - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
>   whether PMBPTR_EL1 points to the first byte after the last complete
>   sample record.
> - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
>   Fault Address Register."
> 
> and (page D10-5179):
> 
> "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
> then a Profiling Buffer management event is generated:
> [..]
> - If PMBPTR_EL1 is not the address of the first byte after the last
>   complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
>   Otherwise, PMBSR_EL1.DL is unchanged."
> 
> Since there is no way to know the record size (well, unless
> PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
> requirement), it means that KVM cannot restore the write pointer to the
> address of the last complete record + 1, to allow the guest to resume
> profiling without corrupted records.
> 
> > 
> > A funkier approach might be to defer pinning of the buffer until the SPE is
> > enabled and avoid pinning all of VM memory that way, although I can't
> > immediately tell how flexible the architecture is in allowing you to cache
> > the base/limit values.
> 
> A guest can use this to pin the VM memory (or a significant part of it),
> either by doing it on purpose, or by allocating new buffers as they get
> full. This will probably result in KVM killing the VM if the pinned memory
> is larger than ulimit's max locked memory, which I believe is going to be a
> bad experience for the user caught unaware. Unless we don't want KVM to
> take ulimit into account when pinning the memory, which as far as I can
> goes against KVM's approach so far.

Yeah, it gets pretty messy and ulimit definitely needs to be taken into
account, as it is today.

That said, we could just continue if the pinning fails and the guest gets to
keep the pieces if we get a stage-2 fault -- putting the device into an
error state and re-injecting the interrupt should cause the perf session in
the guest to fail gracefully. I don't think the complexity is necessarily
worth it, but pinning all of guest memory is really crap so it's worth
thinking about alternatives.

Will
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-04-19 14:59       ` Will Deacon
  0 siblings, 0 replies; 72+ messages in thread
From: Will Deacon @ 2022-04-19 14:59 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm

On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote:
> On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > means there will be a window where profiling is stopped from the moment SPE
> > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > is obviously not present when running on bare metal, as there is no second
> > > stage of address translation being performed.
> > 
> > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > thought SPE buffer data could be written out in whacky ways such that even
> > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > and so pinning is the only game in town.
> 
> Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
> D10-5177):
> 
> "The architecture does not require that a sample record is written
> sequentially by the SPU, only that:
> [..]
> - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
>   whether PMBPTR_EL1 points to the first byte after the last complete
>   sample record.
> - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
>   Fault Address Register."
> 
> and (page D10-5179):
> 
> "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
> then a Profiling Buffer management event is generated:
> [..]
> - If PMBPTR_EL1 is not the address of the first byte after the last
>   complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
>   Otherwise, PMBSR_EL1.DL is unchanged."
> 
> Since there is no way to know the record size (well, unless
> PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
> requirement), it means that KVM cannot restore the write pointer to the
> address of the last complete record + 1, to allow the guest to resume
> profiling without corrupted records.
> 
> > 
> > A funkier approach might be to defer pinning of the buffer until the SPE is
> > enabled and avoid pinning all of VM memory that way, although I can't
> > immediately tell how flexible the architecture is in allowing you to cache
> > the base/limit values.
> 
> A guest can use this to pin the VM memory (or a significant part of it),
> either by doing it on purpose, or by allocating new buffers as they get
> full. This will probably result in KVM killing the VM if the pinned memory
> is larger than ulimit's max locked memory, which I believe is going to be a
> bad experience for the user caught unaware. Unless we don't want KVM to
> take ulimit into account when pinning the memory, which as far as I can
> goes against KVM's approach so far.

Yeah, it gets pretty messy and ulimit definitely needs to be taken into
account, as it is today.

That said, we could just continue if the pinning fails and the guest gets to
keep the pieces if we get a stage-2 fault -- putting the device into an
error state and re-injecting the interrupt should cause the perf session in
the guest to fail gracefully. I don't think the complexity is necessarily
worth it, but pinning all of guest memory is really crap so it's worth
thinking about alternatives.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-04-19 14:59       ` Will Deacon
@ 2022-04-19 15:20         ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-04-19 15:20 UTC (permalink / raw)
  To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel

Hi,

On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote:
> On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote:
> > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > > means there will be a window where profiling is stopped from the moment SPE
> > > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > > is obviously not present when running on bare metal, as there is no second
> > > > stage of address translation being performed.
> > > 
> > > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > > thought SPE buffer data could be written out in whacky ways such that even
> > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > > and so pinning is the only game in town.
> > 
> > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
> > D10-5177):
> > 
> > "The architecture does not require that a sample record is written
> > sequentially by the SPU, only that:
> > [..]
> > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
> >   whether PMBPTR_EL1 points to the first byte after the last complete
> >   sample record.
> > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
> >   Fault Address Register."
> > 
> > and (page D10-5179):
> > 
> > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
> > then a Profiling Buffer management event is generated:
> > [..]
> > - If PMBPTR_EL1 is not the address of the first byte after the last
> >   complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
> >   Otherwise, PMBSR_EL1.DL is unchanged."
> > 
> > Since there is no way to know the record size (well, unless
> > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
> > requirement), it means that KVM cannot restore the write pointer to the
> > address of the last complete record + 1, to allow the guest to resume
> > profiling without corrupted records.
> > 
> > > 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > A guest can use this to pin the VM memory (or a significant part of it),
> > either by doing it on purpose, or by allocating new buffers as they get
> > full. This will probably result in KVM killing the VM if the pinned memory
> > is larger than ulimit's max locked memory, which I believe is going to be a
> > bad experience for the user caught unaware. Unless we don't want KVM to
> > take ulimit into account when pinning the memory, which as far as I can
> > goes against KVM's approach so far.
> 
> Yeah, it gets pretty messy and ulimit definitely needs to be taken into
> account, as it is today.
> 
> That said, we could just continue if the pinning fails and the guest gets to
> keep the pieces if we get a stage-2 fault -- putting the device into an
> error state and re-injecting the interrupt should cause the perf session in
> the guest to fail gracefully. I don't think the complexity is necessarily
> worth it, but pinning all of guest memory is really crap so it's worth
> thinking about alternatives.

On the subject of pinning the memory when guest enables SPE, the guest can
configure SPE to profile userspace only. Programming is done at EL1, and in
this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only
sensible thing to do here is to pin the memory when SPE is disabled. If it
fails, then how should KVM notify the guest that something went wrong when
SPE is disabled? KVM could inject an interrupt, as those are asynchronous
and one could (rather weakly) argue that the interrupt might have been
raised because of something that happened in the previous profiling
session, but what if the guest never enabled SPE? What if the guest is in
the middle of configuring SPE and the interrupt handler isn't even set? Or
should KVM not use an interrupt to report error conditions to the guest, in
which case, how can the guest detect that SPE is stopped?

Both options don't look particularly appealing to me.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-04-19 15:20         ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-04-19 15:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm

Hi,

On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote:
> On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote:
> > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > > means there will be a window where profiling is stopped from the moment SPE
> > > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > > is obviously not present when running on bare metal, as there is no second
> > > > stage of address translation being performed.
> > > 
> > > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > > thought SPE buffer data could be written out in whacky ways such that even
> > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > > and so pinning is the only game in town.
> > 
> > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
> > D10-5177):
> > 
> > "The architecture does not require that a sample record is written
> > sequentially by the SPU, only that:
> > [..]
> > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
> >   whether PMBPTR_EL1 points to the first byte after the last complete
> >   sample record.
> > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
> >   Fault Address Register."
> > 
> > and (page D10-5179):
> > 
> > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
> > then a Profiling Buffer management event is generated:
> > [..]
> > - If PMBPTR_EL1 is not the address of the first byte after the last
> >   complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
> >   Otherwise, PMBSR_EL1.DL is unchanged."
> > 
> > Since there is no way to know the record size (well, unless
> > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
> > requirement), it means that KVM cannot restore the write pointer to the
> > address of the last complete record + 1, to allow the guest to resume
> > profiling without corrupted records.
> > 
> > > 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > A guest can use this to pin the VM memory (or a significant part of it),
> > either by doing it on purpose, or by allocating new buffers as they get
> > full. This will probably result in KVM killing the VM if the pinned memory
> > is larger than ulimit's max locked memory, which I believe is going to be a
> > bad experience for the user caught unaware. Unless we don't want KVM to
> > take ulimit into account when pinning the memory, which as far as I can
> > goes against KVM's approach so far.
> 
> Yeah, it gets pretty messy and ulimit definitely needs to be taken into
> account, as it is today.
> 
> That said, we could just continue if the pinning fails and the guest gets to
> keep the pieces if we get a stage-2 fault -- putting the device into an
> error state and re-injecting the interrupt should cause the perf session in
> the guest to fail gracefully. I don't think the complexity is necessarily
> worth it, but pinning all of guest memory is really crap so it's worth
> thinking about alternatives.

On the subject of pinning the memory when guest enables SPE, the guest can
configure SPE to profile userspace only. Programming is done at EL1, and in
this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only
sensible thing to do here is to pin the memory when SPE is disabled. If it
fails, then how should KVM notify the guest that something went wrong when
SPE is disabled? KVM could inject an interrupt, as those are asynchronous
and one could (rather weakly) argue that the interrupt might have been
raised because of something that happened in the previous profiling
session, but what if the guest never enabled SPE? What if the guest is in
the middle of configuring SPE and the interrupt handler isn't even set? Or
should KVM not use an interrupt to report error conditions to the guest, in
which case, how can the guest detect that SPE is stopped?

Both options don't look particularly appealing to me.

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-04-19 15:20         ` Alexandru Elisei
@ 2022-04-19 15:35           ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-04-19 15:35 UTC (permalink / raw)
  To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel

Hi,

On Tue, Apr 19, 2022 at 04:20:09PM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote:
> > On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote:
> > > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > > > means there will be a window where profiling is stopped from the moment SPE
> > > > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > > > is obviously not present when running on bare metal, as there is no second
> > > > > stage of address translation being performed.
> > > > 
> > > > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > > > thought SPE buffer data could be written out in whacky ways such that even
> > > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > > > and so pinning is the only game in town.
> > > 
> > > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
> > > D10-5177):
> > > 
> > > "The architecture does not require that a sample record is written
> > > sequentially by the SPU, only that:
> > > [..]
> > > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
> > >   whether PMBPTR_EL1 points to the first byte after the last complete
> > >   sample record.
> > > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
> > >   Fault Address Register."
> > > 
> > > and (page D10-5179):
> > > 
> > > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
> > > then a Profiling Buffer management event is generated:
> > > [..]
> > > - If PMBPTR_EL1 is not the address of the first byte after the last
> > >   complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
> > >   Otherwise, PMBSR_EL1.DL is unchanged."
> > > 
> > > Since there is no way to know the record size (well, unless
> > > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
> > > requirement), it means that KVM cannot restore the write pointer to the
> > > address of the last complete record + 1, to allow the guest to resume
> > > profiling without corrupted records.
> > > 
> > > > 
> > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > the base/limit values.
> > > 
> > > A guest can use this to pin the VM memory (or a significant part of it),
> > > either by doing it on purpose, or by allocating new buffers as they get
> > > full. This will probably result in KVM killing the VM if the pinned memory
> > > is larger than ulimit's max locked memory, which I believe is going to be a
> > > bad experience for the user caught unaware. Unless we don't want KVM to
> > > take ulimit into account when pinning the memory, which as far as I can
> > > goes against KVM's approach so far.
> > 
> > Yeah, it gets pretty messy and ulimit definitely needs to be taken into
> > account, as it is today.
> > 
> > That said, we could just continue if the pinning fails and the guest gets to
> > keep the pieces if we get a stage-2 fault -- putting the device into an
> > error state and re-injecting the interrupt should cause the perf session in
> > the guest to fail gracefully. I don't think the complexity is necessarily
> > worth it, but pinning all of guest memory is really crap so it's worth
> > thinking about alternatives.
> 
> On the subject of pinning the memory when guest enables SPE, the guest can
> configure SPE to profile userspace only. Programming is done at EL1, and in
> this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only
> sensible thing to do here is to pin the memory when SPE is disabled. If it
> fails, then how should KVM notify the guest that something went wrong when
> SPE is disabled? KVM could inject an interrupt, as those are asynchronous
> and one could (rather weakly) argue that the interrupt might have been
> raised because of something that happened in the previous profiling
> session, but what if the guest never enabled SPE? What if the guest is in
> the middle of configuring SPE and the interrupt handler isn't even set? Or
> should KVM not use an interrupt to report error conditions to the guest, in
> which case, how can the guest detect that SPE is stopped?

Come to think of it, KVM can defer injecting the interrupt until after an
exit from the guest when the guest was executing at EL0 (and profiling
would have been enabled from the guest's point of view). I think this
should work, as a delay between the condition that causes an interrupt and
the PE taking the said interrupt is expected.

Thoughts?

I too would prefer not to have to pin the entire VM memory, and asking from
userspace to increase max locked memory to the size of the VM memory looks
a lot better to me.

Thanks,
Alex

> 
> Both options don't look particularly appealing to me.
> 
> Thanks,
> Alex
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-04-19 15:35           ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-04-19 15:35 UTC (permalink / raw)
  To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel

Hi,

On Tue, Apr 19, 2022 at 04:20:09PM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote:
> > On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote:
> > > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > > > means there will be a window where profiling is stopped from the moment SPE
> > > > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > > > is obviously not present when running on bare metal, as there is no second
> > > > > stage of address translation being performed.
> > > > 
> > > > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > > > thought SPE buffer data could be written out in whacky ways such that even
> > > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > > > and so pinning is the only game in town.
> > > 
> > > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
> > > D10-5177):
> > > 
> > > "The architecture does not require that a sample record is written
> > > sequentially by the SPU, only that:
> > > [..]
> > > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
> > >   whether PMBPTR_EL1 points to the first byte after the last complete
> > >   sample record.
> > > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
> > >   Fault Address Register."
> > > 
> > > and (page D10-5179):
> > > 
> > > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
> > > then a Profiling Buffer management event is generated:
> > > [..]
> > > - If PMBPTR_EL1 is not the address of the first byte after the last
> > >   complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
> > >   Otherwise, PMBSR_EL1.DL is unchanged."
> > > 
> > > Since there is no way to know the record size (well, unless
> > > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
> > > requirement), it means that KVM cannot restore the write pointer to the
> > > address of the last complete record + 1, to allow the guest to resume
> > > profiling without corrupted records.
> > > 
> > > > 
> > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > the base/limit values.
> > > 
> > > A guest can use this to pin the VM memory (or a significant part of it),
> > > either by doing it on purpose, or by allocating new buffers as they get
> > > full. This will probably result in KVM killing the VM if the pinned memory
> > > is larger than ulimit's max locked memory, which I believe is going to be a
> > > bad experience for the user caught unaware. Unless we don't want KVM to
> > > take ulimit into account when pinning the memory, which as far as I can
> > > goes against KVM's approach so far.
> > 
> > Yeah, it gets pretty messy and ulimit definitely needs to be taken into
> > account, as it is today.
> > 
> > That said, we could just continue if the pinning fails and the guest gets to
> > keep the pieces if we get a stage-2 fault -- putting the device into an
> > error state and re-injecting the interrupt should cause the perf session in
> > the guest to fail gracefully. I don't think the complexity is necessarily
> > worth it, but pinning all of guest memory is really crap so it's worth
> > thinking about alternatives.
> 
> On the subject of pinning the memory when guest enables SPE, the guest can
> configure SPE to profile userspace only. Programming is done at EL1, and in
> this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only
> sensible thing to do here is to pin the memory when SPE is disabled. If it
> fails, then how should KVM notify the guest that something went wrong when
> SPE is disabled? KVM could inject an interrupt, as those are asynchronous
> and one could (rather weakly) argue that the interrupt might have been
> raised because of something that happened in the previous profiling
> session, but what if the guest never enabled SPE? What if the guest is in
> the middle of configuring SPE and the interrupt handler isn't even set? Or
> should KVM not use an interrupt to report error conditions to the guest, in
> which case, how can the guest detect that SPE is stopped?

Come to think of it, KVM can defer injecting the interrupt until after an
exit from the guest when the guest was executing at EL0 (and profiling
would have been enabled from the guest's point of view). I think this
should work, as a delay between the condition that causes an interrupt and
the PE taking the said interrupt is expected.

Thoughts?

I too would prefer not to have to pin the entire VM memory, and asking from
userspace to increase max locked memory to the size of the VM memory looks
a lot better to me.

Thanks,
Alex

> 
> Both options don't look particularly appealing to me.
> 
> Thanks,
> Alex
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-04-19 14:10   ` Will Deacon
@ 2022-07-25 10:06     ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-25 10:06 UTC (permalink / raw)
  To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel

Hi,

On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > altogether. I've taken this approach because:
> > 
> > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > in the case of a stage 2 fault on a stage 1 translation table walk.
> > 
> > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > means there will be a window where profiling is stopped from the moment SPE
> > triggers the fault and when the PE taks the interrupt. This blackout window
> > is obviously not present when running on bare metal, as there is no second
> > stage of address translation being performed.
> 
> Are these faults actually recoverable? My memory is a bit hazy here, but I
> thought SPE buffer data could be written out in whacky ways such that even
> a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> and so pinning is the only game in town.
> 
> A funkier approach might be to defer pinning of the buffer until the SPE is
> enabled and avoid pinning all of VM memory that way, although I can't
> immediately tell how flexible the architecture is in allowing you to cache
> the base/limit values.

I was investigating this approach, and Mark raised a concern that I think
might be a showstopper.

Let's consider this scenario:

Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).

1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
2. Guest programs SPE to enable profiling at **EL0**
(PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
3. Guest changes the translation table entries for the buffer. The
architecture allows this.
4. Guest does an ERET to EL0, thus enabling profiling.

Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
the buffer at stage 2 when profiling gets enabled at EL0.

I can see two solutions here:

a. Accept the limitation (and advertise it in the documentation) that if
someone wants to use SPE when running as a Linux guest, the kernel used by
the guest must not change the buffer translation table entries after the
buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
running a Linux guest should not be a problem. I don't know how other OSes
do it (but I can find out). We could also phrase it that the buffer
translation table entries can be changed after enabling the buffer, but
only if profiling happens at EL1. But that sounds very arbitrary.

b. Pin the buffer after the stage 2 DABT that SPE will report in the
situation above. This means that there is a blackout window, but will
happen only once after each time the guest reprograms the buffer. I don't
know if this is acceptable. We could say that this if this blackout window
is not acceptable, then the guest kernel shouldn't change the translation
table entries after enabling the buffer.

Or drop the approach of pinning the buffer and go back to pinning the
entire memory of the VM.

Any thoughts on this? I would very much prefer to try to pin only the
buffer.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-25 10:06     ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-25 10:06 UTC (permalink / raw)
  To: Will Deacon
  Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm

Hi,

On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > altogether. I've taken this approach because:
> > 
> > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > in the case of a stage 2 fault on a stage 1 translation table walk.
> > 
> > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > means there will be a window where profiling is stopped from the moment SPE
> > triggers the fault and when the PE taks the interrupt. This blackout window
> > is obviously not present when running on bare metal, as there is no second
> > stage of address translation being performed.
> 
> Are these faults actually recoverable? My memory is a bit hazy here, but I
> thought SPE buffer data could be written out in whacky ways such that even
> a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> and so pinning is the only game in town.
> 
> A funkier approach might be to defer pinning of the buffer until the SPE is
> enabled and avoid pinning all of VM memory that way, although I can't
> immediately tell how flexible the architecture is in allowing you to cache
> the base/limit values.

I was investigating this approach, and Mark raised a concern that I think
might be a showstopper.

Let's consider this scenario:

Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).

1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
2. Guest programs SPE to enable profiling at **EL0**
(PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
3. Guest changes the translation table entries for the buffer. The
architecture allows this.
4. Guest does an ERET to EL0, thus enabling profiling.

Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
the buffer at stage 2 when profiling gets enabled at EL0.

I can see two solutions here:

a. Accept the limitation (and advertise it in the documentation) that if
someone wants to use SPE when running as a Linux guest, the kernel used by
the guest must not change the buffer translation table entries after the
buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
running a Linux guest should not be a problem. I don't know how other OSes
do it (but I can find out). We could also phrase it that the buffer
translation table entries can be changed after enabling the buffer, but
only if profiling happens at EL1. But that sounds very arbitrary.

b. Pin the buffer after the stage 2 DABT that SPE will report in the
situation above. This means that there is a blackout window, but will
happen only once after each time the guest reprograms the buffer. I don't
know if this is acceptable. We could say that this if this blackout window
is not acceptable, then the guest kernel shouldn't change the translation
table entries after enabling the buffer.

Or drop the approach of pinning the buffer and go back to pinning the
entire memory of the VM.

Any thoughts on this? I would very much prefer to try to pin only the
buffer.

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-25 10:06     ` Alexandru Elisei
@ 2022-07-26 17:51       ` Oliver Upton
  -1 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-07-26 17:51 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Alex,

On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:

[...]

> > A funkier approach might be to defer pinning of the buffer until the SPE is
> > enabled and avoid pinning all of VM memory that way, although I can't
> > immediately tell how flexible the architecture is in allowing you to cache
> > the base/limit values.
> 
> I was investigating this approach, and Mark raised a concern that I think
> might be a showstopper.
> 
> Let's consider this scenario:
> 
> Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> 
> 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> 2. Guest programs SPE to enable profiling at **EL0**
> (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> 3. Guest changes the translation table entries for the buffer. The
> architecture allows this.
> 4. Guest does an ERET to EL0, thus enabling profiling.
> 
> Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> the buffer at stage 2 when profiling gets enabled at EL0.

Not saying we necessarily should, but this is possible with FGT no?

> I can see two solutions here:
> 
> a. Accept the limitation (and advertise it in the documentation) that if
> someone wants to use SPE when running as a Linux guest, the kernel used by
> the guest must not change the buffer translation table entries after the
> buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> running a Linux guest should not be a problem. I don't know how other OSes
> do it (but I can find out). We could also phrase it that the buffer
> translation table entries can be changed after enabling the buffer, but
> only if profiling happens at EL1. But that sounds very arbitrary.
> 
> b. Pin the buffer after the stage 2 DABT that SPE will report in the
> situation above. This means that there is a blackout window, but will
> happen only once after each time the guest reprograms the buffer. I don't
> know if this is acceptable. We could say that this if this blackout window
> is not acceptable, then the guest kernel shouldn't change the translation
> table entries after enabling the buffer.
> 
> Or drop the approach of pinning the buffer and go back to pinning the
> entire memory of the VM.
> 
> Any thoughts on this? I would very much prefer to try to pin only the
> buffer.

Doesn't pinning the buffer also imply pinning the stage 1 tables
responsible for its translation as well? I agree that pinning the buffer
is likely the best way forward as pinning the whole of guest memory is
entirely impractical.

I'm also a bit confused on how we would manage to un-pin memory on the
way out with this. The guest is free to muck with the stage 1 and could
cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
annoying. One way to tackle it would be to only allow a single
root-to-target walk to be pinned by a vCPU at a time. Any time a new
stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
one instead.

Live migration also throws a wrench in this. IOW, there are still potential
sources of blackout unattributable to guest manipulation of the SPU.

Going to think on this some more..

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-26 17:51       ` Oliver Upton
  0 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-07-26 17:51 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Will Deacon, maz, kvmarm, linux-arm-kernel

Hi Alex,

On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:

[...]

> > A funkier approach might be to defer pinning of the buffer until the SPE is
> > enabled and avoid pinning all of VM memory that way, although I can't
> > immediately tell how flexible the architecture is in allowing you to cache
> > the base/limit values.
> 
> I was investigating this approach, and Mark raised a concern that I think
> might be a showstopper.
> 
> Let's consider this scenario:
> 
> Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> 
> 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> 2. Guest programs SPE to enable profiling at **EL0**
> (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> 3. Guest changes the translation table entries for the buffer. The
> architecture allows this.
> 4. Guest does an ERET to EL0, thus enabling profiling.
> 
> Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> the buffer at stage 2 when profiling gets enabled at EL0.

Not saying we necessarily should, but this is possible with FGT no?

> I can see two solutions here:
> 
> a. Accept the limitation (and advertise it in the documentation) that if
> someone wants to use SPE when running as a Linux guest, the kernel used by
> the guest must not change the buffer translation table entries after the
> buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> running a Linux guest should not be a problem. I don't know how other OSes
> do it (but I can find out). We could also phrase it that the buffer
> translation table entries can be changed after enabling the buffer, but
> only if profiling happens at EL1. But that sounds very arbitrary.
> 
> b. Pin the buffer after the stage 2 DABT that SPE will report in the
> situation above. This means that there is a blackout window, but will
> happen only once after each time the guest reprograms the buffer. I don't
> know if this is acceptable. We could say that this if this blackout window
> is not acceptable, then the guest kernel shouldn't change the translation
> table entries after enabling the buffer.
> 
> Or drop the approach of pinning the buffer and go back to pinning the
> entire memory of the VM.
> 
> Any thoughts on this? I would very much prefer to try to pin only the
> buffer.

Doesn't pinning the buffer also imply pinning the stage 1 tables
responsible for its translation as well? I agree that pinning the buffer
is likely the best way forward as pinning the whole of guest memory is
entirely impractical.

I'm also a bit confused on how we would manage to un-pin memory on the
way out with this. The guest is free to muck with the stage 1 and could
cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
annoying. One way to tackle it would be to only allow a single
root-to-target walk to be pinned by a vCPU at a time. Any time a new
stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
one instead.

Live migration also throws a wrench in this. IOW, there are still potential
sources of blackout unattributable to guest manipulation of the SPU.

Going to think on this some more..

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-26 17:51       ` Oliver Upton
@ 2022-07-27  9:30         ` Marc Zyngier
  -1 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27  9:30 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Will Deacon, kvmarm, linux-arm-kernel

On Tue, 26 Jul 2022 18:51:21 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Hi Alex,
> 
> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > I was investigating this approach, and Mark raised a concern that I think
> > might be a showstopper.
> > 
> > Let's consider this scenario:
> > 
> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > 
> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > 2. Guest programs SPE to enable profiling at **EL0**
> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > 3. Guest changes the translation table entries for the buffer. The
> > architecture allows this.
> > 4. Guest does an ERET to EL0, thus enabling profiling.
> > 
> > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > the buffer at stage 2 when profiling gets enabled at EL0.
> 
> Not saying we necessarily should, but this is possible with FGT no?

Given how often ERET is used at EL1, I'd really refrain from doing
so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real
EL1, and this comes at a serious cost (even an exception return that
stays at the same EL gets trapped). Once EL1 runs, we disengage this
trap because it is otherwise way too costly.

>
> > I can see two solutions here:
> > 
> > a. Accept the limitation (and advertise it in the documentation) that if
> > someone wants to use SPE when running as a Linux guest, the kernel used by
> > the guest must not change the buffer translation table entries after the
> > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> > running a Linux guest should not be a problem. I don't know how other OSes
> > do it (but I can find out). We could also phrase it that the buffer
> > translation table entries can be changed after enabling the buffer, but
> > only if profiling happens at EL1. But that sounds very arbitrary.
> > 
> > b. Pin the buffer after the stage 2 DABT that SPE will report in the
> > situation above. This means that there is a blackout window, but will
> > happen only once after each time the guest reprograms the buffer. I don't
> > know if this is acceptable. We could say that this if this blackout window
> > is not acceptable, then the guest kernel shouldn't change the translation
> > table entries after enabling the buffer.
> > 
> > Or drop the approach of pinning the buffer and go back to pinning the
> > entire memory of the VM.
> > 
> > Any thoughts on this? I would very much prefer to try to pin only the
> > buffer.
> 
> Doesn't pinning the buffer also imply pinning the stage 1 tables
> responsible for its translation as well? I agree that pinning the buffer
> is likely the best way forward as pinning the whole of guest memory is
> entirely impractical.

How different is this from device assignment, which also relies on
full page pinning? The way I look at it, SPE is a device directly
assigned to the guest, and isn't capable of generating synchronous
exception. Not that I'm madly in love with the approach, but this is
at least consistent. There was also some concerns around buggy HW that
would blow itself up on S2 faults, but I think these implementations
are confidential enough that we don't need to worry about them.

> I'm also a bit confused on how we would manage to un-pin memory on the
> way out with this. The guest is free to muck with the stage 1 and could
> cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> annoying. One way to tackle it would be to only allow a single
> root-to-target walk to be pinned by a vCPU at a time. Any time a new
> stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> one instead.

This sounds like a reasonable option. Only one IPA range covering the
SPE buffer (as described by the translation of PMBPTR_EL1) is pinned
at any given time. Generate a SPE S2 fault outside of this range, and
we unpin the region before mapping in the next one. Yes, the guest can
play tricks on us and exploit the latency of the interrupt. But at the
end of the day, this is its own problem.

Of course, this results in larger blind windows. Ideally, we should be
able to report these to the guest, either as sideband data or in the
actual profiling buffer (but I have no idea whether this is possible).

> Live migration also throws a wrench in this. IOW, there are still potential
> sources of blackout unattributable to guest manipulation of the SPU.

Can you chime some light on this? I appreciate that you can't play the
R/O trick on the SPE buffer as it invalidates the above discussion,
but it should be relatively easy to track these pages and never reset
them as clean until the vcpu is stopped. Unless you foresee other
issues?

To be clear, I don't worry too much about these blind windows. The
architecture doesn't really give us the right tools to make it work
reliably, making this a best effort only. Unless we pin the whole
guest and forego migration and other fault-driven mechanisms.

Maybe that is a choice we need to give to the user: cheap, fast,
reliable. Pick two.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27  9:30         ` Marc Zyngier
  0 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27  9:30 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Alexandru Elisei, Will Deacon, kvmarm, linux-arm-kernel

On Tue, 26 Jul 2022 18:51:21 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Hi Alex,
> 
> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > I was investigating this approach, and Mark raised a concern that I think
> > might be a showstopper.
> > 
> > Let's consider this scenario:
> > 
> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > 
> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > 2. Guest programs SPE to enable profiling at **EL0**
> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > 3. Guest changes the translation table entries for the buffer. The
> > architecture allows this.
> > 4. Guest does an ERET to EL0, thus enabling profiling.
> > 
> > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > the buffer at stage 2 when profiling gets enabled at EL0.
> 
> Not saying we necessarily should, but this is possible with FGT no?

Given how often ERET is used at EL1, I'd really refrain from doing
so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real
EL1, and this comes at a serious cost (even an exception return that
stays at the same EL gets trapped). Once EL1 runs, we disengage this
trap because it is otherwise way too costly.

>
> > I can see two solutions here:
> > 
> > a. Accept the limitation (and advertise it in the documentation) that if
> > someone wants to use SPE when running as a Linux guest, the kernel used by
> > the guest must not change the buffer translation table entries after the
> > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> > running a Linux guest should not be a problem. I don't know how other OSes
> > do it (but I can find out). We could also phrase it that the buffer
> > translation table entries can be changed after enabling the buffer, but
> > only if profiling happens at EL1. But that sounds very arbitrary.
> > 
> > b. Pin the buffer after the stage 2 DABT that SPE will report in the
> > situation above. This means that there is a blackout window, but will
> > happen only once after each time the guest reprograms the buffer. I don't
> > know if this is acceptable. We could say that this if this blackout window
> > is not acceptable, then the guest kernel shouldn't change the translation
> > table entries after enabling the buffer.
> > 
> > Or drop the approach of pinning the buffer and go back to pinning the
> > entire memory of the VM.
> > 
> > Any thoughts on this? I would very much prefer to try to pin only the
> > buffer.
> 
> Doesn't pinning the buffer also imply pinning the stage 1 tables
> responsible for its translation as well? I agree that pinning the buffer
> is likely the best way forward as pinning the whole of guest memory is
> entirely impractical.

How different is this from device assignment, which also relies on
full page pinning? The way I look at it, SPE is a device directly
assigned to the guest, and isn't capable of generating synchronous
exception. Not that I'm madly in love with the approach, but this is
at least consistent. There was also some concerns around buggy HW that
would blow itself up on S2 faults, but I think these implementations
are confidential enough that we don't need to worry about them.

> I'm also a bit confused on how we would manage to un-pin memory on the
> way out with this. The guest is free to muck with the stage 1 and could
> cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> annoying. One way to tackle it would be to only allow a single
> root-to-target walk to be pinned by a vCPU at a time. Any time a new
> stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> one instead.

This sounds like a reasonable option. Only one IPA range covering the
SPE buffer (as described by the translation of PMBPTR_EL1) is pinned
at any given time. Generate a SPE S2 fault outside of this range, and
we unpin the region before mapping in the next one. Yes, the guest can
play tricks on us and exploit the latency of the interrupt. But at the
end of the day, this is its own problem.

Of course, this results in larger blind windows. Ideally, we should be
able to report these to the guest, either as sideband data or in the
actual profiling buffer (but I have no idea whether this is possible).

> Live migration also throws a wrench in this. IOW, there are still potential
> sources of blackout unattributable to guest manipulation of the SPU.

Can you chime some light on this? I appreciate that you can't play the
R/O trick on the SPE buffer as it invalidates the above discussion,
but it should be relatively easy to track these pages and never reset
them as clean until the vcpu is stopped. Unless you foresee other
issues?

To be clear, I don't worry too much about these blind windows. The
architecture doesn't really give us the right tools to make it work
reliably, making this a best effort only. Unless we pin the whole
guest and forego migration and other fault-driven mechanisms.

Maybe that is a choice we need to give to the user: cheap, fast,
reliable. Pick two.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27  9:30         ` Marc Zyngier
@ 2022-07-27  9:52           ` Marc Zyngier
  -1 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27  9:52 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Will Deacon, kvmarm, linux-arm-kernel

On Wed, 27 Jul 2022 10:30:59 +0100,
Marc Zyngier <maz@kernel.org> wrote:
> 
> On Tue, 26 Jul 2022 18:51:21 +0100,
> Oliver Upton <oliver.upton@linux.dev> wrote:
> > 
> > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > responsible for its translation as well? I agree that pinning the buffer
> > is likely the best way forward as pinning the whole of guest memory is
> > entirely impractical.

Huh, I just realised that you were talking about S1. I don't think we
need to do this. As long as the translation falls into a mapped
region (pinned or not), we don't need to worry.

If we get a S2 translation fault from SPE, we just go and map it. And
TBH the pinning here is just a optimisation against things like swap,
KSM and similar things. The only thing we need to make sure is that
the fault is handled in the context of the vcpu that owns this SPU.

Alex, can you think of anything that would cause a problem (other than
performance and possible blackout windows) if we didn't do any pinning
at all and just handled the SPE interrupts as normal page faults?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27  9:52           ` Marc Zyngier
  0 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27  9:52 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Alexandru Elisei, Will Deacon, kvmarm, linux-arm-kernel

On Wed, 27 Jul 2022 10:30:59 +0100,
Marc Zyngier <maz@kernel.org> wrote:
> 
> On Tue, 26 Jul 2022 18:51:21 +0100,
> Oliver Upton <oliver.upton@linux.dev> wrote:
> > 
> > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > responsible for its translation as well? I agree that pinning the buffer
> > is likely the best way forward as pinning the whole of guest memory is
> > entirely impractical.

Huh, I just realised that you were talking about S1. I don't think we
need to do this. As long as the translation falls into a mapped
region (pinned or not), we don't need to worry.

If we get a S2 translation fault from SPE, we just go and map it. And
TBH the pinning here is just a optimisation against things like swap,
KSM and similar things. The only thing we need to make sure is that
the fault is handled in the context of the vcpu that owns this SPU.

Alex, can you think of anything that would cause a problem (other than
performance and possible blackout windows) if we didn't do any pinning
at all and just handled the SPE interrupts as normal page faults?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-26 17:51       ` Oliver Upton
@ 2022-07-27 10:19         ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 10:19 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Oliver,

Thank you for the help, replies below.

On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> Hi Alex,
> 
> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > I was investigating this approach, and Mark raised a concern that I think
> > might be a showstopper.
> > 
> > Let's consider this scenario:
> > 
> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > 
> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > 2. Guest programs SPE to enable profiling at **EL0**
> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > 3. Guest changes the translation table entries for the buffer. The
> > architecture allows this.
> > 4. Guest does an ERET to EL0, thus enabling profiling.
> > 
> > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > the buffer at stage 2 when profiling gets enabled at EL0.
> 
> Not saying we necessarily should, but this is possible with FGT no?

It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from EL1.
Unless there's no other way, I would prefer not to have the emulation of one
feature depend on the presence of another feature,

> 
> > I can see two solutions here:
> > 
> > a. Accept the limitation (and advertise it in the documentation) that if
> > someone wants to use SPE when running as a Linux guest, the kernel used by
> > the guest must not change the buffer translation table entries after the
> > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> > running a Linux guest should not be a problem. I don't know how other OSes
> > do it (but I can find out). We could also phrase it that the buffer
> > translation table entries can be changed after enabling the buffer, but
> > only if profiling happens at EL1. But that sounds very arbitrary.
> > 
> > b. Pin the buffer after the stage 2 DABT that SPE will report in the
> > situation above. This means that there is a blackout window, but will
> > happen only once after each time the guest reprograms the buffer. I don't
> > know if this is acceptable. We could say that this if this blackout window
> > is not acceptable, then the guest kernel shouldn't change the translation
> > table entries after enabling the buffer.
> > 
> > Or drop the approach of pinning the buffer and go back to pinning the
> > entire memory of the VM.
> > 
> > Any thoughts on this? I would very much prefer to try to pin only the
> > buffer.
> 
> Doesn't pinning the buffer also imply pinning the stage 1 tables
> responsible for its translation as well? I agree that pinning the buffer

See my reply [1] to a question someone asked in an earlier iteration of the
pKVM series. My conclusion is that it's impossible to stop the
invalidate_range_start() MMU notifiers from being invoked for pinned pages.
But I believe that can be circumvented passing the enum mmu_notifier_event
event field to the arm64 KVM code and use that to decide to do the
unmapping or not. I am still investigating that, but it looks promising.

[1] https://lore.kernel.org/all/YuEMkKY2RU%2F2KiZW@monolith.localdoman/

> is likely the best way forward as pinning the whole of guest memory is
> entirely impractical.

I would say it's undesirable, not impractical. Like Marc said, vfio already
pins the entire guest memory with the VFIO_IOMMMU_MAP_DMA ioctl. The
difference there is that the SMMU tables are unmapped via the explicit
ioctl VFIO_IOMMU_UNMAP_DMA; the SMMU doesn't use the MMU notifiers to keep
in sync with host's stage 1 like KVM does.

> 
> I'm also a bit confused on how we would manage to un-pin memory on the
> way out with this. The guest is free to muck with the stage 1 and could
> cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> annoying. One way to tackle it would be to only allow a single
> root-to-target walk to be pinned by a vCPU at a time. Any time a new
> stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> one instead.
> 
> Live migration also throws a wrench in this. IOW, there are still potential
> sources of blackout unattributable to guest manipulation of the SPU.

I have a proposal to handle [2] that, if you want to have a look.
Basically, userspace tells KVM to never allow the guest to start profiling.
That means a possibly huge blackout window while the guest is being
migrated, but I don't see any better solutions.

[2] https://lore.kernel.org/all/20211117153842.302159-35-alexandru.elisei@arm.com/

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 10:19         ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 10:19 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Will Deacon, maz, kvmarm, linux-arm-kernel

Hi Oliver,

Thank you for the help, replies below.

On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> Hi Alex,
> 
> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > I was investigating this approach, and Mark raised a concern that I think
> > might be a showstopper.
> > 
> > Let's consider this scenario:
> > 
> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > 
> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > 2. Guest programs SPE to enable profiling at **EL0**
> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > 3. Guest changes the translation table entries for the buffer. The
> > architecture allows this.
> > 4. Guest does an ERET to EL0, thus enabling profiling.
> > 
> > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > the buffer at stage 2 when profiling gets enabled at EL0.
> 
> Not saying we necessarily should, but this is possible with FGT no?

It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from EL1.
Unless there's no other way, I would prefer not to have the emulation of one
feature depend on the presence of another feature,

> 
> > I can see two solutions here:
> > 
> > a. Accept the limitation (and advertise it in the documentation) that if
> > someone wants to use SPE when running as a Linux guest, the kernel used by
> > the guest must not change the buffer translation table entries after the
> > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> > running a Linux guest should not be a problem. I don't know how other OSes
> > do it (but I can find out). We could also phrase it that the buffer
> > translation table entries can be changed after enabling the buffer, but
> > only if profiling happens at EL1. But that sounds very arbitrary.
> > 
> > b. Pin the buffer after the stage 2 DABT that SPE will report in the
> > situation above. This means that there is a blackout window, but will
> > happen only once after each time the guest reprograms the buffer. I don't
> > know if this is acceptable. We could say that this if this blackout window
> > is not acceptable, then the guest kernel shouldn't change the translation
> > table entries after enabling the buffer.
> > 
> > Or drop the approach of pinning the buffer and go back to pinning the
> > entire memory of the VM.
> > 
> > Any thoughts on this? I would very much prefer to try to pin only the
> > buffer.
> 
> Doesn't pinning the buffer also imply pinning the stage 1 tables
> responsible for its translation as well? I agree that pinning the buffer

See my reply [1] to a question someone asked in an earlier iteration of the
pKVM series. My conclusion is that it's impossible to stop the
invalidate_range_start() MMU notifiers from being invoked for pinned pages.
But I believe that can be circumvented passing the enum mmu_notifier_event
event field to the arm64 KVM code and use that to decide to do the
unmapping or not. I am still investigating that, but it looks promising.

[1] https://lore.kernel.org/all/YuEMkKY2RU%2F2KiZW@monolith.localdoman/

> is likely the best way forward as pinning the whole of guest memory is
> entirely impractical.

I would say it's undesirable, not impractical. Like Marc said, vfio already
pins the entire guest memory with the VFIO_IOMMMU_MAP_DMA ioctl. The
difference there is that the SMMU tables are unmapped via the explicit
ioctl VFIO_IOMMU_UNMAP_DMA; the SMMU doesn't use the MMU notifiers to keep
in sync with host's stage 1 like KVM does.

> 
> I'm also a bit confused on how we would manage to un-pin memory on the
> way out with this. The guest is free to muck with the stage 1 and could
> cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> annoying. One way to tackle it would be to only allow a single
> root-to-target walk to be pinned by a vCPU at a time. Any time a new
> stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> one instead.
> 
> Live migration also throws a wrench in this. IOW, there are still potential
> sources of blackout unattributable to guest manipulation of the SPU.

I have a proposal to handle [2] that, if you want to have a look.
Basically, userspace tells KVM to never allow the guest to start profiling.
That means a possibly huge blackout window while the guest is being
migrated, but I don't see any better solutions.

[2] https://lore.kernel.org/all/20211117153842.302159-35-alexandru.elisei@arm.com/

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27 10:19         ` Alexandru Elisei
@ 2022-07-27 10:29           ` Marc Zyngier
  -1 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27 10:29 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: linux-arm-kernel, Will Deacon, kvmarm

On 2022-07-27 11:19, Alexandru Elisei wrote:
> Hi Oliver,
> 
> Thank you for the help, replies below.
> 
> On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
>> Hi Alex,
>> 
>> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
>> 
>> [...]
>> 
>> > > A funkier approach might be to defer pinning of the buffer until the SPE is
>> > > enabled and avoid pinning all of VM memory that way, although I can't
>> > > immediately tell how flexible the architecture is in allowing you to cache
>> > > the base/limit values.
>> >
>> > I was investigating this approach, and Mark raised a concern that I think
>> > might be a showstopper.
>> >
>> > Let's consider this scenario:
>> >
>> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
>> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
>> >
>> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
>> > 2. Guest programs SPE to enable profiling at **EL0**
>> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
>> > 3. Guest changes the translation table entries for the buffer. The
>> > architecture allows this.
>> > 4. Guest does an ERET to EL0, thus enabling profiling.
>> >
>> > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
>> > the buffer at stage 2 when profiling gets enabled at EL0.
>> 
>> Not saying we necessarily should, but this is possible with FGT no?
> 
> It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from 
> EL1.

See HFGITR.ERET.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 10:29           ` Marc Zyngier
  0 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27 10:29 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel

On 2022-07-27 11:19, Alexandru Elisei wrote:
> Hi Oliver,
> 
> Thank you for the help, replies below.
> 
> On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
>> Hi Alex,
>> 
>> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
>> 
>> [...]
>> 
>> > > A funkier approach might be to defer pinning of the buffer until the SPE is
>> > > enabled and avoid pinning all of VM memory that way, although I can't
>> > > immediately tell how flexible the architecture is in allowing you to cache
>> > > the base/limit values.
>> >
>> > I was investigating this approach, and Mark raised a concern that I think
>> > might be a showstopper.
>> >
>> > Let's consider this scenario:
>> >
>> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
>> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
>> >
>> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
>> > 2. Guest programs SPE to enable profiling at **EL0**
>> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
>> > 3. Guest changes the translation table entries for the buffer. The
>> > architecture allows this.
>> > 4. Guest does an ERET to EL0, thus enabling profiling.
>> >
>> > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
>> > the buffer at stage 2 when profiling gets enabled at EL0.
>> 
>> Not saying we necessarily should, but this is possible with FGT no?
> 
> It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from 
> EL1.

See HFGITR.ERET.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27  9:52           ` Marc Zyngier
@ 2022-07-27 10:38             ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 10:38 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm

Hi Marc,

On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote:
> On Wed, 27 Jul 2022 10:30:59 +0100,
> Marc Zyngier <maz@kernel.org> wrote:
> > 
> > On Tue, 26 Jul 2022 18:51:21 +0100,
> > Oliver Upton <oliver.upton@linux.dev> wrote:
> > > 
> > > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > > responsible for its translation as well? I agree that pinning the buffer
> > > is likely the best way forward as pinning the whole of guest memory is
> > > entirely impractical.
> 
> Huh, I just realised that you were talking about S1. I don't think we
> need to do this. As long as the translation falls into a mapped
> region (pinned or not), we don't need to worry.
> 
> If we get a S2 translation fault from SPE, we just go and map it. And
> TBH the pinning here is just a optimisation against things like swap,
> KSM and similar things. The only thing we need to make sure is that
> the fault is handled in the context of the vcpu that owns this SPU.
> 
> Alex, can you think of anything that would cause a problem (other than
> performance and possible blackout windows) if we didn't do any pinning
> at all and just handled the SPE interrupts as normal page faults?

PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE,
which means the last record written is incomplete. Records have a variable
size, so it's impossible for KVM to revert to the end of the last known
good record without parsing the buffer (references here [1]). And even if
KVM would know the size of a record, there's this bit in the Arm ARM which
worries me (ARM DDI 0487H.a, page D10-5177):

"The architecture does not require that a sample record is written
sequentially by the SPU, only that:
[..]
- On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
  whether PMBPTR_EL1 points to the first byte after the last complete
  sample record."

So there might be gaps in the buffer, meaning that the entire buffer would
have to be discarded if DL is set as a result of a stage 2 fault.

Also, I'm not sure if you're aware of this, but SPE reports the guest VA in
PMBPTR_EL1 (not the IPA) on a fault, so KVM would have to walk the guest's
stage 1 tables to service the faults, which would add to the overhead of
servicing the fault. Don't know if that makes a difference, just thought I
should mention it as another peculiarity of SPE.

[1] https://lore.kernel.org/all/Yl7KewpTj+7NSonf@monolith.localdoman/

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 10:38             ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 10:38 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel

Hi Marc,

On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote:
> On Wed, 27 Jul 2022 10:30:59 +0100,
> Marc Zyngier <maz@kernel.org> wrote:
> > 
> > On Tue, 26 Jul 2022 18:51:21 +0100,
> > Oliver Upton <oliver.upton@linux.dev> wrote:
> > > 
> > > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > > responsible for its translation as well? I agree that pinning the buffer
> > > is likely the best way forward as pinning the whole of guest memory is
> > > entirely impractical.
> 
> Huh, I just realised that you were talking about S1. I don't think we
> need to do this. As long as the translation falls into a mapped
> region (pinned or not), we don't need to worry.
> 
> If we get a S2 translation fault from SPE, we just go and map it. And
> TBH the pinning here is just a optimisation against things like swap,
> KSM and similar things. The only thing we need to make sure is that
> the fault is handled in the context of the vcpu that owns this SPU.
> 
> Alex, can you think of anything that would cause a problem (other than
> performance and possible blackout windows) if we didn't do any pinning
> at all and just handled the SPE interrupts as normal page faults?

PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE,
which means the last record written is incomplete. Records have a variable
size, so it's impossible for KVM to revert to the end of the last known
good record without parsing the buffer (references here [1]). And even if
KVM would know the size of a record, there's this bit in the Arm ARM which
worries me (ARM DDI 0487H.a, page D10-5177):

"The architecture does not require that a sample record is written
sequentially by the SPU, only that:
[..]
- On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
  whether PMBPTR_EL1 points to the first byte after the last complete
  sample record."

So there might be gaps in the buffer, meaning that the entire buffer would
have to be discarded if DL is set as a result of a stage 2 fault.

Also, I'm not sure if you're aware of this, but SPE reports the guest VA in
PMBPTR_EL1 (not the IPA) on a fault, so KVM would have to walk the guest's
stage 1 tables to service the faults, which would add to the overhead of
servicing the fault. Don't know if that makes a difference, just thought I
should mention it as another peculiarity of SPE.

[1] https://lore.kernel.org/all/Yl7KewpTj+7NSonf@monolith.localdoman/

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27 10:29           ` Marc Zyngier
@ 2022-07-27 10:44             ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 10:44 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm

On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote:
> On 2022-07-27 11:19, Alexandru Elisei wrote:
> > Hi Oliver,
> > 
> > Thank you for the help, replies below.
> > 
> > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> > > Hi Alex,
> > > 
> > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > > 
> > > [...]
> > > 
> > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > the base/limit values.
> > > >
> > > > I was investigating this approach, and Mark raised a concern that I think
> > > > might be a showstopper.
> > > >
> > > > Let's consider this scenario:
> > > >
> > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > >
> > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > 3. Guest changes the translation table entries for the buffer. The
> > > > architecture allows this.
> > > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > >
> > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > > the buffer at stage 2 when profiling gets enabled at EL0.
> > > 
> > > Not saying we necessarily should, but this is possible with FGT no?
> > 
> > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from
> > EL1.
> 
> See HFGITR.ERET.

Ah, so that's the register, thanks!

I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on
FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any machines
that have FEAT_SPE and FEAT_FGT?

On the plus side, KVM could enable the trap only in the case above, and disable
it after the ERET is trapped, so it should be relatively cheap to use.

Thanks,
Alex

> 
> Thanks,
> 
>         M.
> -- 
> Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 10:44             ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 10:44 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel

On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote:
> On 2022-07-27 11:19, Alexandru Elisei wrote:
> > Hi Oliver,
> > 
> > Thank you for the help, replies below.
> > 
> > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> > > Hi Alex,
> > > 
> > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > > 
> > > [...]
> > > 
> > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > the base/limit values.
> > > >
> > > > I was investigating this approach, and Mark raised a concern that I think
> > > > might be a showstopper.
> > > >
> > > > Let's consider this scenario:
> > > >
> > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > >
> > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > 3. Guest changes the translation table entries for the buffer. The
> > > > architecture allows this.
> > > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > >
> > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > > the buffer at stage 2 when profiling gets enabled at EL0.
> > > 
> > > Not saying we necessarily should, but this is possible with FGT no?
> > 
> > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from
> > EL1.
> 
> See HFGITR.ERET.

Ah, so that's the register, thanks!

I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on
FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any machines
that have FEAT_SPE and FEAT_FGT?

On the plus side, KVM could enable the trap only in the case above, and disable
it after the ERET is trapped, so it should be relatively cheap to use.

Thanks,
Alex

> 
> Thanks,
> 
>         M.
> -- 
> Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27  9:30         ` Marc Zyngier
@ 2022-07-27 10:56           ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 10:56 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm

Hi Marc,

On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote:
> On Tue, 26 Jul 2022 18:51:21 +0100,
> Oliver Upton <oliver.upton@linux.dev> wrote:
> > 
> > Hi Alex,
> > 
> > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > 
> > [...]
> > 
> > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > the base/limit values.
> > > 
> > > I was investigating this approach, and Mark raised a concern that I think
> > > might be a showstopper.
> > > 
> > > Let's consider this scenario:
> > > 
> > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > 
> > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > 2. Guest programs SPE to enable profiling at **EL0**
> > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > 3. Guest changes the translation table entries for the buffer. The
> > > architecture allows this.
> > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > 
> > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > the buffer at stage 2 when profiling gets enabled at EL0.
> > 
> > Not saying we necessarily should, but this is possible with FGT no?
> 
> Given how often ERET is used at EL1, I'd really refrain from doing
> so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real
> EL1, and this comes at a serious cost (even an exception return that
> stays at the same EL gets trapped). Once EL1 runs, we disengage this
> trap because it is otherwise way too costly.
> 
> >
> > > I can see two solutions here:
> > > 
> > > a. Accept the limitation (and advertise it in the documentation) that if
> > > someone wants to use SPE when running as a Linux guest, the kernel used by
> > > the guest must not change the buffer translation table entries after the
> > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> > > running a Linux guest should not be a problem. I don't know how other OSes
> > > do it (but I can find out). We could also phrase it that the buffer
> > > translation table entries can be changed after enabling the buffer, but
> > > only if profiling happens at EL1. But that sounds very arbitrary.
> > > 
> > > b. Pin the buffer after the stage 2 DABT that SPE will report in the
> > > situation above. This means that there is a blackout window, but will
> > > happen only once after each time the guest reprograms the buffer. I don't
> > > know if this is acceptable. We could say that this if this blackout window
> > > is not acceptable, then the guest kernel shouldn't change the translation
> > > table entries after enabling the buffer.
> > > 
> > > Or drop the approach of pinning the buffer and go back to pinning the
> > > entire memory of the VM.
> > > 
> > > Any thoughts on this? I would very much prefer to try to pin only the
> > > buffer.
> > 
> > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > responsible for its translation as well? I agree that pinning the buffer
> > is likely the best way forward as pinning the whole of guest memory is
> > entirely impractical.
> 
> How different is this from device assignment, which also relies on
> full page pinning? The way I look at it, SPE is a device directly
> assigned to the guest, and isn't capable of generating synchronous
> exception. Not that I'm madly in love with the approach, but this is
> at least consistent. There was also some concerns around buggy HW that
> would blow itself up on S2 faults, but I think these implementations
> are confidential enough that we don't need to worry about them.
> 
> > I'm also a bit confused on how we would manage to un-pin memory on the
> > way out with this. The guest is free to muck with the stage 1 and could
> > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> > annoying. One way to tackle it would be to only allow a single
> > root-to-target walk to be pinned by a vCPU at a time. Any time a new
> > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> > one instead.
> 
> This sounds like a reasonable option. Only one IPA range covering the
> SPE buffer (as described by the translation of PMBPTR_EL1) is pinned
> at any given time. Generate a SPE S2 fault outside of this range, and
> we unpin the region before mapping in the next one. Yes, the guest can
> play tricks on us and exploit the latency of the interrupt. But at the
> end of the day, this is its own problem.
> 
> Of course, this results in larger blind windows. Ideally, we should be
> able to report these to the guest, either as sideband data or in the
> actual profiling buffer (but I have no idea whether this is possible).

I believe solution b, pin the buffer when guest enables profiling (where by
profiling enabled I mean StatisticalProfilingEnabled() returns true), and
only in the situation that I described pin the buffer as a result of a
stage 2 fault, would reduce the blackouts to a minimum.

Thanks,
Alex

> 
> > Live migration also throws a wrench in this. IOW, there are still potential
> > sources of blackout unattributable to guest manipulation of the SPU.
> 
> Can you chime some light on this? I appreciate that you can't play the
> R/O trick on the SPE buffer as it invalidates the above discussion,
> but it should be relatively easy to track these pages and never reset
> them as clean until the vcpu is stopped. Unless you foresee other
> issues?
> 
> To be clear, I don't worry too much about these blind windows. The
> architecture doesn't really give us the right tools to make it work
> reliably, making this a best effort only. Unless we pin the whole
> guest and forego migration and other fault-driven mechanisms.
> 
> Maybe that is a choice we need to give to the user: cheap, fast,
> reliable. Pick two.
> 
> Thanks,
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 10:56           ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 10:56 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel

Hi Marc,

On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote:
> On Tue, 26 Jul 2022 18:51:21 +0100,
> Oliver Upton <oliver.upton@linux.dev> wrote:
> > 
> > Hi Alex,
> > 
> > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > 
> > [...]
> > 
> > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > the base/limit values.
> > > 
> > > I was investigating this approach, and Mark raised a concern that I think
> > > might be a showstopper.
> > > 
> > > Let's consider this scenario:
> > > 
> > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > 
> > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > 2. Guest programs SPE to enable profiling at **EL0**
> > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > 3. Guest changes the translation table entries for the buffer. The
> > > architecture allows this.
> > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > 
> > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > the buffer at stage 2 when profiling gets enabled at EL0.
> > 
> > Not saying we necessarily should, but this is possible with FGT no?
> 
> Given how often ERET is used at EL1, I'd really refrain from doing
> so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real
> EL1, and this comes at a serious cost (even an exception return that
> stays at the same EL gets trapped). Once EL1 runs, we disengage this
> trap because it is otherwise way too costly.
> 
> >
> > > I can see two solutions here:
> > > 
> > > a. Accept the limitation (and advertise it in the documentation) that if
> > > someone wants to use SPE when running as a Linux guest, the kernel used by
> > > the guest must not change the buffer translation table entries after the
> > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> > > running a Linux guest should not be a problem. I don't know how other OSes
> > > do it (but I can find out). We could also phrase it that the buffer
> > > translation table entries can be changed after enabling the buffer, but
> > > only if profiling happens at EL1. But that sounds very arbitrary.
> > > 
> > > b. Pin the buffer after the stage 2 DABT that SPE will report in the
> > > situation above. This means that there is a blackout window, but will
> > > happen only once after each time the guest reprograms the buffer. I don't
> > > know if this is acceptable. We could say that this if this blackout window
> > > is not acceptable, then the guest kernel shouldn't change the translation
> > > table entries after enabling the buffer.
> > > 
> > > Or drop the approach of pinning the buffer and go back to pinning the
> > > entire memory of the VM.
> > > 
> > > Any thoughts on this? I would very much prefer to try to pin only the
> > > buffer.
> > 
> > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > responsible for its translation as well? I agree that pinning the buffer
> > is likely the best way forward as pinning the whole of guest memory is
> > entirely impractical.
> 
> How different is this from device assignment, which also relies on
> full page pinning? The way I look at it, SPE is a device directly
> assigned to the guest, and isn't capable of generating synchronous
> exception. Not that I'm madly in love with the approach, but this is
> at least consistent. There was also some concerns around buggy HW that
> would blow itself up on S2 faults, but I think these implementations
> are confidential enough that we don't need to worry about them.
> 
> > I'm also a bit confused on how we would manage to un-pin memory on the
> > way out with this. The guest is free to muck with the stage 1 and could
> > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> > annoying. One way to tackle it would be to only allow a single
> > root-to-target walk to be pinned by a vCPU at a time. Any time a new
> > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> > one instead.
> 
> This sounds like a reasonable option. Only one IPA range covering the
> SPE buffer (as described by the translation of PMBPTR_EL1) is pinned
> at any given time. Generate a SPE S2 fault outside of this range, and
> we unpin the region before mapping in the next one. Yes, the guest can
> play tricks on us and exploit the latency of the interrupt. But at the
> end of the day, this is its own problem.
> 
> Of course, this results in larger blind windows. Ideally, we should be
> able to report these to the guest, either as sideband data or in the
> actual profiling buffer (but I have no idea whether this is possible).

I believe solution b, pin the buffer when guest enables profiling (where by
profiling enabled I mean StatisticalProfilingEnabled() returns true), and
only in the situation that I described pin the buffer as a result of a
stage 2 fault, would reduce the blackouts to a minimum.

Thanks,
Alex

> 
> > Live migration also throws a wrench in this. IOW, there are still potential
> > sources of blackout unattributable to guest manipulation of the SPU.
> 
> Can you chime some light on this? I appreciate that you can't play the
> R/O trick on the SPE buffer as it invalidates the above discussion,
> but it should be relatively easy to track these pages and never reset
> them as clean until the vcpu is stopped. Unless you foresee other
> issues?
> 
> To be clear, I don't worry too much about these blind windows. The
> architecture doesn't really give us the right tools to make it work
> reliably, making this a best effort only. Unless we pin the whole
> guest and forego migration and other fault-driven mechanisms.
> 
> Maybe that is a choice we need to give to the user: cheap, fast,
> reliable. Pick two.
> 
> Thanks,
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-26 17:51       ` Oliver Upton
@ 2022-07-27 11:00         ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 11:00 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Oliver,

On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> Hi Alex,
> 
> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> I'm also a bit confused on how we would manage to un-pin memory on the
> way out with this. The guest is free to muck with the stage 1 and could
> cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> annoying. One way to tackle it would be to only allow a single
> root-to-target walk to be pinned by a vCPU at a time. Any time a new
> stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> one instead.

On the topic of memory unpinning, for a well behaved guest I belive that should
be done the next time the buffer is pinned. The buffer can (and should!) be
drained when both the buffer and sampling is disabled; unpinning the buffer when
profiling becomes disabled would lead to unnecessary stage 2 faults when
draining it.

That approach also means that KVM wouldn't have to do anything special for SPE
stage 2 faults.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 11:00         ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 11:00 UTC (permalink / raw)
  To: Oliver Upton; +Cc: Will Deacon, maz, kvmarm, linux-arm-kernel

Hi Oliver,

On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> Hi Alex,
> 
> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> I'm also a bit confused on how we would manage to un-pin memory on the
> way out with this. The guest is free to muck with the stage 1 and could
> cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> annoying. One way to tackle it would be to only allow a single
> root-to-target walk to be pinned by a vCPU at a time. Any time a new
> stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> one instead.

On the topic of memory unpinning, for a well behaved guest I belive that should
be done the next time the buffer is pinned. The buffer can (and should!) be
drained when both the buffer and sampling is disabled; unpinning the buffer when
profiling becomes disabled would lead to unnecessary stage 2 faults when
draining it.

That approach also means that KVM wouldn't have to do anything special for SPE
stage 2 faults.

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27 10:44             ` Alexandru Elisei
@ 2022-07-27 11:08               ` Marc Zyngier
  -1 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27 11:08 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Will Deacon, kvmarm, linux-arm-kernel

On 2022-07-27 11:44, Alexandru Elisei wrote:
> On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote:
>> On 2022-07-27 11:19, Alexandru Elisei wrote:
>> > Hi Oliver,
>> >
>> > Thank you for the help, replies below.
>> >
>> > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
>> > > Hi Alex,
>> > >
>> > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
>> > >
>> > > [...]
>> > >
>> > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
>> > > > > enabled and avoid pinning all of VM memory that way, although I can't
>> > > > > immediately tell how flexible the architecture is in allowing you to cache
>> > > > > the base/limit values.
>> > > >
>> > > > I was investigating this approach, and Mark raised a concern that I think
>> > > > might be a showstopper.
>> > > >
>> > > > Let's consider this scenario:
>> > > >
>> > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
>> > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
>> > > >
>> > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
>> > > > 2. Guest programs SPE to enable profiling at **EL0**
>> > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
>> > > > 3. Guest changes the translation table entries for the buffer. The
>> > > > architecture allows this.
>> > > > 4. Guest does an ERET to EL0, thus enabling profiling.
>> > > >
>> > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
>> > > > the buffer at stage 2 when profiling gets enabled at EL0.
>> > >
>> > > Not saying we necessarily should, but this is possible with FGT no?
>> >
>> > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from
>> > EL1.
>> 
>> See HFGITR.ERET.
> 
> Ah, so that's the register, thanks!
> 
> I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend 
> on
> FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any 
> machines
> that have FEAT_SPE and FEAT_FGT?

None. Both are pretty niche, and the combination is nowhere
to be seen at the moment.

> On the plus side, KVM could enable the trap only in the case above, and 
> disable
> it after the ERET is trapped, so it should be relatively cheap to use.

This feels pretty horrible. Nothing says *when* will EL1
alter the PTs. It could take tons of EL1->EL1 exceptions
before returning to EL0. And the change could happen after
an EL1->EL0->EL1 transition. At which point do you stop?

If you want to rely on ERET for that, you need to trap
ERET all the time, because all ERETs to EL0 will be
suspect. And doing that to handle such a corner case feels
pretty horrible.

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 11:08               ` Marc Zyngier
  0 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27 11:08 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: linux-arm-kernel, Will Deacon, kvmarm

On 2022-07-27 11:44, Alexandru Elisei wrote:
> On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote:
>> On 2022-07-27 11:19, Alexandru Elisei wrote:
>> > Hi Oliver,
>> >
>> > Thank you for the help, replies below.
>> >
>> > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
>> > > Hi Alex,
>> > >
>> > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
>> > >
>> > > [...]
>> > >
>> > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
>> > > > > enabled and avoid pinning all of VM memory that way, although I can't
>> > > > > immediately tell how flexible the architecture is in allowing you to cache
>> > > > > the base/limit values.
>> > > >
>> > > > I was investigating this approach, and Mark raised a concern that I think
>> > > > might be a showstopper.
>> > > >
>> > > > Let's consider this scenario:
>> > > >
>> > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
>> > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
>> > > >
>> > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
>> > > > 2. Guest programs SPE to enable profiling at **EL0**
>> > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
>> > > > 3. Guest changes the translation table entries for the buffer. The
>> > > > architecture allows this.
>> > > > 4. Guest does an ERET to EL0, thus enabling profiling.
>> > > >
>> > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
>> > > > the buffer at stage 2 when profiling gets enabled at EL0.
>> > >
>> > > Not saying we necessarily should, but this is possible with FGT no?
>> >
>> > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from
>> > EL1.
>> 
>> See HFGITR.ERET.
> 
> Ah, so that's the register, thanks!
> 
> I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend 
> on
> FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any 
> machines
> that have FEAT_SPE and FEAT_FGT?

None. Both are pretty niche, and the combination is nowhere
to be seen at the moment.

> On the plus side, KVM could enable the trap only in the case above, and 
> disable
> it after the ERET is trapped, so it should be relatively cheap to use.

This feels pretty horrible. Nothing says *when* will EL1
alter the PTs. It could take tons of EL1->EL1 exceptions
before returning to EL0. And the change could happen after
an EL1->EL0->EL1 transition. At which point do you stop?

If you want to rely on ERET for that, you need to trap
ERET all the time, because all ERETs to EL0 will be
suspect. And doing that to handle such a corner case feels
pretty horrible.

         M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27 10:56           ` Alexandru Elisei
@ 2022-07-27 11:18             ` Marc Zyngier
  -1 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27 11:18 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: linux-arm-kernel, Will Deacon, kvmarm

On 2022-07-27 11:56, Alexandru Elisei wrote:
> Hi Marc,
> 
> On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote:
>> On Tue, 26 Jul 2022 18:51:21 +0100,
>> Oliver Upton <oliver.upton@linux.dev> wrote:
>> >
>> > Hi Alex,
>> >
>> > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
>> >
>> > [...]
>> >
>> > > > A funkier approach might be to defer pinning of the buffer until the SPE is
>> > > > enabled and avoid pinning all of VM memory that way, although I can't
>> > > > immediately tell how flexible the architecture is in allowing you to cache
>> > > > the base/limit values.
>> > >
>> > > I was investigating this approach, and Mark raised a concern that I think
>> > > might be a showstopper.
>> > >
>> > > Let's consider this scenario:
>> > >
>> > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
>> > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
>> > >
>> > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
>> > > 2. Guest programs SPE to enable profiling at **EL0**
>> > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
>> > > 3. Guest changes the translation table entries for the buffer. The
>> > > architecture allows this.
>> > > 4. Guest does an ERET to EL0, thus enabling profiling.
>> > >
>> > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
>> > > the buffer at stage 2 when profiling gets enabled at EL0.
>> >
>> > Not saying we necessarily should, but this is possible with FGT no?
>> 
>> Given how often ERET is used at EL1, I'd really refrain from doing
>> so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real
>> EL1, and this comes at a serious cost (even an exception return that
>> stays at the same EL gets trapped). Once EL1 runs, we disengage this
>> trap because it is otherwise way too costly.
>> 
>> >
>> > > I can see two solutions here:
>> > >
>> > > a. Accept the limitation (and advertise it in the documentation) that if
>> > > someone wants to use SPE when running as a Linux guest, the kernel used by
>> > > the guest must not change the buffer translation table entries after the
>> > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
>> > > running a Linux guest should not be a problem. I don't know how other OSes
>> > > do it (but I can find out). We could also phrase it that the buffer
>> > > translation table entries can be changed after enabling the buffer, but
>> > > only if profiling happens at EL1. But that sounds very arbitrary.
>> > >
>> > > b. Pin the buffer after the stage 2 DABT that SPE will report in the
>> > > situation above. This means that there is a blackout window, but will
>> > > happen only once after each time the guest reprograms the buffer. I don't
>> > > know if this is acceptable. We could say that this if this blackout window
>> > > is not acceptable, then the guest kernel shouldn't change the translation
>> > > table entries after enabling the buffer.
>> > >
>> > > Or drop the approach of pinning the buffer and go back to pinning the
>> > > entire memory of the VM.
>> > >
>> > > Any thoughts on this? I would very much prefer to try to pin only the
>> > > buffer.
>> >
>> > Doesn't pinning the buffer also imply pinning the stage 1 tables
>> > responsible for its translation as well? I agree that pinning the buffer
>> > is likely the best way forward as pinning the whole of guest memory is
>> > entirely impractical.
>> 
>> How different is this from device assignment, which also relies on
>> full page pinning? The way I look at it, SPE is a device directly
>> assigned to the guest, and isn't capable of generating synchronous
>> exception. Not that I'm madly in love with the approach, but this is
>> at least consistent. There was also some concerns around buggy HW that
>> would blow itself up on S2 faults, but I think these implementations
>> are confidential enough that we don't need to worry about them.
>> 
>> > I'm also a bit confused on how we would manage to un-pin memory on the
>> > way out with this. The guest is free to muck with the stage 1 and could
>> > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
>> > annoying. One way to tackle it would be to only allow a single
>> > root-to-target walk to be pinned by a vCPU at a time. Any time a new
>> > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
>> > one instead.
>> 
>> This sounds like a reasonable option. Only one IPA range covering the
>> SPE buffer (as described by the translation of PMBPTR_EL1) is pinned
>> at any given time. Generate a SPE S2 fault outside of this range, and
>> we unpin the region before mapping in the next one. Yes, the guest can
>> play tricks on us and exploit the latency of the interrupt. But at the
>> end of the day, this is its own problem.
>> 
>> Of course, this results in larger blind windows. Ideally, we should be
>> able to report these to the guest, either as sideband data or in the
>> actual profiling buffer (but I have no idea whether this is possible).
> 
> I believe solution b, pin the buffer when guest enables profiling 
> (where by
> profiling enabled I mean StatisticalProfilingEnabled() returns true), 
> and
> only in the situation that I described pin the buffer as a result of a
> stage 2 fault, would reduce the blackouts to a minimum.

In all honesty, I'd rather see everything be done as the result
of a S2 fault for now, and only introduce heuristics to reduce the 
blackout window at a later time. And this includes buffer pinning
if that can be avoided.

My hunch is that people wanting zero blackout will always pin
all their memory, one way or another, and that the rest of us
will be happy just to get *something* out of SPE in a VM...

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 11:18             ` Marc Zyngier
  0 siblings, 0 replies; 72+ messages in thread
From: Marc Zyngier @ 2022-07-27 11:18 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel

On 2022-07-27 11:56, Alexandru Elisei wrote:
> Hi Marc,
> 
> On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote:
>> On Tue, 26 Jul 2022 18:51:21 +0100,
>> Oliver Upton <oliver.upton@linux.dev> wrote:
>> >
>> > Hi Alex,
>> >
>> > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
>> >
>> > [...]
>> >
>> > > > A funkier approach might be to defer pinning of the buffer until the SPE is
>> > > > enabled and avoid pinning all of VM memory that way, although I can't
>> > > > immediately tell how flexible the architecture is in allowing you to cache
>> > > > the base/limit values.
>> > >
>> > > I was investigating this approach, and Mark raised a concern that I think
>> > > might be a showstopper.
>> > >
>> > > Let's consider this scenario:
>> > >
>> > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
>> > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
>> > >
>> > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
>> > > 2. Guest programs SPE to enable profiling at **EL0**
>> > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
>> > > 3. Guest changes the translation table entries for the buffer. The
>> > > architecture allows this.
>> > > 4. Guest does an ERET to EL0, thus enabling profiling.
>> > >
>> > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
>> > > the buffer at stage 2 when profiling gets enabled at EL0.
>> >
>> > Not saying we necessarily should, but this is possible with FGT no?
>> 
>> Given how often ERET is used at EL1, I'd really refrain from doing
>> so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real
>> EL1, and this comes at a serious cost (even an exception return that
>> stays at the same EL gets trapped). Once EL1 runs, we disengage this
>> trap because it is otherwise way too costly.
>> 
>> >
>> > > I can see two solutions here:
>> > >
>> > > a. Accept the limitation (and advertise it in the documentation) that if
>> > > someone wants to use SPE when running as a Linux guest, the kernel used by
>> > > the guest must not change the buffer translation table entries after the
>> > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
>> > > running a Linux guest should not be a problem. I don't know how other OSes
>> > > do it (but I can find out). We could also phrase it that the buffer
>> > > translation table entries can be changed after enabling the buffer, but
>> > > only if profiling happens at EL1. But that sounds very arbitrary.
>> > >
>> > > b. Pin the buffer after the stage 2 DABT that SPE will report in the
>> > > situation above. This means that there is a blackout window, but will
>> > > happen only once after each time the guest reprograms the buffer. I don't
>> > > know if this is acceptable. We could say that this if this blackout window
>> > > is not acceptable, then the guest kernel shouldn't change the translation
>> > > table entries after enabling the buffer.
>> > >
>> > > Or drop the approach of pinning the buffer and go back to pinning the
>> > > entire memory of the VM.
>> > >
>> > > Any thoughts on this? I would very much prefer to try to pin only the
>> > > buffer.
>> >
>> > Doesn't pinning the buffer also imply pinning the stage 1 tables
>> > responsible for its translation as well? I agree that pinning the buffer
>> > is likely the best way forward as pinning the whole of guest memory is
>> > entirely impractical.
>> 
>> How different is this from device assignment, which also relies on
>> full page pinning? The way I look at it, SPE is a device directly
>> assigned to the guest, and isn't capable of generating synchronous
>> exception. Not that I'm madly in love with the approach, but this is
>> at least consistent. There was also some concerns around buggy HW that
>> would blow itself up on S2 faults, but I think these implementations
>> are confidential enough that we don't need to worry about them.
>> 
>> > I'm also a bit confused on how we would manage to un-pin memory on the
>> > way out with this. The guest is free to muck with the stage 1 and could
>> > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
>> > annoying. One way to tackle it would be to only allow a single
>> > root-to-target walk to be pinned by a vCPU at a time. Any time a new
>> > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
>> > one instead.
>> 
>> This sounds like a reasonable option. Only one IPA range covering the
>> SPE buffer (as described by the translation of PMBPTR_EL1) is pinned
>> at any given time. Generate a SPE S2 fault outside of this range, and
>> we unpin the region before mapping in the next one. Yes, the guest can
>> play tricks on us and exploit the latency of the interrupt. But at the
>> end of the day, this is its own problem.
>> 
>> Of course, this results in larger blind windows. Ideally, we should be
>> able to report these to the guest, either as sideband data or in the
>> actual profiling buffer (but I have no idea whether this is possible).
> 
> I believe solution b, pin the buffer when guest enables profiling 
> (where by
> profiling enabled I mean StatisticalProfilingEnabled() returns true), 
> and
> only in the situation that I described pin the buffer as a result of a
> stage 2 fault, would reduce the blackouts to a minimum.

In all honesty, I'd rather see everything be done as the result
of a S2 fault for now, and only introduce heuristics to reduce the 
blackout window at a later time. And this includes buffer pinning
if that can be avoided.

My hunch is that people wanting zero blackout will always pin
all their memory, one way or another, and that the rest of us
will be happy just to get *something* out of SPE in a VM...

         M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27 11:08               ` Marc Zyngier
@ 2022-07-27 11:57                 ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 11:57 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Will Deacon, kvmarm, linux-arm-kernel

Hi,

On Wed, Jul 27, 2022 at 12:08:11PM +0100, Marc Zyngier wrote:
> On 2022-07-27 11:44, Alexandru Elisei wrote:
> > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote:
> > > On 2022-07-27 11:19, Alexandru Elisei wrote:
> > > > Hi Oliver,
> > > >
> > > > Thank you for the help, replies below.
> > > >
> > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> > > > > Hi Alex,
> > > > >
> > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > > > the base/limit values.
> > > > > >
> > > > > > I was investigating this approach, and Mark raised a concern that I think
> > > > > > might be a showstopper.
> > > > > >
> > > > > > Let's consider this scenario:
> > > > > >
> > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > > > >
> > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > > > 3. Guest changes the translation table entries for the buffer. The
> > > > > > architecture allows this.
> > > > > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > > > >
> > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > > > > the buffer at stage 2 when profiling gets enabled at EL0.
> > > > >
> > > > > Not saying we necessarily should, but this is possible with FGT no?
> > > >
> > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from
> > > > EL1.
> > > 
> > > See HFGITR.ERET.
> > 
> > Ah, so that's the register, thanks!
> > 
> > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on
> > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any
> > machines
> > that have FEAT_SPE and FEAT_FGT?
> 
> None. Both are pretty niche, and the combination is nowhere
> to be seen at the moment.

That was also my impression.

> 
> > On the plus side, KVM could enable the trap only in the case above, and
> > disable
> > it after the ERET is trapped, so it should be relatively cheap to use.
> 
> This feels pretty horrible. Nothing says *when* will EL1
> alter the PTs. It could take tons of EL1->EL1 exceptions
> before returning to EL0. And the change could happen after
> an EL1->EL0->EL1 transition. At which point do you stop?

ERET trapping is enabled When PMBLIMITR_EL1.E = 1, PMSCR_EL1.{E0SPE,E1SPE}
= {1,0}. The first guest ERET from EL1 to EL0 enables profiling, at which
point the buffer is pinned and ERET trapping is disabled.

Guest messing with the translation tables while profiling is enabled is the
guest's problem because that's not permitted by the architecture. Any stage
2 dabt taken when the buffer is pinned would be injected back into the
guest as an SPE external abort (or something equivalent). Stage 1 dabts are
entirely the guest's problem to solve and would be injected back regardless
of the status of the buffer.

Yes, I agree, there could be a lot of ERETs from EL1 to EL1 before the ERET
to EL0; those ERETs would be uselessly trapped.

The above is a moot point anyway, because I believe we both agree that
having SPE emulation depend on FEAT_FGT is best to be avoided.

Thanks,
Alex

> 
> If you want to rely on ERET for that, you need to trap
> ERET all the time, because all ERETs to EL0 will be
> suspect. And doing that to handle such a corner case feels
> pretty horrible.
> 
>         M.
> -- 
> Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 11:57                 ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 11:57 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm

Hi,

On Wed, Jul 27, 2022 at 12:08:11PM +0100, Marc Zyngier wrote:
> On 2022-07-27 11:44, Alexandru Elisei wrote:
> > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote:
> > > On 2022-07-27 11:19, Alexandru Elisei wrote:
> > > > Hi Oliver,
> > > >
> > > > Thank you for the help, replies below.
> > > >
> > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> > > > > Hi Alex,
> > > > >
> > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > > > the base/limit values.
> > > > > >
> > > > > > I was investigating this approach, and Mark raised a concern that I think
> > > > > > might be a showstopper.
> > > > > >
> > > > > > Let's consider this scenario:
> > > > > >
> > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > > > >
> > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > > > 3. Guest changes the translation table entries for the buffer. The
> > > > > > architecture allows this.
> > > > > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > > > >
> > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > > > > the buffer at stage 2 when profiling gets enabled at EL0.
> > > > >
> > > > > Not saying we necessarily should, but this is possible with FGT no?
> > > >
> > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from
> > > > EL1.
> > > 
> > > See HFGITR.ERET.
> > 
> > Ah, so that's the register, thanks!
> > 
> > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on
> > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any
> > machines
> > that have FEAT_SPE and FEAT_FGT?
> 
> None. Both are pretty niche, and the combination is nowhere
> to be seen at the moment.

That was also my impression.

> 
> > On the plus side, KVM could enable the trap only in the case above, and
> > disable
> > it after the ERET is trapped, so it should be relatively cheap to use.
> 
> This feels pretty horrible. Nothing says *when* will EL1
> alter the PTs. It could take tons of EL1->EL1 exceptions
> before returning to EL0. And the change could happen after
> an EL1->EL0->EL1 transition. At which point do you stop?

ERET trapping is enabled When PMBLIMITR_EL1.E = 1, PMSCR_EL1.{E0SPE,E1SPE}
= {1,0}. The first guest ERET from EL1 to EL0 enables profiling, at which
point the buffer is pinned and ERET trapping is disabled.

Guest messing with the translation tables while profiling is enabled is the
guest's problem because that's not permitted by the architecture. Any stage
2 dabt taken when the buffer is pinned would be injected back into the
guest as an SPE external abort (or something equivalent). Stage 1 dabts are
entirely the guest's problem to solve and would be injected back regardless
of the status of the buffer.

Yes, I agree, there could be a lot of ERETs from EL1 to EL1 before the ERET
to EL0; those ERETs would be uselessly trapped.

The above is a moot point anyway, because I believe we both agree that
having SPE emulation depend on FEAT_FGT is best to be avoided.

Thanks,
Alex

> 
> If you want to rely on ERET for that, you need to trap
> ERET all the time, because all ERETs to EL0 will be
> suspect. And doing that to handle such a corner case feels
> pretty horrible.
> 
>         M.
> -- 
> Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27 11:18             ` Marc Zyngier
@ 2022-07-27 12:10               ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 12:10 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm

Hi,

On Wed, Jul 27, 2022 at 12:18:41PM +0100, Marc Zyngier wrote:
> On 2022-07-27 11:56, Alexandru Elisei wrote:
> > Hi Marc,
> > 
> > On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote:
> > > On Tue, 26 Jul 2022 18:51:21 +0100,
> > > Oliver Upton <oliver.upton@linux.dev> wrote:
> > > >
> > > > Hi Alex,
> > > >
> > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > > >
> > > > [...]
> > > >
> > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > > the base/limit values.
> > > > >
> > > > > I was investigating this approach, and Mark raised a concern that I think
> > > > > might be a showstopper.
> > > > >
> > > > > Let's consider this scenario:
> > > > >
> > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > > >
> > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > > 3. Guest changes the translation table entries for the buffer. The
> > > > > architecture allows this.
> > > > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > > >
> > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > > > the buffer at stage 2 when profiling gets enabled at EL0.
> > > >
> > > > Not saying we necessarily should, but this is possible with FGT no?
> > > 
> > > Given how often ERET is used at EL1, I'd really refrain from doing
> > > so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real
> > > EL1, and this comes at a serious cost (even an exception return that
> > > stays at the same EL gets trapped). Once EL1 runs, we disengage this
> > > trap because it is otherwise way too costly.
> > > 
> > > >
> > > > > I can see two solutions here:
> > > > >
> > > > > a. Accept the limitation (and advertise it in the documentation) that if
> > > > > someone wants to use SPE when running as a Linux guest, the kernel used by
> > > > > the guest must not change the buffer translation table entries after the
> > > > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> > > > > running a Linux guest should not be a problem. I don't know how other OSes
> > > > > do it (but I can find out). We could also phrase it that the buffer
> > > > > translation table entries can be changed after enabling the buffer, but
> > > > > only if profiling happens at EL1. But that sounds very arbitrary.
> > > > >
> > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the
> > > > > situation above. This means that there is a blackout window, but will
> > > > > happen only once after each time the guest reprograms the buffer. I don't
> > > > > know if this is acceptable. We could say that this if this blackout window
> > > > > is not acceptable, then the guest kernel shouldn't change the translation
> > > > > table entries after enabling the buffer.
> > > > >
> > > > > Or drop the approach of pinning the buffer and go back to pinning the
> > > > > entire memory of the VM.
> > > > >
> > > > > Any thoughts on this? I would very much prefer to try to pin only the
> > > > > buffer.
> > > >
> > > > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > > > responsible for its translation as well? I agree that pinning the buffer
> > > > is likely the best way forward as pinning the whole of guest memory is
> > > > entirely impractical.
> > > 
> > > How different is this from device assignment, which also relies on
> > > full page pinning? The way I look at it, SPE is a device directly
> > > assigned to the guest, and isn't capable of generating synchronous
> > > exception. Not that I'm madly in love with the approach, but this is
> > > at least consistent. There was also some concerns around buggy HW that
> > > would blow itself up on S2 faults, but I think these implementations
> > > are confidential enough that we don't need to worry about them.
> > > 
> > > > I'm also a bit confused on how we would manage to un-pin memory on the
> > > > way out with this. The guest is free to muck with the stage 1 and could
> > > > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> > > > annoying. One way to tackle it would be to only allow a single
> > > > root-to-target walk to be pinned by a vCPU at a time. Any time a new
> > > > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> > > > one instead.
> > > 
> > > This sounds like a reasonable option. Only one IPA range covering the
> > > SPE buffer (as described by the translation of PMBPTR_EL1) is pinned
> > > at any given time. Generate a SPE S2 fault outside of this range, and
> > > we unpin the region before mapping in the next one. Yes, the guest can
> > > play tricks on us and exploit the latency of the interrupt. But at the
> > > end of the day, this is its own problem.
> > > 
> > > Of course, this results in larger blind windows. Ideally, we should be
> > > able to report these to the guest, either as sideband data or in the
> > > actual profiling buffer (but I have no idea whether this is possible).
> > 
> > I believe solution b, pin the buffer when guest enables profiling (where
> > by
> > profiling enabled I mean StatisticalProfilingEnabled() returns true),
> > and
> > only in the situation that I described pin the buffer as a result of a
> > stage 2 fault, would reduce the blackouts to a minimum.
> 
> In all honesty, I'd rather see everything be done as the result
> of a S2 fault for now, and only introduce heuristics to reduce the blackout
> window at a later time. And this includes buffer pinning
> if that can be avoided.

I believe it's not feasible to do everything as a result of a SPE stage 2
fault. I've explained where in this reply [1]. Sorry for fragmenting the
discussion into so many different threads.

Having the first write, and only that first write, trigger a stage 2 fault
that KVM handles by pinning the buffer works because the guest hasn't
written anything useful to the buffer.

[1] https://lore.kernel.org/all/YuEVq8Au7YsDLOdI@monolith.localdoman/

> 
> My hunch is that people wanting zero blackout will always pin
> all their memory, one way or another, and that the rest of us
> will be happy just to get *something* out of SPE in a VM...

What are you thinking when you are saying "one way or another"? Because
that would need changes to KVM (mlock() is not enough).

Thanks,
Alex

> 
>         M.
> -- 
> Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 12:10               ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-07-27 12:10 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel

Hi,

On Wed, Jul 27, 2022 at 12:18:41PM +0100, Marc Zyngier wrote:
> On 2022-07-27 11:56, Alexandru Elisei wrote:
> > Hi Marc,
> > 
> > On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote:
> > > On Tue, 26 Jul 2022 18:51:21 +0100,
> > > Oliver Upton <oliver.upton@linux.dev> wrote:
> > > >
> > > > Hi Alex,
> > > >
> > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > > >
> > > > [...]
> > > >
> > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > > the base/limit values.
> > > > >
> > > > > I was investigating this approach, and Mark raised a concern that I think
> > > > > might be a showstopper.
> > > > >
> > > > > Let's consider this scenario:
> > > > >
> > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > > >
> > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > > 3. Guest changes the translation table entries for the buffer. The
> > > > > architecture allows this.
> > > > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > > >
> > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > > > the buffer at stage 2 when profiling gets enabled at EL0.
> > > >
> > > > Not saying we necessarily should, but this is possible with FGT no?
> > > 
> > > Given how often ERET is used at EL1, I'd really refrain from doing
> > > so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real
> > > EL1, and this comes at a serious cost (even an exception return that
> > > stays at the same EL gets trapped). Once EL1 runs, we disengage this
> > > trap because it is otherwise way too costly.
> > > 
> > > >
> > > > > I can see two solutions here:
> > > > >
> > > > > a. Accept the limitation (and advertise it in the documentation) that if
> > > > > someone wants to use SPE when running as a Linux guest, the kernel used by
> > > > > the guest must not change the buffer translation table entries after the
> > > > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so
> > > > > running a Linux guest should not be a problem. I don't know how other OSes
> > > > > do it (but I can find out). We could also phrase it that the buffer
> > > > > translation table entries can be changed after enabling the buffer, but
> > > > > only if profiling happens at EL1. But that sounds very arbitrary.
> > > > >
> > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the
> > > > > situation above. This means that there is a blackout window, but will
> > > > > happen only once after each time the guest reprograms the buffer. I don't
> > > > > know if this is acceptable. We could say that this if this blackout window
> > > > > is not acceptable, then the guest kernel shouldn't change the translation
> > > > > table entries after enabling the buffer.
> > > > >
> > > > > Or drop the approach of pinning the buffer and go back to pinning the
> > > > > entire memory of the VM.
> > > > >
> > > > > Any thoughts on this? I would very much prefer to try to pin only the
> > > > > buffer.
> > > >
> > > > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > > > responsible for its translation as well? I agree that pinning the buffer
> > > > is likely the best way forward as pinning the whole of guest memory is
> > > > entirely impractical.
> > > 
> > > How different is this from device assignment, which also relies on
> > > full page pinning? The way I look at it, SPE is a device directly
> > > assigned to the guest, and isn't capable of generating synchronous
> > > exception. Not that I'm madly in love with the approach, but this is
> > > at least consistent. There was also some concerns around buggy HW that
> > > would blow itself up on S2 faults, but I think these implementations
> > > are confidential enough that we don't need to worry about them.
> > > 
> > > > I'm also a bit confused on how we would manage to un-pin memory on the
> > > > way out with this. The guest is free to muck with the stage 1 and could
> > > > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be
> > > > annoying. One way to tackle it would be to only allow a single
> > > > root-to-target walk to be pinned by a vCPU at a time. Any time a new
> > > > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new
> > > > one instead.
> > > 
> > > This sounds like a reasonable option. Only one IPA range covering the
> > > SPE buffer (as described by the translation of PMBPTR_EL1) is pinned
> > > at any given time. Generate a SPE S2 fault outside of this range, and
> > > we unpin the region before mapping in the next one. Yes, the guest can
> > > play tricks on us and exploit the latency of the interrupt. But at the
> > > end of the day, this is its own problem.
> > > 
> > > Of course, this results in larger blind windows. Ideally, we should be
> > > able to report these to the guest, either as sideband data or in the
> > > actual profiling buffer (but I have no idea whether this is possible).
> > 
> > I believe solution b, pin the buffer when guest enables profiling (where
> > by
> > profiling enabled I mean StatisticalProfilingEnabled() returns true),
> > and
> > only in the situation that I described pin the buffer as a result of a
> > stage 2 fault, would reduce the blackouts to a minimum.
> 
> In all honesty, I'd rather see everything be done as the result
> of a S2 fault for now, and only introduce heuristics to reduce the blackout
> window at a later time. And this includes buffer pinning
> if that can be avoided.

I believe it's not feasible to do everything as a result of a SPE stage 2
fault. I've explained where in this reply [1]. Sorry for fragmenting the
discussion into so many different threads.

Having the first write, and only that first write, trigger a stage 2 fault
that KVM handles by pinning the buffer works because the guest hasn't
written anything useful to the buffer.

[1] https://lore.kernel.org/all/YuEVq8Au7YsDLOdI@monolith.localdoman/

> 
> My hunch is that people wanting zero blackout will always pin
> all their memory, one way or another, and that the rest of us
> will be happy just to get *something* out of SPE in a VM...

What are you thinking when you are saying "one way or another"? Because
that would need changes to KVM (mlock() is not enough).

Thanks,
Alex

> 
>         M.
> -- 
> Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27 11:57                 ` Alexandru Elisei
@ 2022-07-27 15:15                   ` Oliver Upton
  -1 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-07-27 15:15 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Marc Zyngier, Will Deacon, kvmarm, linux-arm-kernel

On Wed, Jul 27, 2022 at 12:57:16PM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Wed, Jul 27, 2022 at 12:08:11PM +0100, Marc Zyngier wrote:
> > On 2022-07-27 11:44, Alexandru Elisei wrote:
> > > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote:
> > > > On 2022-07-27 11:19, Alexandru Elisei wrote:
> > > > > Hi Oliver,
> > > > >
> > > > > Thank you for the help, replies below.
> > > > >
> > > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> > > > > > Hi Alex,
> > > > > >
> > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > > > > >
> > > > > > [...]
> > > > > >
> > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > > > > the base/limit values.
> > > > > > >
> > > > > > > I was investigating this approach, and Mark raised a concern that I think
> > > > > > > might be a showstopper.
> > > > > > >
> > > > > > > Let's consider this scenario:
> > > > > > >
> > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > > > > >
> > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > > > > 3. Guest changes the translation table entries for the buffer. The
> > > > > > > architecture allows this.
> > > > > > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > > > > >
> > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > > > > > the buffer at stage 2 when profiling gets enabled at EL0.
> > > > > >
> > > > > > Not saying we necessarily should, but this is possible with FGT no?
> > > > >
> > > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from
> > > > > EL1.
> > > > 
> > > > See HFGITR.ERET.
> > > 
> > > Ah, so that's the register, thanks!
> > > 
> > > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on
> > > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any
> > > machines
> > > that have FEAT_SPE and FEAT_FGT?
> > 
> > None. Both are pretty niche, and the combination is nowhere
> > to be seen at the moment.
> 
> That was also my impression.
> 
> > 
> > > On the plus side, KVM could enable the trap only in the case above, and
> > > disable
> > > it after the ERET is trapped, so it should be relatively cheap to use.
> > 
> > This feels pretty horrible. Nothing says *when* will EL1
> > alter the PTs. It could take tons of EL1->EL1 exceptions
> > before returning to EL0. And the change could happen after
> > an EL1->EL0->EL1 transition. At which point do you stop?
> 
> ERET trapping is enabled When PMBLIMITR_EL1.E = 1, PMSCR_EL1.{E0SPE,E1SPE}
> = {1,0}. The first guest ERET from EL1 to EL0 enables profiling, at which
> point the buffer is pinned and ERET trapping is disabled.
> 
> Guest messing with the translation tables while profiling is enabled is the
> guest's problem because that's not permitted by the architecture. Any stage
> 2 dabt taken when the buffer is pinned would be injected back into the
> guest as an SPE external abort (or something equivalent). Stage 1 dabts are
> entirely the guest's problem to solve and would be injected back regardless
> of the status of the buffer.
> 
> Yes, I agree, there could be a lot of ERETs from EL1 to EL1 before the ERET
> to EL0; those ERETs would be uselessly trapped.
> 
> The above is a moot point anyway, because I believe we both agree that
> having SPE emulation depend on FEAT_FGT is best to be avoided.

LOL, I probably shouldn't have even mentioned it :) Completely agree
with you both, trapping ERET is bordering on mad.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 15:15                   ` Oliver Upton
  0 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-07-27 15:15 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Marc Zyngier, Will Deacon, kvmarm, linux-arm-kernel

On Wed, Jul 27, 2022 at 12:57:16PM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Wed, Jul 27, 2022 at 12:08:11PM +0100, Marc Zyngier wrote:
> > On 2022-07-27 11:44, Alexandru Elisei wrote:
> > > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote:
> > > > On 2022-07-27 11:19, Alexandru Elisei wrote:
> > > > > Hi Oliver,
> > > > >
> > > > > Thank you for the help, replies below.
> > > > >
> > > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote:
> > > > > > Hi Alex,
> > > > > >
> > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > > > > >
> > > > > > [...]
> > > > > >
> > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > > > > the base/limit values.
> > > > > > >
> > > > > > > I was investigating this approach, and Mark raised a concern that I think
> > > > > > > might be a showstopper.
> > > > > > >
> > > > > > > Let's consider this scenario:
> > > > > > >
> > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > > > > >
> > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > > > > 3. Guest changes the translation table entries for the buffer. The
> > > > > > > architecture allows this.
> > > > > > > 4. Guest does an ERET to EL0, thus enabling profiling.
> > > > > > >
> > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin
> > > > > > > the buffer at stage 2 when profiling gets enabled at EL0.
> > > > > >
> > > > > > Not saying we necessarily should, but this is possible with FGT no?
> > > > >
> > > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from
> > > > > EL1.
> > > > 
> > > > See HFGITR.ERET.
> > > 
> > > Ah, so that's the register, thanks!
> > > 
> > > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on
> > > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any
> > > machines
> > > that have FEAT_SPE and FEAT_FGT?
> > 
> > None. Both are pretty niche, and the combination is nowhere
> > to be seen at the moment.
> 
> That was also my impression.
> 
> > 
> > > On the plus side, KVM could enable the trap only in the case above, and
> > > disable
> > > it after the ERET is trapped, so it should be relatively cheap to use.
> > 
> > This feels pretty horrible. Nothing says *when* will EL1
> > alter the PTs. It could take tons of EL1->EL1 exceptions
> > before returning to EL0. And the change could happen after
> > an EL1->EL0->EL1 transition. At which point do you stop?
> 
> ERET trapping is enabled When PMBLIMITR_EL1.E = 1, PMSCR_EL1.{E0SPE,E1SPE}
> = {1,0}. The first guest ERET from EL1 to EL0 enables profiling, at which
> point the buffer is pinned and ERET trapping is disabled.
> 
> Guest messing with the translation tables while profiling is enabled is the
> guest's problem because that's not permitted by the architecture. Any stage
> 2 dabt taken when the buffer is pinned would be injected back into the
> guest as an SPE external abort (or something equivalent). Stage 1 dabts are
> entirely the guest's problem to solve and would be injected back regardless
> of the status of the buffer.
> 
> Yes, I agree, there could be a lot of ERETs from EL1 to EL1 before the ERET
> to EL0; those ERETs would be uselessly trapped.
> 
> The above is a moot point anyway, because I believe we both agree that
> having SPE emulation depend on FEAT_FGT is best to be avoided.

LOL, I probably shouldn't have even mentioned it :) Completely agree
with you both, trapping ERET is bordering on mad.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-27 10:38             ` Alexandru Elisei
@ 2022-07-27 16:06               ` Oliver Upton
  -1 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-07-27 16:06 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Marc Zyngier, Will Deacon, kvmarm, linux-arm-kernel

On Wed, Jul 27, 2022 at 11:38:53AM +0100, Alexandru Elisei wrote:
> Hi Marc,
> 
> On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote:
> > On Wed, 27 Jul 2022 10:30:59 +0100,
> > Marc Zyngier <maz@kernel.org> wrote:
> > > 
> > > On Tue, 26 Jul 2022 18:51:21 +0100,
> > > Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > 
> > > > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > > > responsible for its translation as well? I agree that pinning the buffer
> > > > is likely the best way forward as pinning the whole of guest memory is
> > > > entirely impractical.
> > 
> > Huh, I just realised that you were talking about S1. I don't think we
> > need to do this. As long as the translation falls into a mapped
> > region (pinned or not), we don't need to worry.

Right, but my issue is what happens when a fragment of the S1 becomes
unmapped at S2. We were discussing the idea of faulting once on the
buffer at the beginning of profiling, seems to me that it could just as
easily happen at runtime and get tripped up by what Alex points out
below:

> PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE,
> which means the last record written is incomplete. Records have a variable
> size, so it's impossible for KVM to revert to the end of the last known
> good record without parsing the buffer (references here [1]). And even if
> KVM would know the size of a record, there's this bit in the Arm ARM which
> worries me (ARM DDI 0487H.a, page D10-5177):
> 
> "The architecture does not require that a sample record is written
> sequentially by the SPU, only that:
> [..]
> - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
>   whether PMBPTR_EL1 points to the first byte after the last complete
>   sample record."
> 
> So there might be gaps in the buffer, meaning that the entire buffer would
> have to be discarded if DL is set as a result of a stage 2 fault.

Attempting to avoid thrashing with more threads so I'm going to summon back
some context from your original reply, Marc:

> > > > Live migration also throws a wrench in this. IOW, there are still potential
> > > > sources of blackout unattributable to guest manipulation of the SPU.
> > >
> > > Can you chime some light on this? I appreciate that you can't play the
> > > R/O trick on the SPE buffer as it invalidates the above discussion,
> > > but it should be relatively easy to track these pages and never reset
> > > them as clean until the vcpu is stopped. Unless you foresee other
> > > issues?

Right, we can play tricks on pre-copy to avoid write protecting the SPE
buffer. My concern was more around post-copy, where userspace could've
decided to leave the buffer behind and demand it back on the resulting
S2 fault.

> > > To be clear, I don't worry too much about these blind windows. The
> > > architecture doesn't really give us the right tools to make it work
> > > reliably, making this a best effort only. Unless we pin the whole
> > > guest and forego migration and other fault-driven mechanisms.
> > >
> > > Maybe that is a choice we need to give to the user: cheap, fast,
> > > reliable. Pick two.

As long as we crisply document the errata in KVM's virtualized SPE (and
inform the guest), that sounds reasonable. I'm just uneasy about
proceeding with an implementation w/ so many gotchas unless all parties
involved are aware of the quirks.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-07-27 16:06               ` Oliver Upton
  0 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-07-27 16:06 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Marc Zyngier, Will Deacon, kvmarm, linux-arm-kernel

On Wed, Jul 27, 2022 at 11:38:53AM +0100, Alexandru Elisei wrote:
> Hi Marc,
> 
> On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote:
> > On Wed, 27 Jul 2022 10:30:59 +0100,
> > Marc Zyngier <maz@kernel.org> wrote:
> > > 
> > > On Tue, 26 Jul 2022 18:51:21 +0100,
> > > Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > 
> > > > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > > > responsible for its translation as well? I agree that pinning the buffer
> > > > is likely the best way forward as pinning the whole of guest memory is
> > > > entirely impractical.
> > 
> > Huh, I just realised that you were talking about S1. I don't think we
> > need to do this. As long as the translation falls into a mapped
> > region (pinned or not), we don't need to worry.

Right, but my issue is what happens when a fragment of the S1 becomes
unmapped at S2. We were discussing the idea of faulting once on the
buffer at the beginning of profiling, seems to me that it could just as
easily happen at runtime and get tripped up by what Alex points out
below:

> PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE,
> which means the last record written is incomplete. Records have a variable
> size, so it's impossible for KVM to revert to the end of the last known
> good record without parsing the buffer (references here [1]). And even if
> KVM would know the size of a record, there's this bit in the Arm ARM which
> worries me (ARM DDI 0487H.a, page D10-5177):
> 
> "The architecture does not require that a sample record is written
> sequentially by the SPU, only that:
> [..]
> - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
>   whether PMBPTR_EL1 points to the first byte after the last complete
>   sample record."
> 
> So there might be gaps in the buffer, meaning that the entire buffer would
> have to be discarded if DL is set as a result of a stage 2 fault.

Attempting to avoid thrashing with more threads so I'm going to summon back
some context from your original reply, Marc:

> > > > Live migration also throws a wrench in this. IOW, there are still potential
> > > > sources of blackout unattributable to guest manipulation of the SPU.
> > >
> > > Can you chime some light on this? I appreciate that you can't play the
> > > R/O trick on the SPE buffer as it invalidates the above discussion,
> > > but it should be relatively easy to track these pages and never reset
> > > them as clean until the vcpu is stopped. Unless you foresee other
> > > issues?

Right, we can play tricks on pre-copy to avoid write protecting the SPE
buffer. My concern was more around post-copy, where userspace could've
decided to leave the buffer behind and demand it back on the resulting
S2 fault.

> > > To be clear, I don't worry too much about these blind windows. The
> > > architecture doesn't really give us the right tools to make it work
> > > reliably, making this a best effort only. Unless we pin the whole
> > > guest and forego migration and other fault-driven mechanisms.
> > >
> > > Maybe that is a choice we need to give to the user: cheap, fast,
> > > reliable. Pick two.

As long as we crisply document the errata in KVM's virtualized SPE (and
inform the guest), that sounds reasonable. I'm just uneasy about
proceeding with an implementation w/ so many gotchas unless all parties
involved are aware of the quirks.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-07-25 10:06     ` Alexandru Elisei
@ 2022-08-01 17:00       ` Will Deacon
  -1 siblings, 0 replies; 72+ messages in thread
From: Will Deacon @ 2022-08-01 17:00 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, kvmarm, linux-arm-kernel

Hi Alex,

On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > > altogether. I've taken this approach because:
> > > 
> > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > > in the case of a stage 2 fault on a stage 1 translation table walk.
> > > 
> > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > means there will be a window where profiling is stopped from the moment SPE
> > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > is obviously not present when running on bare metal, as there is no second
> > > stage of address translation being performed.
> > 
> > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > thought SPE buffer data could be written out in whacky ways such that even
> > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > and so pinning is the only game in town.
> > 
> > A funkier approach might be to defer pinning of the buffer until the SPE is
> > enabled and avoid pinning all of VM memory that way, although I can't
> > immediately tell how flexible the architecture is in allowing you to cache
> > the base/limit values.
> 
> I was investigating this approach, and Mark raised a concern that I think
> might be a showstopper.
> 
> Let's consider this scenario:
> 
> Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> 
> 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> 2. Guest programs SPE to enable profiling at **EL0**
> (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> 3. Guest changes the translation table entries for the buffer. The
> architecture allows this.

The architecture also allows MMIO accesses to use writeback addressing
modes, but it doesn't provide a mechanism to virtualise them sensibly.

So I'd prefer that we don't pin all of guest memory just to satisfy a corner
case -- as long as the impact of a guest doing this funny sequence is
constrained to the guest, then I think pinning only what is required is
probably the most pragmatic approach.

Is it ideal? No, of course not, and we should probably try to get the debug
architecture extended to be properly virtualisable, but in the meantime
having major operating systems as guests and being able to use SPE without
pinning seems like a major design goal to me.

In any case, that's just my thinking on this and I defer to Oliver and
Marc on the ultimate decision.

Will
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-08-01 17:00       ` Will Deacon
  0 siblings, 0 replies; 72+ messages in thread
From: Will Deacon @ 2022-08-01 17:00 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm

Hi Alex,

On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > > altogether. I've taken this approach because:
> > > 
> > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > > in the case of a stage 2 fault on a stage 1 translation table walk.
> > > 
> > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > means there will be a window where profiling is stopped from the moment SPE
> > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > is obviously not present when running on bare metal, as there is no second
> > > stage of address translation being performed.
> > 
> > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > thought SPE buffer data could be written out in whacky ways such that even
> > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > and so pinning is the only game in town.
> > 
> > A funkier approach might be to defer pinning of the buffer until the SPE is
> > enabled and avoid pinning all of VM memory that way, although I can't
> > immediately tell how flexible the architecture is in allowing you to cache
> > the base/limit values.
> 
> I was investigating this approach, and Mark raised a concern that I think
> might be a showstopper.
> 
> Let's consider this scenario:
> 
> Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> 
> 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> 2. Guest programs SPE to enable profiling at **EL0**
> (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> 3. Guest changes the translation table entries for the buffer. The
> architecture allows this.

The architecture also allows MMIO accesses to use writeback addressing
modes, but it doesn't provide a mechanism to virtualise them sensibly.

So I'd prefer that we don't pin all of guest memory just to satisfy a corner
case -- as long as the impact of a guest doing this funny sequence is
constrained to the guest, then I think pinning only what is required is
probably the most pragmatic approach.

Is it ideal? No, of course not, and we should probably try to get the debug
architecture extended to be properly virtualisable, but in the meantime
having major operating systems as guests and being able to use SPE without
pinning seems like a major design goal to me.

In any case, that's just my thinking on this and I defer to Oliver and
Marc on the ultimate decision.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-08-01 17:00       ` Will Deacon
@ 2022-08-02  9:49         ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-08-02  9:49 UTC (permalink / raw)
  To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel

Hi,

(+Oliver)

On Mon, Aug 01, 2022 at 06:00:56PM +0100, Will Deacon wrote:
> Hi Alex,
> 
> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > > > altogether. I've taken this approach because:
> > > > 
> > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > > > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > > > in the case of a stage 2 fault on a stage 1 translation table walk.
> > > > 
> > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > > means there will be a window where profiling is stopped from the moment SPE
> > > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > > is obviously not present when running on bare metal, as there is no second
> > > > stage of address translation being performed.
> > > 
> > > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > > thought SPE buffer data could be written out in whacky ways such that even
> > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > > and so pinning is the only game in town.
> > > 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > I was investigating this approach, and Mark raised a concern that I think
> > might be a showstopper.
> > 
> > Let's consider this scenario:
> > 
> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > 
> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > 2. Guest programs SPE to enable profiling at **EL0**
> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > 3. Guest changes the translation table entries for the buffer. The
> > architecture allows this.
> 
> The architecture also allows MMIO accesses to use writeback addressing
> modes, but it doesn't provide a mechanism to virtualise them sensibly.
> 
> So I'd prefer that we don't pin all of guest memory just to satisfy a corner
> case -- as long as the impact of a guest doing this funny sequence is
> constrained to the guest, then I think pinning only what is required is
> probably the most pragmatic approach.
> 
> Is it ideal? No, of course not, and we should probably try to get the debug
> architecture extended to be properly virtualisable, but in the meantime
> having major operating systems as guests and being able to use SPE without
> pinning seems like a major design goal to me.
> 
> In any case, that's just my thinking on this and I defer to Oliver and
> Marc on the ultimate decision.

Thank you for the input.

To summarize the approaches we've discussed so far:

1. Pinning the entire guest memory
- Heavy handed and not ideal.
- Tried this approach in v5 of the SPE series [1], patches #2-#12.

2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
faults reported by SPE.
- Not feasible, because the entire contents of the buffer must be discarded is
  PMBSR_EL1.DL is set to 1 when taking the fault.
- Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
  not the IPA.

3. Pinning the guest SPE buffer when profiling becomes enabled*:
- There is the corner case described above, when profiling becomes enabled as a
  result of an ERET to EL0. This can happen when the buffer is enabled and
  PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
- The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
  stage 2 faults when draining the buffer, which is performed with profiling
  disabled.
- Also requires KVM to walk the guest's stage 1 tables.

4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
SPE.
- Gets rid of the corner case at 3.
- Same approach to buffer unpinning as 3.
- Introduces a blackout window before the first record is written.
- Also requires KVM to walk the guest's stage 1 tables.

As for the corner case at 3, I proposed either:

a) Mandate that guest operating systems must never modify the buffer
translation entries if the buffer is enabled and
PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.

b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
but **only** for this corner case. For all other cases, the buffer is pinned
when profiling becomes enabled, to eliminate the blackout window. Guest
operating systems can be modified to not change the translation entries for the
buffer if this blackout window is not desirable.

Pinning as a result of the **first** stage 2 fault should work, because there
are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.

I hope I haven't missed anything. Thoughts and suggestions more than welcome.

*Profiling enabled, as per the Arm ARM, means buffer is enabled and sampling is
enabled at the current exception level.

[1] https://lore.kernel.org/all/20211117153842.302159-1-alexandru.elisei@arm.com/

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-08-02  9:49         ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-08-02  9:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose,
	kvmarm, oliver.upton

Hi,

(+Oliver)

On Mon, Aug 01, 2022 at 06:00:56PM +0100, Will Deacon wrote:
> Hi Alex,
> 
> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote:
> > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > > The approach I've taken so far in adding support for SPE in KVM [1] relies
> > > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults
> > > > altogether. I've taken this approach because:
> > > > 
> > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
> > > > and at the moment KVM has no way to resolve the VA to IPA translation.  The
> > > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
> > > > in the case of a stage 2 fault on a stage 1 translation table walk.
> > > > 
> > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > > means there will be a window where profiling is stopped from the moment SPE
> > > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > > is obviously not present when running on bare metal, as there is no second
> > > > stage of address translation being performed.
> > > 
> > > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > > thought SPE buffer data could be written out in whacky ways such that even
> > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > > and so pinning is the only game in town.
> > > 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > I was investigating this approach, and Mark raised a concern that I think
> > might be a showstopper.
> > 
> > Let's consider this scenario:
> > 
> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > 
> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > 2. Guest programs SPE to enable profiling at **EL0**
> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > 3. Guest changes the translation table entries for the buffer. The
> > architecture allows this.
> 
> The architecture also allows MMIO accesses to use writeback addressing
> modes, but it doesn't provide a mechanism to virtualise them sensibly.
> 
> So I'd prefer that we don't pin all of guest memory just to satisfy a corner
> case -- as long as the impact of a guest doing this funny sequence is
> constrained to the guest, then I think pinning only what is required is
> probably the most pragmatic approach.
> 
> Is it ideal? No, of course not, and we should probably try to get the debug
> architecture extended to be properly virtualisable, but in the meantime
> having major operating systems as guests and being able to use SPE without
> pinning seems like a major design goal to me.
> 
> In any case, that's just my thinking on this and I defer to Oliver and
> Marc on the ultimate decision.

Thank you for the input.

To summarize the approaches we've discussed so far:

1. Pinning the entire guest memory
- Heavy handed and not ideal.
- Tried this approach in v5 of the SPE series [1], patches #2-#12.

2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
faults reported by SPE.
- Not feasible, because the entire contents of the buffer must be discarded is
  PMBSR_EL1.DL is set to 1 when taking the fault.
- Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
  not the IPA.

3. Pinning the guest SPE buffer when profiling becomes enabled*:
- There is the corner case described above, when profiling becomes enabled as a
  result of an ERET to EL0. This can happen when the buffer is enabled and
  PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
- The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
  stage 2 faults when draining the buffer, which is performed with profiling
  disabled.
- Also requires KVM to walk the guest's stage 1 tables.

4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
SPE.
- Gets rid of the corner case at 3.
- Same approach to buffer unpinning as 3.
- Introduces a blackout window before the first record is written.
- Also requires KVM to walk the guest's stage 1 tables.

As for the corner case at 3, I proposed either:

a) Mandate that guest operating systems must never modify the buffer
translation entries if the buffer is enabled and
PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.

b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
but **only** for this corner case. For all other cases, the buffer is pinned
when profiling becomes enabled, to eliminate the blackout window. Guest
operating systems can be modified to not change the translation entries for the
buffer if this blackout window is not desirable.

Pinning as a result of the **first** stage 2 fault should work, because there
are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.

I hope I haven't missed anything. Thoughts and suggestions more than welcome.

*Profiling enabled, as per the Arm ARM, means buffer is enabled and sampling is
enabled at the current exception level.

[1] https://lore.kernel.org/all/20211117153842.302159-1-alexandru.elisei@arm.com/

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-08-02  9:49         ` Alexandru Elisei
@ 2022-08-02 19:34           ` Oliver Upton
  -1 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-08-02 19:34 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi folks,

On Tue, Aug 02, 2022 at 10:49:07AM +0100, Alexandru Elisei wrote:
> > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > the base/limit values.
> > > 
> > > I was investigating this approach, and Mark raised a concern that I think
> > > might be a showstopper.
> > > 
> > > Let's consider this scenario:
> > > 
> > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > 
> > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > 2. Guest programs SPE to enable profiling at **EL0**
> > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > 3. Guest changes the translation table entries for the buffer. The
> > > architecture allows this.
> > 
> > The architecture also allows MMIO accesses to use writeback addressing
> > modes, but it doesn't provide a mechanism to virtualise them sensibly.
> > 
> > So I'd prefer that we don't pin all of guest memory just to satisfy a corner
> > case -- as long as the impact of a guest doing this funny sequence is
> > constrained to the guest, then I think pinning only what is required is
> > probably the most pragmatic approach.
> > 
> > Is it ideal? No, of course not, and we should probably try to get the debug
> > architecture extended to be properly virtualisable, but in the meantime
> > having major operating systems as guests and being able to use SPE without
> > pinning seems like a major design goal to me.
> > 
> > In any case, that's just my thinking on this and I defer to Oliver and
> > Marc on the ultimate decision.

Thanks for chiming in Will, very much agree that pragmatism is likely
the best route forward. While fun to poke at all the pitfalls of
virtualizing SPE, pulling tricks in KVM probably has marginal return
over a simpler approach.

> Thank you for the input.
> 
> To summarize the approaches we've discussed so far:
> 
> 1. Pinning the entire guest memory
> - Heavy handed and not ideal.
> - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> 
> 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> faults reported by SPE.
> - Not feasible, because the entire contents of the buffer must be discarded is
>   PMBSR_EL1.DL is set to 1 when taking the fault.
> - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
>   not the IPA.
> 
> 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> - There is the corner case described above, when profiling becomes enabled as a
>   result of an ERET to EL0. This can happen when the buffer is enabled and
>   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
>   stage 2 faults when draining the buffer, which is performed with profiling
>   disabled.
> - Also requires KVM to walk the guest's stage 1 tables.
> 
> 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> SPE.
> - Gets rid of the corner case at 3.
> - Same approach to buffer unpinning as 3.
> - Introduces a blackout window before the first record is written.
> - Also requires KVM to walk the guest's stage 1 tables.
> 
> As for the corner case at 3, I proposed either:
> 
> a) Mandate that guest operating systems must never modify the buffer
> translation entries if the buffer is enabled and
> PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> 
> b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> but **only** for this corner case. For all other cases, the buffer is pinned
> when profiling becomes enabled, to eliminate the blackout window. Guest
> operating systems can be modified to not change the translation entries for the
> buffer if this blackout window is not desirable.
> 
> Pinning as a result of the **first** stage 2 fault should work, because there
> are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> 
> I hope I haven't missed anything. Thoughts and suggestions more than welcome.

Thanks Alex for pulling together all of the context here.

Unless there's any other strong opinions on the topic, it seems to me
that option #4 (pin on S2 fault) is probably the best approach for
the initial implementation. No amount of tricks in KVM can work around
the fact that SPE has some serious issues w.r.t. virtualization. With
that, we should probably document the behavior of SPE as a known erratum
of KVM.

If folks complain about EL1 profile blackout, eagerly pinning when
profiling is enabled could layer on top quite easily by treating it as
a synthetic S2 fault and triggering the implementation of #4. Having
said that I don't believe it is a hard requirement for enabling some
flavor of SPE for guests.

Walking guest S1 in KVM doesn't sound too exciting although it'll need to
be done eventually.

Do you feel like this is an OK route forward, or have I missed
something?

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-08-02 19:34           ` Oliver Upton
  0 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-08-02 19:34 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Will Deacon, mark.rutland, linux-arm-kernel, maz, james.morse,
	suzuki.poulose, kvmarm

Hi folks,

On Tue, Aug 02, 2022 at 10:49:07AM +0100, Alexandru Elisei wrote:
> > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > the base/limit values.
> > > 
> > > I was investigating this approach, and Mark raised a concern that I think
> > > might be a showstopper.
> > > 
> > > Let's consider this scenario:
> > > 
> > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > 
> > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > 2. Guest programs SPE to enable profiling at **EL0**
> > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > 3. Guest changes the translation table entries for the buffer. The
> > > architecture allows this.
> > 
> > The architecture also allows MMIO accesses to use writeback addressing
> > modes, but it doesn't provide a mechanism to virtualise them sensibly.
> > 
> > So I'd prefer that we don't pin all of guest memory just to satisfy a corner
> > case -- as long as the impact of a guest doing this funny sequence is
> > constrained to the guest, then I think pinning only what is required is
> > probably the most pragmatic approach.
> > 
> > Is it ideal? No, of course not, and we should probably try to get the debug
> > architecture extended to be properly virtualisable, but in the meantime
> > having major operating systems as guests and being able to use SPE without
> > pinning seems like a major design goal to me.
> > 
> > In any case, that's just my thinking on this and I defer to Oliver and
> > Marc on the ultimate decision.

Thanks for chiming in Will, very much agree that pragmatism is likely
the best route forward. While fun to poke at all the pitfalls of
virtualizing SPE, pulling tricks in KVM probably has marginal return
over a simpler approach.

> Thank you for the input.
> 
> To summarize the approaches we've discussed so far:
> 
> 1. Pinning the entire guest memory
> - Heavy handed and not ideal.
> - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> 
> 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> faults reported by SPE.
> - Not feasible, because the entire contents of the buffer must be discarded is
>   PMBSR_EL1.DL is set to 1 when taking the fault.
> - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
>   not the IPA.
> 
> 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> - There is the corner case described above, when profiling becomes enabled as a
>   result of an ERET to EL0. This can happen when the buffer is enabled and
>   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
>   stage 2 faults when draining the buffer, which is performed with profiling
>   disabled.
> - Also requires KVM to walk the guest's stage 1 tables.
> 
> 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> SPE.
> - Gets rid of the corner case at 3.
> - Same approach to buffer unpinning as 3.
> - Introduces a blackout window before the first record is written.
> - Also requires KVM to walk the guest's stage 1 tables.
> 
> As for the corner case at 3, I proposed either:
> 
> a) Mandate that guest operating systems must never modify the buffer
> translation entries if the buffer is enabled and
> PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> 
> b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> but **only** for this corner case. For all other cases, the buffer is pinned
> when profiling becomes enabled, to eliminate the blackout window. Guest
> operating systems can be modified to not change the translation entries for the
> buffer if this blackout window is not desirable.
> 
> Pinning as a result of the **first** stage 2 fault should work, because there
> are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> 
> I hope I haven't missed anything. Thoughts and suggestions more than welcome.

Thanks Alex for pulling together all of the context here.

Unless there's any other strong opinions on the topic, it seems to me
that option #4 (pin on S2 fault) is probably the best approach for
the initial implementation. No amount of tricks in KVM can work around
the fact that SPE has some serious issues w.r.t. virtualization. With
that, we should probably document the behavior of SPE as a known erratum
of KVM.

If folks complain about EL1 profile blackout, eagerly pinning when
profiling is enabled could layer on top quite easily by treating it as
a synthetic S2 fault and triggering the implementation of #4. Having
said that I don't believe it is a hard requirement for enabling some
flavor of SPE for guests.

Walking guest S1 in KVM doesn't sound too exciting although it'll need to
be done eventually.

Do you feel like this is an OK route forward, or have I missed
something?

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-08-02 19:34           ` Oliver Upton
@ 2022-08-09 14:01             ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-08-09 14:01 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi,

On Tue, Aug 02, 2022 at 12:34:40PM -0700, Oliver Upton wrote:
> Hi folks,
> 
> On Tue, Aug 02, 2022 at 10:49:07AM +0100, Alexandru Elisei wrote:
> > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > the base/limit values.
> > > > 
> > > > I was investigating this approach, and Mark raised a concern that I think
> > > > might be a showstopper.
> > > > 
> > > > Let's consider this scenario:
> > > > 
> > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > > 
> > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > 3. Guest changes the translation table entries for the buffer. The
> > > > architecture allows this.
> > > 
> > > The architecture also allows MMIO accesses to use writeback addressing
> > > modes, but it doesn't provide a mechanism to virtualise them sensibly.
> > > 
> > > So I'd prefer that we don't pin all of guest memory just to satisfy a corner
> > > case -- as long as the impact of a guest doing this funny sequence is
> > > constrained to the guest, then I think pinning only what is required is
> > > probably the most pragmatic approach.
> > > 
> > > Is it ideal? No, of course not, and we should probably try to get the debug
> > > architecture extended to be properly virtualisable, but in the meantime
> > > having major operating systems as guests and being able to use SPE without
> > > pinning seems like a major design goal to me.
> > > 
> > > In any case, that's just my thinking on this and I defer to Oliver and
> > > Marc on the ultimate decision.
> 
> Thanks for chiming in Will, very much agree that pragmatism is likely
> the best route forward. While fun to poke at all the pitfalls of
> virtualizing SPE, pulling tricks in KVM probably has marginal return
> over a simpler approach.
> 
> > Thank you for the input.
> > 
> > To summarize the approaches we've discussed so far:
> > 
> > 1. Pinning the entire guest memory
> > - Heavy handed and not ideal.
> > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > 
> > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > faults reported by SPE.
> > - Not feasible, because the entire contents of the buffer must be discarded is
> >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> >   not the IPA.
> > 
> > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > - There is the corner case described above, when profiling becomes enabled as a
> >   result of an ERET to EL0. This can happen when the buffer is enabled and
> >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> >   stage 2 faults when draining the buffer, which is performed with profiling
> >   disabled.
> > - Also requires KVM to walk the guest's stage 1 tables.
> > 
> > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > SPE.
> > - Gets rid of the corner case at 3.
> > - Same approach to buffer unpinning as 3.
> > - Introduces a blackout window before the first record is written.
> > - Also requires KVM to walk the guest's stage 1 tables.
> > 
> > As for the corner case at 3, I proposed either:
> > 
> > a) Mandate that guest operating systems must never modify the buffer
> > translation entries if the buffer is enabled and
> > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > 
> > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > but **only** for this corner case. For all other cases, the buffer is pinned
> > when profiling becomes enabled, to eliminate the blackout window. Guest
> > operating systems can be modified to not change the translation entries for the
> > buffer if this blackout window is not desirable.
> > 
> > Pinning as a result of the **first** stage 2 fault should work, because there
> > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > 
> > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> 
> Thanks Alex for pulling together all of the context here.
> 
> Unless there's any other strong opinions on the topic, it seems to me
> that option #4 (pin on S2 fault) is probably the best approach for
> the initial implementation. No amount of tricks in KVM can work around
> the fact that SPE has some serious issues w.r.t. virtualization. With
> that, we should probably document the behavior of SPE as a known erratum
> of KVM.
> 
> If folks complain about EL1 profile blackout, eagerly pinning when
> profiling is enabled could layer on top quite easily by treating it as
> a synthetic S2 fault and triggering the implementation of #4. Having

I'm not sure I follow, I understand what you mean by "treating it as a
synthetic S2 fault", would you mind elaborating?

> said that I don't believe it is a hard requirement for enabling some
> flavor of SPE for guests.
> 
> Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> be done eventually.
> 
> Do you feel like this is an OK route forward, or have I missed
> something?

I've been giving this some thought, and I prefer approach #3 because with
#4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
will be impossible to distinguish between a valid stage 2 fault (a fault
caused by the guest reprogramming the buffer and enabling profiling) and
KVM messing something up when pinning the buffer. I believe this to be
important, as experience has shown me that pinning the buffer at stage 2 is
not trivial and there isn't a mechanism today in Linux to do that
(explanation and examples here [1]).

With approach #4, it would be impossible to figure out if the results of a
profiling operations inside a guest are representative of the workload or
not, because those SPE stage 2 faults triggered by a bug in KVM can happen
multiple times per profiling session, introducing multiple blackout windows
that can skew the results.

If you're proposing that the blackout window when the first record is
written be documented as an erratum for KVM, then why no got a step further
and document as an erratum that changing the buffer translation tables
after the buffer has been enabled will lead to an SPE Serror? That will
allow us to always pin the buffer when profiling is enabled.

[1] https://lore.kernel.org/all/YuEMkKY2RU%2F2KiZW@monolith.localdoman/

Thanks,
Alex

> 
> --
> Thanks,
> Oliver
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-08-09 14:01             ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-08-09 14:01 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi,

On Tue, Aug 02, 2022 at 12:34:40PM -0700, Oliver Upton wrote:
> Hi folks,
> 
> On Tue, Aug 02, 2022 at 10:49:07AM +0100, Alexandru Elisei wrote:
> > > > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > > > enabled and avoid pinning all of VM memory that way, although I can't
> > > > > immediately tell how flexible the architecture is in allowing you to cache
> > > > > the base/limit values.
> > > > 
> > > > I was investigating this approach, and Mark raised a concern that I think
> > > > might be a showstopper.
> > > > 
> > > > Let's consider this scenario:
> > > > 
> > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0,
> > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}).
> > > > 
> > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1).
> > > > 2. Guest programs SPE to enable profiling at **EL0**
> > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}).
> > > > 3. Guest changes the translation table entries for the buffer. The
> > > > architecture allows this.
> > > 
> > > The architecture also allows MMIO accesses to use writeback addressing
> > > modes, but it doesn't provide a mechanism to virtualise them sensibly.
> > > 
> > > So I'd prefer that we don't pin all of guest memory just to satisfy a corner
> > > case -- as long as the impact of a guest doing this funny sequence is
> > > constrained to the guest, then I think pinning only what is required is
> > > probably the most pragmatic approach.
> > > 
> > > Is it ideal? No, of course not, and we should probably try to get the debug
> > > architecture extended to be properly virtualisable, but in the meantime
> > > having major operating systems as guests and being able to use SPE without
> > > pinning seems like a major design goal to me.
> > > 
> > > In any case, that's just my thinking on this and I defer to Oliver and
> > > Marc on the ultimate decision.
> 
> Thanks for chiming in Will, very much agree that pragmatism is likely
> the best route forward. While fun to poke at all the pitfalls of
> virtualizing SPE, pulling tricks in KVM probably has marginal return
> over a simpler approach.
> 
> > Thank you for the input.
> > 
> > To summarize the approaches we've discussed so far:
> > 
> > 1. Pinning the entire guest memory
> > - Heavy handed and not ideal.
> > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > 
> > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > faults reported by SPE.
> > - Not feasible, because the entire contents of the buffer must be discarded is
> >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> >   not the IPA.
> > 
> > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > - There is the corner case described above, when profiling becomes enabled as a
> >   result of an ERET to EL0. This can happen when the buffer is enabled and
> >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> >   stage 2 faults when draining the buffer, which is performed with profiling
> >   disabled.
> > - Also requires KVM to walk the guest's stage 1 tables.
> > 
> > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > SPE.
> > - Gets rid of the corner case at 3.
> > - Same approach to buffer unpinning as 3.
> > - Introduces a blackout window before the first record is written.
> > - Also requires KVM to walk the guest's stage 1 tables.
> > 
> > As for the corner case at 3, I proposed either:
> > 
> > a) Mandate that guest operating systems must never modify the buffer
> > translation entries if the buffer is enabled and
> > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > 
> > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > but **only** for this corner case. For all other cases, the buffer is pinned
> > when profiling becomes enabled, to eliminate the blackout window. Guest
> > operating systems can be modified to not change the translation entries for the
> > buffer if this blackout window is not desirable.
> > 
> > Pinning as a result of the **first** stage 2 fault should work, because there
> > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > 
> > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> 
> Thanks Alex for pulling together all of the context here.
> 
> Unless there's any other strong opinions on the topic, it seems to me
> that option #4 (pin on S2 fault) is probably the best approach for
> the initial implementation. No amount of tricks in KVM can work around
> the fact that SPE has some serious issues w.r.t. virtualization. With
> that, we should probably document the behavior of SPE as a known erratum
> of KVM.
> 
> If folks complain about EL1 profile blackout, eagerly pinning when
> profiling is enabled could layer on top quite easily by treating it as
> a synthetic S2 fault and triggering the implementation of #4. Having

I'm not sure I follow, I understand what you mean by "treating it as a
synthetic S2 fault", would you mind elaborating?

> said that I don't believe it is a hard requirement for enabling some
> flavor of SPE for guests.
> 
> Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> be done eventually.
> 
> Do you feel like this is an OK route forward, or have I missed
> something?

I've been giving this some thought, and I prefer approach #3 because with
#4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
will be impossible to distinguish between a valid stage 2 fault (a fault
caused by the guest reprogramming the buffer and enabling profiling) and
KVM messing something up when pinning the buffer. I believe this to be
important, as experience has shown me that pinning the buffer at stage 2 is
not trivial and there isn't a mechanism today in Linux to do that
(explanation and examples here [1]).

With approach #4, it would be impossible to figure out if the results of a
profiling operations inside a guest are representative of the workload or
not, because those SPE stage 2 faults triggered by a bug in KVM can happen
multiple times per profiling session, introducing multiple blackout windows
that can skew the results.

If you're proposing that the blackout window when the first record is
written be documented as an erratum for KVM, then why no got a step further
and document as an erratum that changing the buffer translation tables
after the buffer has been enabled will lead to an SPE Serror? That will
allow us to always pin the buffer when profiling is enabled.

[1] https://lore.kernel.org/all/YuEMkKY2RU%2F2KiZW@monolith.localdoman/

Thanks,
Alex

> 
> --
> Thanks,
> Oliver
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-08-09 14:01             ` Alexandru Elisei
@ 2022-08-09 18:43               ` Oliver Upton
  -1 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-08-09 18:43 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Alex,

On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:

[...]

> > > To summarize the approaches we've discussed so far:
> > > 
> > > 1. Pinning the entire guest memory
> > > - Heavy handed and not ideal.
> > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > 
> > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > faults reported by SPE.
> > > - Not feasible, because the entire contents of the buffer must be discarded is
> > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > >   not the IPA.
> > > 
> > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > - There is the corner case described above, when profiling becomes enabled as a
> > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > >   stage 2 faults when draining the buffer, which is performed with profiling
> > >   disabled.
> > > - Also requires KVM to walk the guest's stage 1 tables.
> > > 
> > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > SPE.
> > > - Gets rid of the corner case at 3.
> > > - Same approach to buffer unpinning as 3.
> > > - Introduces a blackout window before the first record is written.
> > > - Also requires KVM to walk the guest's stage 1 tables.
> > > 
> > > As for the corner case at 3, I proposed either:
> > > 
> > > a) Mandate that guest operating systems must never modify the buffer
> > > translation entries if the buffer is enabled and
> > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > 
> > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > operating systems can be modified to not change the translation entries for the
> > > buffer if this blackout window is not desirable.
> > > 
> > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > 
> > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > 
> > Thanks Alex for pulling together all of the context here.
> > 
> > Unless there's any other strong opinions on the topic, it seems to me
> > that option #4 (pin on S2 fault) is probably the best approach for
> > the initial implementation. No amount of tricks in KVM can work around
> > the fact that SPE has some serious issues w.r.t. virtualization. With
> > that, we should probably document the behavior of SPE as a known erratum
> > of KVM.
> > 
> > If folks complain about EL1 profile blackout, eagerly pinning when
> > profiling is enabled could layer on top quite easily by treating it as
> > a synthetic S2 fault and triggering the implementation of #4. Having
> 
> I'm not sure I follow, I understand what you mean by "treating it as a
> synthetic S2 fault", would you mind elaborating?

Assuming approach #4 is implemented, we will already have an SPE fault
handler that walks stage-1 and pins the buffer. At that point,
implementing approach #3 would be relatively easy. When EL1 sets
PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.

> > said that I don't believe it is a hard requirement for enabling some
> > flavor of SPE for guests.
> > 
> > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > be done eventually.
> > 
> > Do you feel like this is an OK route forward, or have I missed
> > something?
> 
> I've been giving this some thought, and I prefer approach #3 because with
> #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> will be impossible to distinguish between a valid stage 2 fault (a fault
> caused by the guest reprogramming the buffer and enabling profiling) and
> KVM messing something up when pinning the buffer. I believe this to be
> important, as experience has shown me that pinning the buffer at stage 2 is
> not trivial and there isn't a mechanism today in Linux to do that
> (explanation and examples here [1]).

How does eagerly pinning avoid stage-2 aborts, though? As you note in
[1], page pinning does not avoid the possibility of the MMU notifiers
being called on a given range. Want to make sure I'm following, what
is your suggestion for approach #3 to handle the profile buffer when
only enabled at EL0?

> With approach #4, it would be impossible to figure out if the results of a
> profiling operations inside a guest are representative of the workload or
> not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> multiple times per profiling session, introducing multiple blackout windows
> that can skew the results.
> 
> If you're proposing that the blackout window when the first record is
> written be documented as an erratum for KVM, then why no got a step further
> and document as an erratum that changing the buffer translation tables
> after the buffer has been enabled will lead to an SPE Serror? That will
> allow us to always pin the buffer when profiling is enabled.

Ah, there are certainly more errata in virtualizing SPE beyond what I
had said :) Preserving the stage-1 translations while profiling is
active is a good recommendation, although I'm not sure that we've
completely eliminated the risk of stage-2 faults. 

It seems impossible to blame the guest for all stage-2 faults that happen
in the middle of a profiling session. In addition to host mm driven changes
to stage-2, live migration is a busted as well. You'd need to build out
stage-2 on the target before resuming the guest and guarantee that the
appropriate pages have been demanded from the source (in case of post-copy).

So, are we going to inject an SError for stage-2 faults outside of guest
control as well? An external abort reported as an SPE buffer management
event seems to be gracefully handled by the Linux driver, but that behavior
is disallowed by SPEv1p3.

To sum up the point I'm getting at: I agree that there are ways to
reduce the risk of stage-2 faults in the middle of profiling, but I
don't believe the current architecture allows KVM to virtualize the
feature to the letter of the specification.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-08-09 18:43               ` Oliver Upton
  0 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-08-09 18:43 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Alex,

On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:

[...]

> > > To summarize the approaches we've discussed so far:
> > > 
> > > 1. Pinning the entire guest memory
> > > - Heavy handed and not ideal.
> > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > 
> > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > faults reported by SPE.
> > > - Not feasible, because the entire contents of the buffer must be discarded is
> > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > >   not the IPA.
> > > 
> > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > - There is the corner case described above, when profiling becomes enabled as a
> > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > >   stage 2 faults when draining the buffer, which is performed with profiling
> > >   disabled.
> > > - Also requires KVM to walk the guest's stage 1 tables.
> > > 
> > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > SPE.
> > > - Gets rid of the corner case at 3.
> > > - Same approach to buffer unpinning as 3.
> > > - Introduces a blackout window before the first record is written.
> > > - Also requires KVM to walk the guest's stage 1 tables.
> > > 
> > > As for the corner case at 3, I proposed either:
> > > 
> > > a) Mandate that guest operating systems must never modify the buffer
> > > translation entries if the buffer is enabled and
> > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > 
> > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > operating systems can be modified to not change the translation entries for the
> > > buffer if this blackout window is not desirable.
> > > 
> > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > 
> > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > 
> > Thanks Alex for pulling together all of the context here.
> > 
> > Unless there's any other strong opinions on the topic, it seems to me
> > that option #4 (pin on S2 fault) is probably the best approach for
> > the initial implementation. No amount of tricks in KVM can work around
> > the fact that SPE has some serious issues w.r.t. virtualization. With
> > that, we should probably document the behavior of SPE as a known erratum
> > of KVM.
> > 
> > If folks complain about EL1 profile blackout, eagerly pinning when
> > profiling is enabled could layer on top quite easily by treating it as
> > a synthetic S2 fault and triggering the implementation of #4. Having
> 
> I'm not sure I follow, I understand what you mean by "treating it as a
> synthetic S2 fault", would you mind elaborating?

Assuming approach #4 is implemented, we will already have an SPE fault
handler that walks stage-1 and pins the buffer. At that point,
implementing approach #3 would be relatively easy. When EL1 sets
PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.

> > said that I don't believe it is a hard requirement for enabling some
> > flavor of SPE for guests.
> > 
> > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > be done eventually.
> > 
> > Do you feel like this is an OK route forward, or have I missed
> > something?
> 
> I've been giving this some thought, and I prefer approach #3 because with
> #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> will be impossible to distinguish between a valid stage 2 fault (a fault
> caused by the guest reprogramming the buffer and enabling profiling) and
> KVM messing something up when pinning the buffer. I believe this to be
> important, as experience has shown me that pinning the buffer at stage 2 is
> not trivial and there isn't a mechanism today in Linux to do that
> (explanation and examples here [1]).

How does eagerly pinning avoid stage-2 aborts, though? As you note in
[1], page pinning does not avoid the possibility of the MMU notifiers
being called on a given range. Want to make sure I'm following, what
is your suggestion for approach #3 to handle the profile buffer when
only enabled at EL0?

> With approach #4, it would be impossible to figure out if the results of a
> profiling operations inside a guest are representative of the workload or
> not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> multiple times per profiling session, introducing multiple blackout windows
> that can skew the results.
> 
> If you're proposing that the blackout window when the first record is
> written be documented as an erratum for KVM, then why no got a step further
> and document as an erratum that changing the buffer translation tables
> after the buffer has been enabled will lead to an SPE Serror? That will
> allow us to always pin the buffer when profiling is enabled.

Ah, there are certainly more errata in virtualizing SPE beyond what I
had said :) Preserving the stage-1 translations while profiling is
active is a good recommendation, although I'm not sure that we've
completely eliminated the risk of stage-2 faults. 

It seems impossible to blame the guest for all stage-2 faults that happen
in the middle of a profiling session. In addition to host mm driven changes
to stage-2, live migration is a busted as well. You'd need to build out
stage-2 on the target before resuming the guest and guarantee that the
appropriate pages have been demanded from the source (in case of post-copy).

So, are we going to inject an SError for stage-2 faults outside of guest
control as well? An external abort reported as an SPE buffer management
event seems to be gracefully handled by the Linux driver, but that behavior
is disallowed by SPEv1p3.

To sum up the point I'm getting at: I agree that there are ways to
reduce the risk of stage-2 faults in the middle of profiling, but I
don't believe the current architecture allows KVM to virtualize the
feature to the letter of the specification.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-08-09 18:43               ` Oliver Upton
@ 2022-08-10  9:37                 ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-08-10  9:37 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi,

On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> Hi Alex,
> 
> On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> > > > To summarize the approaches we've discussed so far:
> > > > 
> > > > 1. Pinning the entire guest memory
> > > > - Heavy handed and not ideal.
> > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > 
> > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > faults reported by SPE.
> > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > >   not the IPA.
> > > > 
> > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > - There is the corner case described above, when profiling becomes enabled as a
> > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > >   disabled.
> > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > 
> > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > SPE.
> > > > - Gets rid of the corner case at 3.
> > > > - Same approach to buffer unpinning as 3.
> > > > - Introduces a blackout window before the first record is written.
> > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > 
> > > > As for the corner case at 3, I proposed either:
> > > > 
> > > > a) Mandate that guest operating systems must never modify the buffer
> > > > translation entries if the buffer is enabled and
> > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > 
> > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > operating systems can be modified to not change the translation entries for the
> > > > buffer if this blackout window is not desirable.
> > > > 
> > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > 
> > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > 
> > > Thanks Alex for pulling together all of the context here.
> > > 
> > > Unless there's any other strong opinions on the topic, it seems to me
> > > that option #4 (pin on S2 fault) is probably the best approach for
> > > the initial implementation. No amount of tricks in KVM can work around
> > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > that, we should probably document the behavior of SPE as a known erratum
> > > of KVM.
> > > 
> > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > profiling is enabled could layer on top quite easily by treating it as
> > > a synthetic S2 fault and triggering the implementation of #4. Having
> > 
> > I'm not sure I follow, I understand what you mean by "treating it as a
> > synthetic S2 fault", would you mind elaborating?
> 
> Assuming approach #4 is implemented, we will already have an SPE fault
> handler that walks stage-1 and pins the buffer. At that point,
> implementing approach #3 would be relatively easy. When EL1 sets
> PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.

I see, that makes sense, thanks,

> 
> > > said that I don't believe it is a hard requirement for enabling some
> > > flavor of SPE for guests.
> > > 
> > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > be done eventually.
> > > 
> > > Do you feel like this is an OK route forward, or have I missed
> > > something?
> > 
> > I've been giving this some thought, and I prefer approach #3 because with
> > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > will be impossible to distinguish between a valid stage 2 fault (a fault
> > caused by the guest reprogramming the buffer and enabling profiling) and
> > KVM messing something up when pinning the buffer. I believe this to be
> > important, as experience has shown me that pinning the buffer at stage 2 is
> > not trivial and there isn't a mechanism today in Linux to do that
> > (explanation and examples here [1]).
> 
> How does eagerly pinning avoid stage-2 aborts, though? As you note in
> [1], page pinning does not avoid the possibility of the MMU notifiers
> being called on a given range. Want to make sure I'm following, what
> is your suggestion for approach #3 to handle the profile buffer when
> only enabled at EL0?
> 
> > With approach #4, it would be impossible to figure out if the results of a
> > profiling operations inside a guest are representative of the workload or
> > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > multiple times per profiling session, introducing multiple blackout windows
> > that can skew the results.
> > 
> > If you're proposing that the blackout window when the first record is
> > written be documented as an erratum for KVM, then why no got a step further
> > and document as an erratum that changing the buffer translation tables
> > after the buffer has been enabled will lead to an SPE Serror? That will
> > allow us to always pin the buffer when profiling is enabled.
> 
> Ah, there are certainly more errata in virtualizing SPE beyond what I
> had said :) Preserving the stage-1 translations while profiling is
> active is a good recommendation, although I'm not sure that we've
> completely eliminated the risk of stage-2 faults. 
> 
> It seems impossible to blame the guest for all stage-2 faults that happen
> in the middle of a profiling session. In addition to host mm driven changes
> to stage-2, live migration is a busted as well. You'd need to build out
> stage-2 on the target before resuming the guest and guarantee that the
> appropriate pages have been demanded from the source (in case of post-copy).
> 
> So, are we going to inject an SError for stage-2 faults outside of guest
> control as well? An external abort reported as an SPE buffer management
> event seems to be gracefully handled by the Linux driver, but that behavior
> is disallowed by SPEv1p3.
> 
> To sum up the point I'm getting at: I agree that there are ways to
> reduce the risk of stage-2 faults in the middle of profiling, but I
> don't believe the current architecture allows KVM to virtualize the
> feature to the letter of the specification.

I believe there's some confusion here: emulating SPE **does not work** if
stage 2 faults are triggered in the middle of a profiling session. Being
able to have a memory range never unmapped from stage 2 is a
**prerequisite** and is **required** for SPE emulation, it's not a nice to
have.

A stage 2 fault before the first record is written is acceptable because
there are no other records already written which need to be thrown away.
Stage 2 faults after at least one record has been written are unacceptable
because it means that the contents of the buffer needs to thrown away.

Does that make sense to you?

I believe it is doable to have addresses always mapped at stage 2 with some
changes to KVM, but that's not what this thread is about. This thread is
about how and when to pin the buffer.

As long as we're all agreed that buffer memory needs "pinning" (as in the
IPA are never unmapped from stage 2 until KVM decides otherwise as part of
SPE emulation), I believe that live migration is tangential to figuring out
how and when the buffer should be "pinned". I'm more than happy to start a
separate thread about live migration after we figure out how we should go
about "pinning" the buffer, I think your insight would be most helpful :)

Thanks,
Alex

> 
> --
> Thanks,
> Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-08-10  9:37                 ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-08-10  9:37 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi,

On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> Hi Alex,
> 
> On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> > > > To summarize the approaches we've discussed so far:
> > > > 
> > > > 1. Pinning the entire guest memory
> > > > - Heavy handed and not ideal.
> > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > 
> > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > faults reported by SPE.
> > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > >   not the IPA.
> > > > 
> > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > - There is the corner case described above, when profiling becomes enabled as a
> > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > >   disabled.
> > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > 
> > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > SPE.
> > > > - Gets rid of the corner case at 3.
> > > > - Same approach to buffer unpinning as 3.
> > > > - Introduces a blackout window before the first record is written.
> > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > 
> > > > As for the corner case at 3, I proposed either:
> > > > 
> > > > a) Mandate that guest operating systems must never modify the buffer
> > > > translation entries if the buffer is enabled and
> > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > 
> > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > operating systems can be modified to not change the translation entries for the
> > > > buffer if this blackout window is not desirable.
> > > > 
> > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > 
> > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > 
> > > Thanks Alex for pulling together all of the context here.
> > > 
> > > Unless there's any other strong opinions on the topic, it seems to me
> > > that option #4 (pin on S2 fault) is probably the best approach for
> > > the initial implementation. No amount of tricks in KVM can work around
> > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > that, we should probably document the behavior of SPE as a known erratum
> > > of KVM.
> > > 
> > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > profiling is enabled could layer on top quite easily by treating it as
> > > a synthetic S2 fault and triggering the implementation of #4. Having
> > 
> > I'm not sure I follow, I understand what you mean by "treating it as a
> > synthetic S2 fault", would you mind elaborating?
> 
> Assuming approach #4 is implemented, we will already have an SPE fault
> handler that walks stage-1 and pins the buffer. At that point,
> implementing approach #3 would be relatively easy. When EL1 sets
> PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.

I see, that makes sense, thanks,

> 
> > > said that I don't believe it is a hard requirement for enabling some
> > > flavor of SPE for guests.
> > > 
> > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > be done eventually.
> > > 
> > > Do you feel like this is an OK route forward, or have I missed
> > > something?
> > 
> > I've been giving this some thought, and I prefer approach #3 because with
> > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > will be impossible to distinguish between a valid stage 2 fault (a fault
> > caused by the guest reprogramming the buffer and enabling profiling) and
> > KVM messing something up when pinning the buffer. I believe this to be
> > important, as experience has shown me that pinning the buffer at stage 2 is
> > not trivial and there isn't a mechanism today in Linux to do that
> > (explanation and examples here [1]).
> 
> How does eagerly pinning avoid stage-2 aborts, though? As you note in
> [1], page pinning does not avoid the possibility of the MMU notifiers
> being called on a given range. Want to make sure I'm following, what
> is your suggestion for approach #3 to handle the profile buffer when
> only enabled at EL0?
> 
> > With approach #4, it would be impossible to figure out if the results of a
> > profiling operations inside a guest are representative of the workload or
> > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > multiple times per profiling session, introducing multiple blackout windows
> > that can skew the results.
> > 
> > If you're proposing that the blackout window when the first record is
> > written be documented as an erratum for KVM, then why no got a step further
> > and document as an erratum that changing the buffer translation tables
> > after the buffer has been enabled will lead to an SPE Serror? That will
> > allow us to always pin the buffer when profiling is enabled.
> 
> Ah, there are certainly more errata in virtualizing SPE beyond what I
> had said :) Preserving the stage-1 translations while profiling is
> active is a good recommendation, although I'm not sure that we've
> completely eliminated the risk of stage-2 faults. 
> 
> It seems impossible to blame the guest for all stage-2 faults that happen
> in the middle of a profiling session. In addition to host mm driven changes
> to stage-2, live migration is a busted as well. You'd need to build out
> stage-2 on the target before resuming the guest and guarantee that the
> appropriate pages have been demanded from the source (in case of post-copy).
> 
> So, are we going to inject an SError for stage-2 faults outside of guest
> control as well? An external abort reported as an SPE buffer management
> event seems to be gracefully handled by the Linux driver, but that behavior
> is disallowed by SPEv1p3.
> 
> To sum up the point I'm getting at: I agree that there are ways to
> reduce the risk of stage-2 faults in the middle of profiling, but I
> don't believe the current architecture allows KVM to virtualize the
> feature to the letter of the specification.

I believe there's some confusion here: emulating SPE **does not work** if
stage 2 faults are triggered in the middle of a profiling session. Being
able to have a memory range never unmapped from stage 2 is a
**prerequisite** and is **required** for SPE emulation, it's not a nice to
have.

A stage 2 fault before the first record is written is acceptable because
there are no other records already written which need to be thrown away.
Stage 2 faults after at least one record has been written are unacceptable
because it means that the contents of the buffer needs to thrown away.

Does that make sense to you?

I believe it is doable to have addresses always mapped at stage 2 with some
changes to KVM, but that's not what this thread is about. This thread is
about how and when to pin the buffer.

As long as we're all agreed that buffer memory needs "pinning" (as in the
IPA are never unmapped from stage 2 until KVM decides otherwise as part of
SPE emulation), I believe that live migration is tangential to figuring out
how and when the buffer should be "pinned". I'm more than happy to start a
separate thread about live migration after we figure out how we should go
about "pinning" the buffer, I think your insight would be most helpful :)

Thanks,
Alex

> 
> --
> Thanks,
> Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-08-10  9:37                 ` Alexandru Elisei
@ 2022-08-10 15:25                   ` Oliver Upton
  -1 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-08-10 15:25 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> > Hi Alex,
> > 
> > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> > 
> > [...]
> > 
> > > > > To summarize the approaches we've discussed so far:
> > > > > 
> > > > > 1. Pinning the entire guest memory
> > > > > - Heavy handed and not ideal.
> > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > > 
> > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > > faults reported by SPE.
> > > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > > >   not the IPA.
> > > > > 
> > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > > - There is the corner case described above, when profiling becomes enabled as a
> > > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > > >   disabled.
> > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > 
> > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > > SPE.
> > > > > - Gets rid of the corner case at 3.
> > > > > - Same approach to buffer unpinning as 3.
> > > > > - Introduces a blackout window before the first record is written.
> > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > 
> > > > > As for the corner case at 3, I proposed either:
> > > > > 
> > > > > a) Mandate that guest operating systems must never modify the buffer
> > > > > translation entries if the buffer is enabled and
> > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > > 
> > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > > operating systems can be modified to not change the translation entries for the
> > > > > buffer if this blackout window is not desirable.
> > > > > 
> > > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > > 
> > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > > 
> > > > Thanks Alex for pulling together all of the context here.
> > > > 
> > > > Unless there's any other strong opinions on the topic, it seems to me
> > > > that option #4 (pin on S2 fault) is probably the best approach for
> > > > the initial implementation. No amount of tricks in KVM can work around
> > > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > > that, we should probably document the behavior of SPE as a known erratum
> > > > of KVM.
> > > > 
> > > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > > profiling is enabled could layer on top quite easily by treating it as
> > > > a synthetic S2 fault and triggering the implementation of #4. Having
> > > 
> > > I'm not sure I follow, I understand what you mean by "treating it as a
> > > synthetic S2 fault", would you mind elaborating?
> > 
> > Assuming approach #4 is implemented, we will already have an SPE fault
> > handler that walks stage-1 and pins the buffer. At that point,
> > implementing approach #3 would be relatively easy. When EL1 sets
> > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.
> 
> I see, that makes sense, thanks,
> 
> > 
> > > > said that I don't believe it is a hard requirement for enabling some
> > > > flavor of SPE for guests.
> > > > 
> > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > > be done eventually.
> > > > 
> > > > Do you feel like this is an OK route forward, or have I missed
> > > > something?
> > > 
> > > I've been giving this some thought, and I prefer approach #3 because with
> > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > > will be impossible to distinguish between a valid stage 2 fault (a fault
> > > caused by the guest reprogramming the buffer and enabling profiling) and
> > > KVM messing something up when pinning the buffer. I believe this to be
> > > important, as experience has shown me that pinning the buffer at stage 2 is
> > > not trivial and there isn't a mechanism today in Linux to do that
> > > (explanation and examples here [1]).
> > 
> > How does eagerly pinning avoid stage-2 aborts, though? As you note in
> > [1], page pinning does not avoid the possibility of the MMU notifiers
> > being called on a given range. Want to make sure I'm following, what
> > is your suggestion for approach #3 to handle the profile buffer when
> > only enabled at EL0?
> > 
> > > With approach #4, it would be impossible to figure out if the results of a
> > > profiling operations inside a guest are representative of the workload or
> > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > > multiple times per profiling session, introducing multiple blackout windows
> > > that can skew the results.
> > > 
> > > If you're proposing that the blackout window when the first record is
> > > written be documented as an erratum for KVM, then why no got a step further
> > > and document as an erratum that changing the buffer translation tables
> > > after the buffer has been enabled will lead to an SPE Serror? That will
> > > allow us to always pin the buffer when profiling is enabled.
> > 
> > Ah, there are certainly more errata in virtualizing SPE beyond what I
> > had said :) Preserving the stage-1 translations while profiling is
> > active is a good recommendation, although I'm not sure that we've
> > completely eliminated the risk of stage-2 faults. 
> > 
> > It seems impossible to blame the guest for all stage-2 faults that happen
> > in the middle of a profiling session. In addition to host mm driven changes
> > to stage-2, live migration is a busted as well. You'd need to build out
> > stage-2 on the target before resuming the guest and guarantee that the
> > appropriate pages have been demanded from the source (in case of post-copy).
> > 
> > So, are we going to inject an SError for stage-2 faults outside of guest
> > control as well? An external abort reported as an SPE buffer management
> > event seems to be gracefully handled by the Linux driver, but that behavior
> > is disallowed by SPEv1p3.
> > 
> > To sum up the point I'm getting at: I agree that there are ways to
> > reduce the risk of stage-2 faults in the middle of profiling, but I
> > don't believe the current architecture allows KVM to virtualize the
> > feature to the letter of the specification.
> 
> I believe there's some confusion here: emulating SPE **does not work** if
> stage 2 faults are triggered in the middle of a profiling session. Being
> able to have a memory range never unmapped from stage 2 is a
> **prerequisite** and is **required** for SPE emulation, it's not a nice to
> have.
> 
> A stage 2 fault before the first record is written is acceptable because
> there are no other records already written which need to be thrown away.
> Stage 2 faults after at least one record has been written are unacceptable
> because it means that the contents of the buffer needs to thrown away.
> 
> Does that make sense to you?
> 
> I believe it is doable to have addresses always mapped at stage 2 with some
> changes to KVM, but that's not what this thread is about. This thread is
> about how and when to pin the buffer.

Sorry if I've been forcing a tangent, but I believe there is a lot of
value in discussing what is to be done for keeping the stage-2 mapping
alive. I've been whining about it out of the very concern you highlight:
a stage-2 fault in the middle of the profile is game over. Otherwise,
optimizations in *when* we pin the buffer seem meaningless as stage-2
faults appear unavoidable.

Nonetheless, back to your proposal. Injecting some context from earlier:

> 3. Pinning the guest SPE buffer when profiling becomes enabled*:

So we are only doing this when enabled for EL1, right?
(PMSCR_EL1.{E0SPE,E1SPE} = {x, 1})

> - There is the corner case described above, when profiling becomes enabled as a
>   result of an ERET to EL0. This can happen when the buffer is enabled and
>   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};

Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set
(outside of the architectures definition of when profiling is enabled)?

> - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
>   stage 2 faults when draining the buffer, which is performed with profiling
>   disabled.

Sounds reasonable.

> As long as we're all agreed that buffer memory needs "pinning" (as in the
> IPA are never unmapped from stage 2 until KVM decides otherwise as part of
> SPE emulation), I believe that live migration is tangential to figuring out
> how and when the buffer should be "pinned". I'm more than happy to start a
> separate thread about live migration after we figure out how we should go
> about "pinning" the buffer, I think your insight would be most helpful :)

Fair enough, let's see how this all shakes out and then figure out LM
thereafter :)

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-08-10 15:25                   ` Oliver Upton
  0 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-08-10 15:25 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> > Hi Alex,
> > 
> > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> > 
> > [...]
> > 
> > > > > To summarize the approaches we've discussed so far:
> > > > > 
> > > > > 1. Pinning the entire guest memory
> > > > > - Heavy handed and not ideal.
> > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > > 
> > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > > faults reported by SPE.
> > > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > > >   not the IPA.
> > > > > 
> > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > > - There is the corner case described above, when profiling becomes enabled as a
> > > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > > >   disabled.
> > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > 
> > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > > SPE.
> > > > > - Gets rid of the corner case at 3.
> > > > > - Same approach to buffer unpinning as 3.
> > > > > - Introduces a blackout window before the first record is written.
> > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > 
> > > > > As for the corner case at 3, I proposed either:
> > > > > 
> > > > > a) Mandate that guest operating systems must never modify the buffer
> > > > > translation entries if the buffer is enabled and
> > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > > 
> > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > > operating systems can be modified to not change the translation entries for the
> > > > > buffer if this blackout window is not desirable.
> > > > > 
> > > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > > 
> > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > > 
> > > > Thanks Alex for pulling together all of the context here.
> > > > 
> > > > Unless there's any other strong opinions on the topic, it seems to me
> > > > that option #4 (pin on S2 fault) is probably the best approach for
> > > > the initial implementation. No amount of tricks in KVM can work around
> > > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > > that, we should probably document the behavior of SPE as a known erratum
> > > > of KVM.
> > > > 
> > > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > > profiling is enabled could layer on top quite easily by treating it as
> > > > a synthetic S2 fault and triggering the implementation of #4. Having
> > > 
> > > I'm not sure I follow, I understand what you mean by "treating it as a
> > > synthetic S2 fault", would you mind elaborating?
> > 
> > Assuming approach #4 is implemented, we will already have an SPE fault
> > handler that walks stage-1 and pins the buffer. At that point,
> > implementing approach #3 would be relatively easy. When EL1 sets
> > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.
> 
> I see, that makes sense, thanks,
> 
> > 
> > > > said that I don't believe it is a hard requirement for enabling some
> > > > flavor of SPE for guests.
> > > > 
> > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > > be done eventually.
> > > > 
> > > > Do you feel like this is an OK route forward, or have I missed
> > > > something?
> > > 
> > > I've been giving this some thought, and I prefer approach #3 because with
> > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > > will be impossible to distinguish between a valid stage 2 fault (a fault
> > > caused by the guest reprogramming the buffer and enabling profiling) and
> > > KVM messing something up when pinning the buffer. I believe this to be
> > > important, as experience has shown me that pinning the buffer at stage 2 is
> > > not trivial and there isn't a mechanism today in Linux to do that
> > > (explanation and examples here [1]).
> > 
> > How does eagerly pinning avoid stage-2 aborts, though? As you note in
> > [1], page pinning does not avoid the possibility of the MMU notifiers
> > being called on a given range. Want to make sure I'm following, what
> > is your suggestion for approach #3 to handle the profile buffer when
> > only enabled at EL0?
> > 
> > > With approach #4, it would be impossible to figure out if the results of a
> > > profiling operations inside a guest are representative of the workload or
> > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > > multiple times per profiling session, introducing multiple blackout windows
> > > that can skew the results.
> > > 
> > > If you're proposing that the blackout window when the first record is
> > > written be documented as an erratum for KVM, then why no got a step further
> > > and document as an erratum that changing the buffer translation tables
> > > after the buffer has been enabled will lead to an SPE Serror? That will
> > > allow us to always pin the buffer when profiling is enabled.
> > 
> > Ah, there are certainly more errata in virtualizing SPE beyond what I
> > had said :) Preserving the stage-1 translations while profiling is
> > active is a good recommendation, although I'm not sure that we've
> > completely eliminated the risk of stage-2 faults. 
> > 
> > It seems impossible to blame the guest for all stage-2 faults that happen
> > in the middle of a profiling session. In addition to host mm driven changes
> > to stage-2, live migration is a busted as well. You'd need to build out
> > stage-2 on the target before resuming the guest and guarantee that the
> > appropriate pages have been demanded from the source (in case of post-copy).
> > 
> > So, are we going to inject an SError for stage-2 faults outside of guest
> > control as well? An external abort reported as an SPE buffer management
> > event seems to be gracefully handled by the Linux driver, but that behavior
> > is disallowed by SPEv1p3.
> > 
> > To sum up the point I'm getting at: I agree that there are ways to
> > reduce the risk of stage-2 faults in the middle of profiling, but I
> > don't believe the current architecture allows KVM to virtualize the
> > feature to the letter of the specification.
> 
> I believe there's some confusion here: emulating SPE **does not work** if
> stage 2 faults are triggered in the middle of a profiling session. Being
> able to have a memory range never unmapped from stage 2 is a
> **prerequisite** and is **required** for SPE emulation, it's not a nice to
> have.
> 
> A stage 2 fault before the first record is written is acceptable because
> there are no other records already written which need to be thrown away.
> Stage 2 faults after at least one record has been written are unacceptable
> because it means that the contents of the buffer needs to thrown away.
> 
> Does that make sense to you?
> 
> I believe it is doable to have addresses always mapped at stage 2 with some
> changes to KVM, but that's not what this thread is about. This thread is
> about how and when to pin the buffer.

Sorry if I've been forcing a tangent, but I believe there is a lot of
value in discussing what is to be done for keeping the stage-2 mapping
alive. I've been whining about it out of the very concern you highlight:
a stage-2 fault in the middle of the profile is game over. Otherwise,
optimizations in *when* we pin the buffer seem meaningless as stage-2
faults appear unavoidable.

Nonetheless, back to your proposal. Injecting some context from earlier:

> 3. Pinning the guest SPE buffer when profiling becomes enabled*:

So we are only doing this when enabled for EL1, right?
(PMSCR_EL1.{E0SPE,E1SPE} = {x, 1})

> - There is the corner case described above, when profiling becomes enabled as a
>   result of an ERET to EL0. This can happen when the buffer is enabled and
>   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};

Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set
(outside of the architectures definition of when profiling is enabled)?

> - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
>   stage 2 faults when draining the buffer, which is performed with profiling
>   disabled.

Sounds reasonable.

> As long as we're all agreed that buffer memory needs "pinning" (as in the
> IPA are never unmapped from stage 2 until KVM decides otherwise as part of
> SPE emulation), I believe that live migration is tangential to figuring out
> how and when the buffer should be "pinned". I'm more than happy to start a
> separate thread about live migration after we figure out how we should go
> about "pinning" the buffer, I think your insight would be most helpful :)

Fair enough, let's see how this all shakes out and then figure out LM
thereafter :)

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-08-10 15:25                   ` Oliver Upton
@ 2022-08-12 13:05                     ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-08-12 13:05 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Oliver,

Just a note, for some reason some of your emails, but not all, don't show up in
my email client (mutt). That's why it might take me a while to send a reply
(noticed that you replied by looking for this thread on lore.kernel.org).

On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote:
> On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> > > Hi Alex,
> > > 
> > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> > > 
> > > [...]
> > > 
> > > > > > To summarize the approaches we've discussed so far:
> > > > > > 
> > > > > > 1. Pinning the entire guest memory
> > > > > > - Heavy handed and not ideal.
> > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > > > 
> > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > > > faults reported by SPE.
> > > > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > > > >   not the IPA.
> > > > > > 
> > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > > > - There is the corner case described above, when profiling becomes enabled as a
> > > > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > > > >   disabled.
> > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > 
> > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > > > SPE.
> > > > > > - Gets rid of the corner case at 3.
> > > > > > - Same approach to buffer unpinning as 3.
> > > > > > - Introduces a blackout window before the first record is written.
> > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > 
> > > > > > As for the corner case at 3, I proposed either:
> > > > > > 
> > > > > > a) Mandate that guest operating systems must never modify the buffer
> > > > > > translation entries if the buffer is enabled and
> > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > > > 
> > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > > > operating systems can be modified to not change the translation entries for the
> > > > > > buffer if this blackout window is not desirable.
> > > > > > 
> > > > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > > > 
> > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > > > 
> > > > > Thanks Alex for pulling together all of the context here.
> > > > > 
> > > > > Unless there's any other strong opinions on the topic, it seems to me
> > > > > that option #4 (pin on S2 fault) is probably the best approach for
> > > > > the initial implementation. No amount of tricks in KVM can work around
> > > > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > > > that, we should probably document the behavior of SPE as a known erratum
> > > > > of KVM.
> > > > > 
> > > > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > > > profiling is enabled could layer on top quite easily by treating it as
> > > > > a synthetic S2 fault and triggering the implementation of #4. Having
> > > > 
> > > > I'm not sure I follow, I understand what you mean by "treating it as a
> > > > synthetic S2 fault", would you mind elaborating?
> > > 
> > > Assuming approach #4 is implemented, we will already have an SPE fault
> > > handler that walks stage-1 and pins the buffer. At that point,
> > > implementing approach #3 would be relatively easy. When EL1 sets
> > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.
> > 
> > I see, that makes sense, thanks,
> > 
> > > 
> > > > > said that I don't believe it is a hard requirement for enabling some
> > > > > flavor of SPE for guests.
> > > > > 
> > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > > > be done eventually.
> > > > > 
> > > > > Do you feel like this is an OK route forward, or have I missed
> > > > > something?
> > > > 
> > > > I've been giving this some thought, and I prefer approach #3 because with
> > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > > > will be impossible to distinguish between a valid stage 2 fault (a fault
> > > > caused by the guest reprogramming the buffer and enabling profiling) and
> > > > KVM messing something up when pinning the buffer. I believe this to be
> > > > important, as experience has shown me that pinning the buffer at stage 2 is
> > > > not trivial and there isn't a mechanism today in Linux to do that
> > > > (explanation and examples here [1]).
> > > 
> > > How does eagerly pinning avoid stage-2 aborts, though? As you note in
> > > [1], page pinning does not avoid the possibility of the MMU notifiers
> > > being called on a given range. Want to make sure I'm following, what
> > > is your suggestion for approach #3 to handle the profile buffer when
> > > only enabled at EL0?
> > > 
> > > > With approach #4, it would be impossible to figure out if the results of a
> > > > profiling operations inside a guest are representative of the workload or
> > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > > > multiple times per profiling session, introducing multiple blackout windows
> > > > that can skew the results.
> > > > 
> > > > If you're proposing that the blackout window when the first record is
> > > > written be documented as an erratum for KVM, then why no got a step further
> > > > and document as an erratum that changing the buffer translation tables
> > > > after the buffer has been enabled will lead to an SPE Serror? That will
> > > > allow us to always pin the buffer when profiling is enabled.
> > > 
> > > Ah, there are certainly more errata in virtualizing SPE beyond what I
> > > had said :) Preserving the stage-1 translations while profiling is
> > > active is a good recommendation, although I'm not sure that we've
> > > completely eliminated the risk of stage-2 faults. 
> > > 
> > > It seems impossible to blame the guest for all stage-2 faults that happen
> > > in the middle of a profiling session. In addition to host mm driven changes
> > > to stage-2, live migration is a busted as well. You'd need to build out
> > > stage-2 on the target before resuming the guest and guarantee that the
> > > appropriate pages have been demanded from the source (in case of post-copy).
> > > 
> > > So, are we going to inject an SError for stage-2 faults outside of guest
> > > control as well? An external abort reported as an SPE buffer management
> > > event seems to be gracefully handled by the Linux driver, but that behavior
> > > is disallowed by SPEv1p3.
> > > 
> > > To sum up the point I'm getting at: I agree that there are ways to
> > > reduce the risk of stage-2 faults in the middle of profiling, but I
> > > don't believe the current architecture allows KVM to virtualize the
> > > feature to the letter of the specification.
> > 
> > I believe there's some confusion here: emulating SPE **does not work** if
> > stage 2 faults are triggered in the middle of a profiling session. Being
> > able to have a memory range never unmapped from stage 2 is a
> > **prerequisite** and is **required** for SPE emulation, it's not a nice to
> > have.
> > 
> > A stage 2 fault before the first record is written is acceptable because
> > there are no other records already written which need to be thrown away.
> > Stage 2 faults after at least one record has been written are unacceptable
> > because it means that the contents of the buffer needs to thrown away.
> > 
> > Does that make sense to you?
> > 
> > I believe it is doable to have addresses always mapped at stage 2 with some
> > changes to KVM, but that's not what this thread is about. This thread is
> > about how and when to pin the buffer.
> 
> Sorry if I've been forcing a tangent, but I believe there is a lot of
> value in discussing what is to be done for keeping the stage-2 mapping
> alive. I've been whining about it out of the very concern you highlight:
> a stage-2 fault in the middle of the profile is game over. Otherwise,
> optimizations in *when* we pin the buffer seem meaningless as stage-2
> faults appear unavoidable.

The idea I had was to propagate the mmu_notifier_range->event field to the
arch code. Then keep track of the IPAs which KVM pinned with
pin_user_page(s) that translate the guest buffer, and don't unmap that IPA
from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all
notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem
trying to change how that particular page is mapped.

> 
> Nonetheless, back to your proposal. Injecting some context from earlier:
> 
> > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> 
> So we are only doing this when enabled for EL1, right?
> (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1})

Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}.
Accesses to those registers can be trapped by KVM, and to verify the
condition becomes trivial.

> 
> > - There is the corner case described above, when profiling becomes enabled as a
> >   result of an ERET to EL0. This can happen when the buffer is enabled and
> >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> 
> Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set
> (outside of the architectures definition of when profiling is enabled)?

The original proposal was to pin on the first fault in this case, yes.
That's because the architecture doesn't forbid changing the translation
entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled
(PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}).

But you mentioned adding a quirk/erratum to KVM in your proposal, and I was
thinking that we could add an erratum to avoid the case above by saying
that that behaviour is impredictable. But that might restrict what
operating systems KVM can run in an SPE-enabled VM, I can do some digging
to find out how other operating systems use SPE, if you think adding the
quirk sounds reasonable.

> 
> > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> >   stage 2 faults when draining the buffer, which is performed with profiling
> >   disabled.
> 
> Sounds reasonable.
> 
> > As long as we're all agreed that buffer memory needs "pinning" (as in the
> > IPA are never unmapped from stage 2 until KVM decides otherwise as part of
> > SPE emulation), I believe that live migration is tangential to figuring out
> > how and when the buffer should be "pinned". I'm more than happy to start a
> > separate thread about live migration after we figure out how we should go
> > about "pinning" the buffer, I think your insight would be most helpful :)
> 
> Fair enough, let's see how this all shakes out and then figure out LM
> thereafter :)

Great, thanks!

Alex

> 
> --
> Thanks,
> Oliver
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-08-12 13:05                     ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-08-12 13:05 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Oliver,

Just a note, for some reason some of your emails, but not all, don't show up in
my email client (mutt). That's why it might take me a while to send a reply
(noticed that you replied by looking for this thread on lore.kernel.org).

On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote:
> On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> > > Hi Alex,
> > > 
> > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> > > 
> > > [...]
> > > 
> > > > > > To summarize the approaches we've discussed so far:
> > > > > > 
> > > > > > 1. Pinning the entire guest memory
> > > > > > - Heavy handed and not ideal.
> > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > > > 
> > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > > > faults reported by SPE.
> > > > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > > > >   not the IPA.
> > > > > > 
> > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > > > - There is the corner case described above, when profiling becomes enabled as a
> > > > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > > > >   disabled.
> > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > 
> > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > > > SPE.
> > > > > > - Gets rid of the corner case at 3.
> > > > > > - Same approach to buffer unpinning as 3.
> > > > > > - Introduces a blackout window before the first record is written.
> > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > 
> > > > > > As for the corner case at 3, I proposed either:
> > > > > > 
> > > > > > a) Mandate that guest operating systems must never modify the buffer
> > > > > > translation entries if the buffer is enabled and
> > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > > > 
> > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > > > operating systems can be modified to not change the translation entries for the
> > > > > > buffer if this blackout window is not desirable.
> > > > > > 
> > > > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > > > 
> > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > > > 
> > > > > Thanks Alex for pulling together all of the context here.
> > > > > 
> > > > > Unless there's any other strong opinions on the topic, it seems to me
> > > > > that option #4 (pin on S2 fault) is probably the best approach for
> > > > > the initial implementation. No amount of tricks in KVM can work around
> > > > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > > > that, we should probably document the behavior of SPE as a known erratum
> > > > > of KVM.
> > > > > 
> > > > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > > > profiling is enabled could layer on top quite easily by treating it as
> > > > > a synthetic S2 fault and triggering the implementation of #4. Having
> > > > 
> > > > I'm not sure I follow, I understand what you mean by "treating it as a
> > > > synthetic S2 fault", would you mind elaborating?
> > > 
> > > Assuming approach #4 is implemented, we will already have an SPE fault
> > > handler that walks stage-1 and pins the buffer. At that point,
> > > implementing approach #3 would be relatively easy. When EL1 sets
> > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.
> > 
> > I see, that makes sense, thanks,
> > 
> > > 
> > > > > said that I don't believe it is a hard requirement for enabling some
> > > > > flavor of SPE for guests.
> > > > > 
> > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > > > be done eventually.
> > > > > 
> > > > > Do you feel like this is an OK route forward, or have I missed
> > > > > something?
> > > > 
> > > > I've been giving this some thought, and I prefer approach #3 because with
> > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > > > will be impossible to distinguish between a valid stage 2 fault (a fault
> > > > caused by the guest reprogramming the buffer and enabling profiling) and
> > > > KVM messing something up when pinning the buffer. I believe this to be
> > > > important, as experience has shown me that pinning the buffer at stage 2 is
> > > > not trivial and there isn't a mechanism today in Linux to do that
> > > > (explanation and examples here [1]).
> > > 
> > > How does eagerly pinning avoid stage-2 aborts, though? As you note in
> > > [1], page pinning does not avoid the possibility of the MMU notifiers
> > > being called on a given range. Want to make sure I'm following, what
> > > is your suggestion for approach #3 to handle the profile buffer when
> > > only enabled at EL0?
> > > 
> > > > With approach #4, it would be impossible to figure out if the results of a
> > > > profiling operations inside a guest are representative of the workload or
> > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > > > multiple times per profiling session, introducing multiple blackout windows
> > > > that can skew the results.
> > > > 
> > > > If you're proposing that the blackout window when the first record is
> > > > written be documented as an erratum for KVM, then why no got a step further
> > > > and document as an erratum that changing the buffer translation tables
> > > > after the buffer has been enabled will lead to an SPE Serror? That will
> > > > allow us to always pin the buffer when profiling is enabled.
> > > 
> > > Ah, there are certainly more errata in virtualizing SPE beyond what I
> > > had said :) Preserving the stage-1 translations while profiling is
> > > active is a good recommendation, although I'm not sure that we've
> > > completely eliminated the risk of stage-2 faults. 
> > > 
> > > It seems impossible to blame the guest for all stage-2 faults that happen
> > > in the middle of a profiling session. In addition to host mm driven changes
> > > to stage-2, live migration is a busted as well. You'd need to build out
> > > stage-2 on the target before resuming the guest and guarantee that the
> > > appropriate pages have been demanded from the source (in case of post-copy).
> > > 
> > > So, are we going to inject an SError for stage-2 faults outside of guest
> > > control as well? An external abort reported as an SPE buffer management
> > > event seems to be gracefully handled by the Linux driver, but that behavior
> > > is disallowed by SPEv1p3.
> > > 
> > > To sum up the point I'm getting at: I agree that there are ways to
> > > reduce the risk of stage-2 faults in the middle of profiling, but I
> > > don't believe the current architecture allows KVM to virtualize the
> > > feature to the letter of the specification.
> > 
> > I believe there's some confusion here: emulating SPE **does not work** if
> > stage 2 faults are triggered in the middle of a profiling session. Being
> > able to have a memory range never unmapped from stage 2 is a
> > **prerequisite** and is **required** for SPE emulation, it's not a nice to
> > have.
> > 
> > A stage 2 fault before the first record is written is acceptable because
> > there are no other records already written which need to be thrown away.
> > Stage 2 faults after at least one record has been written are unacceptable
> > because it means that the contents of the buffer needs to thrown away.
> > 
> > Does that make sense to you?
> > 
> > I believe it is doable to have addresses always mapped at stage 2 with some
> > changes to KVM, but that's not what this thread is about. This thread is
> > about how and when to pin the buffer.
> 
> Sorry if I've been forcing a tangent, but I believe there is a lot of
> value in discussing what is to be done for keeping the stage-2 mapping
> alive. I've been whining about it out of the very concern you highlight:
> a stage-2 fault in the middle of the profile is game over. Otherwise,
> optimizations in *when* we pin the buffer seem meaningless as stage-2
> faults appear unavoidable.

The idea I had was to propagate the mmu_notifier_range->event field to the
arch code. Then keep track of the IPAs which KVM pinned with
pin_user_page(s) that translate the guest buffer, and don't unmap that IPA
from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all
notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem
trying to change how that particular page is mapped.

> 
> Nonetheless, back to your proposal. Injecting some context from earlier:
> 
> > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> 
> So we are only doing this when enabled for EL1, right?
> (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1})

Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}.
Accesses to those registers can be trapped by KVM, and to verify the
condition becomes trivial.

> 
> > - There is the corner case described above, when profiling becomes enabled as a
> >   result of an ERET to EL0. This can happen when the buffer is enabled and
> >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> 
> Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set
> (outside of the architectures definition of when profiling is enabled)?

The original proposal was to pin on the first fault in this case, yes.
That's because the architecture doesn't forbid changing the translation
entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled
(PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}).

But you mentioned adding a quirk/erratum to KVM in your proposal, and I was
thinking that we could add an erratum to avoid the case above by saying
that that behaviour is impredictable. But that might restrict what
operating systems KVM can run in an SPE-enabled VM, I can do some digging
to find out how other operating systems use SPE, if you think adding the
quirk sounds reasonable.

> 
> > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> >   stage 2 faults when draining the buffer, which is performed with profiling
> >   disabled.
> 
> Sounds reasonable.
> 
> > As long as we're all agreed that buffer memory needs "pinning" (as in the
> > IPA are never unmapped from stage 2 until KVM decides otherwise as part of
> > SPE emulation), I believe that live migration is tangential to figuring out
> > how and when the buffer should be "pinned". I'm more than happy to start a
> > separate thread about live migration after we figure out how we should go
> > about "pinning" the buffer, I think your insight would be most helpful :)
> 
> Fair enough, let's see how this all shakes out and then figure out LM
> thereafter :)

Great, thanks!

Alex

> 
> --
> Thanks,
> Oliver
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-08-12 13:05                     ` Alexandru Elisei
@ 2022-08-17 15:05                       ` Oliver Upton
  -1 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-08-17 15:05 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Alex,

On Fri, Aug 12, 2022 at 02:05:45PM +0100, Alexandru Elisei wrote:
> Hi Oliver,
> 
> Just a note, for some reason some of your emails, but not all, don't show up in
> my email client (mutt). That's why it might take me a while to send a reply
> (noticed that you replied by looking for this thread on lore.kernel.org).

Urgh, that's weird. Am I getting thrown into spam or something? Also, do
you know if you've been receiving Drew's email since he switched to
@linux.dev?

> On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote:
> > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote:
> > > Hi,
> > > 
> > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> > > > Hi Alex,
> > > > 
> > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > > To summarize the approaches we've discussed so far:
> > > > > > > 
> > > > > > > 1. Pinning the entire guest memory
> > > > > > > - Heavy handed and not ideal.
> > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > > > > 
> > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > > > > faults reported by SPE.
> > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > > > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > > > > >   not the IPA.
> > > > > > > 
> > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > > > > - There is the corner case described above, when profiling becomes enabled as a
> > > > > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > > > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > > > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > > > > >   disabled.
> > > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > > 
> > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > > > > SPE.
> > > > > > > - Gets rid of the corner case at 3.
> > > > > > > - Same approach to buffer unpinning as 3.
> > > > > > > - Introduces a blackout window before the first record is written.
> > > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > > 
> > > > > > > As for the corner case at 3, I proposed either:
> > > > > > > 
> > > > > > > a) Mandate that guest operating systems must never modify the buffer
> > > > > > > translation entries if the buffer is enabled and
> > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > > > > 
> > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > > > > operating systems can be modified to not change the translation entries for the
> > > > > > > buffer if this blackout window is not desirable.
> > > > > > > 
> > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > > > > 
> > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > > > > 
> > > > > > Thanks Alex for pulling together all of the context here.
> > > > > > 
> > > > > > Unless there's any other strong opinions on the topic, it seems to me
> > > > > > that option #4 (pin on S2 fault) is probably the best approach for
> > > > > > the initial implementation. No amount of tricks in KVM can work around
> > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > > > > that, we should probably document the behavior of SPE as a known erratum
> > > > > > of KVM.
> > > > > > 
> > > > > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > > > > profiling is enabled could layer on top quite easily by treating it as
> > > > > > a synthetic S2 fault and triggering the implementation of #4. Having
> > > > > 
> > > > > I'm not sure I follow, I understand what you mean by "treating it as a
> > > > > synthetic S2 fault", would you mind elaborating?
> > > > 
> > > > Assuming approach #4 is implemented, we will already have an SPE fault
> > > > handler that walks stage-1 and pins the buffer. At that point,
> > > > implementing approach #3 would be relatively easy. When EL1 sets
> > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.
> > > 
> > > I see, that makes sense, thanks,
> > > 
> > > > 
> > > > > > said that I don't believe it is a hard requirement for enabling some
> > > > > > flavor of SPE for guests.
> > > > > > 
> > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > > > > be done eventually.
> > > > > > 
> > > > > > Do you feel like this is an OK route forward, or have I missed
> > > > > > something?
> > > > > 
> > > > > I've been giving this some thought, and I prefer approach #3 because with
> > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > > > > will be impossible to distinguish between a valid stage 2 fault (a fault
> > > > > caused by the guest reprogramming the buffer and enabling profiling) and
> > > > > KVM messing something up when pinning the buffer. I believe this to be
> > > > > important, as experience has shown me that pinning the buffer at stage 2 is
> > > > > not trivial and there isn't a mechanism today in Linux to do that
> > > > > (explanation and examples here [1]).
> > > > 
> > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in
> > > > [1], page pinning does not avoid the possibility of the MMU notifiers
> > > > being called on a given range. Want to make sure I'm following, what
> > > > is your suggestion for approach #3 to handle the profile buffer when
> > > > only enabled at EL0?
> > > > 
> > > > > With approach #4, it would be impossible to figure out if the results of a
> > > > > profiling operations inside a guest are representative of the workload or
> > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > > > > multiple times per profiling session, introducing multiple blackout windows
> > > > > that can skew the results.
> > > > > 
> > > > > If you're proposing that the blackout window when the first record is
> > > > > written be documented as an erratum for KVM, then why no got a step further
> > > > > and document as an erratum that changing the buffer translation tables
> > > > > after the buffer has been enabled will lead to an SPE Serror? That will
> > > > > allow us to always pin the buffer when profiling is enabled.
> > > > 
> > > > Ah, there are certainly more errata in virtualizing SPE beyond what I
> > > > had said :) Preserving the stage-1 translations while profiling is
> > > > active is a good recommendation, although I'm not sure that we've
> > > > completely eliminated the risk of stage-2 faults. 
> > > > 
> > > > It seems impossible to blame the guest for all stage-2 faults that happen
> > > > in the middle of a profiling session. In addition to host mm driven changes
> > > > to stage-2, live migration is a busted as well. You'd need to build out
> > > > stage-2 on the target before resuming the guest and guarantee that the
> > > > appropriate pages have been demanded from the source (in case of post-copy).
> > > > 
> > > > So, are we going to inject an SError for stage-2 faults outside of guest
> > > > control as well? An external abort reported as an SPE buffer management
> > > > event seems to be gracefully handled by the Linux driver, but that behavior
> > > > is disallowed by SPEv1p3.
> > > > 
> > > > To sum up the point I'm getting at: I agree that there are ways to
> > > > reduce the risk of stage-2 faults in the middle of profiling, but I
> > > > don't believe the current architecture allows KVM to virtualize the
> > > > feature to the letter of the specification.
> > > 
> > > I believe there's some confusion here: emulating SPE **does not work** if
> > > stage 2 faults are triggered in the middle of a profiling session. Being
> > > able to have a memory range never unmapped from stage 2 is a
> > > **prerequisite** and is **required** for SPE emulation, it's not a nice to
> > > have.
> > > 
> > > A stage 2 fault before the first record is written is acceptable because
> > > there are no other records already written which need to be thrown away.
> > > Stage 2 faults after at least one record has been written are unacceptable
> > > because it means that the contents of the buffer needs to thrown away.
> > > 
> > > Does that make sense to you?
> > > 
> > > I believe it is doable to have addresses always mapped at stage 2 with some
> > > changes to KVM, but that's not what this thread is about. This thread is
> > > about how and when to pin the buffer.
> > 
> > Sorry if I've been forcing a tangent, but I believe there is a lot of
> > value in discussing what is to be done for keeping the stage-2 mapping
> > alive. I've been whining about it out of the very concern you highlight:
> > a stage-2 fault in the middle of the profile is game over. Otherwise,
> > optimizations in *when* we pin the buffer seem meaningless as stage-2
> > faults appear unavoidable.
> 
> The idea I had was to propagate the mmu_notifier_range->event field to the
> arch code. Then keep track of the IPAs which KVM pinned with
> pin_user_page(s) that translate the guest buffer, and don't unmap that IPA
> from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all
> notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem
> trying to change how that particular page is mapped.
> 
> > 
> > Nonetheless, back to your proposal. Injecting some context from earlier:
> > 
> > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > 
> > So we are only doing this when enabled for EL1, right?
> > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1})
> 
> Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}.
> Accesses to those registers can be trapped by KVM, and to verify the
> condition becomes trivial.
> 
> > 
> > > - There is the corner case described above, when profiling becomes enabled as a
> > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > 
> > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set
> > (outside of the architectures definition of when profiling is enabled)?
> 
> The original proposal was to pin on the first fault in this case, yes.
> That's because the architecture doesn't forbid changing the translation
> entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled
> (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}).
> 
> But you mentioned adding a quirk/erratum to KVM in your proposal, and I was
> thinking that we could add an erratum to avoid the case above by saying
> that that behaviour is impredictable. But that might restrict what
> operating systems KVM can run in an SPE-enabled VM, I can do some digging
> to find out how other operating systems use SPE, if you think adding the
> quirk sounds reasonable.

Yeah, that would be good to follow up on what other OSes are doing.
You'll still have a nondestructive S2 fault handler for the SPE, right?
IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
new one.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-08-17 15:05                       ` Oliver Upton
  0 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-08-17 15:05 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Alex,

On Fri, Aug 12, 2022 at 02:05:45PM +0100, Alexandru Elisei wrote:
> Hi Oliver,
> 
> Just a note, for some reason some of your emails, but not all, don't show up in
> my email client (mutt). That's why it might take me a while to send a reply
> (noticed that you replied by looking for this thread on lore.kernel.org).

Urgh, that's weird. Am I getting thrown into spam or something? Also, do
you know if you've been receiving Drew's email since he switched to
@linux.dev?

> On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote:
> > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote:
> > > Hi,
> > > 
> > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> > > > Hi Alex,
> > > > 
> > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > > To summarize the approaches we've discussed so far:
> > > > > > > 
> > > > > > > 1. Pinning the entire guest memory
> > > > > > > - Heavy handed and not ideal.
> > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > > > > 
> > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > > > > faults reported by SPE.
> > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > > > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > > > > >   not the IPA.
> > > > > > > 
> > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > > > > - There is the corner case described above, when profiling becomes enabled as a
> > > > > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > > > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > > > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > > > > >   disabled.
> > > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > > 
> > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > > > > SPE.
> > > > > > > - Gets rid of the corner case at 3.
> > > > > > > - Same approach to buffer unpinning as 3.
> > > > > > > - Introduces a blackout window before the first record is written.
> > > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > > 
> > > > > > > As for the corner case at 3, I proposed either:
> > > > > > > 
> > > > > > > a) Mandate that guest operating systems must never modify the buffer
> > > > > > > translation entries if the buffer is enabled and
> > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > > > > 
> > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > > > > operating systems can be modified to not change the translation entries for the
> > > > > > > buffer if this blackout window is not desirable.
> > > > > > > 
> > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > > > > 
> > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > > > > 
> > > > > > Thanks Alex for pulling together all of the context here.
> > > > > > 
> > > > > > Unless there's any other strong opinions on the topic, it seems to me
> > > > > > that option #4 (pin on S2 fault) is probably the best approach for
> > > > > > the initial implementation. No amount of tricks in KVM can work around
> > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > > > > that, we should probably document the behavior of SPE as a known erratum
> > > > > > of KVM.
> > > > > > 
> > > > > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > > > > profiling is enabled could layer on top quite easily by treating it as
> > > > > > a synthetic S2 fault and triggering the implementation of #4. Having
> > > > > 
> > > > > I'm not sure I follow, I understand what you mean by "treating it as a
> > > > > synthetic S2 fault", would you mind elaborating?
> > > > 
> > > > Assuming approach #4 is implemented, we will already have an SPE fault
> > > > handler that walks stage-1 and pins the buffer. At that point,
> > > > implementing approach #3 would be relatively easy. When EL1 sets
> > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.
> > > 
> > > I see, that makes sense, thanks,
> > > 
> > > > 
> > > > > > said that I don't believe it is a hard requirement for enabling some
> > > > > > flavor of SPE for guests.
> > > > > > 
> > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > > > > be done eventually.
> > > > > > 
> > > > > > Do you feel like this is an OK route forward, or have I missed
> > > > > > something?
> > > > > 
> > > > > I've been giving this some thought, and I prefer approach #3 because with
> > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > > > > will be impossible to distinguish between a valid stage 2 fault (a fault
> > > > > caused by the guest reprogramming the buffer and enabling profiling) and
> > > > > KVM messing something up when pinning the buffer. I believe this to be
> > > > > important, as experience has shown me that pinning the buffer at stage 2 is
> > > > > not trivial and there isn't a mechanism today in Linux to do that
> > > > > (explanation and examples here [1]).
> > > > 
> > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in
> > > > [1], page pinning does not avoid the possibility of the MMU notifiers
> > > > being called on a given range. Want to make sure I'm following, what
> > > > is your suggestion for approach #3 to handle the profile buffer when
> > > > only enabled at EL0?
> > > > 
> > > > > With approach #4, it would be impossible to figure out if the results of a
> > > > > profiling operations inside a guest are representative of the workload or
> > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > > > > multiple times per profiling session, introducing multiple blackout windows
> > > > > that can skew the results.
> > > > > 
> > > > > If you're proposing that the blackout window when the first record is
> > > > > written be documented as an erratum for KVM, then why no got a step further
> > > > > and document as an erratum that changing the buffer translation tables
> > > > > after the buffer has been enabled will lead to an SPE Serror? That will
> > > > > allow us to always pin the buffer when profiling is enabled.
> > > > 
> > > > Ah, there are certainly more errata in virtualizing SPE beyond what I
> > > > had said :) Preserving the stage-1 translations while profiling is
> > > > active is a good recommendation, although I'm not sure that we've
> > > > completely eliminated the risk of stage-2 faults. 
> > > > 
> > > > It seems impossible to blame the guest for all stage-2 faults that happen
> > > > in the middle of a profiling session. In addition to host mm driven changes
> > > > to stage-2, live migration is a busted as well. You'd need to build out
> > > > stage-2 on the target before resuming the guest and guarantee that the
> > > > appropriate pages have been demanded from the source (in case of post-copy).
> > > > 
> > > > So, are we going to inject an SError for stage-2 faults outside of guest
> > > > control as well? An external abort reported as an SPE buffer management
> > > > event seems to be gracefully handled by the Linux driver, but that behavior
> > > > is disallowed by SPEv1p3.
> > > > 
> > > > To sum up the point I'm getting at: I agree that there are ways to
> > > > reduce the risk of stage-2 faults in the middle of profiling, but I
> > > > don't believe the current architecture allows KVM to virtualize the
> > > > feature to the letter of the specification.
> > > 
> > > I believe there's some confusion here: emulating SPE **does not work** if
> > > stage 2 faults are triggered in the middle of a profiling session. Being
> > > able to have a memory range never unmapped from stage 2 is a
> > > **prerequisite** and is **required** for SPE emulation, it's not a nice to
> > > have.
> > > 
> > > A stage 2 fault before the first record is written is acceptable because
> > > there are no other records already written which need to be thrown away.
> > > Stage 2 faults after at least one record has been written are unacceptable
> > > because it means that the contents of the buffer needs to thrown away.
> > > 
> > > Does that make sense to you?
> > > 
> > > I believe it is doable to have addresses always mapped at stage 2 with some
> > > changes to KVM, but that's not what this thread is about. This thread is
> > > about how and when to pin the buffer.
> > 
> > Sorry if I've been forcing a tangent, but I believe there is a lot of
> > value in discussing what is to be done for keeping the stage-2 mapping
> > alive. I've been whining about it out of the very concern you highlight:
> > a stage-2 fault in the middle of the profile is game over. Otherwise,
> > optimizations in *when* we pin the buffer seem meaningless as stage-2
> > faults appear unavoidable.
> 
> The idea I had was to propagate the mmu_notifier_range->event field to the
> arch code. Then keep track of the IPAs which KVM pinned with
> pin_user_page(s) that translate the guest buffer, and don't unmap that IPA
> from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all
> notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem
> trying to change how that particular page is mapped.
> 
> > 
> > Nonetheless, back to your proposal. Injecting some context from earlier:
> > 
> > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > 
> > So we are only doing this when enabled for EL1, right?
> > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1})
> 
> Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}.
> Accesses to those registers can be trapped by KVM, and to verify the
> condition becomes trivial.
> 
> > 
> > > - There is the corner case described above, when profiling becomes enabled as a
> > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > 
> > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set
> > (outside of the architectures definition of when profiling is enabled)?
> 
> The original proposal was to pin on the first fault in this case, yes.
> That's because the architecture doesn't forbid changing the translation
> entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled
> (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}).
> 
> But you mentioned adding a quirk/erratum to KVM in your proposal, and I was
> thinking that we could add an erratum to avoid the case above by saying
> that that behaviour is impredictable. But that might restrict what
> operating systems KVM can run in an SPE-enabled VM, I can do some digging
> to find out how other operating systems use SPE, if you think adding the
> quirk sounds reasonable.

Yeah, that would be good to follow up on what other OSes are doing.
You'll still have a nondestructive S2 fault handler for the SPE, right?
IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
new one.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-08-17 15:05                       ` Oliver Upton
@ 2022-09-12 14:50                         ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-09-12 14:50 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Oliver, 

On Wed, Aug 17, 2022 at 10:05:51AM -0500, Oliver Upton wrote:
> Hi Alex,
> 
> On Fri, Aug 12, 2022 at 02:05:45PM +0100, Alexandru Elisei wrote:
> > Hi Oliver,
> > 
> > Just a note, for some reason some of your emails, but not all, don't show up in
> > my email client (mutt). That's why it might take me a while to send a reply
> > (noticed that you replied by looking for this thread on lore.kernel.org).
> 
> Urgh, that's weird. Am I getting thrown into spam or something? Also, do
> you know if you've been receiving Drew's email since he switched to
> @linux.dev?

As far as  I can tell, I am able to receive emails from Drew's new email
address.

I think it's because some of the macros that I've been using in mutt, they
seem to interract in a weird way with imap_keepalive. Disabled
imap_keepalive and everything looks to have been sorted out.

> 
> > On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote:
> > > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote:
> > > > Hi,
> > > > 
> > > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> > > > > Hi Alex,
> > > > > 
> > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > > To summarize the approaches we've discussed so far:
> > > > > > > > 
> > > > > > > > 1. Pinning the entire guest memory
> > > > > > > > - Heavy handed and not ideal.
> > > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > > > > > 
> > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > > > > > faults reported by SPE.
> > > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > > > > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > > > > > >   not the IPA.
> > > > > > > > 
> > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > > > > > - There is the corner case described above, when profiling becomes enabled as a
> > > > > > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > > > > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > > > > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > > > > > >   disabled.
> > > > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > > > 
> > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > > > > > SPE.
> > > > > > > > - Gets rid of the corner case at 3.
> > > > > > > > - Same approach to buffer unpinning as 3.
> > > > > > > > - Introduces a blackout window before the first record is written.
> > > > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > > > 
> > > > > > > > As for the corner case at 3, I proposed either:
> > > > > > > > 
> > > > > > > > a) Mandate that guest operating systems must never modify the buffer
> > > > > > > > translation entries if the buffer is enabled and
> > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > > > > > 
> > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > > > > > operating systems can be modified to not change the translation entries for the
> > > > > > > > buffer if this blackout window is not desirable.
> > > > > > > > 
> > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > > > > > 
> > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > > > > > 
> > > > > > > Thanks Alex for pulling together all of the context here.
> > > > > > > 
> > > > > > > Unless there's any other strong opinions on the topic, it seems to me
> > > > > > > that option #4 (pin on S2 fault) is probably the best approach for
> > > > > > > the initial implementation. No amount of tricks in KVM can work around
> > > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > > > > > that, we should probably document the behavior of SPE as a known erratum
> > > > > > > of KVM.
> > > > > > > 
> > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > > > > > profiling is enabled could layer on top quite easily by treating it as
> > > > > > > a synthetic S2 fault and triggering the implementation of #4. Having
> > > > > > 
> > > > > > I'm not sure I follow, I understand what you mean by "treating it as a
> > > > > > synthetic S2 fault", would you mind elaborating?
> > > > > 
> > > > > Assuming approach #4 is implemented, we will already have an SPE fault
> > > > > handler that walks stage-1 and pins the buffer. At that point,
> > > > > implementing approach #3 would be relatively easy. When EL1 sets
> > > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.
> > > > 
> > > > I see, that makes sense, thanks,
> > > > 
> > > > > 
> > > > > > > said that I don't believe it is a hard requirement for enabling some
> > > > > > > flavor of SPE for guests.
> > > > > > > 
> > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > > > > > be done eventually.
> > > > > > > 
> > > > > > > Do you feel like this is an OK route forward, or have I missed
> > > > > > > something?
> > > > > > 
> > > > > > I've been giving this some thought, and I prefer approach #3 because with
> > > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > > > > > will be impossible to distinguish between a valid stage 2 fault (a fault
> > > > > > caused by the guest reprogramming the buffer and enabling profiling) and
> > > > > > KVM messing something up when pinning the buffer. I believe this to be
> > > > > > important, as experience has shown me that pinning the buffer at stage 2 is
> > > > > > not trivial and there isn't a mechanism today in Linux to do that
> > > > > > (explanation and examples here [1]).
> > > > > 
> > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in
> > > > > [1], page pinning does not avoid the possibility of the MMU notifiers
> > > > > being called on a given range. Want to make sure I'm following, what
> > > > > is your suggestion for approach #3 to handle the profile buffer when
> > > > > only enabled at EL0?
> > > > > 
> > > > > > With approach #4, it would be impossible to figure out if the results of a
> > > > > > profiling operations inside a guest are representative of the workload or
> > > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > > > > > multiple times per profiling session, introducing multiple blackout windows
> > > > > > that can skew the results.
> > > > > > 
> > > > > > If you're proposing that the blackout window when the first record is
> > > > > > written be documented as an erratum for KVM, then why no got a step further
> > > > > > and document as an erratum that changing the buffer translation tables
> > > > > > after the buffer has been enabled will lead to an SPE Serror? That will
> > > > > > allow us to always pin the buffer when profiling is enabled.
> > > > > 
> > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I
> > > > > had said :) Preserving the stage-1 translations while profiling is
> > > > > active is a good recommendation, although I'm not sure that we've
> > > > > completely eliminated the risk of stage-2 faults. 
> > > > > 
> > > > > It seems impossible to blame the guest for all stage-2 faults that happen
> > > > > in the middle of a profiling session. In addition to host mm driven changes
> > > > > to stage-2, live migration is a busted as well. You'd need to build out
> > > > > stage-2 on the target before resuming the guest and guarantee that the
> > > > > appropriate pages have been demanded from the source (in case of post-copy).
> > > > > 
> > > > > So, are we going to inject an SError for stage-2 faults outside of guest
> > > > > control as well? An external abort reported as an SPE buffer management
> > > > > event seems to be gracefully handled by the Linux driver, but that behavior
> > > > > is disallowed by SPEv1p3.
> > > > > 
> > > > > To sum up the point I'm getting at: I agree that there are ways to
> > > > > reduce the risk of stage-2 faults in the middle of profiling, but I
> > > > > don't believe the current architecture allows KVM to virtualize the
> > > > > feature to the letter of the specification.
> > > > 
> > > > I believe there's some confusion here: emulating SPE **does not work** if
> > > > stage 2 faults are triggered in the middle of a profiling session. Being
> > > > able to have a memory range never unmapped from stage 2 is a
> > > > **prerequisite** and is **required** for SPE emulation, it's not a nice to
> > > > have.
> > > > 
> > > > A stage 2 fault before the first record is written is acceptable because
> > > > there are no other records already written which need to be thrown away.
> > > > Stage 2 faults after at least one record has been written are unacceptable
> > > > because it means that the contents of the buffer needs to thrown away.
> > > > 
> > > > Does that make sense to you?
> > > > 
> > > > I believe it is doable to have addresses always mapped at stage 2 with some
> > > > changes to KVM, but that's not what this thread is about. This thread is
> > > > about how and when to pin the buffer.
> > > 
> > > Sorry if I've been forcing a tangent, but I believe there is a lot of
> > > value in discussing what is to be done for keeping the stage-2 mapping
> > > alive. I've been whining about it out of the very concern you highlight:
> > > a stage-2 fault in the middle of the profile is game over. Otherwise,
> > > optimizations in *when* we pin the buffer seem meaningless as stage-2
> > > faults appear unavoidable.
> > 
> > The idea I had was to propagate the mmu_notifier_range->event field to the
> > arch code. Then keep track of the IPAs which KVM pinned with
> > pin_user_page(s) that translate the guest buffer, and don't unmap that IPA
> > from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all
> > notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem
> > trying to change how that particular page is mapped.
> > 
> > > 
> > > Nonetheless, back to your proposal. Injecting some context from earlier:
> > > 
> > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > 
> > > So we are only doing this when enabled for EL1, right?
> > > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1})
> > 
> > Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}.
> > Accesses to those registers can be trapped by KVM, and to verify the
> > condition becomes trivial.
> > 
> > > 
> > > > - There is the corner case described above, when profiling becomes enabled as a
> > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > 
> > > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set
> > > (outside of the architectures definition of when profiling is enabled)?
> > 
> > The original proposal was to pin on the first fault in this case, yes.
> > That's because the architecture doesn't forbid changing the translation
> > entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled
> > (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}).
> > 
> > But you mentioned adding a quirk/erratum to KVM in your proposal, and I was
> > thinking that we could add an erratum to avoid the case above by saying
> > that that behaviour is impredictable. But that might restrict what
> > operating systems KVM can run in an SPE-enabled VM, I can do some digging
> > to find out how other operating systems use SPE, if you think adding the
> > quirk sounds reasonable.
> 
> Yeah, that would be good to follow up on what other OSes are doing.

FreeBSD doesn't have a SPE driver.

Currently in the process of finding out how/if Windows implements the
driver.

> You'll still have a nondestructive S2 fault handler for the SPE, right?
> IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> new one.

This is how I think about it: a S2 DABT where DL == 0 can happen because of
something that the VMM, KVM or the guest has done:

1. If it's because of something that the host's userspace did (memslot was
changed while the VM was running, memory was munmap'ed, etc). In this case,
there's no way for KVM to handle the SPE fault, so I would say that the
sensible approach would be to inject an SPE external abort.

2. If it's because of something that KVM did, that can only be because of a
bug in SPE emulation. In this case, it can happen again, which means
arbitrary blackout windows which can skew the profiling results. I would
much rather inject an SPE external abort then let the guest rely on
potentially bad profiling information.

3. The guest changes the mapping for the buffer when it shouldn't have: A.
when the architecture does allow it, but KVM doesn't support, or B. when
the architecture doesn't allow it. For both cases, I would much rather
inject an SPE external abort for the reasons above. Furthermore, for B, I
think it would be better to let the guest know as soon as possible that
it's not following the architecture.

In conclusion, I would prefer to treat all SPE S2 faults as errors.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-09-12 14:50                         ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-09-12 14:50 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Oliver, 

On Wed, Aug 17, 2022 at 10:05:51AM -0500, Oliver Upton wrote:
> Hi Alex,
> 
> On Fri, Aug 12, 2022 at 02:05:45PM +0100, Alexandru Elisei wrote:
> > Hi Oliver,
> > 
> > Just a note, for some reason some of your emails, but not all, don't show up in
> > my email client (mutt). That's why it might take me a while to send a reply
> > (noticed that you replied by looking for this thread on lore.kernel.org).
> 
> Urgh, that's weird. Am I getting thrown into spam or something? Also, do
> you know if you've been receiving Drew's email since he switched to
> @linux.dev?

As far as  I can tell, I am able to receive emails from Drew's new email
address.

I think it's because some of the macros that I've been using in mutt, they
seem to interract in a weird way with imap_keepalive. Disabled
imap_keepalive and everything looks to have been sorted out.

> 
> > On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote:
> > > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote:
> > > > Hi,
> > > > 
> > > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> > > > > Hi Alex,
> > > > > 
> > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > > To summarize the approaches we've discussed so far:
> > > > > > > > 
> > > > > > > > 1. Pinning the entire guest memory
> > > > > > > > - Heavy handed and not ideal.
> > > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > > > > > 
> > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > > > > > faults reported by SPE.
> > > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > > > > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > > > > > >   not the IPA.
> > > > > > > > 
> > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > > > > > - There is the corner case described above, when profiling becomes enabled as a
> > > > > > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > > > > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > > > > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > > > > > >   disabled.
> > > > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > > > 
> > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > > > > > SPE.
> > > > > > > > - Gets rid of the corner case at 3.
> > > > > > > > - Same approach to buffer unpinning as 3.
> > > > > > > > - Introduces a blackout window before the first record is written.
> > > > > > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > > > > > 
> > > > > > > > As for the corner case at 3, I proposed either:
> > > > > > > > 
> > > > > > > > a) Mandate that guest operating systems must never modify the buffer
> > > > > > > > translation entries if the buffer is enabled and
> > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > > > > > 
> > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > > > > > operating systems can be modified to not change the translation entries for the
> > > > > > > > buffer if this blackout window is not desirable.
> > > > > > > > 
> > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > > > > > 
> > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > > > > > 
> > > > > > > Thanks Alex for pulling together all of the context here.
> > > > > > > 
> > > > > > > Unless there's any other strong opinions on the topic, it seems to me
> > > > > > > that option #4 (pin on S2 fault) is probably the best approach for
> > > > > > > the initial implementation. No amount of tricks in KVM can work around
> > > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > > > > > that, we should probably document the behavior of SPE as a known erratum
> > > > > > > of KVM.
> > > > > > > 
> > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > > > > > profiling is enabled could layer on top quite easily by treating it as
> > > > > > > a synthetic S2 fault and triggering the implementation of #4. Having
> > > > > > 
> > > > > > I'm not sure I follow, I understand what you mean by "treating it as a
> > > > > > synthetic S2 fault", would you mind elaborating?
> > > > > 
> > > > > Assuming approach #4 is implemented, we will already have an SPE fault
> > > > > handler that walks stage-1 and pins the buffer. At that point,
> > > > > implementing approach #3 would be relatively easy. When EL1 sets
> > > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.
> > > > 
> > > > I see, that makes sense, thanks,
> > > > 
> > > > > 
> > > > > > > said that I don't believe it is a hard requirement for enabling some
> > > > > > > flavor of SPE for guests.
> > > > > > > 
> > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > > > > > be done eventually.
> > > > > > > 
> > > > > > > Do you feel like this is an OK route forward, or have I missed
> > > > > > > something?
> > > > > > 
> > > > > > I've been giving this some thought, and I prefer approach #3 because with
> > > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > > > > > will be impossible to distinguish between a valid stage 2 fault (a fault
> > > > > > caused by the guest reprogramming the buffer and enabling profiling) and
> > > > > > KVM messing something up when pinning the buffer. I believe this to be
> > > > > > important, as experience has shown me that pinning the buffer at stage 2 is
> > > > > > not trivial and there isn't a mechanism today in Linux to do that
> > > > > > (explanation and examples here [1]).
> > > > > 
> > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in
> > > > > [1], page pinning does not avoid the possibility of the MMU notifiers
> > > > > being called on a given range. Want to make sure I'm following, what
> > > > > is your suggestion for approach #3 to handle the profile buffer when
> > > > > only enabled at EL0?
> > > > > 
> > > > > > With approach #4, it would be impossible to figure out if the results of a
> > > > > > profiling operations inside a guest are representative of the workload or
> > > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > > > > > multiple times per profiling session, introducing multiple blackout windows
> > > > > > that can skew the results.
> > > > > > 
> > > > > > If you're proposing that the blackout window when the first record is
> > > > > > written be documented as an erratum for KVM, then why no got a step further
> > > > > > and document as an erratum that changing the buffer translation tables
> > > > > > after the buffer has been enabled will lead to an SPE Serror? That will
> > > > > > allow us to always pin the buffer when profiling is enabled.
> > > > > 
> > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I
> > > > > had said :) Preserving the stage-1 translations while profiling is
> > > > > active is a good recommendation, although I'm not sure that we've
> > > > > completely eliminated the risk of stage-2 faults. 
> > > > > 
> > > > > It seems impossible to blame the guest for all stage-2 faults that happen
> > > > > in the middle of a profiling session. In addition to host mm driven changes
> > > > > to stage-2, live migration is a busted as well. You'd need to build out
> > > > > stage-2 on the target before resuming the guest and guarantee that the
> > > > > appropriate pages have been demanded from the source (in case of post-copy).
> > > > > 
> > > > > So, are we going to inject an SError for stage-2 faults outside of guest
> > > > > control as well? An external abort reported as an SPE buffer management
> > > > > event seems to be gracefully handled by the Linux driver, but that behavior
> > > > > is disallowed by SPEv1p3.
> > > > > 
> > > > > To sum up the point I'm getting at: I agree that there are ways to
> > > > > reduce the risk of stage-2 faults in the middle of profiling, but I
> > > > > don't believe the current architecture allows KVM to virtualize the
> > > > > feature to the letter of the specification.
> > > > 
> > > > I believe there's some confusion here: emulating SPE **does not work** if
> > > > stage 2 faults are triggered in the middle of a profiling session. Being
> > > > able to have a memory range never unmapped from stage 2 is a
> > > > **prerequisite** and is **required** for SPE emulation, it's not a nice to
> > > > have.
> > > > 
> > > > A stage 2 fault before the first record is written is acceptable because
> > > > there are no other records already written which need to be thrown away.
> > > > Stage 2 faults after at least one record has been written are unacceptable
> > > > because it means that the contents of the buffer needs to thrown away.
> > > > 
> > > > Does that make sense to you?
> > > > 
> > > > I believe it is doable to have addresses always mapped at stage 2 with some
> > > > changes to KVM, but that's not what this thread is about. This thread is
> > > > about how and when to pin the buffer.
> > > 
> > > Sorry if I've been forcing a tangent, but I believe there is a lot of
> > > value in discussing what is to be done for keeping the stage-2 mapping
> > > alive. I've been whining about it out of the very concern you highlight:
> > > a stage-2 fault in the middle of the profile is game over. Otherwise,
> > > optimizations in *when* we pin the buffer seem meaningless as stage-2
> > > faults appear unavoidable.
> > 
> > The idea I had was to propagate the mmu_notifier_range->event field to the
> > arch code. Then keep track of the IPAs which KVM pinned with
> > pin_user_page(s) that translate the guest buffer, and don't unmap that IPA
> > from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all
> > notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem
> > trying to change how that particular page is mapped.
> > 
> > > 
> > > Nonetheless, back to your proposal. Injecting some context from earlier:
> > > 
> > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > 
> > > So we are only doing this when enabled for EL1, right?
> > > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1})
> > 
> > Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}.
> > Accesses to those registers can be trapped by KVM, and to verify the
> > condition becomes trivial.
> > 
> > > 
> > > > - There is the corner case described above, when profiling becomes enabled as a
> > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > 
> > > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set
> > > (outside of the architectures definition of when profiling is enabled)?
> > 
> > The original proposal was to pin on the first fault in this case, yes.
> > That's because the architecture doesn't forbid changing the translation
> > entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled
> > (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}).
> > 
> > But you mentioned adding a quirk/erratum to KVM in your proposal, and I was
> > thinking that we could add an erratum to avoid the case above by saying
> > that that behaviour is impredictable. But that might restrict what
> > operating systems KVM can run in an SPE-enabled VM, I can do some digging
> > to find out how other operating systems use SPE, if you think adding the
> > quirk sounds reasonable.
> 
> Yeah, that would be good to follow up on what other OSes are doing.

FreeBSD doesn't have a SPE driver.

Currently in the process of finding out how/if Windows implements the
driver.

> You'll still have a nondestructive S2 fault handler for the SPE, right?
> IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> new one.

This is how I think about it: a S2 DABT where DL == 0 can happen because of
something that the VMM, KVM or the guest has done:

1. If it's because of something that the host's userspace did (memslot was
changed while the VM was running, memory was munmap'ed, etc). In this case,
there's no way for KVM to handle the SPE fault, so I would say that the
sensible approach would be to inject an SPE external abort.

2. If it's because of something that KVM did, that can only be because of a
bug in SPE emulation. In this case, it can happen again, which means
arbitrary blackout windows which can skew the profiling results. I would
much rather inject an SPE external abort then let the guest rely on
potentially bad profiling information.

3. The guest changes the mapping for the buffer when it shouldn't have: A.
when the architecture does allow it, but KVM doesn't support, or B. when
the architecture doesn't allow it. For both cases, I would much rather
inject an SPE external abort for the reasons above. Furthermore, for B, I
think it would be better to let the guest know as soon as possible that
it's not following the architecture.

In conclusion, I would prefer to treat all SPE S2 faults as errors.

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-09-12 14:50                         ` Alexandru Elisei
@ 2022-09-13 10:58                           ` Oliver Upton
  -1 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-09-13 10:58 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hey Alex,

On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote:

[...]

> > Yeah, that would be good to follow up on what other OSes are doing.
> 
> FreeBSD doesn't have a SPE driver.
> 
> Currently in the process of finding out how/if Windows implements the
> driver.
> 
> > You'll still have a nondestructive S2 fault handler for the SPE, right?
> > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> > new one.
> 
> This is how I think about it: a S2 DABT where DL == 0 can happen because of
> something that the VMM, KVM or the guest has done:
> 
> 1. If it's because of something that the host's userspace did (memslot was
> changed while the VM was running, memory was munmap'ed, etc). In this case,
> there's no way for KVM to handle the SPE fault, so I would say that the
> sensible approach would be to inject an SPE external abort.
> 
> 2. If it's because of something that KVM did, that can only be because of a
> bug in SPE emulation. In this case, it can happen again, which means
> arbitrary blackout windows which can skew the profiling results. I would
> much rather inject an SPE external abort then let the guest rely on
> potentially bad profiling information.
> 
> 3. The guest changes the mapping for the buffer when it shouldn't have: A.
> when the architecture does allow it, but KVM doesn't support, or B. when
> the architecture doesn't allow it. For both cases, I would much rather
> inject an SPE external abort for the reasons above. Furthermore, for B, I
> think it would be better to let the guest know as soon as possible that
> it's not following the architecture.
> 
> In conclusion, I would prefer to treat all SPE S2 faults as errors.

My main concern with treating S2 faults as a synthetic external abort is
how this behavior progresses in later versions of the architecture.
SPEv1p3 disallows implementations from reporting external aborts via the
SPU, instead allowing only for an SError to be delivered to the core.

I caught up with Will on this for a little bit:

Instead of an external abort, how about reporting an IMP DEF buffer
management event to the guest? At least for the Linux driver it should
have the same effect of killing the session but the VM will stay
running. This way there's no architectural requirement to promote to an
SError.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-09-13 10:58                           ` Oliver Upton
  0 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-09-13 10:58 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hey Alex,

On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote:

[...]

> > Yeah, that would be good to follow up on what other OSes are doing.
> 
> FreeBSD doesn't have a SPE driver.
> 
> Currently in the process of finding out how/if Windows implements the
> driver.
> 
> > You'll still have a nondestructive S2 fault handler for the SPE, right?
> > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> > new one.
> 
> This is how I think about it: a S2 DABT where DL == 0 can happen because of
> something that the VMM, KVM or the guest has done:
> 
> 1. If it's because of something that the host's userspace did (memslot was
> changed while the VM was running, memory was munmap'ed, etc). In this case,
> there's no way for KVM to handle the SPE fault, so I would say that the
> sensible approach would be to inject an SPE external abort.
> 
> 2. If it's because of something that KVM did, that can only be because of a
> bug in SPE emulation. In this case, it can happen again, which means
> arbitrary blackout windows which can skew the profiling results. I would
> much rather inject an SPE external abort then let the guest rely on
> potentially bad profiling information.
> 
> 3. The guest changes the mapping for the buffer when it shouldn't have: A.
> when the architecture does allow it, but KVM doesn't support, or B. when
> the architecture doesn't allow it. For both cases, I would much rather
> inject an SPE external abort for the reasons above. Furthermore, for B, I
> think it would be better to let the guest know as soon as possible that
> it's not following the architecture.
> 
> In conclusion, I would prefer to treat all SPE S2 faults as errors.

My main concern with treating S2 faults as a synthetic external abort is
how this behavior progresses in later versions of the architecture.
SPEv1p3 disallows implementations from reporting external aborts via the
SPU, instead allowing only for an SError to be delivered to the core.

I caught up with Will on this for a little bit:

Instead of an external abort, how about reporting an IMP DEF buffer
management event to the guest? At least for the Linux driver it should
have the same effect of killing the session but the VM will stay
running. This way there's no architectural requirement to promote to an
SError.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-09-13 10:58                           ` Oliver Upton
@ 2022-09-13 12:41                             ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-09-13 12:41 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Oliver,

On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote:
> Hey Alex,
> 
> On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> > > Yeah, that would be good to follow up on what other OSes are doing.
> > 
> > FreeBSD doesn't have a SPE driver.
> > 
> > Currently in the process of finding out how/if Windows implements the
> > driver.
> > 
> > > You'll still have a nondestructive S2 fault handler for the SPE, right?
> > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> > > new one.
> > 
> > This is how I think about it: a S2 DABT where DL == 0 can happen because of
> > something that the VMM, KVM or the guest has done:
> > 
> > 1. If it's because of something that the host's userspace did (memslot was
> > changed while the VM was running, memory was munmap'ed, etc). In this case,
> > there's no way for KVM to handle the SPE fault, so I would say that the
> > sensible approach would be to inject an SPE external abort.
> > 
> > 2. If it's because of something that KVM did, that can only be because of a
> > bug in SPE emulation. In this case, it can happen again, which means
> > arbitrary blackout windows which can skew the profiling results. I would
> > much rather inject an SPE external abort then let the guest rely on
> > potentially bad profiling information.
> > 
> > 3. The guest changes the mapping for the buffer when it shouldn't have: A.
> > when the architecture does allow it, but KVM doesn't support, or B. when
> > the architecture doesn't allow it. For both cases, I would much rather
> > inject an SPE external abort for the reasons above. Furthermore, for B, I
> > think it would be better to let the guest know as soon as possible that
> > it's not following the architecture.
> > 
> > In conclusion, I would prefer to treat all SPE S2 faults as errors.
> 
> My main concern with treating S2 faults as a synthetic external abort is
> how this behavior progresses in later versions of the architecture.
> SPEv1p3 disallows implementations from reporting external aborts via the
> SPU, instead allowing only for an SError to be delivered to the core.

Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180).

> 
> I caught up with Will on this for a little bit:
> 
> Instead of an external abort, how about reporting an IMP DEF buffer
> management event to the guest? At least for the Linux driver it should
> have the same effect of killing the session but the VM will stay
> running. This way there's no architectural requirement to promote to an
> SError.

The only reason I proposed to inject an external abort is because KVM needs
a way to tell the guest that something outside of the guest's control went
wrong and it should drop the contents of the current profiling session. An
external abort reported by the SPU seemed to fit the bit.

By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111
(Buffer management event for an IMPLEMENTATION DEFINED reason). I'm
thinking that someone might run a custom kernel in a VM, like a vendor
downstream kernel, with patches that actually handle this exception class,
and injecting such an exception might not have the effects that KVM
expects. Am I overthinking things? Is that something that KVM should take
into consideration? I suppose KVM can and should also set
PMBSR_EL1.DL = 1, as that means per the architecture that the buffer
contents should be discarded.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-09-13 12:41                             ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2022-09-13 12:41 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi Oliver,

On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote:
> Hey Alex,
> 
> On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> > > Yeah, that would be good to follow up on what other OSes are doing.
> > 
> > FreeBSD doesn't have a SPE driver.
> > 
> > Currently in the process of finding out how/if Windows implements the
> > driver.
> > 
> > > You'll still have a nondestructive S2 fault handler for the SPE, right?
> > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> > > new one.
> > 
> > This is how I think about it: a S2 DABT where DL == 0 can happen because of
> > something that the VMM, KVM or the guest has done:
> > 
> > 1. If it's because of something that the host's userspace did (memslot was
> > changed while the VM was running, memory was munmap'ed, etc). In this case,
> > there's no way for KVM to handle the SPE fault, so I would say that the
> > sensible approach would be to inject an SPE external abort.
> > 
> > 2. If it's because of something that KVM did, that can only be because of a
> > bug in SPE emulation. In this case, it can happen again, which means
> > arbitrary blackout windows which can skew the profiling results. I would
> > much rather inject an SPE external abort then let the guest rely on
> > potentially bad profiling information.
> > 
> > 3. The guest changes the mapping for the buffer when it shouldn't have: A.
> > when the architecture does allow it, but KVM doesn't support, or B. when
> > the architecture doesn't allow it. For both cases, I would much rather
> > inject an SPE external abort for the reasons above. Furthermore, for B, I
> > think it would be better to let the guest know as soon as possible that
> > it's not following the architecture.
> > 
> > In conclusion, I would prefer to treat all SPE S2 faults as errors.
> 
> My main concern with treating S2 faults as a synthetic external abort is
> how this behavior progresses in later versions of the architecture.
> SPEv1p3 disallows implementations from reporting external aborts via the
> SPU, instead allowing only for an SError to be delivered to the core.

Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180).

> 
> I caught up with Will on this for a little bit:
> 
> Instead of an external abort, how about reporting an IMP DEF buffer
> management event to the guest? At least for the Linux driver it should
> have the same effect of killing the session but the VM will stay
> running. This way there's no architectural requirement to promote to an
> SError.

The only reason I proposed to inject an external abort is because KVM needs
a way to tell the guest that something outside of the guest's control went
wrong and it should drop the contents of the current profiling session. An
external abort reported by the SPU seemed to fit the bit.

By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111
(Buffer management event for an IMPLEMENTATION DEFINED reason). I'm
thinking that someone might run a custom kernel in a VM, like a vendor
downstream kernel, with patches that actually handle this exception class,
and injecting such an exception might not have the effects that KVM
expects. Am I overthinking things? Is that something that KVM should take
into consideration? I suppose KVM can and should also set
PMBSR_EL1.DL = 1, as that means per the architecture that the buffer
contents should be discarded.

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-09-13 12:41                             ` Alexandru Elisei
@ 2022-09-13 14:13                               ` Oliver Upton
  -1 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-09-13 14:13 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

On Tue, Sep 13, 2022 at 01:41:56PM +0100, Alexandru Elisei wrote:
> Hi Oliver,
> 
> On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote:
> > Hey Alex,
> > 
> > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote:
> > 
> > [...]
> > 
> > > > Yeah, that would be good to follow up on what other OSes are doing.
> > > 
> > > FreeBSD doesn't have a SPE driver.
> > > 
> > > Currently in the process of finding out how/if Windows implements the
> > > driver.
> > > 
> > > > You'll still have a nondestructive S2 fault handler for the SPE, right?
> > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> > > > new one.
> > > 
> > > This is how I think about it: a S2 DABT where DL == 0 can happen because of
> > > something that the VMM, KVM or the guest has done:
> > > 
> > > 1. If it's because of something that the host's userspace did (memslot was
> > > changed while the VM was running, memory was munmap'ed, etc). In this case,
> > > there's no way for KVM to handle the SPE fault, so I would say that the
> > > sensible approach would be to inject an SPE external abort.
> > > 
> > > 2. If it's because of something that KVM did, that can only be because of a
> > > bug in SPE emulation. In this case, it can happen again, which means
> > > arbitrary blackout windows which can skew the profiling results. I would
> > > much rather inject an SPE external abort then let the guest rely on
> > > potentially bad profiling information.
> > > 
> > > 3. The guest changes the mapping for the buffer when it shouldn't have: A.
> > > when the architecture does allow it, but KVM doesn't support, or B. when
> > > the architecture doesn't allow it. For both cases, I would much rather
> > > inject an SPE external abort for the reasons above. Furthermore, for B, I
> > > think it would be better to let the guest know as soon as possible that
> > > it's not following the architecture.
> > > 
> > > In conclusion, I would prefer to treat all SPE S2 faults as errors.
> > 
> > My main concern with treating S2 faults as a synthetic external abort is
> > how this behavior progresses in later versions of the architecture.
> > SPEv1p3 disallows implementations from reporting external aborts via the
> > SPU, instead allowing only for an SError to be delivered to the core.
> 
> Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180).
> 
> > 
> > I caught up with Will on this for a little bit:
> > 
> > Instead of an external abort, how about reporting an IMP DEF buffer
> > management event to the guest? At least for the Linux driver it should
> > have the same effect of killing the session but the VM will stay
> > running. This way there's no architectural requirement to promote to an
> > SError.
> 
> The only reason I proposed to inject an external abort is because KVM needs
> a way to tell the guest that something outside of the guest's control went
> wrong and it should drop the contents of the current profiling session. An
> external abort reported by the SPU seemed to fit the bit.
> 
> By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111
> (Buffer management event for an IMPLEMENTATION DEFINED reason).

Yup, that's it. You also get two whole bytes of room in PMBSR_EL1.MSS
which is also IMP DEF, so we could even stick some ASCII in there to
tell the guest how we really feel! :-P

> I'm thinking that someone might run a custom kernel in a VM, like a vendor
> downstream kernel, with patches that actually handle this exception class,
> and injecting such an exception might not have the effects that KVM
> expects. Am I overthinking things? Is that something that KVM should take
> into consideration? I suppose KVM can and should also set
> PMBSR_EL1.DL = 1, as that means per the architecture that the buffer
> contents should be discarded.

I agree with you that PMBSR_EL1.DL=1 is the right call for this. With
that, I'd be surprised if there was a guest that tried to pull some
tricks other than blowing away the profile. The other option that I
find funny is if we plainly report the S2 abort to the guest, but that
wont work well when nested comes into the picture.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2022-09-13 14:13                               ` Oliver Upton
  0 siblings, 0 replies; 72+ messages in thread
From: Oliver Upton @ 2022-09-13 14:13 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

On Tue, Sep 13, 2022 at 01:41:56PM +0100, Alexandru Elisei wrote:
> Hi Oliver,
> 
> On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote:
> > Hey Alex,
> > 
> > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote:
> > 
> > [...]
> > 
> > > > Yeah, that would be good to follow up on what other OSes are doing.
> > > 
> > > FreeBSD doesn't have a SPE driver.
> > > 
> > > Currently in the process of finding out how/if Windows implements the
> > > driver.
> > > 
> > > > You'll still have a nondestructive S2 fault handler for the SPE, right?
> > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> > > > new one.
> > > 
> > > This is how I think about it: a S2 DABT where DL == 0 can happen because of
> > > something that the VMM, KVM or the guest has done:
> > > 
> > > 1. If it's because of something that the host's userspace did (memslot was
> > > changed while the VM was running, memory was munmap'ed, etc). In this case,
> > > there's no way for KVM to handle the SPE fault, so I would say that the
> > > sensible approach would be to inject an SPE external abort.
> > > 
> > > 2. If it's because of something that KVM did, that can only be because of a
> > > bug in SPE emulation. In this case, it can happen again, which means
> > > arbitrary blackout windows which can skew the profiling results. I would
> > > much rather inject an SPE external abort then let the guest rely on
> > > potentially bad profiling information.
> > > 
> > > 3. The guest changes the mapping for the buffer when it shouldn't have: A.
> > > when the architecture does allow it, but KVM doesn't support, or B. when
> > > the architecture doesn't allow it. For both cases, I would much rather
> > > inject an SPE external abort for the reasons above. Furthermore, for B, I
> > > think it would be better to let the guest know as soon as possible that
> > > it's not following the architecture.
> > > 
> > > In conclusion, I would prefer to treat all SPE S2 faults as errors.
> > 
> > My main concern with treating S2 faults as a synthetic external abort is
> > how this behavior progresses in later versions of the architecture.
> > SPEv1p3 disallows implementations from reporting external aborts via the
> > SPU, instead allowing only for an SError to be delivered to the core.
> 
> Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180).
> 
> > 
> > I caught up with Will on this for a little bit:
> > 
> > Instead of an external abort, how about reporting an IMP DEF buffer
> > management event to the guest? At least for the Linux driver it should
> > have the same effect of killing the session but the VM will stay
> > running. This way there's no architectural requirement to promote to an
> > SError.
> 
> The only reason I proposed to inject an external abort is because KVM needs
> a way to tell the guest that something outside of the guest's control went
> wrong and it should drop the contents of the current profiling session. An
> external abort reported by the SPU seemed to fit the bit.
> 
> By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111
> (Buffer management event for an IMPLEMENTATION DEFINED reason).

Yup, that's it. You also get two whole bytes of room in PMBSR_EL1.MSS
which is also IMP DEF, so we could even stick some ASCII in there to
tell the guest how we really feel! :-P

> I'm thinking that someone might run a custom kernel in a VM, like a vendor
> downstream kernel, with patches that actually handle this exception class,
> and injecting such an exception might not have the effects that KVM
> expects. Am I overthinking things? Is that something that KVM should take
> into consideration? I suppose KVM can and should also set
> PMBSR_EL1.DL = 1, as that means per the architecture that the buffer
> contents should be discarded.

I agree with you that PMBSR_EL1.DL=1 is the right call for this. With
that, I'd be surprised if there was a guest that tried to pull some
tricks other than blowing away the profile. The other option that I
find funny is if we plainly report the S2 abort to the guest, but that
wont work well when nested comes into the picture.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
  2022-09-13 14:13                               ` Oliver Upton
@ 2023-01-03 14:26                                 ` Alexandru Elisei
  -1 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2023-01-03 14:26 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi,

Just a heads-up, sent a new proposal for SPE emulation which removes the need to
pin memory at stage 2 [1].

[1] https://lists.cs.columbia.edu/pipermail/kvmarm/2022-November/056637.html

Thanks,
Alex

On Tue, Sep 13, 2022 at 03:13:31PM +0100, Oliver Upton wrote:
> On Tue, Sep 13, 2022 at 01:41:56PM +0100, Alexandru Elisei wrote:
> > Hi Oliver,
> > 
> > On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote:
> > > Hey Alex,
> > > 
> > > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote:
> > > 
> > > [...]
> > > 
> > > > > Yeah, that would be good to follow up on what other OSes are doing.
> > > > 
> > > > FreeBSD doesn't have a SPE driver.
> > > > 
> > > > Currently in the process of finding out how/if Windows implements the
> > > > driver.
> > > > 
> > > > > You'll still have a nondestructive S2 fault handler for the SPE, right?
> > > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> > > > > new one.
> > > > 
> > > > This is how I think about it: a S2 DABT where DL == 0 can happen because of
> > > > something that the VMM, KVM or the guest has done:
> > > > 
> > > > 1. If it's because of something that the host's userspace did (memslot was
> > > > changed while the VM was running, memory was munmap'ed, etc). In this case,
> > > > there's no way for KVM to handle the SPE fault, so I would say that the
> > > > sensible approach would be to inject an SPE external abort.
> > > > 
> > > > 2. If it's because of something that KVM did, that can only be because of a
> > > > bug in SPE emulation. In this case, it can happen again, which means
> > > > arbitrary blackout windows which can skew the profiling results. I would
> > > > much rather inject an SPE external abort then let the guest rely on
> > > > potentially bad profiling information.
> > > > 
> > > > 3. The guest changes the mapping for the buffer when it shouldn't have: A.
> > > > when the architecture does allow it, but KVM doesn't support, or B. when
> > > > the architecture doesn't allow it. For both cases, I would much rather
> > > > inject an SPE external abort for the reasons above. Furthermore, for B, I
> > > > think it would be better to let the guest know as soon as possible that
> > > > it's not following the architecture.
> > > > 
> > > > In conclusion, I would prefer to treat all SPE S2 faults as errors.
> > > 
> > > My main concern with treating S2 faults as a synthetic external abort is
> > > how this behavior progresses in later versions of the architecture.
> > > SPEv1p3 disallows implementations from reporting external aborts via the
> > > SPU, instead allowing only for an SError to be delivered to the core.
> > 
> > Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180).
> > 
> > > 
> > > I caught up with Will on this for a little bit:
> > > 
> > > Instead of an external abort, how about reporting an IMP DEF buffer
> > > management event to the guest? At least for the Linux driver it should
> > > have the same effect of killing the session but the VM will stay
> > > running. This way there's no architectural requirement to promote to an
> > > SError.
> > 
> > The only reason I proposed to inject an external abort is because KVM needs
> > a way to tell the guest that something outside of the guest's control went
> > wrong and it should drop the contents of the current profiling session. An
> > external abort reported by the SPU seemed to fit the bit.
> > 
> > By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111
> > (Buffer management event for an IMPLEMENTATION DEFINED reason).
> 
> Yup, that's it. You also get two whole bytes of room in PMBSR_EL1.MSS
> which is also IMP DEF, so we could even stick some ASCII in there to
> tell the guest how we really feel! :-P
> 
> > I'm thinking that someone might run a custom kernel in a VM, like a vendor
> > downstream kernel, with patches that actually handle this exception class,
> > and injecting such an exception might not have the effects that KVM
> > expects. Am I overthinking things? Is that something that KVM should take
> > into consideration? I suppose KVM can and should also set
> > PMBSR_EL1.DL = 1, as that means per the architecture that the buffer
> > contents should be discarded.
> 
> I agree with you that PMBSR_EL1.DL=1 is the right call for this. With
> that, I'd be surprised if there was a guest that tried to pull some
> tricks other than blowing away the profile. The other option that I
> find funny is if we plainly report the S2 abort to the guest, but that
> wont work well when nested comes into the picture.
> 
> --
> Thanks,
> Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory
@ 2023-01-03 14:26                                 ` Alexandru Elisei
  0 siblings, 0 replies; 72+ messages in thread
From: Alexandru Elisei @ 2023-01-03 14:26 UTC (permalink / raw)
  To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel

Hi,

Just a heads-up, sent a new proposal for SPE emulation which removes the need to
pin memory at stage 2 [1].

[1] https://lists.cs.columbia.edu/pipermail/kvmarm/2022-November/056637.html

Thanks,
Alex

On Tue, Sep 13, 2022 at 03:13:31PM +0100, Oliver Upton wrote:
> On Tue, Sep 13, 2022 at 01:41:56PM +0100, Alexandru Elisei wrote:
> > Hi Oliver,
> > 
> > On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote:
> > > Hey Alex,
> > > 
> > > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote:
> > > 
> > > [...]
> > > 
> > > > > Yeah, that would be good to follow up on what other OSes are doing.
> > > > 
> > > > FreeBSD doesn't have a SPE driver.
> > > > 
> > > > Currently in the process of finding out how/if Windows implements the
> > > > driver.
> > > > 
> > > > > You'll still have a nondestructive S2 fault handler for the SPE, right?
> > > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the
> > > > > new one.
> > > > 
> > > > This is how I think about it: a S2 DABT where DL == 0 can happen because of
> > > > something that the VMM, KVM or the guest has done:
> > > > 
> > > > 1. If it's because of something that the host's userspace did (memslot was
> > > > changed while the VM was running, memory was munmap'ed, etc). In this case,
> > > > there's no way for KVM to handle the SPE fault, so I would say that the
> > > > sensible approach would be to inject an SPE external abort.
> > > > 
> > > > 2. If it's because of something that KVM did, that can only be because of a
> > > > bug in SPE emulation. In this case, it can happen again, which means
> > > > arbitrary blackout windows which can skew the profiling results. I would
> > > > much rather inject an SPE external abort then let the guest rely on
> > > > potentially bad profiling information.
> > > > 
> > > > 3. The guest changes the mapping for the buffer when it shouldn't have: A.
> > > > when the architecture does allow it, but KVM doesn't support, or B. when
> > > > the architecture doesn't allow it. For both cases, I would much rather
> > > > inject an SPE external abort for the reasons above. Furthermore, for B, I
> > > > think it would be better to let the guest know as soon as possible that
> > > > it's not following the architecture.
> > > > 
> > > > In conclusion, I would prefer to treat all SPE S2 faults as errors.
> > > 
> > > My main concern with treating S2 faults as a synthetic external abort is
> > > how this behavior progresses in later versions of the architecture.
> > > SPEv1p3 disallows implementations from reporting external aborts via the
> > > SPU, instead allowing only for an SError to be delivered to the core.
> > 
> > Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180).
> > 
> > > 
> > > I caught up with Will on this for a little bit:
> > > 
> > > Instead of an external abort, how about reporting an IMP DEF buffer
> > > management event to the guest? At least for the Linux driver it should
> > > have the same effect of killing the session but the VM will stay
> > > running. This way there's no architectural requirement to promote to an
> > > SError.
> > 
> > The only reason I proposed to inject an external abort is because KVM needs
> > a way to tell the guest that something outside of the guest's control went
> > wrong and it should drop the contents of the current profiling session. An
> > external abort reported by the SPU seemed to fit the bit.
> > 
> > By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111
> > (Buffer management event for an IMPLEMENTATION DEFINED reason).
> 
> Yup, that's it. You also get two whole bytes of room in PMBSR_EL1.MSS
> which is also IMP DEF, so we could even stick some ASCII in there to
> tell the guest how we really feel! :-P
> 
> > I'm thinking that someone might run a custom kernel in a VM, like a vendor
> > downstream kernel, with patches that actually handle this exception class,
> > and injecting such an exception might not have the effects that KVM
> > expects. Am I overthinking things? Is that something that KVM should take
> > into consideration? I suppose KVM can and should also set
> > PMBSR_EL1.DL = 1, as that means per the architecture that the buffer
> > contents should be discarded.
> 
> I agree with you that PMBSR_EL1.DL=1 is the right call for this. With
> that, I'd be surprised if there was a guest that tried to pull some
> tricks other than blowing away the profile. The other option that I
> find funny is if we plainly report the S2 abort to the guest, but that
> wont work well when nested comes into the picture.
> 
> --
> Thanks,
> Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2023-01-03 17:20 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-19 13:51 KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory Alexandru Elisei
2022-04-19 13:51 ` Alexandru Elisei
2022-04-19 14:10 ` Will Deacon
2022-04-19 14:10   ` Will Deacon
2022-04-19 14:44   ` Alexandru Elisei
2022-04-19 14:44     ` Alexandru Elisei
2022-04-19 14:59     ` Will Deacon
2022-04-19 14:59       ` Will Deacon
2022-04-19 15:20       ` Alexandru Elisei
2022-04-19 15:20         ` Alexandru Elisei
2022-04-19 15:35         ` Alexandru Elisei
2022-04-19 15:35           ` Alexandru Elisei
2022-07-25 10:06   ` Alexandru Elisei
2022-07-25 10:06     ` Alexandru Elisei
2022-07-26 17:51     ` Oliver Upton
2022-07-26 17:51       ` Oliver Upton
2022-07-27  9:30       ` Marc Zyngier
2022-07-27  9:30         ` Marc Zyngier
2022-07-27  9:52         ` Marc Zyngier
2022-07-27  9:52           ` Marc Zyngier
2022-07-27 10:38           ` Alexandru Elisei
2022-07-27 10:38             ` Alexandru Elisei
2022-07-27 16:06             ` Oliver Upton
2022-07-27 16:06               ` Oliver Upton
2022-07-27 10:56         ` Alexandru Elisei
2022-07-27 10:56           ` Alexandru Elisei
2022-07-27 11:18           ` Marc Zyngier
2022-07-27 11:18             ` Marc Zyngier
2022-07-27 12:10             ` Alexandru Elisei
2022-07-27 12:10               ` Alexandru Elisei
2022-07-27 10:19       ` Alexandru Elisei
2022-07-27 10:19         ` Alexandru Elisei
2022-07-27 10:29         ` Marc Zyngier
2022-07-27 10:29           ` Marc Zyngier
2022-07-27 10:44           ` Alexandru Elisei
2022-07-27 10:44             ` Alexandru Elisei
2022-07-27 11:08             ` Marc Zyngier
2022-07-27 11:08               ` Marc Zyngier
2022-07-27 11:57               ` Alexandru Elisei
2022-07-27 11:57                 ` Alexandru Elisei
2022-07-27 15:15                 ` Oliver Upton
2022-07-27 15:15                   ` Oliver Upton
2022-07-27 11:00       ` Alexandru Elisei
2022-07-27 11:00         ` Alexandru Elisei
2022-08-01 17:00     ` Will Deacon
2022-08-01 17:00       ` Will Deacon
2022-08-02  9:49       ` Alexandru Elisei
2022-08-02  9:49         ` Alexandru Elisei
2022-08-02 19:34         ` Oliver Upton
2022-08-02 19:34           ` Oliver Upton
2022-08-09 14:01           ` Alexandru Elisei
2022-08-09 14:01             ` Alexandru Elisei
2022-08-09 18:43             ` Oliver Upton
2022-08-09 18:43               ` Oliver Upton
2022-08-10  9:37               ` Alexandru Elisei
2022-08-10  9:37                 ` Alexandru Elisei
2022-08-10 15:25                 ` Oliver Upton
2022-08-10 15:25                   ` Oliver Upton
2022-08-12 13:05                   ` Alexandru Elisei
2022-08-12 13:05                     ` Alexandru Elisei
2022-08-17 15:05                     ` Oliver Upton
2022-08-17 15:05                       ` Oliver Upton
2022-09-12 14:50                       ` Alexandru Elisei
2022-09-12 14:50                         ` Alexandru Elisei
2022-09-13 10:58                         ` Oliver Upton
2022-09-13 10:58                           ` Oliver Upton
2022-09-13 12:41                           ` Alexandru Elisei
2022-09-13 12:41                             ` Alexandru Elisei
2022-09-13 14:13                             ` Oliver Upton
2022-09-13 14:13                               ` Oliver Upton
2023-01-03 14:26                               ` Alexandru Elisei
2023-01-03 14:26                                 ` Alexandru Elisei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.