* KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-04-19 13:51 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-04-19 13:51 UTC (permalink / raw) To: will, mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm The approach I've taken so far in adding support for SPE in KVM [1] relies on pinning the entire VM memory to avoid SPE triggering stage 2 faults altogether. I've taken this approach because: 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, and at the moment KVM has no way to resolve the VA to IPA translation. The AT instruction is not useful here, because PAR_EL1 doesn't report the IPA in the case of a stage 2 fault on a stage 1 translation table walk. 2. The stage 2 fault is reported asynchronously via an interrupt, which means there will be a window where profiling is stopped from the moment SPE triggers the fault and when the PE taks the interrupt. This blackout window is obviously not present when running on bare metal, as there is no second stage of address translation being performed. I've been thinking about this approach and I was considering translating the VA reported by SPE to the IPA instead, thus treating the SPE stage 2 data aborts more like regular (MMU) data aborts. As I see it, this approach has several merits over memory pinning: - The stage 1 translation table walker is also needed for nested virtualization, to emulate AT S1* instructions executed by the L1 guest hypervisor. - Walking the guest's translation tables is less of a departure from the way KVM manages physical memory for a virtual machine today. I had a discussion with Mark offline about this approach and he expressed a very sensible concern: when a guest is profiling, there is a blackout window where profiling is stopped which doesn't happen on bare metal (point 2 above). My questions are: 1. Is having this blackout window, regardless of its size, unnacceptable? If it is, then I'll continue with the memory pinning approach. 2. If having a blackout window is acceptable, how large can this window be before it becomes too much? I can try to take some performance measurements to evaluate the blackout window when using a stage 1 walker in relation to the buffer write speed on different hardware. I have access to an N1SDP machine and an Ampere Altra for this. [1] https://lore.kernel.org/all/20211117153842.302159-1-alexandru.elisei@arm.com/ Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-04-19 13:51 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-04-19 13:51 UTC (permalink / raw) To: will, mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm The approach I've taken so far in adding support for SPE in KVM [1] relies on pinning the entire VM memory to avoid SPE triggering stage 2 faults altogether. I've taken this approach because: 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, and at the moment KVM has no way to resolve the VA to IPA translation. The AT instruction is not useful here, because PAR_EL1 doesn't report the IPA in the case of a stage 2 fault on a stage 1 translation table walk. 2. The stage 2 fault is reported asynchronously via an interrupt, which means there will be a window where profiling is stopped from the moment SPE triggers the fault and when the PE taks the interrupt. This blackout window is obviously not present when running on bare metal, as there is no second stage of address translation being performed. I've been thinking about this approach and I was considering translating the VA reported by SPE to the IPA instead, thus treating the SPE stage 2 data aborts more like regular (MMU) data aborts. As I see it, this approach has several merits over memory pinning: - The stage 1 translation table walker is also needed for nested virtualization, to emulate AT S1* instructions executed by the L1 guest hypervisor. - Walking the guest's translation tables is less of a departure from the way KVM manages physical memory for a virtual machine today. I had a discussion with Mark offline about this approach and he expressed a very sensible concern: when a guest is profiling, there is a blackout window where profiling is stopped which doesn't happen on bare metal (point 2 above). My questions are: 1. Is having this blackout window, regardless of its size, unnacceptable? If it is, then I'll continue with the memory pinning approach. 2. If having a blackout window is acceptable, how large can this window be before it becomes too much? I can try to take some performance measurements to evaluate the blackout window when using a stage 1 walker in relation to the buffer write speed on different hardware. I have access to an N1SDP machine and an Ampere Altra for this. [1] https://lore.kernel.org/all/20211117153842.302159-1-alexandru.elisei@arm.com/ Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-04-19 13:51 ` Alexandru Elisei @ 2022-04-19 14:10 ` Will Deacon -1 siblings, 0 replies; 72+ messages in thread From: Will Deacon @ 2022-04-19 14:10 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, kvmarm, linux-arm-kernel On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > The approach I've taken so far in adding support for SPE in KVM [1] relies > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > altogether. I've taken this approach because: > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > and at the moment KVM has no way to resolve the VA to IPA translation. The > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > in the case of a stage 2 fault on a stage 1 translation table walk. > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > means there will be a window where profiling is stopped from the moment SPE > triggers the fault and when the PE taks the interrupt. This blackout window > is obviously not present when running on bare metal, as there is no second > stage of address translation being performed. Are these faults actually recoverable? My memory is a bit hazy here, but I thought SPE buffer data could be written out in whacky ways such that even a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), and so pinning is the only game in town. A funkier approach might be to defer pinning of the buffer until the SPE is enabled and avoid pinning all of VM memory that way, although I can't immediately tell how flexible the architecture is in allowing you to cache the base/limit values. Will _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-04-19 14:10 ` Will Deacon 0 siblings, 0 replies; 72+ messages in thread From: Will Deacon @ 2022-04-19 14:10 UTC (permalink / raw) To: Alexandru Elisei Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > The approach I've taken so far in adding support for SPE in KVM [1] relies > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > altogether. I've taken this approach because: > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > and at the moment KVM has no way to resolve the VA to IPA translation. The > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > in the case of a stage 2 fault on a stage 1 translation table walk. > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > means there will be a window where profiling is stopped from the moment SPE > triggers the fault and when the PE taks the interrupt. This blackout window > is obviously not present when running on bare metal, as there is no second > stage of address translation being performed. Are these faults actually recoverable? My memory is a bit hazy here, but I thought SPE buffer data could be written out in whacky ways such that even a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), and so pinning is the only game in town. A funkier approach might be to defer pinning of the buffer until the SPE is enabled and avoid pinning all of VM memory that way, although I can't immediately tell how flexible the architecture is in allowing you to cache the base/limit values. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-04-19 14:10 ` Will Deacon @ 2022-04-19 14:44 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-04-19 14:44 UTC (permalink / raw) To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel Hi Will, On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > The approach I've taken so far in adding support for SPE in KVM [1] relies > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > > altogether. I've taken this approach because: > > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > > and at the moment KVM has no way to resolve the VA to IPA translation. The > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > > in the case of a stage 2 fault on a stage 1 translation table walk. > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > means there will be a window where profiling is stopped from the moment SPE > > triggers the fault and when the PE taks the interrupt. This blackout window > > is obviously not present when running on bare metal, as there is no second > > stage of address translation being performed. > > Are these faults actually recoverable? My memory is a bit hazy here, but I > thought SPE buffer data could be written out in whacky ways such that even > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > and so pinning is the only game in town. Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page D10-5177): "The architecture does not require that a sample record is written sequentially by the SPU, only that: [..] - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates whether PMBPTR_EL1 points to the first byte after the last complete sample record. - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a Fault Address Register." and (page D10-5179): "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0, then a Profiling Buffer management event is generated: [..] - If PMBPTR_EL1 is not the address of the first byte after the last complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1. Otherwise, PMBSR_EL1.DL is unchanged." Since there is no way to know the record size (well, unless PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural requirement), it means that KVM cannot restore the write pointer to the address of the last complete record + 1, to allow the guest to resume profiling without corrupted records. > > A funkier approach might be to defer pinning of the buffer until the SPE is > enabled and avoid pinning all of VM memory that way, although I can't > immediately tell how flexible the architecture is in allowing you to cache > the base/limit values. A guest can use this to pin the VM memory (or a significant part of it), either by doing it on purpose, or by allocating new buffers as they get full. This will probably result in KVM killing the VM if the pinned memory is larger than ulimit's max locked memory, which I believe is going to be a bad experience for the user caught unaware. Unless we don't want KVM to take ulimit into account when pinning the memory, which as far as I can goes against KVM's approach so far. Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-04-19 14:44 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-04-19 14:44 UTC (permalink / raw) To: Will Deacon Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm Hi Will, On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > The approach I've taken so far in adding support for SPE in KVM [1] relies > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > > altogether. I've taken this approach because: > > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > > and at the moment KVM has no way to resolve the VA to IPA translation. The > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > > in the case of a stage 2 fault on a stage 1 translation table walk. > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > means there will be a window where profiling is stopped from the moment SPE > > triggers the fault and when the PE taks the interrupt. This blackout window > > is obviously not present when running on bare metal, as there is no second > > stage of address translation being performed. > > Are these faults actually recoverable? My memory is a bit hazy here, but I > thought SPE buffer data could be written out in whacky ways such that even > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > and so pinning is the only game in town. Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page D10-5177): "The architecture does not require that a sample record is written sequentially by the SPU, only that: [..] - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates whether PMBPTR_EL1 points to the first byte after the last complete sample record. - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a Fault Address Register." and (page D10-5179): "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0, then a Profiling Buffer management event is generated: [..] - If PMBPTR_EL1 is not the address of the first byte after the last complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1. Otherwise, PMBSR_EL1.DL is unchanged." Since there is no way to know the record size (well, unless PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural requirement), it means that KVM cannot restore the write pointer to the address of the last complete record + 1, to allow the guest to resume profiling without corrupted records. > > A funkier approach might be to defer pinning of the buffer until the SPE is > enabled and avoid pinning all of VM memory that way, although I can't > immediately tell how flexible the architecture is in allowing you to cache > the base/limit values. A guest can use this to pin the VM memory (or a significant part of it), either by doing it on purpose, or by allocating new buffers as they get full. This will probably result in KVM killing the VM if the pinned memory is larger than ulimit's max locked memory, which I believe is going to be a bad experience for the user caught unaware. Unless we don't want KVM to take ulimit into account when pinning the memory, which as far as I can goes against KVM's approach so far. Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-04-19 14:44 ` Alexandru Elisei @ 2022-04-19 14:59 ` Will Deacon -1 siblings, 0 replies; 72+ messages in thread From: Will Deacon @ 2022-04-19 14:59 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, kvmarm, linux-arm-kernel On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote: > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > means there will be a window where profiling is stopped from the moment SPE > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > is obviously not present when running on bare metal, as there is no second > > > stage of address translation being performed. > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > thought SPE buffer data could be written out in whacky ways such that even > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > and so pinning is the only game in town. > > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page > D10-5177): > > "The architecture does not require that a sample record is written > sequentially by the SPU, only that: > [..] > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates > whether PMBPTR_EL1 points to the first byte after the last complete > sample record. > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a > Fault Address Register." > > and (page D10-5179): > > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0, > then a Profiling Buffer management event is generated: > [..] > - If PMBPTR_EL1 is not the address of the first byte after the last > complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1. > Otherwise, PMBSR_EL1.DL is unchanged." > > Since there is no way to know the record size (well, unless > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural > requirement), it means that KVM cannot restore the write pointer to the > address of the last complete record + 1, to allow the guest to resume > profiling without corrupted records. > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > enabled and avoid pinning all of VM memory that way, although I can't > > immediately tell how flexible the architecture is in allowing you to cache > > the base/limit values. > > A guest can use this to pin the VM memory (or a significant part of it), > either by doing it on purpose, or by allocating new buffers as they get > full. This will probably result in KVM killing the VM if the pinned memory > is larger than ulimit's max locked memory, which I believe is going to be a > bad experience for the user caught unaware. Unless we don't want KVM to > take ulimit into account when pinning the memory, which as far as I can > goes against KVM's approach so far. Yeah, it gets pretty messy and ulimit definitely needs to be taken into account, as it is today. That said, we could just continue if the pinning fails and the guest gets to keep the pieces if we get a stage-2 fault -- putting the device into an error state and re-injecting the interrupt should cause the perf session in the guest to fail gracefully. I don't think the complexity is necessarily worth it, but pinning all of guest memory is really crap so it's worth thinking about alternatives. Will _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-04-19 14:59 ` Will Deacon 0 siblings, 0 replies; 72+ messages in thread From: Will Deacon @ 2022-04-19 14:59 UTC (permalink / raw) To: Alexandru Elisei Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote: > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > means there will be a window where profiling is stopped from the moment SPE > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > is obviously not present when running on bare metal, as there is no second > > > stage of address translation being performed. > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > thought SPE buffer data could be written out in whacky ways such that even > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > and so pinning is the only game in town. > > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page > D10-5177): > > "The architecture does not require that a sample record is written > sequentially by the SPU, only that: > [..] > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates > whether PMBPTR_EL1 points to the first byte after the last complete > sample record. > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a > Fault Address Register." > > and (page D10-5179): > > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0, > then a Profiling Buffer management event is generated: > [..] > - If PMBPTR_EL1 is not the address of the first byte after the last > complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1. > Otherwise, PMBSR_EL1.DL is unchanged." > > Since there is no way to know the record size (well, unless > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural > requirement), it means that KVM cannot restore the write pointer to the > address of the last complete record + 1, to allow the guest to resume > profiling without corrupted records. > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > enabled and avoid pinning all of VM memory that way, although I can't > > immediately tell how flexible the architecture is in allowing you to cache > > the base/limit values. > > A guest can use this to pin the VM memory (or a significant part of it), > either by doing it on purpose, or by allocating new buffers as they get > full. This will probably result in KVM killing the VM if the pinned memory > is larger than ulimit's max locked memory, which I believe is going to be a > bad experience for the user caught unaware. Unless we don't want KVM to > take ulimit into account when pinning the memory, which as far as I can > goes against KVM's approach so far. Yeah, it gets pretty messy and ulimit definitely needs to be taken into account, as it is today. That said, we could just continue if the pinning fails and the guest gets to keep the pieces if we get a stage-2 fault -- putting the device into an error state and re-injecting the interrupt should cause the perf session in the guest to fail gracefully. I don't think the complexity is necessarily worth it, but pinning all of guest memory is really crap so it's worth thinking about alternatives. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-04-19 14:59 ` Will Deacon @ 2022-04-19 15:20 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-04-19 15:20 UTC (permalink / raw) To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel Hi, On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote: > On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote: > > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > > means there will be a window where profiling is stopped from the moment SPE > > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > > is obviously not present when running on bare metal, as there is no second > > > > stage of address translation being performed. > > > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > > thought SPE buffer data could be written out in whacky ways such that even > > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > > and so pinning is the only game in town. > > > > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page > > D10-5177): > > > > "The architecture does not require that a sample record is written > > sequentially by the SPU, only that: > > [..] > > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates > > whether PMBPTR_EL1 points to the first byte after the last complete > > sample record. > > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a > > Fault Address Register." > > > > and (page D10-5179): > > > > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0, > > then a Profiling Buffer management event is generated: > > [..] > > - If PMBPTR_EL1 is not the address of the first byte after the last > > complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1. > > Otherwise, PMBSR_EL1.DL is unchanged." > > > > Since there is no way to know the record size (well, unless > > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural > > requirement), it means that KVM cannot restore the write pointer to the > > address of the last complete record + 1, to allow the guest to resume > > profiling without corrupted records. > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > enabled and avoid pinning all of VM memory that way, although I can't > > > immediately tell how flexible the architecture is in allowing you to cache > > > the base/limit values. > > > > A guest can use this to pin the VM memory (or a significant part of it), > > either by doing it on purpose, or by allocating new buffers as they get > > full. This will probably result in KVM killing the VM if the pinned memory > > is larger than ulimit's max locked memory, which I believe is going to be a > > bad experience for the user caught unaware. Unless we don't want KVM to > > take ulimit into account when pinning the memory, which as far as I can > > goes against KVM's approach so far. > > Yeah, it gets pretty messy and ulimit definitely needs to be taken into > account, as it is today. > > That said, we could just continue if the pinning fails and the guest gets to > keep the pieces if we get a stage-2 fault -- putting the device into an > error state and re-injecting the interrupt should cause the perf session in > the guest to fail gracefully. I don't think the complexity is necessarily > worth it, but pinning all of guest memory is really crap so it's worth > thinking about alternatives. On the subject of pinning the memory when guest enables SPE, the guest can configure SPE to profile userspace only. Programming is done at EL1, and in this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only sensible thing to do here is to pin the memory when SPE is disabled. If it fails, then how should KVM notify the guest that something went wrong when SPE is disabled? KVM could inject an interrupt, as those are asynchronous and one could (rather weakly) argue that the interrupt might have been raised because of something that happened in the previous profiling session, but what if the guest never enabled SPE? What if the guest is in the middle of configuring SPE and the interrupt handler isn't even set? Or should KVM not use an interrupt to report error conditions to the guest, in which case, how can the guest detect that SPE is stopped? Both options don't look particularly appealing to me. Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-04-19 15:20 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-04-19 15:20 UTC (permalink / raw) To: Will Deacon Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm Hi, On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote: > On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote: > > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > > means there will be a window where profiling is stopped from the moment SPE > > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > > is obviously not present when running on bare metal, as there is no second > > > > stage of address translation being performed. > > > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > > thought SPE buffer data could be written out in whacky ways such that even > > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > > and so pinning is the only game in town. > > > > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page > > D10-5177): > > > > "The architecture does not require that a sample record is written > > sequentially by the SPU, only that: > > [..] > > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates > > whether PMBPTR_EL1 points to the first byte after the last complete > > sample record. > > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a > > Fault Address Register." > > > > and (page D10-5179): > > > > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0, > > then a Profiling Buffer management event is generated: > > [..] > > - If PMBPTR_EL1 is not the address of the first byte after the last > > complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1. > > Otherwise, PMBSR_EL1.DL is unchanged." > > > > Since there is no way to know the record size (well, unless > > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural > > requirement), it means that KVM cannot restore the write pointer to the > > address of the last complete record + 1, to allow the guest to resume > > profiling without corrupted records. > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > enabled and avoid pinning all of VM memory that way, although I can't > > > immediately tell how flexible the architecture is in allowing you to cache > > > the base/limit values. > > > > A guest can use this to pin the VM memory (or a significant part of it), > > either by doing it on purpose, or by allocating new buffers as they get > > full. This will probably result in KVM killing the VM if the pinned memory > > is larger than ulimit's max locked memory, which I believe is going to be a > > bad experience for the user caught unaware. Unless we don't want KVM to > > take ulimit into account when pinning the memory, which as far as I can > > goes against KVM's approach so far. > > Yeah, it gets pretty messy and ulimit definitely needs to be taken into > account, as it is today. > > That said, we could just continue if the pinning fails and the guest gets to > keep the pieces if we get a stage-2 fault -- putting the device into an > error state and re-injecting the interrupt should cause the perf session in > the guest to fail gracefully. I don't think the complexity is necessarily > worth it, but pinning all of guest memory is really crap so it's worth > thinking about alternatives. On the subject of pinning the memory when guest enables SPE, the guest can configure SPE to profile userspace only. Programming is done at EL1, and in this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only sensible thing to do here is to pin the memory when SPE is disabled. If it fails, then how should KVM notify the guest that something went wrong when SPE is disabled? KVM could inject an interrupt, as those are asynchronous and one could (rather weakly) argue that the interrupt might have been raised because of something that happened in the previous profiling session, but what if the guest never enabled SPE? What if the guest is in the middle of configuring SPE and the interrupt handler isn't even set? Or should KVM not use an interrupt to report error conditions to the guest, in which case, how can the guest detect that SPE is stopped? Both options don't look particularly appealing to me. Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-04-19 15:20 ` Alexandru Elisei @ 2022-04-19 15:35 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-04-19 15:35 UTC (permalink / raw) To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel Hi, On Tue, Apr 19, 2022 at 04:20:09PM +0100, Alexandru Elisei wrote: > Hi, > > On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote: > > On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote: > > > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > > > means there will be a window where profiling is stopped from the moment SPE > > > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > > > is obviously not present when running on bare metal, as there is no second > > > > > stage of address translation being performed. > > > > > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > > > thought SPE buffer data could be written out in whacky ways such that even > > > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > > > and so pinning is the only game in town. > > > > > > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page > > > D10-5177): > > > > > > "The architecture does not require that a sample record is written > > > sequentially by the SPU, only that: > > > [..] > > > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates > > > whether PMBPTR_EL1 points to the first byte after the last complete > > > sample record. > > > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a > > > Fault Address Register." > > > > > > and (page D10-5179): > > > > > > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0, > > > then a Profiling Buffer management event is generated: > > > [..] > > > - If PMBPTR_EL1 is not the address of the first byte after the last > > > complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1. > > > Otherwise, PMBSR_EL1.DL is unchanged." > > > > > > Since there is no way to know the record size (well, unless > > > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural > > > requirement), it means that KVM cannot restore the write pointer to the > > > address of the last complete record + 1, to allow the guest to resume > > > profiling without corrupted records. > > > > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > the base/limit values. > > > > > > A guest can use this to pin the VM memory (or a significant part of it), > > > either by doing it on purpose, or by allocating new buffers as they get > > > full. This will probably result in KVM killing the VM if the pinned memory > > > is larger than ulimit's max locked memory, which I believe is going to be a > > > bad experience for the user caught unaware. Unless we don't want KVM to > > > take ulimit into account when pinning the memory, which as far as I can > > > goes against KVM's approach so far. > > > > Yeah, it gets pretty messy and ulimit definitely needs to be taken into > > account, as it is today. > > > > That said, we could just continue if the pinning fails and the guest gets to > > keep the pieces if we get a stage-2 fault -- putting the device into an > > error state and re-injecting the interrupt should cause the perf session in > > the guest to fail gracefully. I don't think the complexity is necessarily > > worth it, but pinning all of guest memory is really crap so it's worth > > thinking about alternatives. > > On the subject of pinning the memory when guest enables SPE, the guest can > configure SPE to profile userspace only. Programming is done at EL1, and in > this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only > sensible thing to do here is to pin the memory when SPE is disabled. If it > fails, then how should KVM notify the guest that something went wrong when > SPE is disabled? KVM could inject an interrupt, as those are asynchronous > and one could (rather weakly) argue that the interrupt might have been > raised because of something that happened in the previous profiling > session, but what if the guest never enabled SPE? What if the guest is in > the middle of configuring SPE and the interrupt handler isn't even set? Or > should KVM not use an interrupt to report error conditions to the guest, in > which case, how can the guest detect that SPE is stopped? Come to think of it, KVM can defer injecting the interrupt until after an exit from the guest when the guest was executing at EL0 (and profiling would have been enabled from the guest's point of view). I think this should work, as a delay between the condition that causes an interrupt and the PE taking the said interrupt is expected. Thoughts? I too would prefer not to have to pin the entire VM memory, and asking from userspace to increase max locked memory to the size of the VM memory looks a lot better to me. Thanks, Alex > > Both options don't look particularly appealing to me. > > Thanks, > Alex > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-04-19 15:35 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-04-19 15:35 UTC (permalink / raw) To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel Hi, On Tue, Apr 19, 2022 at 04:20:09PM +0100, Alexandru Elisei wrote: > Hi, > > On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote: > > On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote: > > > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > > > means there will be a window where profiling is stopped from the moment SPE > > > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > > > is obviously not present when running on bare metal, as there is no second > > > > > stage of address translation being performed. > > > > > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > > > thought SPE buffer data could be written out in whacky ways such that even > > > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > > > and so pinning is the only game in town. > > > > > > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page > > > D10-5177): > > > > > > "The architecture does not require that a sample record is written > > > sequentially by the SPU, only that: > > > [..] > > > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates > > > whether PMBPTR_EL1 points to the first byte after the last complete > > > sample record. > > > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a > > > Fault Address Register." > > > > > > and (page D10-5179): > > > > > > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0, > > > then a Profiling Buffer management event is generated: > > > [..] > > > - If PMBPTR_EL1 is not the address of the first byte after the last > > > complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1. > > > Otherwise, PMBSR_EL1.DL is unchanged." > > > > > > Since there is no way to know the record size (well, unless > > > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural > > > requirement), it means that KVM cannot restore the write pointer to the > > > address of the last complete record + 1, to allow the guest to resume > > > profiling without corrupted records. > > > > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > the base/limit values. > > > > > > A guest can use this to pin the VM memory (or a significant part of it), > > > either by doing it on purpose, or by allocating new buffers as they get > > > full. This will probably result in KVM killing the VM if the pinned memory > > > is larger than ulimit's max locked memory, which I believe is going to be a > > > bad experience for the user caught unaware. Unless we don't want KVM to > > > take ulimit into account when pinning the memory, which as far as I can > > > goes against KVM's approach so far. > > > > Yeah, it gets pretty messy and ulimit definitely needs to be taken into > > account, as it is today. > > > > That said, we could just continue if the pinning fails and the guest gets to > > keep the pieces if we get a stage-2 fault -- putting the device into an > > error state and re-injecting the interrupt should cause the perf session in > > the guest to fail gracefully. I don't think the complexity is necessarily > > worth it, but pinning all of guest memory is really crap so it's worth > > thinking about alternatives. > > On the subject of pinning the memory when guest enables SPE, the guest can > configure SPE to profile userspace only. Programming is done at EL1, and in > this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only > sensible thing to do here is to pin the memory when SPE is disabled. If it > fails, then how should KVM notify the guest that something went wrong when > SPE is disabled? KVM could inject an interrupt, as those are asynchronous > and one could (rather weakly) argue that the interrupt might have been > raised because of something that happened in the previous profiling > session, but what if the guest never enabled SPE? What if the guest is in > the middle of configuring SPE and the interrupt handler isn't even set? Or > should KVM not use an interrupt to report error conditions to the guest, in > which case, how can the guest detect that SPE is stopped? Come to think of it, KVM can defer injecting the interrupt until after an exit from the guest when the guest was executing at EL0 (and profiling would have been enabled from the guest's point of view). I think this should work, as a delay between the condition that causes an interrupt and the PE taking the said interrupt is expected. Thoughts? I too would prefer not to have to pin the entire VM memory, and asking from userspace to increase max locked memory to the size of the VM memory looks a lot better to me. Thanks, Alex > > Both options don't look particularly appealing to me. > > Thanks, > Alex > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-04-19 14:10 ` Will Deacon @ 2022-07-25 10:06 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-25 10:06 UTC (permalink / raw) To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel Hi, On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > The approach I've taken so far in adding support for SPE in KVM [1] relies > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > > altogether. I've taken this approach because: > > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > > and at the moment KVM has no way to resolve the VA to IPA translation. The > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > > in the case of a stage 2 fault on a stage 1 translation table walk. > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > means there will be a window where profiling is stopped from the moment SPE > > triggers the fault and when the PE taks the interrupt. This blackout window > > is obviously not present when running on bare metal, as there is no second > > stage of address translation being performed. > > Are these faults actually recoverable? My memory is a bit hazy here, but I > thought SPE buffer data could be written out in whacky ways such that even > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > and so pinning is the only game in town. > > A funkier approach might be to defer pinning of the buffer until the SPE is > enabled and avoid pinning all of VM memory that way, although I can't > immediately tell how flexible the architecture is in allowing you to cache > the base/limit values. I was investigating this approach, and Mark raised a concern that I think might be a showstopper. Let's consider this scenario: Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). 2. Guest programs SPE to enable profiling at **EL0** (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). 3. Guest changes the translation table entries for the buffer. The architecture allows this. 4. Guest does an ERET to EL0, thus enabling profiling. Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin the buffer at stage 2 when profiling gets enabled at EL0. I can see two solutions here: a. Accept the limitation (and advertise it in the documentation) that if someone wants to use SPE when running as a Linux guest, the kernel used by the guest must not change the buffer translation table entries after the buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so running a Linux guest should not be a problem. I don't know how other OSes do it (but I can find out). We could also phrase it that the buffer translation table entries can be changed after enabling the buffer, but only if profiling happens at EL1. But that sounds very arbitrary. b. Pin the buffer after the stage 2 DABT that SPE will report in the situation above. This means that there is a blackout window, but will happen only once after each time the guest reprograms the buffer. I don't know if this is acceptable. We could say that this if this blackout window is not acceptable, then the guest kernel shouldn't change the translation table entries after enabling the buffer. Or drop the approach of pinning the buffer and go back to pinning the entire memory of the VM. Any thoughts on this? I would very much prefer to try to pin only the buffer. Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-25 10:06 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-25 10:06 UTC (permalink / raw) To: Will Deacon Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm Hi, On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > The approach I've taken so far in adding support for SPE in KVM [1] relies > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > > altogether. I've taken this approach because: > > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > > and at the moment KVM has no way to resolve the VA to IPA translation. The > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > > in the case of a stage 2 fault on a stage 1 translation table walk. > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > means there will be a window where profiling is stopped from the moment SPE > > triggers the fault and when the PE taks the interrupt. This blackout window > > is obviously not present when running on bare metal, as there is no second > > stage of address translation being performed. > > Are these faults actually recoverable? My memory is a bit hazy here, but I > thought SPE buffer data could be written out in whacky ways such that even > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > and so pinning is the only game in town. > > A funkier approach might be to defer pinning of the buffer until the SPE is > enabled and avoid pinning all of VM memory that way, although I can't > immediately tell how flexible the architecture is in allowing you to cache > the base/limit values. I was investigating this approach, and Mark raised a concern that I think might be a showstopper. Let's consider this scenario: Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). 2. Guest programs SPE to enable profiling at **EL0** (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). 3. Guest changes the translation table entries for the buffer. The architecture allows this. 4. Guest does an ERET to EL0, thus enabling profiling. Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin the buffer at stage 2 when profiling gets enabled at EL0. I can see two solutions here: a. Accept the limitation (and advertise it in the documentation) that if someone wants to use SPE when running as a Linux guest, the kernel used by the guest must not change the buffer translation table entries after the buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so running a Linux guest should not be a problem. I don't know how other OSes do it (but I can find out). We could also phrase it that the buffer translation table entries can be changed after enabling the buffer, but only if profiling happens at EL1. But that sounds very arbitrary. b. Pin the buffer after the stage 2 DABT that SPE will report in the situation above. This means that there is a blackout window, but will happen only once after each time the guest reprograms the buffer. I don't know if this is acceptable. We could say that this if this blackout window is not acceptable, then the guest kernel shouldn't change the translation table entries after enabling the buffer. Or drop the approach of pinning the buffer and go back to pinning the entire memory of the VM. Any thoughts on this? I would very much prefer to try to pin only the buffer. Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-25 10:06 ` Alexandru Elisei @ 2022-07-26 17:51 ` Oliver Upton -1 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-07-26 17:51 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Alex, On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: [...] > > A funkier approach might be to defer pinning of the buffer until the SPE is > > enabled and avoid pinning all of VM memory that way, although I can't > > immediately tell how flexible the architecture is in allowing you to cache > > the base/limit values. > > I was investigating this approach, and Mark raised a concern that I think > might be a showstopper. > > Let's consider this scenario: > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > 2. Guest programs SPE to enable profiling at **EL0** > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > 3. Guest changes the translation table entries for the buffer. The > architecture allows this. > 4. Guest does an ERET to EL0, thus enabling profiling. > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > the buffer at stage 2 when profiling gets enabled at EL0. Not saying we necessarily should, but this is possible with FGT no? > I can see two solutions here: > > a. Accept the limitation (and advertise it in the documentation) that if > someone wants to use SPE when running as a Linux guest, the kernel used by > the guest must not change the buffer translation table entries after the > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > running a Linux guest should not be a problem. I don't know how other OSes > do it (but I can find out). We could also phrase it that the buffer > translation table entries can be changed after enabling the buffer, but > only if profiling happens at EL1. But that sounds very arbitrary. > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > situation above. This means that there is a blackout window, but will > happen only once after each time the guest reprograms the buffer. I don't > know if this is acceptable. We could say that this if this blackout window > is not acceptable, then the guest kernel shouldn't change the translation > table entries after enabling the buffer. > > Or drop the approach of pinning the buffer and go back to pinning the > entire memory of the VM. > > Any thoughts on this? I would very much prefer to try to pin only the > buffer. Doesn't pinning the buffer also imply pinning the stage 1 tables responsible for its translation as well? I agree that pinning the buffer is likely the best way forward as pinning the whole of guest memory is entirely impractical. I'm also a bit confused on how we would manage to un-pin memory on the way out with this. The guest is free to muck with the stage 1 and could cause the SPU to spew a bunch of stage 2 aborts if it wanted to be annoying. One way to tackle it would be to only allow a single root-to-target walk to be pinned by a vCPU at a time. Any time a new stage 2 abort comes from the SPU, we un-pin the old walk and pin the new one instead. Live migration also throws a wrench in this. IOW, there are still potential sources of blackout unattributable to guest manipulation of the SPU. Going to think on this some more.. -- Thanks, Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-26 17:51 ` Oliver Upton 0 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-07-26 17:51 UTC (permalink / raw) To: Alexandru Elisei; +Cc: Will Deacon, maz, kvmarm, linux-arm-kernel Hi Alex, On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: [...] > > A funkier approach might be to defer pinning of the buffer until the SPE is > > enabled and avoid pinning all of VM memory that way, although I can't > > immediately tell how flexible the architecture is in allowing you to cache > > the base/limit values. > > I was investigating this approach, and Mark raised a concern that I think > might be a showstopper. > > Let's consider this scenario: > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > 2. Guest programs SPE to enable profiling at **EL0** > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > 3. Guest changes the translation table entries for the buffer. The > architecture allows this. > 4. Guest does an ERET to EL0, thus enabling profiling. > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > the buffer at stage 2 when profiling gets enabled at EL0. Not saying we necessarily should, but this is possible with FGT no? > I can see two solutions here: > > a. Accept the limitation (and advertise it in the documentation) that if > someone wants to use SPE when running as a Linux guest, the kernel used by > the guest must not change the buffer translation table entries after the > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > running a Linux guest should not be a problem. I don't know how other OSes > do it (but I can find out). We could also phrase it that the buffer > translation table entries can be changed after enabling the buffer, but > only if profiling happens at EL1. But that sounds very arbitrary. > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > situation above. This means that there is a blackout window, but will > happen only once after each time the guest reprograms the buffer. I don't > know if this is acceptable. We could say that this if this blackout window > is not acceptable, then the guest kernel shouldn't change the translation > table entries after enabling the buffer. > > Or drop the approach of pinning the buffer and go back to pinning the > entire memory of the VM. > > Any thoughts on this? I would very much prefer to try to pin only the > buffer. Doesn't pinning the buffer also imply pinning the stage 1 tables responsible for its translation as well? I agree that pinning the buffer is likely the best way forward as pinning the whole of guest memory is entirely impractical. I'm also a bit confused on how we would manage to un-pin memory on the way out with this. The guest is free to muck with the stage 1 and could cause the SPU to spew a bunch of stage 2 aborts if it wanted to be annoying. One way to tackle it would be to only allow a single root-to-target walk to be pinned by a vCPU at a time. Any time a new stage 2 abort comes from the SPU, we un-pin the old walk and pin the new one instead. Live migration also throws a wrench in this. IOW, there are still potential sources of blackout unattributable to guest manipulation of the SPU. Going to think on this some more.. -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-26 17:51 ` Oliver Upton @ 2022-07-27 9:30 ` Marc Zyngier -1 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 9:30 UTC (permalink / raw) To: Oliver Upton; +Cc: Will Deacon, kvmarm, linux-arm-kernel On Tue, 26 Jul 2022 18:51:21 +0100, Oliver Upton <oliver.upton@linux.dev> wrote: > > Hi Alex, > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > [...] > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > enabled and avoid pinning all of VM memory that way, although I can't > > > immediately tell how flexible the architecture is in allowing you to cache > > > the base/limit values. > > > > I was investigating this approach, and Mark raised a concern that I think > > might be a showstopper. > > > > Let's consider this scenario: > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > 2. Guest programs SPE to enable profiling at **EL0** > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > 3. Guest changes the translation table entries for the buffer. The > > architecture allows this. > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > the buffer at stage 2 when profiling gets enabled at EL0. > > Not saying we necessarily should, but this is possible with FGT no? Given how often ERET is used at EL1, I'd really refrain from doing so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real EL1, and this comes at a serious cost (even an exception return that stays at the same EL gets trapped). Once EL1 runs, we disengage this trap because it is otherwise way too costly. > > > I can see two solutions here: > > > > a. Accept the limitation (and advertise it in the documentation) that if > > someone wants to use SPE when running as a Linux guest, the kernel used by > > the guest must not change the buffer translation table entries after the > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > > running a Linux guest should not be a problem. I don't know how other OSes > > do it (but I can find out). We could also phrase it that the buffer > > translation table entries can be changed after enabling the buffer, but > > only if profiling happens at EL1. But that sounds very arbitrary. > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > > situation above. This means that there is a blackout window, but will > > happen only once after each time the guest reprograms the buffer. I don't > > know if this is acceptable. We could say that this if this blackout window > > is not acceptable, then the guest kernel shouldn't change the translation > > table entries after enabling the buffer. > > > > Or drop the approach of pinning the buffer and go back to pinning the > > entire memory of the VM. > > > > Any thoughts on this? I would very much prefer to try to pin only the > > buffer. > > Doesn't pinning the buffer also imply pinning the stage 1 tables > responsible for its translation as well? I agree that pinning the buffer > is likely the best way forward as pinning the whole of guest memory is > entirely impractical. How different is this from device assignment, which also relies on full page pinning? The way I look at it, SPE is a device directly assigned to the guest, and isn't capable of generating synchronous exception. Not that I'm madly in love with the approach, but this is at least consistent. There was also some concerns around buggy HW that would blow itself up on S2 faults, but I think these implementations are confidential enough that we don't need to worry about them. > I'm also a bit confused on how we would manage to un-pin memory on the > way out with this. The guest is free to muck with the stage 1 and could > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > annoying. One way to tackle it would be to only allow a single > root-to-target walk to be pinned by a vCPU at a time. Any time a new > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > one instead. This sounds like a reasonable option. Only one IPA range covering the SPE buffer (as described by the translation of PMBPTR_EL1) is pinned at any given time. Generate a SPE S2 fault outside of this range, and we unpin the region before mapping in the next one. Yes, the guest can play tricks on us and exploit the latency of the interrupt. But at the end of the day, this is its own problem. Of course, this results in larger blind windows. Ideally, we should be able to report these to the guest, either as sideband data or in the actual profiling buffer (but I have no idea whether this is possible). > Live migration also throws a wrench in this. IOW, there are still potential > sources of blackout unattributable to guest manipulation of the SPU. Can you chime some light on this? I appreciate that you can't play the R/O trick on the SPE buffer as it invalidates the above discussion, but it should be relatively easy to track these pages and never reset them as clean until the vcpu is stopped. Unless you foresee other issues? To be clear, I don't worry too much about these blind windows. The architecture doesn't really give us the right tools to make it work reliably, making this a best effort only. Unless we pin the whole guest and forego migration and other fault-driven mechanisms. Maybe that is a choice we need to give to the user: cheap, fast, reliable. Pick two. Thanks, M. -- Without deviation from the norm, progress is not possible. _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 9:30 ` Marc Zyngier 0 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 9:30 UTC (permalink / raw) To: Oliver Upton; +Cc: Alexandru Elisei, Will Deacon, kvmarm, linux-arm-kernel On Tue, 26 Jul 2022 18:51:21 +0100, Oliver Upton <oliver.upton@linux.dev> wrote: > > Hi Alex, > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > [...] > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > enabled and avoid pinning all of VM memory that way, although I can't > > > immediately tell how flexible the architecture is in allowing you to cache > > > the base/limit values. > > > > I was investigating this approach, and Mark raised a concern that I think > > might be a showstopper. > > > > Let's consider this scenario: > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > 2. Guest programs SPE to enable profiling at **EL0** > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > 3. Guest changes the translation table entries for the buffer. The > > architecture allows this. > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > the buffer at stage 2 when profiling gets enabled at EL0. > > Not saying we necessarily should, but this is possible with FGT no? Given how often ERET is used at EL1, I'd really refrain from doing so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real EL1, and this comes at a serious cost (even an exception return that stays at the same EL gets trapped). Once EL1 runs, we disengage this trap because it is otherwise way too costly. > > > I can see two solutions here: > > > > a. Accept the limitation (and advertise it in the documentation) that if > > someone wants to use SPE when running as a Linux guest, the kernel used by > > the guest must not change the buffer translation table entries after the > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > > running a Linux guest should not be a problem. I don't know how other OSes > > do it (but I can find out). We could also phrase it that the buffer > > translation table entries can be changed after enabling the buffer, but > > only if profiling happens at EL1. But that sounds very arbitrary. > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > > situation above. This means that there is a blackout window, but will > > happen only once after each time the guest reprograms the buffer. I don't > > know if this is acceptable. We could say that this if this blackout window > > is not acceptable, then the guest kernel shouldn't change the translation > > table entries after enabling the buffer. > > > > Or drop the approach of pinning the buffer and go back to pinning the > > entire memory of the VM. > > > > Any thoughts on this? I would very much prefer to try to pin only the > > buffer. > > Doesn't pinning the buffer also imply pinning the stage 1 tables > responsible for its translation as well? I agree that pinning the buffer > is likely the best way forward as pinning the whole of guest memory is > entirely impractical. How different is this from device assignment, which also relies on full page pinning? The way I look at it, SPE is a device directly assigned to the guest, and isn't capable of generating synchronous exception. Not that I'm madly in love with the approach, but this is at least consistent. There was also some concerns around buggy HW that would blow itself up on S2 faults, but I think these implementations are confidential enough that we don't need to worry about them. > I'm also a bit confused on how we would manage to un-pin memory on the > way out with this. The guest is free to muck with the stage 1 and could > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > annoying. One way to tackle it would be to only allow a single > root-to-target walk to be pinned by a vCPU at a time. Any time a new > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > one instead. This sounds like a reasonable option. Only one IPA range covering the SPE buffer (as described by the translation of PMBPTR_EL1) is pinned at any given time. Generate a SPE S2 fault outside of this range, and we unpin the region before mapping in the next one. Yes, the guest can play tricks on us and exploit the latency of the interrupt. But at the end of the day, this is its own problem. Of course, this results in larger blind windows. Ideally, we should be able to report these to the guest, either as sideband data or in the actual profiling buffer (but I have no idea whether this is possible). > Live migration also throws a wrench in this. IOW, there are still potential > sources of blackout unattributable to guest manipulation of the SPU. Can you chime some light on this? I appreciate that you can't play the R/O trick on the SPE buffer as it invalidates the above discussion, but it should be relatively easy to track these pages and never reset them as clean until the vcpu is stopped. Unless you foresee other issues? To be clear, I don't worry too much about these blind windows. The architecture doesn't really give us the right tools to make it work reliably, making this a best effort only. Unless we pin the whole guest and forego migration and other fault-driven mechanisms. Maybe that is a choice we need to give to the user: cheap, fast, reliable. Pick two. Thanks, M. -- Without deviation from the norm, progress is not possible. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 9:30 ` Marc Zyngier @ 2022-07-27 9:52 ` Marc Zyngier -1 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 9:52 UTC (permalink / raw) To: Oliver Upton; +Cc: Will Deacon, kvmarm, linux-arm-kernel On Wed, 27 Jul 2022 10:30:59 +0100, Marc Zyngier <maz@kernel.org> wrote: > > On Tue, 26 Jul 2022 18:51:21 +0100, > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > responsible for its translation as well? I agree that pinning the buffer > > is likely the best way forward as pinning the whole of guest memory is > > entirely impractical. Huh, I just realised that you were talking about S1. I don't think we need to do this. As long as the translation falls into a mapped region (pinned or not), we don't need to worry. If we get a S2 translation fault from SPE, we just go and map it. And TBH the pinning here is just a optimisation against things like swap, KSM and similar things. The only thing we need to make sure is that the fault is handled in the context of the vcpu that owns this SPU. Alex, can you think of anything that would cause a problem (other than performance and possible blackout windows) if we didn't do any pinning at all and just handled the SPE interrupts as normal page faults? Thanks, M. -- Without deviation from the norm, progress is not possible. _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 9:52 ` Marc Zyngier 0 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 9:52 UTC (permalink / raw) To: Oliver Upton; +Cc: Alexandru Elisei, Will Deacon, kvmarm, linux-arm-kernel On Wed, 27 Jul 2022 10:30:59 +0100, Marc Zyngier <maz@kernel.org> wrote: > > On Tue, 26 Jul 2022 18:51:21 +0100, > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > responsible for its translation as well? I agree that pinning the buffer > > is likely the best way forward as pinning the whole of guest memory is > > entirely impractical. Huh, I just realised that you were talking about S1. I don't think we need to do this. As long as the translation falls into a mapped region (pinned or not), we don't need to worry. If we get a S2 translation fault from SPE, we just go and map it. And TBH the pinning here is just a optimisation against things like swap, KSM and similar things. The only thing we need to make sure is that the fault is handled in the context of the vcpu that owns this SPU. Alex, can you think of anything that would cause a problem (other than performance and possible blackout windows) if we didn't do any pinning at all and just handled the SPE interrupts as normal page faults? Thanks, M. -- Without deviation from the norm, progress is not possible. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 9:52 ` Marc Zyngier @ 2022-07-27 10:38 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 10:38 UTC (permalink / raw) To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm Hi Marc, On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote: > On Wed, 27 Jul 2022 10:30:59 +0100, > Marc Zyngier <maz@kernel.org> wrote: > > > > On Tue, 26 Jul 2022 18:51:21 +0100, > > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > > responsible for its translation as well? I agree that pinning the buffer > > > is likely the best way forward as pinning the whole of guest memory is > > > entirely impractical. > > Huh, I just realised that you were talking about S1. I don't think we > need to do this. As long as the translation falls into a mapped > region (pinned or not), we don't need to worry. > > If we get a S2 translation fault from SPE, we just go and map it. And > TBH the pinning here is just a optimisation against things like swap, > KSM and similar things. The only thing we need to make sure is that > the fault is handled in the context of the vcpu that owns this SPU. > > Alex, can you think of anything that would cause a problem (other than > performance and possible blackout windows) if we didn't do any pinning > at all and just handled the SPE interrupts as normal page faults? PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE, which means the last record written is incomplete. Records have a variable size, so it's impossible for KVM to revert to the end of the last known good record without parsing the buffer (references here [1]). And even if KVM would know the size of a record, there's this bit in the Arm ARM which worries me (ARM DDI 0487H.a, page D10-5177): "The architecture does not require that a sample record is written sequentially by the SPU, only that: [..] - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates whether PMBPTR_EL1 points to the first byte after the last complete sample record." So there might be gaps in the buffer, meaning that the entire buffer would have to be discarded if DL is set as a result of a stage 2 fault. Also, I'm not sure if you're aware of this, but SPE reports the guest VA in PMBPTR_EL1 (not the IPA) on a fault, so KVM would have to walk the guest's stage 1 tables to service the faults, which would add to the overhead of servicing the fault. Don't know if that makes a difference, just thought I should mention it as another peculiarity of SPE. [1] https://lore.kernel.org/all/Yl7KewpTj+7NSonf@monolith.localdoman/ Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 10:38 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 10:38 UTC (permalink / raw) To: Marc Zyngier; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel Hi Marc, On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote: > On Wed, 27 Jul 2022 10:30:59 +0100, > Marc Zyngier <maz@kernel.org> wrote: > > > > On Tue, 26 Jul 2022 18:51:21 +0100, > > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > > responsible for its translation as well? I agree that pinning the buffer > > > is likely the best way forward as pinning the whole of guest memory is > > > entirely impractical. > > Huh, I just realised that you were talking about S1. I don't think we > need to do this. As long as the translation falls into a mapped > region (pinned or not), we don't need to worry. > > If we get a S2 translation fault from SPE, we just go and map it. And > TBH the pinning here is just a optimisation against things like swap, > KSM and similar things. The only thing we need to make sure is that > the fault is handled in the context of the vcpu that owns this SPU. > > Alex, can you think of anything that would cause a problem (other than > performance and possible blackout windows) if we didn't do any pinning > at all and just handled the SPE interrupts as normal page faults? PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE, which means the last record written is incomplete. Records have a variable size, so it's impossible for KVM to revert to the end of the last known good record without parsing the buffer (references here [1]). And even if KVM would know the size of a record, there's this bit in the Arm ARM which worries me (ARM DDI 0487H.a, page D10-5177): "The architecture does not require that a sample record is written sequentially by the SPU, only that: [..] - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates whether PMBPTR_EL1 points to the first byte after the last complete sample record." So there might be gaps in the buffer, meaning that the entire buffer would have to be discarded if DL is set as a result of a stage 2 fault. Also, I'm not sure if you're aware of this, but SPE reports the guest VA in PMBPTR_EL1 (not the IPA) on a fault, so KVM would have to walk the guest's stage 1 tables to service the faults, which would add to the overhead of servicing the fault. Don't know if that makes a difference, just thought I should mention it as another peculiarity of SPE. [1] https://lore.kernel.org/all/Yl7KewpTj+7NSonf@monolith.localdoman/ Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 10:38 ` Alexandru Elisei @ 2022-07-27 16:06 ` Oliver Upton -1 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-07-27 16:06 UTC (permalink / raw) To: Alexandru Elisei; +Cc: Marc Zyngier, Will Deacon, kvmarm, linux-arm-kernel On Wed, Jul 27, 2022 at 11:38:53AM +0100, Alexandru Elisei wrote: > Hi Marc, > > On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote: > > On Wed, 27 Jul 2022 10:30:59 +0100, > > Marc Zyngier <maz@kernel.org> wrote: > > > > > > On Tue, 26 Jul 2022 18:51:21 +0100, > > > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > > > responsible for its translation as well? I agree that pinning the buffer > > > > is likely the best way forward as pinning the whole of guest memory is > > > > entirely impractical. > > > > Huh, I just realised that you were talking about S1. I don't think we > > need to do this. As long as the translation falls into a mapped > > region (pinned or not), we don't need to worry. Right, but my issue is what happens when a fragment of the S1 becomes unmapped at S2. We were discussing the idea of faulting once on the buffer at the beginning of profiling, seems to me that it could just as easily happen at runtime and get tripped up by what Alex points out below: > PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE, > which means the last record written is incomplete. Records have a variable > size, so it's impossible for KVM to revert to the end of the last known > good record without parsing the buffer (references here [1]). And even if > KVM would know the size of a record, there's this bit in the Arm ARM which > worries me (ARM DDI 0487H.a, page D10-5177): > > "The architecture does not require that a sample record is written > sequentially by the SPU, only that: > [..] > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates > whether PMBPTR_EL1 points to the first byte after the last complete > sample record." > > So there might be gaps in the buffer, meaning that the entire buffer would > have to be discarded if DL is set as a result of a stage 2 fault. Attempting to avoid thrashing with more threads so I'm going to summon back some context from your original reply, Marc: > > > > Live migration also throws a wrench in this. IOW, there are still potential > > > > sources of blackout unattributable to guest manipulation of the SPU. > > > > > > Can you chime some light on this? I appreciate that you can't play the > > > R/O trick on the SPE buffer as it invalidates the above discussion, > > > but it should be relatively easy to track these pages and never reset > > > them as clean until the vcpu is stopped. Unless you foresee other > > > issues? Right, we can play tricks on pre-copy to avoid write protecting the SPE buffer. My concern was more around post-copy, where userspace could've decided to leave the buffer behind and demand it back on the resulting S2 fault. > > > To be clear, I don't worry too much about these blind windows. The > > > architecture doesn't really give us the right tools to make it work > > > reliably, making this a best effort only. Unless we pin the whole > > > guest and forego migration and other fault-driven mechanisms. > > > > > > Maybe that is a choice we need to give to the user: cheap, fast, > > > reliable. Pick two. As long as we crisply document the errata in KVM's virtualized SPE (and inform the guest), that sounds reasonable. I'm just uneasy about proceeding with an implementation w/ so many gotchas unless all parties involved are aware of the quirks. -- Thanks, Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 16:06 ` Oliver Upton 0 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-07-27 16:06 UTC (permalink / raw) To: Alexandru Elisei; +Cc: Marc Zyngier, Will Deacon, kvmarm, linux-arm-kernel On Wed, Jul 27, 2022 at 11:38:53AM +0100, Alexandru Elisei wrote: > Hi Marc, > > On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote: > > On Wed, 27 Jul 2022 10:30:59 +0100, > > Marc Zyngier <maz@kernel.org> wrote: > > > > > > On Tue, 26 Jul 2022 18:51:21 +0100, > > > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > > > responsible for its translation as well? I agree that pinning the buffer > > > > is likely the best way forward as pinning the whole of guest memory is > > > > entirely impractical. > > > > Huh, I just realised that you were talking about S1. I don't think we > > need to do this. As long as the translation falls into a mapped > > region (pinned or not), we don't need to worry. Right, but my issue is what happens when a fragment of the S1 becomes unmapped at S2. We were discussing the idea of faulting once on the buffer at the beginning of profiling, seems to me that it could just as easily happen at runtime and get tripped up by what Alex points out below: > PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE, > which means the last record written is incomplete. Records have a variable > size, so it's impossible for KVM to revert to the end of the last known > good record without parsing the buffer (references here [1]). And even if > KVM would know the size of a record, there's this bit in the Arm ARM which > worries me (ARM DDI 0487H.a, page D10-5177): > > "The architecture does not require that a sample record is written > sequentially by the SPU, only that: > [..] > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates > whether PMBPTR_EL1 points to the first byte after the last complete > sample record." > > So there might be gaps in the buffer, meaning that the entire buffer would > have to be discarded if DL is set as a result of a stage 2 fault. Attempting to avoid thrashing with more threads so I'm going to summon back some context from your original reply, Marc: > > > > Live migration also throws a wrench in this. IOW, there are still potential > > > > sources of blackout unattributable to guest manipulation of the SPU. > > > > > > Can you chime some light on this? I appreciate that you can't play the > > > R/O trick on the SPE buffer as it invalidates the above discussion, > > > but it should be relatively easy to track these pages and never reset > > > them as clean until the vcpu is stopped. Unless you foresee other > > > issues? Right, we can play tricks on pre-copy to avoid write protecting the SPE buffer. My concern was more around post-copy, where userspace could've decided to leave the buffer behind and demand it back on the resulting S2 fault. > > > To be clear, I don't worry too much about these blind windows. The > > > architecture doesn't really give us the right tools to make it work > > > reliably, making this a best effort only. Unless we pin the whole > > > guest and forego migration and other fault-driven mechanisms. > > > > > > Maybe that is a choice we need to give to the user: cheap, fast, > > > reliable. Pick two. As long as we crisply document the errata in KVM's virtualized SPE (and inform the guest), that sounds reasonable. I'm just uneasy about proceeding with an implementation w/ so many gotchas unless all parties involved are aware of the quirks. -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 9:30 ` Marc Zyngier @ 2022-07-27 10:56 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 10:56 UTC (permalink / raw) To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm Hi Marc, On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote: > On Tue, 26 Jul 2022 18:51:21 +0100, > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Hi Alex, > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > [...] > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > the base/limit values. > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > might be a showstopper. > > > > > > Let's consider this scenario: > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > 2. Guest programs SPE to enable profiling at **EL0** > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > 3. Guest changes the translation table entries for the buffer. The > > > architecture allows this. > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > Not saying we necessarily should, but this is possible with FGT no? > > Given how often ERET is used at EL1, I'd really refrain from doing > so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real > EL1, and this comes at a serious cost (even an exception return that > stays at the same EL gets trapped). Once EL1 runs, we disengage this > trap because it is otherwise way too costly. > > > > > > I can see two solutions here: > > > > > > a. Accept the limitation (and advertise it in the documentation) that if > > > someone wants to use SPE when running as a Linux guest, the kernel used by > > > the guest must not change the buffer translation table entries after the > > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > > > running a Linux guest should not be a problem. I don't know how other OSes > > > do it (but I can find out). We could also phrase it that the buffer > > > translation table entries can be changed after enabling the buffer, but > > > only if profiling happens at EL1. But that sounds very arbitrary. > > > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > > > situation above. This means that there is a blackout window, but will > > > happen only once after each time the guest reprograms the buffer. I don't > > > know if this is acceptable. We could say that this if this blackout window > > > is not acceptable, then the guest kernel shouldn't change the translation > > > table entries after enabling the buffer. > > > > > > Or drop the approach of pinning the buffer and go back to pinning the > > > entire memory of the VM. > > > > > > Any thoughts on this? I would very much prefer to try to pin only the > > > buffer. > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > responsible for its translation as well? I agree that pinning the buffer > > is likely the best way forward as pinning the whole of guest memory is > > entirely impractical. > > How different is this from device assignment, which also relies on > full page pinning? The way I look at it, SPE is a device directly > assigned to the guest, and isn't capable of generating synchronous > exception. Not that I'm madly in love with the approach, but this is > at least consistent. There was also some concerns around buggy HW that > would blow itself up on S2 faults, but I think these implementations > are confidential enough that we don't need to worry about them. > > > I'm also a bit confused on how we would manage to un-pin memory on the > > way out with this. The guest is free to muck with the stage 1 and could > > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > > annoying. One way to tackle it would be to only allow a single > > root-to-target walk to be pinned by a vCPU at a time. Any time a new > > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > > one instead. > > This sounds like a reasonable option. Only one IPA range covering the > SPE buffer (as described by the translation of PMBPTR_EL1) is pinned > at any given time. Generate a SPE S2 fault outside of this range, and > we unpin the region before mapping in the next one. Yes, the guest can > play tricks on us and exploit the latency of the interrupt. But at the > end of the day, this is its own problem. > > Of course, this results in larger blind windows. Ideally, we should be > able to report these to the guest, either as sideband data or in the > actual profiling buffer (but I have no idea whether this is possible). I believe solution b, pin the buffer when guest enables profiling (where by profiling enabled I mean StatisticalProfilingEnabled() returns true), and only in the situation that I described pin the buffer as a result of a stage 2 fault, would reduce the blackouts to a minimum. Thanks, Alex > > > Live migration also throws a wrench in this. IOW, there are still potential > > sources of blackout unattributable to guest manipulation of the SPU. > > Can you chime some light on this? I appreciate that you can't play the > R/O trick on the SPE buffer as it invalidates the above discussion, > but it should be relatively easy to track these pages and never reset > them as clean until the vcpu is stopped. Unless you foresee other > issues? > > To be clear, I don't worry too much about these blind windows. The > architecture doesn't really give us the right tools to make it work > reliably, making this a best effort only. Unless we pin the whole > guest and forego migration and other fault-driven mechanisms. > > Maybe that is a choice we need to give to the user: cheap, fast, > reliable. Pick two. > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible. _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 10:56 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 10:56 UTC (permalink / raw) To: Marc Zyngier; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel Hi Marc, On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote: > On Tue, 26 Jul 2022 18:51:21 +0100, > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Hi Alex, > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > [...] > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > the base/limit values. > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > might be a showstopper. > > > > > > Let's consider this scenario: > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > 2. Guest programs SPE to enable profiling at **EL0** > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > 3. Guest changes the translation table entries for the buffer. The > > > architecture allows this. > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > Not saying we necessarily should, but this is possible with FGT no? > > Given how often ERET is used at EL1, I'd really refrain from doing > so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real > EL1, and this comes at a serious cost (even an exception return that > stays at the same EL gets trapped). Once EL1 runs, we disengage this > trap because it is otherwise way too costly. > > > > > > I can see two solutions here: > > > > > > a. Accept the limitation (and advertise it in the documentation) that if > > > someone wants to use SPE when running as a Linux guest, the kernel used by > > > the guest must not change the buffer translation table entries after the > > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > > > running a Linux guest should not be a problem. I don't know how other OSes > > > do it (but I can find out). We could also phrase it that the buffer > > > translation table entries can be changed after enabling the buffer, but > > > only if profiling happens at EL1. But that sounds very arbitrary. > > > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > > > situation above. This means that there is a blackout window, but will > > > happen only once after each time the guest reprograms the buffer. I don't > > > know if this is acceptable. We could say that this if this blackout window > > > is not acceptable, then the guest kernel shouldn't change the translation > > > table entries after enabling the buffer. > > > > > > Or drop the approach of pinning the buffer and go back to pinning the > > > entire memory of the VM. > > > > > > Any thoughts on this? I would very much prefer to try to pin only the > > > buffer. > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > responsible for its translation as well? I agree that pinning the buffer > > is likely the best way forward as pinning the whole of guest memory is > > entirely impractical. > > How different is this from device assignment, which also relies on > full page pinning? The way I look at it, SPE is a device directly > assigned to the guest, and isn't capable of generating synchronous > exception. Not that I'm madly in love with the approach, but this is > at least consistent. There was also some concerns around buggy HW that > would blow itself up on S2 faults, but I think these implementations > are confidential enough that we don't need to worry about them. > > > I'm also a bit confused on how we would manage to un-pin memory on the > > way out with this. The guest is free to muck with the stage 1 and could > > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > > annoying. One way to tackle it would be to only allow a single > > root-to-target walk to be pinned by a vCPU at a time. Any time a new > > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > > one instead. > > This sounds like a reasonable option. Only one IPA range covering the > SPE buffer (as described by the translation of PMBPTR_EL1) is pinned > at any given time. Generate a SPE S2 fault outside of this range, and > we unpin the region before mapping in the next one. Yes, the guest can > play tricks on us and exploit the latency of the interrupt. But at the > end of the day, this is its own problem. > > Of course, this results in larger blind windows. Ideally, we should be > able to report these to the guest, either as sideband data or in the > actual profiling buffer (but I have no idea whether this is possible). I believe solution b, pin the buffer when guest enables profiling (where by profiling enabled I mean StatisticalProfilingEnabled() returns true), and only in the situation that I described pin the buffer as a result of a stage 2 fault, would reduce the blackouts to a minimum. Thanks, Alex > > > Live migration also throws a wrench in this. IOW, there are still potential > > sources of blackout unattributable to guest manipulation of the SPU. > > Can you chime some light on this? I appreciate that you can't play the > R/O trick on the SPE buffer as it invalidates the above discussion, > but it should be relatively easy to track these pages and never reset > them as clean until the vcpu is stopped. Unless you foresee other > issues? > > To be clear, I don't worry too much about these blind windows. The > architecture doesn't really give us the right tools to make it work > reliably, making this a best effort only. Unless we pin the whole > guest and forego migration and other fault-driven mechanisms. > > Maybe that is a choice we need to give to the user: cheap, fast, > reliable. Pick two. > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 10:56 ` Alexandru Elisei @ 2022-07-27 11:18 ` Marc Zyngier -1 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 11:18 UTC (permalink / raw) To: Alexandru Elisei; +Cc: linux-arm-kernel, Will Deacon, kvmarm On 2022-07-27 11:56, Alexandru Elisei wrote: > Hi Marc, > > On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote: >> On Tue, 26 Jul 2022 18:51:21 +0100, >> Oliver Upton <oliver.upton@linux.dev> wrote: >> > >> > Hi Alex, >> > >> > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: >> > >> > [...] >> > >> > > > A funkier approach might be to defer pinning of the buffer until the SPE is >> > > > enabled and avoid pinning all of VM memory that way, although I can't >> > > > immediately tell how flexible the architecture is in allowing you to cache >> > > > the base/limit values. >> > > >> > > I was investigating this approach, and Mark raised a concern that I think >> > > might be a showstopper. >> > > >> > > Let's consider this scenario: >> > > >> > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, >> > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). >> > > >> > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). >> > > 2. Guest programs SPE to enable profiling at **EL0** >> > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). >> > > 3. Guest changes the translation table entries for the buffer. The >> > > architecture allows this. >> > > 4. Guest does an ERET to EL0, thus enabling profiling. >> > > >> > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin >> > > the buffer at stage 2 when profiling gets enabled at EL0. >> > >> > Not saying we necessarily should, but this is possible with FGT no? >> >> Given how often ERET is used at EL1, I'd really refrain from doing >> so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real >> EL1, and this comes at a serious cost (even an exception return that >> stays at the same EL gets trapped). Once EL1 runs, we disengage this >> trap because it is otherwise way too costly. >> >> > >> > > I can see two solutions here: >> > > >> > > a. Accept the limitation (and advertise it in the documentation) that if >> > > someone wants to use SPE when running as a Linux guest, the kernel used by >> > > the guest must not change the buffer translation table entries after the >> > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so >> > > running a Linux guest should not be a problem. I don't know how other OSes >> > > do it (but I can find out). We could also phrase it that the buffer >> > > translation table entries can be changed after enabling the buffer, but >> > > only if profiling happens at EL1. But that sounds very arbitrary. >> > > >> > > b. Pin the buffer after the stage 2 DABT that SPE will report in the >> > > situation above. This means that there is a blackout window, but will >> > > happen only once after each time the guest reprograms the buffer. I don't >> > > know if this is acceptable. We could say that this if this blackout window >> > > is not acceptable, then the guest kernel shouldn't change the translation >> > > table entries after enabling the buffer. >> > > >> > > Or drop the approach of pinning the buffer and go back to pinning the >> > > entire memory of the VM. >> > > >> > > Any thoughts on this? I would very much prefer to try to pin only the >> > > buffer. >> > >> > Doesn't pinning the buffer also imply pinning the stage 1 tables >> > responsible for its translation as well? I agree that pinning the buffer >> > is likely the best way forward as pinning the whole of guest memory is >> > entirely impractical. >> >> How different is this from device assignment, which also relies on >> full page pinning? The way I look at it, SPE is a device directly >> assigned to the guest, and isn't capable of generating synchronous >> exception. Not that I'm madly in love with the approach, but this is >> at least consistent. There was also some concerns around buggy HW that >> would blow itself up on S2 faults, but I think these implementations >> are confidential enough that we don't need to worry about them. >> >> > I'm also a bit confused on how we would manage to un-pin memory on the >> > way out with this. The guest is free to muck with the stage 1 and could >> > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be >> > annoying. One way to tackle it would be to only allow a single >> > root-to-target walk to be pinned by a vCPU at a time. Any time a new >> > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new >> > one instead. >> >> This sounds like a reasonable option. Only one IPA range covering the >> SPE buffer (as described by the translation of PMBPTR_EL1) is pinned >> at any given time. Generate a SPE S2 fault outside of this range, and >> we unpin the region before mapping in the next one. Yes, the guest can >> play tricks on us and exploit the latency of the interrupt. But at the >> end of the day, this is its own problem. >> >> Of course, this results in larger blind windows. Ideally, we should be >> able to report these to the guest, either as sideband data or in the >> actual profiling buffer (but I have no idea whether this is possible). > > I believe solution b, pin the buffer when guest enables profiling > (where by > profiling enabled I mean StatisticalProfilingEnabled() returns true), > and > only in the situation that I described pin the buffer as a result of a > stage 2 fault, would reduce the blackouts to a minimum. In all honesty, I'd rather see everything be done as the result of a S2 fault for now, and only introduce heuristics to reduce the blackout window at a later time. And this includes buffer pinning if that can be avoided. My hunch is that people wanting zero blackout will always pin all their memory, one way or another, and that the rest of us will be happy just to get *something* out of SPE in a VM... M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 11:18 ` Marc Zyngier 0 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 11:18 UTC (permalink / raw) To: Alexandru Elisei; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel On 2022-07-27 11:56, Alexandru Elisei wrote: > Hi Marc, > > On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote: >> On Tue, 26 Jul 2022 18:51:21 +0100, >> Oliver Upton <oliver.upton@linux.dev> wrote: >> > >> > Hi Alex, >> > >> > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: >> > >> > [...] >> > >> > > > A funkier approach might be to defer pinning of the buffer until the SPE is >> > > > enabled and avoid pinning all of VM memory that way, although I can't >> > > > immediately tell how flexible the architecture is in allowing you to cache >> > > > the base/limit values. >> > > >> > > I was investigating this approach, and Mark raised a concern that I think >> > > might be a showstopper. >> > > >> > > Let's consider this scenario: >> > > >> > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, >> > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). >> > > >> > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). >> > > 2. Guest programs SPE to enable profiling at **EL0** >> > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). >> > > 3. Guest changes the translation table entries for the buffer. The >> > > architecture allows this. >> > > 4. Guest does an ERET to EL0, thus enabling profiling. >> > > >> > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin >> > > the buffer at stage 2 when profiling gets enabled at EL0. >> > >> > Not saying we necessarily should, but this is possible with FGT no? >> >> Given how often ERET is used at EL1, I'd really refrain from doing >> so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real >> EL1, and this comes at a serious cost (even an exception return that >> stays at the same EL gets trapped). Once EL1 runs, we disengage this >> trap because it is otherwise way too costly. >> >> > >> > > I can see two solutions here: >> > > >> > > a. Accept the limitation (and advertise it in the documentation) that if >> > > someone wants to use SPE when running as a Linux guest, the kernel used by >> > > the guest must not change the buffer translation table entries after the >> > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so >> > > running a Linux guest should not be a problem. I don't know how other OSes >> > > do it (but I can find out). We could also phrase it that the buffer >> > > translation table entries can be changed after enabling the buffer, but >> > > only if profiling happens at EL1. But that sounds very arbitrary. >> > > >> > > b. Pin the buffer after the stage 2 DABT that SPE will report in the >> > > situation above. This means that there is a blackout window, but will >> > > happen only once after each time the guest reprograms the buffer. I don't >> > > know if this is acceptable. We could say that this if this blackout window >> > > is not acceptable, then the guest kernel shouldn't change the translation >> > > table entries after enabling the buffer. >> > > >> > > Or drop the approach of pinning the buffer and go back to pinning the >> > > entire memory of the VM. >> > > >> > > Any thoughts on this? I would very much prefer to try to pin only the >> > > buffer. >> > >> > Doesn't pinning the buffer also imply pinning the stage 1 tables >> > responsible for its translation as well? I agree that pinning the buffer >> > is likely the best way forward as pinning the whole of guest memory is >> > entirely impractical. >> >> How different is this from device assignment, which also relies on >> full page pinning? The way I look at it, SPE is a device directly >> assigned to the guest, and isn't capable of generating synchronous >> exception. Not that I'm madly in love with the approach, but this is >> at least consistent. There was also some concerns around buggy HW that >> would blow itself up on S2 faults, but I think these implementations >> are confidential enough that we don't need to worry about them. >> >> > I'm also a bit confused on how we would manage to un-pin memory on the >> > way out with this. The guest is free to muck with the stage 1 and could >> > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be >> > annoying. One way to tackle it would be to only allow a single >> > root-to-target walk to be pinned by a vCPU at a time. Any time a new >> > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new >> > one instead. >> >> This sounds like a reasonable option. Only one IPA range covering the >> SPE buffer (as described by the translation of PMBPTR_EL1) is pinned >> at any given time. Generate a SPE S2 fault outside of this range, and >> we unpin the region before mapping in the next one. Yes, the guest can >> play tricks on us and exploit the latency of the interrupt. But at the >> end of the day, this is its own problem. >> >> Of course, this results in larger blind windows. Ideally, we should be >> able to report these to the guest, either as sideband data or in the >> actual profiling buffer (but I have no idea whether this is possible). > > I believe solution b, pin the buffer when guest enables profiling > (where by > profiling enabled I mean StatisticalProfilingEnabled() returns true), > and > only in the situation that I described pin the buffer as a result of a > stage 2 fault, would reduce the blackouts to a minimum. In all honesty, I'd rather see everything be done as the result of a S2 fault for now, and only introduce heuristics to reduce the blackout window at a later time. And this includes buffer pinning if that can be avoided. My hunch is that people wanting zero blackout will always pin all their memory, one way or another, and that the rest of us will be happy just to get *something* out of SPE in a VM... M. -- Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 11:18 ` Marc Zyngier @ 2022-07-27 12:10 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 12:10 UTC (permalink / raw) To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm Hi, On Wed, Jul 27, 2022 at 12:18:41PM +0100, Marc Zyngier wrote: > On 2022-07-27 11:56, Alexandru Elisei wrote: > > Hi Marc, > > > > On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote: > > > On Tue, 26 Jul 2022 18:51:21 +0100, > > > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > > > Hi Alex, > > > > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > > > > > [...] > > > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > > the base/limit values. > > > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > > might be a showstopper. > > > > > > > > > > Let's consider this scenario: > > > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > > 3. Guest changes the translation table entries for the buffer. The > > > > > architecture allows this. > > > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > > > > > Not saying we necessarily should, but this is possible with FGT no? > > > > > > Given how often ERET is used at EL1, I'd really refrain from doing > > > so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real > > > EL1, and this comes at a serious cost (even an exception return that > > > stays at the same EL gets trapped). Once EL1 runs, we disengage this > > > trap because it is otherwise way too costly. > > > > > > > > > > > > I can see two solutions here: > > > > > > > > > > a. Accept the limitation (and advertise it in the documentation) that if > > > > > someone wants to use SPE when running as a Linux guest, the kernel used by > > > > > the guest must not change the buffer translation table entries after the > > > > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > > > > > running a Linux guest should not be a problem. I don't know how other OSes > > > > > do it (but I can find out). We could also phrase it that the buffer > > > > > translation table entries can be changed after enabling the buffer, but > > > > > only if profiling happens at EL1. But that sounds very arbitrary. > > > > > > > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > > > > > situation above. This means that there is a blackout window, but will > > > > > happen only once after each time the guest reprograms the buffer. I don't > > > > > know if this is acceptable. We could say that this if this blackout window > > > > > is not acceptable, then the guest kernel shouldn't change the translation > > > > > table entries after enabling the buffer. > > > > > > > > > > Or drop the approach of pinning the buffer and go back to pinning the > > > > > entire memory of the VM. > > > > > > > > > > Any thoughts on this? I would very much prefer to try to pin only the > > > > > buffer. > > > > > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > > > responsible for its translation as well? I agree that pinning the buffer > > > > is likely the best way forward as pinning the whole of guest memory is > > > > entirely impractical. > > > > > > How different is this from device assignment, which also relies on > > > full page pinning? The way I look at it, SPE is a device directly > > > assigned to the guest, and isn't capable of generating synchronous > > > exception. Not that I'm madly in love with the approach, but this is > > > at least consistent. There was also some concerns around buggy HW that > > > would blow itself up on S2 faults, but I think these implementations > > > are confidential enough that we don't need to worry about them. > > > > > > > I'm also a bit confused on how we would manage to un-pin memory on the > > > > way out with this. The guest is free to muck with the stage 1 and could > > > > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > > > > annoying. One way to tackle it would be to only allow a single > > > > root-to-target walk to be pinned by a vCPU at a time. Any time a new > > > > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > > > > one instead. > > > > > > This sounds like a reasonable option. Only one IPA range covering the > > > SPE buffer (as described by the translation of PMBPTR_EL1) is pinned > > > at any given time. Generate a SPE S2 fault outside of this range, and > > > we unpin the region before mapping in the next one. Yes, the guest can > > > play tricks on us and exploit the latency of the interrupt. But at the > > > end of the day, this is its own problem. > > > > > > Of course, this results in larger blind windows. Ideally, we should be > > > able to report these to the guest, either as sideband data or in the > > > actual profiling buffer (but I have no idea whether this is possible). > > > > I believe solution b, pin the buffer when guest enables profiling (where > > by > > profiling enabled I mean StatisticalProfilingEnabled() returns true), > > and > > only in the situation that I described pin the buffer as a result of a > > stage 2 fault, would reduce the blackouts to a minimum. > > In all honesty, I'd rather see everything be done as the result > of a S2 fault for now, and only introduce heuristics to reduce the blackout > window at a later time. And this includes buffer pinning > if that can be avoided. I believe it's not feasible to do everything as a result of a SPE stage 2 fault. I've explained where in this reply [1]. Sorry for fragmenting the discussion into so many different threads. Having the first write, and only that first write, trigger a stage 2 fault that KVM handles by pinning the buffer works because the guest hasn't written anything useful to the buffer. [1] https://lore.kernel.org/all/YuEVq8Au7YsDLOdI@monolith.localdoman/ > > My hunch is that people wanting zero blackout will always pin > all their memory, one way or another, and that the rest of us > will be happy just to get *something* out of SPE in a VM... What are you thinking when you are saying "one way or another"? Because that would need changes to KVM (mlock() is not enough). Thanks, Alex > > M. > -- > Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 12:10 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 12:10 UTC (permalink / raw) To: Marc Zyngier; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel Hi, On Wed, Jul 27, 2022 at 12:18:41PM +0100, Marc Zyngier wrote: > On 2022-07-27 11:56, Alexandru Elisei wrote: > > Hi Marc, > > > > On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote: > > > On Tue, 26 Jul 2022 18:51:21 +0100, > > > Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > > > Hi Alex, > > > > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > > > > > [...] > > > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > > the base/limit values. > > > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > > might be a showstopper. > > > > > > > > > > Let's consider this scenario: > > > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > > 3. Guest changes the translation table entries for the buffer. The > > > > > architecture allows this. > > > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > > > > > Not saying we necessarily should, but this is possible with FGT no? > > > > > > Given how often ERET is used at EL1, I'd really refrain from doing > > > so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real > > > EL1, and this comes at a serious cost (even an exception return that > > > stays at the same EL gets trapped). Once EL1 runs, we disengage this > > > trap because it is otherwise way too costly. > > > > > > > > > > > > I can see two solutions here: > > > > > > > > > > a. Accept the limitation (and advertise it in the documentation) that if > > > > > someone wants to use SPE when running as a Linux guest, the kernel used by > > > > > the guest must not change the buffer translation table entries after the > > > > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > > > > > running a Linux guest should not be a problem. I don't know how other OSes > > > > > do it (but I can find out). We could also phrase it that the buffer > > > > > translation table entries can be changed after enabling the buffer, but > > > > > only if profiling happens at EL1. But that sounds very arbitrary. > > > > > > > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > > > > > situation above. This means that there is a blackout window, but will > > > > > happen only once after each time the guest reprograms the buffer. I don't > > > > > know if this is acceptable. We could say that this if this blackout window > > > > > is not acceptable, then the guest kernel shouldn't change the translation > > > > > table entries after enabling the buffer. > > > > > > > > > > Or drop the approach of pinning the buffer and go back to pinning the > > > > > entire memory of the VM. > > > > > > > > > > Any thoughts on this? I would very much prefer to try to pin only the > > > > > buffer. > > > > > > > > Doesn't pinning the buffer also imply pinning the stage 1 tables > > > > responsible for its translation as well? I agree that pinning the buffer > > > > is likely the best way forward as pinning the whole of guest memory is > > > > entirely impractical. > > > > > > How different is this from device assignment, which also relies on > > > full page pinning? The way I look at it, SPE is a device directly > > > assigned to the guest, and isn't capable of generating synchronous > > > exception. Not that I'm madly in love with the approach, but this is > > > at least consistent. There was also some concerns around buggy HW that > > > would blow itself up on S2 faults, but I think these implementations > > > are confidential enough that we don't need to worry about them. > > > > > > > I'm also a bit confused on how we would manage to un-pin memory on the > > > > way out with this. The guest is free to muck with the stage 1 and could > > > > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > > > > annoying. One way to tackle it would be to only allow a single > > > > root-to-target walk to be pinned by a vCPU at a time. Any time a new > > > > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > > > > one instead. > > > > > > This sounds like a reasonable option. Only one IPA range covering the > > > SPE buffer (as described by the translation of PMBPTR_EL1) is pinned > > > at any given time. Generate a SPE S2 fault outside of this range, and > > > we unpin the region before mapping in the next one. Yes, the guest can > > > play tricks on us and exploit the latency of the interrupt. But at the > > > end of the day, this is its own problem. > > > > > > Of course, this results in larger blind windows. Ideally, we should be > > > able to report these to the guest, either as sideband data or in the > > > actual profiling buffer (but I have no idea whether this is possible). > > > > I believe solution b, pin the buffer when guest enables profiling (where > > by > > profiling enabled I mean StatisticalProfilingEnabled() returns true), > > and > > only in the situation that I described pin the buffer as a result of a > > stage 2 fault, would reduce the blackouts to a minimum. > > In all honesty, I'd rather see everything be done as the result > of a S2 fault for now, and only introduce heuristics to reduce the blackout > window at a later time. And this includes buffer pinning > if that can be avoided. I believe it's not feasible to do everything as a result of a SPE stage 2 fault. I've explained where in this reply [1]. Sorry for fragmenting the discussion into so many different threads. Having the first write, and only that first write, trigger a stage 2 fault that KVM handles by pinning the buffer works because the guest hasn't written anything useful to the buffer. [1] https://lore.kernel.org/all/YuEVq8Au7YsDLOdI@monolith.localdoman/ > > My hunch is that people wanting zero blackout will always pin > all their memory, one way or another, and that the rest of us > will be happy just to get *something* out of SPE in a VM... What are you thinking when you are saying "one way or another"? Because that would need changes to KVM (mlock() is not enough). Thanks, Alex > > M. > -- > Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-26 17:51 ` Oliver Upton @ 2022-07-27 10:19 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 10:19 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Oliver, Thank you for the help, replies below. On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > Hi Alex, > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > [...] > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > enabled and avoid pinning all of VM memory that way, although I can't > > > immediately tell how flexible the architecture is in allowing you to cache > > > the base/limit values. > > > > I was investigating this approach, and Mark raised a concern that I think > > might be a showstopper. > > > > Let's consider this scenario: > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > 2. Guest programs SPE to enable profiling at **EL0** > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > 3. Guest changes the translation table entries for the buffer. The > > architecture allows this. > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > the buffer at stage 2 when profiling gets enabled at EL0. > > Not saying we necessarily should, but this is possible with FGT no? It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from EL1. Unless there's no other way, I would prefer not to have the emulation of one feature depend on the presence of another feature, > > > I can see two solutions here: > > > > a. Accept the limitation (and advertise it in the documentation) that if > > someone wants to use SPE when running as a Linux guest, the kernel used by > > the guest must not change the buffer translation table entries after the > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > > running a Linux guest should not be a problem. I don't know how other OSes > > do it (but I can find out). We could also phrase it that the buffer > > translation table entries can be changed after enabling the buffer, but > > only if profiling happens at EL1. But that sounds very arbitrary. > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > > situation above. This means that there is a blackout window, but will > > happen only once after each time the guest reprograms the buffer. I don't > > know if this is acceptable. We could say that this if this blackout window > > is not acceptable, then the guest kernel shouldn't change the translation > > table entries after enabling the buffer. > > > > Or drop the approach of pinning the buffer and go back to pinning the > > entire memory of the VM. > > > > Any thoughts on this? I would very much prefer to try to pin only the > > buffer. > > Doesn't pinning the buffer also imply pinning the stage 1 tables > responsible for its translation as well? I agree that pinning the buffer See my reply [1] to a question someone asked in an earlier iteration of the pKVM series. My conclusion is that it's impossible to stop the invalidate_range_start() MMU notifiers from being invoked for pinned pages. But I believe that can be circumvented passing the enum mmu_notifier_event event field to the arm64 KVM code and use that to decide to do the unmapping or not. I am still investigating that, but it looks promising. [1] https://lore.kernel.org/all/YuEMkKY2RU%2F2KiZW@monolith.localdoman/ > is likely the best way forward as pinning the whole of guest memory is > entirely impractical. I would say it's undesirable, not impractical. Like Marc said, vfio already pins the entire guest memory with the VFIO_IOMMMU_MAP_DMA ioctl. The difference there is that the SMMU tables are unmapped via the explicit ioctl VFIO_IOMMU_UNMAP_DMA; the SMMU doesn't use the MMU notifiers to keep in sync with host's stage 1 like KVM does. > > I'm also a bit confused on how we would manage to un-pin memory on the > way out with this. The guest is free to muck with the stage 1 and could > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > annoying. One way to tackle it would be to only allow a single > root-to-target walk to be pinned by a vCPU at a time. Any time a new > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > one instead. > > Live migration also throws a wrench in this. IOW, there are still potential > sources of blackout unattributable to guest manipulation of the SPU. I have a proposal to handle [2] that, if you want to have a look. Basically, userspace tells KVM to never allow the guest to start profiling. That means a possibly huge blackout window while the guest is being migrated, but I don't see any better solutions. [2] https://lore.kernel.org/all/20211117153842.302159-35-alexandru.elisei@arm.com/ Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 10:19 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 10:19 UTC (permalink / raw) To: Oliver Upton; +Cc: Will Deacon, maz, kvmarm, linux-arm-kernel Hi Oliver, Thank you for the help, replies below. On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > Hi Alex, > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > [...] > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > enabled and avoid pinning all of VM memory that way, although I can't > > > immediately tell how flexible the architecture is in allowing you to cache > > > the base/limit values. > > > > I was investigating this approach, and Mark raised a concern that I think > > might be a showstopper. > > > > Let's consider this scenario: > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > 2. Guest programs SPE to enable profiling at **EL0** > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > 3. Guest changes the translation table entries for the buffer. The > > architecture allows this. > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > the buffer at stage 2 when profiling gets enabled at EL0. > > Not saying we necessarily should, but this is possible with FGT no? It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from EL1. Unless there's no other way, I would prefer not to have the emulation of one feature depend on the presence of another feature, > > > I can see two solutions here: > > > > a. Accept the limitation (and advertise it in the documentation) that if > > someone wants to use SPE when running as a Linux guest, the kernel used by > > the guest must not change the buffer translation table entries after the > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so > > running a Linux guest should not be a problem. I don't know how other OSes > > do it (but I can find out). We could also phrase it that the buffer > > translation table entries can be changed after enabling the buffer, but > > only if profiling happens at EL1. But that sounds very arbitrary. > > > > b. Pin the buffer after the stage 2 DABT that SPE will report in the > > situation above. This means that there is a blackout window, but will > > happen only once after each time the guest reprograms the buffer. I don't > > know if this is acceptable. We could say that this if this blackout window > > is not acceptable, then the guest kernel shouldn't change the translation > > table entries after enabling the buffer. > > > > Or drop the approach of pinning the buffer and go back to pinning the > > entire memory of the VM. > > > > Any thoughts on this? I would very much prefer to try to pin only the > > buffer. > > Doesn't pinning the buffer also imply pinning the stage 1 tables > responsible for its translation as well? I agree that pinning the buffer See my reply [1] to a question someone asked in an earlier iteration of the pKVM series. My conclusion is that it's impossible to stop the invalidate_range_start() MMU notifiers from being invoked for pinned pages. But I believe that can be circumvented passing the enum mmu_notifier_event event field to the arm64 KVM code and use that to decide to do the unmapping or not. I am still investigating that, but it looks promising. [1] https://lore.kernel.org/all/YuEMkKY2RU%2F2KiZW@monolith.localdoman/ > is likely the best way forward as pinning the whole of guest memory is > entirely impractical. I would say it's undesirable, not impractical. Like Marc said, vfio already pins the entire guest memory with the VFIO_IOMMMU_MAP_DMA ioctl. The difference there is that the SMMU tables are unmapped via the explicit ioctl VFIO_IOMMU_UNMAP_DMA; the SMMU doesn't use the MMU notifiers to keep in sync with host's stage 1 like KVM does. > > I'm also a bit confused on how we would manage to un-pin memory on the > way out with this. The guest is free to muck with the stage 1 and could > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > annoying. One way to tackle it would be to only allow a single > root-to-target walk to be pinned by a vCPU at a time. Any time a new > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > one instead. > > Live migration also throws a wrench in this. IOW, there are still potential > sources of blackout unattributable to guest manipulation of the SPU. I have a proposal to handle [2] that, if you want to have a look. Basically, userspace tells KVM to never allow the guest to start profiling. That means a possibly huge blackout window while the guest is being migrated, but I don't see any better solutions. [2] https://lore.kernel.org/all/20211117153842.302159-35-alexandru.elisei@arm.com/ Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 10:19 ` Alexandru Elisei @ 2022-07-27 10:29 ` Marc Zyngier -1 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 10:29 UTC (permalink / raw) To: Alexandru Elisei; +Cc: linux-arm-kernel, Will Deacon, kvmarm On 2022-07-27 11:19, Alexandru Elisei wrote: > Hi Oliver, > > Thank you for the help, replies below. > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: >> Hi Alex, >> >> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: >> >> [...] >> >> > > A funkier approach might be to defer pinning of the buffer until the SPE is >> > > enabled and avoid pinning all of VM memory that way, although I can't >> > > immediately tell how flexible the architecture is in allowing you to cache >> > > the base/limit values. >> > >> > I was investigating this approach, and Mark raised a concern that I think >> > might be a showstopper. >> > >> > Let's consider this scenario: >> > >> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, >> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). >> > >> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). >> > 2. Guest programs SPE to enable profiling at **EL0** >> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). >> > 3. Guest changes the translation table entries for the buffer. The >> > architecture allows this. >> > 4. Guest does an ERET to EL0, thus enabling profiling. >> > >> > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin >> > the buffer at stage 2 when profiling gets enabled at EL0. >> >> Not saying we necessarily should, but this is possible with FGT no? > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from > EL1. See HFGITR.ERET. Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 10:29 ` Marc Zyngier 0 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 10:29 UTC (permalink / raw) To: Alexandru Elisei; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel On 2022-07-27 11:19, Alexandru Elisei wrote: > Hi Oliver, > > Thank you for the help, replies below. > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: >> Hi Alex, >> >> On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: >> >> [...] >> >> > > A funkier approach might be to defer pinning of the buffer until the SPE is >> > > enabled and avoid pinning all of VM memory that way, although I can't >> > > immediately tell how flexible the architecture is in allowing you to cache >> > > the base/limit values. >> > >> > I was investigating this approach, and Mark raised a concern that I think >> > might be a showstopper. >> > >> > Let's consider this scenario: >> > >> > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, >> > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). >> > >> > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). >> > 2. Guest programs SPE to enable profiling at **EL0** >> > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). >> > 3. Guest changes the translation table entries for the buffer. The >> > architecture allows this. >> > 4. Guest does an ERET to EL0, thus enabling profiling. >> > >> > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin >> > the buffer at stage 2 when profiling gets enabled at EL0. >> >> Not saying we necessarily should, but this is possible with FGT no? > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from > EL1. See HFGITR.ERET. Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 10:29 ` Marc Zyngier @ 2022-07-27 10:44 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 10:44 UTC (permalink / raw) To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote: > On 2022-07-27 11:19, Alexandru Elisei wrote: > > Hi Oliver, > > > > Thank you for the help, replies below. > > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > > > Hi Alex, > > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > > > [...] > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > the base/limit values. > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > might be a showstopper. > > > > > > > > Let's consider this scenario: > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > 3. Guest changes the translation table entries for the buffer. The > > > > architecture allows this. > > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > > > Not saying we necessarily should, but this is possible with FGT no? > > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from > > EL1. > > See HFGITR.ERET. Ah, so that's the register, thanks! I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any machines that have FEAT_SPE and FEAT_FGT? On the plus side, KVM could enable the trap only in the case above, and disable it after the ERET is trapped, so it should be relatively cheap to use. Thanks, Alex > > Thanks, > > M. > -- > Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 10:44 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 10:44 UTC (permalink / raw) To: Marc Zyngier; +Cc: Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote: > On 2022-07-27 11:19, Alexandru Elisei wrote: > > Hi Oliver, > > > > Thank you for the help, replies below. > > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > > > Hi Alex, > > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > > > [...] > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > the base/limit values. > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > might be a showstopper. > > > > > > > > Let's consider this scenario: > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > 3. Guest changes the translation table entries for the buffer. The > > > > architecture allows this. > > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > > > Not saying we necessarily should, but this is possible with FGT no? > > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from > > EL1. > > See HFGITR.ERET. Ah, so that's the register, thanks! I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any machines that have FEAT_SPE and FEAT_FGT? On the plus side, KVM could enable the trap only in the case above, and disable it after the ERET is trapped, so it should be relatively cheap to use. Thanks, Alex > > Thanks, > > M. > -- > Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 10:44 ` Alexandru Elisei @ 2022-07-27 11:08 ` Marc Zyngier -1 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 11:08 UTC (permalink / raw) To: Alexandru Elisei; +Cc: Will Deacon, kvmarm, linux-arm-kernel On 2022-07-27 11:44, Alexandru Elisei wrote: > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote: >> On 2022-07-27 11:19, Alexandru Elisei wrote: >> > Hi Oliver, >> > >> > Thank you for the help, replies below. >> > >> > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: >> > > Hi Alex, >> > > >> > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: >> > > >> > > [...] >> > > >> > > > > A funkier approach might be to defer pinning of the buffer until the SPE is >> > > > > enabled and avoid pinning all of VM memory that way, although I can't >> > > > > immediately tell how flexible the architecture is in allowing you to cache >> > > > > the base/limit values. >> > > > >> > > > I was investigating this approach, and Mark raised a concern that I think >> > > > might be a showstopper. >> > > > >> > > > Let's consider this scenario: >> > > > >> > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, >> > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). >> > > > >> > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). >> > > > 2. Guest programs SPE to enable profiling at **EL0** >> > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). >> > > > 3. Guest changes the translation table entries for the buffer. The >> > > > architecture allows this. >> > > > 4. Guest does an ERET to EL0, thus enabling profiling. >> > > > >> > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin >> > > > the buffer at stage 2 when profiling gets enabled at EL0. >> > > >> > > Not saying we necessarily should, but this is possible with FGT no? >> > >> > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from >> > EL1. >> >> See HFGITR.ERET. > > Ah, so that's the register, thanks! > > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend > on > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any > machines > that have FEAT_SPE and FEAT_FGT? None. Both are pretty niche, and the combination is nowhere to be seen at the moment. > On the plus side, KVM could enable the trap only in the case above, and > disable > it after the ERET is trapped, so it should be relatively cheap to use. This feels pretty horrible. Nothing says *when* will EL1 alter the PTs. It could take tons of EL1->EL1 exceptions before returning to EL0. And the change could happen after an EL1->EL0->EL1 transition. At which point do you stop? If you want to rely on ERET for that, you need to trap ERET all the time, because all ERETs to EL0 will be suspect. And doing that to handle such a corner case feels pretty horrible. M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 11:08 ` Marc Zyngier 0 siblings, 0 replies; 72+ messages in thread From: Marc Zyngier @ 2022-07-27 11:08 UTC (permalink / raw) To: Alexandru Elisei; +Cc: linux-arm-kernel, Will Deacon, kvmarm On 2022-07-27 11:44, Alexandru Elisei wrote: > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote: >> On 2022-07-27 11:19, Alexandru Elisei wrote: >> > Hi Oliver, >> > >> > Thank you for the help, replies below. >> > >> > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: >> > > Hi Alex, >> > > >> > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: >> > > >> > > [...] >> > > >> > > > > A funkier approach might be to defer pinning of the buffer until the SPE is >> > > > > enabled and avoid pinning all of VM memory that way, although I can't >> > > > > immediately tell how flexible the architecture is in allowing you to cache >> > > > > the base/limit values. >> > > > >> > > > I was investigating this approach, and Mark raised a concern that I think >> > > > might be a showstopper. >> > > > >> > > > Let's consider this scenario: >> > > > >> > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, >> > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). >> > > > >> > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). >> > > > 2. Guest programs SPE to enable profiling at **EL0** >> > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). >> > > > 3. Guest changes the translation table entries for the buffer. The >> > > > architecture allows this. >> > > > 4. Guest does an ERET to EL0, thus enabling profiling. >> > > > >> > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin >> > > > the buffer at stage 2 when profiling gets enabled at EL0. >> > > >> > > Not saying we necessarily should, but this is possible with FGT no? >> > >> > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from >> > EL1. >> >> See HFGITR.ERET. > > Ah, so that's the register, thanks! > > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend > on > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any > machines > that have FEAT_SPE and FEAT_FGT? None. Both are pretty niche, and the combination is nowhere to be seen at the moment. > On the plus side, KVM could enable the trap only in the case above, and > disable > it after the ERET is trapped, so it should be relatively cheap to use. This feels pretty horrible. Nothing says *when* will EL1 alter the PTs. It could take tons of EL1->EL1 exceptions before returning to EL0. And the change could happen after an EL1->EL0->EL1 transition. At which point do you stop? If you want to rely on ERET for that, you need to trap ERET all the time, because all ERETs to EL0 will be suspect. And doing that to handle such a corner case feels pretty horrible. M. -- Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 11:08 ` Marc Zyngier @ 2022-07-27 11:57 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 11:57 UTC (permalink / raw) To: Marc Zyngier; +Cc: Will Deacon, kvmarm, linux-arm-kernel Hi, On Wed, Jul 27, 2022 at 12:08:11PM +0100, Marc Zyngier wrote: > On 2022-07-27 11:44, Alexandru Elisei wrote: > > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote: > > > On 2022-07-27 11:19, Alexandru Elisei wrote: > > > > Hi Oliver, > > > > > > > > Thank you for the help, replies below. > > > > > > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > > > > > Hi Alex, > > > > > > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > > > > > > > [...] > > > > > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > > > the base/limit values. > > > > > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > > > might be a showstopper. > > > > > > > > > > > > Let's consider this scenario: > > > > > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > > > 3. Guest changes the translation table entries for the buffer. The > > > > > > architecture allows this. > > > > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > > > > > > > Not saying we necessarily should, but this is possible with FGT no? > > > > > > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from > > > > EL1. > > > > > > See HFGITR.ERET. > > > > Ah, so that's the register, thanks! > > > > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on > > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any > > machines > > that have FEAT_SPE and FEAT_FGT? > > None. Both are pretty niche, and the combination is nowhere > to be seen at the moment. That was also my impression. > > > On the plus side, KVM could enable the trap only in the case above, and > > disable > > it after the ERET is trapped, so it should be relatively cheap to use. > > This feels pretty horrible. Nothing says *when* will EL1 > alter the PTs. It could take tons of EL1->EL1 exceptions > before returning to EL0. And the change could happen after > an EL1->EL0->EL1 transition. At which point do you stop? ERET trapping is enabled When PMBLIMITR_EL1.E = 1, PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. The first guest ERET from EL1 to EL0 enables profiling, at which point the buffer is pinned and ERET trapping is disabled. Guest messing with the translation tables while profiling is enabled is the guest's problem because that's not permitted by the architecture. Any stage 2 dabt taken when the buffer is pinned would be injected back into the guest as an SPE external abort (or something equivalent). Stage 1 dabts are entirely the guest's problem to solve and would be injected back regardless of the status of the buffer. Yes, I agree, there could be a lot of ERETs from EL1 to EL1 before the ERET to EL0; those ERETs would be uselessly trapped. The above is a moot point anyway, because I believe we both agree that having SPE emulation depend on FEAT_FGT is best to be avoided. Thanks, Alex > > If you want to rely on ERET for that, you need to trap > ERET all the time, because all ERETs to EL0 will be > suspect. And doing that to handle such a corner case feels > pretty horrible. > > M. > -- > Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 11:57 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 11:57 UTC (permalink / raw) To: Marc Zyngier; +Cc: linux-arm-kernel, Will Deacon, kvmarm Hi, On Wed, Jul 27, 2022 at 12:08:11PM +0100, Marc Zyngier wrote: > On 2022-07-27 11:44, Alexandru Elisei wrote: > > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote: > > > On 2022-07-27 11:19, Alexandru Elisei wrote: > > > > Hi Oliver, > > > > > > > > Thank you for the help, replies below. > > > > > > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > > > > > Hi Alex, > > > > > > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > > > > > > > [...] > > > > > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > > > the base/limit values. > > > > > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > > > might be a showstopper. > > > > > > > > > > > > Let's consider this scenario: > > > > > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > > > 3. Guest changes the translation table entries for the buffer. The > > > > > > architecture allows this. > > > > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > > > > > > > Not saying we necessarily should, but this is possible with FGT no? > > > > > > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from > > > > EL1. > > > > > > See HFGITR.ERET. > > > > Ah, so that's the register, thanks! > > > > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on > > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any > > machines > > that have FEAT_SPE and FEAT_FGT? > > None. Both are pretty niche, and the combination is nowhere > to be seen at the moment. That was also my impression. > > > On the plus side, KVM could enable the trap only in the case above, and > > disable > > it after the ERET is trapped, so it should be relatively cheap to use. > > This feels pretty horrible. Nothing says *when* will EL1 > alter the PTs. It could take tons of EL1->EL1 exceptions > before returning to EL0. And the change could happen after > an EL1->EL0->EL1 transition. At which point do you stop? ERET trapping is enabled When PMBLIMITR_EL1.E = 1, PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. The first guest ERET from EL1 to EL0 enables profiling, at which point the buffer is pinned and ERET trapping is disabled. Guest messing with the translation tables while profiling is enabled is the guest's problem because that's not permitted by the architecture. Any stage 2 dabt taken when the buffer is pinned would be injected back into the guest as an SPE external abort (or something equivalent). Stage 1 dabts are entirely the guest's problem to solve and would be injected back regardless of the status of the buffer. Yes, I agree, there could be a lot of ERETs from EL1 to EL1 before the ERET to EL0; those ERETs would be uselessly trapped. The above is a moot point anyway, because I believe we both agree that having SPE emulation depend on FEAT_FGT is best to be avoided. Thanks, Alex > > If you want to rely on ERET for that, you need to trap > ERET all the time, because all ERETs to EL0 will be > suspect. And doing that to handle such a corner case feels > pretty horrible. > > M. > -- > Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-27 11:57 ` Alexandru Elisei @ 2022-07-27 15:15 ` Oliver Upton -1 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-07-27 15:15 UTC (permalink / raw) To: Alexandru Elisei; +Cc: Marc Zyngier, Will Deacon, kvmarm, linux-arm-kernel On Wed, Jul 27, 2022 at 12:57:16PM +0100, Alexandru Elisei wrote: > Hi, > > On Wed, Jul 27, 2022 at 12:08:11PM +0100, Marc Zyngier wrote: > > On 2022-07-27 11:44, Alexandru Elisei wrote: > > > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote: > > > > On 2022-07-27 11:19, Alexandru Elisei wrote: > > > > > Hi Oliver, > > > > > > > > > > Thank you for the help, replies below. > > > > > > > > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > > > > > > Hi Alex, > > > > > > > > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > > > > > > > > > [...] > > > > > > > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > > > > the base/limit values. > > > > > > > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > > > > might be a showstopper. > > > > > > > > > > > > > > Let's consider this scenario: > > > > > > > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > > > > 3. Guest changes the translation table entries for the buffer. The > > > > > > > architecture allows this. > > > > > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > > > > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > > > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > > > > > > > > > Not saying we necessarily should, but this is possible with FGT no? > > > > > > > > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from > > > > > EL1. > > > > > > > > See HFGITR.ERET. > > > > > > Ah, so that's the register, thanks! > > > > > > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on > > > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any > > > machines > > > that have FEAT_SPE and FEAT_FGT? > > > > None. Both are pretty niche, and the combination is nowhere > > to be seen at the moment. > > That was also my impression. > > > > > > On the plus side, KVM could enable the trap only in the case above, and > > > disable > > > it after the ERET is trapped, so it should be relatively cheap to use. > > > > This feels pretty horrible. Nothing says *when* will EL1 > > alter the PTs. It could take tons of EL1->EL1 exceptions > > before returning to EL0. And the change could happen after > > an EL1->EL0->EL1 transition. At which point do you stop? > > ERET trapping is enabled When PMBLIMITR_EL1.E = 1, PMSCR_EL1.{E0SPE,E1SPE} > = {1,0}. The first guest ERET from EL1 to EL0 enables profiling, at which > point the buffer is pinned and ERET trapping is disabled. > > Guest messing with the translation tables while profiling is enabled is the > guest's problem because that's not permitted by the architecture. Any stage > 2 dabt taken when the buffer is pinned would be injected back into the > guest as an SPE external abort (or something equivalent). Stage 1 dabts are > entirely the guest's problem to solve and would be injected back regardless > of the status of the buffer. > > Yes, I agree, there could be a lot of ERETs from EL1 to EL1 before the ERET > to EL0; those ERETs would be uselessly trapped. > > The above is a moot point anyway, because I believe we both agree that > having SPE emulation depend on FEAT_FGT is best to be avoided. LOL, I probably shouldn't have even mentioned it :) Completely agree with you both, trapping ERET is bordering on mad. -- Thanks, Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 15:15 ` Oliver Upton 0 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-07-27 15:15 UTC (permalink / raw) To: Alexandru Elisei; +Cc: Marc Zyngier, Will Deacon, kvmarm, linux-arm-kernel On Wed, Jul 27, 2022 at 12:57:16PM +0100, Alexandru Elisei wrote: > Hi, > > On Wed, Jul 27, 2022 at 12:08:11PM +0100, Marc Zyngier wrote: > > On 2022-07-27 11:44, Alexandru Elisei wrote: > > > On Wed, Jul 27, 2022 at 11:29:03AM +0100, Marc Zyngier wrote: > > > > On 2022-07-27 11:19, Alexandru Elisei wrote: > > > > > Hi Oliver, > > > > > > > > > > Thank you for the help, replies below. > > > > > > > > > > On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > > > > > > Hi Alex, > > > > > > > > > > > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > > > > > > > > > > > [...] > > > > > > > > > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > > > > the base/limit values. > > > > > > > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > > > > might be a showstopper. > > > > > > > > > > > > > > Let's consider this scenario: > > > > > > > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > > > > 3. Guest changes the translation table entries for the buffer. The > > > > > > > architecture allows this. > > > > > > > 4. Guest does an ERET to EL0, thus enabling profiling. > > > > > > > > > > > > > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin > > > > > > > the buffer at stage 2 when profiling gets enabled at EL0. > > > > > > > > > > > > Not saying we necessarily should, but this is possible with FGT no? > > > > > > > > > > It doesn't look to me like FEAT_FGT offers any knobs to trap ERET from > > > > > EL1. > > > > > > > > See HFGITR.ERET. > > > > > > Ah, so that's the register, thanks! > > > > > > I stil am not sure that having FEAT_SPE, an Armv8.3 extension, depend on > > > FEAT_FGT, an Armv8.6 extension, is the best idea. Do you know of any > > > machines > > > that have FEAT_SPE and FEAT_FGT? > > > > None. Both are pretty niche, and the combination is nowhere > > to be seen at the moment. > > That was also my impression. > > > > > > On the plus side, KVM could enable the trap only in the case above, and > > > disable > > > it after the ERET is trapped, so it should be relatively cheap to use. > > > > This feels pretty horrible. Nothing says *when* will EL1 > > alter the PTs. It could take tons of EL1->EL1 exceptions > > before returning to EL0. And the change could happen after > > an EL1->EL0->EL1 transition. At which point do you stop? > > ERET trapping is enabled When PMBLIMITR_EL1.E = 1, PMSCR_EL1.{E0SPE,E1SPE} > = {1,0}. The first guest ERET from EL1 to EL0 enables profiling, at which > point the buffer is pinned and ERET trapping is disabled. > > Guest messing with the translation tables while profiling is enabled is the > guest's problem because that's not permitted by the architecture. Any stage > 2 dabt taken when the buffer is pinned would be injected back into the > guest as an SPE external abort (or something equivalent). Stage 1 dabts are > entirely the guest's problem to solve and would be injected back regardless > of the status of the buffer. > > Yes, I agree, there could be a lot of ERETs from EL1 to EL1 before the ERET > to EL0; those ERETs would be uselessly trapped. > > The above is a moot point anyway, because I believe we both agree that > having SPE emulation depend on FEAT_FGT is best to be avoided. LOL, I probably shouldn't have even mentioned it :) Completely agree with you both, trapping ERET is bordering on mad. -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-26 17:51 ` Oliver Upton @ 2022-07-27 11:00 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 11:00 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Oliver, On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > Hi Alex, > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > [...] > > I'm also a bit confused on how we would manage to un-pin memory on the > way out with this. The guest is free to muck with the stage 1 and could > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > annoying. One way to tackle it would be to only allow a single > root-to-target walk to be pinned by a vCPU at a time. Any time a new > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > one instead. On the topic of memory unpinning, for a well behaved guest I belive that should be done the next time the buffer is pinned. The buffer can (and should!) be drained when both the buffer and sampling is disabled; unpinning the buffer when profiling becomes disabled would lead to unnecessary stage 2 faults when draining it. That approach also means that KVM wouldn't have to do anything special for SPE stage 2 faults. Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-07-27 11:00 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-07-27 11:00 UTC (permalink / raw) To: Oliver Upton; +Cc: Will Deacon, maz, kvmarm, linux-arm-kernel Hi Oliver, On Tue, Jul 26, 2022 at 10:51:21AM -0700, Oliver Upton wrote: > Hi Alex, > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > [...] > > I'm also a bit confused on how we would manage to un-pin memory on the > way out with this. The guest is free to muck with the stage 1 and could > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be > annoying. One way to tackle it would be to only allow a single > root-to-target walk to be pinned by a vCPU at a time. Any time a new > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new > one instead. On the topic of memory unpinning, for a well behaved guest I belive that should be done the next time the buffer is pinned. The buffer can (and should!) be drained when both the buffer and sampling is disabled; unpinning the buffer when profiling becomes disabled would lead to unnecessary stage 2 faults when draining it. That approach also means that KVM wouldn't have to do anything special for SPE stage 2 faults. Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-07-25 10:06 ` Alexandru Elisei @ 2022-08-01 17:00 ` Will Deacon -1 siblings, 0 replies; 72+ messages in thread From: Will Deacon @ 2022-08-01 17:00 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, kvmarm, linux-arm-kernel Hi Alex, On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > The approach I've taken so far in adding support for SPE in KVM [1] relies > > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > > > altogether. I've taken this approach because: > > > > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > > > and at the moment KVM has no way to resolve the VA to IPA translation. The > > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > > > in the case of a stage 2 fault on a stage 1 translation table walk. > > > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > means there will be a window where profiling is stopped from the moment SPE > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > is obviously not present when running on bare metal, as there is no second > > > stage of address translation being performed. > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > thought SPE buffer data could be written out in whacky ways such that even > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > and so pinning is the only game in town. > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > enabled and avoid pinning all of VM memory that way, although I can't > > immediately tell how flexible the architecture is in allowing you to cache > > the base/limit values. > > I was investigating this approach, and Mark raised a concern that I think > might be a showstopper. > > Let's consider this scenario: > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > 2. Guest programs SPE to enable profiling at **EL0** > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > 3. Guest changes the translation table entries for the buffer. The > architecture allows this. The architecture also allows MMIO accesses to use writeback addressing modes, but it doesn't provide a mechanism to virtualise them sensibly. So I'd prefer that we don't pin all of guest memory just to satisfy a corner case -- as long as the impact of a guest doing this funny sequence is constrained to the guest, then I think pinning only what is required is probably the most pragmatic approach. Is it ideal? No, of course not, and we should probably try to get the debug architecture extended to be properly virtualisable, but in the meantime having major operating systems as guests and being able to use SPE without pinning seems like a major design goal to me. In any case, that's just my thinking on this and I defer to Oliver and Marc on the ultimate decision. Will _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-08-01 17:00 ` Will Deacon 0 siblings, 0 replies; 72+ messages in thread From: Will Deacon @ 2022-08-01 17:00 UTC (permalink / raw) To: Alexandru Elisei Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm Hi Alex, On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > The approach I've taken so far in adding support for SPE in KVM [1] relies > > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > > > altogether. I've taken this approach because: > > > > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > > > and at the moment KVM has no way to resolve the VA to IPA translation. The > > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > > > in the case of a stage 2 fault on a stage 1 translation table walk. > > > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > means there will be a window where profiling is stopped from the moment SPE > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > is obviously not present when running on bare metal, as there is no second > > > stage of address translation being performed. > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > thought SPE buffer data could be written out in whacky ways such that even > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > and so pinning is the only game in town. > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > enabled and avoid pinning all of VM memory that way, although I can't > > immediately tell how flexible the architecture is in allowing you to cache > > the base/limit values. > > I was investigating this approach, and Mark raised a concern that I think > might be a showstopper. > > Let's consider this scenario: > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > 2. Guest programs SPE to enable profiling at **EL0** > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > 3. Guest changes the translation table entries for the buffer. The > architecture allows this. The architecture also allows MMIO accesses to use writeback addressing modes, but it doesn't provide a mechanism to virtualise them sensibly. So I'd prefer that we don't pin all of guest memory just to satisfy a corner case -- as long as the impact of a guest doing this funny sequence is constrained to the guest, then I think pinning only what is required is probably the most pragmatic approach. Is it ideal? No, of course not, and we should probably try to get the debug architecture extended to be properly virtualisable, but in the meantime having major operating systems as guests and being able to use SPE without pinning seems like a major design goal to me. In any case, that's just my thinking on this and I defer to Oliver and Marc on the ultimate decision. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-08-01 17:00 ` Will Deacon @ 2022-08-02 9:49 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-08-02 9:49 UTC (permalink / raw) To: Will Deacon; +Cc: maz, kvmarm, linux-arm-kernel Hi, (+Oliver) On Mon, Aug 01, 2022 at 06:00:56PM +0100, Will Deacon wrote: > Hi Alex, > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > > The approach I've taken so far in adding support for SPE in KVM [1] relies > > > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > > > > altogether. I've taken this approach because: > > > > > > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > > > > and at the moment KVM has no way to resolve the VA to IPA translation. The > > > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > > > > in the case of a stage 2 fault on a stage 1 translation table walk. > > > > > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > > means there will be a window where profiling is stopped from the moment SPE > > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > > is obviously not present when running on bare metal, as there is no second > > > > stage of address translation being performed. > > > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > > thought SPE buffer data could be written out in whacky ways such that even > > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > > and so pinning is the only game in town. > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > enabled and avoid pinning all of VM memory that way, although I can't > > > immediately tell how flexible the architecture is in allowing you to cache > > > the base/limit values. > > > > I was investigating this approach, and Mark raised a concern that I think > > might be a showstopper. > > > > Let's consider this scenario: > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > 2. Guest programs SPE to enable profiling at **EL0** > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > 3. Guest changes the translation table entries for the buffer. The > > architecture allows this. > > The architecture also allows MMIO accesses to use writeback addressing > modes, but it doesn't provide a mechanism to virtualise them sensibly. > > So I'd prefer that we don't pin all of guest memory just to satisfy a corner > case -- as long as the impact of a guest doing this funny sequence is > constrained to the guest, then I think pinning only what is required is > probably the most pragmatic approach. > > Is it ideal? No, of course not, and we should probably try to get the debug > architecture extended to be properly virtualisable, but in the meantime > having major operating systems as guests and being able to use SPE without > pinning seems like a major design goal to me. > > In any case, that's just my thinking on this and I defer to Oliver and > Marc on the ultimate decision. Thank you for the input. To summarize the approaches we've discussed so far: 1. Pinning the entire guest memory - Heavy handed and not ideal. - Tried this approach in v5 of the SPE series [1], patches #2-#12. 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 faults reported by SPE. - Not feasible, because the entire contents of the buffer must be discarded is PMBSR_EL1.DL is set to 1 when taking the fault. - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, not the IPA. 3. Pinning the guest SPE buffer when profiling becomes enabled*: - There is the corner case described above, when profiling becomes enabled as a result of an ERET to EL0. This can happen when the buffer is enabled and PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE stage 2 faults when draining the buffer, which is performed with profiling disabled. - Also requires KVM to walk the guest's stage 1 tables. 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by SPE. - Gets rid of the corner case at 3. - Same approach to buffer unpinning as 3. - Introduces a blackout window before the first record is written. - Also requires KVM to walk the guest's stage 1 tables. As for the corner case at 3, I proposed either: a) Mandate that guest operating systems must never modify the buffer translation entries if the buffer is enabled and PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, but **only** for this corner case. For all other cases, the buffer is pinned when profiling becomes enabled, to eliminate the blackout window. Guest operating systems can be modified to not change the translation entries for the buffer if this blackout window is not desirable. Pinning as a result of the **first** stage 2 fault should work, because there are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. I hope I haven't missed anything. Thoughts and suggestions more than welcome. *Profiling enabled, as per the Arm ARM, means buffer is enabled and sampling is enabled at the current exception level. [1] https://lore.kernel.org/all/20211117153842.302159-1-alexandru.elisei@arm.com/ Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-08-02 9:49 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-08-02 9:49 UTC (permalink / raw) To: Will Deacon Cc: mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm, oliver.upton Hi, (+Oliver) On Mon, Aug 01, 2022 at 06:00:56PM +0100, Will Deacon wrote: > Hi Alex, > > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: > > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote: > > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote: > > > > The approach I've taken so far in adding support for SPE in KVM [1] relies > > > > on pinning the entire VM memory to avoid SPE triggering stage 2 faults > > > > altogether. I've taken this approach because: > > > > > > > > 1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults, > > > > and at the moment KVM has no way to resolve the VA to IPA translation. The > > > > AT instruction is not useful here, because PAR_EL1 doesn't report the IPA > > > > in the case of a stage 2 fault on a stage 1 translation table walk. > > > > > > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which > > > > means there will be a window where profiling is stopped from the moment SPE > > > > triggers the fault and when the PE taks the interrupt. This blackout window > > > > is obviously not present when running on bare metal, as there is no second > > > > stage of address translation being performed. > > > > > > Are these faults actually recoverable? My memory is a bit hazy here, but I > > > thought SPE buffer data could be written out in whacky ways such that even > > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1), > > > and so pinning is the only game in town. > > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > enabled and avoid pinning all of VM memory that way, although I can't > > > immediately tell how flexible the architecture is in allowing you to cache > > > the base/limit values. > > > > I was investigating this approach, and Mark raised a concern that I think > > might be a showstopper. > > > > Let's consider this scenario: > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > 2. Guest programs SPE to enable profiling at **EL0** > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > 3. Guest changes the translation table entries for the buffer. The > > architecture allows this. > > The architecture also allows MMIO accesses to use writeback addressing > modes, but it doesn't provide a mechanism to virtualise them sensibly. > > So I'd prefer that we don't pin all of guest memory just to satisfy a corner > case -- as long as the impact of a guest doing this funny sequence is > constrained to the guest, then I think pinning only what is required is > probably the most pragmatic approach. > > Is it ideal? No, of course not, and we should probably try to get the debug > architecture extended to be properly virtualisable, but in the meantime > having major operating systems as guests and being able to use SPE without > pinning seems like a major design goal to me. > > In any case, that's just my thinking on this and I defer to Oliver and > Marc on the ultimate decision. Thank you for the input. To summarize the approaches we've discussed so far: 1. Pinning the entire guest memory - Heavy handed and not ideal. - Tried this approach in v5 of the SPE series [1], patches #2-#12. 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 faults reported by SPE. - Not feasible, because the entire contents of the buffer must be discarded is PMBSR_EL1.DL is set to 1 when taking the fault. - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, not the IPA. 3. Pinning the guest SPE buffer when profiling becomes enabled*: - There is the corner case described above, when profiling becomes enabled as a result of an ERET to EL0. This can happen when the buffer is enabled and PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE stage 2 faults when draining the buffer, which is performed with profiling disabled. - Also requires KVM to walk the guest's stage 1 tables. 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by SPE. - Gets rid of the corner case at 3. - Same approach to buffer unpinning as 3. - Introduces a blackout window before the first record is written. - Also requires KVM to walk the guest's stage 1 tables. As for the corner case at 3, I proposed either: a) Mandate that guest operating systems must never modify the buffer translation entries if the buffer is enabled and PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, but **only** for this corner case. For all other cases, the buffer is pinned when profiling becomes enabled, to eliminate the blackout window. Guest operating systems can be modified to not change the translation entries for the buffer if this blackout window is not desirable. Pinning as a result of the **first** stage 2 fault should work, because there are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. I hope I haven't missed anything. Thoughts and suggestions more than welcome. *Profiling enabled, as per the Arm ARM, means buffer is enabled and sampling is enabled at the current exception level. [1] https://lore.kernel.org/all/20211117153842.302159-1-alexandru.elisei@arm.com/ Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-08-02 9:49 ` Alexandru Elisei @ 2022-08-02 19:34 ` Oliver Upton -1 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-08-02 19:34 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi folks, On Tue, Aug 02, 2022 at 10:49:07AM +0100, Alexandru Elisei wrote: > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > the base/limit values. > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > might be a showstopper. > > > > > > Let's consider this scenario: > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > 2. Guest programs SPE to enable profiling at **EL0** > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > 3. Guest changes the translation table entries for the buffer. The > > > architecture allows this. > > > > The architecture also allows MMIO accesses to use writeback addressing > > modes, but it doesn't provide a mechanism to virtualise them sensibly. > > > > So I'd prefer that we don't pin all of guest memory just to satisfy a corner > > case -- as long as the impact of a guest doing this funny sequence is > > constrained to the guest, then I think pinning only what is required is > > probably the most pragmatic approach. > > > > Is it ideal? No, of course not, and we should probably try to get the debug > > architecture extended to be properly virtualisable, but in the meantime > > having major operating systems as guests and being able to use SPE without > > pinning seems like a major design goal to me. > > > > In any case, that's just my thinking on this and I defer to Oliver and > > Marc on the ultimate decision. Thanks for chiming in Will, very much agree that pragmatism is likely the best route forward. While fun to poke at all the pitfalls of virtualizing SPE, pulling tricks in KVM probably has marginal return over a simpler approach. > Thank you for the input. > > To summarize the approaches we've discussed so far: > > 1. Pinning the entire guest memory > - Heavy handed and not ideal. > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > faults reported by SPE. > - Not feasible, because the entire contents of the buffer must be discarded is > PMBSR_EL1.DL is set to 1 when taking the fault. > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > not the IPA. > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > - There is the corner case described above, when profiling becomes enabled as a > result of an ERET to EL0. This can happen when the buffer is enabled and > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > stage 2 faults when draining the buffer, which is performed with profiling > disabled. > - Also requires KVM to walk the guest's stage 1 tables. > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > SPE. > - Gets rid of the corner case at 3. > - Same approach to buffer unpinning as 3. > - Introduces a blackout window before the first record is written. > - Also requires KVM to walk the guest's stage 1 tables. > > As for the corner case at 3, I proposed either: > > a) Mandate that guest operating systems must never modify the buffer > translation entries if the buffer is enabled and > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > but **only** for this corner case. For all other cases, the buffer is pinned > when profiling becomes enabled, to eliminate the blackout window. Guest > operating systems can be modified to not change the translation entries for the > buffer if this blackout window is not desirable. > > Pinning as a result of the **first** stage 2 fault should work, because there > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. Thanks Alex for pulling together all of the context here. Unless there's any other strong opinions on the topic, it seems to me that option #4 (pin on S2 fault) is probably the best approach for the initial implementation. No amount of tricks in KVM can work around the fact that SPE has some serious issues w.r.t. virtualization. With that, we should probably document the behavior of SPE as a known erratum of KVM. If folks complain about EL1 profile blackout, eagerly pinning when profiling is enabled could layer on top quite easily by treating it as a synthetic S2 fault and triggering the implementation of #4. Having said that I don't believe it is a hard requirement for enabling some flavor of SPE for guests. Walking guest S1 in KVM doesn't sound too exciting although it'll need to be done eventually. Do you feel like this is an OK route forward, or have I missed something? -- Thanks, Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-08-02 19:34 ` Oliver Upton 0 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-08-02 19:34 UTC (permalink / raw) To: Alexandru Elisei Cc: Will Deacon, mark.rutland, linux-arm-kernel, maz, james.morse, suzuki.poulose, kvmarm Hi folks, On Tue, Aug 02, 2022 at 10:49:07AM +0100, Alexandru Elisei wrote: > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > the base/limit values. > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > might be a showstopper. > > > > > > Let's consider this scenario: > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > 2. Guest programs SPE to enable profiling at **EL0** > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > 3. Guest changes the translation table entries for the buffer. The > > > architecture allows this. > > > > The architecture also allows MMIO accesses to use writeback addressing > > modes, but it doesn't provide a mechanism to virtualise them sensibly. > > > > So I'd prefer that we don't pin all of guest memory just to satisfy a corner > > case -- as long as the impact of a guest doing this funny sequence is > > constrained to the guest, then I think pinning only what is required is > > probably the most pragmatic approach. > > > > Is it ideal? No, of course not, and we should probably try to get the debug > > architecture extended to be properly virtualisable, but in the meantime > > having major operating systems as guests and being able to use SPE without > > pinning seems like a major design goal to me. > > > > In any case, that's just my thinking on this and I defer to Oliver and > > Marc on the ultimate decision. Thanks for chiming in Will, very much agree that pragmatism is likely the best route forward. While fun to poke at all the pitfalls of virtualizing SPE, pulling tricks in KVM probably has marginal return over a simpler approach. > Thank you for the input. > > To summarize the approaches we've discussed so far: > > 1. Pinning the entire guest memory > - Heavy handed and not ideal. > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > faults reported by SPE. > - Not feasible, because the entire contents of the buffer must be discarded is > PMBSR_EL1.DL is set to 1 when taking the fault. > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > not the IPA. > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > - There is the corner case described above, when profiling becomes enabled as a > result of an ERET to EL0. This can happen when the buffer is enabled and > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > stage 2 faults when draining the buffer, which is performed with profiling > disabled. > - Also requires KVM to walk the guest's stage 1 tables. > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > SPE. > - Gets rid of the corner case at 3. > - Same approach to buffer unpinning as 3. > - Introduces a blackout window before the first record is written. > - Also requires KVM to walk the guest's stage 1 tables. > > As for the corner case at 3, I proposed either: > > a) Mandate that guest operating systems must never modify the buffer > translation entries if the buffer is enabled and > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > but **only** for this corner case. For all other cases, the buffer is pinned > when profiling becomes enabled, to eliminate the blackout window. Guest > operating systems can be modified to not change the translation entries for the > buffer if this blackout window is not desirable. > > Pinning as a result of the **first** stage 2 fault should work, because there > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. Thanks Alex for pulling together all of the context here. Unless there's any other strong opinions on the topic, it seems to me that option #4 (pin on S2 fault) is probably the best approach for the initial implementation. No amount of tricks in KVM can work around the fact that SPE has some serious issues w.r.t. virtualization. With that, we should probably document the behavior of SPE as a known erratum of KVM. If folks complain about EL1 profile blackout, eagerly pinning when profiling is enabled could layer on top quite easily by treating it as a synthetic S2 fault and triggering the implementation of #4. Having said that I don't believe it is a hard requirement for enabling some flavor of SPE for guests. Walking guest S1 in KVM doesn't sound too exciting although it'll need to be done eventually. Do you feel like this is an OK route forward, or have I missed something? -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-08-02 19:34 ` Oliver Upton @ 2022-08-09 14:01 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-08-09 14:01 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi, On Tue, Aug 02, 2022 at 12:34:40PM -0700, Oliver Upton wrote: > Hi folks, > > On Tue, Aug 02, 2022 at 10:49:07AM +0100, Alexandru Elisei wrote: > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > the base/limit values. > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > might be a showstopper. > > > > > > > > Let's consider this scenario: > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > 3. Guest changes the translation table entries for the buffer. The > > > > architecture allows this. > > > > > > The architecture also allows MMIO accesses to use writeback addressing > > > modes, but it doesn't provide a mechanism to virtualise them sensibly. > > > > > > So I'd prefer that we don't pin all of guest memory just to satisfy a corner > > > case -- as long as the impact of a guest doing this funny sequence is > > > constrained to the guest, then I think pinning only what is required is > > > probably the most pragmatic approach. > > > > > > Is it ideal? No, of course not, and we should probably try to get the debug > > > architecture extended to be properly virtualisable, but in the meantime > > > having major operating systems as guests and being able to use SPE without > > > pinning seems like a major design goal to me. > > > > > > In any case, that's just my thinking on this and I defer to Oliver and > > > Marc on the ultimate decision. > > Thanks for chiming in Will, very much agree that pragmatism is likely > the best route forward. While fun to poke at all the pitfalls of > virtualizing SPE, pulling tricks in KVM probably has marginal return > over a simpler approach. > > > Thank you for the input. > > > > To summarize the approaches we've discussed so far: > > > > 1. Pinning the entire guest memory > > - Heavy handed and not ideal. > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > faults reported by SPE. > > - Not feasible, because the entire contents of the buffer must be discarded is > > PMBSR_EL1.DL is set to 1 when taking the fault. > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > not the IPA. > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > - There is the corner case described above, when profiling becomes enabled as a > > result of an ERET to EL0. This can happen when the buffer is enabled and > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > stage 2 faults when draining the buffer, which is performed with profiling > > disabled. > > - Also requires KVM to walk the guest's stage 1 tables. > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > SPE. > > - Gets rid of the corner case at 3. > > - Same approach to buffer unpinning as 3. > > - Introduces a blackout window before the first record is written. > > - Also requires KVM to walk the guest's stage 1 tables. > > > > As for the corner case at 3, I proposed either: > > > > a) Mandate that guest operating systems must never modify the buffer > > translation entries if the buffer is enabled and > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > but **only** for this corner case. For all other cases, the buffer is pinned > > when profiling becomes enabled, to eliminate the blackout window. Guest > > operating systems can be modified to not change the translation entries for the > > buffer if this blackout window is not desirable. > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > Thanks Alex for pulling together all of the context here. > > Unless there's any other strong opinions on the topic, it seems to me > that option #4 (pin on S2 fault) is probably the best approach for > the initial implementation. No amount of tricks in KVM can work around > the fact that SPE has some serious issues w.r.t. virtualization. With > that, we should probably document the behavior of SPE as a known erratum > of KVM. > > If folks complain about EL1 profile blackout, eagerly pinning when > profiling is enabled could layer on top quite easily by treating it as > a synthetic S2 fault and triggering the implementation of #4. Having I'm not sure I follow, I understand what you mean by "treating it as a synthetic S2 fault", would you mind elaborating? > said that I don't believe it is a hard requirement for enabling some > flavor of SPE for guests. > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > be done eventually. > > Do you feel like this is an OK route forward, or have I missed > something? I've been giving this some thought, and I prefer approach #3 because with #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it will be impossible to distinguish between a valid stage 2 fault (a fault caused by the guest reprogramming the buffer and enabling profiling) and KVM messing something up when pinning the buffer. I believe this to be important, as experience has shown me that pinning the buffer at stage 2 is not trivial and there isn't a mechanism today in Linux to do that (explanation and examples here [1]). With approach #4, it would be impossible to figure out if the results of a profiling operations inside a guest are representative of the workload or not, because those SPE stage 2 faults triggered by a bug in KVM can happen multiple times per profiling session, introducing multiple blackout windows that can skew the results. If you're proposing that the blackout window when the first record is written be documented as an erratum for KVM, then why no got a step further and document as an erratum that changing the buffer translation tables after the buffer has been enabled will lead to an SPE Serror? That will allow us to always pin the buffer when profiling is enabled. [1] https://lore.kernel.org/all/YuEMkKY2RU%2F2KiZW@monolith.localdoman/ Thanks, Alex > > -- > Thanks, > Oliver > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-08-09 14:01 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-08-09 14:01 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi, On Tue, Aug 02, 2022 at 12:34:40PM -0700, Oliver Upton wrote: > Hi folks, > > On Tue, Aug 02, 2022 at 10:49:07AM +0100, Alexandru Elisei wrote: > > > > > A funkier approach might be to defer pinning of the buffer until the SPE is > > > > > enabled and avoid pinning all of VM memory that way, although I can't > > > > > immediately tell how flexible the architecture is in allowing you to cache > > > > > the base/limit values. > > > > > > > > I was investigating this approach, and Mark raised a concern that I think > > > > might be a showstopper. > > > > > > > > Let's consider this scenario: > > > > > > > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, > > > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). > > > > > > > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). > > > > 2. Guest programs SPE to enable profiling at **EL0** > > > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). > > > > 3. Guest changes the translation table entries for the buffer. The > > > > architecture allows this. > > > > > > The architecture also allows MMIO accesses to use writeback addressing > > > modes, but it doesn't provide a mechanism to virtualise them sensibly. > > > > > > So I'd prefer that we don't pin all of guest memory just to satisfy a corner > > > case -- as long as the impact of a guest doing this funny sequence is > > > constrained to the guest, then I think pinning only what is required is > > > probably the most pragmatic approach. > > > > > > Is it ideal? No, of course not, and we should probably try to get the debug > > > architecture extended to be properly virtualisable, but in the meantime > > > having major operating systems as guests and being able to use SPE without > > > pinning seems like a major design goal to me. > > > > > > In any case, that's just my thinking on this and I defer to Oliver and > > > Marc on the ultimate decision. > > Thanks for chiming in Will, very much agree that pragmatism is likely > the best route forward. While fun to poke at all the pitfalls of > virtualizing SPE, pulling tricks in KVM probably has marginal return > over a simpler approach. > > > Thank you for the input. > > > > To summarize the approaches we've discussed so far: > > > > 1. Pinning the entire guest memory > > - Heavy handed and not ideal. > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > faults reported by SPE. > > - Not feasible, because the entire contents of the buffer must be discarded is > > PMBSR_EL1.DL is set to 1 when taking the fault. > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > not the IPA. > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > - There is the corner case described above, when profiling becomes enabled as a > > result of an ERET to EL0. This can happen when the buffer is enabled and > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > stage 2 faults when draining the buffer, which is performed with profiling > > disabled. > > - Also requires KVM to walk the guest's stage 1 tables. > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > SPE. > > - Gets rid of the corner case at 3. > > - Same approach to buffer unpinning as 3. > > - Introduces a blackout window before the first record is written. > > - Also requires KVM to walk the guest's stage 1 tables. > > > > As for the corner case at 3, I proposed either: > > > > a) Mandate that guest operating systems must never modify the buffer > > translation entries if the buffer is enabled and > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > but **only** for this corner case. For all other cases, the buffer is pinned > > when profiling becomes enabled, to eliminate the blackout window. Guest > > operating systems can be modified to not change the translation entries for the > > buffer if this blackout window is not desirable. > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > Thanks Alex for pulling together all of the context here. > > Unless there's any other strong opinions on the topic, it seems to me > that option #4 (pin on S2 fault) is probably the best approach for > the initial implementation. No amount of tricks in KVM can work around > the fact that SPE has some serious issues w.r.t. virtualization. With > that, we should probably document the behavior of SPE as a known erratum > of KVM. > > If folks complain about EL1 profile blackout, eagerly pinning when > profiling is enabled could layer on top quite easily by treating it as > a synthetic S2 fault and triggering the implementation of #4. Having I'm not sure I follow, I understand what you mean by "treating it as a synthetic S2 fault", would you mind elaborating? > said that I don't believe it is a hard requirement for enabling some > flavor of SPE for guests. > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > be done eventually. > > Do you feel like this is an OK route forward, or have I missed > something? I've been giving this some thought, and I prefer approach #3 because with #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it will be impossible to distinguish between a valid stage 2 fault (a fault caused by the guest reprogramming the buffer and enabling profiling) and KVM messing something up when pinning the buffer. I believe this to be important, as experience has shown me that pinning the buffer at stage 2 is not trivial and there isn't a mechanism today in Linux to do that (explanation and examples here [1]). With approach #4, it would be impossible to figure out if the results of a profiling operations inside a guest are representative of the workload or not, because those SPE stage 2 faults triggered by a bug in KVM can happen multiple times per profiling session, introducing multiple blackout windows that can skew the results. If you're proposing that the blackout window when the first record is written be documented as an erratum for KVM, then why no got a step further and document as an erratum that changing the buffer translation tables after the buffer has been enabled will lead to an SPE Serror? That will allow us to always pin the buffer when profiling is enabled. [1] https://lore.kernel.org/all/YuEMkKY2RU%2F2KiZW@monolith.localdoman/ Thanks, Alex > > -- > Thanks, > Oliver > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-08-09 14:01 ` Alexandru Elisei @ 2022-08-09 18:43 ` Oliver Upton -1 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-08-09 18:43 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Alex, On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: [...] > > > To summarize the approaches we've discussed so far: > > > > > > 1. Pinning the entire guest memory > > > - Heavy handed and not ideal. > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > faults reported by SPE. > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > not the IPA. > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > - There is the corner case described above, when profiling becomes enabled as a > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > stage 2 faults when draining the buffer, which is performed with profiling > > > disabled. > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > SPE. > > > - Gets rid of the corner case at 3. > > > - Same approach to buffer unpinning as 3. > > > - Introduces a blackout window before the first record is written. > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > As for the corner case at 3, I proposed either: > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > translation entries if the buffer is enabled and > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > operating systems can be modified to not change the translation entries for the > > > buffer if this blackout window is not desirable. > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > Thanks Alex for pulling together all of the context here. > > > > Unless there's any other strong opinions on the topic, it seems to me > > that option #4 (pin on S2 fault) is probably the best approach for > > the initial implementation. No amount of tricks in KVM can work around > > the fact that SPE has some serious issues w.r.t. virtualization. With > > that, we should probably document the behavior of SPE as a known erratum > > of KVM. > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > profiling is enabled could layer on top quite easily by treating it as > > a synthetic S2 fault and triggering the implementation of #4. Having > > I'm not sure I follow, I understand what you mean by "treating it as a > synthetic S2 fault", would you mind elaborating? Assuming approach #4 is implemented, we will already have an SPE fault handler that walks stage-1 and pins the buffer. At that point, implementing approach #3 would be relatively easy. When EL1 sets PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > said that I don't believe it is a hard requirement for enabling some > > flavor of SPE for guests. > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > be done eventually. > > > > Do you feel like this is an OK route forward, or have I missed > > something? > > I've been giving this some thought, and I prefer approach #3 because with > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > will be impossible to distinguish between a valid stage 2 fault (a fault > caused by the guest reprogramming the buffer and enabling profiling) and > KVM messing something up when pinning the buffer. I believe this to be > important, as experience has shown me that pinning the buffer at stage 2 is > not trivial and there isn't a mechanism today in Linux to do that > (explanation and examples here [1]). How does eagerly pinning avoid stage-2 aborts, though? As you note in [1], page pinning does not avoid the possibility of the MMU notifiers being called on a given range. Want to make sure I'm following, what is your suggestion for approach #3 to handle the profile buffer when only enabled at EL0? > With approach #4, it would be impossible to figure out if the results of a > profiling operations inside a guest are representative of the workload or > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > multiple times per profiling session, introducing multiple blackout windows > that can skew the results. > > If you're proposing that the blackout window when the first record is > written be documented as an erratum for KVM, then why no got a step further > and document as an erratum that changing the buffer translation tables > after the buffer has been enabled will lead to an SPE Serror? That will > allow us to always pin the buffer when profiling is enabled. Ah, there are certainly more errata in virtualizing SPE beyond what I had said :) Preserving the stage-1 translations while profiling is active is a good recommendation, although I'm not sure that we've completely eliminated the risk of stage-2 faults. It seems impossible to blame the guest for all stage-2 faults that happen in the middle of a profiling session. In addition to host mm driven changes to stage-2, live migration is a busted as well. You'd need to build out stage-2 on the target before resuming the guest and guarantee that the appropriate pages have been demanded from the source (in case of post-copy). So, are we going to inject an SError for stage-2 faults outside of guest control as well? An external abort reported as an SPE buffer management event seems to be gracefully handled by the Linux driver, but that behavior is disallowed by SPEv1p3. To sum up the point I'm getting at: I agree that there are ways to reduce the risk of stage-2 faults in the middle of profiling, but I don't believe the current architecture allows KVM to virtualize the feature to the letter of the specification. -- Thanks, Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-08-09 18:43 ` Oliver Upton 0 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-08-09 18:43 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Alex, On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: [...] > > > To summarize the approaches we've discussed so far: > > > > > > 1. Pinning the entire guest memory > > > - Heavy handed and not ideal. > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > faults reported by SPE. > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > not the IPA. > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > - There is the corner case described above, when profiling becomes enabled as a > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > stage 2 faults when draining the buffer, which is performed with profiling > > > disabled. > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > SPE. > > > - Gets rid of the corner case at 3. > > > - Same approach to buffer unpinning as 3. > > > - Introduces a blackout window before the first record is written. > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > As for the corner case at 3, I proposed either: > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > translation entries if the buffer is enabled and > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > operating systems can be modified to not change the translation entries for the > > > buffer if this blackout window is not desirable. > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > Thanks Alex for pulling together all of the context here. > > > > Unless there's any other strong opinions on the topic, it seems to me > > that option #4 (pin on S2 fault) is probably the best approach for > > the initial implementation. No amount of tricks in KVM can work around > > the fact that SPE has some serious issues w.r.t. virtualization. With > > that, we should probably document the behavior of SPE as a known erratum > > of KVM. > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > profiling is enabled could layer on top quite easily by treating it as > > a synthetic S2 fault and triggering the implementation of #4. Having > > I'm not sure I follow, I understand what you mean by "treating it as a > synthetic S2 fault", would you mind elaborating? Assuming approach #4 is implemented, we will already have an SPE fault handler that walks stage-1 and pins the buffer. At that point, implementing approach #3 would be relatively easy. When EL1 sets PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > said that I don't believe it is a hard requirement for enabling some > > flavor of SPE for guests. > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > be done eventually. > > > > Do you feel like this is an OK route forward, or have I missed > > something? > > I've been giving this some thought, and I prefer approach #3 because with > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > will be impossible to distinguish between a valid stage 2 fault (a fault > caused by the guest reprogramming the buffer and enabling profiling) and > KVM messing something up when pinning the buffer. I believe this to be > important, as experience has shown me that pinning the buffer at stage 2 is > not trivial and there isn't a mechanism today in Linux to do that > (explanation and examples here [1]). How does eagerly pinning avoid stage-2 aborts, though? As you note in [1], page pinning does not avoid the possibility of the MMU notifiers being called on a given range. Want to make sure I'm following, what is your suggestion for approach #3 to handle the profile buffer when only enabled at EL0? > With approach #4, it would be impossible to figure out if the results of a > profiling operations inside a guest are representative of the workload or > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > multiple times per profiling session, introducing multiple blackout windows > that can skew the results. > > If you're proposing that the blackout window when the first record is > written be documented as an erratum for KVM, then why no got a step further > and document as an erratum that changing the buffer translation tables > after the buffer has been enabled will lead to an SPE Serror? That will > allow us to always pin the buffer when profiling is enabled. Ah, there are certainly more errata in virtualizing SPE beyond what I had said :) Preserving the stage-1 translations while profiling is active is a good recommendation, although I'm not sure that we've completely eliminated the risk of stage-2 faults. It seems impossible to blame the guest for all stage-2 faults that happen in the middle of a profiling session. In addition to host mm driven changes to stage-2, live migration is a busted as well. You'd need to build out stage-2 on the target before resuming the guest and guarantee that the appropriate pages have been demanded from the source (in case of post-copy). So, are we going to inject an SError for stage-2 faults outside of guest control as well? An external abort reported as an SPE buffer management event seems to be gracefully handled by the Linux driver, but that behavior is disallowed by SPEv1p3. To sum up the point I'm getting at: I agree that there are ways to reduce the risk of stage-2 faults in the middle of profiling, but I don't believe the current architecture allows KVM to virtualize the feature to the letter of the specification. -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-08-09 18:43 ` Oliver Upton @ 2022-08-10 9:37 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-08-10 9:37 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi, On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > Hi Alex, > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > [...] > > > > > To summarize the approaches we've discussed so far: > > > > > > > > 1. Pinning the entire guest memory > > > > - Heavy handed and not ideal. > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > faults reported by SPE. > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > not the IPA. > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > disabled. > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > SPE. > > > > - Gets rid of the corner case at 3. > > > > - Same approach to buffer unpinning as 3. > > > > - Introduces a blackout window before the first record is written. > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > translation entries if the buffer is enabled and > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > operating systems can be modified to not change the translation entries for the > > > > buffer if this blackout window is not desirable. > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > that option #4 (pin on S2 fault) is probably the best approach for > > > the initial implementation. No amount of tricks in KVM can work around > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > that, we should probably document the behavior of SPE as a known erratum > > > of KVM. > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > profiling is enabled could layer on top quite easily by treating it as > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > synthetic S2 fault", would you mind elaborating? > > Assuming approach #4 is implemented, we will already have an SPE fault > handler that walks stage-1 and pins the buffer. At that point, > implementing approach #3 would be relatively easy. When EL1 sets > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. I see, that makes sense, thanks, > > > > said that I don't believe it is a hard requirement for enabling some > > > flavor of SPE for guests. > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > be done eventually. > > > > > > Do you feel like this is an OK route forward, or have I missed > > > something? > > > > I've been giving this some thought, and I prefer approach #3 because with > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > will be impossible to distinguish between a valid stage 2 fault (a fault > > caused by the guest reprogramming the buffer and enabling profiling) and > > KVM messing something up when pinning the buffer. I believe this to be > > important, as experience has shown me that pinning the buffer at stage 2 is > > not trivial and there isn't a mechanism today in Linux to do that > > (explanation and examples here [1]). > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > [1], page pinning does not avoid the possibility of the MMU notifiers > being called on a given range. Want to make sure I'm following, what > is your suggestion for approach #3 to handle the profile buffer when > only enabled at EL0? > > > With approach #4, it would be impossible to figure out if the results of a > > profiling operations inside a guest are representative of the workload or > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > multiple times per profiling session, introducing multiple blackout windows > > that can skew the results. > > > > If you're proposing that the blackout window when the first record is > > written be documented as an erratum for KVM, then why no got a step further > > and document as an erratum that changing the buffer translation tables > > after the buffer has been enabled will lead to an SPE Serror? That will > > allow us to always pin the buffer when profiling is enabled. > > Ah, there are certainly more errata in virtualizing SPE beyond what I > had said :) Preserving the stage-1 translations while profiling is > active is a good recommendation, although I'm not sure that we've > completely eliminated the risk of stage-2 faults. > > It seems impossible to blame the guest for all stage-2 faults that happen > in the middle of a profiling session. In addition to host mm driven changes > to stage-2, live migration is a busted as well. You'd need to build out > stage-2 on the target before resuming the guest and guarantee that the > appropriate pages have been demanded from the source (in case of post-copy). > > So, are we going to inject an SError for stage-2 faults outside of guest > control as well? An external abort reported as an SPE buffer management > event seems to be gracefully handled by the Linux driver, but that behavior > is disallowed by SPEv1p3. > > To sum up the point I'm getting at: I agree that there are ways to > reduce the risk of stage-2 faults in the middle of profiling, but I > don't believe the current architecture allows KVM to virtualize the > feature to the letter of the specification. I believe there's some confusion here: emulating SPE **does not work** if stage 2 faults are triggered in the middle of a profiling session. Being able to have a memory range never unmapped from stage 2 is a **prerequisite** and is **required** for SPE emulation, it's not a nice to have. A stage 2 fault before the first record is written is acceptable because there are no other records already written which need to be thrown away. Stage 2 faults after at least one record has been written are unacceptable because it means that the contents of the buffer needs to thrown away. Does that make sense to you? I believe it is doable to have addresses always mapped at stage 2 with some changes to KVM, but that's not what this thread is about. This thread is about how and when to pin the buffer. As long as we're all agreed that buffer memory needs "pinning" (as in the IPA are never unmapped from stage 2 until KVM decides otherwise as part of SPE emulation), I believe that live migration is tangential to figuring out how and when the buffer should be "pinned". I'm more than happy to start a separate thread about live migration after we figure out how we should go about "pinning" the buffer, I think your insight would be most helpful :) Thanks, Alex > > -- > Thanks, > Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-08-10 9:37 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-08-10 9:37 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi, On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > Hi Alex, > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > [...] > > > > > To summarize the approaches we've discussed so far: > > > > > > > > 1. Pinning the entire guest memory > > > > - Heavy handed and not ideal. > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > faults reported by SPE. > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > not the IPA. > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > disabled. > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > SPE. > > > > - Gets rid of the corner case at 3. > > > > - Same approach to buffer unpinning as 3. > > > > - Introduces a blackout window before the first record is written. > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > translation entries if the buffer is enabled and > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > operating systems can be modified to not change the translation entries for the > > > > buffer if this blackout window is not desirable. > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > that option #4 (pin on S2 fault) is probably the best approach for > > > the initial implementation. No amount of tricks in KVM can work around > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > that, we should probably document the behavior of SPE as a known erratum > > > of KVM. > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > profiling is enabled could layer on top quite easily by treating it as > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > synthetic S2 fault", would you mind elaborating? > > Assuming approach #4 is implemented, we will already have an SPE fault > handler that walks stage-1 and pins the buffer. At that point, > implementing approach #3 would be relatively easy. When EL1 sets > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. I see, that makes sense, thanks, > > > > said that I don't believe it is a hard requirement for enabling some > > > flavor of SPE for guests. > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > be done eventually. > > > > > > Do you feel like this is an OK route forward, or have I missed > > > something? > > > > I've been giving this some thought, and I prefer approach #3 because with > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > will be impossible to distinguish between a valid stage 2 fault (a fault > > caused by the guest reprogramming the buffer and enabling profiling) and > > KVM messing something up when pinning the buffer. I believe this to be > > important, as experience has shown me that pinning the buffer at stage 2 is > > not trivial and there isn't a mechanism today in Linux to do that > > (explanation and examples here [1]). > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > [1], page pinning does not avoid the possibility of the MMU notifiers > being called on a given range. Want to make sure I'm following, what > is your suggestion for approach #3 to handle the profile buffer when > only enabled at EL0? > > > With approach #4, it would be impossible to figure out if the results of a > > profiling operations inside a guest are representative of the workload or > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > multiple times per profiling session, introducing multiple blackout windows > > that can skew the results. > > > > If you're proposing that the blackout window when the first record is > > written be documented as an erratum for KVM, then why no got a step further > > and document as an erratum that changing the buffer translation tables > > after the buffer has been enabled will lead to an SPE Serror? That will > > allow us to always pin the buffer when profiling is enabled. > > Ah, there are certainly more errata in virtualizing SPE beyond what I > had said :) Preserving the stage-1 translations while profiling is > active is a good recommendation, although I'm not sure that we've > completely eliminated the risk of stage-2 faults. > > It seems impossible to blame the guest for all stage-2 faults that happen > in the middle of a profiling session. In addition to host mm driven changes > to stage-2, live migration is a busted as well. You'd need to build out > stage-2 on the target before resuming the guest and guarantee that the > appropriate pages have been demanded from the source (in case of post-copy). > > So, are we going to inject an SError for stage-2 faults outside of guest > control as well? An external abort reported as an SPE buffer management > event seems to be gracefully handled by the Linux driver, but that behavior > is disallowed by SPEv1p3. > > To sum up the point I'm getting at: I agree that there are ways to > reduce the risk of stage-2 faults in the middle of profiling, but I > don't believe the current architecture allows KVM to virtualize the > feature to the letter of the specification. I believe there's some confusion here: emulating SPE **does not work** if stage 2 faults are triggered in the middle of a profiling session. Being able to have a memory range never unmapped from stage 2 is a **prerequisite** and is **required** for SPE emulation, it's not a nice to have. A stage 2 fault before the first record is written is acceptable because there are no other records already written which need to be thrown away. Stage 2 faults after at least one record has been written are unacceptable because it means that the contents of the buffer needs to thrown away. Does that make sense to you? I believe it is doable to have addresses always mapped at stage 2 with some changes to KVM, but that's not what this thread is about. This thread is about how and when to pin the buffer. As long as we're all agreed that buffer memory needs "pinning" (as in the IPA are never unmapped from stage 2 until KVM decides otherwise as part of SPE emulation), I believe that live migration is tangential to figuring out how and when the buffer should be "pinned". I'm more than happy to start a separate thread about live migration after we figure out how we should go about "pinning" the buffer, I think your insight would be most helpful :) Thanks, Alex > > -- > Thanks, > Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-08-10 9:37 ` Alexandru Elisei @ 2022-08-10 15:25 ` Oliver Upton -1 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-08-10 15:25 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote: > Hi, > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > > Hi Alex, > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > > > [...] > > > > > > > To summarize the approaches we've discussed so far: > > > > > > > > > > 1. Pinning the entire guest memory > > > > > - Heavy handed and not ideal. > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > > faults reported by SPE. > > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > > not the IPA. > > > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > > disabled. > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > > SPE. > > > > > - Gets rid of the corner case at 3. > > > > > - Same approach to buffer unpinning as 3. > > > > > - Introduces a blackout window before the first record is written. > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > > translation entries if the buffer is enabled and > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > > operating systems can be modified to not change the translation entries for the > > > > > buffer if this blackout window is not desirable. > > > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > > that option #4 (pin on S2 fault) is probably the best approach for > > > > the initial implementation. No amount of tricks in KVM can work around > > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > > that, we should probably document the behavior of SPE as a known erratum > > > > of KVM. > > > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > > profiling is enabled could layer on top quite easily by treating it as > > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > > synthetic S2 fault", would you mind elaborating? > > > > Assuming approach #4 is implemented, we will already have an SPE fault > > handler that walks stage-1 and pins the buffer. At that point, > > implementing approach #3 would be relatively easy. When EL1 sets > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > I see, that makes sense, thanks, > > > > > > > said that I don't believe it is a hard requirement for enabling some > > > > flavor of SPE for guests. > > > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > > be done eventually. > > > > > > > > Do you feel like this is an OK route forward, or have I missed > > > > something? > > > > > > I've been giving this some thought, and I prefer approach #3 because with > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > > will be impossible to distinguish between a valid stage 2 fault (a fault > > > caused by the guest reprogramming the buffer and enabling profiling) and > > > KVM messing something up when pinning the buffer. I believe this to be > > > important, as experience has shown me that pinning the buffer at stage 2 is > > > not trivial and there isn't a mechanism today in Linux to do that > > > (explanation and examples here [1]). > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > > [1], page pinning does not avoid the possibility of the MMU notifiers > > being called on a given range. Want to make sure I'm following, what > > is your suggestion for approach #3 to handle the profile buffer when > > only enabled at EL0? > > > > > With approach #4, it would be impossible to figure out if the results of a > > > profiling operations inside a guest are representative of the workload or > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > > multiple times per profiling session, introducing multiple blackout windows > > > that can skew the results. > > > > > > If you're proposing that the blackout window when the first record is > > > written be documented as an erratum for KVM, then why no got a step further > > > and document as an erratum that changing the buffer translation tables > > > after the buffer has been enabled will lead to an SPE Serror? That will > > > allow us to always pin the buffer when profiling is enabled. > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I > > had said :) Preserving the stage-1 translations while profiling is > > active is a good recommendation, although I'm not sure that we've > > completely eliminated the risk of stage-2 faults. > > > > It seems impossible to blame the guest for all stage-2 faults that happen > > in the middle of a profiling session. In addition to host mm driven changes > > to stage-2, live migration is a busted as well. You'd need to build out > > stage-2 on the target before resuming the guest and guarantee that the > > appropriate pages have been demanded from the source (in case of post-copy). > > > > So, are we going to inject an SError for stage-2 faults outside of guest > > control as well? An external abort reported as an SPE buffer management > > event seems to be gracefully handled by the Linux driver, but that behavior > > is disallowed by SPEv1p3. > > > > To sum up the point I'm getting at: I agree that there are ways to > > reduce the risk of stage-2 faults in the middle of profiling, but I > > don't believe the current architecture allows KVM to virtualize the > > feature to the letter of the specification. > > I believe there's some confusion here: emulating SPE **does not work** if > stage 2 faults are triggered in the middle of a profiling session. Being > able to have a memory range never unmapped from stage 2 is a > **prerequisite** and is **required** for SPE emulation, it's not a nice to > have. > > A stage 2 fault before the first record is written is acceptable because > there are no other records already written which need to be thrown away. > Stage 2 faults after at least one record has been written are unacceptable > because it means that the contents of the buffer needs to thrown away. > > Does that make sense to you? > > I believe it is doable to have addresses always mapped at stage 2 with some > changes to KVM, but that's not what this thread is about. This thread is > about how and when to pin the buffer. Sorry if I've been forcing a tangent, but I believe there is a lot of value in discussing what is to be done for keeping the stage-2 mapping alive. I've been whining about it out of the very concern you highlight: a stage-2 fault in the middle of the profile is game over. Otherwise, optimizations in *when* we pin the buffer seem meaningless as stage-2 faults appear unavoidable. Nonetheless, back to your proposal. Injecting some context from earlier: > 3. Pinning the guest SPE buffer when profiling becomes enabled*: So we are only doing this when enabled for EL1, right? (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}) > - There is the corner case described above, when profiling becomes enabled as a > result of an ERET to EL0. This can happen when the buffer is enabled and > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set (outside of the architectures definition of when profiling is enabled)? > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > stage 2 faults when draining the buffer, which is performed with profiling > disabled. Sounds reasonable. > As long as we're all agreed that buffer memory needs "pinning" (as in the > IPA are never unmapped from stage 2 until KVM decides otherwise as part of > SPE emulation), I believe that live migration is tangential to figuring out > how and when the buffer should be "pinned". I'm more than happy to start a > separate thread about live migration after we figure out how we should go > about "pinning" the buffer, I think your insight would be most helpful :) Fair enough, let's see how this all shakes out and then figure out LM thereafter :) -- Thanks, Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-08-10 15:25 ` Oliver Upton 0 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-08-10 15:25 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote: > Hi, > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > > Hi Alex, > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > > > [...] > > > > > > > To summarize the approaches we've discussed so far: > > > > > > > > > > 1. Pinning the entire guest memory > > > > > - Heavy handed and not ideal. > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > > faults reported by SPE. > > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > > not the IPA. > > > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > > disabled. > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > > SPE. > > > > > - Gets rid of the corner case at 3. > > > > > - Same approach to buffer unpinning as 3. > > > > > - Introduces a blackout window before the first record is written. > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > > translation entries if the buffer is enabled and > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > > operating systems can be modified to not change the translation entries for the > > > > > buffer if this blackout window is not desirable. > > > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > > that option #4 (pin on S2 fault) is probably the best approach for > > > > the initial implementation. No amount of tricks in KVM can work around > > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > > that, we should probably document the behavior of SPE as a known erratum > > > > of KVM. > > > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > > profiling is enabled could layer on top quite easily by treating it as > > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > > synthetic S2 fault", would you mind elaborating? > > > > Assuming approach #4 is implemented, we will already have an SPE fault > > handler that walks stage-1 and pins the buffer. At that point, > > implementing approach #3 would be relatively easy. When EL1 sets > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > I see, that makes sense, thanks, > > > > > > > said that I don't believe it is a hard requirement for enabling some > > > > flavor of SPE for guests. > > > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > > be done eventually. > > > > > > > > Do you feel like this is an OK route forward, or have I missed > > > > something? > > > > > > I've been giving this some thought, and I prefer approach #3 because with > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > > will be impossible to distinguish between a valid stage 2 fault (a fault > > > caused by the guest reprogramming the buffer and enabling profiling) and > > > KVM messing something up when pinning the buffer. I believe this to be > > > important, as experience has shown me that pinning the buffer at stage 2 is > > > not trivial and there isn't a mechanism today in Linux to do that > > > (explanation and examples here [1]). > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > > [1], page pinning does not avoid the possibility of the MMU notifiers > > being called on a given range. Want to make sure I'm following, what > > is your suggestion for approach #3 to handle the profile buffer when > > only enabled at EL0? > > > > > With approach #4, it would be impossible to figure out if the results of a > > > profiling operations inside a guest are representative of the workload or > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > > multiple times per profiling session, introducing multiple blackout windows > > > that can skew the results. > > > > > > If you're proposing that the blackout window when the first record is > > > written be documented as an erratum for KVM, then why no got a step further > > > and document as an erratum that changing the buffer translation tables > > > after the buffer has been enabled will lead to an SPE Serror? That will > > > allow us to always pin the buffer when profiling is enabled. > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I > > had said :) Preserving the stage-1 translations while profiling is > > active is a good recommendation, although I'm not sure that we've > > completely eliminated the risk of stage-2 faults. > > > > It seems impossible to blame the guest for all stage-2 faults that happen > > in the middle of a profiling session. In addition to host mm driven changes > > to stage-2, live migration is a busted as well. You'd need to build out > > stage-2 on the target before resuming the guest and guarantee that the > > appropriate pages have been demanded from the source (in case of post-copy). > > > > So, are we going to inject an SError for stage-2 faults outside of guest > > control as well? An external abort reported as an SPE buffer management > > event seems to be gracefully handled by the Linux driver, but that behavior > > is disallowed by SPEv1p3. > > > > To sum up the point I'm getting at: I agree that there are ways to > > reduce the risk of stage-2 faults in the middle of profiling, but I > > don't believe the current architecture allows KVM to virtualize the > > feature to the letter of the specification. > > I believe there's some confusion here: emulating SPE **does not work** if > stage 2 faults are triggered in the middle of a profiling session. Being > able to have a memory range never unmapped from stage 2 is a > **prerequisite** and is **required** for SPE emulation, it's not a nice to > have. > > A stage 2 fault before the first record is written is acceptable because > there are no other records already written which need to be thrown away. > Stage 2 faults after at least one record has been written are unacceptable > because it means that the contents of the buffer needs to thrown away. > > Does that make sense to you? > > I believe it is doable to have addresses always mapped at stage 2 with some > changes to KVM, but that's not what this thread is about. This thread is > about how and when to pin the buffer. Sorry if I've been forcing a tangent, but I believe there is a lot of value in discussing what is to be done for keeping the stage-2 mapping alive. I've been whining about it out of the very concern you highlight: a stage-2 fault in the middle of the profile is game over. Otherwise, optimizations in *when* we pin the buffer seem meaningless as stage-2 faults appear unavoidable. Nonetheless, back to your proposal. Injecting some context from earlier: > 3. Pinning the guest SPE buffer when profiling becomes enabled*: So we are only doing this when enabled for EL1, right? (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}) > - There is the corner case described above, when profiling becomes enabled as a > result of an ERET to EL0. This can happen when the buffer is enabled and > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set (outside of the architectures definition of when profiling is enabled)? > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > stage 2 faults when draining the buffer, which is performed with profiling > disabled. Sounds reasonable. > As long as we're all agreed that buffer memory needs "pinning" (as in the > IPA are never unmapped from stage 2 until KVM decides otherwise as part of > SPE emulation), I believe that live migration is tangential to figuring out > how and when the buffer should be "pinned". I'm more than happy to start a > separate thread about live migration after we figure out how we should go > about "pinning" the buffer, I think your insight would be most helpful :) Fair enough, let's see how this all shakes out and then figure out LM thereafter :) -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-08-10 15:25 ` Oliver Upton @ 2022-08-12 13:05 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-08-12 13:05 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Oliver, Just a note, for some reason some of your emails, but not all, don't show up in my email client (mutt). That's why it might take me a while to send a reply (noticed that you replied by looking for this thread on lore.kernel.org). On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote: > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote: > > Hi, > > > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > > > Hi Alex, > > > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > > > > > [...] > > > > > > > > > To summarize the approaches we've discussed so far: > > > > > > > > > > > > 1. Pinning the entire guest memory > > > > > > - Heavy handed and not ideal. > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > > > faults reported by SPE. > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > > > not the IPA. > > > > > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > > > disabled. > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > > > SPE. > > > > > > - Gets rid of the corner case at 3. > > > > > > - Same approach to buffer unpinning as 3. > > > > > > - Introduces a blackout window before the first record is written. > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > > > translation entries if the buffer is enabled and > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > > > operating systems can be modified to not change the translation entries for the > > > > > > buffer if this blackout window is not desirable. > > > > > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > > > that option #4 (pin on S2 fault) is probably the best approach for > > > > > the initial implementation. No amount of tricks in KVM can work around > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > > > that, we should probably document the behavior of SPE as a known erratum > > > > > of KVM. > > > > > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > > > profiling is enabled could layer on top quite easily by treating it as > > > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > > > synthetic S2 fault", would you mind elaborating? > > > > > > Assuming approach #4 is implemented, we will already have an SPE fault > > > handler that walks stage-1 and pins the buffer. At that point, > > > implementing approach #3 would be relatively easy. When EL1 sets > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > > > I see, that makes sense, thanks, > > > > > > > > > > said that I don't believe it is a hard requirement for enabling some > > > > > flavor of SPE for guests. > > > > > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > > > be done eventually. > > > > > > > > > > Do you feel like this is an OK route forward, or have I missed > > > > > something? > > > > > > > > I've been giving this some thought, and I prefer approach #3 because with > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > > > will be impossible to distinguish between a valid stage 2 fault (a fault > > > > caused by the guest reprogramming the buffer and enabling profiling) and > > > > KVM messing something up when pinning the buffer. I believe this to be > > > > important, as experience has shown me that pinning the buffer at stage 2 is > > > > not trivial and there isn't a mechanism today in Linux to do that > > > > (explanation and examples here [1]). > > > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > > > [1], page pinning does not avoid the possibility of the MMU notifiers > > > being called on a given range. Want to make sure I'm following, what > > > is your suggestion for approach #3 to handle the profile buffer when > > > only enabled at EL0? > > > > > > > With approach #4, it would be impossible to figure out if the results of a > > > > profiling operations inside a guest are representative of the workload or > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > > > multiple times per profiling session, introducing multiple blackout windows > > > > that can skew the results. > > > > > > > > If you're proposing that the blackout window when the first record is > > > > written be documented as an erratum for KVM, then why no got a step further > > > > and document as an erratum that changing the buffer translation tables > > > > after the buffer has been enabled will lead to an SPE Serror? That will > > > > allow us to always pin the buffer when profiling is enabled. > > > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I > > > had said :) Preserving the stage-1 translations while profiling is > > > active is a good recommendation, although I'm not sure that we've > > > completely eliminated the risk of stage-2 faults. > > > > > > It seems impossible to blame the guest for all stage-2 faults that happen > > > in the middle of a profiling session. In addition to host mm driven changes > > > to stage-2, live migration is a busted as well. You'd need to build out > > > stage-2 on the target before resuming the guest and guarantee that the > > > appropriate pages have been demanded from the source (in case of post-copy). > > > > > > So, are we going to inject an SError for stage-2 faults outside of guest > > > control as well? An external abort reported as an SPE buffer management > > > event seems to be gracefully handled by the Linux driver, but that behavior > > > is disallowed by SPEv1p3. > > > > > > To sum up the point I'm getting at: I agree that there are ways to > > > reduce the risk of stage-2 faults in the middle of profiling, but I > > > don't believe the current architecture allows KVM to virtualize the > > > feature to the letter of the specification. > > > > I believe there's some confusion here: emulating SPE **does not work** if > > stage 2 faults are triggered in the middle of a profiling session. Being > > able to have a memory range never unmapped from stage 2 is a > > **prerequisite** and is **required** for SPE emulation, it's not a nice to > > have. > > > > A stage 2 fault before the first record is written is acceptable because > > there are no other records already written which need to be thrown away. > > Stage 2 faults after at least one record has been written are unacceptable > > because it means that the contents of the buffer needs to thrown away. > > > > Does that make sense to you? > > > > I believe it is doable to have addresses always mapped at stage 2 with some > > changes to KVM, but that's not what this thread is about. This thread is > > about how and when to pin the buffer. > > Sorry if I've been forcing a tangent, but I believe there is a lot of > value in discussing what is to be done for keeping the stage-2 mapping > alive. I've been whining about it out of the very concern you highlight: > a stage-2 fault in the middle of the profile is game over. Otherwise, > optimizations in *when* we pin the buffer seem meaningless as stage-2 > faults appear unavoidable. The idea I had was to propagate the mmu_notifier_range->event field to the arch code. Then keep track of the IPAs which KVM pinned with pin_user_page(s) that translate the guest buffer, and don't unmap that IPA from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem trying to change how that particular page is mapped. > > Nonetheless, back to your proposal. Injecting some context from earlier: > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > So we are only doing this when enabled for EL1, right? > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}) Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}. Accesses to those registers can be trapped by KVM, and to verify the condition becomes trivial. > > > - There is the corner case described above, when profiling becomes enabled as a > > result of an ERET to EL0. This can happen when the buffer is enabled and > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set > (outside of the architectures definition of when profiling is enabled)? The original proposal was to pin on the first fault in this case, yes. That's because the architecture doesn't forbid changing the translation entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}). But you mentioned adding a quirk/erratum to KVM in your proposal, and I was thinking that we could add an erratum to avoid the case above by saying that that behaviour is impredictable. But that might restrict what operating systems KVM can run in an SPE-enabled VM, I can do some digging to find out how other operating systems use SPE, if you think adding the quirk sounds reasonable. > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > stage 2 faults when draining the buffer, which is performed with profiling > > disabled. > > Sounds reasonable. > > > As long as we're all agreed that buffer memory needs "pinning" (as in the > > IPA are never unmapped from stage 2 until KVM decides otherwise as part of > > SPE emulation), I believe that live migration is tangential to figuring out > > how and when the buffer should be "pinned". I'm more than happy to start a > > separate thread about live migration after we figure out how we should go > > about "pinning" the buffer, I think your insight would be most helpful :) > > Fair enough, let's see how this all shakes out and then figure out LM > thereafter :) Great, thanks! Alex > > -- > Thanks, > Oliver > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-08-12 13:05 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-08-12 13:05 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Oliver, Just a note, for some reason some of your emails, but not all, don't show up in my email client (mutt). That's why it might take me a while to send a reply (noticed that you replied by looking for this thread on lore.kernel.org). On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote: > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote: > > Hi, > > > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > > > Hi Alex, > > > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > > > > > [...] > > > > > > > > > To summarize the approaches we've discussed so far: > > > > > > > > > > > > 1. Pinning the entire guest memory > > > > > > - Heavy handed and not ideal. > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > > > faults reported by SPE. > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > > > not the IPA. > > > > > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > > > disabled. > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > > > SPE. > > > > > > - Gets rid of the corner case at 3. > > > > > > - Same approach to buffer unpinning as 3. > > > > > > - Introduces a blackout window before the first record is written. > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > > > translation entries if the buffer is enabled and > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > > > operating systems can be modified to not change the translation entries for the > > > > > > buffer if this blackout window is not desirable. > > > > > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > > > that option #4 (pin on S2 fault) is probably the best approach for > > > > > the initial implementation. No amount of tricks in KVM can work around > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > > > that, we should probably document the behavior of SPE as a known erratum > > > > > of KVM. > > > > > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > > > profiling is enabled could layer on top quite easily by treating it as > > > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > > > synthetic S2 fault", would you mind elaborating? > > > > > > Assuming approach #4 is implemented, we will already have an SPE fault > > > handler that walks stage-1 and pins the buffer. At that point, > > > implementing approach #3 would be relatively easy. When EL1 sets > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > > > I see, that makes sense, thanks, > > > > > > > > > > said that I don't believe it is a hard requirement for enabling some > > > > > flavor of SPE for guests. > > > > > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > > > be done eventually. > > > > > > > > > > Do you feel like this is an OK route forward, or have I missed > > > > > something? > > > > > > > > I've been giving this some thought, and I prefer approach #3 because with > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > > > will be impossible to distinguish between a valid stage 2 fault (a fault > > > > caused by the guest reprogramming the buffer and enabling profiling) and > > > > KVM messing something up when pinning the buffer. I believe this to be > > > > important, as experience has shown me that pinning the buffer at stage 2 is > > > > not trivial and there isn't a mechanism today in Linux to do that > > > > (explanation and examples here [1]). > > > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > > > [1], page pinning does not avoid the possibility of the MMU notifiers > > > being called on a given range. Want to make sure I'm following, what > > > is your suggestion for approach #3 to handle the profile buffer when > > > only enabled at EL0? > > > > > > > With approach #4, it would be impossible to figure out if the results of a > > > > profiling operations inside a guest are representative of the workload or > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > > > multiple times per profiling session, introducing multiple blackout windows > > > > that can skew the results. > > > > > > > > If you're proposing that the blackout window when the first record is > > > > written be documented as an erratum for KVM, then why no got a step further > > > > and document as an erratum that changing the buffer translation tables > > > > after the buffer has been enabled will lead to an SPE Serror? That will > > > > allow us to always pin the buffer when profiling is enabled. > > > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I > > > had said :) Preserving the stage-1 translations while profiling is > > > active is a good recommendation, although I'm not sure that we've > > > completely eliminated the risk of stage-2 faults. > > > > > > It seems impossible to blame the guest for all stage-2 faults that happen > > > in the middle of a profiling session. In addition to host mm driven changes > > > to stage-2, live migration is a busted as well. You'd need to build out > > > stage-2 on the target before resuming the guest and guarantee that the > > > appropriate pages have been demanded from the source (in case of post-copy). > > > > > > So, are we going to inject an SError for stage-2 faults outside of guest > > > control as well? An external abort reported as an SPE buffer management > > > event seems to be gracefully handled by the Linux driver, but that behavior > > > is disallowed by SPEv1p3. > > > > > > To sum up the point I'm getting at: I agree that there are ways to > > > reduce the risk of stage-2 faults in the middle of profiling, but I > > > don't believe the current architecture allows KVM to virtualize the > > > feature to the letter of the specification. > > > > I believe there's some confusion here: emulating SPE **does not work** if > > stage 2 faults are triggered in the middle of a profiling session. Being > > able to have a memory range never unmapped from stage 2 is a > > **prerequisite** and is **required** for SPE emulation, it's not a nice to > > have. > > > > A stage 2 fault before the first record is written is acceptable because > > there are no other records already written which need to be thrown away. > > Stage 2 faults after at least one record has been written are unacceptable > > because it means that the contents of the buffer needs to thrown away. > > > > Does that make sense to you? > > > > I believe it is doable to have addresses always mapped at stage 2 with some > > changes to KVM, but that's not what this thread is about. This thread is > > about how and when to pin the buffer. > > Sorry if I've been forcing a tangent, but I believe there is a lot of > value in discussing what is to be done for keeping the stage-2 mapping > alive. I've been whining about it out of the very concern you highlight: > a stage-2 fault in the middle of the profile is game over. Otherwise, > optimizations in *when* we pin the buffer seem meaningless as stage-2 > faults appear unavoidable. The idea I had was to propagate the mmu_notifier_range->event field to the arch code. Then keep track of the IPAs which KVM pinned with pin_user_page(s) that translate the guest buffer, and don't unmap that IPA from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem trying to change how that particular page is mapped. > > Nonetheless, back to your proposal. Injecting some context from earlier: > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > So we are only doing this when enabled for EL1, right? > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}) Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}. Accesses to those registers can be trapped by KVM, and to verify the condition becomes trivial. > > > - There is the corner case described above, when profiling becomes enabled as a > > result of an ERET to EL0. This can happen when the buffer is enabled and > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set > (outside of the architectures definition of when profiling is enabled)? The original proposal was to pin on the first fault in this case, yes. That's because the architecture doesn't forbid changing the translation entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}). But you mentioned adding a quirk/erratum to KVM in your proposal, and I was thinking that we could add an erratum to avoid the case above by saying that that behaviour is impredictable. But that might restrict what operating systems KVM can run in an SPE-enabled VM, I can do some digging to find out how other operating systems use SPE, if you think adding the quirk sounds reasonable. > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > stage 2 faults when draining the buffer, which is performed with profiling > > disabled. > > Sounds reasonable. > > > As long as we're all agreed that buffer memory needs "pinning" (as in the > > IPA are never unmapped from stage 2 until KVM decides otherwise as part of > > SPE emulation), I believe that live migration is tangential to figuring out > > how and when the buffer should be "pinned". I'm more than happy to start a > > separate thread about live migration after we figure out how we should go > > about "pinning" the buffer, I think your insight would be most helpful :) > > Fair enough, let's see how this all shakes out and then figure out LM > thereafter :) Great, thanks! Alex > > -- > Thanks, > Oliver > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-08-12 13:05 ` Alexandru Elisei @ 2022-08-17 15:05 ` Oliver Upton -1 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-08-17 15:05 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Alex, On Fri, Aug 12, 2022 at 02:05:45PM +0100, Alexandru Elisei wrote: > Hi Oliver, > > Just a note, for some reason some of your emails, but not all, don't show up in > my email client (mutt). That's why it might take me a while to send a reply > (noticed that you replied by looking for this thread on lore.kernel.org). Urgh, that's weird. Am I getting thrown into spam or something? Also, do you know if you've been receiving Drew's email since he switched to @linux.dev? > On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote: > > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote: > > > Hi, > > > > > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > > > > Hi Alex, > > > > > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > > > > > > > [...] > > > > > > > > > > > To summarize the approaches we've discussed so far: > > > > > > > > > > > > > > 1. Pinning the entire guest memory > > > > > > > - Heavy handed and not ideal. > > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > > > > faults reported by SPE. > > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > > > > not the IPA. > > > > > > > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > > > > disabled. > > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > > > > SPE. > > > > > > > - Gets rid of the corner case at 3. > > > > > > > - Same approach to buffer unpinning as 3. > > > > > > > - Introduces a blackout window before the first record is written. > > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > > > > translation entries if the buffer is enabled and > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > > > > operating systems can be modified to not change the translation entries for the > > > > > > > buffer if this blackout window is not desirable. > > > > > > > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > > > > that option #4 (pin on S2 fault) is probably the best approach for > > > > > > the initial implementation. No amount of tricks in KVM can work around > > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > > > > that, we should probably document the behavior of SPE as a known erratum > > > > > > of KVM. > > > > > > > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > > > > profiling is enabled could layer on top quite easily by treating it as > > > > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > > > > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > > > > synthetic S2 fault", would you mind elaborating? > > > > > > > > Assuming approach #4 is implemented, we will already have an SPE fault > > > > handler that walks stage-1 and pins the buffer. At that point, > > > > implementing approach #3 would be relatively easy. When EL1 sets > > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > > > > > I see, that makes sense, thanks, > > > > > > > > > > > > > said that I don't believe it is a hard requirement for enabling some > > > > > > flavor of SPE for guests. > > > > > > > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > > > > be done eventually. > > > > > > > > > > > > Do you feel like this is an OK route forward, or have I missed > > > > > > something? > > > > > > > > > > I've been giving this some thought, and I prefer approach #3 because with > > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > > > > will be impossible to distinguish between a valid stage 2 fault (a fault > > > > > caused by the guest reprogramming the buffer and enabling profiling) and > > > > > KVM messing something up when pinning the buffer. I believe this to be > > > > > important, as experience has shown me that pinning the buffer at stage 2 is > > > > > not trivial and there isn't a mechanism today in Linux to do that > > > > > (explanation and examples here [1]). > > > > > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > > > > [1], page pinning does not avoid the possibility of the MMU notifiers > > > > being called on a given range. Want to make sure I'm following, what > > > > is your suggestion for approach #3 to handle the profile buffer when > > > > only enabled at EL0? > > > > > > > > > With approach #4, it would be impossible to figure out if the results of a > > > > > profiling operations inside a guest are representative of the workload or > > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > > > > multiple times per profiling session, introducing multiple blackout windows > > > > > that can skew the results. > > > > > > > > > > If you're proposing that the blackout window when the first record is > > > > > written be documented as an erratum for KVM, then why no got a step further > > > > > and document as an erratum that changing the buffer translation tables > > > > > after the buffer has been enabled will lead to an SPE Serror? That will > > > > > allow us to always pin the buffer when profiling is enabled. > > > > > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I > > > > had said :) Preserving the stage-1 translations while profiling is > > > > active is a good recommendation, although I'm not sure that we've > > > > completely eliminated the risk of stage-2 faults. > > > > > > > > It seems impossible to blame the guest for all stage-2 faults that happen > > > > in the middle of a profiling session. In addition to host mm driven changes > > > > to stage-2, live migration is a busted as well. You'd need to build out > > > > stage-2 on the target before resuming the guest and guarantee that the > > > > appropriate pages have been demanded from the source (in case of post-copy). > > > > > > > > So, are we going to inject an SError for stage-2 faults outside of guest > > > > control as well? An external abort reported as an SPE buffer management > > > > event seems to be gracefully handled by the Linux driver, but that behavior > > > > is disallowed by SPEv1p3. > > > > > > > > To sum up the point I'm getting at: I agree that there are ways to > > > > reduce the risk of stage-2 faults in the middle of profiling, but I > > > > don't believe the current architecture allows KVM to virtualize the > > > > feature to the letter of the specification. > > > > > > I believe there's some confusion here: emulating SPE **does not work** if > > > stage 2 faults are triggered in the middle of a profiling session. Being > > > able to have a memory range never unmapped from stage 2 is a > > > **prerequisite** and is **required** for SPE emulation, it's not a nice to > > > have. > > > > > > A stage 2 fault before the first record is written is acceptable because > > > there are no other records already written which need to be thrown away. > > > Stage 2 faults after at least one record has been written are unacceptable > > > because it means that the contents of the buffer needs to thrown away. > > > > > > Does that make sense to you? > > > > > > I believe it is doable to have addresses always mapped at stage 2 with some > > > changes to KVM, but that's not what this thread is about. This thread is > > > about how and when to pin the buffer. > > > > Sorry if I've been forcing a tangent, but I believe there is a lot of > > value in discussing what is to be done for keeping the stage-2 mapping > > alive. I've been whining about it out of the very concern you highlight: > > a stage-2 fault in the middle of the profile is game over. Otherwise, > > optimizations in *when* we pin the buffer seem meaningless as stage-2 > > faults appear unavoidable. > > The idea I had was to propagate the mmu_notifier_range->event field to the > arch code. Then keep track of the IPAs which KVM pinned with > pin_user_page(s) that translate the guest buffer, and don't unmap that IPA > from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all > notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem > trying to change how that particular page is mapped. > > > > > Nonetheless, back to your proposal. Injecting some context from earlier: > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > So we are only doing this when enabled for EL1, right? > > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}) > > Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}. > Accesses to those registers can be trapped by KVM, and to verify the > condition becomes trivial. > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set > > (outside of the architectures definition of when profiling is enabled)? > > The original proposal was to pin on the first fault in this case, yes. > That's because the architecture doesn't forbid changing the translation > entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled > (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}). > > But you mentioned adding a quirk/erratum to KVM in your proposal, and I was > thinking that we could add an erratum to avoid the case above by saying > that that behaviour is impredictable. But that might restrict what > operating systems KVM can run in an SPE-enabled VM, I can do some digging > to find out how other operating systems use SPE, if you think adding the > quirk sounds reasonable. Yeah, that would be good to follow up on what other OSes are doing. You'll still have a nondestructive S2 fault handler for the SPE, right? IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the new one. -- Thanks, Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-08-17 15:05 ` Oliver Upton 0 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-08-17 15:05 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Alex, On Fri, Aug 12, 2022 at 02:05:45PM +0100, Alexandru Elisei wrote: > Hi Oliver, > > Just a note, for some reason some of your emails, but not all, don't show up in > my email client (mutt). That's why it might take me a while to send a reply > (noticed that you replied by looking for this thread on lore.kernel.org). Urgh, that's weird. Am I getting thrown into spam or something? Also, do you know if you've been receiving Drew's email since he switched to @linux.dev? > On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote: > > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote: > > > Hi, > > > > > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > > > > Hi Alex, > > > > > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > > > > > > > [...] > > > > > > > > > > > To summarize the approaches we've discussed so far: > > > > > > > > > > > > > > 1. Pinning the entire guest memory > > > > > > > - Heavy handed and not ideal. > > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > > > > faults reported by SPE. > > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > > > > not the IPA. > > > > > > > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > > > > disabled. > > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > > > > SPE. > > > > > > > - Gets rid of the corner case at 3. > > > > > > > - Same approach to buffer unpinning as 3. > > > > > > > - Introduces a blackout window before the first record is written. > > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > > > > translation entries if the buffer is enabled and > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > > > > operating systems can be modified to not change the translation entries for the > > > > > > > buffer if this blackout window is not desirable. > > > > > > > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > > > > that option #4 (pin on S2 fault) is probably the best approach for > > > > > > the initial implementation. No amount of tricks in KVM can work around > > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > > > > that, we should probably document the behavior of SPE as a known erratum > > > > > > of KVM. > > > > > > > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > > > > profiling is enabled could layer on top quite easily by treating it as > > > > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > > > > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > > > > synthetic S2 fault", would you mind elaborating? > > > > > > > > Assuming approach #4 is implemented, we will already have an SPE fault > > > > handler that walks stage-1 and pins the buffer. At that point, > > > > implementing approach #3 would be relatively easy. When EL1 sets > > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > > > > > I see, that makes sense, thanks, > > > > > > > > > > > > > said that I don't believe it is a hard requirement for enabling some > > > > > > flavor of SPE for guests. > > > > > > > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > > > > be done eventually. > > > > > > > > > > > > Do you feel like this is an OK route forward, or have I missed > > > > > > something? > > > > > > > > > > I've been giving this some thought, and I prefer approach #3 because with > > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > > > > will be impossible to distinguish between a valid stage 2 fault (a fault > > > > > caused by the guest reprogramming the buffer and enabling profiling) and > > > > > KVM messing something up when pinning the buffer. I believe this to be > > > > > important, as experience has shown me that pinning the buffer at stage 2 is > > > > > not trivial and there isn't a mechanism today in Linux to do that > > > > > (explanation and examples here [1]). > > > > > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > > > > [1], page pinning does not avoid the possibility of the MMU notifiers > > > > being called on a given range. Want to make sure I'm following, what > > > > is your suggestion for approach #3 to handle the profile buffer when > > > > only enabled at EL0? > > > > > > > > > With approach #4, it would be impossible to figure out if the results of a > > > > > profiling operations inside a guest are representative of the workload or > > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > > > > multiple times per profiling session, introducing multiple blackout windows > > > > > that can skew the results. > > > > > > > > > > If you're proposing that the blackout window when the first record is > > > > > written be documented as an erratum for KVM, then why no got a step further > > > > > and document as an erratum that changing the buffer translation tables > > > > > after the buffer has been enabled will lead to an SPE Serror? That will > > > > > allow us to always pin the buffer when profiling is enabled. > > > > > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I > > > > had said :) Preserving the stage-1 translations while profiling is > > > > active is a good recommendation, although I'm not sure that we've > > > > completely eliminated the risk of stage-2 faults. > > > > > > > > It seems impossible to blame the guest for all stage-2 faults that happen > > > > in the middle of a profiling session. In addition to host mm driven changes > > > > to stage-2, live migration is a busted as well. You'd need to build out > > > > stage-2 on the target before resuming the guest and guarantee that the > > > > appropriate pages have been demanded from the source (in case of post-copy). > > > > > > > > So, are we going to inject an SError for stage-2 faults outside of guest > > > > control as well? An external abort reported as an SPE buffer management > > > > event seems to be gracefully handled by the Linux driver, but that behavior > > > > is disallowed by SPEv1p3. > > > > > > > > To sum up the point I'm getting at: I agree that there are ways to > > > > reduce the risk of stage-2 faults in the middle of profiling, but I > > > > don't believe the current architecture allows KVM to virtualize the > > > > feature to the letter of the specification. > > > > > > I believe there's some confusion here: emulating SPE **does not work** if > > > stage 2 faults are triggered in the middle of a profiling session. Being > > > able to have a memory range never unmapped from stage 2 is a > > > **prerequisite** and is **required** for SPE emulation, it's not a nice to > > > have. > > > > > > A stage 2 fault before the first record is written is acceptable because > > > there are no other records already written which need to be thrown away. > > > Stage 2 faults after at least one record has been written are unacceptable > > > because it means that the contents of the buffer needs to thrown away. > > > > > > Does that make sense to you? > > > > > > I believe it is doable to have addresses always mapped at stage 2 with some > > > changes to KVM, but that's not what this thread is about. This thread is > > > about how and when to pin the buffer. > > > > Sorry if I've been forcing a tangent, but I believe there is a lot of > > value in discussing what is to be done for keeping the stage-2 mapping > > alive. I've been whining about it out of the very concern you highlight: > > a stage-2 fault in the middle of the profile is game over. Otherwise, > > optimizations in *when* we pin the buffer seem meaningless as stage-2 > > faults appear unavoidable. > > The idea I had was to propagate the mmu_notifier_range->event field to the > arch code. Then keep track of the IPAs which KVM pinned with > pin_user_page(s) that translate the guest buffer, and don't unmap that IPA > from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all > notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem > trying to change how that particular page is mapped. > > > > > Nonetheless, back to your proposal. Injecting some context from earlier: > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > So we are only doing this when enabled for EL1, right? > > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}) > > Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}. > Accesses to those registers can be trapped by KVM, and to verify the > condition becomes trivial. > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set > > (outside of the architectures definition of when profiling is enabled)? > > The original proposal was to pin on the first fault in this case, yes. > That's because the architecture doesn't forbid changing the translation > entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled > (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}). > > But you mentioned adding a quirk/erratum to KVM in your proposal, and I was > thinking that we could add an erratum to avoid the case above by saying > that that behaviour is impredictable. But that might restrict what > operating systems KVM can run in an SPE-enabled VM, I can do some digging > to find out how other operating systems use SPE, if you think adding the > quirk sounds reasonable. Yeah, that would be good to follow up on what other OSes are doing. You'll still have a nondestructive S2 fault handler for the SPE, right? IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the new one. -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-08-17 15:05 ` Oliver Upton @ 2022-09-12 14:50 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-09-12 14:50 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Oliver, On Wed, Aug 17, 2022 at 10:05:51AM -0500, Oliver Upton wrote: > Hi Alex, > > On Fri, Aug 12, 2022 at 02:05:45PM +0100, Alexandru Elisei wrote: > > Hi Oliver, > > > > Just a note, for some reason some of your emails, but not all, don't show up in > > my email client (mutt). That's why it might take me a while to send a reply > > (noticed that you replied by looking for this thread on lore.kernel.org). > > Urgh, that's weird. Am I getting thrown into spam or something? Also, do > you know if you've been receiving Drew's email since he switched to > @linux.dev? As far as I can tell, I am able to receive emails from Drew's new email address. I think it's because some of the macros that I've been using in mutt, they seem to interract in a weird way with imap_keepalive. Disabled imap_keepalive and everything looks to have been sorted out. > > > On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote: > > > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote: > > > > Hi, > > > > > > > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > > > > > Hi Alex, > > > > > > > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > > > > > > > > > [...] > > > > > > > > > > > > > To summarize the approaches we've discussed so far: > > > > > > > > > > > > > > > > 1. Pinning the entire guest memory > > > > > > > > - Heavy handed and not ideal. > > > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > > > > > faults reported by SPE. > > > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > > > > > not the IPA. > > > > > > > > > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > > > > > disabled. > > > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > > > > > SPE. > > > > > > > > - Gets rid of the corner case at 3. > > > > > > > > - Same approach to buffer unpinning as 3. > > > > > > > > - Introduces a blackout window before the first record is written. > > > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > > > > > translation entries if the buffer is enabled and > > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > > > > > operating systems can be modified to not change the translation entries for the > > > > > > > > buffer if this blackout window is not desirable. > > > > > > > > > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > > > > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > > > > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > > > > > that option #4 (pin on S2 fault) is probably the best approach for > > > > > > > the initial implementation. No amount of tricks in KVM can work around > > > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > > > > > that, we should probably document the behavior of SPE as a known erratum > > > > > > > of KVM. > > > > > > > > > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > > > > > profiling is enabled could layer on top quite easily by treating it as > > > > > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > > > > > > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > > > > > synthetic S2 fault", would you mind elaborating? > > > > > > > > > > Assuming approach #4 is implemented, we will already have an SPE fault > > > > > handler that walks stage-1 and pins the buffer. At that point, > > > > > implementing approach #3 would be relatively easy. When EL1 sets > > > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > > > > > > > I see, that makes sense, thanks, > > > > > > > > > > > > > > > > said that I don't believe it is a hard requirement for enabling some > > > > > > > flavor of SPE for guests. > > > > > > > > > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > > > > > be done eventually. > > > > > > > > > > > > > > Do you feel like this is an OK route forward, or have I missed > > > > > > > something? > > > > > > > > > > > > I've been giving this some thought, and I prefer approach #3 because with > > > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > > > > > will be impossible to distinguish between a valid stage 2 fault (a fault > > > > > > caused by the guest reprogramming the buffer and enabling profiling) and > > > > > > KVM messing something up when pinning the buffer. I believe this to be > > > > > > important, as experience has shown me that pinning the buffer at stage 2 is > > > > > > not trivial and there isn't a mechanism today in Linux to do that > > > > > > (explanation and examples here [1]). > > > > > > > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > > > > > [1], page pinning does not avoid the possibility of the MMU notifiers > > > > > being called on a given range. Want to make sure I'm following, what > > > > > is your suggestion for approach #3 to handle the profile buffer when > > > > > only enabled at EL0? > > > > > > > > > > > With approach #4, it would be impossible to figure out if the results of a > > > > > > profiling operations inside a guest are representative of the workload or > > > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > > > > > multiple times per profiling session, introducing multiple blackout windows > > > > > > that can skew the results. > > > > > > > > > > > > If you're proposing that the blackout window when the first record is > > > > > > written be documented as an erratum for KVM, then why no got a step further > > > > > > and document as an erratum that changing the buffer translation tables > > > > > > after the buffer has been enabled will lead to an SPE Serror? That will > > > > > > allow us to always pin the buffer when profiling is enabled. > > > > > > > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I > > > > > had said :) Preserving the stage-1 translations while profiling is > > > > > active is a good recommendation, although I'm not sure that we've > > > > > completely eliminated the risk of stage-2 faults. > > > > > > > > > > It seems impossible to blame the guest for all stage-2 faults that happen > > > > > in the middle of a profiling session. In addition to host mm driven changes > > > > > to stage-2, live migration is a busted as well. You'd need to build out > > > > > stage-2 on the target before resuming the guest and guarantee that the > > > > > appropriate pages have been demanded from the source (in case of post-copy). > > > > > > > > > > So, are we going to inject an SError for stage-2 faults outside of guest > > > > > control as well? An external abort reported as an SPE buffer management > > > > > event seems to be gracefully handled by the Linux driver, but that behavior > > > > > is disallowed by SPEv1p3. > > > > > > > > > > To sum up the point I'm getting at: I agree that there are ways to > > > > > reduce the risk of stage-2 faults in the middle of profiling, but I > > > > > don't believe the current architecture allows KVM to virtualize the > > > > > feature to the letter of the specification. > > > > > > > > I believe there's some confusion here: emulating SPE **does not work** if > > > > stage 2 faults are triggered in the middle of a profiling session. Being > > > > able to have a memory range never unmapped from stage 2 is a > > > > **prerequisite** and is **required** for SPE emulation, it's not a nice to > > > > have. > > > > > > > > A stage 2 fault before the first record is written is acceptable because > > > > there are no other records already written which need to be thrown away. > > > > Stage 2 faults after at least one record has been written are unacceptable > > > > because it means that the contents of the buffer needs to thrown away. > > > > > > > > Does that make sense to you? > > > > > > > > I believe it is doable to have addresses always mapped at stage 2 with some > > > > changes to KVM, but that's not what this thread is about. This thread is > > > > about how and when to pin the buffer. > > > > > > Sorry if I've been forcing a tangent, but I believe there is a lot of > > > value in discussing what is to be done for keeping the stage-2 mapping > > > alive. I've been whining about it out of the very concern you highlight: > > > a stage-2 fault in the middle of the profile is game over. Otherwise, > > > optimizations in *when* we pin the buffer seem meaningless as stage-2 > > > faults appear unavoidable. > > > > The idea I had was to propagate the mmu_notifier_range->event field to the > > arch code. Then keep track of the IPAs which KVM pinned with > > pin_user_page(s) that translate the guest buffer, and don't unmap that IPA > > from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all > > notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem > > trying to change how that particular page is mapped. > > > > > > > > Nonetheless, back to your proposal. Injecting some context from earlier: > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > > So we are only doing this when enabled for EL1, right? > > > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}) > > > > Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}. > > Accesses to those registers can be trapped by KVM, and to verify the > > condition becomes trivial. > > > > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set > > > (outside of the architectures definition of when profiling is enabled)? > > > > The original proposal was to pin on the first fault in this case, yes. > > That's because the architecture doesn't forbid changing the translation > > entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled > > (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}). > > > > But you mentioned adding a quirk/erratum to KVM in your proposal, and I was > > thinking that we could add an erratum to avoid the case above by saying > > that that behaviour is impredictable. But that might restrict what > > operating systems KVM can run in an SPE-enabled VM, I can do some digging > > to find out how other operating systems use SPE, if you think adding the > > quirk sounds reasonable. > > Yeah, that would be good to follow up on what other OSes are doing. FreeBSD doesn't have a SPE driver. Currently in the process of finding out how/if Windows implements the driver. > You'll still have a nondestructive S2 fault handler for the SPE, right? > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > new one. This is how I think about it: a S2 DABT where DL == 0 can happen because of something that the VMM, KVM or the guest has done: 1. If it's because of something that the host's userspace did (memslot was changed while the VM was running, memory was munmap'ed, etc). In this case, there's no way for KVM to handle the SPE fault, so I would say that the sensible approach would be to inject an SPE external abort. 2. If it's because of something that KVM did, that can only be because of a bug in SPE emulation. In this case, it can happen again, which means arbitrary blackout windows which can skew the profiling results. I would much rather inject an SPE external abort then let the guest rely on potentially bad profiling information. 3. The guest changes the mapping for the buffer when it shouldn't have: A. when the architecture does allow it, but KVM doesn't support, or B. when the architecture doesn't allow it. For both cases, I would much rather inject an SPE external abort for the reasons above. Furthermore, for B, I think it would be better to let the guest know as soon as possible that it's not following the architecture. In conclusion, I would prefer to treat all SPE S2 faults as errors. Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-09-12 14:50 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-09-12 14:50 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Oliver, On Wed, Aug 17, 2022 at 10:05:51AM -0500, Oliver Upton wrote: > Hi Alex, > > On Fri, Aug 12, 2022 at 02:05:45PM +0100, Alexandru Elisei wrote: > > Hi Oliver, > > > > Just a note, for some reason some of your emails, but not all, don't show up in > > my email client (mutt). That's why it might take me a while to send a reply > > (noticed that you replied by looking for this thread on lore.kernel.org). > > Urgh, that's weird. Am I getting thrown into spam or something? Also, do > you know if you've been receiving Drew's email since he switched to > @linux.dev? As far as I can tell, I am able to receive emails from Drew's new email address. I think it's because some of the macros that I've been using in mutt, they seem to interract in a weird way with imap_keepalive. Disabled imap_keepalive and everything looks to have been sorted out. > > > On Wed, Aug 10, 2022 at 10:25:56AM -0500, Oliver Upton wrote: > > > On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote: > > > > Hi, > > > > > > > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > > > > > Hi Alex, > > > > > > > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > > > > > > > > > [...] > > > > > > > > > > > > > To summarize the approaches we've discussed so far: > > > > > > > > > > > > > > > > 1. Pinning the entire guest memory > > > > > > > > - Heavy handed and not ideal. > > > > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > > > > > faults reported by SPE. > > > > > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > > > > > not the IPA. > > > > > > > > > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > > > > > disabled. > > > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > > > > > SPE. > > > > > > > > - Gets rid of the corner case at 3. > > > > > > > > - Same approach to buffer unpinning as 3. > > > > > > > > - Introduces a blackout window before the first record is written. > > > > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > > > > > translation entries if the buffer is enabled and > > > > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > > > > > operating systems can be modified to not change the translation entries for the > > > > > > > > buffer if this blackout window is not desirable. > > > > > > > > > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > > > > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > > > > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > > > > > that option #4 (pin on S2 fault) is probably the best approach for > > > > > > > the initial implementation. No amount of tricks in KVM can work around > > > > > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > > > > > that, we should probably document the behavior of SPE as a known erratum > > > > > > > of KVM. > > > > > > > > > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > > > > > profiling is enabled could layer on top quite easily by treating it as > > > > > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > > > > > > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > > > > > synthetic S2 fault", would you mind elaborating? > > > > > > > > > > Assuming approach #4 is implemented, we will already have an SPE fault > > > > > handler that walks stage-1 and pins the buffer. At that point, > > > > > implementing approach #3 would be relatively easy. When EL1 sets > > > > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > > > > > > > I see, that makes sense, thanks, > > > > > > > > > > > > > > > > said that I don't believe it is a hard requirement for enabling some > > > > > > > flavor of SPE for guests. > > > > > > > > > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > > > > > be done eventually. > > > > > > > > > > > > > > Do you feel like this is an OK route forward, or have I missed > > > > > > > something? > > > > > > > > > > > > I've been giving this some thought, and I prefer approach #3 because with > > > > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > > > > > will be impossible to distinguish between a valid stage 2 fault (a fault > > > > > > caused by the guest reprogramming the buffer and enabling profiling) and > > > > > > KVM messing something up when pinning the buffer. I believe this to be > > > > > > important, as experience has shown me that pinning the buffer at stage 2 is > > > > > > not trivial and there isn't a mechanism today in Linux to do that > > > > > > (explanation and examples here [1]). > > > > > > > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > > > > > [1], page pinning does not avoid the possibility of the MMU notifiers > > > > > being called on a given range. Want to make sure I'm following, what > > > > > is your suggestion for approach #3 to handle the profile buffer when > > > > > only enabled at EL0? > > > > > > > > > > > With approach #4, it would be impossible to figure out if the results of a > > > > > > profiling operations inside a guest are representative of the workload or > > > > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > > > > > multiple times per profiling session, introducing multiple blackout windows > > > > > > that can skew the results. > > > > > > > > > > > > If you're proposing that the blackout window when the first record is > > > > > > written be documented as an erratum for KVM, then why no got a step further > > > > > > and document as an erratum that changing the buffer translation tables > > > > > > after the buffer has been enabled will lead to an SPE Serror? That will > > > > > > allow us to always pin the buffer when profiling is enabled. > > > > > > > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I > > > > > had said :) Preserving the stage-1 translations while profiling is > > > > > active is a good recommendation, although I'm not sure that we've > > > > > completely eliminated the risk of stage-2 faults. > > > > > > > > > > It seems impossible to blame the guest for all stage-2 faults that happen > > > > > in the middle of a profiling session. In addition to host mm driven changes > > > > > to stage-2, live migration is a busted as well. You'd need to build out > > > > > stage-2 on the target before resuming the guest and guarantee that the > > > > > appropriate pages have been demanded from the source (in case of post-copy). > > > > > > > > > > So, are we going to inject an SError for stage-2 faults outside of guest > > > > > control as well? An external abort reported as an SPE buffer management > > > > > event seems to be gracefully handled by the Linux driver, but that behavior > > > > > is disallowed by SPEv1p3. > > > > > > > > > > To sum up the point I'm getting at: I agree that there are ways to > > > > > reduce the risk of stage-2 faults in the middle of profiling, but I > > > > > don't believe the current architecture allows KVM to virtualize the > > > > > feature to the letter of the specification. > > > > > > > > I believe there's some confusion here: emulating SPE **does not work** if > > > > stage 2 faults are triggered in the middle of a profiling session. Being > > > > able to have a memory range never unmapped from stage 2 is a > > > > **prerequisite** and is **required** for SPE emulation, it's not a nice to > > > > have. > > > > > > > > A stage 2 fault before the first record is written is acceptable because > > > > there are no other records already written which need to be thrown away. > > > > Stage 2 faults after at least one record has been written are unacceptable > > > > because it means that the contents of the buffer needs to thrown away. > > > > > > > > Does that make sense to you? > > > > > > > > I believe it is doable to have addresses always mapped at stage 2 with some > > > > changes to KVM, but that's not what this thread is about. This thread is > > > > about how and when to pin the buffer. > > > > > > Sorry if I've been forcing a tangent, but I believe there is a lot of > > > value in discussing what is to be done for keeping the stage-2 mapping > > > alive. I've been whining about it out of the very concern you highlight: > > > a stage-2 fault in the middle of the profile is game over. Otherwise, > > > optimizations in *when* we pin the buffer seem meaningless as stage-2 > > > faults appear unavoidable. > > > > The idea I had was to propagate the mmu_notifier_range->event field to the > > arch code. Then keep track of the IPAs which KVM pinned with > > pin_user_page(s) that translate the guest buffer, and don't unmap that IPA > > from stage 2 if the event != MMU_NOTIFY_UNMAP. For a pinned page, all > > notifier events except MMU_NOTIFY_UNMAP are caused by the mm subsystem > > trying to change how that particular page is mapped. > > > > > > > > Nonetheless, back to your proposal. Injecting some context from earlier: > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > > So we are only doing this when enabled for EL1, right? > > > (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}) > > > > Yes, pin when PMBLIMITR_EL1.E = 1 and PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}. > > Accesses to those registers can be trapped by KVM, and to verify the > > condition becomes trivial. > > > > > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > > Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set > > > (outside of the architectures definition of when profiling is enabled)? > > > > The original proposal was to pin on the first fault in this case, yes. > > That's because the architecture doesn't forbid changing the translation > > entries for the buffer when PMBLIMITR_EL1.E = 1 and sampling is disabled > > (PMSCR_EL1.{E0SPE, E1SPE] = {x, 0}). > > > > But you mentioned adding a quirk/erratum to KVM in your proposal, and I was > > thinking that we could add an erratum to avoid the case above by saying > > that that behaviour is impredictable. But that might restrict what > > operating systems KVM can run in an SPE-enabled VM, I can do some digging > > to find out how other operating systems use SPE, if you think adding the > > quirk sounds reasonable. > > Yeah, that would be good to follow up on what other OSes are doing. FreeBSD doesn't have a SPE driver. Currently in the process of finding out how/if Windows implements the driver. > You'll still have a nondestructive S2 fault handler for the SPE, right? > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > new one. This is how I think about it: a S2 DABT where DL == 0 can happen because of something that the VMM, KVM or the guest has done: 1. If it's because of something that the host's userspace did (memslot was changed while the VM was running, memory was munmap'ed, etc). In this case, there's no way for KVM to handle the SPE fault, so I would say that the sensible approach would be to inject an SPE external abort. 2. If it's because of something that KVM did, that can only be because of a bug in SPE emulation. In this case, it can happen again, which means arbitrary blackout windows which can skew the profiling results. I would much rather inject an SPE external abort then let the guest rely on potentially bad profiling information. 3. The guest changes the mapping for the buffer when it shouldn't have: A. when the architecture does allow it, but KVM doesn't support, or B. when the architecture doesn't allow it. For both cases, I would much rather inject an SPE external abort for the reasons above. Furthermore, for B, I think it would be better to let the guest know as soon as possible that it's not following the architecture. In conclusion, I would prefer to treat all SPE S2 faults as errors. Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-09-12 14:50 ` Alexandru Elisei @ 2022-09-13 10:58 ` Oliver Upton -1 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-09-13 10:58 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hey Alex, On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote: [...] > > Yeah, that would be good to follow up on what other OSes are doing. > > FreeBSD doesn't have a SPE driver. > > Currently in the process of finding out how/if Windows implements the > driver. > > > You'll still have a nondestructive S2 fault handler for the SPE, right? > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > > new one. > > This is how I think about it: a S2 DABT where DL == 0 can happen because of > something that the VMM, KVM or the guest has done: > > 1. If it's because of something that the host's userspace did (memslot was > changed while the VM was running, memory was munmap'ed, etc). In this case, > there's no way for KVM to handle the SPE fault, so I would say that the > sensible approach would be to inject an SPE external abort. > > 2. If it's because of something that KVM did, that can only be because of a > bug in SPE emulation. In this case, it can happen again, which means > arbitrary blackout windows which can skew the profiling results. I would > much rather inject an SPE external abort then let the guest rely on > potentially bad profiling information. > > 3. The guest changes the mapping for the buffer when it shouldn't have: A. > when the architecture does allow it, but KVM doesn't support, or B. when > the architecture doesn't allow it. For both cases, I would much rather > inject an SPE external abort for the reasons above. Furthermore, for B, I > think it would be better to let the guest know as soon as possible that > it's not following the architecture. > > In conclusion, I would prefer to treat all SPE S2 faults as errors. My main concern with treating S2 faults as a synthetic external abort is how this behavior progresses in later versions of the architecture. SPEv1p3 disallows implementations from reporting external aborts via the SPU, instead allowing only for an SError to be delivered to the core. I caught up with Will on this for a little bit: Instead of an external abort, how about reporting an IMP DEF buffer management event to the guest? At least for the Linux driver it should have the same effect of killing the session but the VM will stay running. This way there's no architectural requirement to promote to an SError. -- Thanks, Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-09-13 10:58 ` Oliver Upton 0 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-09-13 10:58 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hey Alex, On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote: [...] > > Yeah, that would be good to follow up on what other OSes are doing. > > FreeBSD doesn't have a SPE driver. > > Currently in the process of finding out how/if Windows implements the > driver. > > > You'll still have a nondestructive S2 fault handler for the SPE, right? > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > > new one. > > This is how I think about it: a S2 DABT where DL == 0 can happen because of > something that the VMM, KVM or the guest has done: > > 1. If it's because of something that the host's userspace did (memslot was > changed while the VM was running, memory was munmap'ed, etc). In this case, > there's no way for KVM to handle the SPE fault, so I would say that the > sensible approach would be to inject an SPE external abort. > > 2. If it's because of something that KVM did, that can only be because of a > bug in SPE emulation. In this case, it can happen again, which means > arbitrary blackout windows which can skew the profiling results. I would > much rather inject an SPE external abort then let the guest rely on > potentially bad profiling information. > > 3. The guest changes the mapping for the buffer when it shouldn't have: A. > when the architecture does allow it, but KVM doesn't support, or B. when > the architecture doesn't allow it. For both cases, I would much rather > inject an SPE external abort for the reasons above. Furthermore, for B, I > think it would be better to let the guest know as soon as possible that > it's not following the architecture. > > In conclusion, I would prefer to treat all SPE S2 faults as errors. My main concern with treating S2 faults as a synthetic external abort is how this behavior progresses in later versions of the architecture. SPEv1p3 disallows implementations from reporting external aborts via the SPU, instead allowing only for an SError to be delivered to the core. I caught up with Will on this for a little bit: Instead of an external abort, how about reporting an IMP DEF buffer management event to the guest? At least for the Linux driver it should have the same effect of killing the session but the VM will stay running. This way there's no architectural requirement to promote to an SError. -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-09-13 10:58 ` Oliver Upton @ 2022-09-13 12:41 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-09-13 12:41 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Oliver, On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote: > Hey Alex, > > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote: > > [...] > > > > Yeah, that would be good to follow up on what other OSes are doing. > > > > FreeBSD doesn't have a SPE driver. > > > > Currently in the process of finding out how/if Windows implements the > > driver. > > > > > You'll still have a nondestructive S2 fault handler for the SPE, right? > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > > > new one. > > > > This is how I think about it: a S2 DABT where DL == 0 can happen because of > > something that the VMM, KVM or the guest has done: > > > > 1. If it's because of something that the host's userspace did (memslot was > > changed while the VM was running, memory was munmap'ed, etc). In this case, > > there's no way for KVM to handle the SPE fault, so I would say that the > > sensible approach would be to inject an SPE external abort. > > > > 2. If it's because of something that KVM did, that can only be because of a > > bug in SPE emulation. In this case, it can happen again, which means > > arbitrary blackout windows which can skew the profiling results. I would > > much rather inject an SPE external abort then let the guest rely on > > potentially bad profiling information. > > > > 3. The guest changes the mapping for the buffer when it shouldn't have: A. > > when the architecture does allow it, but KVM doesn't support, or B. when > > the architecture doesn't allow it. For both cases, I would much rather > > inject an SPE external abort for the reasons above. Furthermore, for B, I > > think it would be better to let the guest know as soon as possible that > > it's not following the architecture. > > > > In conclusion, I would prefer to treat all SPE S2 faults as errors. > > My main concern with treating S2 faults as a synthetic external abort is > how this behavior progresses in later versions of the architecture. > SPEv1p3 disallows implementations from reporting external aborts via the > SPU, instead allowing only for an SError to be delivered to the core. Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180). > > I caught up with Will on this for a little bit: > > Instead of an external abort, how about reporting an IMP DEF buffer > management event to the guest? At least for the Linux driver it should > have the same effect of killing the session but the VM will stay > running. This way there's no architectural requirement to promote to an > SError. The only reason I proposed to inject an external abort is because KVM needs a way to tell the guest that something outside of the guest's control went wrong and it should drop the contents of the current profiling session. An external abort reported by the SPU seemed to fit the bit. By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111 (Buffer management event for an IMPLEMENTATION DEFINED reason). I'm thinking that someone might run a custom kernel in a VM, like a vendor downstream kernel, with patches that actually handle this exception class, and injecting such an exception might not have the effects that KVM expects. Am I overthinking things? Is that something that KVM should take into consideration? I suppose KVM can and should also set PMBSR_EL1.DL = 1, as that means per the architecture that the buffer contents should be discarded. Thanks, Alex _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-09-13 12:41 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2022-09-13 12:41 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi Oliver, On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote: > Hey Alex, > > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote: > > [...] > > > > Yeah, that would be good to follow up on what other OSes are doing. > > > > FreeBSD doesn't have a SPE driver. > > > > Currently in the process of finding out how/if Windows implements the > > driver. > > > > > You'll still have a nondestructive S2 fault handler for the SPE, right? > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > > > new one. > > > > This is how I think about it: a S2 DABT where DL == 0 can happen because of > > something that the VMM, KVM or the guest has done: > > > > 1. If it's because of something that the host's userspace did (memslot was > > changed while the VM was running, memory was munmap'ed, etc). In this case, > > there's no way for KVM to handle the SPE fault, so I would say that the > > sensible approach would be to inject an SPE external abort. > > > > 2. If it's because of something that KVM did, that can only be because of a > > bug in SPE emulation. In this case, it can happen again, which means > > arbitrary blackout windows which can skew the profiling results. I would > > much rather inject an SPE external abort then let the guest rely on > > potentially bad profiling information. > > > > 3. The guest changes the mapping for the buffer when it shouldn't have: A. > > when the architecture does allow it, but KVM doesn't support, or B. when > > the architecture doesn't allow it. For both cases, I would much rather > > inject an SPE external abort for the reasons above. Furthermore, for B, I > > think it would be better to let the guest know as soon as possible that > > it's not following the architecture. > > > > In conclusion, I would prefer to treat all SPE S2 faults as errors. > > My main concern with treating S2 faults as a synthetic external abort is > how this behavior progresses in later versions of the architecture. > SPEv1p3 disallows implementations from reporting external aborts via the > SPU, instead allowing only for an SError to be delivered to the core. Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180). > > I caught up with Will on this for a little bit: > > Instead of an external abort, how about reporting an IMP DEF buffer > management event to the guest? At least for the Linux driver it should > have the same effect of killing the session but the VM will stay > running. This way there's no architectural requirement to promote to an > SError. The only reason I proposed to inject an external abort is because KVM needs a way to tell the guest that something outside of the guest's control went wrong and it should drop the contents of the current profiling session. An external abort reported by the SPU seemed to fit the bit. By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111 (Buffer management event for an IMPLEMENTATION DEFINED reason). I'm thinking that someone might run a custom kernel in a VM, like a vendor downstream kernel, with patches that actually handle this exception class, and injecting such an exception might not have the effects that KVM expects. Am I overthinking things? Is that something that KVM should take into consideration? I suppose KVM can and should also set PMBSR_EL1.DL = 1, as that means per the architecture that the buffer contents should be discarded. Thanks, Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-09-13 12:41 ` Alexandru Elisei @ 2022-09-13 14:13 ` Oliver Upton -1 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-09-13 14:13 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel On Tue, Sep 13, 2022 at 01:41:56PM +0100, Alexandru Elisei wrote: > Hi Oliver, > > On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote: > > Hey Alex, > > > > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote: > > > > [...] > > > > > > Yeah, that would be good to follow up on what other OSes are doing. > > > > > > FreeBSD doesn't have a SPE driver. > > > > > > Currently in the process of finding out how/if Windows implements the > > > driver. > > > > > > > You'll still have a nondestructive S2 fault handler for the SPE, right? > > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > > > > new one. > > > > > > This is how I think about it: a S2 DABT where DL == 0 can happen because of > > > something that the VMM, KVM or the guest has done: > > > > > > 1. If it's because of something that the host's userspace did (memslot was > > > changed while the VM was running, memory was munmap'ed, etc). In this case, > > > there's no way for KVM to handle the SPE fault, so I would say that the > > > sensible approach would be to inject an SPE external abort. > > > > > > 2. If it's because of something that KVM did, that can only be because of a > > > bug in SPE emulation. In this case, it can happen again, which means > > > arbitrary blackout windows which can skew the profiling results. I would > > > much rather inject an SPE external abort then let the guest rely on > > > potentially bad profiling information. > > > > > > 3. The guest changes the mapping for the buffer when it shouldn't have: A. > > > when the architecture does allow it, but KVM doesn't support, or B. when > > > the architecture doesn't allow it. For both cases, I would much rather > > > inject an SPE external abort for the reasons above. Furthermore, for B, I > > > think it would be better to let the guest know as soon as possible that > > > it's not following the architecture. > > > > > > In conclusion, I would prefer to treat all SPE S2 faults as errors. > > > > My main concern with treating S2 faults as a synthetic external abort is > > how this behavior progresses in later versions of the architecture. > > SPEv1p3 disallows implementations from reporting external aborts via the > > SPU, instead allowing only for an SError to be delivered to the core. > > Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180). > > > > > I caught up with Will on this for a little bit: > > > > Instead of an external abort, how about reporting an IMP DEF buffer > > management event to the guest? At least for the Linux driver it should > > have the same effect of killing the session but the VM will stay > > running. This way there's no architectural requirement to promote to an > > SError. > > The only reason I proposed to inject an external abort is because KVM needs > a way to tell the guest that something outside of the guest's control went > wrong and it should drop the contents of the current profiling session. An > external abort reported by the SPU seemed to fit the bit. > > By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111 > (Buffer management event for an IMPLEMENTATION DEFINED reason). Yup, that's it. You also get two whole bytes of room in PMBSR_EL1.MSS which is also IMP DEF, so we could even stick some ASCII in there to tell the guest how we really feel! :-P > I'm thinking that someone might run a custom kernel in a VM, like a vendor > downstream kernel, with patches that actually handle this exception class, > and injecting such an exception might not have the effects that KVM > expects. Am I overthinking things? Is that something that KVM should take > into consideration? I suppose KVM can and should also set > PMBSR_EL1.DL = 1, as that means per the architecture that the buffer > contents should be discarded. I agree with you that PMBSR_EL1.DL=1 is the right call for this. With that, I'd be surprised if there was a guest that tried to pull some tricks other than blowing away the profile. The other option that I find funny is if we plainly report the S2 abort to the guest, but that wont work well when nested comes into the picture. -- Thanks, Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2022-09-13 14:13 ` Oliver Upton 0 siblings, 0 replies; 72+ messages in thread From: Oliver Upton @ 2022-09-13 14:13 UTC (permalink / raw) To: Alexandru Elisei; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel On Tue, Sep 13, 2022 at 01:41:56PM +0100, Alexandru Elisei wrote: > Hi Oliver, > > On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote: > > Hey Alex, > > > > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote: > > > > [...] > > > > > > Yeah, that would be good to follow up on what other OSes are doing. > > > > > > FreeBSD doesn't have a SPE driver. > > > > > > Currently in the process of finding out how/if Windows implements the > > > driver. > > > > > > > You'll still have a nondestructive S2 fault handler for the SPE, right? > > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > > > > new one. > > > > > > This is how I think about it: a S2 DABT where DL == 0 can happen because of > > > something that the VMM, KVM or the guest has done: > > > > > > 1. If it's because of something that the host's userspace did (memslot was > > > changed while the VM was running, memory was munmap'ed, etc). In this case, > > > there's no way for KVM to handle the SPE fault, so I would say that the > > > sensible approach would be to inject an SPE external abort. > > > > > > 2. If it's because of something that KVM did, that can only be because of a > > > bug in SPE emulation. In this case, it can happen again, which means > > > arbitrary blackout windows which can skew the profiling results. I would > > > much rather inject an SPE external abort then let the guest rely on > > > potentially bad profiling information. > > > > > > 3. The guest changes the mapping for the buffer when it shouldn't have: A. > > > when the architecture does allow it, but KVM doesn't support, or B. when > > > the architecture doesn't allow it. For both cases, I would much rather > > > inject an SPE external abort for the reasons above. Furthermore, for B, I > > > think it would be better to let the guest know as soon as possible that > > > it's not following the architecture. > > > > > > In conclusion, I would prefer to treat all SPE S2 faults as errors. > > > > My main concern with treating S2 faults as a synthetic external abort is > > how this behavior progresses in later versions of the architecture. > > SPEv1p3 disallows implementations from reporting external aborts via the > > SPU, instead allowing only for an SError to be delivered to the core. > > Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180). > > > > > I caught up with Will on this for a little bit: > > > > Instead of an external abort, how about reporting an IMP DEF buffer > > management event to the guest? At least for the Linux driver it should > > have the same effect of killing the session but the VM will stay > > running. This way there's no architectural requirement to promote to an > > SError. > > The only reason I proposed to inject an external abort is because KVM needs > a way to tell the guest that something outside of the guest's control went > wrong and it should drop the contents of the current profiling session. An > external abort reported by the SPU seemed to fit the bit. > > By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111 > (Buffer management event for an IMPLEMENTATION DEFINED reason). Yup, that's it. You also get two whole bytes of room in PMBSR_EL1.MSS which is also IMP DEF, so we could even stick some ASCII in there to tell the guest how we really feel! :-P > I'm thinking that someone might run a custom kernel in a VM, like a vendor > downstream kernel, with patches that actually handle this exception class, > and injecting such an exception might not have the effects that KVM > expects. Am I overthinking things? Is that something that KVM should take > into consideration? I suppose KVM can and should also set > PMBSR_EL1.DL = 1, as that means per the architecture that the buffer > contents should be discarded. I agree with you that PMBSR_EL1.DL=1 is the right call for this. With that, I'd be surprised if there was a guest that tried to pull some tricks other than blowing away the profile. The other option that I find funny is if we plainly report the S2 abort to the guest, but that wont work well when nested comes into the picture. -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory 2022-09-13 14:13 ` Oliver Upton @ 2023-01-03 14:26 ` Alexandru Elisei -1 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2023-01-03 14:26 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi, Just a heads-up, sent a new proposal for SPE emulation which removes the need to pin memory at stage 2 [1]. [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2022-November/056637.html Thanks, Alex On Tue, Sep 13, 2022 at 03:13:31PM +0100, Oliver Upton wrote: > On Tue, Sep 13, 2022 at 01:41:56PM +0100, Alexandru Elisei wrote: > > Hi Oliver, > > > > On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote: > > > Hey Alex, > > > > > > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote: > > > > > > [...] > > > > > > > > Yeah, that would be good to follow up on what other OSes are doing. > > > > > > > > FreeBSD doesn't have a SPE driver. > > > > > > > > Currently in the process of finding out how/if Windows implements the > > > > driver. > > > > > > > > > You'll still have a nondestructive S2 fault handler for the SPE, right? > > > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > > > > > new one. > > > > > > > > This is how I think about it: a S2 DABT where DL == 0 can happen because of > > > > something that the VMM, KVM or the guest has done: > > > > > > > > 1. If it's because of something that the host's userspace did (memslot was > > > > changed while the VM was running, memory was munmap'ed, etc). In this case, > > > > there's no way for KVM to handle the SPE fault, so I would say that the > > > > sensible approach would be to inject an SPE external abort. > > > > > > > > 2. If it's because of something that KVM did, that can only be because of a > > > > bug in SPE emulation. In this case, it can happen again, which means > > > > arbitrary blackout windows which can skew the profiling results. I would > > > > much rather inject an SPE external abort then let the guest rely on > > > > potentially bad profiling information. > > > > > > > > 3. The guest changes the mapping for the buffer when it shouldn't have: A. > > > > when the architecture does allow it, but KVM doesn't support, or B. when > > > > the architecture doesn't allow it. For both cases, I would much rather > > > > inject an SPE external abort for the reasons above. Furthermore, for B, I > > > > think it would be better to let the guest know as soon as possible that > > > > it's not following the architecture. > > > > > > > > In conclusion, I would prefer to treat all SPE S2 faults as errors. > > > > > > My main concern with treating S2 faults as a synthetic external abort is > > > how this behavior progresses in later versions of the architecture. > > > SPEv1p3 disallows implementations from reporting external aborts via the > > > SPU, instead allowing only for an SError to be delivered to the core. > > > > Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180). > > > > > > > > I caught up with Will on this for a little bit: > > > > > > Instead of an external abort, how about reporting an IMP DEF buffer > > > management event to the guest? At least for the Linux driver it should > > > have the same effect of killing the session but the VM will stay > > > running. This way there's no architectural requirement to promote to an > > > SError. > > > > The only reason I proposed to inject an external abort is because KVM needs > > a way to tell the guest that something outside of the guest's control went > > wrong and it should drop the contents of the current profiling session. An > > external abort reported by the SPU seemed to fit the bit. > > > > By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111 > > (Buffer management event for an IMPLEMENTATION DEFINED reason). > > Yup, that's it. You also get two whole bytes of room in PMBSR_EL1.MSS > which is also IMP DEF, so we could even stick some ASCII in there to > tell the guest how we really feel! :-P > > > I'm thinking that someone might run a custom kernel in a VM, like a vendor > > downstream kernel, with patches that actually handle this exception class, > > and injecting such an exception might not have the effects that KVM > > expects. Am I overthinking things? Is that something that KVM should take > > into consideration? I suppose KVM can and should also set > > PMBSR_EL1.DL = 1, as that means per the architecture that the buffer > > contents should be discarded. > > I agree with you that PMBSR_EL1.DL=1 is the right call for this. With > that, I'd be surprised if there was a guest that tried to pull some > tricks other than blowing away the profile. The other option that I > find funny is if we plainly report the S2 abort to the guest, but that > wont work well when nested comes into the picture. > > -- > Thanks, > Oliver _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory @ 2023-01-03 14:26 ` Alexandru Elisei 0 siblings, 0 replies; 72+ messages in thread From: Alexandru Elisei @ 2023-01-03 14:26 UTC (permalink / raw) To: Oliver Upton; +Cc: maz, Will Deacon, kvmarm, linux-arm-kernel Hi, Just a heads-up, sent a new proposal for SPE emulation which removes the need to pin memory at stage 2 [1]. [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2022-November/056637.html Thanks, Alex On Tue, Sep 13, 2022 at 03:13:31PM +0100, Oliver Upton wrote: > On Tue, Sep 13, 2022 at 01:41:56PM +0100, Alexandru Elisei wrote: > > Hi Oliver, > > > > On Tue, Sep 13, 2022 at 11:58:47AM +0100, Oliver Upton wrote: > > > Hey Alex, > > > > > > On Mon, Sep 12, 2022 at 03:50:46PM +0100, Alexandru Elisei wrote: > > > > > > [...] > > > > > > > > Yeah, that would be good to follow up on what other OSes are doing. > > > > > > > > FreeBSD doesn't have a SPE driver. > > > > > > > > Currently in the process of finding out how/if Windows implements the > > > > driver. > > > > > > > > > You'll still have a nondestructive S2 fault handler for the SPE, right? > > > > > IOW, if PMBSR_EL1.DL=0 KVM will just unpin the old buffer and repin the > > > > > new one. > > > > > > > > This is how I think about it: a S2 DABT where DL == 0 can happen because of > > > > something that the VMM, KVM or the guest has done: > > > > > > > > 1. If it's because of something that the host's userspace did (memslot was > > > > changed while the VM was running, memory was munmap'ed, etc). In this case, > > > > there's no way for KVM to handle the SPE fault, so I would say that the > > > > sensible approach would be to inject an SPE external abort. > > > > > > > > 2. If it's because of something that KVM did, that can only be because of a > > > > bug in SPE emulation. In this case, it can happen again, which means > > > > arbitrary blackout windows which can skew the profiling results. I would > > > > much rather inject an SPE external abort then let the guest rely on > > > > potentially bad profiling information. > > > > > > > > 3. The guest changes the mapping for the buffer when it shouldn't have: A. > > > > when the architecture does allow it, but KVM doesn't support, or B. when > > > > the architecture doesn't allow it. For both cases, I would much rather > > > > inject an SPE external abort for the reasons above. Furthermore, for B, I > > > > think it would be better to let the guest know as soon as possible that > > > > it's not following the architecture. > > > > > > > > In conclusion, I would prefer to treat all SPE S2 faults as errors. > > > > > > My main concern with treating S2 faults as a synthetic external abort is > > > how this behavior progresses in later versions of the architecture. > > > SPEv1p3 disallows implementations from reporting external aborts via the > > > SPU, instead allowing only for an SError to be delivered to the core. > > > > Ah, yes, missed that bit for SPEv1p3 (ARM DDI 0487H.a, page D10-5180). > > > > > > > > I caught up with Will on this for a little bit: > > > > > > Instead of an external abort, how about reporting an IMP DEF buffer > > > management event to the guest? At least for the Linux driver it should > > > have the same effect of killing the session but the VM will stay > > > running. This way there's no architectural requirement to promote to an > > > SError. > > > > The only reason I proposed to inject an external abort is because KVM needs > > a way to tell the guest that something outside of the guest's control went > > wrong and it should drop the contents of the current profiling session. An > > external abort reported by the SPU seemed to fit the bit. > > > > By IMP DEF buffer management event I assume you mean PMBSR_EL1.EC=0b011111 > > (Buffer management event for an IMPLEMENTATION DEFINED reason). > > Yup, that's it. You also get two whole bytes of room in PMBSR_EL1.MSS > which is also IMP DEF, so we could even stick some ASCII in there to > tell the guest how we really feel! :-P > > > I'm thinking that someone might run a custom kernel in a VM, like a vendor > > downstream kernel, with patches that actually handle this exception class, > > and injecting such an exception might not have the effects that KVM > > expects. Am I overthinking things? Is that something that KVM should take > > into consideration? I suppose KVM can and should also set > > PMBSR_EL1.DL = 1, as that means per the architecture that the buffer > > contents should be discarded. > > I agree with you that PMBSR_EL1.DL=1 is the right call for this. With > that, I'd be surprised if there was a guest that tried to pull some > tricks other than blowing away the profile. The other option that I > find funny is if we plainly report the S2 abort to the guest, but that > wont work well when nested comes into the picture. > > -- > Thanks, > Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 72+ messages in thread
end of thread, other threads:[~2023-01-03 17:20 UTC | newest] Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-19 13:51 KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory Alexandru Elisei 2022-04-19 13:51 ` Alexandru Elisei 2022-04-19 14:10 ` Will Deacon 2022-04-19 14:10 ` Will Deacon 2022-04-19 14:44 ` Alexandru Elisei 2022-04-19 14:44 ` Alexandru Elisei 2022-04-19 14:59 ` Will Deacon 2022-04-19 14:59 ` Will Deacon 2022-04-19 15:20 ` Alexandru Elisei 2022-04-19 15:20 ` Alexandru Elisei 2022-04-19 15:35 ` Alexandru Elisei 2022-04-19 15:35 ` Alexandru Elisei 2022-07-25 10:06 ` Alexandru Elisei 2022-07-25 10:06 ` Alexandru Elisei 2022-07-26 17:51 ` Oliver Upton 2022-07-26 17:51 ` Oliver Upton 2022-07-27 9:30 ` Marc Zyngier 2022-07-27 9:30 ` Marc Zyngier 2022-07-27 9:52 ` Marc Zyngier 2022-07-27 9:52 ` Marc Zyngier 2022-07-27 10:38 ` Alexandru Elisei 2022-07-27 10:38 ` Alexandru Elisei 2022-07-27 16:06 ` Oliver Upton 2022-07-27 16:06 ` Oliver Upton 2022-07-27 10:56 ` Alexandru Elisei 2022-07-27 10:56 ` Alexandru Elisei 2022-07-27 11:18 ` Marc Zyngier 2022-07-27 11:18 ` Marc Zyngier 2022-07-27 12:10 ` Alexandru Elisei 2022-07-27 12:10 ` Alexandru Elisei 2022-07-27 10:19 ` Alexandru Elisei 2022-07-27 10:19 ` Alexandru Elisei 2022-07-27 10:29 ` Marc Zyngier 2022-07-27 10:29 ` Marc Zyngier 2022-07-27 10:44 ` Alexandru Elisei 2022-07-27 10:44 ` Alexandru Elisei 2022-07-27 11:08 ` Marc Zyngier 2022-07-27 11:08 ` Marc Zyngier 2022-07-27 11:57 ` Alexandru Elisei 2022-07-27 11:57 ` Alexandru Elisei 2022-07-27 15:15 ` Oliver Upton 2022-07-27 15:15 ` Oliver Upton 2022-07-27 11:00 ` Alexandru Elisei 2022-07-27 11:00 ` Alexandru Elisei 2022-08-01 17:00 ` Will Deacon 2022-08-01 17:00 ` Will Deacon 2022-08-02 9:49 ` Alexandru Elisei 2022-08-02 9:49 ` Alexandru Elisei 2022-08-02 19:34 ` Oliver Upton 2022-08-02 19:34 ` Oliver Upton 2022-08-09 14:01 ` Alexandru Elisei 2022-08-09 14:01 ` Alexandru Elisei 2022-08-09 18:43 ` Oliver Upton 2022-08-09 18:43 ` Oliver Upton 2022-08-10 9:37 ` Alexandru Elisei 2022-08-10 9:37 ` Alexandru Elisei 2022-08-10 15:25 ` Oliver Upton 2022-08-10 15:25 ` Oliver Upton 2022-08-12 13:05 ` Alexandru Elisei 2022-08-12 13:05 ` Alexandru Elisei 2022-08-17 15:05 ` Oliver Upton 2022-08-17 15:05 ` Oliver Upton 2022-09-12 14:50 ` Alexandru Elisei 2022-09-12 14:50 ` Alexandru Elisei 2022-09-13 10:58 ` Oliver Upton 2022-09-13 10:58 ` Oliver Upton 2022-09-13 12:41 ` Alexandru Elisei 2022-09-13 12:41 ` Alexandru Elisei 2022-09-13 14:13 ` Oliver Upton 2022-09-13 14:13 ` Oliver Upton 2023-01-03 14:26 ` Alexandru Elisei 2023-01-03 14:26 ` Alexandru Elisei
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.