* KVM/arm64: Guest ABI changes do not appear rollback-safe @ 2021-08-24 21:15 Oliver Upton 2021-08-24 21:34 ` [RFC PATCH] KVM: arm64: Allow VMMs to opt-out of KVM_CAP_PTP_KVM Oliver Upton 2021-08-25 9:27 ` KVM/arm64: Guest ABI changes do not appear rollback-safe Marc Zyngier 0 siblings, 2 replies; 25+ messages in thread From: Oliver Upton @ 2021-08-24 21:15 UTC (permalink / raw) To: kvmarm Cc: maz, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, alexandru.elisei, suzuki.poulose Hey folks, Ricardo and I were discussing hypercall support in KVM/arm64 and something seems to be a bit problematic. I do not see anywhere that KVM requires opt-in from the VMM to expose new hypercalls to the guest. To name a few, the TRNG and KVM PTP hypercalls are unconditionally provided to the guest. Exposing new hypercalls to guests in this manner seems very unsafe to me. Suppose an operator is trying to upgrade from kernel N to kernel N+1, which brings in the new 'widget' hypercall. Guests are live migrated onto the N+1 kernel, but the operator finds a defect that warrants a kernel rollback. VMs are then migrated from kernel N+1 -> N. Any guests that discovered the 'widget' hypercall are likely going to get fussy _very_ quickly on the old kernel. Really, we need to ensure that the exposed guest ABI is backwards-compatible. Running a VMM that is blissfully unaware of the 'widget' hypercall should not implicitly expose it to its guest on a new kernel. This conversation was in the context of devising a new UAPI that allows VMMs to trap hypercalls to userspace. While such an interface would easily work around the issue, the onus of ABI compatibility lies with the kernel. So, this is all a long-winded way of asking: how do we dig ourselves out of this situation, and how to we avoid it happening again in the future? I believe the answer to both is to have new VM capabilities for sets of hypercalls exposed to the guest. Unfortunately, the unconditional exposure of TRNG and PTP hypercalls is ABI now, so we'd have to provide an opt-out at this point. For anything new, require opt-in from the VMM before we give it to the guest. Have I missed something blatantly obvious, or do others see this as an issue as well? I'll reply with an example of adding opt-out for PTP. I'm sure other hypercalls could be handled similarly. -- Thanks, Oliver ^ permalink raw reply [flat|nested] 25+ messages in thread
* [RFC PATCH] KVM: arm64: Allow VMMs to opt-out of KVM_CAP_PTP_KVM 2021-08-24 21:15 KVM/arm64: Guest ABI changes do not appear rollback-safe Oliver Upton @ 2021-08-24 21:34 ` Oliver Upton 2021-08-25 9:27 ` KVM/arm64: Guest ABI changes do not appear rollback-safe Marc Zyngier 1 sibling, 0 replies; 25+ messages in thread From: Oliver Upton @ 2021-08-24 21:34 UTC (permalink / raw) To: kvm, kvmarm Cc: Marc Zyngier, Peter Shier, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei, Suzuki K Poulose, rejiw, Oliver Upton commit 3bf725699bf6 ("KVM: arm64: Add support for the KVM PTP service") introduced support for a hypercall-based interface through which a KVM guest may query the host's walltime relative to its physical or virtual counter. Unfortunately, KVM does not require opt-in for the feature, and unconditionally provides it to guests when available. This is extremely problematic for operators who want to ensure guest migrations are rollback safe. If an operator were to live migrate guests to a kernel with KVM_CAP_PTP_KVM and subsequently need to roll back the kernel, guests that discovered the hypercall will get fussy *very* quickly. Plug the hazard by introducing a new capability, KVM_CAP_DISABLE_PTP_KVM. To maintain ABI compatibility with the aforementioned change, this cap is off by default. When enabled, hide the KVM PTP hypercall from the guest. Fixes: 3bf725699bf6 ("KVM: arm64: Add support for the KVM PTP service") Signed-off-by: Oliver Upton <oupton@google.com> --- Patch cleanly applies on v5.14-rc7. Delightfully untested beyond building it :) Documentation/virt/kvm/api.rst | 13 +++++++++++++ arch/arm64/include/asm/kvm_host.h | 3 +++ arch/arm64/kvm/arm.c | 5 +++++ arch/arm64/kvm/hypercalls.c | 7 +++++-- include/uapi/linux/kvm.h | 1 + 5 files changed, 27 insertions(+), 2 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index dae68e68ca23..4866418a2bb6 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -7241,3 +7241,16 @@ The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace the hypercalls whose corresponding bit is in the argument, and return ENOSYS for the others. + +8.35 KVM_CAP_DISABLE_PTP_KVM +---------------------------- + +:Architectures: arm64 + +This capability indicates that a VMM may disable the KVM virtual PTP +service for a guest. KVM_CAP_PTP_KVM introduced support for this +hypercall interface, but it is unconditionally enabled without any +opt-out. + +When this capability is enabled, KVM will hide the KVM virtual PTP +service from the guest. diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 41911585ae0c..8795228aa08e 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -136,6 +136,9 @@ struct kvm_arch { /* Memory Tagging Extension enabled for the guest */ bool mte_enabled; + + /* PTP KVM hypercall disabled for this guest */ + bool ptp_kvm_disabled; }; struct kvm_vcpu_fault_info { diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 0ca72f5cda41..b8f3b2eafd45 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -103,6 +103,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(&kvm->lock); break; + case KVM_CAP_DISABLE_PTP_KVM: + kvm->arch.ptp_kvm_disabled = true; + r = 0; + break; default: r = -EINVAL; break; @@ -217,6 +221,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: + case KVM_CAP_DISABLE_PTP_KVM: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c index 30da78f72b3b..8e9f2e1329e7 100644 --- a/arch/arm64/kvm/hypercalls.c +++ b/arch/arm64/kvm/hypercalls.c @@ -62,6 +62,7 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) { u32 func_id = smccc_get_function(vcpu); u64 val[4] = {SMCCC_RET_NOT_SUPPORTED}; + struct kvm *kvm = vcpu->kvm; u32 feature; gpa_t gpa; @@ -128,10 +129,12 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) break; case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); - val[0] |= BIT(ARM_SMCCC_KVM_FUNC_PTP); + if (!kvm->arch.ptp_kvm_disabled) + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_PTP); break; case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID: - kvm_ptp_get_time(vcpu, val); + if (!kvm->arch.ptp_kvm_disabled) + kvm_ptp_get_time(vcpu, val); break; case ARM_SMCCC_TRNG_VERSION: case ARM_SMCCC_TRNG_FEATURES: diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index d9e4aabcb31a..d8419c336ec8 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1112,6 +1112,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_BINARY_STATS_FD 203 #define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204 #define KVM_CAP_ARM_MTE 205 +#define KVM_CAP_DISABLE_PTP_KVM 206 #ifdef KVM_CAP_IRQ_ROUTING -- 2.33.0.rc2.250.ged5fa647cd-goog ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-24 21:15 KVM/arm64: Guest ABI changes do not appear rollback-safe Oliver Upton 2021-08-24 21:34 ` [RFC PATCH] KVM: arm64: Allow VMMs to opt-out of KVM_CAP_PTP_KVM Oliver Upton @ 2021-08-25 9:27 ` Marc Zyngier 2021-08-25 10:02 ` Oliver Upton 1 sibling, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2021-08-25 9:27 UTC (permalink / raw) To: Oliver Upton Cc: kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, alexandru.elisei, suzuki.poulose, Drew Jones, Peter Maydell Hi Oliver, Adding Andrew and Peter to the discussion. On Tue, 24 Aug 2021 22:15:03 +0100, Oliver Upton <oupton@google.com> wrote: > > Hey folks, > > Ricardo and I were discussing hypercall support in KVM/arm64 and > something seems to be a bit problematic. I do not see anywhere that KVM > requires opt-in from the VMM to expose new hypercalls to the guest. To > name a few, the TRNG and KVM PTP hypercalls are unconditionally provided > to the guest. > > Exposing new hypercalls to guests in this manner seems very unsafe to > me. Suppose an operator is trying to upgrade from kernel N to kernel > N+1, which brings in the new 'widget' hypercall. Guests are live > migrated onto the N+1 kernel, but the operator finds a defect that > warrants a kernel rollback. VMs are then migrated from kernel N+1 -> N. > Any guests that discovered the 'widget' hypercall are likely going to > get fussy _very_ quickly on the old kernel. This goes against what we decided to support for the *only* publicly available VMM that cares about save/restore, which is that we only move forward and don't rollback. Hypercalls are the least of your worries, and there is a whole range of other architectural features that will have also appeared/disappeared (your own CNTPOFF series is a glaring example of this). I appreciate that you may have different considerations, but I felt that it was important to state *why* this is the way it is. > > Really, we need to ensure that the exposed guest ABI is > backwards-compatible. Running a VMM that is blissfully unaware of the > 'widget' hypercall should not implicitly expose it to its guest on a new > kernel. > > This conversation was in the context of devising a new UAPI that allows > VMMs to trap hypercalls to userspace. While such an interface would > easily work around the issue, the onus of ABI compatibility lies with > the kernel. > > So, this is all a long-winded way of asking: how do we dig ourselves out > of this situation, and how to we avoid it happening again in the future? > I believe the answer to both is to have new VM capabilities for sets of > hypercalls exposed to the guest. Unfortunately, the unconditional > exposure of TRNG and PTP hypercalls is ABI now, so we'd have to provide > an opt-out at this point. For anything new, require opt-in from the VMM > before we give it to the guest. > > Have I missed something blatantly obvious, or do others see this as an > issue as well? I'll reply with an example of adding opt-out for PTP. > I'm sure other hypercalls could be handled similarly. Why do we need this? For future hypercalls, we could have some buy-in capabilities. For existing ones, it is too late, and negative features are just too horrible. For KVM-specific hypercalls, we could get the VMM to save/restore the bitmap of supported functions. That would be "less horrible". This could be implemented using extra "firmware pseudo-registers" such as the ones described in Documentation/virt/kvm/arm/psci.rst. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-25 9:27 ` KVM/arm64: Guest ABI changes do not appear rollback-safe Marc Zyngier @ 2021-08-25 10:02 ` Oliver Upton 2021-08-25 10:39 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Oliver Upton @ 2021-08-25 10:02 UTC (permalink / raw) To: Marc Zyngier Cc: kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Drew Jones, Peter Maydell On Wed, Aug 25, 2021 at 2:27 AM Marc Zyngier <maz@kernel.org> wrote: > > Exposing new hypercalls to guests in this manner seems very unsafe to > > me. Suppose an operator is trying to upgrade from kernel N to kernel > > N+1, which brings in the new 'widget' hypercall. Guests are live > > migrated onto the N+1 kernel, but the operator finds a defect that > > warrants a kernel rollback. VMs are then migrated from kernel N+1 -> N. > > Any guests that discovered the 'widget' hypercall are likely going to > > get fussy _very_ quickly on the old kernel. > > This goes against what we decided to support for the *only* publicly > available VMM that cares about save/restore, which is that we only > move forward and don't rollback. Ah, I was definitely missing this context. Current behavior makes much more sense then. > Hypercalls are the least of your > worries, and there is a whole range of other architectural features > that will have also appeared/disappeared (your own CNTPOFF series is a > glaring example of this). Isn't that a tad bit different though? I'll admit, I'm just as guilty with my own series forgetting to add a KVM_CAP (oops), but it is in my queue to kick out with the fix for nVHE/ptimer. Nonetheless, if a user takes up a new KVM UAPI, it is up to the user to run on a new kernel. My concerns are explicitly with the 'under the nose' changes, where KVM modifies the guest feature set without userspace opting in. Based on your comment, though, it would appear that other parts of KVM are affected too. It doesn't have to be rollback safety, either. There may simply be a hypercall which an operator doesn't want to give its guests, and it needs a way to tell KVM to hide it. > > Have I missed something blatantly obvious, or do others see this as an > > issue as well? I'll reply with an example of adding opt-out for PTP. > > I'm sure other hypercalls could be handled similarly. > > Why do we need this? For future hypercalls, we could have some buy-in > capabilities. For existing ones, it is too late, and negative features > are just too horrible. Oh, agreed on the nastiness. Lazy hack to realize the intended functional change.. > For KVM-specific hypercalls, we could get the VMM to save/restore the > bitmap of supported functions. That would be "less horrible". This > could be implemented using extra "firmware pseudo-registers" such as > the ones described in Documentation/virt/kvm/arm/psci.rst. This seems more reasonable, especially since we do this for migrating the guest's PSCI version. Alternatively, I had thought about using a VM attribute, given the fact that it is non-architectural information and we avoid ABI issues in KVM_GET_REG_LIST without buy-in through a KVM_CAP. > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-25 10:02 ` Oliver Upton @ 2021-08-25 10:39 ` Marc Zyngier 2021-08-25 15:07 ` Andrew Jones 0 siblings, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2021-08-25 10:39 UTC (permalink / raw) To: Oliver Upton Cc: kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Drew Jones, Peter Maydell On Wed, 25 Aug 2021 11:02:28 +0100, Oliver Upton <oupton@google.com> wrote: > > On Wed, Aug 25, 2021 at 2:27 AM Marc Zyngier <maz@kernel.org> wrote: > > > Exposing new hypercalls to guests in this manner seems very unsafe to > > > me. Suppose an operator is trying to upgrade from kernel N to kernel > > > N+1, which brings in the new 'widget' hypercall. Guests are live > > > migrated onto the N+1 kernel, but the operator finds a defect that > > > warrants a kernel rollback. VMs are then migrated from kernel N+1 -> N. > > > Any guests that discovered the 'widget' hypercall are likely going to > > > get fussy _very_ quickly on the old kernel. > > > > This goes against what we decided to support for the *only* publicly > > available VMM that cares about save/restore, which is that we only > > move forward and don't rollback. > > Ah, I was definitely missing this context. Current behavior makes much > more sense then. > > > Hypercalls are the least of your > > worries, and there is a whole range of other architectural features > > that will have also appeared/disappeared (your own CNTPOFF series is a > > glaring example of this). > > Isn't that a tad bit different though? I'll admit, I'm just as guilty > with my own series forgetting to add a KVM_CAP (oops), but it is in my > queue to kick out with the fix for nVHE/ptimer. Nonetheless, if a user > takes up a new KVM UAPI, it is up to the user to run on a new kernel. The two are linked. Exposing a new register to userspace and/or guest result in the same thing: you can't rollback. That's specially true in the QEMU case, which *learns* from the kernel what registers are available, and doesn't maintain a fixed list. > My concerns are explicitly with the 'under the nose' changes, where > KVM modifies the guest feature set without userspace opting in. Based > on your comment, though, it would appear that other parts of KVM are > affected too. Any new system register that is exposed by a new kernel feature breaks rollback. And so far, we only consider it a bug if the set of exposed registers reduces. Anything can be added safely (as checked by one of the selftests added by Drew). < It doesn't have to be rollback safety, either. There may > simply be a hypercall which an operator doesn't want to give its > guests, and it needs a way to tell KVM to hide it. Fair enough. But this has to be done in a scalable way, which individual capability cannot provide. > > > Have I missed something blatantly obvious, or do others see this as an > > > issue as well? I'll reply with an example of adding opt-out for PTP. > > > I'm sure other hypercalls could be handled similarly. > > > > Why do we need this? For future hypercalls, we could have some buy-in > > capabilities. For existing ones, it is too late, and negative features > > are just too horrible. > > Oh, agreed on the nastiness. Lazy hack to realize the intended > functional change.. Well, you definitely achieved your goal of attracting my attention :). > > For KVM-specific hypercalls, we could get the VMM to save/restore the > > bitmap of supported functions. That would be "less horrible". This > > could be implemented using extra "firmware pseudo-registers" such as > > the ones described in Documentation/virt/kvm/arm/psci.rst. > > This seems more reasonable, especially since we do this for migrating > the guest's PSCI version. > > Alternatively, I had thought about using a VM attribute, given the > fact that it is non-architectural information and we avoid ABI issues > in KVM_GET_REG_LIST without buy-in through a KVM_CAP. The whole point is that these settings get exposed by KVM_GET_REG_LIST, as this is QEMU's way to dump a VM state. Given that we already have this for things like the spectre management state, we can just as well expose the bitmaps that deal with the KVM-specific hypercalls. After all, this falls into the realm of "KVM as VM firmware". For ARM-architected hypercalls (TRNG, stolen time), we may need a similar extension. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-25 10:39 ` Marc Zyngier @ 2021-08-25 15:07 ` Andrew Jones 2021-08-25 18:14 ` Oliver Upton 0 siblings, 1 reply; 25+ messages in thread From: Andrew Jones @ 2021-08-25 15:07 UTC (permalink / raw) To: Marc Zyngier Cc: Oliver Upton, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Wed, Aug 25, 2021 at 11:39:34AM +0100, Marc Zyngier wrote: > On Wed, 25 Aug 2021 11:02:28 +0100, > Oliver Upton <oupton@google.com> wrote: > > > > On Wed, Aug 25, 2021 at 2:27 AM Marc Zyngier <maz@kernel.org> wrote: > > > > Exposing new hypercalls to guests in this manner seems very unsafe to > > > > me. Suppose an operator is trying to upgrade from kernel N to kernel > > > > N+1, which brings in the new 'widget' hypercall. Guests are live > > > > migrated onto the N+1 kernel, but the operator finds a defect that > > > > warrants a kernel rollback. VMs are then migrated from kernel N+1 -> N. > > > > Any guests that discovered the 'widget' hypercall are likely going to > > > > get fussy _very_ quickly on the old kernel. > > > > > > This goes against what we decided to support for the *only* publicly > > > available VMM that cares about save/restore, which is that we only > > > move forward and don't rollback. > > > > Ah, I was definitely missing this context. Current behavior makes much > > more sense then. > > > > > Hypercalls are the least of your > > > worries, and there is a whole range of other architectural features > > > that will have also appeared/disappeared (your own CNTPOFF series is a > > > glaring example of this). > > > > Isn't that a tad bit different though? I'll admit, I'm just as guilty > > with my own series forgetting to add a KVM_CAP (oops), but it is in my > > queue to kick out with the fix for nVHE/ptimer. Nonetheless, if a user > > takes up a new KVM UAPI, it is up to the user to run on a new kernel. > > The two are linked. Exposing a new register to userspace and/or guest > result in the same thing: you can't rollback. That's specially true in > the QEMU case, which *learns* from the kernel what registers are > available, and doesn't maintain a fixed list. > > > My concerns are explicitly with the 'under the nose' changes, where > > KVM modifies the guest feature set without userspace opting in. Based > > on your comment, though, it would appear that other parts of KVM are > > affected too. > > Any new system register that is exposed by a new kernel feature breaks > rollback. And so far, we only consider it a bug if the set of exposed > registers reduces. Anything can be added safely (as checked by one of > the selftests added by Drew). > > < It doesn't have to be rollback safety, either. There may > > simply be a hypercall which an operator doesn't want to give its > > guests, and it needs a way to tell KVM to hide it. > > Fair enough. But this has to be done in a scalable way, which > individual capability cannot provide. > > > > > Have I missed something blatantly obvious, or do others see this as an > > > > issue as well? I'll reply with an example of adding opt-out for PTP. > > > > I'm sure other hypercalls could be handled similarly. > > > > > > Why do we need this? For future hypercalls, we could have some buy-in > > > capabilities. For existing ones, it is too late, and negative features > > > are just too horrible. > > > > Oh, agreed on the nastiness. Lazy hack to realize the intended > > functional change.. > > Well, you definitely achieved your goal of attracting my attention :). > > > > For KVM-specific hypercalls, we could get the VMM to save/restore the > > > bitmap of supported functions. That would be "less horrible". This > > > could be implemented using extra "firmware pseudo-registers" such as > > > the ones described in Documentation/virt/kvm/arm/psci.rst. > > > > This seems more reasonable, especially since we do this for migrating > > the guest's PSCI version. > > > > Alternatively, I had thought about using a VM attribute, given the > > fact that it is non-architectural information and we avoid ABI issues > > in KVM_GET_REG_LIST without buy-in through a KVM_CAP. > > The whole point is that these settings get exposed by > KVM_GET_REG_LIST, as this is QEMU's way to dump a VM state. Given that > we already have this for things like the spectre management state, we > can just as well expose the bitmaps that deal with the KVM-specific > hypercalls. After all, this falls into the realm of "KVM as VM > firmware". > > For ARM-architected hypercalls (TRNG, stolen time), we may need a > similar extension. > Thanks for including me Marc. I think you've mentioned all the examples of why we don't generally expect N+1 -> N migrations to work that I can think of. While some of the examples like get-reg-list could eventually be eliminated if we had CPU models to tighten our machine type state, I think N+1 -> N migrations will always be best effort at most. I agree with giving userspace control over the exposer of the hypercalls though. Using pseudo-registers for that purpose rather than a pile of CAPs also seems reasonable to me. And, while I don't think this patch is going to proceed, I thought I'd point out that the opt-out approach doesn't help much with expanding our migration support unless we require the VMM to be upgraded first. And, even then, the (N_kern, N+1_vmm) -> (N+1_kern, N_vmm) case won't work as expected, since the source enforce opt-out, but the destination won't. Also, since the VMM doesn't key off the kernel version, for the most part N+1 VMMs won't know when they're supposed to opt-out or not, leaving it to the user to ensure they consider everything. opt-in usually only needs the user to consider what machine type they want to launch. Thanks, drew ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-25 15:07 ` Andrew Jones @ 2021-08-25 18:14 ` Oliver Upton 2021-08-26 8:37 ` Marc Zyngier 2021-08-26 8:49 ` Andrew Jones 0 siblings, 2 replies; 25+ messages in thread From: Oliver Upton @ 2021-08-25 18:14 UTC (permalink / raw) To: Andrew Jones Cc: Marc Zyngier, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Wed, Aug 25, 2021 at 8:07 AM Andrew Jones <drjones@redhat.com> wrote: > > On Wed, Aug 25, 2021 at 11:39:34AM +0100, Marc Zyngier wrote: > > On Wed, 25 Aug 2021 11:02:28 +0100, > > Oliver Upton <oupton@google.com> wrote: > > > > > > On Wed, Aug 25, 2021 at 2:27 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > Exposing new hypercalls to guests in this manner seems very unsafe to > > > > > me. Suppose an operator is trying to upgrade from kernel N to kernel > > > > > N+1, which brings in the new 'widget' hypercall. Guests are live > > > > > migrated onto the N+1 kernel, but the operator finds a defect that > > > > > warrants a kernel rollback. VMs are then migrated from kernel N+1 -> N. > > > > > Any guests that discovered the 'widget' hypercall are likely going to > > > > > get fussy _very_ quickly on the old kernel. > > > > > > > > This goes against what we decided to support for the *only* publicly > > > > available VMM that cares about save/restore, which is that we only > > > > move forward and don't rollback. > > > > > > Ah, I was definitely missing this context. Current behavior makes much > > > more sense then. > > > > > > > Hypercalls are the least of your > > > > worries, and there is a whole range of other architectural features > > > > that will have also appeared/disappeared (your own CNTPOFF series is a > > > > glaring example of this). > > > > > > Isn't that a tad bit different though? I'll admit, I'm just as guilty > > > with my own series forgetting to add a KVM_CAP (oops), but it is in my > > > queue to kick out with the fix for nVHE/ptimer. Nonetheless, if a user > > > takes up a new KVM UAPI, it is up to the user to run on a new kernel. > > > > The two are linked. Exposing a new register to userspace and/or guest > > result in the same thing: you can't rollback. That's specially true in > > the QEMU case, which *learns* from the kernel what registers are > > available, and doesn't maintain a fixed list. > > > > > My concerns are explicitly with the 'under the nose' changes, where > > > KVM modifies the guest feature set without userspace opting in. Based > > > on your comment, though, it would appear that other parts of KVM are > > > affected too. > > > > Any new system register that is exposed by a new kernel feature breaks > > rollback. And so far, we only consider it a bug if the set of exposed > > registers reduces. Anything can be added safely (as checked by one of > > the selftests added by Drew). > > > > < It doesn't have to be rollback safety, either. There may > > > simply be a hypercall which an operator doesn't want to give its > > > guests, and it needs a way to tell KVM to hide it. > > > > Fair enough. But this has to be done in a scalable way, which > > individual capability cannot provide. > > > > > > > Have I missed something blatantly obvious, or do others see this as an > > > > > issue as well? I'll reply with an example of adding opt-out for PTP. > > > > > I'm sure other hypercalls could be handled similarly. > > > > > > > > Why do we need this? For future hypercalls, we could have some buy-in > > > > capabilities. For existing ones, it is too late, and negative features > > > > are just too horrible. > > > > > > Oh, agreed on the nastiness. Lazy hack to realize the intended > > > functional change.. > > > > Well, you definitely achieved your goal of attracting my attention :). > > > > > > For KVM-specific hypercalls, we could get the VMM to save/restore the > > > > bitmap of supported functions. That would be "less horrible". This > > > > could be implemented using extra "firmware pseudo-registers" such as > > > > the ones described in Documentation/virt/kvm/arm/psci.rst. > > > > > > This seems more reasonable, especially since we do this for migrating > > > the guest's PSCI version. > > > > > > Alternatively, I had thought about using a VM attribute, given the > > > fact that it is non-architectural information and we avoid ABI issues > > > in KVM_GET_REG_LIST without buy-in through a KVM_CAP. > > > > The whole point is that these settings get exposed by > > KVM_GET_REG_LIST, as this is QEMU's way to dump a VM state. Given that > > we already have this for things like the spectre management state, we > > can just as well expose the bitmaps that deal with the KVM-specific > > hypercalls. After all, this falls into the realm of "KVM as VM > > firmware". > > > > For ARM-architected hypercalls (TRNG, stolen time), we may need a > > similar extension. > > > > Thanks for including me Marc. I think you've mentioned all the examples > of why we don't generally expect N+1 -> N migrations to work that I > can think of. While some of the examples like get-reg-list could > eventually be eliminated if we had CPU models to tighten our machine type > state, I think N+1 -> N migrations will always be best effort at most. > > I agree with giving userspace control over the exposer of the hypercalls > though. Using pseudo-registers for that purpose rather than a pile of > CAPs also seems reasonable to me. > > And, while I don't think this patch is going to proceed, I thought I'd > point out that the opt-out approach doesn't help much with expanding > our migration support unless we require the VMM to be upgraded first. > > And, even then, the (N_kern, N+1_vmm) -> (N+1_kern, N_vmm) case won't > work as expected, since the source enforce opt-out, but the destination > won't. Right, there's going to need to be a fence in both kernel and VMM versions. Before the fence, you can't rollback with either component. Once on the other side of the fence, the user may freely migrate between kernel + VMM combinations. > Also, since the VMM doesn't key off the kernel version, for the > most part N+1 VMMs won't know when they're supposed to opt-out or not, > leaving it to the user to ensure they consider everything. opt-in > usually only needs the user to consider what machine type they want to > launch. Going the register route will implicitly require opt-out for all old hypercalls. We exposed them unconditionally to the guest before, and we must uphold that behavior. The default value for the bitmap will have those features set. Any hypercalls added after that register interface will then require explicit opt-in from userspace. With regards to the pseudoregister interface, how would a VMM discover new bits? From my perspective, you need to have two bitmaps that the VMM can get at: the set of supported feature bits and the active bitmap of features for a running guest. > Thanks, > drew > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-25 18:14 ` Oliver Upton @ 2021-08-26 8:37 ` Marc Zyngier 2021-08-26 18:49 ` Oliver Upton 2021-08-26 8:49 ` Andrew Jones 1 sibling, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2021-08-26 8:37 UTC (permalink / raw) To: Oliver Upton Cc: Andrew Jones, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Wed, 25 Aug 2021 19:14:59 +0100, Oliver Upton <oupton@google.com> wrote: > > On Wed, Aug 25, 2021 at 8:07 AM Andrew Jones <drjones@redhat.com> wrote: [...] > > Thanks for including me Marc. I think you've mentioned all the examples > > of why we don't generally expect N+1 -> N migrations to work that I > > can think of. While some of the examples like get-reg-list could > > eventually be eliminated if we had CPU models to tighten our machine type > > state, I think N+1 -> N migrations will always be best effort at most. > > > > I agree with giving userspace control over the exposer of the hypercalls > > though. Using pseudo-registers for that purpose rather than a pile of > > CAPs also seems reasonable to me. > > > > And, while I don't think this patch is going to proceed, I thought I'd > > point out that the opt-out approach doesn't help much with expanding > > our migration support unless we require the VMM to be upgraded first. > > > > And, even then, the (N_kern, N+1_vmm) -> (N+1_kern, N_vmm) case won't > > work as expected, since the source enforce opt-out, but the destination > > won't. > > Right, there's going to need to be a fence in both kernel and VMM > versions. Before the fence, you can't rollback with either component. > Once on the other side of the fence, the user may freely migrate > between kernel + VMM combinations. > > > Also, since the VMM doesn't key off the kernel version, for the > > most part N+1 VMMs won't know when they're supposed to opt-out or not, > > leaving it to the user to ensure they consider everything. opt-in > > usually only needs the user to consider what machine type they want to > > launch. > > Going the register route will implicitly require opt-out for all old > hypercalls. We exposed them unconditionally to the guest before, and > we must uphold that behavior. The default value for the bitmap will > have those features set. Any hypercalls added after that register > interface will then require explicit opt-in from userspace. I disagree here. This makes the ABI inconsistent, and means that no feature can be implemented without changing userspace. If you can deal with the existing features, you should be able to deal with the next lot. > With regards to the pseudoregister interface, how would a VMM discover > new bits? From my perspective, you need to have two bitmaps that the > VMM can get at: the set of supported feature bits and the active > bitmap of features for a running guest. My proposal is that we have a single pseudo-register exposing the list of implemented by the kernel. Clear the bits you don't want, and write back the result. As long as you haven't written anything, you have the full feature set. That's pretty similar to the virtio feature negotiation. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-26 8:37 ` Marc Zyngier @ 2021-08-26 18:49 ` Oliver Upton 2021-08-27 7:40 ` Andrew Jones 0 siblings, 1 reply; 25+ messages in thread From: Oliver Upton @ 2021-08-26 18:49 UTC (permalink / raw) To: Marc Zyngier Cc: Andrew Jones, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Thu, Aug 26, 2021 at 09:37:42AM +0100, Marc Zyngier wrote: > On Wed, 25 Aug 2021 19:14:59 +0100, > Oliver Upton <oupton@google.com> wrote: > > > > On Wed, Aug 25, 2021 at 8:07 AM Andrew Jones <drjones@redhat.com> wrote: > > [...] > > > > Thanks for including me Marc. I think you've mentioned all the examples > > > of why we don't generally expect N+1 -> N migrations to work that I > > > can think of. While some of the examples like get-reg-list could > > > eventually be eliminated if we had CPU models to tighten our machine type > > > state, I think N+1 -> N migrations will always be best effort at most. > > > > > > I agree with giving userspace control over the exposer of the hypercalls > > > though. Using pseudo-registers for that purpose rather than a pile of > > > CAPs also seems reasonable to me. > > > > > > And, while I don't think this patch is going to proceed, I thought I'd > > > point out that the opt-out approach doesn't help much with expanding > > > our migration support unless we require the VMM to be upgraded first. > > > > > > And, even then, the (N_kern, N+1_vmm) -> (N+1_kern, N_vmm) case won't > > > work as expected, since the source enforce opt-out, but the destination > > > won't. > > > > Right, there's going to need to be a fence in both kernel and VMM > > versions. Before the fence, you can't rollback with either component. > > Once on the other side of the fence, the user may freely migrate > > between kernel + VMM combinations. > > > > > Also, since the VMM doesn't key off the kernel version, for the > > > most part N+1 VMMs won't know when they're supposed to opt-out or not, > > > leaving it to the user to ensure they consider everything. opt-in > > > usually only needs the user to consider what machine type they want to > > > launch. > > > > Going the register route will implicitly require opt-out for all old > > hypercalls. We exposed them unconditionally to the guest before, and > > we must uphold that behavior. The default value for the bitmap will > > have those features set. Any hypercalls added after that register > > interface will then require explicit opt-in from userspace. > > I disagree here. This makes the ABI inconsistent, and means that no > feature can be implemented without changing userspace. If you can deal > with the existing features, you should be able to deal with the next > lot. > > > With regards to the pseudoregister interface, how would a VMM discover > > new bits? From my perspective, you need to have two bitmaps that the > > VMM can get at: the set of supported feature bits and the active > > bitmap of features for a running guest. > > My proposal is that we have a single pseudo-register exposing the list > of implemented by the kernel. Clear the bits you don't want, and write > back the result. As long as you haven't written anything, you have the > full feature set. That's pretty similar to the virtio feature > negotiation. Ah, yes I agree. Thinking about it more we will not need something similar to KVM_GET_SUPPORTED_CPUID. So then, for any register where userspace/KVM need to negotiate features, the default value will return the maximum feature set that is supported. If userspace wants to constrain features, read out the register, make sure everything you want is there, and write it back blowing away the superfluous bits. Given this should we enforce ordering on feature registers, such that a VMM can only write to the registers before a VM is started? Also, Reiji is working on making the identity registers writable for the sake of feature restriction. The suggested negotiation interface would be applicable there too, IMO. Many thanks to both you and Drew for working this out with me. -- Best, Oliver ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-26 18:49 ` Oliver Upton @ 2021-08-27 7:40 ` Andrew Jones 2021-09-29 18:22 ` Oliver Upton 0 siblings, 1 reply; 25+ messages in thread From: Andrew Jones @ 2021-08-27 7:40 UTC (permalink / raw) To: Oliver Upton Cc: Marc Zyngier, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Thu, Aug 26, 2021 at 06:49:27PM +0000, Oliver Upton wrote: > On Thu, Aug 26, 2021 at 09:37:42AM +0100, Marc Zyngier wrote: > > On Wed, 25 Aug 2021 19:14:59 +0100, > > Oliver Upton <oupton@google.com> wrote: > > > > > > On Wed, Aug 25, 2021 at 8:07 AM Andrew Jones <drjones@redhat.com> wrote: > > > > [...] > > > > > > Thanks for including me Marc. I think you've mentioned all the examples > > > > of why we don't generally expect N+1 -> N migrations to work that I > > > > can think of. While some of the examples like get-reg-list could > > > > eventually be eliminated if we had CPU models to tighten our machine type > > > > state, I think N+1 -> N migrations will always be best effort at most. > > > > > > > > I agree with giving userspace control over the exposer of the hypercalls > > > > though. Using pseudo-registers for that purpose rather than a pile of > > > > CAPs also seems reasonable to me. > > > > > > > > And, while I don't think this patch is going to proceed, I thought I'd > > > > point out that the opt-out approach doesn't help much with expanding > > > > our migration support unless we require the VMM to be upgraded first. > > > > > > > > And, even then, the (N_kern, N+1_vmm) -> (N+1_kern, N_vmm) case won't > > > > work as expected, since the source enforce opt-out, but the destination > > > > won't. > > > > > > Right, there's going to need to be a fence in both kernel and VMM > > > versions. Before the fence, you can't rollback with either component. > > > Once on the other side of the fence, the user may freely migrate > > > between kernel + VMM combinations. > > > > > > > Also, since the VMM doesn't key off the kernel version, for the > > > > most part N+1 VMMs won't know when they're supposed to opt-out or not, > > > > leaving it to the user to ensure they consider everything. opt-in > > > > usually only needs the user to consider what machine type they want to > > > > launch. > > > > > > Going the register route will implicitly require opt-out for all old > > > hypercalls. We exposed them unconditionally to the guest before, and > > > we must uphold that behavior. The default value for the bitmap will > > > have those features set. Any hypercalls added after that register > > > interface will then require explicit opt-in from userspace. > > > > I disagree here. This makes the ABI inconsistent, and means that no > > feature can be implemented without changing userspace. If you can deal > > with the existing features, you should be able to deal with the next > > lot. > > > > > With regards to the pseudoregister interface, how would a VMM discover > > > new bits? From my perspective, you need to have two bitmaps that the > > > VMM can get at: the set of supported feature bits and the active > > > bitmap of features for a running guest. > > > > My proposal is that we have a single pseudo-register exposing the list > > of implemented by the kernel. Clear the bits you don't want, and write > > back the result. As long as you haven't written anything, you have the > > full feature set. That's pretty similar to the virtio feature > > negotiation. > > Ah, yes I agree. Thinking about it more we will not need something > similar to KVM_GET_SUPPORTED_CPUID. > > So then, for any register where userspace/KVM need to negotiate > features, the default value will return the maximum feature set that is > supported. If userspace wants to constrain features, read out the > register, make sure everything you want is there, and write it back > blowing away the superfluous bits. Given this should we enforce ordering > on feature registers, such that a VMM can only write to the registers > before a VM is started? That's a good idea. KVM_REG_ARM64_SVE_VLS has this type of constraint so we can model the feature register control off that. > > Also, Reiji is working on making the identity registers writable for the > sake of feature restriction. The suggested negotiation interface would > be applicable there too, IMO. This this interesting news. I'll look forward to the posting. > > Many thanks to both you and Drew for working this out with me. > Thanks, drew ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-27 7:40 ` Andrew Jones @ 2021-09-29 18:22 ` Oliver Upton 2021-09-30 7:32 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Oliver Upton @ 2021-09-29 18:22 UTC (permalink / raw) To: Andrew Jones Cc: Marc Zyngier, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Fri, Aug 27, 2021 at 12:40 AM Andrew Jones <drjones@redhat.com> wrote: > > On Thu, Aug 26, 2021 at 06:49:27PM +0000, Oliver Upton wrote: > > On Thu, Aug 26, 2021 at 09:37:42AM +0100, Marc Zyngier wrote: > > > On Wed, 25 Aug 2021 19:14:59 +0100, > > > Oliver Upton <oupton@google.com> wrote: > > > > > > > > On Wed, Aug 25, 2021 at 8:07 AM Andrew Jones <drjones@redhat.com> wrote: > > > > > > [...] > > > > > > > > Thanks for including me Marc. I think you've mentioned all the examples > > > > > of why we don't generally expect N+1 -> N migrations to work that I > > > > > can think of. While some of the examples like get-reg-list could > > > > > eventually be eliminated if we had CPU models to tighten our machine type > > > > > state, I think N+1 -> N migrations will always be best effort at most. > > > > > > > > > > I agree with giving userspace control over the exposer of the hypercalls > > > > > though. Using pseudo-registers for that purpose rather than a pile of > > > > > CAPs also seems reasonable to me. > > > > > > > > > > And, while I don't think this patch is going to proceed, I thought I'd > > > > > point out that the opt-out approach doesn't help much with expanding > > > > > our migration support unless we require the VMM to be upgraded first. > > > > > > > > > > And, even then, the (N_kern, N+1_vmm) -> (N+1_kern, N_vmm) case won't > > > > > work as expected, since the source enforce opt-out, but the destination > > > > > won't. > > > > > > > > Right, there's going to need to be a fence in both kernel and VMM > > > > versions. Before the fence, you can't rollback with either component. > > > > Once on the other side of the fence, the user may freely migrate > > > > between kernel + VMM combinations. > > > > > > > > > Also, since the VMM doesn't key off the kernel version, for the > > > > > most part N+1 VMMs won't know when they're supposed to opt-out or not, > > > > > leaving it to the user to ensure they consider everything. opt-in > > > > > usually only needs the user to consider what machine type they want to > > > > > launch. > > > > > > > > Going the register route will implicitly require opt-out for all old > > > > hypercalls. We exposed them unconditionally to the guest before, and > > > > we must uphold that behavior. The default value for the bitmap will > > > > have those features set. Any hypercalls added after that register > > > > interface will then require explicit opt-in from userspace. > > > > > > I disagree here. This makes the ABI inconsistent, and means that no > > > feature can be implemented without changing userspace. If you can deal > > > with the existing features, you should be able to deal with the next > > > lot. > > > > > > > With regards to the pseudoregister interface, how would a VMM discover > > > > new bits? From my perspective, you need to have two bitmaps that the > > > > VMM can get at: the set of supported feature bits and the active > > > > bitmap of features for a running guest. > > > > > > My proposal is that we have a single pseudo-register exposing the list > > > of implemented by the kernel. Clear the bits you don't want, and write > > > back the result. As long as you haven't written anything, you have the > > > full feature set. That's pretty similar to the virtio feature > > > negotiation. > > > > Ah, yes I agree. Thinking about it more we will not need something > > similar to KVM_GET_SUPPORTED_CPUID. > > > > So then, for any register where userspace/KVM need to negotiate > > features, the default value will return the maximum feature set that is > > supported. If userspace wants to constrain features, read out the > > register, make sure everything you want is there, and write it back > > blowing away the superfluous bits. Given this should we enforce ordering > > on feature registers, such that a VMM can only write to the registers > > before a VM is started? > > That's a good idea. KVM_REG_ARM64_SVE_VLS has this type of constraint so > we can model the feature register control off that. > > > > > Also, Reiji is working on making the identity registers writable for the > > sake of feature restriction. The suggested negotiation interface would > > be applicable there too, IMO. > > This this interesting news. I'll look forward to the posting. > > > > > Many thanks to both you and Drew for working this out with me. > > > > Thanks, > drew > Hey folks, I have some lingering thoughts on this subject since we last spoke and wanted to discuss. I'm having a hard time figuring out how a VMM should handle a new hypercall identity register introduced on a newer kernel. In order to maintain guest ABI, the VMM would need to know about that register and zero it when restoring an older guest. Perhaps instead we could reserve a range of firmware registers as the 'hypercall identity' registers. Implement all of them as RAZ/WI by default, encouraging userspace to zero these registers away for older VMs but still allowing an old userspace to pick up new KVM features. Doing so would align the hypercall identity registers with the feature ID registers from the architecture. Thoughts? -- Thanks, Oliver ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-09-29 18:22 ` Oliver Upton @ 2021-09-30 7:32 ` Marc Zyngier 2021-09-30 17:24 ` Oliver Upton 0 siblings, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2021-09-30 7:32 UTC (permalink / raw) To: Oliver Upton Cc: Andrew Jones, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell Hi Oliver, On Wed, 29 Sep 2021 19:22:05 +0100, Oliver Upton <oupton@google.com> wrote: > > I have some lingering thoughts on this subject since we last spoke and > wanted to discuss. > > I'm having a hard time figuring out how a VMM should handle a new > hypercall identity register introduced on a newer kernel. In order to > maintain guest ABI, the VMM would need to know about that register and > zero it when restoring an older guest. Just as it would need to be able to discover any new system register exposed by default, as it happens at all times. Which is why we have a way to discover all the registers, architected or not. > Perhaps instead we could reserve a range of firmware registers as the > 'hypercall identity' registers. Implement all of them as RAZ/WI by > default, encouraging userspace to zero these registers away for older > VMs but still allowing an old userspace to pick up new KVM features. > Doing so would align the hypercall identity registers with the feature > ID registers from the architecture. The range already exists in the form of the "coprocessor" 0x14. I don't see the need to expose it as RAZ/WI, however. If userspace doesn't know about how this pseudo-register works, it won't be able to program it anyway. I don't buy the parallel with the ID-regs either. They are RAZ/WI by default so that they don't UNDEF at runtime. The meaning of a RAZ id-register is also well defined (feature not implemented), and the CPU cannot write to them. In a way, the ID-regs *are* the enumeration mechanism. Our firmware registers don't follow the same rules. Userspace can write to them, and there is no such "not implemented" rule (case in point, PSCI). We also have a separate enumeration mechanism (GET_ONE_REG), which is (more or less) designed for userspace to find what is implemented. For these reasons, I don't immediately see the point of advertising a set of registers ahead of time, before userspace grows an understanding of what these registers mean. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-09-30 7:32 ` Marc Zyngier @ 2021-09-30 17:24 ` Oliver Upton 2021-10-01 11:43 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Oliver Upton @ 2021-09-30 17:24 UTC (permalink / raw) To: Marc Zyngier Cc: Andrew Jones, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell Hey Marc, On Thu, Sep 30, 2021 at 12:32 AM Marc Zyngier <maz@kernel.org> wrote: > > Hi Oliver, > > On Wed, 29 Sep 2021 19:22:05 +0100, > Oliver Upton <oupton@google.com> wrote: > > > > I have some lingering thoughts on this subject since we last spoke and > > wanted to discuss. > > > > I'm having a hard time figuring out how a VMM should handle a new > > hypercall identity register introduced on a newer kernel. In order to > > maintain guest ABI, the VMM would need to know about that register and > > zero it when restoring an older guest. > > Just as it would need to be able to discover any new system register > exposed by default, as it happens at all times. Which is why we have a > way to discover all the registers, architected or not. > > > Perhaps instead we could reserve a range of firmware registers as the > > 'hypercall identity' registers. Implement all of them as RAZ/WI by > > default, encouraging userspace to zero these registers away for older > > VMs but still allowing an old userspace to pick up new KVM features. > > Doing so would align the hypercall identity registers with the feature > > ID registers from the architecture. > > The range already exists in the form of the "coprocessor" 0x14. I > don't see the need to expose it as RAZ/WI, however. If userspace > doesn't know about how this pseudo-register works, it won't be able to > program it anyway. > > I don't buy the parallel with the ID-regs either. They are RAZ/WI by > default so that they don't UNDEF at runtime. The meaning of a RAZ > id-register is also well defined (feature not implemented), and the > CPU cannot write to them. In a way, the ID-regs *are* the enumeration > mechanism. > > Our firmware registers don't follow the same rules. Userspace can > write to them, and there is no such "not implemented" rule (case in > point, PSCI). We also have a separate enumeration mechanism > (GET_ONE_REG), which is (more or less) designed for userspace to find > what is implemented. > > For these reasons, I don't immediately see the point of advertising a > set of registers ahead of time, before userspace grows an > understanding of what these registers mean. Supposing we don't preallocate some hypercall ID registers, how can we safely migrate a guest from an older kernel to newer kernel? Ideally, we would preserve the hypercall feature set across the migration which could work for a while with the first set of registers that get defined, but whenever a new hypercall firmware register comes along then the VMM will be clueless to the new ABI. Fundamentally, I don't think userspace should need a patch to preserve ABI on a newer kernel. Despite that, it would seem that userspace will need to learn of any firmware registers that control hypercall features which come after the initial set that gets proposed. If KVM_GET_REG_LIST were to disambiguate between ID registers (hypercall, architectural feature ID registers) from other parts of the vCPU state, it would be clear to what registers to zero on a newer kernel. Apologies if it is distracting to mention the feature ID registers here, but both are on my mind currently and want to make sure there is some consistency in how features get handled on newer kernels, architected or not. -- Thanks, Oliver ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-09-30 17:24 ` Oliver Upton @ 2021-10-01 11:43 ` Marc Zyngier 2021-10-01 15:38 ` Oliver Upton 0 siblings, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2021-10-01 11:43 UTC (permalink / raw) To: Oliver Upton Cc: Andrew Jones, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Thu, 30 Sep 2021 18:24:23 +0100, Oliver Upton <oupton@google.com> wrote: > > Hey Marc, > > On Thu, Sep 30, 2021 at 12:32 AM Marc Zyngier <maz@kernel.org> wrote: > > > > Hi Oliver, > > > > On Wed, 29 Sep 2021 19:22:05 +0100, > > Oliver Upton <oupton@google.com> wrote: > > > > > > I have some lingering thoughts on this subject since we last spoke and > > > wanted to discuss. > > > > > > I'm having a hard time figuring out how a VMM should handle a new > > > hypercall identity register introduced on a newer kernel. In order to > > > maintain guest ABI, the VMM would need to know about that register and > > > zero it when restoring an older guest. > > > > Just as it would need to be able to discover any new system register > > exposed by default, as it happens at all times. Which is why we have a > > way to discover all the registers, architected or not. > > > > > Perhaps instead we could reserve a range of firmware registers as the > > > 'hypercall identity' registers. Implement all of them as RAZ/WI by > > > default, encouraging userspace to zero these registers away for older > > > VMs but still allowing an old userspace to pick up new KVM features. > > > Doing so would align the hypercall identity registers with the feature > > > ID registers from the architecture. > > > > The range already exists in the form of the "coprocessor" 0x14. I > > don't see the need to expose it as RAZ/WI, however. If userspace > > doesn't know about how this pseudo-register works, it won't be able to > > program it anyway. > > > > I don't buy the parallel with the ID-regs either. They are RAZ/WI by > > default so that they don't UNDEF at runtime. The meaning of a RAZ > > id-register is also well defined (feature not implemented), and the > > CPU cannot write to them. In a way, the ID-regs *are* the enumeration > > mechanism. > > > > Our firmware registers don't follow the same rules. Userspace can > > write to them, and there is no such "not implemented" rule (case in > > point, PSCI). We also have a separate enumeration mechanism > > (GET_ONE_REG), which is (more or less) designed for userspace to find > > what is implemented. > > > > For these reasons, I don't immediately see the point of advertising a > > set of registers ahead of time, before userspace grows an > > understanding of what these registers mean. > > Supposing we don't preallocate some hypercall ID registers, how can we > safely migrate a guest from an older kernel to newer kernel? Ideally, > we would preserve the hypercall feature set across the migration which > could work for a while with the first set of registers that get > defined, but whenever a new hypercall firmware register comes along > then the VMM will be clueless to the new ABI. My expectations were that whenever userspace writes a set of firmware register, all the defaults are invalidated. For example say that host-A know about a single hypercall register, while host-B knows about two. Upon writing to the first register, the host would clear the set of available services in the second one. If userspace eventually writes there, the value would stick if valid. Also, remember that pseudo-registers don't have to be 64bit. We could define a new class of hypercall-specific registers that would be much wider, and thus have a larger write granularity (KVM supports anything from 8 to 2048 bits). This would make it pretty easy to implement the above. > Fundamentally, I don't think userspace should need a patch to preserve > ABI on a newer kernel. Despite that, it would seem that userspace will > need to learn of any firmware registers that control hypercall > features which come after the initial set that gets proposed. If > KVM_GET_REG_LIST were to disambiguate between ID registers (hypercall, > architectural feature ID registers) from other parts of the vCPU > state, it would be clear to what registers to zero on a newer kernel. > Apologies if it is distracting to mention the feature ID registers > here, but both are on my mind currently and want to make sure there is > some consistency in how features get handled on newer kernels, > architected or not. The problem I see is that we will always need to grow the HC register space one way or another, no matter how many we reserve. Your approach only works if we don't exceed that particular range. Maybe it will never be a problem, but it remains that this is not scalable. If we wanted to be safe, we'd reserve the whole of the possible space as defined by the SMCCC spec. Given that we currently have two HC spaces (the ARM-controlled one, and the KVM-specific one), the function space being 16bits in each case, that's 16kB worth of zeroes that userspace has to save/restore at all times... I'm not overly enthusiastic. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-10-01 11:43 ` Marc Zyngier @ 2021-10-01 15:38 ` Oliver Upton 2022-01-25 3:47 ` Raghavendra Rao Ananta 0 siblings, 1 reply; 25+ messages in thread From: Oliver Upton @ 2021-10-01 15:38 UTC (permalink / raw) To: Marc Zyngier Cc: Andrew Jones, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Fri, Oct 1, 2021 at 4:43 AM Marc Zyngier <maz@kernel.org> wrote: > > On Thu, 30 Sep 2021 18:24:23 +0100, > Oliver Upton <oupton@google.com> wrote: > > > > Hey Marc, > > > > On Thu, Sep 30, 2021 at 12:32 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > Hi Oliver, > > > > > > On Wed, 29 Sep 2021 19:22:05 +0100, > > > Oliver Upton <oupton@google.com> wrote: > > > > > > > > I have some lingering thoughts on this subject since we last spoke and > > > > wanted to discuss. > > > > > > > > I'm having a hard time figuring out how a VMM should handle a new > > > > hypercall identity register introduced on a newer kernel. In order to > > > > maintain guest ABI, the VMM would need to know about that register and > > > > zero it when restoring an older guest. > > > > > > Just as it would need to be able to discover any new system register > > > exposed by default, as it happens at all times. Which is why we have a > > > way to discover all the registers, architected or not. > > > > > > > Perhaps instead we could reserve a range of firmware registers as the > > > > 'hypercall identity' registers. Implement all of them as RAZ/WI by > > > > default, encouraging userspace to zero these registers away for older > > > > VMs but still allowing an old userspace to pick up new KVM features. > > > > Doing so would align the hypercall identity registers with the feature > > > > ID registers from the architecture. > > > > > > The range already exists in the form of the "coprocessor" 0x14. I > > > don't see the need to expose it as RAZ/WI, however. If userspace > > > doesn't know about how this pseudo-register works, it won't be able to > > > program it anyway. > > > > > > I don't buy the parallel with the ID-regs either. They are RAZ/WI by > > > default so that they don't UNDEF at runtime. The meaning of a RAZ > > > id-register is also well defined (feature not implemented), and the > > > CPU cannot write to them. In a way, the ID-regs *are* the enumeration > > > mechanism. > > > > > > Our firmware registers don't follow the same rules. Userspace can > > > write to them, and there is no such "not implemented" rule (case in > > > point, PSCI). We also have a separate enumeration mechanism > > > (GET_ONE_REG), which is (more or less) designed for userspace to find > > > what is implemented. > > > > > > For these reasons, I don't immediately see the point of advertising a > > > set of registers ahead of time, before userspace grows an > > > understanding of what these registers mean. > > > > Supposing we don't preallocate some hypercall ID registers, how can we > > safely migrate a guest from an older kernel to newer kernel? Ideally, > > we would preserve the hypercall feature set across the migration which > > could work for a while with the first set of registers that get > > defined, but whenever a new hypercall firmware register comes along > > then the VMM will be clueless to the new ABI. > > My expectations were that whenever userspace writes a set of firmware > register, all the defaults are invalidated. For example say that > host-A know about a single hypercall register, while host-B knows > about two. Upon writing to the first register, the host would clear > the set of available services in the second one. If userspace > eventually writes there, the value would stick if valid. > > Also, remember that pseudo-registers don't have to be 64bit. We could > define a new class of hypercall-specific registers that would be much > wider, and thus have a larger write granularity (KVM supports anything > from 8 to 2048 bits). This would make it pretty easy to implement the > above. > > > Fundamentally, I don't think userspace should need a patch to preserve > > ABI on a newer kernel. Despite that, it would seem that userspace will > > need to learn of any firmware registers that control hypercall > > features which come after the initial set that gets proposed. If > > KVM_GET_REG_LIST were to disambiguate between ID registers (hypercall, > > architectural feature ID registers) from other parts of the vCPU > > state, it would be clear to what registers to zero on a newer kernel. > > Apologies if it is distracting to mention the feature ID registers > > here, but both are on my mind currently and want to make sure there is > > some consistency in how features get handled on newer kernels, > > architected or not. > > The problem I see is that we will always need to grow the HC register > space one way or another, no matter how many we reserve. Your approach > only works if we don't exceed that particular range. Maybe it will > never be a problem, but it remains that this is not scalable. > > If we wanted to be safe, we'd reserve the whole of the possible space > as defined by the SMCCC spec. Given that we currently have two HC > spaces (the ARM-controlled one, and the KVM-specific one), the > function space being 16bits in each case, that's 16kB worth of zeroes > that userspace has to save/restore at all times... I'm not overly > enthusiastic. Yeah, that's definitely overkill. I agree with your position; we can solve the issue altogether by enforcing ordering on the hypercall registers. Userspace should read all hypercall registers to discover features, then write them. Given your suggestion, I don't know if there is much need for the guesswork to conclude on an appropriate register size for the hypercall bitmaps. 64 bits seems fine to me.. -- Thanks, Oliver ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-10-01 15:38 ` Oliver Upton @ 2022-01-25 3:47 ` Raghavendra Rao Ananta 2022-01-25 8:45 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Raghavendra Rao Ananta @ 2022-01-25 3:47 UTC (permalink / raw) To: Marc Zyngier, Andrew Jones Cc: kvmarm, pshier, ricarkol, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, alexandru.elisei, suzuki.poulose, Peter Maydell, Sean Christopherson, Oliver Upton Hello all, Based on the recent discussion on the patch '[RFC PATCH v3 01/11] KVM: Capture VM start' [1], Reiji, I (and others in the team) were wondering, since the hypercall feature-map is technically per-VM and not per-vCPU, would it make more sense to present it as a kvm_device, rather than vCPU psuedo-registers to the userspace? If I understand correctly, the original motivation for going with pseudo-registers was to comply with QEMU, which uses KVM_GET_REG_LIST and KVM_[GET|SET]_ONE_REG interface, but I'm guessing the VMMs doing save/restore across migration might write the same values for every vCPU. Granted that we would be diverging from the existing implementation (psci versioning and spectre WA registers), but this can be a cleaner way forward for extending hypercall support. The kvm_device interface can be made backward compatible with the way hypercalls are exposed today, it can have the same registers/features discovery mechanisms that we discussed above, and majorly the userspace has to configure it only once per-VM. We can probably make the feature selection immutable just before any vCPU is created. Please let me know your thoughts or any disadvantages that I'm overlooking. Thanks, Raghavendra [1]: https://lore.kernel.org/kvmarm/CAJHc60zhRyOad7AqtEFn-Ptro5BGVkfpB2wXWGw5EZMxOHUc=w@mail.gmail.com/ On Fri, Oct 1, 2021 at 8:38 AM Oliver Upton <oupton@google.com> wrote: > > On Fri, Oct 1, 2021 at 4:43 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Thu, 30 Sep 2021 18:24:23 +0100, > > Oliver Upton <oupton@google.com> wrote: > > > > > > Hey Marc, > > > > > > On Thu, Sep 30, 2021 at 12:32 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > > > Hi Oliver, > > > > > > > > On Wed, 29 Sep 2021 19:22:05 +0100, > > > > Oliver Upton <oupton@google.com> wrote: > > > > > > > > > > I have some lingering thoughts on this subject since we last spoke and > > > > > wanted to discuss. > > > > > > > > > > I'm having a hard time figuring out how a VMM should handle a new > > > > > hypercall identity register introduced on a newer kernel. In order to > > > > > maintain guest ABI, the VMM would need to know about that register and > > > > > zero it when restoring an older guest. > > > > > > > > Just as it would need to be able to discover any new system register > > > > exposed by default, as it happens at all times. Which is why we have a > > > > way to discover all the registers, architected or not. > > > > > > > > > Perhaps instead we could reserve a range of firmware registers as the > > > > > 'hypercall identity' registers. Implement all of them as RAZ/WI by > > > > > default, encouraging userspace to zero these registers away for older > > > > > VMs but still allowing an old userspace to pick up new KVM features. > > > > > Doing so would align the hypercall identity registers with the feature > > > > > ID registers from the architecture. > > > > > > > > The range already exists in the form of the "coprocessor" 0x14. I > > > > don't see the need to expose it as RAZ/WI, however. If userspace > > > > doesn't know about how this pseudo-register works, it won't be able to > > > > program it anyway. > > > > > > > > I don't buy the parallel with the ID-regs either. They are RAZ/WI by > > > > default so that they don't UNDEF at runtime. The meaning of a RAZ > > > > id-register is also well defined (feature not implemented), and the > > > > CPU cannot write to them. In a way, the ID-regs *are* the enumeration > > > > mechanism. > > > > > > > > Our firmware registers don't follow the same rules. Userspace can > > > > write to them, and there is no such "not implemented" rule (case in > > > > point, PSCI). We also have a separate enumeration mechanism > > > > (GET_ONE_REG), which is (more or less) designed for userspace to find > > > > what is implemented. > > > > > > > > For these reasons, I don't immediately see the point of advertising a > > > > set of registers ahead of time, before userspace grows an > > > > understanding of what these registers mean. > > > > > > Supposing we don't preallocate some hypercall ID registers, how can we > > > safely migrate a guest from an older kernel to newer kernel? Ideally, > > > we would preserve the hypercall feature set across the migration which > > > could work for a while with the first set of registers that get > > > defined, but whenever a new hypercall firmware register comes along > > > then the VMM will be clueless to the new ABI. > > > > My expectations were that whenever userspace writes a set of firmware > > register, all the defaults are invalidated. For example say that > > host-A know about a single hypercall register, while host-B knows > > about two. Upon writing to the first register, the host would clear > > the set of available services in the second one. If userspace > > eventually writes there, the value would stick if valid. > > > > Also, remember that pseudo-registers don't have to be 64bit. We could > > define a new class of hypercall-specific registers that would be much > > wider, and thus have a larger write granularity (KVM supports anything > > from 8 to 2048 bits). This would make it pretty easy to implement the > > above. > > > > > Fundamentally, I don't think userspace should need a patch to preserve > > > ABI on a newer kernel. Despite that, it would seem that userspace will > > > need to learn of any firmware registers that control hypercall > > > features which come after the initial set that gets proposed. If > > > KVM_GET_REG_LIST were to disambiguate between ID registers (hypercall, > > > architectural feature ID registers) from other parts of the vCPU > > > state, it would be clear to what registers to zero on a newer kernel. > > > Apologies if it is distracting to mention the feature ID registers > > > here, but both are on my mind currently and want to make sure there is > > > some consistency in how features get handled on newer kernels, > > > architected or not. > > > > The problem I see is that we will always need to grow the HC register > > space one way or another, no matter how many we reserve. Your approach > > only works if we don't exceed that particular range. Maybe it will > > never be a problem, but it remains that this is not scalable. > > > > If we wanted to be safe, we'd reserve the whole of the possible space > > as defined by the SMCCC spec. Given that we currently have two HC > > spaces (the ARM-controlled one, and the KVM-specific one), the > > function space being 16bits in each case, that's 16kB worth of zeroes > > that userspace has to save/restore at all times... I'm not overly > > enthusiastic. > > Yeah, that's definitely overkill. > > I agree with your position; we can solve the issue altogether by > enforcing ordering on the hypercall registers. Userspace should read > all hypercall registers to discover features, then write them. Given > your suggestion, I don't know if there is much need for the guesswork > to conclude on an appropriate register size for the hypercall bitmaps. > 64 bits seems fine to me.. > > -- > Thanks, > Oliver ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2022-01-25 3:47 ` Raghavendra Rao Ananta @ 2022-01-25 8:45 ` Marc Zyngier 2022-01-25 17:29 ` Oliver Upton 0 siblings, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2022-01-25 8:45 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Andrew Jones, kvmarm, pshier, ricarkol, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, alexandru.elisei, suzuki.poulose, Peter Maydell, Sean Christopherson, Oliver Upton On Tue, 25 Jan 2022 03:47:04 +0000, Raghavendra Rao Ananta <rananta@google.com> wrote: > > Hello all, > > Based on the recent discussion on the patch '[RFC PATCH v3 01/11] KVM: > Capture VM start' [1], Reiji, I (and others in the team) were > wondering, since the hypercall feature-map is technically per-VM and > not per-vCPU, would it make more sense to present it as a kvm_device, > rather than vCPU psuedo-registers to the userspace? I really don't think so. > If I understand correctly, the original motivation for going with > pseudo-registers was to comply with QEMU, which uses KVM_GET_REG_LIST > and KVM_[GET|SET]_ONE_REG interface, but I'm guessing the VMMs doing > save/restore across migration might write the same values for every > vCPU. KVM currently restricts the vcpu features to be unified across vcpus, but that's only an implementation choice. The ARM architecture doesn't mandate that these registers are all the same, and it isn't impossible that we'd allow for the feature set to become per-vcpu at some point in time. So this argument doesn't really hold. Furthermore, compatibility with QEMU's save/restore model is essential, and AFAICT, there is no open source alternative. > Granted that we would be diverging from the existing implementation > (psci versioning and spectre WA registers), but this can be a cleaner > way forward for extending hypercall support. Cleaner? Why? How? You'd have the exact same constraints, plus the disadvantages of having to maintain a new interface concurrently with the existing ones. > The kvm_device interface > can be made backward compatible with the way hypercalls are exposed > today, it can have the same registers/features discovery mechanisms > that we discussed above, and majorly the userspace has to configure it > only once per-VM. We can probably make the feature selection immutable > just before any vCPU is created. What is the problem with immutability on first run? Or even with a 'finalize' ioctl (we already have one)? > > Please let me know your thoughts or any disadvantages that I'm overlooking. A device means yet another configuration and migration API. Don't you think we have enough of those? The complexity of KVM/arm64 userspace API is already insane, and extremely fragile. Adding to it will be a validation nightmare (it already is, and I don't see anyone actively helping with it). So no, I have no plan to consider anything but the GET/SET_ONE_REG interface, and until someone writes another open source VMM that has the same save/restore capabilities and proves that this interface isn't fit for purpose, I see no incentive in deviating from a model that is known to work. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2022-01-25 8:45 ` Marc Zyngier @ 2022-01-25 17:29 ` Oliver Upton 2022-02-08 9:46 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Oliver Upton @ 2022-01-25 17:29 UTC (permalink / raw) To: Marc Zyngier Cc: Raghavendra Rao Ananta, Andrew Jones, kvmarm, pshier, ricarkol, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell, Sean Christopherson Hi Marc, On Tue, Jan 25, 2022 at 12:46 AM Marc Zyngier <maz@kernel.org> wrote: > > If I understand correctly, the original motivation for going with > > pseudo-registers was to comply with QEMU, which uses KVM_GET_REG_LIST > > and KVM_[GET|SET]_ONE_REG interface, but I'm guessing the VMMs doing > > save/restore across migration might write the same values for every > > vCPU. > > KVM currently restricts the vcpu features to be unified across vcpus, > but that's only an implementation choice. But that implementation choice has become ABI, no? How could support for asymmetry be added without requiring userspace opt-in or breaking existing VMMs that depend on feature unification? > The ARM architecture doesn't > mandate that these registers are all the same, and it isn't impossible > that we'd allow for the feature set to become per-vcpu at some point > in time. So this argument doesn't really hold. Accessing per-VM state N times is bound to increase VM blackout time during migrations ~linearly as the number of vCPUs in a VM increases, since a VM scoped lock is necessary to serialize guest accesses. It could be tolerable at present scale, but seems like in the future it could become a real problem. > Furthermore, compatibility with QEMU's save/restore model is > essential, and AFAICT, there is no open source alternative. Agree fundamentally, but I believe it is entirely reasonable to require a userspace change to adopt a new KVM feature. Otherwise, we may be trying to shoehorn new features into existing UAPI that may not be a precise fit.. In order to cure the serialization mentioned above, two options are top of mind: accessing the VM state with the VM FD or informing userspace that a set of registers need only be written once for an entire VM. If we add support for asymmetry later down the road, that would become an opt-in such that userspace will do the access per-vCPU. > A device means yet another configuration and migration API. Don't you > think we have enough of those? The complexity of KVM/arm64 userspace > API is already insane, and extremely fragile. Adding to it will be a > validation nightmare (it already is, and I don't see anyone actively > helping with it). It seems equally fragile to introduce VM-wide serialization to vCPU UAPI that we know is in the live migration critical path for _any_ VMM. Without requiring userspace changes for all the new widgets under discussion we're effectively forcing VMMs to do something suboptimal. -- Thanks, Oliver ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2022-01-25 17:29 ` Oliver Upton @ 2022-02-08 9:46 ` Marc Zyngier 2022-02-08 9:56 ` Oliver Upton 0 siblings, 1 reply; 25+ messages in thread From: Marc Zyngier @ 2022-02-08 9:46 UTC (permalink / raw) To: Oliver Upton Cc: Raghavendra Rao Ananta, Andrew Jones, kvmarm, pshier, ricarkol, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell, Sean Christopherson Huh, somewhat missed that email, apologies for the delay. On Tue, 25 Jan 2022 17:29:13 +0000, Oliver Upton <oupton@google.com> wrote: > > Hi Marc, > > On Tue, Jan 25, 2022 at 12:46 AM Marc Zyngier <maz@kernel.org> wrote: > > > If I understand correctly, the original motivation for going with > > > pseudo-registers was to comply with QEMU, which uses KVM_GET_REG_LIST > > > and KVM_[GET|SET]_ONE_REG interface, but I'm guessing the VMMs doing > > > save/restore across migration might write the same values for every > > > vCPU. > > > > KVM currently restricts the vcpu features to be unified across vcpus, > > but that's only an implementation choice. > > But that implementation choice has become ABI, no? How could support > for asymmetry be added without requiring userspace opt-in or breaking > existing VMMs that depend on feature unification? Of course, you'd need some sort of advertising of this new behaviour. One thing I would like to add to the current state of thing is an indication of whether the effects of a sysreg being written from userspace are global or local to a vcpu. You'd need a new capability, and an extra flag added to the encoding of each register. > > > The ARM architecture doesn't > > mandate that these registers are all the same, and it isn't impossible > > that we'd allow for the feature set to become per-vcpu at some point > > in time. So this argument doesn't really hold. > > Accessing per-VM state N times is bound to increase VM blackout time > during migrations ~linearly as the number of vCPUs in a VM increases, > since a VM scoped lock is necessary to serialize guest accesses. It > could be tolerable at present scale, but seems like in the future it > could become a real problem. I don't disagree. But I doubt switching to a different API altogether is the solution to this. > > > Furthermore, compatibility with QEMU's save/restore model is > > essential, and AFAICT, there is no open source alternative. > > Agree fundamentally, but I believe it is entirely reasonable to > require a userspace change to adopt a new KVM feature. Otherwise, we > may be trying to shoehorn new features into existing UAPI that may not > be a precise fit.. But the very purpose of this API is to support discoverability. If we can't support it, then we might as well declare the whole API deprecated and restart from scratch. No, I'm not seriously suggesting this :-/. > In order to cure the serialization mentioned above, two options are > top of mind: accessing the VM state with the VM FD or informing > userspace that a set of registers need only be written once for an > entire VM. If we add support for asymmetry later down the road, that > would become an opt-in such that userspace will do the access > per-vCPU. It is the latter that I'm suggesting. > > > A device means yet another configuration and migration API. Don't you > > think we have enough of those? The complexity of KVM/arm64 userspace > > API is already insane, and extremely fragile. Adding to it will be a > > validation nightmare (it already is, and I don't see anyone actively > > helping with it). > > It seems equally fragile to introduce VM-wide serialization to vCPU > UAPI that we know is in the live migration critical path for _any_ > VMM. Without requiring userspace changes for all the new widgets under > discussion we're effectively forcing VMMs to do something suboptimal. I'm perfectly happy with suboptimal to start with, and find ways to make it better once we have a clear path forward. I just don't want to conflate the two. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2022-02-08 9:46 ` Marc Zyngier @ 2022-02-08 9:56 ` Oliver Upton 2022-02-08 16:58 ` Sean Christopherson 0 siblings, 1 reply; 25+ messages in thread From: Oliver Upton @ 2022-02-08 9:56 UTC (permalink / raw) To: Marc Zyngier Cc: Raghavendra Rao Ananta, Andrew Jones, kvmarm, pshier, ricarkol, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell, Sean Christopherson Hi Marc, On Tue, Feb 8, 2022 at 1:46 AM Marc Zyngier <maz@kernel.org> wrote: > > > KVM currently restricts the vcpu features to be unified across vcpus, > > > but that's only an implementation choice. > > > > But that implementation choice has become ABI, no? How could support > > for asymmetry be added without requiring userspace opt-in or breaking > > existing VMMs that depend on feature unification? > > Of course, you'd need some sort of advertising of this new behaviour. > > One thing I would like to add to the current state of thing is an > indication of whether the effects of a sysreg being written from > userspace are global or local to a vcpu. You'd need a new capability, > and an extra flag added to the encoding of each register. Ah. I think that is a much more reasonable fit then. VMMs unaware of this can continue to migrate new bits (albeit at the cost of potentially higher lock contention for the per-VM stuff), and those that do can reap the benefits of writing such attributes exactly once. [...] > > > A device means yet another configuration and migration API. Don't you > > > think we have enough of those? The complexity of KVM/arm64 userspace > > > API is already insane, and extremely fragile. Adding to it will be a > > > validation nightmare (it already is, and I don't see anyone actively > > > helping with it). > > > > It seems equally fragile to introduce VM-wide serialization to vCPU > > UAPI that we know is in the live migration critical path for _any_ > > VMM. Without requiring userspace changes for all the new widgets under > > discussion we're effectively forcing VMMs to do something suboptimal. > > I'm perfectly happy with suboptimal to start with, and find ways to > make it better once we have a clear path forward. I just don't want to > conflate the two. Fair. My initial concern was that it didn't seem as though there was much room for growth/improvement with the one reg UAPI, but your suggestion definitely provides a ramp out to handle VM state once per VM. Thanks for following up :) -- Oliver ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2022-02-08 9:56 ` Oliver Upton @ 2022-02-08 16:58 ` Sean Christopherson 2022-02-08 17:48 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Sean Christopherson @ 2022-02-08 16:58 UTC (permalink / raw) To: Oliver Upton Cc: Marc Zyngier, Raghavendra Rao Ananta, Andrew Jones, kvmarm, pshier, ricarkol, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Tue, Feb 08, 2022, Oliver Upton wrote: > Hi Marc, > > On Tue, Feb 8, 2022 at 1:46 AM Marc Zyngier <maz@kernel.org> wrote: > > > > KVM currently restricts the vcpu features to be unified across vcpus, > > > > but that's only an implementation choice. > > > > > > But that implementation choice has become ABI, no? How could support > > > for asymmetry be added without requiring userspace opt-in or breaking > > > existing VMMs that depend on feature unification? > > > > Of course, you'd need some sort of advertising of this new behaviour. > > > > One thing I would like to add to the current state of thing is an > > indication of whether the effects of a sysreg being written from > > userspace are global or local to a vcpu. You'd need a new capability, > > and an extra flag added to the encoding of each register. > > Ah. I think that is a much more reasonable fit then. VMMs unaware of > this can continue to migrate new bits (albeit at the cost of > potentially higher lock contention for the per-VM stuff), and those > that do can reap the benefits of writing such attributes exactly once. But the "proper" usage is no different than adding support for VM-scoped variants of KVM_{G,S}ET_ONE_REG and friends, and a VM-scoped variant is conceptually a lot cleaner IMO. And making them truly VM-scoped means KVM can do things like support sysregs that are immutable after vCPUs are created. So long as KVM defaults to '0' for all such registers, lack of migration support in userspace that isn't aware of the new API, i.e. doesn't do KVM_GET_REG_LIST at a VM-scope, is a nop because said userspace also won't modify the registers in the first place. The only "unsolvable" problem that is avoided by usurping the per-vCPU ioctls is rollback to a userspace VMM that isn't aware of the per-VM ioctls, but it doesn't seem too onerous to tell userspace "don't use these unless your entire fleet has upgraded", especially since that requirement/advisement is true for the KVM side with respect to new registers regardless of how those registers are accessed. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2022-02-08 16:58 ` Sean Christopherson @ 2022-02-08 17:48 ` Marc Zyngier 0 siblings, 0 replies; 25+ messages in thread From: Marc Zyngier @ 2022-02-08 17:48 UTC (permalink / raw) To: Sean Christopherson Cc: Oliver Upton, Raghavendra Rao Ananta, Andrew Jones, kvmarm, pshier, ricarkol, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Tue, 08 Feb 2022 16:58:26 +0000, Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Feb 08, 2022, Oliver Upton wrote: > > Hi Marc, > > > > On Tue, Feb 8, 2022 at 1:46 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > KVM currently restricts the vcpu features to be unified across vcpus, > > > > > but that's only an implementation choice. > > > > > > > > But that implementation choice has become ABI, no? How could support > > > > for asymmetry be added without requiring userspace opt-in or breaking > > > > existing VMMs that depend on feature unification? > > > > > > Of course, you'd need some sort of advertising of this new behaviour. > > > > > > One thing I would like to add to the current state of thing is an > > > indication of whether the effects of a sysreg being written from > > > userspace are global or local to a vcpu. You'd need a new capability, > > > and an extra flag added to the encoding of each register. > > > > Ah. I think that is a much more reasonable fit then. VMMs unaware of > > this can continue to migrate new bits (albeit at the cost of > > potentially higher lock contention for the per-VM stuff), and those > > that do can reap the benefits of writing such attributes exactly once. > > But the "proper" usage is no different than adding support for > VM-scoped variants of KVM_{G,S}ET_ONE_REG and friends, and a > VM-scoped variant is conceptually a lot cleaner IMO. And making > them truly VM-scoped means KVM can do things like support sysregs > that are immutable after vCPUs are created. It is different, because your approach requires you to update all the existing VMMs to find out which register is of which kind. Not to mention that global sysregs are an absolute oddity in the face of the architecture (there is none in the base architecture). > So long as KVM defaults to '0' for all such registers, lack of > migration support in userspace that isn't aware of the new API, > i.e. doesn't do KVM_GET_REG_LIST at a VM-scope, is a nop because > said userspace also won't modify the registers in the first place. We want any VMM that correctly uses the API today to seamlessly be able to save/restore any new feature. QEMU does that, and it should continue to work no matter what new feature we add to the list. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-25 18:14 ` Oliver Upton 2021-08-26 8:37 ` Marc Zyngier @ 2021-08-26 8:49 ` Andrew Jones 2021-08-26 8:54 ` Andrew Jones 1 sibling, 1 reply; 25+ messages in thread From: Andrew Jones @ 2021-08-26 8:49 UTC (permalink / raw) To: Oliver Upton Cc: Marc Zyngier, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Wed, Aug 25, 2021 at 11:14:59AM -0700, Oliver Upton wrote: > On Wed, Aug 25, 2021 at 8:07 AM Andrew Jones <drjones@redhat.com> wrote: > > > > On Wed, Aug 25, 2021 at 11:39:34AM +0100, Marc Zyngier wrote: > > > On Wed, 25 Aug 2021 11:02:28 +0100, > > > Oliver Upton <oupton@google.com> wrote: > > > > > > > > On Wed, Aug 25, 2021 at 2:27 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > Exposing new hypercalls to guests in this manner seems very unsafe to > > > > > > me. Suppose an operator is trying to upgrade from kernel N to kernel > > > > > > N+1, which brings in the new 'widget' hypercall. Guests are live > > > > > > migrated onto the N+1 kernel, but the operator finds a defect that > > > > > > warrants a kernel rollback. VMs are then migrated from kernel N+1 -> N. > > > > > > Any guests that discovered the 'widget' hypercall are likely going to > > > > > > get fussy _very_ quickly on the old kernel. > > > > > > > > > > This goes against what we decided to support for the *only* publicly > > > > > available VMM that cares about save/restore, which is that we only > > > > > move forward and don't rollback. > > > > > > > > Ah, I was definitely missing this context. Current behavior makes much > > > > more sense then. > > > > > > > > > Hypercalls are the least of your > > > > > worries, and there is a whole range of other architectural features > > > > > that will have also appeared/disappeared (your own CNTPOFF series is a > > > > > glaring example of this). > > > > > > > > Isn't that a tad bit different though? I'll admit, I'm just as guilty > > > > with my own series forgetting to add a KVM_CAP (oops), but it is in my > > > > queue to kick out with the fix for nVHE/ptimer. Nonetheless, if a user > > > > takes up a new KVM UAPI, it is up to the user to run on a new kernel. > > > > > > The two are linked. Exposing a new register to userspace and/or guest > > > result in the same thing: you can't rollback. That's specially true in > > > the QEMU case, which *learns* from the kernel what registers are > > > available, and doesn't maintain a fixed list. > > > > > > > My concerns are explicitly with the 'under the nose' changes, where > > > > KVM modifies the guest feature set without userspace opting in. Based > > > > on your comment, though, it would appear that other parts of KVM are > > > > affected too. > > > > > > Any new system register that is exposed by a new kernel feature breaks > > > rollback. And so far, we only consider it a bug if the set of exposed > > > registers reduces. Anything can be added safely (as checked by one of > > > the selftests added by Drew). > > > > > > < It doesn't have to be rollback safety, either. There may > > > > simply be a hypercall which an operator doesn't want to give its > > > > guests, and it needs a way to tell KVM to hide it. > > > > > > Fair enough. But this has to be done in a scalable way, which > > > individual capability cannot provide. > > > > > > > > > Have I missed something blatantly obvious, or do others see this as an > > > > > > issue as well? I'll reply with an example of adding opt-out for PTP. > > > > > > I'm sure other hypercalls could be handled similarly. > > > > > > > > > > Why do we need this? For future hypercalls, we could have some buy-in > > > > > capabilities. For existing ones, it is too late, and negative features > > > > > are just too horrible. > > > > > > > > Oh, agreed on the nastiness. Lazy hack to realize the intended > > > > functional change.. > > > > > > Well, you definitely achieved your goal of attracting my attention :). > > > > > > > > For KVM-specific hypercalls, we could get the VMM to save/restore the > > > > > bitmap of supported functions. That would be "less horrible". This > > > > > could be implemented using extra "firmware pseudo-registers" such as > > > > > the ones described in Documentation/virt/kvm/arm/psci.rst. > > > > > > > > This seems more reasonable, especially since we do this for migrating > > > > the guest's PSCI version. > > > > > > > > Alternatively, I had thought about using a VM attribute, given the > > > > fact that it is non-architectural information and we avoid ABI issues > > > > in KVM_GET_REG_LIST without buy-in through a KVM_CAP. > > > > > > The whole point is that these settings get exposed by > > > KVM_GET_REG_LIST, as this is QEMU's way to dump a VM state. Given that > > > we already have this for things like the spectre management state, we > > > can just as well expose the bitmaps that deal with the KVM-specific > > > hypercalls. After all, this falls into the realm of "KVM as VM > > > firmware". > > > > > > For ARM-architected hypercalls (TRNG, stolen time), we may need a > > > similar extension. > > > > > > > Thanks for including me Marc. I think you've mentioned all the examples > > of why we don't generally expect N+1 -> N migrations to work that I > > can think of. While some of the examples like get-reg-list could > > eventually be eliminated if we had CPU models to tighten our machine type > > state, I think N+1 -> N migrations will always be best effort at most. > > > > I agree with giving userspace control over the exposer of the hypercalls > > though. Using pseudo-registers for that purpose rather than a pile of > > CAPs also seems reasonable to me. > > > > And, while I don't think this patch is going to proceed, I thought I'd > > point out that the opt-out approach doesn't help much with expanding > > our migration support unless we require the VMM to be upgraded first. > > > > And, even then, the (N_kern, N+1_vmm) -> (N+1_kern, N_vmm) case won't > > work as expected, since the source enforce opt-out, but the destination > > won't. > > Right, there's going to need to be a fence in both kernel and VMM > versions. Before the fence, you can't rollback with either component. > Once on the other side of the fence, the user may freely migrate > between kernel + VMM combinations. > > > Also, since the VMM doesn't key off the kernel version, for the > > most part N+1 VMMs won't know when they're supposed to opt-out or not, > > leaving it to the user to ensure they consider everything. opt-in > > usually only needs the user to consider what machine type they want to > > launch. > > Going the register route will implicitly require opt-out for all old > hypercalls. We exposed them unconditionally to the guest before, and > we must uphold that behavior. The default value for the bitmap will > have those features set. Any hypercalls added after that register > interface will then require explicit opt-in from userspace. > > With regards to the pseudoregister interface, how would a VMM discover > new bits? From my perspective, you need to have two bitmaps that the > VMM can get at: the set of supported feature bits and the active > bitmap of features for a running guest. > I think we should model the pseudo-register approach off of x86's CPUID approach. x86 has specific get/set ioctls for CPUIDs (KVM_GET/SET_CPUID2), but I think we should get by with just get/set-one-reg. However, it might be nice/necessary to have something like x86's KVM_GET_SUPPORTED_CPUID which returns all the registers at once as a bitmap and the set bits would inform userspace of what's supported by the hardware and KVM. So a new ioctl similar to KVM_GET_SUPPORTED_CPUID would be your first bitmap that shows what's supported and then userpace can determine what it wants to change and calculate the appropriate pseudo-registers to set/clear bits on with set-one-reg, as x86 would do with kvm-set-cpuid2. Thanks, drew ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-26 8:49 ` Andrew Jones @ 2021-08-26 8:54 ` Andrew Jones 2021-08-26 9:43 ` Marc Zyngier 0 siblings, 1 reply; 25+ messages in thread From: Andrew Jones @ 2021-08-26 8:54 UTC (permalink / raw) To: Oliver Upton Cc: Marc Zyngier, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Thu, Aug 26, 2021 at 10:49:45AM +0200, Andrew Jones wrote: > On Wed, Aug 25, 2021 at 11:14:59AM -0700, Oliver Upton wrote: > > On Wed, Aug 25, 2021 at 8:07 AM Andrew Jones <drjones@redhat.com> wrote: > > > > > > On Wed, Aug 25, 2021 at 11:39:34AM +0100, Marc Zyngier wrote: > > > > On Wed, 25 Aug 2021 11:02:28 +0100, > > > > Oliver Upton <oupton@google.com> wrote: > > > > > > > > > > On Wed, Aug 25, 2021 at 2:27 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > > Exposing new hypercalls to guests in this manner seems very unsafe to > > > > > > > me. Suppose an operator is trying to upgrade from kernel N to kernel > > > > > > > N+1, which brings in the new 'widget' hypercall. Guests are live > > > > > > > migrated onto the N+1 kernel, but the operator finds a defect that > > > > > > > warrants a kernel rollback. VMs are then migrated from kernel N+1 -> N. > > > > > > > Any guests that discovered the 'widget' hypercall are likely going to > > > > > > > get fussy _very_ quickly on the old kernel. > > > > > > > > > > > > This goes against what we decided to support for the *only* publicly > > > > > > available VMM that cares about save/restore, which is that we only > > > > > > move forward and don't rollback. > > > > > > > > > > Ah, I was definitely missing this context. Current behavior makes much > > > > > more sense then. > > > > > > > > > > > Hypercalls are the least of your > > > > > > worries, and there is a whole range of other architectural features > > > > > > that will have also appeared/disappeared (your own CNTPOFF series is a > > > > > > glaring example of this). > > > > > > > > > > Isn't that a tad bit different though? I'll admit, I'm just as guilty > > > > > with my own series forgetting to add a KVM_CAP (oops), but it is in my > > > > > queue to kick out with the fix for nVHE/ptimer. Nonetheless, if a user > > > > > takes up a new KVM UAPI, it is up to the user to run on a new kernel. > > > > > > > > The two are linked. Exposing a new register to userspace and/or guest > > > > result in the same thing: you can't rollback. That's specially true in > > > > the QEMU case, which *learns* from the kernel what registers are > > > > available, and doesn't maintain a fixed list. > > > > > > > > > My concerns are explicitly with the 'under the nose' changes, where > > > > > KVM modifies the guest feature set without userspace opting in. Based > > > > > on your comment, though, it would appear that other parts of KVM are > > > > > affected too. > > > > > > > > Any new system register that is exposed by a new kernel feature breaks > > > > rollback. And so far, we only consider it a bug if the set of exposed > > > > registers reduces. Anything can be added safely (as checked by one of > > > > the selftests added by Drew). > > > > > > > > < It doesn't have to be rollback safety, either. There may > > > > > simply be a hypercall which an operator doesn't want to give its > > > > > guests, and it needs a way to tell KVM to hide it. > > > > > > > > Fair enough. But this has to be done in a scalable way, which > > > > individual capability cannot provide. > > > > > > > > > > > Have I missed something blatantly obvious, or do others see this as an > > > > > > > issue as well? I'll reply with an example of adding opt-out for PTP. > > > > > > > I'm sure other hypercalls could be handled similarly. > > > > > > > > > > > > Why do we need this? For future hypercalls, we could have some buy-in > > > > > > capabilities. For existing ones, it is too late, and negative features > > > > > > are just too horrible. > > > > > > > > > > Oh, agreed on the nastiness. Lazy hack to realize the intended > > > > > functional change.. > > > > > > > > Well, you definitely achieved your goal of attracting my attention :). > > > > > > > > > > For KVM-specific hypercalls, we could get the VMM to save/restore the > > > > > > bitmap of supported functions. That would be "less horrible". This > > > > > > could be implemented using extra "firmware pseudo-registers" such as > > > > > > the ones described in Documentation/virt/kvm/arm/psci.rst. > > > > > > > > > > This seems more reasonable, especially since we do this for migrating > > > > > the guest's PSCI version. > > > > > > > > > > Alternatively, I had thought about using a VM attribute, given the > > > > > fact that it is non-architectural information and we avoid ABI issues > > > > > in KVM_GET_REG_LIST without buy-in through a KVM_CAP. > > > > > > > > The whole point is that these settings get exposed by > > > > KVM_GET_REG_LIST, as this is QEMU's way to dump a VM state. Given that > > > > we already have this for things like the spectre management state, we > > > > can just as well expose the bitmaps that deal with the KVM-specific > > > > hypercalls. After all, this falls into the realm of "KVM as VM > > > > firmware". > > > > > > > > For ARM-architected hypercalls (TRNG, stolen time), we may need a > > > > similar extension. > > > > > > > > > > Thanks for including me Marc. I think you've mentioned all the examples > > > of why we don't generally expect N+1 -> N migrations to work that I > > > can think of. While some of the examples like get-reg-list could > > > eventually be eliminated if we had CPU models to tighten our machine type > > > state, I think N+1 -> N migrations will always be best effort at most. > > > > > > I agree with giving userspace control over the exposer of the hypercalls > > > though. Using pseudo-registers for that purpose rather than a pile of > > > CAPs also seems reasonable to me. > > > > > > And, while I don't think this patch is going to proceed, I thought I'd > > > point out that the opt-out approach doesn't help much with expanding > > > our migration support unless we require the VMM to be upgraded first. > > > > > > And, even then, the (N_kern, N+1_vmm) -> (N+1_kern, N_vmm) case won't > > > work as expected, since the source enforce opt-out, but the destination > > > won't. > > > > Right, there's going to need to be a fence in both kernel and VMM > > versions. Before the fence, you can't rollback with either component. > > Once on the other side of the fence, the user may freely migrate > > between kernel + VMM combinations. > > > > > Also, since the VMM doesn't key off the kernel version, for the > > > most part N+1 VMMs won't know when they're supposed to opt-out or not, > > > leaving it to the user to ensure they consider everything. opt-in > > > usually only needs the user to consider what machine type they want to > > > launch. > > > > Going the register route will implicitly require opt-out for all old > > hypercalls. We exposed them unconditionally to the guest before, and > > we must uphold that behavior. The default value for the bitmap will > > have those features set. Any hypercalls added after that register > > interface will then require explicit opt-in from userspace. > > > > With regards to the pseudoregister interface, how would a VMM discover > > new bits? From my perspective, you need to have two bitmaps that the > > VMM can get at: the set of supported feature bits and the active > > bitmap of features for a running guest. > > > > I think we should model the pseudo-register approach off of x86's > CPUID approach. x86 has specific get/set ioctls for CPUIDs > (KVM_GET/SET_CPUID2), but I think we should get by with just > get/set-one-reg. However, it might be nice/necessary to have something > like x86's KVM_GET_SUPPORTED_CPUID which returns all the registers > at once as a bitmap and the set bits would inform userspace of what's > supported by the hardware and KVM. > > So a new ioctl similar to KVM_GET_SUPPORTED_CPUID would be your > first bitmap that shows what's supported and then userpace can > determine what it wants to change and calculate the appropriate > pseudo-registers to set/clear bits on with set-one-reg, as x86 > would do with kvm-set-cpuid2. > I see Marc just replied stating we'll probably just have a single register. The KVM_GET_SUPPORTED_CPUID type of ioctl would be overkill in that case (get-one_reg is enough). Anyway it can always be added later if we expand into more registers. Thanks, drew ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: KVM/arm64: Guest ABI changes do not appear rollback-safe 2021-08-26 8:54 ` Andrew Jones @ 2021-08-26 9:43 ` Marc Zyngier 0 siblings, 0 replies; 25+ messages in thread From: Marc Zyngier @ 2021-08-26 9:43 UTC (permalink / raw) To: Andrew Jones Cc: Oliver Upton, kvmarm, pshier, ricarkol, rananta, reijiw, jingzhangos, kvm, linux-arm-kernel, james.morse, Alexandru.Elisei, suzuki.poulose, Peter Maydell On Thu, 26 Aug 2021 09:54:29 +0100, Andrew Jones <drjones@redhat.com> wrote: [...] > I see Marc just replied stating we'll probably just have a single > register. The KVM_GET_SUPPORTED_CPUID type of ioctl would be > overkill in that case (get-one_reg is enough). Anyway it can > always be added later if we expand into more registers. Here's what I propose for the KVM-specific hypercalls. Compile tested only, plenty of nits to sort out. But you'll get the general idea. Something similar could be implemented for ARM-architected hypercalls such as TRNG. M. diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index f8be56d5342b..3fcca7cc9668 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -136,6 +136,9 @@ struct kvm_arch { /* Memory Tagging Extension enabled for the guest */ bool mte_enabled; + + /* Bitmap of enabled hypercalls */ + unsigned long hc_bmap; }; struct kvm_vcpu_fault_info { diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index b3edde68bc3e..8d7a1a93f12c 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -281,6 +281,9 @@ struct kvm_arm_copy_mte_tags { #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED 3 #define KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED (1U << 4) +#define KVM_REG_ARM_KVM_SPECIFIC_HYPERCALLS_0 KVM_REG_ARM_FW_REG(3) +#define KVM_REG_ARM_KVM_SPECIFIC_HYPERCALLS_1 KVM_REG_ARM_FW_REG(4) + /* SVE registers */ #define KVM_REG_ARM64_SVE (0x15 << KVM_REG_ARM_COPROC_SHIFT) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index fe102cd2e518..dbc136675b02 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -130,6 +130,13 @@ static void set_default_spectre(struct kvm *kvm) kvm->arch.pfr0_csv3 = 1; } + +static void set_default_vendor_hypercalls(struct kvm *kvm) +{ + kvm->arch.hc_bmap = (BIT(ARM_SMCCC_KVM_FUNC_FEATURES) | + BIT(ARM_SMCCC_KVM_FUNC_PTP)); +} + /** * kvm_arch_init_vm - initializes a VM data structure * @kvm: pointer to the KVM struct @@ -156,6 +163,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) kvm->arch.max_vcpus = kvm_arm_default_max_vcpus(); set_default_spectre(kvm); + set_default_vendor_hypercalls(kvm); return ret; out_free_stage2_pgd: diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c index 30da78f72b3b..11f79e790f7d 100644 --- a/arch/arm64/kvm/hypercalls.c +++ b/arch/arm64/kvm/hypercalls.c @@ -127,8 +127,7 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) val[3] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3; break; case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: - val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); - val[0] |= BIT(ARM_SMCCC_KVM_FUNC_PTP); + val[0] = vcpu->kvm->arch.hc_bmap; break; case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID: kvm_ptp_get_time(vcpu, val); diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c index 74c47d420253..ceef0c706d50 100644 --- a/arch/arm64/kvm/psci.c +++ b/arch/arm64/kvm/psci.c @@ -404,21 +404,26 @@ int kvm_psci_call(struct kvm_vcpu *vcpu) }; } +static const u64 fw_reg_ids[] = { + KVM_REG_ARM_PSCI_VERSION, + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1, + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2, + KVM_REG_ARM_KVM_SPECIFIC_HYPERCALLS_0, +}; + int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu) { - return 3; /* PSCI version and two workaround registers */ + return ARRAY_SIZE(fw_reg_ids); } int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) { - if (put_user(KVM_REG_ARM_PSCI_VERSION, uindices++)) - return -EFAULT; - - if (put_user(KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1, uindices++)) - return -EFAULT; + int i; - if (put_user(KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2, uindices++)) - return -EFAULT; + for (i = 0; i < ARRAY_SIZE(fw_reg_ids); i++) { + if (put_user(fw_reg_ids[i], uindices++)) + return -EFAULT; + } return 0; } @@ -477,6 +482,10 @@ int kvm_arm_get_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: val = get_kernel_wa_level(reg->id) & KVM_REG_FEATURE_LEVEL_MASK; break; + case KVM_REG_ARM_KVM_SPECIFIC_HYPERCALLS_0: + val = vcpu->kvm->arch.hc_bmap; + break; + case KVM_REG_ARM_KVM_SPECIFIC_HYPERCALLS_1: default: return -ENOENT; } @@ -515,8 +524,9 @@ int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) return -EINVAL; vcpu->kvm->arch.psci_version = val; return 0; + default: + return -EINVAL; } - break; } case KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: @@ -563,9 +573,16 @@ int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) return -EINVAL; return 0; + case KVM_REG_ARM_KVM_SPECIFIC_HYPERCALLS_0: + /* FIXME: no vcpu should have run */ + /* Exclude any unknown bit */ + if (val & ~(BIT(ARM_SMCCC_KVM_FUNC_FEATURES) | + BIT(ARM_SMCCC_KVM_FUNC_PTP))) + return -EINVAL; + vcpu->kvm->arch.hc_bmap = val; + return 0; + case KVM_REG_ARM_KVM_SPECIFIC_HYPERCALLS_1: default: return -ENOENT; } - - return -EINVAL; } -- Without deviation from the norm, progress is not possible. ^ permalink raw reply related [flat|nested] 25+ messages in thread
end of thread, other threads:[~2022-02-08 17:49 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-24 21:15 KVM/arm64: Guest ABI changes do not appear rollback-safe Oliver Upton 2021-08-24 21:34 ` [RFC PATCH] KVM: arm64: Allow VMMs to opt-out of KVM_CAP_PTP_KVM Oliver Upton 2021-08-25 9:27 ` KVM/arm64: Guest ABI changes do not appear rollback-safe Marc Zyngier 2021-08-25 10:02 ` Oliver Upton 2021-08-25 10:39 ` Marc Zyngier 2021-08-25 15:07 ` Andrew Jones 2021-08-25 18:14 ` Oliver Upton 2021-08-26 8:37 ` Marc Zyngier 2021-08-26 18:49 ` Oliver Upton 2021-08-27 7:40 ` Andrew Jones 2021-09-29 18:22 ` Oliver Upton 2021-09-30 7:32 ` Marc Zyngier 2021-09-30 17:24 ` Oliver Upton 2021-10-01 11:43 ` Marc Zyngier 2021-10-01 15:38 ` Oliver Upton 2022-01-25 3:47 ` Raghavendra Rao Ananta 2022-01-25 8:45 ` Marc Zyngier 2022-01-25 17:29 ` Oliver Upton 2022-02-08 9:46 ` Marc Zyngier 2022-02-08 9:56 ` Oliver Upton 2022-02-08 16:58 ` Sean Christopherson 2022-02-08 17:48 ` Marc Zyngier 2021-08-26 8:49 ` Andrew Jones 2021-08-26 8:54 ` Andrew Jones 2021-08-26 9:43 ` Marc Zyngier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).