* [RFC PATCH 0/2] MTE support for KVM guest @ 2020-06-17 12:38 Steven Price 2020-06-17 12:38 ` [RFC PATCH 1/2] arm64: kvm: Save/restore MTE registers Steven Price ` (3 more replies) 0 siblings, 4 replies; 22+ messages in thread From: Steven Price @ 2020-06-17 12:38 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner These patches add support to KVM to enable MTE within a guest. It is based on Catalin's v4 MTE user space series[1]. [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com Posting as an RFC as I'd like feedback on the approach taken. First a little background on how MTE fits within the architecture: The stage 2 page tables have limited scope for controlling the availability of MTE. If a page is mapped as Normal and cached in stage 2 then it's the stage 1 tables that get to choose whether the memory is tagged or not. So the only way of forbidding tags on a page from the hypervisor is to change the cacheability (or make it device memory) which would cause other problems. Note this restriction fits the intention that a system should have all (general purpose) memory supporting tags if it support MTE, so it's not too surprising. However, the upshot of this is that to enable MTE within a guest all pages of memory mapped into the guest as normal cached pages in stage 2 *must* support MTE (i.e. we must ensure the tags are appropriately sanitised and save/restore the tags during swap etc). My current approach is that KVM transparently upgrades any pages provided by the VMM to be tag-enabled when they are faulted in (i.e. sets the PG_mte_tagged flag on the page) which has the benefit of requiring fewer changes in the VMM. However, save/restore of the VM state still requires the VMM to have a PROT_MTE enabled mapping so that it can access the tag values. A VMM which 'forgets' to enable PROT_MTE would lose the tag values when saving/restoring (tags are RAZ/WI when PROT_MTE isn't set). An alternative approach would be to enforce the VMM provides PROT_MTE memory in the first place. This seems appealing to prevent the above potentially unexpected gotchas with save/restore, however this would also extend to memory that you might not expect to have PROT_MTE (e.g. a shared frame buffer for an emulated graphics card). Comments on the approach (or ideas for alternative approaches) are very welcome. Steven Price (2): arm64: kvm: Save/restore MTE registers arm64: kvm: Introduce MTE VCPU feature arch/arm64/include/asm/kvm_emulate.h | 3 +++ arch/arm64/include/asm/kvm_host.h | 9 ++++++++- arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/hyp/sysreg-sr.c | 12 ++++++++++++ arch/arm64/kvm/reset.c | 8 ++++++++ arch/arm64/kvm/sys_regs.c | 8 ++++++++ virt/kvm/arm/mmu.c | 11 +++++++++++ 7 files changed, 51 insertions(+), 1 deletion(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC PATCH 1/2] arm64: kvm: Save/restore MTE registers 2020-06-17 12:38 [RFC PATCH 0/2] MTE support for KVM guest Steven Price @ 2020-06-17 12:38 ` Steven Price 2020-06-17 14:05 ` Catalin Marinas 2020-06-17 12:38 ` [RFC PATCH 2/2] arm64: kvm: Introduce MTE VCPU feature Steven Price ` (2 subsequent siblings) 3 siblings, 1 reply; 22+ messages in thread From: Steven Price @ 2020-06-17 12:38 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner Define the new system registers that MTE introduces and context switch them. Also hide the MTE feature from the ID register as it isn't supported in a VM yet. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_host.h | 7 +++++++ arch/arm64/kvm/hyp/sysreg-sr.c | 12 ++++++++++++ arch/arm64/kvm/sys_regs.c | 7 +++++++ 3 files changed, 26 insertions(+) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 32c8a675e5a4..1f10e9dee2e0 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -92,6 +92,9 @@ struct kvm_arch { * supported. */ bool return_nisv_io_abort_to_user; + + /* If any VCPU has MTE enabled then all memory must be MTE enabled */ + bool vcpu_has_mte; }; #define KVM_NR_MEM_OBJS 40 @@ -123,6 +126,8 @@ enum vcpu_sysreg { SCTLR_EL1, /* System Control Register */ ACTLR_EL1, /* Auxiliary Control Register */ CPACR_EL1, /* Coprocessor Access Control */ + RGSR_EL1, /* Random Allocation Tag Seed Register */ + GCR_EL1, /* Tag Control Register */ ZCR_EL1, /* SVE Control */ TTBR0_EL1, /* Translation Table Base Register 0 */ TTBR1_EL1, /* Translation Table Base Register 1 */ @@ -139,6 +144,8 @@ enum vcpu_sysreg { TPIDR_EL1, /* Thread ID, Privileged */ AMAIR_EL1, /* Aux Memory Attribute Indirection Register */ CNTKCTL_EL1, /* Timer Control Register (EL1) */ + TFSRE0_EL1, /* Tag Fault Status Register (EL0) */ + TFSR_EL1, /* Tag Fault Stauts Register (EL1) */ PAR_EL1, /* Physical Address Register */ MDSCR_EL1, /* Monitor Debug System Control Register */ MDCCINT_EL1, /* Monitor Debug Comms Channel Interrupt Enable Reg */ diff --git a/arch/arm64/kvm/hyp/sysreg-sr.c b/arch/arm64/kvm/hyp/sysreg-sr.c index 75b1925763f1..6ecee1528566 100644 --- a/arch/arm64/kvm/hyp/sysreg-sr.c +++ b/arch/arm64/kvm/hyp/sysreg-sr.c @@ -26,6 +26,12 @@ static void __hyp_text __sysreg_save_common_state(struct kvm_cpu_context *ctxt) { ctxt->sys_regs[MDSCR_EL1] = read_sysreg(mdscr_el1); + if (system_supports_mte()) { + ctxt->sys_regs[RGSR_EL1] = read_sysreg_s(SYS_RGSR_EL1); + ctxt->sys_regs[GCR_EL1] = read_sysreg_s(SYS_GCR_EL1); + ctxt->sys_regs[TFSRE0_EL1] = read_sysreg_s(SYS_TFSRE0_EL1); + ctxt->sys_regs[TFSR_EL1] = read_sysreg_s(SYS_TFSR_EL1); + } /* * The host arm64 Linux uses sp_el0 to point to 'current' and it must @@ -99,6 +105,12 @@ NOKPROBE_SYMBOL(sysreg_save_guest_state_vhe); static void __hyp_text __sysreg_restore_common_state(struct kvm_cpu_context *ctxt) { write_sysreg(ctxt->sys_regs[MDSCR_EL1], mdscr_el1); + if (system_supports_mte()) { + write_sysreg_s(ctxt->sys_regs[RGSR_EL1], SYS_RGSR_EL1); + write_sysreg_s(ctxt->sys_regs[GCR_EL1], SYS_GCR_EL1); + write_sysreg_s(ctxt->sys_regs[TFSRE0_EL1], SYS_TFSRE0_EL1); + write_sysreg_s(ctxt->sys_regs[TFSR_EL1], SYS_TFSR_EL1); + } /* * The host arm64 Linux uses sp_el0 to point to 'current' and it must diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 51db934702b6..3ae008a9b0bd 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1095,6 +1095,8 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu, if (!vcpu_has_sve(vcpu)) val &= ~(0xfUL << ID_AA64PFR0_SVE_SHIFT); val &= ~(0xfUL << ID_AA64PFR0_AMU_SHIFT); + } else if (id == SYS_ID_AA64PFR1_EL1) { + val &= ~(0xfUL << ID_AA64PFR1_MTE_SHIFT); } else if (id == SYS_ID_AA64ISAR1_EL1 && !vcpu_has_ptrauth(vcpu)) { val &= ~((0xfUL << ID_AA64ISAR1_APA_SHIFT) | (0xfUL << ID_AA64ISAR1_API_SHIFT) | @@ -1504,6 +1506,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_SCTLR_EL1), access_vm_reg, reset_val, SCTLR_EL1, 0x00C50078 }, { SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 }, + { SYS_DESC(SYS_RGSR_EL1), trap_raz_wi, reset_unknown, RGSR_EL1 }, + { SYS_DESC(SYS_GCR_EL1), trap_raz_wi, reset_unknown, GCR_EL1 }, { SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility }, { SYS_DESC(SYS_TTBR0_EL1), access_vm_reg, reset_unknown, TTBR0_EL1 }, { SYS_DESC(SYS_TTBR1_EL1), access_vm_reg, reset_unknown, TTBR1_EL1 }, @@ -1528,6 +1532,9 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi }, { SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi }, + { SYS_DESC(SYS_TFSR_EL1), trap_raz_wi, reset_unknown, TFSR_EL1 }, + { SYS_DESC(SYS_TFSRE0_EL1), trap_raz_wi, reset_unknown, TFSRE0_EL1 }, + { SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 }, { SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 }, -- 2.20.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] arm64: kvm: Save/restore MTE registers 2020-06-17 12:38 ` [RFC PATCH 1/2] arm64: kvm: Save/restore MTE registers Steven Price @ 2020-06-17 14:05 ` Catalin Marinas 2020-06-18 10:43 ` Steven Price 0 siblings, 1 reply; 22+ messages in thread From: Catalin Marinas @ 2020-06-17 14:05 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner On Wed, Jun 17, 2020 at 01:38:43PM +0100, Steven Price wrote: > diff --git a/arch/arm64/kvm/hyp/sysreg-sr.c b/arch/arm64/kvm/hyp/sysreg-sr.c > index 75b1925763f1..6ecee1528566 100644 > --- a/arch/arm64/kvm/hyp/sysreg-sr.c > +++ b/arch/arm64/kvm/hyp/sysreg-sr.c > @@ -26,6 +26,12 @@ > static void __hyp_text __sysreg_save_common_state(struct kvm_cpu_context *ctxt) > { > ctxt->sys_regs[MDSCR_EL1] = read_sysreg(mdscr_el1); > + if (system_supports_mte()) { > + ctxt->sys_regs[RGSR_EL1] = read_sysreg_s(SYS_RGSR_EL1); > + ctxt->sys_regs[GCR_EL1] = read_sysreg_s(SYS_GCR_EL1); > + ctxt->sys_regs[TFSRE0_EL1] = read_sysreg_s(SYS_TFSRE0_EL1); > + ctxt->sys_regs[TFSR_EL1] = read_sysreg_s(SYS_TFSR_EL1); > + } TFSR_EL1 is not a common register as we have the TFSR_EL2 as well. So you'd have to access it as read_sysreg_el1(SYS_TFSR) so that, in the VHE case, it generates TFSR_EL12, otherwise you just save the host register. Also, since TFSR*_EL1 can be set asynchronously, I think we need to set the SCTLR_EL2.ITFSB bit so that the register update is synchronised on entry to EL2. With VHE we get this automatically as part of SCTLR_EL1_SET but it turns out that we have another SCTLR_ELx_FLAGS macro for the non-VHE case (why not calling this SCTLR_EL2_* I have no idea). > /* > * The host arm64 Linux uses sp_el0 to point to 'current' and it must > @@ -99,6 +105,12 @@ NOKPROBE_SYMBOL(sysreg_save_guest_state_vhe); > static void __hyp_text __sysreg_restore_common_state(struct kvm_cpu_context *ctxt) > { > write_sysreg(ctxt->sys_regs[MDSCR_EL1], mdscr_el1); > + if (system_supports_mte()) { > + write_sysreg_s(ctxt->sys_regs[RGSR_EL1], SYS_RGSR_EL1); > + write_sysreg_s(ctxt->sys_regs[GCR_EL1], SYS_GCR_EL1); > + write_sysreg_s(ctxt->sys_regs[TFSRE0_EL1], SYS_TFSRE0_EL1); > + write_sysreg_s(ctxt->sys_regs[TFSR_EL1], SYS_TFSR_EL1); > + } Similarly here, you override the TFSR_EL2 with VHE enabled. -- Catalin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] arm64: kvm: Save/restore MTE registers 2020-06-17 14:05 ` Catalin Marinas @ 2020-06-18 10:43 ` Steven Price 0 siblings, 0 replies; 22+ messages in thread From: Steven Price @ 2020-06-18 10:43 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave P Martin, Mark Rutland, Thomas Gleixner On 17/06/2020 15:05, Catalin Marinas wrote: > On Wed, Jun 17, 2020 at 01:38:43PM +0100, Steven Price wrote: >> diff --git a/arch/arm64/kvm/hyp/sysreg-sr.c b/arch/arm64/kvm/hyp/sysreg-sr.c >> index 75b1925763f1..6ecee1528566 100644 >> --- a/arch/arm64/kvm/hyp/sysreg-sr.c >> +++ b/arch/arm64/kvm/hyp/sysreg-sr.c >> @@ -26,6 +26,12 @@ >> static void __hyp_text __sysreg_save_common_state(struct kvm_cpu_context *ctxt) >> { >> ctxt->sys_regs[MDSCR_EL1] = read_sysreg(mdscr_el1); >> + if (system_supports_mte()) { >> + ctxt->sys_regs[RGSR_EL1] = read_sysreg_s(SYS_RGSR_EL1); >> + ctxt->sys_regs[GCR_EL1] = read_sysreg_s(SYS_GCR_EL1); >> + ctxt->sys_regs[TFSRE0_EL1] = read_sysreg_s(SYS_TFSRE0_EL1); >> + ctxt->sys_regs[TFSR_EL1] = read_sysreg_s(SYS_TFSR_EL1); >> + } > > TFSR_EL1 is not a common register as we have the TFSR_EL2 as well. So > you'd have to access it as read_sysreg_el1(SYS_TFSR) so that, in the VHE > case, it generates TFSR_EL12, otherwise you just save the host register. Ah, thanks for pointing that out - I'd got myself confused with the whole VHE _EL12 registers. I'd managed to miss that TFSR is banked. > Also, since TFSR*_EL1 can be set asynchronously, I think we need to set > the SCTLR_EL2.ITFSB bit so that the register update is synchronised on > entry to EL2. With VHE we get this automatically as part of > SCTLR_EL1_SET but it turns out that we have another SCTLR_ELx_FLAGS > macro for the non-VHE case (why not calling this SCTLR_EL2_* I have no > idea). I hadn't noticed that there was a different set for the non-VHE case which was missing ITFSB - I'll update that. Thanks, Steve >> /* >> * The host arm64 Linux uses sp_el0 to point to 'current' and it must >> @@ -99,6 +105,12 @@ NOKPROBE_SYMBOL(sysreg_save_guest_state_vhe); >> static void __hyp_text __sysreg_restore_common_state(struct kvm_cpu_context *ctxt) >> { >> write_sysreg(ctxt->sys_regs[MDSCR_EL1], mdscr_el1); >> + if (system_supports_mte()) { >> + write_sysreg_s(ctxt->sys_regs[RGSR_EL1], SYS_RGSR_EL1); >> + write_sysreg_s(ctxt->sys_regs[GCR_EL1], SYS_GCR_EL1); >> + write_sysreg_s(ctxt->sys_regs[TFSRE0_EL1], SYS_TFSRE0_EL1); >> + write_sysreg_s(ctxt->sys_regs[TFSR_EL1], SYS_TFSR_EL1); >> + } > > Similarly here, you override the TFSR_EL2 with VHE enabled. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC PATCH 2/2] arm64: kvm: Introduce MTE VCPU feature 2020-06-17 12:38 [RFC PATCH 0/2] MTE support for KVM guest Steven Price 2020-06-17 12:38 ` [RFC PATCH 1/2] arm64: kvm: Save/restore MTE registers Steven Price @ 2020-06-17 12:38 ` Steven Price 2020-06-17 14:38 ` Catalin Marinas 2020-06-23 17:48 ` [RFC PATCH 0/2] MTE support for KVM guest Catalin Marinas 2020-06-23 18:05 ` Peter Maydell 3 siblings, 1 reply; 22+ messages in thread From: Steven Price @ 2020-06-17 12:38 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner Add a new VCPU features 'KVM_ARM_VCPU_MTE' which enables memory tagging on a VCPU. When enabled on any VCPU in the virtual machine this causes all pages that are faulted into the VM to have the PG_mte_tagged flag set (and the tag storage cleared if this is the first use). Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_emulate.h | 3 +++ arch/arm64/include/asm/kvm_host.h | 2 +- arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/reset.c | 8 ++++++++ arch/arm64/kvm/sys_regs.c | 3 ++- virt/kvm/arm/mmu.c | 11 +++++++++++ 6 files changed, 26 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index a30b4eec7cb4..b118f466a40b 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -79,6 +79,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu) if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) || vcpu_el1_is_32bit(vcpu)) vcpu->arch.hcr_el2 |= HCR_TID2; + + if (test_bit(KVM_ARM_VCPU_MTE, vcpu->arch.features)) + vcpu->arch.hcr_el2 |= HCR_ATA; } static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 1f10e9dee2e0..3461639bb08a 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -37,7 +37,7 @@ #define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS -#define KVM_VCPU_MAX_FEATURES 7 +#define KVM_VCPU_MAX_FEATURES 8 #define KVM_REQ_SLEEP \ KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index ba85bb23f060..2677e1ab8c16 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -106,6 +106,7 @@ struct kvm_regs { #define KVM_ARM_VCPU_SVE 4 /* enable SVE for this CPU */ #define KVM_ARM_VCPU_PTRAUTH_ADDRESS 5 /* VCPU uses address authentication */ #define KVM_ARM_VCPU_PTRAUTH_GENERIC 6 /* VCPU uses generic authentication */ +#define KVM_ARM_VCPU_MTE 7 /* VCPU supports Memory Tagging */ struct kvm_vcpu_init { __u32 target; diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c index ab76728e2742..f87a434c0849 100644 --- a/arch/arm64/kvm/reset.c +++ b/arch/arm64/kvm/reset.c @@ -287,6 +287,14 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu) } } + if (test_bit(KVM_ARM_VCPU_MTE, vcpu->arch.features)) { + if (!system_supports_mte()) { + ret = -EINVAL; + goto out; + } + vcpu->kvm->arch.vcpu_has_mte = true; + } + switch (vcpu->arch.target) { default: if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) { diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 3ae008a9b0bd..a6a9552d1233 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1096,7 +1096,8 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu, val &= ~(0xfUL << ID_AA64PFR0_SVE_SHIFT); val &= ~(0xfUL << ID_AA64PFR0_AMU_SHIFT); } else if (id == SYS_ID_AA64PFR1_EL1) { - val &= ~(0xfUL << ID_AA64PFR1_MTE_SHIFT); + if (!test_bit(KVM_ARM_VCPU_MTE, vcpu->arch.features)) + val &= ~(0xfUL << ID_AA64PFR1_MTE_SHIFT); } else if (id == SYS_ID_AA64ISAR1_EL1 && !vcpu_has_ptrauth(vcpu)) { val &= ~((0xfUL << ID_AA64ISAR1_APA_SHIFT) | (0xfUL << ID_AA64ISAR1_API_SHIFT) | diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index e3b9ee268823..040a7fffaa93 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -1783,6 +1783,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, vma_pagesize = PMD_SIZE; } + if (system_supports_mte() && kvm->arch.vcpu_has_mte) { + /* + * VM will be able to see the page's tags, so we must ensure + * they have been initialised. + */ + struct page *page = pfn_to_page(pfn); + + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) + mte_clear_page_tags(page_address(page), page_size(page)); + } + if (writable) kvm_set_pfn_dirty(pfn); -- 2.20.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] arm64: kvm: Introduce MTE VCPU feature 2020-06-17 12:38 ` [RFC PATCH 2/2] arm64: kvm: Introduce MTE VCPU feature Steven Price @ 2020-06-17 14:38 ` Catalin Marinas 2020-06-17 15:34 ` Steven Price 0 siblings, 1 reply; 22+ messages in thread From: Catalin Marinas @ 2020-06-17 14:38 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner On Wed, Jun 17, 2020 at 01:38:44PM +0100, Steven Price wrote: > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > index e3b9ee268823..040a7fffaa93 100644 > --- a/virt/kvm/arm/mmu.c > +++ b/virt/kvm/arm/mmu.c > @@ -1783,6 +1783,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > vma_pagesize = PMD_SIZE; > } > > + if (system_supports_mte() && kvm->arch.vcpu_has_mte) { > + /* > + * VM will be able to see the page's tags, so we must ensure > + * they have been initialised. > + */ > + struct page *page = pfn_to_page(pfn); > + > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > + mte_clear_page_tags(page_address(page), page_size(page)); > + } Are all the guest pages always mapped via a Stage 2 fault? It may be better if we did that via kvm_set_spte_hva(). -- Catalin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] arm64: kvm: Introduce MTE VCPU feature 2020-06-17 14:38 ` Catalin Marinas @ 2020-06-17 15:34 ` Steven Price 2020-06-26 16:40 ` James Morse 0 siblings, 1 reply; 22+ messages in thread From: Steven Price @ 2020-06-17 15:34 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave P Martin, Mark Rutland, Thomas Gleixner On 17/06/2020 15:38, Catalin Marinas wrote: > On Wed, Jun 17, 2020 at 01:38:44PM +0100, Steven Price wrote: >> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c >> index e3b9ee268823..040a7fffaa93 100644 >> --- a/virt/kvm/arm/mmu.c >> +++ b/virt/kvm/arm/mmu.c >> @@ -1783,6 +1783,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >> vma_pagesize = PMD_SIZE; >> } >> >> + if (system_supports_mte() && kvm->arch.vcpu_has_mte) { >> + /* >> + * VM will be able to see the page's tags, so we must ensure >> + * they have been initialised. >> + */ >> + struct page *page = pfn_to_page(pfn); >> + >> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >> + mte_clear_page_tags(page_address(page), page_size(page)); >> + } > > Are all the guest pages always mapped via a Stage 2 fault? It may be > better if we did that via kvm_set_spte_hva(). > I was under the impression that pages are always faulted into the stage 2, but I have to admit I'm not 100% sure about that. kvm_set_spte_hva() may be more appropriate, although on first look I don't understand how that function deals with huge pages. Is it actually called for normal mappings or only for changes due to the likes of KSM? Steve ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] arm64: kvm: Introduce MTE VCPU feature 2020-06-17 15:34 ` Steven Price @ 2020-06-26 16:40 ` James Morse 0 siblings, 0 replies; 22+ messages in thread From: James Morse @ 2020-06-26 16:40 UTC (permalink / raw) To: Steven Price Cc: Catalin Marinas, Marc Zyngier, Will Deacon, Julien Thierry, Suzuki Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave P Martin, Mark Rutland, Thomas Gleixner Hi Steve, On 17/06/2020 16:34, Steven Price wrote: > On 17/06/2020 15:38, Catalin Marinas wrote: >> On Wed, Jun 17, 2020 at 01:38:44PM +0100, Steven Price wrote: >>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c >>> index e3b9ee268823..040a7fffaa93 100644 >>> --- a/virt/kvm/arm/mmu.c >>> +++ b/virt/kvm/arm/mmu.c >>> @@ -1783,6 +1783,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t >>> fault_ipa, >>> vma_pagesize = PMD_SIZE; >>> } >>> + if (system_supports_mte() && kvm->arch.vcpu_has_mte) { >>> + /* >>> + * VM will be able to see the page's tags, so we must ensure >>> + * they have been initialised. >>> + */ >>> + struct page *page = pfn_to_page(pfn); >>> + >>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>> + mte_clear_page_tags(page_address(page), page_size(page)); >>> + } >> >> Are all the guest pages always mapped via a Stage 2 fault? It may be >> better if we did that via kvm_set_spte_hva(). > I was under the impression that pages are always faulted into the stage 2, but I have to > admit I'm not 100% sure about that. I think there is only one case: VMA with VM_PFNMAP set will get pre-populated during kvm_arch_prepare_memory_region(), but they are always made device at stage2, so MTE isn't a concern there. > kvm_set_spte_hva() may be more appropriate, although on first look I don't understand how > that function deals with huge pages. Is it actually called for normal mappings or only for > changes due to the likes of KSM? It looks like its only called through set_pte_at_notify(), which is used by things like KSM/COW that change a mapping, and really don't want to fault it a second time. I guess they are only for PAGE_SIZE mappings. Other mapping sizes would get faulted in by user_mem_abort(). I think this should happen in the same places as we clean new pages to PoC, as that is also an additional piece of maintenance KVM has to do that the host's stage 1 doesn't. You may be able to rename clean_dcache_guest_page() to encompass maintenance that we need when a page is accessible to a different EL1. Thanks, James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-17 12:38 [RFC PATCH 0/2] MTE support for KVM guest Steven Price 2020-06-17 12:38 ` [RFC PATCH 1/2] arm64: kvm: Save/restore MTE registers Steven Price 2020-06-17 12:38 ` [RFC PATCH 2/2] arm64: kvm: Introduce MTE VCPU feature Steven Price @ 2020-06-23 17:48 ` Catalin Marinas 2020-06-24 11:16 ` Steven Price 2020-06-23 18:05 ` Peter Maydell 3 siblings, 1 reply; 22+ messages in thread From: Catalin Marinas @ 2020-06-23 17:48 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, Peter Maydell Hi Steven, On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote: > These patches add support to KVM to enable MTE within a guest. It is > based on Catalin's v4 MTE user space series[1]. > > [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com > > Posting as an RFC as I'd like feedback on the approach taken. First a > little background on how MTE fits within the architecture: > > The stage 2 page tables have limited scope for controlling the > availability of MTE. If a page is mapped as Normal and cached in stage 2 > then it's the stage 1 tables that get to choose whether the memory is > tagged or not. So the only way of forbidding tags on a page from the > hypervisor is to change the cacheability (or make it device memory) > which would cause other problems. Note this restriction fits the > intention that a system should have all (general purpose) memory > supporting tags if it support MTE, so it's not too surprising. > > However, the upshot of this is that to enable MTE within a guest all > pages of memory mapped into the guest as normal cached pages in stage 2 > *must* support MTE (i.e. we must ensure the tags are appropriately > sanitised and save/restore the tags during swap etc). > > My current approach is that KVM transparently upgrades any pages > provided by the VMM to be tag-enabled when they are faulted in (i.e. > sets the PG_mte_tagged flag on the page) which has the benefit of > requiring fewer changes in the VMM. However, save/restore of the VM > state still requires the VMM to have a PROT_MTE enabled mapping so that > it can access the tag values. A VMM which 'forgets' to enable PROT_MTE > would lose the tag values when saving/restoring (tags are RAZ/WI when > PROT_MTE isn't set). > > An alternative approach would be to enforce the VMM provides PROT_MTE > memory in the first place. This seems appealing to prevent the above > potentially unexpected gotchas with save/restore, however this would > also extend to memory that you might not expect to have PROT_MTE (e.g. a > shared frame buffer for an emulated graphics card). As you mentioned above, if memory is mapped as Normal Cacheable at Stage 2 (whether we use FWB or not), the guest is allowed to turn MTE on via Stage 1. There is no way for KVM to prevent a guest from using MTE other than the big HCR_EL2.ATA knob. This causes potential issues since we can't guarantee that all the Cacheable memory slots allocated by the VMM support MTE. If they do not, the arch behaviour is "unpredictable". We also can't trust the guest to not enable MTE on such Cacheable mappings. On the host kernel, mmap'ing with PROT_MTE is only allowed for anonymous mappings and shmem. So requiring the VMM to always pass PROT_MTE mapped ranges to KVM, irrespective of whether it's guest RAM, emulated device, virtio etc. (as long as they are Cacheable), filters unsafe ranges that may be mapped into guest. Note that in the next revision of the MTE patches I'll drop the DT memory nodes checking and rely only on the CPUID information (arch updated promised by the architects). I see two possible ways to handle this (there may be more): 1. As in your current patches, assume any Cacheable at Stage 2 can have MTE enabled at Stage 1. In addition, we need to check whether the physical memory supports MTE and it could be something simple like pfn_valid(). Is there a way to reject a memory slot passed by the VMM? 2. Similar to 1 but instead of checking whether the pfn supports MTE, we require the VMM to only pass PROT_MTE ranges (filtering already done by the host kernel). We need a way to reject the slot and return an error to the VMM. I think rejecting a slot at the Stage 2 fault time is very late. You probably won't be able to do much other than killing the guest. Both 1 and 2 above risk breaking existing VMMs just because they happen to start on an MTE-capable machine. So, can we also require the VMM to explicitly opt in to MTE support in guests via some ioctl()? This in turn would enable the additional checks in KVM for the MTE capability of the memory slots (1 or 2 above). An alternative to an MTE enable ioctl(), if all the memory slots are set up prior to the VM starting, KVM could check 1 or 2 above and decide whether to expose MTE to guests (HCR_EL2.ATA). More questions than solutions above, mostly for the KVM and Qemu maintainers. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-23 17:48 ` [RFC PATCH 0/2] MTE support for KVM guest Catalin Marinas @ 2020-06-24 11:16 ` Steven Price 2020-06-24 14:21 ` Catalin Marinas 0 siblings, 1 reply; 22+ messages in thread From: Steven Price @ 2020-06-24 11:16 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, Peter Maydell On 23/06/2020 18:48, Catalin Marinas wrote: > Hi Steven, > > On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote: >> These patches add support to KVM to enable MTE within a guest. It is >> based on Catalin's v4 MTE user space series[1]. >> >> [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com >> >> Posting as an RFC as I'd like feedback on the approach taken. First a >> little background on how MTE fits within the architecture: >> >> The stage 2 page tables have limited scope for controlling the >> availability of MTE. If a page is mapped as Normal and cached in stage 2 >> then it's the stage 1 tables that get to choose whether the memory is >> tagged or not. So the only way of forbidding tags on a page from the >> hypervisor is to change the cacheability (or make it device memory) >> which would cause other problems. Note this restriction fits the >> intention that a system should have all (general purpose) memory >> supporting tags if it support MTE, so it's not too surprising. >> >> However, the upshot of this is that to enable MTE within a guest all >> pages of memory mapped into the guest as normal cached pages in stage 2 >> *must* support MTE (i.e. we must ensure the tags are appropriately >> sanitised and save/restore the tags during swap etc). >> >> My current approach is that KVM transparently upgrades any pages >> provided by the VMM to be tag-enabled when they are faulted in (i.e. >> sets the PG_mte_tagged flag on the page) which has the benefit of >> requiring fewer changes in the VMM. However, save/restore of the VM >> state still requires the VMM to have a PROT_MTE enabled mapping so that >> it can access the tag values. A VMM which 'forgets' to enable PROT_MTE >> would lose the tag values when saving/restoring (tags are RAZ/WI when >> PROT_MTE isn't set). >> >> An alternative approach would be to enforce the VMM provides PROT_MTE >> memory in the first place. This seems appealing to prevent the above >> potentially unexpected gotchas with save/restore, however this would >> also extend to memory that you might not expect to have PROT_MTE (e.g. a >> shared frame buffer for an emulated graphics card). > > As you mentioned above, if memory is mapped as Normal Cacheable at Stage > 2 (whether we use FWB or not), the guest is allowed to turn MTE on via > Stage 1. There is no way for KVM to prevent a guest from using MTE other > than the big HCR_EL2.ATA knob. > > This causes potential issues since we can't guarantee that all the > Cacheable memory slots allocated by the VMM support MTE. If they do not, > the arch behaviour is "unpredictable". We also can't trust the guest to > not enable MTE on such Cacheable mappings. Architecturally it seems dodgy to export any address that isn't "normal memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although I'm a bit worried this might cause a regression in some existing case. > On the host kernel, mmap'ing with PROT_MTE is only allowed for anonymous > mappings and shmem. So requiring the VMM to always pass PROT_MTE mapped > ranges to KVM, irrespective of whether it's guest RAM, emulated device, > virtio etc. (as long as they are Cacheable), filters unsafe ranges that > may be mapped into guest. That would be an easy way of doing the filtering, but it's not clear whether PROT_MTE is actually what the VMM wants (it most likely doesn't want to have tag checking enabled on the memory in user space). > Note that in the next revision of the MTE patches I'll drop the DT > memory nodes checking and rely only on the CPUID information (arch > updated promised by the architects). > > I see two possible ways to handle this (there may be more): > > 1. As in your current patches, assume any Cacheable at Stage 2 can have > MTE enabled at Stage 1. In addition, we need to check whether the > physical memory supports MTE and it could be something simple like > pfn_valid(). Is there a way to reject a memory slot passed by the > VMM? Yes pfn_valid() should have been in there. At the moment pfn_to_page() is called without any checks. The problem with attempting to reject a memory slot is that the memory backing that slot can change. So checking at the time the slot is created isn't enough (although it might be a useful error checking feature). It's not clear to me what we can do at fault time when we discover the memory isn't tag-capable and would have been mapped cacheable other than kill the VM. > 2. Similar to 1 but instead of checking whether the pfn supports MTE, we > require the VMM to only pass PROT_MTE ranges (filtering already done > by the host kernel). We need a way to reject the slot and return an > error to the VMM. > > I think rejecting a slot at the Stage 2 fault time is very late. You > probably won't be able to do much other than killing the guest. As above, we will struggle to catch all cases during slot creation, so I think we're going to have to deal with this late detection as well. > Both 1 and 2 above risk breaking existing VMMs just because they happen > to start on an MTE-capable machine. So, can we also require the VMM to > explicitly opt in to MTE support in guests via some ioctl()? This in > turn would enable the additional checks in KVM for the MTE capability of > the memory slots (1 or 2 above). Patch 2 introduces a VCPU feature which must be explicitly enabled for the guest to have MTE. So it won't break existing VMMs. However clearly simply setting that bit will likely break some configurations where not all memory is MTE capable. > An alternative to an MTE enable ioctl(), if all the memory slots are set > up prior to the VM starting, KVM could check 1 or 2 above and decide > whether to expose MTE to guests (HCR_EL2.ATA). The VMM needs to be fully aware of MTE before it's enabled by KVM otherwise it could lose the tags (e.g. during migration). So my preference is to make it an explicit opt-in. > More questions than solutions above, mostly for the KVM and Qemu > maintainers. Indeed it would be useful to have input from those familiar with VMMs as to what a good interface looks like. Thanks, Steve ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 11:16 ` Steven Price @ 2020-06-24 14:21 ` Catalin Marinas 2020-06-24 14:59 ` Steven Price 0 siblings, 1 reply; 22+ messages in thread From: Catalin Marinas @ 2020-06-24 14:21 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, Peter Maydell On Wed, Jun 24, 2020 at 12:16:28PM +0100, Steven Price wrote: > On 23/06/2020 18:48, Catalin Marinas wrote: > > On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote: > > > These patches add support to KVM to enable MTE within a guest. It is > > > based on Catalin's v4 MTE user space series[1]. > > > > > > [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com > > > > > > Posting as an RFC as I'd like feedback on the approach taken. First a > > > little background on how MTE fits within the architecture: > > > > > > The stage 2 page tables have limited scope for controlling the > > > availability of MTE. If a page is mapped as Normal and cached in stage 2 > > > then it's the stage 1 tables that get to choose whether the memory is > > > tagged or not. So the only way of forbidding tags on a page from the > > > hypervisor is to change the cacheability (or make it device memory) > > > which would cause other problems. Note this restriction fits the > > > intention that a system should have all (general purpose) memory > > > supporting tags if it support MTE, so it's not too surprising. > > > > > > However, the upshot of this is that to enable MTE within a guest all > > > pages of memory mapped into the guest as normal cached pages in stage 2 > > > *must* support MTE (i.e. we must ensure the tags are appropriately > > > sanitised and save/restore the tags during swap etc). > > > > > > My current approach is that KVM transparently upgrades any pages > > > provided by the VMM to be tag-enabled when they are faulted in (i.e. > > > sets the PG_mte_tagged flag on the page) which has the benefit of > > > requiring fewer changes in the VMM. However, save/restore of the VM > > > state still requires the VMM to have a PROT_MTE enabled mapping so that > > > it can access the tag values. A VMM which 'forgets' to enable PROT_MTE > > > would lose the tag values when saving/restoring (tags are RAZ/WI when > > > PROT_MTE isn't set). > > > > > > An alternative approach would be to enforce the VMM provides PROT_MTE > > > memory in the first place. This seems appealing to prevent the above > > > potentially unexpected gotchas with save/restore, however this would > > > also extend to memory that you might not expect to have PROT_MTE (e.g. a > > > shared frame buffer for an emulated graphics card). > > > > As you mentioned above, if memory is mapped as Normal Cacheable at Stage > > 2 (whether we use FWB or not), the guest is allowed to turn MTE on via > > Stage 1. There is no way for KVM to prevent a guest from using MTE other > > than the big HCR_EL2.ATA knob. > > > > This causes potential issues since we can't guarantee that all the > > Cacheable memory slots allocated by the VMM support MTE. If they do not, > > the arch behaviour is "unpredictable". We also can't trust the guest to > > not enable MTE on such Cacheable mappings. > > Architecturally it seems dodgy to export any address that isn't "normal > memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although > I'm a bit worried this might cause a regression in some existing case. What I had in mind is some persistent memory that may be given to the guest for direct access. This is allowed to be cacheable (write-back) but may not have tag storage. > > On the host kernel, mmap'ing with PROT_MTE is only allowed for anonymous > > mappings and shmem. So requiring the VMM to always pass PROT_MTE mapped > > ranges to KVM, irrespective of whether it's guest RAM, emulated device, > > virtio etc. (as long as they are Cacheable), filters unsafe ranges that > > may be mapped into guest. > > That would be an easy way of doing the filtering, but it's not clear whether > PROT_MTE is actually what the VMM wants (it most likely doesn't want to have > tag checking enabled on the memory in user space). From the other sub-thread, yeah, we probably don't want to mandate PROT_MTE because of potential inadvertent tag check faults in the VMM itself. > > Note that in the next revision of the MTE patches I'll drop the DT > > memory nodes checking and rely only on the CPUID information (arch > > updated promised by the architects). > > > > I see two possible ways to handle this (there may be more): > > > > 1. As in your current patches, assume any Cacheable at Stage 2 can have > > MTE enabled at Stage 1. In addition, we need to check whether the > > physical memory supports MTE and it could be something simple like > > pfn_valid(). Is there a way to reject a memory slot passed by the > > VMM? > > Yes pfn_valid() should have been in there. At the moment pfn_to_page() is > called without any checks. > > The problem with attempting to reject a memory slot is that the memory > backing that slot can change. So checking at the time the slot is created > isn't enough (although it might be a useful error checking feature). But isn't the slot changed as a result of another VMM call? So we could always have such check in place. > It's not clear to me what we can do at fault time when we discover the > memory isn't tag-capable and would have been mapped cacheable other than > kill the VM. Indeed, I don't have a better idea other than trying not to get in this situation. > > 2. Similar to 1 but instead of checking whether the pfn supports MTE, we > > require the VMM to only pass PROT_MTE ranges (filtering already done > > by the host kernel). We need a way to reject the slot and return an > > error to the VMM. > > > > I think rejecting a slot at the Stage 2 fault time is very late. You > > probably won't be able to do much other than killing the guest. > > As above, we will struggle to catch all cases during slot creation, so I > think we're going to have to deal with this late detection as well. We can leave it in place as a safety check, killing the VM. My hope is that we can detect slot creation subsequent changes. > > Both 1 and 2 above risk breaking existing VMMs just because they happen > > to start on an MTE-capable machine. So, can we also require the VMM to > > explicitly opt in to MTE support in guests via some ioctl()? This in > > turn would enable the additional checks in KVM for the MTE capability of > > the memory slots (1 or 2 above). > > Patch 2 introduces a VCPU feature which must be explicitly enabled for the > guest to have MTE. So it won't break existing VMMs. However clearly simply > setting that bit will likely break some configurations where not all memory > is MTE capable. Ah, I missed that. At least we won't break current, unaware VMMs. I suspect the CPUID is also conditioned by this explicit enabling. > > An alternative to an MTE enable ioctl(), if all the memory slots are set > > up prior to the VM starting, KVM could check 1 or 2 above and decide > > whether to expose MTE to guests (HCR_EL2.ATA). > > The VMM needs to be fully aware of MTE before it's enabled by KVM otherwise > it could lose the tags (e.g. during migration). So my preference is to make > it an explicit opt-in. I agree. -- Catalin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 14:21 ` Catalin Marinas @ 2020-06-24 14:59 ` Steven Price 2020-06-24 16:24 ` Catalin Marinas 0 siblings, 1 reply; 22+ messages in thread From: Steven Price @ 2020-06-24 14:59 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, Peter Maydell On 24/06/2020 15:21, Catalin Marinas wrote: > On Wed, Jun 24, 2020 at 12:16:28PM +0100, Steven Price wrote: >> On 23/06/2020 18:48, Catalin Marinas wrote: >>> On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote: >>>> These patches add support to KVM to enable MTE within a guest. It is >>>> based on Catalin's v4 MTE user space series[1]. >>>> >>>> [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com >>>> >>>> Posting as an RFC as I'd like feedback on the approach taken. First a >>>> little background on how MTE fits within the architecture: >>>> >>>> The stage 2 page tables have limited scope for controlling the >>>> availability of MTE. If a page is mapped as Normal and cached in stage 2 >>>> then it's the stage 1 tables that get to choose whether the memory is >>>> tagged or not. So the only way of forbidding tags on a page from the >>>> hypervisor is to change the cacheability (or make it device memory) >>>> which would cause other problems. Note this restriction fits the >>>> intention that a system should have all (general purpose) memory >>>> supporting tags if it support MTE, so it's not too surprising. >>>> >>>> However, the upshot of this is that to enable MTE within a guest all >>>> pages of memory mapped into the guest as normal cached pages in stage 2 >>>> *must* support MTE (i.e. we must ensure the tags are appropriately >>>> sanitised and save/restore the tags during swap etc). >>>> >>>> My current approach is that KVM transparently upgrades any pages >>>> provided by the VMM to be tag-enabled when they are faulted in (i.e. >>>> sets the PG_mte_tagged flag on the page) which has the benefit of >>>> requiring fewer changes in the VMM. However, save/restore of the VM >>>> state still requires the VMM to have a PROT_MTE enabled mapping so that >>>> it can access the tag values. A VMM which 'forgets' to enable PROT_MTE >>>> would lose the tag values when saving/restoring (tags are RAZ/WI when >>>> PROT_MTE isn't set). >>>> >>>> An alternative approach would be to enforce the VMM provides PROT_MTE >>>> memory in the first place. This seems appealing to prevent the above >>>> potentially unexpected gotchas with save/restore, however this would >>>> also extend to memory that you might not expect to have PROT_MTE (e.g. a >>>> shared frame buffer for an emulated graphics card). >>> >>> As you mentioned above, if memory is mapped as Normal Cacheable at Stage >>> 2 (whether we use FWB or not), the guest is allowed to turn MTE on via >>> Stage 1. There is no way for KVM to prevent a guest from using MTE other >>> than the big HCR_EL2.ATA knob. >>> >>> This causes potential issues since we can't guarantee that all the >>> Cacheable memory slots allocated by the VMM support MTE. If they do not, >>> the arch behaviour is "unpredictable". We also can't trust the guest to >>> not enable MTE on such Cacheable mappings. >> >> Architecturally it seems dodgy to export any address that isn't "normal >> memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although >> I'm a bit worried this might cause a regression in some existing case. > > What I had in mind is some persistent memory that may be given to the > guest for direct access. This is allowed to be cacheable (write-back) > but may not have tag storage. At the moment we don't have a good idea what would happen if/when the guest (or host) attempts to use that memory as tagged. If we have a relatively safe hardware behaviour (e.g. the tags are silently dropped/read-as-zero) then that's not a big issue. But if the accesses cause some form of abort then we need to understand how that would be handled. >>> On the host kernel, mmap'ing with PROT_MTE is only allowed for anonymous >>> mappings and shmem. So requiring the VMM to always pass PROT_MTE mapped >>> ranges to KVM, irrespective of whether it's guest RAM, emulated device, >>> virtio etc. (as long as they are Cacheable), filters unsafe ranges that >>> may be mapped into guest. >> >> That would be an easy way of doing the filtering, but it's not clear whether >> PROT_MTE is actually what the VMM wants (it most likely doesn't want to have >> tag checking enabled on the memory in user space). > > From the other sub-thread, yeah, we probably don't want to mandate > PROT_MTE because of potential inadvertent tag check faults in the VMM > itself. > >>> Note that in the next revision of the MTE patches I'll drop the DT >>> memory nodes checking and rely only on the CPUID information (arch >>> updated promised by the architects). >>> >>> I see two possible ways to handle this (there may be more): >>> >>> 1. As in your current patches, assume any Cacheable at Stage 2 can have >>> MTE enabled at Stage 1. In addition, we need to check whether the >>> physical memory supports MTE and it could be something simple like >>> pfn_valid(). Is there a way to reject a memory slot passed by the >>> VMM? >> >> Yes pfn_valid() should have been in there. At the moment pfn_to_page() is >> called without any checks. >> >> The problem with attempting to reject a memory slot is that the memory >> backing that slot can change. So checking at the time the slot is created >> isn't enough (although it might be a useful error checking feature). > > But isn't the slot changed as a result of another VMM call? So we could > always have such check in place. Once you have created a memslot the guest's view of memory follows the user space's address space. This is the KVM_CAP_SYNC_MMU capability. So there's nothing stopping a VMM adding a memslot backed with perfectly reasonable memory then mmap()ing over the top of it some memory which isn't MTE compatible. KVM gets told the memory is being removed (via mmu notifiers) but I think it waits for the next fault before (re)creating the stage 2 entries. >> It's not clear to me what we can do at fault time when we discover the >> memory isn't tag-capable and would have been mapped cacheable other than >> kill the VM. > > Indeed, I don't have a better idea other than trying not to get in this > situation. Sadly to me it looks like it's not really possible to avoid a (malicious) VMM getting us there. But we can certainly try to avoid a VMM accidentally ending up in the situation. >>> 2. Similar to 1 but instead of checking whether the pfn supports MTE, we >>> require the VMM to only pass PROT_MTE ranges (filtering already done >>> by the host kernel). We need a way to reject the slot and return an >>> error to the VMM. >>> >>> I think rejecting a slot at the Stage 2 fault time is very late. You >>> probably won't be able to do much other than killing the guest. >> >> As above, we will struggle to catch all cases during slot creation, so I >> think we're going to have to deal with this late detection as well. > > We can leave it in place as a safety check, killing the VM. My hope is > that we can detect slot creation subsequent changes. > >>> Both 1 and 2 above risk breaking existing VMMs just because they happen >>> to start on an MTE-capable machine. So, can we also require the VMM to >>> explicitly opt in to MTE support in guests via some ioctl()? This in >>> turn would enable the additional checks in KVM for the MTE capability of >>> the memory slots (1 or 2 above). >> >> Patch 2 introduces a VCPU feature which must be explicitly enabled for the >> guest to have MTE. So it won't break existing VMMs. However clearly simply >> setting that bit will likely break some configurations where not all memory >> is MTE capable. > > Ah, I missed that. At least we won't break current, unaware VMMs. I > suspect the CPUID is also conditioned by this explicit enabling. Yes it should be. Indeed we might need at least part of the first patch in your initial series to avoid the ID register showing support before we're ready for it, see the below chunk: > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c > index 51db934702b6..3ae008a9b0bd 100644 > --- a/arch/arm64/kvm/sys_regs.c > +++ b/arch/arm64/kvm/sys_regs.c > @@ -1095,6 +1095,8 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu, > if (!vcpu_has_sve(vcpu)) > val &= ~(0xfUL << ID_AA64PFR0_SVE_SHIFT); > val &= ~(0xfUL << ID_AA64PFR0_AMU_SHIFT); > + } else if (id == SYS_ID_AA64PFR1_EL1) { > + val &= ~(0xfUL << ID_AA64PFR1_MTE_SHIFT); > } else if (id == SYS_ID_AA64ISAR1_EL1 && !vcpu_has_ptrauth(vcpu)) { > val &= ~((0xfUL << ID_AA64ISAR1_APA_SHIFT) | > (0xfUL << ID_AA64ISAR1_API_SHIFT) Steve >>> An alternative to an MTE enable ioctl(), if all the memory slots are set >>> up prior to the VM starting, KVM could check 1 or 2 above and decide >>> whether to expose MTE to guests (HCR_EL2.ATA). >> >> The VMM needs to be fully aware of MTE before it's enabled by KVM otherwise >> it could lose the tags (e.g. during migration). So my preference is to make >> it an explicit opt-in. > > I agree. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 14:59 ` Steven Price @ 2020-06-24 16:24 ` Catalin Marinas 2020-06-26 17:24 ` James Morse 0 siblings, 1 reply; 22+ messages in thread From: Catalin Marinas @ 2020-06-24 16:24 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, Peter Maydell On Wed, Jun 24, 2020 at 03:59:35PM +0100, Steven Price wrote: > On 24/06/2020 15:21, Catalin Marinas wrote: > > On Wed, Jun 24, 2020 at 12:16:28PM +0100, Steven Price wrote: > > > On 23/06/2020 18:48, Catalin Marinas wrote: > > > > On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote: > > > > > These patches add support to KVM to enable MTE within a guest. It is > > > > > based on Catalin's v4 MTE user space series[1]. > > > > > > > > > > [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com > > > > > > > > > > Posting as an RFC as I'd like feedback on the approach taken. First a > > > > > little background on how MTE fits within the architecture: > > > > > > > > > > The stage 2 page tables have limited scope for controlling the > > > > > availability of MTE. If a page is mapped as Normal and cached in stage 2 > > > > > then it's the stage 1 tables that get to choose whether the memory is > > > > > tagged or not. So the only way of forbidding tags on a page from the > > > > > hypervisor is to change the cacheability (or make it device memory) > > > > > which would cause other problems. Note this restriction fits the > > > > > intention that a system should have all (general purpose) memory > > > > > supporting tags if it support MTE, so it's not too surprising. > > > > > > > > > > However, the upshot of this is that to enable MTE within a guest all > > > > > pages of memory mapped into the guest as normal cached pages in stage 2 > > > > > *must* support MTE (i.e. we must ensure the tags are appropriately > > > > > sanitised and save/restore the tags during swap etc). > > > > > > > > > > My current approach is that KVM transparently upgrades any pages > > > > > provided by the VMM to be tag-enabled when they are faulted in (i.e. > > > > > sets the PG_mte_tagged flag on the page) which has the benefit of > > > > > requiring fewer changes in the VMM. However, save/restore of the VM > > > > > state still requires the VMM to have a PROT_MTE enabled mapping so that > > > > > it can access the tag values. A VMM which 'forgets' to enable PROT_MTE > > > > > would lose the tag values when saving/restoring (tags are RAZ/WI when > > > > > PROT_MTE isn't set). > > > > > > > > > > An alternative approach would be to enforce the VMM provides PROT_MTE > > > > > memory in the first place. This seems appealing to prevent the above > > > > > potentially unexpected gotchas with save/restore, however this would > > > > > also extend to memory that you might not expect to have PROT_MTE (e.g. a > > > > > shared frame buffer for an emulated graphics card). > > > > > > > > As you mentioned above, if memory is mapped as Normal Cacheable at Stage > > > > 2 (whether we use FWB or not), the guest is allowed to turn MTE on via > > > > Stage 1. There is no way for KVM to prevent a guest from using MTE other > > > > than the big HCR_EL2.ATA knob. > > > > > > > > This causes potential issues since we can't guarantee that all the > > > > Cacheable memory slots allocated by the VMM support MTE. If they do not, > > > > the arch behaviour is "unpredictable". We also can't trust the guest to > > > > not enable MTE on such Cacheable mappings. > > > > > > Architecturally it seems dodgy to export any address that isn't "normal > > > memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although > > > I'm a bit worried this might cause a regression in some existing case. > > > > What I had in mind is some persistent memory that may be given to the > > guest for direct access. This is allowed to be cacheable (write-back) > > but may not have tag storage. > > At the moment we don't have a good idea what would happen if/when the guest > (or host) attempts to use that memory as tagged. If we have a relatively > safe hardware behaviour (e.g. the tags are silently dropped/read-as-zero) > then that's not a big issue. But if the accesses cause some form of abort > then we need to understand how that would be handled. The architecture is not prescriptive here, the behaviour is "unpredictable". It could mean tags read-as-zero/write-ignored or an SError. > > > > 1. As in your current patches, assume any Cacheable at Stage 2 can have > > > > MTE enabled at Stage 1. In addition, we need to check whether the > > > > physical memory supports MTE and it could be something simple like > > > > pfn_valid(). Is there a way to reject a memory slot passed by the > > > > VMM? > > > > > > Yes pfn_valid() should have been in there. At the moment pfn_to_page() is > > > called without any checks. > > > > > > The problem with attempting to reject a memory slot is that the memory > > > backing that slot can change. So checking at the time the slot is created > > > isn't enough (although it might be a useful error checking feature). > > > > But isn't the slot changed as a result of another VMM call? So we could > > always have such check in place. > > Once you have created a memslot the guest's view of memory follows the user > space's address space. This is the KVM_CAP_SYNC_MMU capability. So there's > nothing stopping a VMM adding a memslot backed with perfectly reasonable > memory then mmap()ing over the top of it some memory which isn't MTE > compatible. KVM gets told the memory is being removed (via mmu notifiers) > but I think it waits for the next fault before (re)creating the stage 2 > entries. OK, so that's where we could kill the guest if the VMM doesn't play nicely. It means that we need the check when setting up the stage 2 entry. I guess it's fine if we only have the check at that point and ignore it on KVM_SET_USER_MEMORY_REGION. It would be nice if we returned on error on slot setup but we may not know (yet) whether the VMM intends to enable MTE for the guest. > > > It's not clear to me what we can do at fault time when we discover the > > > memory isn't tag-capable and would have been mapped cacheable other than > > > kill the VM. > > > > Indeed, I don't have a better idea other than trying not to get in this > > situation. > > Sadly to me it looks like it's not really possible to avoid a (malicious) > VMM getting us there. But we can certainly try to avoid a VMM accidentally > ending up in the situation. What we need to avoid is a malicious VMM affecting the host kernel. If the VMM wants to corrupt the guest, I think it has other ways already even without MTE. -- Catalin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 16:24 ` Catalin Marinas @ 2020-06-26 17:24 ` James Morse 0 siblings, 0 replies; 22+ messages in thread From: James Morse @ 2020-06-26 17:24 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Marc Zyngier, Will Deacon, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, Peter Maydell Hi guys, On 24/06/2020 17:24, Catalin Marinas wrote: > On Wed, Jun 24, 2020 at 03:59:35PM +0100, Steven Price wrote: >> On 24/06/2020 15:21, Catalin Marinas wrote: >>> On Wed, Jun 24, 2020 at 12:16:28PM +0100, Steven Price wrote: >>>> On 23/06/2020 18:48, Catalin Marinas wrote: >>>>> This causes potential issues since we can't guarantee that all the >>>>> Cacheable memory slots allocated by the VMM support MTE. If they do not, >>>>> the arch behaviour is "unpredictable". We also can't trust the guest to >>>>> not enable MTE on such Cacheable mappings. >>>> >>>> Architecturally it seems dodgy to export any address that isn't "normal >>>> memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although >>>> I'm a bit worried this might cause a regression in some existing case. >>> >>> What I had in mind is some persistent memory that may be given to the >>> guest for direct access. This is allowed to be cacheable (write-back) >>> but may not have tag storage. >> >> At the moment we don't have a good idea what would happen if/when the guest >> (or host) attempts to use that memory as tagged. If we have a relatively >> safe hardware behaviour (e.g. the tags are silently dropped/read-as-zero) >> then that's not a big issue. But if the accesses cause some form of abort >> then we need to understand how that would be handled. > > The architecture is not prescriptive here, the behaviour is > "unpredictable". It could mean tags read-as-zero/write-ignored or an > SError. This surely is the same as treating a VFIO device as memory and performing some unsupported operation on it. I thought the DT 'which memory ranges' description for MTE was removed. Wouldn't the rules for a guest be the same? If you enable MTE, everything described as memory must support MTE. Something like persistent memory then can't be described as memory, ... we have the same problem on the host. >>>>> 1. As in your current patches, assume any Cacheable at Stage 2 can have >>>>> MTE enabled at Stage 1. In addition, we need to check whether the >>>>> physical memory supports MTE and it could be something simple like >>>>> pfn_valid(). Is there a way to reject a memory slot passed by the >>>>> VMM? >>>> >>>> Yes pfn_valid() should have been in there. At the moment pfn_to_page() is >>>> called without any checks. >>>> >>>> The problem with attempting to reject a memory slot is that the memory >>>> backing that slot can change. So checking at the time the slot is created >>>> isn't enough (although it might be a useful error checking feature). >>> >>> But isn't the slot changed as a result of another VMM call? So we could >>> always have such check in place. >> >> Once you have created a memslot the guest's view of memory follows the user >> space's address space. This is the KVM_CAP_SYNC_MMU capability. So there's >> nothing stopping a VMM adding a memslot backed with perfectly reasonable >> memory then mmap()ing over the top of it some memory which isn't MTE >> compatible. KVM gets told the memory is being removed (via mmu notifiers) >> but I think it waits for the next fault before (re)creating the stage 2 >> entries. (indeed, stage2 is pretty lazy) > OK, so that's where we could kill the guest if the VMM doesn't play > nicely. It means that we need the check when setting up the stage 2 > entry. I guess it's fine if we only have the check at that point and > ignore it on KVM_SET_USER_MEMORY_REGION. It would be nice if we returned > on error on slot setup but > we may not know (yet) whether the VMM intends to enable MTE for the guest. We don't. Memory slots take the VM-fd, whereas the easy-to-add feature bits are per-vcpu. Packing features into the 'type' that create-vm takes is a problem once we run out, although the existing user is the IPA space size, and MTE is a property of the memory system. The meaning of the flag is then "I described this as memory, only let the guest access memory through this range that is MTE capable". What do we do when that is violated? Tell the VMM is the nicest, but its not something we ever expect to happen. I guess an abort is what real hardware would do, (if firmware magically turned off MTE while it was in use). This would need to be kvm's inject_abt64(), as otherwise the vcpu may take the stage2 fault again, forever. For kvm_set_spte_hva() we can't inject an abort (which vcpu?), so not mapping the page and waiting for the guest to access it is the only option... Thanks, James ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-17 12:38 [RFC PATCH 0/2] MTE support for KVM guest Steven Price ` (2 preceding siblings ...) 2020-06-23 17:48 ` [RFC PATCH 0/2] MTE support for KVM guest Catalin Marinas @ 2020-06-23 18:05 ` Peter Maydell 2020-06-24 9:38 ` Catalin Marinas 3 siblings, 1 reply; 22+ messages in thread From: Peter Maydell @ 2020-06-23 18:05 UTC (permalink / raw) To: Steven Price Cc: Catalin Marinas, Marc Zyngier, Will Deacon, Dave Martin, lkml - Kernel Mailing List, Thomas Gleixner, kvmarm, arm-mail-list On Wed, 17 Jun 2020 at 13:39, Steven Price <steven.price@arm.com> wrote: > > These patches add support to KVM to enable MTE within a guest. It is > based on Catalin's v4 MTE user space series[1]. > > [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com > > Posting as an RFC as I'd like feedback on the approach taken. What's your plan for handling tags across VM migration? Will the kernel expose the tag ram to userspace so we can copy it from the source machine to the destination at the same time as we copy the actual ram contents ? thanks -- PMM ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-23 18:05 ` Peter Maydell @ 2020-06-24 9:38 ` Catalin Marinas 2020-06-24 10:34 ` Dave Martin 0 siblings, 1 reply; 22+ messages in thread From: Catalin Marinas @ 2020-06-24 9:38 UTC (permalink / raw) To: Peter Maydell Cc: Steven Price, Marc Zyngier, Will Deacon, Dave Martin, lkml - Kernel Mailing List, Thomas Gleixner, kvmarm, arm-mail-list On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote: > On Wed, 17 Jun 2020 at 13:39, Steven Price <steven.price@arm.com> wrote: > > These patches add support to KVM to enable MTE within a guest. It is > > based on Catalin's v4 MTE user space series[1]. > > > > [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com > > > > Posting as an RFC as I'd like feedback on the approach taken. > > What's your plan for handling tags across VM migration? > Will the kernel expose the tag ram to userspace so we > can copy it from the source machine to the destination > at the same time as we copy the actual ram contents ? Qemu can map the guest memory with PROT_MTE and access the tags directly with LDG/STG instructions. Steven was actually asking in the cover letter whether we should require that the VMM maps the guest memory with PROT_MTE as a guarantee that it can access the guest tags. There is no architecturally visible tag ram (tag storage), that's a microarchitecture detail. -- Catalin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 9:38 ` Catalin Marinas @ 2020-06-24 10:34 ` Dave Martin 2020-06-24 11:03 ` Steven Price 0 siblings, 1 reply; 22+ messages in thread From: Dave Martin @ 2020-06-24 10:34 UTC (permalink / raw) To: Catalin Marinas Cc: Peter Maydell, Marc Zyngier, lkml - Kernel Mailing List, Steven Price, kvmarm, Thomas Gleixner, Will Deacon, arm-mail-list On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote: > On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote: > > On Wed, 17 Jun 2020 at 13:39, Steven Price <steven.price@arm.com> wrote: > > > These patches add support to KVM to enable MTE within a guest. It is > > > based on Catalin's v4 MTE user space series[1]. > > > > > > [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com > > > > > > Posting as an RFC as I'd like feedback on the approach taken. > > > > What's your plan for handling tags across VM migration? > > Will the kernel expose the tag ram to userspace so we > > can copy it from the source machine to the destination > > at the same time as we copy the actual ram contents ? > > Qemu can map the guest memory with PROT_MTE and access the tags directly > with LDG/STG instructions. Steven was actually asking in the cover > letter whether we should require that the VMM maps the guest memory with > PROT_MTE as a guarantee that it can access the guest tags. > > There is no architecturally visible tag ram (tag storage), that's a > microarchitecture detail. If userspace maps the guest memory with PROT_MTE for dump purposes, isn't it going to get tag check faults when accessing the memory (i.e., when dumping the regular memory content, not the tags specifically). Does it need to map two aliases, one with PROT_MTE and one without, and is that architecturally valid? Cheers ---Dave ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 10:34 ` Dave Martin @ 2020-06-24 11:03 ` Steven Price 2020-06-24 11:09 ` Catalin Marinas 0 siblings, 1 reply; 22+ messages in thread From: Steven Price @ 2020-06-24 11:03 UTC (permalink / raw) To: Dave Martin, Catalin Marinas Cc: Peter Maydell, Marc Zyngier, lkml - Kernel Mailing List, kvmarm, Thomas Gleixner, Will Deacon, arm-mail-list On 24/06/2020 11:34, Dave Martin wrote: > On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote: >> On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote: >>> On Wed, 17 Jun 2020 at 13:39, Steven Price <steven.price@arm.com> wrote: >>>> These patches add support to KVM to enable MTE within a guest. It is >>>> based on Catalin's v4 MTE user space series[1]. >>>> >>>> [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com >>>> >>>> Posting as an RFC as I'd like feedback on the approach taken. >>> >>> What's your plan for handling tags across VM migration? >>> Will the kernel expose the tag ram to userspace so we >>> can copy it from the source machine to the destination >>> at the same time as we copy the actual ram contents ? >> >> Qemu can map the guest memory with PROT_MTE and access the tags directly >> with LDG/STG instructions. Steven was actually asking in the cover >> letter whether we should require that the VMM maps the guest memory with >> PROT_MTE as a guarantee that it can access the guest tags. >> >> There is no architecturally visible tag ram (tag storage), that's a >> microarchitecture detail. > > If userspace maps the guest memory with PROT_MTE for dump purposes, > isn't it going to get tag check faults when accessing the memory > (i.e., when dumping the regular memory content, not the tags > specifically). > > Does it need to map two aliases, one with PROT_MTE and one without, > and is that architecturally valid? Userspace would either need to have two mappings (I don't believe there are any architectural issues with that - but this could be awkward to arrange in some situations) or be careful to avoid faults. Basically your choices with one mapping are: 1. Disable tag checking (using prctl) when touching the memory. This works but means you lose tag checking for the VMM's own accesses during this code sequence. 2. Read the tag values and ensure you use the correct tag. This suffers from race conditions if the VM is still running. 3. Use one of the exceptions in the architecture that generates a Tag Unchecked access. Sadly the only remotely useful thing I can see in the v8 ARM is "A base register plus immediate offset addressing form, with the SP as the base register." - but making sure SP is in range of where you want to access would be a pain. The kernel could provide a mechanism to do this, but I'm not sure that would be better than 1. This, however, is another argument for my current approach of "upgrading" the pages automatically and not forcing the VMM to set PROT_MTE. But in this case we probably would need a kernel interface to fetch the tags as the VM sees them. Steve ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 11:03 ` Steven Price @ 2020-06-24 11:09 ` Catalin Marinas 2020-06-24 11:18 ` Steven Price 0 siblings, 1 reply; 22+ messages in thread From: Catalin Marinas @ 2020-06-24 11:09 UTC (permalink / raw) To: Steven Price Cc: Dave Martin, Peter Maydell, Marc Zyngier, lkml - Kernel Mailing List, kvmarm, Thomas Gleixner, Will Deacon, arm-mail-list On Wed, Jun 24, 2020 at 12:03:35PM +0100, Steven Price wrote: > On 24/06/2020 11:34, Dave Martin wrote: > > On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote: > > > On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote: > > > > On Wed, 17 Jun 2020 at 13:39, Steven Price <steven.price@arm.com> wrote: > > > > > These patches add support to KVM to enable MTE within a guest. It is > > > > > based on Catalin's v4 MTE user space series[1]. > > > > > > > > > > [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com > > > > > > > > > > Posting as an RFC as I'd like feedback on the approach taken. > > > > > > > > What's your plan for handling tags across VM migration? > > > > Will the kernel expose the tag ram to userspace so we > > > > can copy it from the source machine to the destination > > > > at the same time as we copy the actual ram contents ? > > > > > > Qemu can map the guest memory with PROT_MTE and access the tags directly > > > with LDG/STG instructions. Steven was actually asking in the cover > > > letter whether we should require that the VMM maps the guest memory with > > > PROT_MTE as a guarantee that it can access the guest tags. > > > > > > There is no architecturally visible tag ram (tag storage), that's a > > > microarchitecture detail. > > > > If userspace maps the guest memory with PROT_MTE for dump purposes, > > isn't it going to get tag check faults when accessing the memory > > (i.e., when dumping the regular memory content, not the tags > > specifically). > > > > Does it need to map two aliases, one with PROT_MTE and one without, > > and is that architecturally valid? > > Userspace would either need to have two mappings (I don't believe there are > any architectural issues with that - but this could be awkward to arrange in > some situations) or be careful to avoid faults. Basically your choices with > one mapping are: > > 1. Disable tag checking (using prctl) when touching the memory. This works > but means you lose tag checking for the VMM's own accesses during this code > sequence. > > 2. Read the tag values and ensure you use the correct tag. This suffers > from race conditions if the VM is still running. > > 3. Use one of the exceptions in the architecture that generates a Tag > Unchecked access. Sadly the only remotely useful thing I can see in the v8 > ARM is "A base register plus immediate offset addressing form, with the SP > as the base register." - but making sure SP is in range of where you want to > access would be a pain. Or: 4. Set PSTATE.TCO when accessing tagged memory in an unsafe way. -- Catalin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 11:09 ` Catalin Marinas @ 2020-06-24 11:18 ` Steven Price 2020-06-24 11:52 ` Catalin Marinas 2020-06-24 13:16 ` Peter Maydell 0 siblings, 2 replies; 22+ messages in thread From: Steven Price @ 2020-06-24 11:18 UTC (permalink / raw) To: Catalin Marinas Cc: Dave P Martin, Peter Maydell, Marc Zyngier, lkml - Kernel Mailing List, kvmarm, Thomas Gleixner, Will Deacon, arm-mail-list On 24/06/2020 12:09, Catalin Marinas wrote: > On Wed, Jun 24, 2020 at 12:03:35PM +0100, Steven Price wrote: >> On 24/06/2020 11:34, Dave Martin wrote: >>> On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote: >>>> On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote: >>>>> On Wed, 17 Jun 2020 at 13:39, Steven Price <steven.price@arm.com> wrote: >>>>>> These patches add support to KVM to enable MTE within a guest. It is >>>>>> based on Catalin's v4 MTE user space series[1]. >>>>>> >>>>>> [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com >>>>>> >>>>>> Posting as an RFC as I'd like feedback on the approach taken. >>>>> >>>>> What's your plan for handling tags across VM migration? >>>>> Will the kernel expose the tag ram to userspace so we >>>>> can copy it from the source machine to the destination >>>>> at the same time as we copy the actual ram contents ? >>>> >>>> Qemu can map the guest memory with PROT_MTE and access the tags directly >>>> with LDG/STG instructions. Steven was actually asking in the cover >>>> letter whether we should require that the VMM maps the guest memory with >>>> PROT_MTE as a guarantee that it can access the guest tags. >>>> >>>> There is no architecturally visible tag ram (tag storage), that's a >>>> microarchitecture detail. >>> >>> If userspace maps the guest memory with PROT_MTE for dump purposes, >>> isn't it going to get tag check faults when accessing the memory >>> (i.e., when dumping the regular memory content, not the tags >>> specifically). >>> >>> Does it need to map two aliases, one with PROT_MTE and one without, >>> and is that architecturally valid? >> >> Userspace would either need to have two mappings (I don't believe there are >> any architectural issues with that - but this could be awkward to arrange in >> some situations) or be careful to avoid faults. Basically your choices with >> one mapping are: >> >> 1. Disable tag checking (using prctl) when touching the memory. This works >> but means you lose tag checking for the VMM's own accesses during this code >> sequence. >> >> 2. Read the tag values and ensure you use the correct tag. This suffers >> from race conditions if the VM is still running. >> >> 3. Use one of the exceptions in the architecture that generates a Tag >> Unchecked access. Sadly the only remotely useful thing I can see in the v8 >> ARM is "A base register plus immediate offset addressing form, with the SP >> as the base register." - but making sure SP is in range of where you want to >> access would be a pain. > > Or: > > 4. Set PSTATE.TCO when accessing tagged memory in an unsafe way. > Ah yes, similar to (1) but much lower overhead ;) That's probably the best option - it can be hidden in a memcpy_ignoring_tags() function. However it still means that the VMM can't directly touch the guest's memory which might cause issues for the VMM. Steve ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 11:18 ` Steven Price @ 2020-06-24 11:52 ` Catalin Marinas 2020-06-24 13:16 ` Peter Maydell 1 sibling, 0 replies; 22+ messages in thread From: Catalin Marinas @ 2020-06-24 11:52 UTC (permalink / raw) To: Steven Price Cc: Dave P Martin, Peter Maydell, Marc Zyngier, lkml - Kernel Mailing List, kvmarm, Thomas Gleixner, Will Deacon, arm-mail-list On Wed, Jun 24, 2020 at 12:18:46PM +0100, Steven Price wrote: > On 24/06/2020 12:09, Catalin Marinas wrote: > > On Wed, Jun 24, 2020 at 12:03:35PM +0100, Steven Price wrote: > > > On 24/06/2020 11:34, Dave Martin wrote: > > > > On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote: > > > > > On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote: > > > > > > On Wed, 17 Jun 2020 at 13:39, Steven Price <steven.price@arm.com> wrote: > > > > > > > These patches add support to KVM to enable MTE within a guest. It is > > > > > > > based on Catalin's v4 MTE user space series[1]. > > > > > > > > > > > > > > [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com > > > > > > > > > > > > > > Posting as an RFC as I'd like feedback on the approach taken. > > > > > > > > > > > > What's your plan for handling tags across VM migration? > > > > > > Will the kernel expose the tag ram to userspace so we > > > > > > can copy it from the source machine to the destination > > > > > > at the same time as we copy the actual ram contents ? > > > > > > > > > > Qemu can map the guest memory with PROT_MTE and access the tags directly > > > > > with LDG/STG instructions. Steven was actually asking in the cover > > > > > letter whether we should require that the VMM maps the guest memory with > > > > > PROT_MTE as a guarantee that it can access the guest tags. > > > > > > > > > > There is no architecturally visible tag ram (tag storage), that's a > > > > > microarchitecture detail. > > > > > > > > If userspace maps the guest memory with PROT_MTE for dump purposes, > > > > isn't it going to get tag check faults when accessing the memory > > > > (i.e., when dumping the regular memory content, not the tags > > > > specifically). > > > > > > > > Does it need to map two aliases, one with PROT_MTE and one without, > > > > and is that architecturally valid? > > > > > > Userspace would either need to have two mappings (I don't believe there are > > > any architectural issues with that - but this could be awkward to arrange in > > > some situations) or be careful to avoid faults. Basically your choices with > > > one mapping are: > > > > > > 1. Disable tag checking (using prctl) when touching the memory. This works > > > but means you lose tag checking for the VMM's own accesses during this code > > > sequence. > > > > > > 2. Read the tag values and ensure you use the correct tag. This suffers > > > from race conditions if the VM is still running. > > > > > > 3. Use one of the exceptions in the architecture that generates a Tag > > > Unchecked access. Sadly the only remotely useful thing I can see in the v8 > > > ARM is "A base register plus immediate offset addressing form, with the SP > > > as the base register." - but making sure SP is in range of where you want to > > > access would be a pain. > > > > Or: > > > > 4. Set PSTATE.TCO when accessing tagged memory in an unsafe way. > > Ah yes, similar to (1) but much lower overhead ;) That's probably the best > option - it can be hidden in a memcpy_ignoring_tags() function. However it > still means that the VMM can't directly touch the guest's memory which might > cause issues for the VMM. You are right, I don't think it's safe for the VMM to access the guest memory via a PROT_MTE mapping. If the guest is using memory tagging for for a buffer and then it is passed to qemu for virtio, the tag information may have been lost already (does qemu only get the IPA in this case?) So we may end up with two mappings after all, one for the normal execution and a new on if migration is needed. -- Catalin ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 0/2] MTE support for KVM guest 2020-06-24 11:18 ` Steven Price 2020-06-24 11:52 ` Catalin Marinas @ 2020-06-24 13:16 ` Peter Maydell 1 sibling, 0 replies; 22+ messages in thread From: Peter Maydell @ 2020-06-24 13:16 UTC (permalink / raw) To: Steven Price Cc: Catalin Marinas, Dave P Martin, Marc Zyngier, lkml - Kernel Mailing List, kvmarm, Thomas Gleixner, Will Deacon, arm-mail-list On Wed, 24 Jun 2020 at 12:18, Steven Price <steven.price@arm.com> wrote: > Ah yes, similar to (1) but much lower overhead ;) That's probably the > best option - it can be hidden in a memcpy_ignoring_tags() function. > However it still means that the VMM can't directly touch the guest's > memory which might cause issues for the VMM. That's kind of awkward, since in general QEMU assumes it can naturally just access guest RAM[*] (eg emulation of DMAing devices, virtio, graphics display, gdb stub memory accesses). It would be nicer to be able to do it the other way around, maybe, so that the current APIs give you the "just the memory" and if you really want to do tagged accesses to guest ram you can do it with tag-specific APIs. I haven't thought about this very much though and haven't read enough of the MTE spec recently enough to make much sensible comment. So mostly what I'm trying to encourage here is that the people implementing the KVM/kernel side of this API also think about the userspace side of it, so we get one coherent design rather than a half-a-product that turns out to be awkward to use :-) [*] "guest ram is encrypted" also breaks this assumption, of course; I haven't looked at the efforts in that direction that are already in QEMU to see how they work, though. thanks -- PMM ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2020-06-26 17:24 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-06-17 12:38 [RFC PATCH 0/2] MTE support for KVM guest Steven Price 2020-06-17 12:38 ` [RFC PATCH 1/2] arm64: kvm: Save/restore MTE registers Steven Price 2020-06-17 14:05 ` Catalin Marinas 2020-06-18 10:43 ` Steven Price 2020-06-17 12:38 ` [RFC PATCH 2/2] arm64: kvm: Introduce MTE VCPU feature Steven Price 2020-06-17 14:38 ` Catalin Marinas 2020-06-17 15:34 ` Steven Price 2020-06-26 16:40 ` James Morse 2020-06-23 17:48 ` [RFC PATCH 0/2] MTE support for KVM guest Catalin Marinas 2020-06-24 11:16 ` Steven Price 2020-06-24 14:21 ` Catalin Marinas 2020-06-24 14:59 ` Steven Price 2020-06-24 16:24 ` Catalin Marinas 2020-06-26 17:24 ` James Morse 2020-06-23 18:05 ` Peter Maydell 2020-06-24 9:38 ` Catalin Marinas 2020-06-24 10:34 ` Dave Martin 2020-06-24 11:03 ` Steven Price 2020-06-24 11:09 ` Catalin Marinas 2020-06-24 11:18 ` Steven Price 2020-06-24 11:52 ` Catalin Marinas 2020-06-24 13:16 ` Peter Maydell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).