* [PATCH v10 0/6] MTE support for KVM guest @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones This series adds support for using the Arm Memory Tagging Extensions (MTE) in a KVM guest. This version is rebased on v5.12-rc2. Changes since v9[1]: * Check fault_status in user_mem_abort() to avoid unnecessarily checking if tags need clearing when handling permission faults. * The MTE CPU feature exposed is now 0b10 explicitly rather than the host's CPU feature. This prevents problems when a newer MTE version is supported by the host CPU. * Add a couple of reserved u64s to struct kvm_arm_copy_mte_tags for potential future expansion (and check they are 0 for now). * Correctly hold slots_lock during the ioctl (rather than embarrassingly not do any locking as before...). * Add the structure definition to the documentation and some improvements suggested by Peter. [1] https://lore.kernel.org/r/20210301142315.30920-1-steven.price%40arm.com Steven Price (6): arm64: mte: Sync tags for pages where PTE is untagged arm64: kvm: Introduce MTE VM feature arm64: kvm: Save/restore MTE registers arm64: kvm: Expose KVM_ARM_CAP_MTE KVM: arm64: ioctl to fetch/store tags in a guest KVM: arm64: Document MTE capability and ioctl Documentation/virt/kvm/api.rst | 53 +++++++++++++++ arch/arm64/include/asm/kvm_emulate.h | 3 + arch/arm64/include/asm/kvm_host.h | 9 +++ arch/arm64/include/asm/kvm_mte.h | 66 ++++++++++++++++++ arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/include/asm/sysreg.h | 3 +- arch/arm64/include/uapi/asm/kvm.h | 14 ++++ arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kernel/mte.c | 16 +++-- arch/arm64/kvm/arm.c | 78 ++++++++++++++++++++++ arch/arm64/kvm/hyp/entry.S | 7 ++ arch/arm64/kvm/hyp/exception.c | 3 +- arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++++++ arch/arm64/kvm/mmu.c | 16 +++++ arch/arm64/kvm/sys_regs.c | 28 ++++++-- include/uapi/linux/kvm.h | 2 + 16 files changed, 313 insertions(+), 11 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_mte.h -- 2.20.1 ^ permalink raw reply [flat|nested] 112+ messages in thread
* [PATCH v10 0/6] MTE support for KVM guest @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones This series adds support for using the Arm Memory Tagging Extensions (MTE) in a KVM guest. This version is rebased on v5.12-rc2. Changes since v9[1]: * Check fault_status in user_mem_abort() to avoid unnecessarily checking if tags need clearing when handling permission faults. * The MTE CPU feature exposed is now 0b10 explicitly rather than the host's CPU feature. This prevents problems when a newer MTE version is supported by the host CPU. * Add a couple of reserved u64s to struct kvm_arm_copy_mte_tags for potential future expansion (and check they are 0 for now). * Correctly hold slots_lock during the ioctl (rather than embarrassingly not do any locking as before...). * Add the structure definition to the documentation and some improvements suggested by Peter. [1] https://lore.kernel.org/r/20210301142315.30920-1-steven.price%40arm.com Steven Price (6): arm64: mte: Sync tags for pages where PTE is untagged arm64: kvm: Introduce MTE VM feature arm64: kvm: Save/restore MTE registers arm64: kvm: Expose KVM_ARM_CAP_MTE KVM: arm64: ioctl to fetch/store tags in a guest KVM: arm64: Document MTE capability and ioctl Documentation/virt/kvm/api.rst | 53 +++++++++++++++ arch/arm64/include/asm/kvm_emulate.h | 3 + arch/arm64/include/asm/kvm_host.h | 9 +++ arch/arm64/include/asm/kvm_mte.h | 66 ++++++++++++++++++ arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/include/asm/sysreg.h | 3 +- arch/arm64/include/uapi/asm/kvm.h | 14 ++++ arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kernel/mte.c | 16 +++-- arch/arm64/kvm/arm.c | 78 ++++++++++++++++++++++ arch/arm64/kvm/hyp/entry.S | 7 ++ arch/arm64/kvm/hyp/exception.c | 3 +- arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++++++ arch/arm64/kvm/mmu.c | 16 +++++ arch/arm64/kvm/sys_regs.c | 28 ++++++-- include/uapi/linux/kvm.h | 2 + 16 files changed, 313 insertions(+), 11 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_mte.h -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* [PATCH v10 0/6] MTE support for KVM guest @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Dr. David Alan Gilbert, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, Thomas Gleixner, kvmarm, linux-arm-kernel This series adds support for using the Arm Memory Tagging Extensions (MTE) in a KVM guest. This version is rebased on v5.12-rc2. Changes since v9[1]: * Check fault_status in user_mem_abort() to avoid unnecessarily checking if tags need clearing when handling permission faults. * The MTE CPU feature exposed is now 0b10 explicitly rather than the host's CPU feature. This prevents problems when a newer MTE version is supported by the host CPU. * Add a couple of reserved u64s to struct kvm_arm_copy_mte_tags for potential future expansion (and check they are 0 for now). * Correctly hold slots_lock during the ioctl (rather than embarrassingly not do any locking as before...). * Add the structure definition to the documentation and some improvements suggested by Peter. [1] https://lore.kernel.org/r/20210301142315.30920-1-steven.price%40arm.com Steven Price (6): arm64: mte: Sync tags for pages where PTE is untagged arm64: kvm: Introduce MTE VM feature arm64: kvm: Save/restore MTE registers arm64: kvm: Expose KVM_ARM_CAP_MTE KVM: arm64: ioctl to fetch/store tags in a guest KVM: arm64: Document MTE capability and ioctl Documentation/virt/kvm/api.rst | 53 +++++++++++++++ arch/arm64/include/asm/kvm_emulate.h | 3 + arch/arm64/include/asm/kvm_host.h | 9 +++ arch/arm64/include/asm/kvm_mte.h | 66 ++++++++++++++++++ arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/include/asm/sysreg.h | 3 +- arch/arm64/include/uapi/asm/kvm.h | 14 ++++ arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kernel/mte.c | 16 +++-- arch/arm64/kvm/arm.c | 78 ++++++++++++++++++++++ arch/arm64/kvm/hyp/entry.S | 7 ++ arch/arm64/kvm/hyp/exception.c | 3 +- arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++++++ arch/arm64/kvm/mmu.c | 16 +++++ arch/arm64/kvm/sys_regs.c | 28 ++++++-- include/uapi/linux/kvm.h | 2 + 16 files changed, 313 insertions(+), 11 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_mte.h -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* [PATCH v10 0/6] MTE support for KVM guest @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, James Morse, Julien Thierry, Thomas Gleixner, kvmarm, linux-arm-kernel This series adds support for using the Arm Memory Tagging Extensions (MTE) in a KVM guest. This version is rebased on v5.12-rc2. Changes since v9[1]: * Check fault_status in user_mem_abort() to avoid unnecessarily checking if tags need clearing when handling permission faults. * The MTE CPU feature exposed is now 0b10 explicitly rather than the host's CPU feature. This prevents problems when a newer MTE version is supported by the host CPU. * Add a couple of reserved u64s to struct kvm_arm_copy_mte_tags for potential future expansion (and check they are 0 for now). * Correctly hold slots_lock during the ioctl (rather than embarrassingly not do any locking as before...). * Add the structure definition to the documentation and some improvements suggested by Peter. [1] https://lore.kernel.org/r/20210301142315.30920-1-steven.price%40arm.com Steven Price (6): arm64: mte: Sync tags for pages where PTE is untagged arm64: kvm: Introduce MTE VM feature arm64: kvm: Save/restore MTE registers arm64: kvm: Expose KVM_ARM_CAP_MTE KVM: arm64: ioctl to fetch/store tags in a guest KVM: arm64: Document MTE capability and ioctl Documentation/virt/kvm/api.rst | 53 +++++++++++++++ arch/arm64/include/asm/kvm_emulate.h | 3 + arch/arm64/include/asm/kvm_host.h | 9 +++ arch/arm64/include/asm/kvm_mte.h | 66 ++++++++++++++++++ arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/include/asm/sysreg.h | 3 +- arch/arm64/include/uapi/asm/kvm.h | 14 ++++ arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kernel/mte.c | 16 +++-- arch/arm64/kvm/arm.c | 78 ++++++++++++++++++++++ arch/arm64/kvm/hyp/entry.S | 7 ++ arch/arm64/kvm/hyp/exception.c | 3 +- arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++++++ arch/arm64/kvm/mmu.c | 16 +++++ arch/arm64/kvm/sys_regs.c | 28 ++++++-- include/uapi/linux/kvm.h | 2 + 16 files changed, 313 insertions(+), 11 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_mte.h -- 2.20.1 ^ permalink raw reply [flat|nested] 112+ messages in thread
* [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged 2021-03-12 15:18 ` Steven Price (?) (?) @ 2021-03-12 15:18 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones A KVM guest could store tags in a page even if the VMM hasn't mapped the page with PROT_MTE. So when restoring pages from swap we will need to check to see if there are any saved tags even if !pte_tagged(). However don't check pages which are !pte_valid_user() as these will not have been swapped out. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/kernel/mte.c | 16 ++++++++++++---- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e17b96d0e4b5..84166625c989 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, __sync_icache_dcache(pte); if (system_supports_mte() && - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) mte_sync_tags(ptep, pte); __check_racy_pte_update(mm, ptep, pte); diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index b3c70a612c7a..e016ab57ea36 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -26,17 +26,23 @@ u64 gcr_kernel_excl __ro_after_init; static bool report_fault_once = true; -static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap) +static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap, + bool pte_is_tagged) { pte_t old_pte = READ_ONCE(*ptep); if (check_swap && is_swap_pte(old_pte)) { swp_entry_t entry = pte_to_swp_entry(old_pte); - if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) + if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) { + set_bit(PG_mte_tagged, &page->flags); return; + } } + if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, &page->flags)) + return; + page_kasan_tag_reset(page); /* * We need smp_wmb() in between setting the flags and clearing the @@ -54,11 +60,13 @@ void mte_sync_tags(pte_t *ptep, pte_t pte) struct page *page = pte_page(pte); long i, nr_pages = compound_nr(page); bool check_swap = nr_pages == 1; + bool pte_is_tagged = pte_tagged(pte); /* if PG_mte_tagged is set, tags have already been initialised */ for (i = 0; i < nr_pages; i++, page++) { - if (!test_and_set_bit(PG_mte_tagged, &page->flags)) - mte_sync_page_tags(page, ptep, check_swap); + if (!test_bit(PG_mte_tagged, &page->flags)) + mte_sync_page_tags(page, ptep, check_swap, + pte_is_tagged); } } -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones A KVM guest could store tags in a page even if the VMM hasn't mapped the page with PROT_MTE. So when restoring pages from swap we will need to check to see if there are any saved tags even if !pte_tagged(). However don't check pages which are !pte_valid_user() as these will not have been swapped out. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/kernel/mte.c | 16 ++++++++++++---- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e17b96d0e4b5..84166625c989 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, __sync_icache_dcache(pte); if (system_supports_mte() && - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) mte_sync_tags(ptep, pte); __check_racy_pte_update(mm, ptep, pte); diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index b3c70a612c7a..e016ab57ea36 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -26,17 +26,23 @@ u64 gcr_kernel_excl __ro_after_init; static bool report_fault_once = true; -static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap) +static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap, + bool pte_is_tagged) { pte_t old_pte = READ_ONCE(*ptep); if (check_swap && is_swap_pte(old_pte)) { swp_entry_t entry = pte_to_swp_entry(old_pte); - if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) + if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) { + set_bit(PG_mte_tagged, &page->flags); return; + } } + if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, &page->flags)) + return; + page_kasan_tag_reset(page); /* * We need smp_wmb() in between setting the flags and clearing the @@ -54,11 +60,13 @@ void mte_sync_tags(pte_t *ptep, pte_t pte) struct page *page = pte_page(pte); long i, nr_pages = compound_nr(page); bool check_swap = nr_pages == 1; + bool pte_is_tagged = pte_tagged(pte); /* if PG_mte_tagged is set, tags have already been initialised */ for (i = 0; i < nr_pages; i++, page++) { - if (!test_and_set_bit(PG_mte_tagged, &page->flags)) - mte_sync_page_tags(page, ptep, check_swap); + if (!test_bit(PG_mte_tagged, &page->flags)) + mte_sync_page_tags(page, ptep, check_swap, + pte_is_tagged); } } -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Dr. David Alan Gilbert, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, Thomas Gleixner, kvmarm, linux-arm-kernel A KVM guest could store tags in a page even if the VMM hasn't mapped the page with PROT_MTE. So when restoring pages from swap we will need to check to see if there are any saved tags even if !pte_tagged(). However don't check pages which are !pte_valid_user() as these will not have been swapped out. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/kernel/mte.c | 16 ++++++++++++---- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e17b96d0e4b5..84166625c989 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, __sync_icache_dcache(pte); if (system_supports_mte() && - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) mte_sync_tags(ptep, pte); __check_racy_pte_update(mm, ptep, pte); diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index b3c70a612c7a..e016ab57ea36 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -26,17 +26,23 @@ u64 gcr_kernel_excl __ro_after_init; static bool report_fault_once = true; -static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap) +static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap, + bool pte_is_tagged) { pte_t old_pte = READ_ONCE(*ptep); if (check_swap && is_swap_pte(old_pte)) { swp_entry_t entry = pte_to_swp_entry(old_pte); - if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) + if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) { + set_bit(PG_mte_tagged, &page->flags); return; + } } + if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, &page->flags)) + return; + page_kasan_tag_reset(page); /* * We need smp_wmb() in between setting the flags and clearing the @@ -54,11 +60,13 @@ void mte_sync_tags(pte_t *ptep, pte_t pte) struct page *page = pte_page(pte); long i, nr_pages = compound_nr(page); bool check_swap = nr_pages == 1; + bool pte_is_tagged = pte_tagged(pte); /* if PG_mte_tagged is set, tags have already been initialised */ for (i = 0; i < nr_pages; i++, page++) { - if (!test_and_set_bit(PG_mte_tagged, &page->flags)) - mte_sync_page_tags(page, ptep, check_swap); + if (!test_bit(PG_mte_tagged, &page->flags)) + mte_sync_page_tags(page, ptep, check_swap, + pte_is_tagged); } } -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, James Morse, Julien Thierry, Thomas Gleixner, kvmarm, linux-arm-kernel A KVM guest could store tags in a page even if the VMM hasn't mapped the page with PROT_MTE. So when restoring pages from swap we will need to check to see if there are any saved tags even if !pte_tagged(). However don't check pages which are !pte_valid_user() as these will not have been swapped out. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/kernel/mte.c | 16 ++++++++++++---- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e17b96d0e4b5..84166625c989 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, __sync_icache_dcache(pte); if (system_supports_mte() && - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) mte_sync_tags(ptep, pte); __check_racy_pte_update(mm, ptep, pte); diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index b3c70a612c7a..e016ab57ea36 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -26,17 +26,23 @@ u64 gcr_kernel_excl __ro_after_init; static bool report_fault_once = true; -static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap) +static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap, + bool pte_is_tagged) { pte_t old_pte = READ_ONCE(*ptep); if (check_swap && is_swap_pte(old_pte)) { swp_entry_t entry = pte_to_swp_entry(old_pte); - if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) + if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) { + set_bit(PG_mte_tagged, &page->flags); return; + } } + if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, &page->flags)) + return; + page_kasan_tag_reset(page); /* * We need smp_wmb() in between setting the flags and clearing the @@ -54,11 +60,13 @@ void mte_sync_tags(pte_t *ptep, pte_t pte) struct page *page = pte_page(pte); long i, nr_pages = compound_nr(page); bool check_swap = nr_pages == 1; + bool pte_is_tagged = pte_tagged(pte); /* if PG_mte_tagged is set, tags have already been initialised */ for (i = 0; i < nr_pages; i++, page++) { - if (!test_and_set_bit(PG_mte_tagged, &page->flags)) - mte_sync_page_tags(page, ptep, check_swap); + if (!test_bit(PG_mte_tagged, &page->flags)) + mte_sync_page_tags(page, ptep, check_swap, + pte_is_tagged); } } -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged 2021-03-12 15:18 ` Steven Price (?) (?) @ 2021-03-26 18:56 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-26 18:56 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones Hi Steven, On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: > A KVM guest could store tags in a page even if the VMM hasn't mapped > the page with PROT_MTE. So when restoring pages from swap we will > need to check to see if there are any saved tags even if !pte_tagged(). > > However don't check pages which are !pte_valid_user() as these will > not have been swapped out. > > Signed-off-by: Steven Price <steven.price@arm.com> > --- > arch/arm64/include/asm/pgtable.h | 2 +- > arch/arm64/kernel/mte.c | 16 ++++++++++++---- > 2 files changed, 13 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index e17b96d0e4b5..84166625c989 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, > __sync_icache_dcache(pte); > > if (system_supports_mte() && > - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) > + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) > mte_sync_tags(ptep, pte); With the EPAN patches queued in for-next/epan, pte_valid_user() disappeared as its semantics weren't very clear. So this relies on the set_pte_at() being done on the VMM address space. I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need something like pte_present() && addr <= user_addr_max(). BTW, ignoring virtualisation, can we ever bring a page in from swap on a PROT_NONE mapping (say fault-around)? It's not too bad if we keep the metadata around for when the pte becomes accessible but I suspect we remove it if the page is removed from swap. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-26 18:56 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-26 18:56 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones Hi Steven, On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: > A KVM guest could store tags in a page even if the VMM hasn't mapped > the page with PROT_MTE. So when restoring pages from swap we will > need to check to see if there are any saved tags even if !pte_tagged(). > > However don't check pages which are !pte_valid_user() as these will > not have been swapped out. > > Signed-off-by: Steven Price <steven.price@arm.com> > --- > arch/arm64/include/asm/pgtable.h | 2 +- > arch/arm64/kernel/mte.c | 16 ++++++++++++---- > 2 files changed, 13 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index e17b96d0e4b5..84166625c989 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, > __sync_icache_dcache(pte); > > if (system_supports_mte() && > - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) > + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) > mte_sync_tags(ptep, pte); With the EPAN patches queued in for-next/epan, pte_valid_user() disappeared as its semantics weren't very clear. So this relies on the set_pte_at() being done on the VMM address space. I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need something like pte_present() && addr <= user_addr_max(). BTW, ignoring virtualisation, can we ever bring a page in from swap on a PROT_NONE mapping (say fault-around)? It's not too bad if we keep the metadata around for when the pte becomes accessible but I suspect we remove it if the page is removed from swap. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-26 18:56 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-26 18:56 UTC (permalink / raw) To: Steven Price Cc: Dr. David Alan Gilbert, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm Hi Steven, On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: > A KVM guest could store tags in a page even if the VMM hasn't mapped > the page with PROT_MTE. So when restoring pages from swap we will > need to check to see if there are any saved tags even if !pte_tagged(). > > However don't check pages which are !pte_valid_user() as these will > not have been swapped out. > > Signed-off-by: Steven Price <steven.price@arm.com> > --- > arch/arm64/include/asm/pgtable.h | 2 +- > arch/arm64/kernel/mte.c | 16 ++++++++++++---- > 2 files changed, 13 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index e17b96d0e4b5..84166625c989 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, > __sync_icache_dcache(pte); > > if (system_supports_mte() && > - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) > + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) > mte_sync_tags(ptep, pte); With the EPAN patches queued in for-next/epan, pte_valid_user() disappeared as its semantics weren't very clear. So this relies on the set_pte_at() being done on the VMM address space. I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need something like pte_present() && addr <= user_addr_max(). BTW, ignoring virtualisation, can we ever bring a page in from swap on a PROT_NONE mapping (say fault-around)? It's not too bad if we keep the metadata around for when the pte becomes accessible but I suspect we remove it if the page is removed from swap. -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-26 18:56 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-26 18:56 UTC (permalink / raw) To: Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry Hi Steven, On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: > A KVM guest could store tags in a page even if the VMM hasn't mapped > the page with PROT_MTE. So when restoring pages from swap we will > need to check to see if there are any saved tags even if !pte_tagged(). > > However don't check pages which are !pte_valid_user() as these will > not have been swapped out. > > Signed-off-by: Steven Price <steven.price@arm.com> > --- > arch/arm64/include/asm/pgtable.h | 2 +- > arch/arm64/kernel/mte.c | 16 ++++++++++++---- > 2 files changed, 13 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index e17b96d0e4b5..84166625c989 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, > __sync_icache_dcache(pte); > > if (system_supports_mte() && > - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) > + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) > mte_sync_tags(ptep, pte); With the EPAN patches queued in for-next/epan, pte_valid_user() disappeared as its semantics weren't very clear. So this relies on the set_pte_at() being done on the VMM address space. I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need something like pte_present() && addr <= user_addr_max(). BTW, ignoring virtualisation, can we ever bring a page in from swap on a PROT_NONE mapping (say fault-around)? It's not too bad if we keep the metadata around for when the pte becomes accessible but I suspect we remove it if the page is removed from swap. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged 2021-03-26 18:56 ` Catalin Marinas (?) (?) @ 2021-03-29 15:55 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-29 15:55 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On 26/03/2021 18:56, Catalin Marinas wrote: > Hi Steven, > > On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: >> A KVM guest could store tags in a page even if the VMM hasn't mapped >> the page with PROT_MTE. So when restoring pages from swap we will >> need to check to see if there are any saved tags even if !pte_tagged(). >> >> However don't check pages which are !pte_valid_user() as these will >> not have been swapped out. >> >> Signed-off-by: Steven Price <steven.price@arm.com> >> --- >> arch/arm64/include/asm/pgtable.h | 2 +- >> arch/arm64/kernel/mte.c | 16 ++++++++++++---- >> 2 files changed, 13 insertions(+), 5 deletions(-) >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index e17b96d0e4b5..84166625c989 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, >> __sync_icache_dcache(pte); >> >> if (system_supports_mte() && >> - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) >> + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) >> mte_sync_tags(ptep, pte); > > With the EPAN patches queued in for-next/epan, pte_valid_user() > disappeared as its semantics weren't very clear. Thanks for pointing that out. > So this relies on the set_pte_at() being done on the VMM address space. > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need > something like pte_present() && addr <= user_addr_max(). AFAIUI the stage 2 matches the VMM's address space (for the subset that has memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be invalidated and a subsequent fault would exit to the VMM to sort out. This sort of thing is done for the lazy migration use case (i.e. pages are fetched as the VM tries to access them). > BTW, ignoring virtualisation, can we ever bring a page in from swap on a > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the > metadata around for when the pte becomes accessible but I suspect we > remove it if the page is removed from swap. There are two stages of bringing data from swap. First is populating the swap cache by doing the physical read from swap. The second is actually restoring the page table entries. Clearly the first part can happen even with PROT_NONE (the simple case is there's another mapping which is !PROT_NONE). For the second I'm a little hazy on exactly what happens when you do a 'swapoff' - that may cause a page to be re-inserted into a page table without a fault. If you follow the chain down from try_to_unuse() you end up at a call to set_pte_at(). So we need set_pte_at() to handle a PROT_NONE mapping. So I guess the test we really want here is just (pte_val() & PTE_USER). Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-29 15:55 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-29 15:55 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On 26/03/2021 18:56, Catalin Marinas wrote: > Hi Steven, > > On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: >> A KVM guest could store tags in a page even if the VMM hasn't mapped >> the page with PROT_MTE. So when restoring pages from swap we will >> need to check to see if there are any saved tags even if !pte_tagged(). >> >> However don't check pages which are !pte_valid_user() as these will >> not have been swapped out. >> >> Signed-off-by: Steven Price <steven.price@arm.com> >> --- >> arch/arm64/include/asm/pgtable.h | 2 +- >> arch/arm64/kernel/mte.c | 16 ++++++++++++---- >> 2 files changed, 13 insertions(+), 5 deletions(-) >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index e17b96d0e4b5..84166625c989 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, >> __sync_icache_dcache(pte); >> >> if (system_supports_mte() && >> - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) >> + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) >> mte_sync_tags(ptep, pte); > > With the EPAN patches queued in for-next/epan, pte_valid_user() > disappeared as its semantics weren't very clear. Thanks for pointing that out. > So this relies on the set_pte_at() being done on the VMM address space. > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need > something like pte_present() && addr <= user_addr_max(). AFAIUI the stage 2 matches the VMM's address space (for the subset that has memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be invalidated and a subsequent fault would exit to the VMM to sort out. This sort of thing is done for the lazy migration use case (i.e. pages are fetched as the VM tries to access them). > BTW, ignoring virtualisation, can we ever bring a page in from swap on a > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the > metadata around for when the pte becomes accessible but I suspect we > remove it if the page is removed from swap. There are two stages of bringing data from swap. First is populating the swap cache by doing the physical read from swap. The second is actually restoring the page table entries. Clearly the first part can happen even with PROT_NONE (the simple case is there's another mapping which is !PROT_NONE). For the second I'm a little hazy on exactly what happens when you do a 'swapoff' - that may cause a page to be re-inserted into a page table without a fault. If you follow the chain down from try_to_unuse() you end up at a call to set_pte_at(). So we need set_pte_at() to handle a PROT_NONE mapping. So I guess the test we really want here is just (pte_val() & PTE_USER). Steve _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-29 15:55 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-29 15:55 UTC (permalink / raw) To: Catalin Marinas Cc: Dr. David Alan Gilbert, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm On 26/03/2021 18:56, Catalin Marinas wrote: > Hi Steven, > > On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: >> A KVM guest could store tags in a page even if the VMM hasn't mapped >> the page with PROT_MTE. So when restoring pages from swap we will >> need to check to see if there are any saved tags even if !pte_tagged(). >> >> However don't check pages which are !pte_valid_user() as these will >> not have been swapped out. >> >> Signed-off-by: Steven Price <steven.price@arm.com> >> --- >> arch/arm64/include/asm/pgtable.h | 2 +- >> arch/arm64/kernel/mte.c | 16 ++++++++++++---- >> 2 files changed, 13 insertions(+), 5 deletions(-) >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index e17b96d0e4b5..84166625c989 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, >> __sync_icache_dcache(pte); >> >> if (system_supports_mte() && >> - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) >> + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) >> mte_sync_tags(ptep, pte); > > With the EPAN patches queued in for-next/epan, pte_valid_user() > disappeared as its semantics weren't very clear. Thanks for pointing that out. > So this relies on the set_pte_at() being done on the VMM address space. > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need > something like pte_present() && addr <= user_addr_max(). AFAIUI the stage 2 matches the VMM's address space (for the subset that has memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be invalidated and a subsequent fault would exit to the VMM to sort out. This sort of thing is done for the lazy migration use case (i.e. pages are fetched as the VM tries to access them). > BTW, ignoring virtualisation, can we ever bring a page in from swap on a > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the > metadata around for when the pte becomes accessible but I suspect we > remove it if the page is removed from swap. There are two stages of bringing data from swap. First is populating the swap cache by doing the physical read from swap. The second is actually restoring the page table entries. Clearly the first part can happen even with PROT_NONE (the simple case is there's another mapping which is !PROT_NONE). For the second I'm a little hazy on exactly what happens when you do a 'swapoff' - that may cause a page to be re-inserted into a page table without a fault. If you follow the chain down from try_to_unuse() you end up at a call to set_pte_at(). So we need set_pte_at() to handle a PROT_NONE mapping. So I guess the test we really want here is just (pte_val() & PTE_USER). Steve _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-29 15:55 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-29 15:55 UTC (permalink / raw) To: Catalin Marinas Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 26/03/2021 18:56, Catalin Marinas wrote: > Hi Steven, > > On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: >> A KVM guest could store tags in a page even if the VMM hasn't mapped >> the page with PROT_MTE. So when restoring pages from swap we will >> need to check to see if there are any saved tags even if !pte_tagged(). >> >> However don't check pages which are !pte_valid_user() as these will >> not have been swapped out. >> >> Signed-off-by: Steven Price <steven.price@arm.com> >> --- >> arch/arm64/include/asm/pgtable.h | 2 +- >> arch/arm64/kernel/mte.c | 16 ++++++++++++---- >> 2 files changed, 13 insertions(+), 5 deletions(-) >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index e17b96d0e4b5..84166625c989 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, >> __sync_icache_dcache(pte); >> >> if (system_supports_mte() && >> - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) >> + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) >> mte_sync_tags(ptep, pte); > > With the EPAN patches queued in for-next/epan, pte_valid_user() > disappeared as its semantics weren't very clear. Thanks for pointing that out. > So this relies on the set_pte_at() being done on the VMM address space. > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need > something like pte_present() && addr <= user_addr_max(). AFAIUI the stage 2 matches the VMM's address space (for the subset that has memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be invalidated and a subsequent fault would exit to the VMM to sort out. This sort of thing is done for the lazy migration use case (i.e. pages are fetched as the VM tries to access them). > BTW, ignoring virtualisation, can we ever bring a page in from swap on a > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the > metadata around for when the pte becomes accessible but I suspect we > remove it if the page is removed from swap. There are two stages of bringing data from swap. First is populating the swap cache by doing the physical read from swap. The second is actually restoring the page table entries. Clearly the first part can happen even with PROT_NONE (the simple case is there's another mapping which is !PROT_NONE). For the second I'm a little hazy on exactly what happens when you do a 'swapoff' - that may cause a page to be re-inserted into a page table without a fault. If you follow the chain down from try_to_unuse() you end up at a call to set_pte_at(). So we need set_pte_at() to handle a PROT_NONE mapping. So I guess the test we really want here is just (pte_val() & PTE_USER). Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged 2021-03-29 15:55 ` Steven Price (?) (?) @ 2021-03-30 10:13 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-30 10:13 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote: > On 26/03/2021 18:56, Catalin Marinas wrote: > > On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: > > > A KVM guest could store tags in a page even if the VMM hasn't mapped > > > the page with PROT_MTE. So when restoring pages from swap we will > > > need to check to see if there are any saved tags even if !pte_tagged(). > > > > > > However don't check pages which are !pte_valid_user() as these will > > > not have been swapped out. > > > > > > Signed-off-by: Steven Price <steven.price@arm.com> > > > --- > > > arch/arm64/include/asm/pgtable.h | 2 +- > > > arch/arm64/kernel/mte.c | 16 ++++++++++++---- > > > 2 files changed, 13 insertions(+), 5 deletions(-) > > > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > > index e17b96d0e4b5..84166625c989 100644 > > > --- a/arch/arm64/include/asm/pgtable.h > > > +++ b/arch/arm64/include/asm/pgtable.h > > > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, > > > __sync_icache_dcache(pte); > > > if (system_supports_mte() && > > > - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) > > > + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) > > > mte_sync_tags(ptep, pte); > > > > With the EPAN patches queued in for-next/epan, pte_valid_user() > > disappeared as its semantics weren't very clear. > > Thanks for pointing that out. > > > So this relies on the set_pte_at() being done on the VMM address space. > > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access > > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need > > something like pte_present() && addr <= user_addr_max(). > > AFAIUI the stage 2 matches the VMM's address space (for the subset that has > memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be > invalidated and a subsequent fault would exit to the VMM to sort out. This > sort of thing is done for the lazy migration use case (i.e. pages are > fetched as the VM tries to access them). There's also the protected KVM case which IIUC wouldn't provide any mapping of the guest memory to the host (or maybe the host still thinks it's there but cannot access it without a Stage 2 fault). At least in this case it wouldn't swap pages out and it would be the responsibility of the EL2 code to clear the tags when giving pages to the guest (user_mem_abort() must not touch the page). So basically we either have a valid, accessible mapping in the VMM and we can handle the tags via set_pte_at() or we leave it to whatever is running at EL2 in the pKVM case. I don't remember whether we had a clear conclusion in the past: have we ruled out requiring the VMM to map the guest memory with PROT_MTE entirely? IIRC a potential problem was the VMM using MTE itself and having to disable it when accessing the guest memory. Another potential issue (I haven't got my head around it yet) is a race in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until after the tags had been restored. Can we have the same page mapped by two ptes, each attempting to restore it from swap and one gets it first and starts modifying it? Given that we set the actual pte after setting PG_mte_tagged, it's probably alright but I think we miss some barriers. Also, if a page is not a swap one, we currently clear the tags if mapped as pte_tagged() (prior to this patch). We'd need something similar when mapping it in the guest so that we don't leak tags but to avoid any page ending up with PG_mte_tagged, I think you moved the tag clearing to user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM would be called first and then set in Stage 2. > > BTW, ignoring virtualisation, can we ever bring a page in from swap on a > > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the > > metadata around for when the pte becomes accessible but I suspect we > > remove it if the page is removed from swap. > > There are two stages of bringing data from swap. First is populating the > swap cache by doing the physical read from swap. The second is actually > restoring the page table entries. When is the page metadata removed? I want to make sure we don't drop it for some pte attributes. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-30 10:13 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-30 10:13 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote: > On 26/03/2021 18:56, Catalin Marinas wrote: > > On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: > > > A KVM guest could store tags in a page even if the VMM hasn't mapped > > > the page with PROT_MTE. So when restoring pages from swap we will > > > need to check to see if there are any saved tags even if !pte_tagged(). > > > > > > However don't check pages which are !pte_valid_user() as these will > > > not have been swapped out. > > > > > > Signed-off-by: Steven Price <steven.price@arm.com> > > > --- > > > arch/arm64/include/asm/pgtable.h | 2 +- > > > arch/arm64/kernel/mte.c | 16 ++++++++++++---- > > > 2 files changed, 13 insertions(+), 5 deletions(-) > > > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > > index e17b96d0e4b5..84166625c989 100644 > > > --- a/arch/arm64/include/asm/pgtable.h > > > +++ b/arch/arm64/include/asm/pgtable.h > > > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, > > > __sync_icache_dcache(pte); > > > if (system_supports_mte() && > > > - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) > > > + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) > > > mte_sync_tags(ptep, pte); > > > > With the EPAN patches queued in for-next/epan, pte_valid_user() > > disappeared as its semantics weren't very clear. > > Thanks for pointing that out. > > > So this relies on the set_pte_at() being done on the VMM address space. > > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access > > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need > > something like pte_present() && addr <= user_addr_max(). > > AFAIUI the stage 2 matches the VMM's address space (for the subset that has > memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be > invalidated and a subsequent fault would exit to the VMM to sort out. This > sort of thing is done for the lazy migration use case (i.e. pages are > fetched as the VM tries to access them). There's also the protected KVM case which IIUC wouldn't provide any mapping of the guest memory to the host (or maybe the host still thinks it's there but cannot access it without a Stage 2 fault). At least in this case it wouldn't swap pages out and it would be the responsibility of the EL2 code to clear the tags when giving pages to the guest (user_mem_abort() must not touch the page). So basically we either have a valid, accessible mapping in the VMM and we can handle the tags via set_pte_at() or we leave it to whatever is running at EL2 in the pKVM case. I don't remember whether we had a clear conclusion in the past: have we ruled out requiring the VMM to map the guest memory with PROT_MTE entirely? IIRC a potential problem was the VMM using MTE itself and having to disable it when accessing the guest memory. Another potential issue (I haven't got my head around it yet) is a race in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until after the tags had been restored. Can we have the same page mapped by two ptes, each attempting to restore it from swap and one gets it first and starts modifying it? Given that we set the actual pte after setting PG_mte_tagged, it's probably alright but I think we miss some barriers. Also, if a page is not a swap one, we currently clear the tags if mapped as pte_tagged() (prior to this patch). We'd need something similar when mapping it in the guest so that we don't leak tags but to avoid any page ending up with PG_mte_tagged, I think you moved the tag clearing to user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM would be called first and then set in Stage 2. > > BTW, ignoring virtualisation, can we ever bring a page in from swap on a > > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the > > metadata around for when the pte becomes accessible but I suspect we > > remove it if the page is removed from swap. > > There are two stages of bringing data from swap. First is populating the > swap cache by doing the physical read from swap. The second is actually > restoring the page table entries. When is the page metadata removed? I want to make sure we don't drop it for some pte attributes. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-30 10:13 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-30 10:13 UTC (permalink / raw) To: Steven Price Cc: Dr. David Alan Gilbert, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote: > On 26/03/2021 18:56, Catalin Marinas wrote: > > On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: > > > A KVM guest could store tags in a page even if the VMM hasn't mapped > > > the page with PROT_MTE. So when restoring pages from swap we will > > > need to check to see if there are any saved tags even if !pte_tagged(). > > > > > > However don't check pages which are !pte_valid_user() as these will > > > not have been swapped out. > > > > > > Signed-off-by: Steven Price <steven.price@arm.com> > > > --- > > > arch/arm64/include/asm/pgtable.h | 2 +- > > > arch/arm64/kernel/mte.c | 16 ++++++++++++---- > > > 2 files changed, 13 insertions(+), 5 deletions(-) > > > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > > index e17b96d0e4b5..84166625c989 100644 > > > --- a/arch/arm64/include/asm/pgtable.h > > > +++ b/arch/arm64/include/asm/pgtable.h > > > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, > > > __sync_icache_dcache(pte); > > > if (system_supports_mte() && > > > - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) > > > + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) > > > mte_sync_tags(ptep, pte); > > > > With the EPAN patches queued in for-next/epan, pte_valid_user() > > disappeared as its semantics weren't very clear. > > Thanks for pointing that out. > > > So this relies on the set_pte_at() being done on the VMM address space. > > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access > > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need > > something like pte_present() && addr <= user_addr_max(). > > AFAIUI the stage 2 matches the VMM's address space (for the subset that has > memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be > invalidated and a subsequent fault would exit to the VMM to sort out. This > sort of thing is done for the lazy migration use case (i.e. pages are > fetched as the VM tries to access them). There's also the protected KVM case which IIUC wouldn't provide any mapping of the guest memory to the host (or maybe the host still thinks it's there but cannot access it without a Stage 2 fault). At least in this case it wouldn't swap pages out and it would be the responsibility of the EL2 code to clear the tags when giving pages to the guest (user_mem_abort() must not touch the page). So basically we either have a valid, accessible mapping in the VMM and we can handle the tags via set_pte_at() or we leave it to whatever is running at EL2 in the pKVM case. I don't remember whether we had a clear conclusion in the past: have we ruled out requiring the VMM to map the guest memory with PROT_MTE entirely? IIRC a potential problem was the VMM using MTE itself and having to disable it when accessing the guest memory. Another potential issue (I haven't got my head around it yet) is a race in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until after the tags had been restored. Can we have the same page mapped by two ptes, each attempting to restore it from swap and one gets it first and starts modifying it? Given that we set the actual pte after setting PG_mte_tagged, it's probably alright but I think we miss some barriers. Also, if a page is not a swap one, we currently clear the tags if mapped as pte_tagged() (prior to this patch). We'd need something similar when mapping it in the guest so that we don't leak tags but to avoid any page ending up with PG_mte_tagged, I think you moved the tag clearing to user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM would be called first and then set in Stage 2. > > BTW, ignoring virtualisation, can we ever bring a page in from swap on a > > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the > > metadata around for when the pte becomes accessible but I suspect we > > remove it if the page is removed from swap. > > There are two stages of bringing data from swap. First is populating the > swap cache by doing the physical read from swap. The second is actually > restoring the page table entries. When is the page metadata removed? I want to make sure we don't drop it for some pte attributes. -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-30 10:13 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-30 10:13 UTC (permalink / raw) To: Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote: > On 26/03/2021 18:56, Catalin Marinas wrote: > > On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: > > > A KVM guest could store tags in a page even if the VMM hasn't mapped > > > the page with PROT_MTE. So when restoring pages from swap we will > > > need to check to see if there are any saved tags even if !pte_tagged(). > > > > > > However don't check pages which are !pte_valid_user() as these will > > > not have been swapped out. > > > > > > Signed-off-by: Steven Price <steven.price@arm.com> > > > --- > > > arch/arm64/include/asm/pgtable.h | 2 +- > > > arch/arm64/kernel/mte.c | 16 ++++++++++++---- > > > 2 files changed, 13 insertions(+), 5 deletions(-) > > > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > > index e17b96d0e4b5..84166625c989 100644 > > > --- a/arch/arm64/include/asm/pgtable.h > > > +++ b/arch/arm64/include/asm/pgtable.h > > > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, > > > __sync_icache_dcache(pte); > > > if (system_supports_mte() && > > > - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) > > > + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) > > > mte_sync_tags(ptep, pte); > > > > With the EPAN patches queued in for-next/epan, pte_valid_user() > > disappeared as its semantics weren't very clear. > > Thanks for pointing that out. > > > So this relies on the set_pte_at() being done on the VMM address space. > > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access > > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need > > something like pte_present() && addr <= user_addr_max(). > > AFAIUI the stage 2 matches the VMM's address space (for the subset that has > memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be > invalidated and a subsequent fault would exit to the VMM to sort out. This > sort of thing is done for the lazy migration use case (i.e. pages are > fetched as the VM tries to access them). There's also the protected KVM case which IIUC wouldn't provide any mapping of the guest memory to the host (or maybe the host still thinks it's there but cannot access it without a Stage 2 fault). At least in this case it wouldn't swap pages out and it would be the responsibility of the EL2 code to clear the tags when giving pages to the guest (user_mem_abort() must not touch the page). So basically we either have a valid, accessible mapping in the VMM and we can handle the tags via set_pte_at() or we leave it to whatever is running at EL2 in the pKVM case. I don't remember whether we had a clear conclusion in the past: have we ruled out requiring the VMM to map the guest memory with PROT_MTE entirely? IIRC a potential problem was the VMM using MTE itself and having to disable it when accessing the guest memory. Another potential issue (I haven't got my head around it yet) is a race in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until after the tags had been restored. Can we have the same page mapped by two ptes, each attempting to restore it from swap and one gets it first and starts modifying it? Given that we set the actual pte after setting PG_mte_tagged, it's probably alright but I think we miss some barriers. Also, if a page is not a swap one, we currently clear the tags if mapped as pte_tagged() (prior to this patch). We'd need something similar when mapping it in the guest so that we don't leak tags but to avoid any page ending up with PG_mte_tagged, I think you moved the tag clearing to user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM would be called first and then set in Stage 2. > > BTW, ignoring virtualisation, can we ever bring a page in from swap on a > > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the > > metadata around for when the pte becomes accessible but I suspect we > > remove it if the page is removed from swap. > > There are two stages of bringing data from swap. First is populating the > swap cache by doing the physical read from swap. The second is actually > restoring the page table entries. When is the page metadata removed? I want to make sure we don't drop it for some pte attributes. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged 2021-03-30 10:13 ` Catalin Marinas (?) (?) @ 2021-03-31 10:09 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-31 10:09 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On 30/03/2021 11:13, Catalin Marinas wrote: > On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote: >> On 26/03/2021 18:56, Catalin Marinas wrote: >>> On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: >>>> A KVM guest could store tags in a page even if the VMM hasn't mapped >>>> the page with PROT_MTE. So when restoring pages from swap we will >>>> need to check to see if there are any saved tags even if !pte_tagged(). >>>> >>>> However don't check pages which are !pte_valid_user() as these will >>>> not have been swapped out. >>>> >>>> Signed-off-by: Steven Price <steven.price@arm.com> >>>> --- >>>> arch/arm64/include/asm/pgtable.h | 2 +- >>>> arch/arm64/kernel/mte.c | 16 ++++++++++++---- >>>> 2 files changed, 13 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>> index e17b96d0e4b5..84166625c989 100644 >>>> --- a/arch/arm64/include/asm/pgtable.h >>>> +++ b/arch/arm64/include/asm/pgtable.h >>>> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, >>>> __sync_icache_dcache(pte); >>>> if (system_supports_mte() && >>>> - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) >>>> + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) >>>> mte_sync_tags(ptep, pte); >>> >>> With the EPAN patches queued in for-next/epan, pte_valid_user() >>> disappeared as its semantics weren't very clear. >> >> Thanks for pointing that out. >> >>> So this relies on the set_pte_at() being done on the VMM address space. >>> I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access >>> it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need >>> something like pte_present() && addr <= user_addr_max(). >> >> AFAIUI the stage 2 matches the VMM's address space (for the subset that has >> memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be >> invalidated and a subsequent fault would exit to the VMM to sort out. This >> sort of thing is done for the lazy migration use case (i.e. pages are >> fetched as the VM tries to access them). > > There's also the protected KVM case which IIUC wouldn't provide any > mapping of the guest memory to the host (or maybe the host still thinks > it's there but cannot access it without a Stage 2 fault). At least in > this case it wouldn't swap pages out and it would be the responsibility > of the EL2 code to clear the tags when giving pages to the guest > (user_mem_abort() must not touch the page). > > So basically we either have a valid, accessible mapping in the VMM and > we can handle the tags via set_pte_at() or we leave it to whatever is > running at EL2 in the pKVM case. For the pKVM case it's up to the EL2 code to hand over suitably scrubbed pages to the guest, and the host doesn't have access to the pages so we (currently) don't have to worry about swap. If swap get implemented it will presumably be up to the EL2 code to package up both the normal data and the MTE tags into an encrypted bundle for the host to stash somewhere. > I don't remember whether we had a clear conclusion in the past: have we > ruled out requiring the VMM to map the guest memory with PROT_MTE > entirely? IIRC a potential problem was the VMM using MTE itself and > having to disable it when accessing the guest memory. Yes, there are some ugly corner cases if we require the VMM to map with PROT_MTE. Hence patch 5 - an ioctl to allow the VMM to access the tags without having to maintain a PROT_MTE mapping. > Another potential issue (I haven't got my head around it yet) is a race > in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until > after the tags had been restored. Can we have the same page mapped by > two ptes, each attempting to restore it from swap and one gets it first > and starts modifying it? Given that we set the actual pte after setting > PG_mte_tagged, it's probably alright but I think we miss some barriers. I'm not sure if I've got my head round this one yet either, but you could be right there's a race. This exists without these patches: CPU 1 | CPU 2 -------------------------+----------------- set_pte_at() | --> mte_sync_tags() | --> test_and_set_bit() | --> mte_sync_page_tags() | set_pte_at() [stalls/sleeps] | --> mte_sync_tags() | --> test_and_set_bit() | [already set by CPU 1] | set_pte() | [sees stale tags] [eventually wakes up | and sets tags] | What I'm struggling to get my head around is whether there's always a sufficient lock held during the call to set_pte_at() to avoid the above. I suspect not because the two calls could be in completely separate processes. We potentially could stick a lock_page()/unlock_page() sequence in mte_sync_tags(). I just ran a basic test and didn't hit problems with that. Any thoughts? > Also, if a page is not a swap one, we currently clear the tags if mapped > as pte_tagged() (prior to this patch). We'd need something similar when > mapping it in the guest so that we don't leak tags but to avoid any page > ending up with PG_mte_tagged, I think you moved the tag clearing to > user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM > would be called first and then set in Stage 2. Yes - KVM will perform the equivalent of get_user_pages() before setting the entry in Stage 2, that should end up performing any set_pte_at() calls to populate the VMM's page tables. So the VMM 'sees' the memory before stage 2. >>> BTW, ignoring virtualisation, can we ever bring a page in from swap on a >>> PROT_NONE mapping (say fault-around)? It's not too bad if we keep the >>> metadata around for when the pte becomes accessible but I suspect we >>> remove it if the page is removed from swap. >> >> There are two stages of bringing data from swap. First is populating the >> swap cache by doing the physical read from swap. The second is actually >> restoring the page table entries. > > When is the page metadata removed? I want to make sure we don't drop it > for some pte attributes. The tag metadata for swapped pages lives for the same length of time as the swap metadata itself. The swap code already makes sure that the metadata hangs around as long as there are any swap PTEs in existence, so I think everything should be fine here. The arch_swap_invalidate_xxx() calls match up with the frontswap calls as it has the same lifetime requirements. Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-31 10:09 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-31 10:09 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On 30/03/2021 11:13, Catalin Marinas wrote: > On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote: >> On 26/03/2021 18:56, Catalin Marinas wrote: >>> On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: >>>> A KVM guest could store tags in a page even if the VMM hasn't mapped >>>> the page with PROT_MTE. So when restoring pages from swap we will >>>> need to check to see if there are any saved tags even if !pte_tagged(). >>>> >>>> However don't check pages which are !pte_valid_user() as these will >>>> not have been swapped out. >>>> >>>> Signed-off-by: Steven Price <steven.price@arm.com> >>>> --- >>>> arch/arm64/include/asm/pgtable.h | 2 +- >>>> arch/arm64/kernel/mte.c | 16 ++++++++++++---- >>>> 2 files changed, 13 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>> index e17b96d0e4b5..84166625c989 100644 >>>> --- a/arch/arm64/include/asm/pgtable.h >>>> +++ b/arch/arm64/include/asm/pgtable.h >>>> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, >>>> __sync_icache_dcache(pte); >>>> if (system_supports_mte() && >>>> - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) >>>> + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) >>>> mte_sync_tags(ptep, pte); >>> >>> With the EPAN patches queued in for-next/epan, pte_valid_user() >>> disappeared as its semantics weren't very clear. >> >> Thanks for pointing that out. >> >>> So this relies on the set_pte_at() being done on the VMM address space. >>> I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access >>> it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need >>> something like pte_present() && addr <= user_addr_max(). >> >> AFAIUI the stage 2 matches the VMM's address space (for the subset that has >> memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be >> invalidated and a subsequent fault would exit to the VMM to sort out. This >> sort of thing is done for the lazy migration use case (i.e. pages are >> fetched as the VM tries to access them). > > There's also the protected KVM case which IIUC wouldn't provide any > mapping of the guest memory to the host (or maybe the host still thinks > it's there but cannot access it without a Stage 2 fault). At least in > this case it wouldn't swap pages out and it would be the responsibility > of the EL2 code to clear the tags when giving pages to the guest > (user_mem_abort() must not touch the page). > > So basically we either have a valid, accessible mapping in the VMM and > we can handle the tags via set_pte_at() or we leave it to whatever is > running at EL2 in the pKVM case. For the pKVM case it's up to the EL2 code to hand over suitably scrubbed pages to the guest, and the host doesn't have access to the pages so we (currently) don't have to worry about swap. If swap get implemented it will presumably be up to the EL2 code to package up both the normal data and the MTE tags into an encrypted bundle for the host to stash somewhere. > I don't remember whether we had a clear conclusion in the past: have we > ruled out requiring the VMM to map the guest memory with PROT_MTE > entirely? IIRC a potential problem was the VMM using MTE itself and > having to disable it when accessing the guest memory. Yes, there are some ugly corner cases if we require the VMM to map with PROT_MTE. Hence patch 5 - an ioctl to allow the VMM to access the tags without having to maintain a PROT_MTE mapping. > Another potential issue (I haven't got my head around it yet) is a race > in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until > after the tags had been restored. Can we have the same page mapped by > two ptes, each attempting to restore it from swap and one gets it first > and starts modifying it? Given that we set the actual pte after setting > PG_mte_tagged, it's probably alright but I think we miss some barriers. I'm not sure if I've got my head round this one yet either, but you could be right there's a race. This exists without these patches: CPU 1 | CPU 2 -------------------------+----------------- set_pte_at() | --> mte_sync_tags() | --> test_and_set_bit() | --> mte_sync_page_tags() | set_pte_at() [stalls/sleeps] | --> mte_sync_tags() | --> test_and_set_bit() | [already set by CPU 1] | set_pte() | [sees stale tags] [eventually wakes up | and sets tags] | What I'm struggling to get my head around is whether there's always a sufficient lock held during the call to set_pte_at() to avoid the above. I suspect not because the two calls could be in completely separate processes. We potentially could stick a lock_page()/unlock_page() sequence in mte_sync_tags(). I just ran a basic test and didn't hit problems with that. Any thoughts? > Also, if a page is not a swap one, we currently clear the tags if mapped > as pte_tagged() (prior to this patch). We'd need something similar when > mapping it in the guest so that we don't leak tags but to avoid any page > ending up with PG_mte_tagged, I think you moved the tag clearing to > user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM > would be called first and then set in Stage 2. Yes - KVM will perform the equivalent of get_user_pages() before setting the entry in Stage 2, that should end up performing any set_pte_at() calls to populate the VMM's page tables. So the VMM 'sees' the memory before stage 2. >>> BTW, ignoring virtualisation, can we ever bring a page in from swap on a >>> PROT_NONE mapping (say fault-around)? It's not too bad if we keep the >>> metadata around for when the pte becomes accessible but I suspect we >>> remove it if the page is removed from swap. >> >> There are two stages of bringing data from swap. First is populating the >> swap cache by doing the physical read from swap. The second is actually >> restoring the page table entries. > > When is the page metadata removed? I want to make sure we don't drop it > for some pte attributes. The tag metadata for swapped pages lives for the same length of time as the swap metadata itself. The swap code already makes sure that the metadata hangs around as long as there are any swap PTEs in existence, so I think everything should be fine here. The arch_swap_invalidate_xxx() calls match up with the frontswap calls as it has the same lifetime requirements. Steve _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-31 10:09 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-31 10:09 UTC (permalink / raw) To: Catalin Marinas Cc: Dr. David Alan Gilbert, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm On 30/03/2021 11:13, Catalin Marinas wrote: > On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote: >> On 26/03/2021 18:56, Catalin Marinas wrote: >>> On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: >>>> A KVM guest could store tags in a page even if the VMM hasn't mapped >>>> the page with PROT_MTE. So when restoring pages from swap we will >>>> need to check to see if there are any saved tags even if !pte_tagged(). >>>> >>>> However don't check pages which are !pte_valid_user() as these will >>>> not have been swapped out. >>>> >>>> Signed-off-by: Steven Price <steven.price@arm.com> >>>> --- >>>> arch/arm64/include/asm/pgtable.h | 2 +- >>>> arch/arm64/kernel/mte.c | 16 ++++++++++++---- >>>> 2 files changed, 13 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>> index e17b96d0e4b5..84166625c989 100644 >>>> --- a/arch/arm64/include/asm/pgtable.h >>>> +++ b/arch/arm64/include/asm/pgtable.h >>>> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, >>>> __sync_icache_dcache(pte); >>>> if (system_supports_mte() && >>>> - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) >>>> + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) >>>> mte_sync_tags(ptep, pte); >>> >>> With the EPAN patches queued in for-next/epan, pte_valid_user() >>> disappeared as its semantics weren't very clear. >> >> Thanks for pointing that out. >> >>> So this relies on the set_pte_at() being done on the VMM address space. >>> I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access >>> it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need >>> something like pte_present() && addr <= user_addr_max(). >> >> AFAIUI the stage 2 matches the VMM's address space (for the subset that has >> memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be >> invalidated and a subsequent fault would exit to the VMM to sort out. This >> sort of thing is done for the lazy migration use case (i.e. pages are >> fetched as the VM tries to access them). > > There's also the protected KVM case which IIUC wouldn't provide any > mapping of the guest memory to the host (or maybe the host still thinks > it's there but cannot access it without a Stage 2 fault). At least in > this case it wouldn't swap pages out and it would be the responsibility > of the EL2 code to clear the tags when giving pages to the guest > (user_mem_abort() must not touch the page). > > So basically we either have a valid, accessible mapping in the VMM and > we can handle the tags via set_pte_at() or we leave it to whatever is > running at EL2 in the pKVM case. For the pKVM case it's up to the EL2 code to hand over suitably scrubbed pages to the guest, and the host doesn't have access to the pages so we (currently) don't have to worry about swap. If swap get implemented it will presumably be up to the EL2 code to package up both the normal data and the MTE tags into an encrypted bundle for the host to stash somewhere. > I don't remember whether we had a clear conclusion in the past: have we > ruled out requiring the VMM to map the guest memory with PROT_MTE > entirely? IIRC a potential problem was the VMM using MTE itself and > having to disable it when accessing the guest memory. Yes, there are some ugly corner cases if we require the VMM to map with PROT_MTE. Hence patch 5 - an ioctl to allow the VMM to access the tags without having to maintain a PROT_MTE mapping. > Another potential issue (I haven't got my head around it yet) is a race > in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until > after the tags had been restored. Can we have the same page mapped by > two ptes, each attempting to restore it from swap and one gets it first > and starts modifying it? Given that we set the actual pte after setting > PG_mte_tagged, it's probably alright but I think we miss some barriers. I'm not sure if I've got my head round this one yet either, but you could be right there's a race. This exists without these patches: CPU 1 | CPU 2 -------------------------+----------------- set_pte_at() | --> mte_sync_tags() | --> test_and_set_bit() | --> mte_sync_page_tags() | set_pte_at() [stalls/sleeps] | --> mte_sync_tags() | --> test_and_set_bit() | [already set by CPU 1] | set_pte() | [sees stale tags] [eventually wakes up | and sets tags] | What I'm struggling to get my head around is whether there's always a sufficient lock held during the call to set_pte_at() to avoid the above. I suspect not because the two calls could be in completely separate processes. We potentially could stick a lock_page()/unlock_page() sequence in mte_sync_tags(). I just ran a basic test and didn't hit problems with that. Any thoughts? > Also, if a page is not a swap one, we currently clear the tags if mapped > as pte_tagged() (prior to this patch). We'd need something similar when > mapping it in the guest so that we don't leak tags but to avoid any page > ending up with PG_mte_tagged, I think you moved the tag clearing to > user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM > would be called first and then set in Stage 2. Yes - KVM will perform the equivalent of get_user_pages() before setting the entry in Stage 2, that should end up performing any set_pte_at() calls to populate the VMM's page tables. So the VMM 'sees' the memory before stage 2. >>> BTW, ignoring virtualisation, can we ever bring a page in from swap on a >>> PROT_NONE mapping (say fault-around)? It's not too bad if we keep the >>> metadata around for when the pte becomes accessible but I suspect we >>> remove it if the page is removed from swap. >> >> There are two stages of bringing data from swap. First is populating the >> swap cache by doing the physical read from swap. The second is actually >> restoring the page table entries. > > When is the page metadata removed? I want to make sure we don't drop it > for some pte attributes. The tag metadata for swapped pages lives for the same length of time as the swap metadata itself. The swap code already makes sure that the metadata hangs around as long as there are any swap PTEs in existence, so I think everything should be fine here. The arch_swap_invalidate_xxx() calls match up with the frontswap calls as it has the same lifetime requirements. Steve _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged @ 2021-03-31 10:09 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-31 10:09 UTC (permalink / raw) To: Catalin Marinas Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 30/03/2021 11:13, Catalin Marinas wrote: > On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote: >> On 26/03/2021 18:56, Catalin Marinas wrote: >>> On Fri, Mar 12, 2021 at 03:18:57PM +0000, Steven Price wrote: >>>> A KVM guest could store tags in a page even if the VMM hasn't mapped >>>> the page with PROT_MTE. So when restoring pages from swap we will >>>> need to check to see if there are any saved tags even if !pte_tagged(). >>>> >>>> However don't check pages which are !pte_valid_user() as these will >>>> not have been swapped out. >>>> >>>> Signed-off-by: Steven Price <steven.price@arm.com> >>>> --- >>>> arch/arm64/include/asm/pgtable.h | 2 +- >>>> arch/arm64/kernel/mte.c | 16 ++++++++++++---- >>>> 2 files changed, 13 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >>>> index e17b96d0e4b5..84166625c989 100644 >>>> --- a/arch/arm64/include/asm/pgtable.h >>>> +++ b/arch/arm64/include/asm/pgtable.h >>>> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, >>>> __sync_icache_dcache(pte); >>>> if (system_supports_mte() && >>>> - pte_present(pte) && pte_tagged(pte) && !pte_special(pte)) >>>> + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte)) >>>> mte_sync_tags(ptep, pte); >>> >>> With the EPAN patches queued in for-next/epan, pte_valid_user() >>> disappeared as its semantics weren't very clear. >> >> Thanks for pointing that out. >> >>> So this relies on the set_pte_at() being done on the VMM address space. >>> I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access >>> it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need >>> something like pte_present() && addr <= user_addr_max(). >> >> AFAIUI the stage 2 matches the VMM's address space (for the subset that has >> memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be >> invalidated and a subsequent fault would exit to the VMM to sort out. This >> sort of thing is done for the lazy migration use case (i.e. pages are >> fetched as the VM tries to access them). > > There's also the protected KVM case which IIUC wouldn't provide any > mapping of the guest memory to the host (or maybe the host still thinks > it's there but cannot access it without a Stage 2 fault). At least in > this case it wouldn't swap pages out and it would be the responsibility > of the EL2 code to clear the tags when giving pages to the guest > (user_mem_abort() must not touch the page). > > So basically we either have a valid, accessible mapping in the VMM and > we can handle the tags via set_pte_at() or we leave it to whatever is > running at EL2 in the pKVM case. For the pKVM case it's up to the EL2 code to hand over suitably scrubbed pages to the guest, and the host doesn't have access to the pages so we (currently) don't have to worry about swap. If swap get implemented it will presumably be up to the EL2 code to package up both the normal data and the MTE tags into an encrypted bundle for the host to stash somewhere. > I don't remember whether we had a clear conclusion in the past: have we > ruled out requiring the VMM to map the guest memory with PROT_MTE > entirely? IIRC a potential problem was the VMM using MTE itself and > having to disable it when accessing the guest memory. Yes, there are some ugly corner cases if we require the VMM to map with PROT_MTE. Hence patch 5 - an ioctl to allow the VMM to access the tags without having to maintain a PROT_MTE mapping. > Another potential issue (I haven't got my head around it yet) is a race > in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until > after the tags had been restored. Can we have the same page mapped by > two ptes, each attempting to restore it from swap and one gets it first > and starts modifying it? Given that we set the actual pte after setting > PG_mte_tagged, it's probably alright but I think we miss some barriers. I'm not sure if I've got my head round this one yet either, but you could be right there's a race. This exists without these patches: CPU 1 | CPU 2 -------------------------+----------------- set_pte_at() | --> mte_sync_tags() | --> test_and_set_bit() | --> mte_sync_page_tags() | set_pte_at() [stalls/sleeps] | --> mte_sync_tags() | --> test_and_set_bit() | [already set by CPU 1] | set_pte() | [sees stale tags] [eventually wakes up | and sets tags] | What I'm struggling to get my head around is whether there's always a sufficient lock held during the call to set_pte_at() to avoid the above. I suspect not because the two calls could be in completely separate processes. We potentially could stick a lock_page()/unlock_page() sequence in mte_sync_tags(). I just ran a basic test and didn't hit problems with that. Any thoughts? > Also, if a page is not a swap one, we currently clear the tags if mapped > as pte_tagged() (prior to this patch). We'd need something similar when > mapping it in the guest so that we don't leak tags but to avoid any page > ending up with PG_mte_tagged, I think you moved the tag clearing to > user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM > would be called first and then set in Stage 2. Yes - KVM will perform the equivalent of get_user_pages() before setting the entry in Stage 2, that should end up performing any set_pte_at() calls to populate the VMM's page tables. So the VMM 'sees' the memory before stage 2. >>> BTW, ignoring virtualisation, can we ever bring a page in from swap on a >>> PROT_NONE mapping (say fault-around)? It's not too bad if we keep the >>> metadata around for when the pte becomes accessible but I suspect we >>> remove it if the page is removed from swap. >> >> There are two stages of bringing data from swap. First is populating the >> swap cache by doing the physical read from swap. The second is actually >> restoring the page table entries. > > When is the page metadata removed? I want to make sure we don't drop it > for some pte attributes. The tag metadata for swapped pages lives for the same length of time as the swap metadata itself. The swap code already makes sure that the metadata hangs around as long as there are any swap PTEs in existence, so I think everything should be fine here. The arch_swap_invalidate_xxx() calls match up with the frontswap calls as it has the same lifetime requirements. Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-12 15:18 ` Steven Price (?) (?) @ 2021-03-12 15:18 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging for a VM. This will expose the feature to the guest and automatically tag memory pages touched by the VM as PG_mte_tagged (and clear the tag storage) to ensure that the guest cannot see stale tags, and so that the tags are correctly saved/restored across swap. Actually exposing the new capability to user space happens in a later patch. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_emulate.h | 3 +++ arch/arm64/include/asm/kvm_host.h | 3 +++ arch/arm64/kvm/hyp/exception.c | 3 ++- arch/arm64/kvm/mmu.c | 16 ++++++++++++++++ arch/arm64/kvm/sys_regs.c | 3 +++ include/uapi/linux/kvm.h | 1 + 6 files changed, 28 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index f612c090f2e4..6bf776c2399c 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu) if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) || vcpu_el1_is_32bit(vcpu)) vcpu->arch.hcr_el2 |= HCR_TID2; + + if (kvm_has_mte(vcpu->kvm)) + vcpu->arch.hcr_el2 |= HCR_ATA; } static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 3d10e6527f7d..1170ee137096 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -132,6 +132,8 @@ struct kvm_arch { u8 pfr0_csv2; u8 pfr0_csv3; + /* Memory Tagging Extension enabled for the guest */ + bool mte_enabled; }; struct kvm_vcpu_fault_info { @@ -767,6 +769,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu); #define kvm_arm_vcpu_sve_finalized(vcpu) \ ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED) +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled) #define kvm_vcpu_has_pmu(vcpu) \ (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features)) diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c index 73629094f903..56426565600c 100644 --- a/arch/arm64/kvm/hyp/exception.c +++ b/arch/arm64/kvm/hyp/exception.c @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode, new |= (old & PSR_C_BIT); new |= (old & PSR_V_BIT); - // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests) + if (kvm_has_mte(vcpu->kvm)) + new |= PSR_TCO_BIT; new |= (old & PSR_DIT_BIT); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 77cb2d28f2a4..b31b7a821f90 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (vma_pagesize == PAGE_SIZE && !force_pte) vma_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &fault_ipa); + + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { + /* + * VM will be able to see the page's tags, so we must ensure + * they have been initialised. if PG_mte_tagged is set, tags + * have already been initialised. + */ + struct page *page = pfn_to_page(pfn); + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; + + for (i = 0; i < nr_pages; i++, page++) { + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) + mte_clear_page_tags(page_address(page)); + } + } + if (writable) prot |= KVM_PGTABLE_PROT_W; diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 4f2f1e3145de..18c87500a7a8 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu, break; case SYS_ID_AA64PFR1_EL1: val &= ~FEATURE(ID_AA64PFR1_MTE); + if (kvm_has_mte(vcpu->kvm)) + val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE), + ID_AA64PFR1_MTE); break; case SYS_ID_AA64ISAR1_EL1: if (!vcpu_has_ptrauth(vcpu)) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index f6afee209620..6dc16c09a2d1 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1078,6 +1078,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_DIRTY_LOG_RING 192 #define KVM_CAP_X86_BUS_LOCK_EXIT 193 #define KVM_CAP_PPC_DAWR1 194 +#define KVM_CAP_ARM_MTE 195 #ifdef KVM_CAP_IRQ_ROUTING -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging for a VM. This will expose the feature to the guest and automatically tag memory pages touched by the VM as PG_mte_tagged (and clear the tag storage) to ensure that the guest cannot see stale tags, and so that the tags are correctly saved/restored across swap. Actually exposing the new capability to user space happens in a later patch. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_emulate.h | 3 +++ arch/arm64/include/asm/kvm_host.h | 3 +++ arch/arm64/kvm/hyp/exception.c | 3 ++- arch/arm64/kvm/mmu.c | 16 ++++++++++++++++ arch/arm64/kvm/sys_regs.c | 3 +++ include/uapi/linux/kvm.h | 1 + 6 files changed, 28 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index f612c090f2e4..6bf776c2399c 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu) if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) || vcpu_el1_is_32bit(vcpu)) vcpu->arch.hcr_el2 |= HCR_TID2; + + if (kvm_has_mte(vcpu->kvm)) + vcpu->arch.hcr_el2 |= HCR_ATA; } static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 3d10e6527f7d..1170ee137096 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -132,6 +132,8 @@ struct kvm_arch { u8 pfr0_csv2; u8 pfr0_csv3; + /* Memory Tagging Extension enabled for the guest */ + bool mte_enabled; }; struct kvm_vcpu_fault_info { @@ -767,6 +769,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu); #define kvm_arm_vcpu_sve_finalized(vcpu) \ ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED) +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled) #define kvm_vcpu_has_pmu(vcpu) \ (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features)) diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c index 73629094f903..56426565600c 100644 --- a/arch/arm64/kvm/hyp/exception.c +++ b/arch/arm64/kvm/hyp/exception.c @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode, new |= (old & PSR_C_BIT); new |= (old & PSR_V_BIT); - // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests) + if (kvm_has_mte(vcpu->kvm)) + new |= PSR_TCO_BIT; new |= (old & PSR_DIT_BIT); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 77cb2d28f2a4..b31b7a821f90 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (vma_pagesize == PAGE_SIZE && !force_pte) vma_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &fault_ipa); + + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { + /* + * VM will be able to see the page's tags, so we must ensure + * they have been initialised. if PG_mte_tagged is set, tags + * have already been initialised. + */ + struct page *page = pfn_to_page(pfn); + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; + + for (i = 0; i < nr_pages; i++, page++) { + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) + mte_clear_page_tags(page_address(page)); + } + } + if (writable) prot |= KVM_PGTABLE_PROT_W; diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 4f2f1e3145de..18c87500a7a8 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu, break; case SYS_ID_AA64PFR1_EL1: val &= ~FEATURE(ID_AA64PFR1_MTE); + if (kvm_has_mte(vcpu->kvm)) + val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE), + ID_AA64PFR1_MTE); break; case SYS_ID_AA64ISAR1_EL1: if (!vcpu_has_ptrauth(vcpu)) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index f6afee209620..6dc16c09a2d1 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1078,6 +1078,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_DIRTY_LOG_RING 192 #define KVM_CAP_X86_BUS_LOCK_EXIT 193 #define KVM_CAP_PPC_DAWR1 194 +#define KVM_CAP_ARM_MTE 195 #ifdef KVM_CAP_IRQ_ROUTING -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Dr. David Alan Gilbert, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, Thomas Gleixner, kvmarm, linux-arm-kernel Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging for a VM. This will expose the feature to the guest and automatically tag memory pages touched by the VM as PG_mte_tagged (and clear the tag storage) to ensure that the guest cannot see stale tags, and so that the tags are correctly saved/restored across swap. Actually exposing the new capability to user space happens in a later patch. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_emulate.h | 3 +++ arch/arm64/include/asm/kvm_host.h | 3 +++ arch/arm64/kvm/hyp/exception.c | 3 ++- arch/arm64/kvm/mmu.c | 16 ++++++++++++++++ arch/arm64/kvm/sys_regs.c | 3 +++ include/uapi/linux/kvm.h | 1 + 6 files changed, 28 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index f612c090f2e4..6bf776c2399c 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu) if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) || vcpu_el1_is_32bit(vcpu)) vcpu->arch.hcr_el2 |= HCR_TID2; + + if (kvm_has_mte(vcpu->kvm)) + vcpu->arch.hcr_el2 |= HCR_ATA; } static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 3d10e6527f7d..1170ee137096 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -132,6 +132,8 @@ struct kvm_arch { u8 pfr0_csv2; u8 pfr0_csv3; + /* Memory Tagging Extension enabled for the guest */ + bool mte_enabled; }; struct kvm_vcpu_fault_info { @@ -767,6 +769,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu); #define kvm_arm_vcpu_sve_finalized(vcpu) \ ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED) +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled) #define kvm_vcpu_has_pmu(vcpu) \ (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features)) diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c index 73629094f903..56426565600c 100644 --- a/arch/arm64/kvm/hyp/exception.c +++ b/arch/arm64/kvm/hyp/exception.c @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode, new |= (old & PSR_C_BIT); new |= (old & PSR_V_BIT); - // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests) + if (kvm_has_mte(vcpu->kvm)) + new |= PSR_TCO_BIT; new |= (old & PSR_DIT_BIT); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 77cb2d28f2a4..b31b7a821f90 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (vma_pagesize == PAGE_SIZE && !force_pte) vma_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &fault_ipa); + + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { + /* + * VM will be able to see the page's tags, so we must ensure + * they have been initialised. if PG_mte_tagged is set, tags + * have already been initialised. + */ + struct page *page = pfn_to_page(pfn); + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; + + for (i = 0; i < nr_pages; i++, page++) { + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) + mte_clear_page_tags(page_address(page)); + } + } + if (writable) prot |= KVM_PGTABLE_PROT_W; diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 4f2f1e3145de..18c87500a7a8 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu, break; case SYS_ID_AA64PFR1_EL1: val &= ~FEATURE(ID_AA64PFR1_MTE); + if (kvm_has_mte(vcpu->kvm)) + val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE), + ID_AA64PFR1_MTE); break; case SYS_ID_AA64ISAR1_EL1: if (!vcpu_has_ptrauth(vcpu)) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index f6afee209620..6dc16c09a2d1 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1078,6 +1078,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_DIRTY_LOG_RING 192 #define KVM_CAP_X86_BUS_LOCK_EXIT 193 #define KVM_CAP_PPC_DAWR1 194 +#define KVM_CAP_ARM_MTE 195 #ifdef KVM_CAP_IRQ_ROUTING -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, James Morse, Julien Thierry, Thomas Gleixner, kvmarm, linux-arm-kernel Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging for a VM. This will expose the feature to the guest and automatically tag memory pages touched by the VM as PG_mte_tagged (and clear the tag storage) to ensure that the guest cannot see stale tags, and so that the tags are correctly saved/restored across swap. Actually exposing the new capability to user space happens in a later patch. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_emulate.h | 3 +++ arch/arm64/include/asm/kvm_host.h | 3 +++ arch/arm64/kvm/hyp/exception.c | 3 ++- arch/arm64/kvm/mmu.c | 16 ++++++++++++++++ arch/arm64/kvm/sys_regs.c | 3 +++ include/uapi/linux/kvm.h | 1 + 6 files changed, 28 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index f612c090f2e4..6bf776c2399c 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu) if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) || vcpu_el1_is_32bit(vcpu)) vcpu->arch.hcr_el2 |= HCR_TID2; + + if (kvm_has_mte(vcpu->kvm)) + vcpu->arch.hcr_el2 |= HCR_ATA; } static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 3d10e6527f7d..1170ee137096 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -132,6 +132,8 @@ struct kvm_arch { u8 pfr0_csv2; u8 pfr0_csv3; + /* Memory Tagging Extension enabled for the guest */ + bool mte_enabled; }; struct kvm_vcpu_fault_info { @@ -767,6 +769,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu); #define kvm_arm_vcpu_sve_finalized(vcpu) \ ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED) +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled) #define kvm_vcpu_has_pmu(vcpu) \ (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features)) diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c index 73629094f903..56426565600c 100644 --- a/arch/arm64/kvm/hyp/exception.c +++ b/arch/arm64/kvm/hyp/exception.c @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode, new |= (old & PSR_C_BIT); new |= (old & PSR_V_BIT); - // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests) + if (kvm_has_mte(vcpu->kvm)) + new |= PSR_TCO_BIT; new |= (old & PSR_DIT_BIT); diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 77cb2d28f2a4..b31b7a821f90 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (vma_pagesize == PAGE_SIZE && !force_pte) vma_pagesize = transparent_hugepage_adjust(memslot, hva, &pfn, &fault_ipa); + + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { + /* + * VM will be able to see the page's tags, so we must ensure + * they have been initialised. if PG_mte_tagged is set, tags + * have already been initialised. + */ + struct page *page = pfn_to_page(pfn); + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; + + for (i = 0; i < nr_pages; i++, page++) { + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) + mte_clear_page_tags(page_address(page)); + } + } + if (writable) prot |= KVM_PGTABLE_PROT_W; diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 4f2f1e3145de..18c87500a7a8 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu, break; case SYS_ID_AA64PFR1_EL1: val &= ~FEATURE(ID_AA64PFR1_MTE); + if (kvm_has_mte(vcpu->kvm)) + val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE), + ID_AA64PFR1_MTE); break; case SYS_ID_AA64ISAR1_EL1: if (!vcpu_has_ptrauth(vcpu)) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index f6afee209620..6dc16c09a2d1 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1078,6 +1078,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_DIRTY_LOG_RING 192 #define KVM_CAP_X86_BUS_LOCK_EXIT 193 #define KVM_CAP_PPC_DAWR1 194 +#define KVM_CAP_ARM_MTE 195 #ifdef KVM_CAP_IRQ_ROUTING -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-12 15:18 ` Steven Price (?) (?) @ 2021-03-27 15:23 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-27 15:23 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 77cb2d28f2a4..b31b7a821f90 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > if (vma_pagesize == PAGE_SIZE && !force_pte) > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > &pfn, &fault_ipa); > + > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { This pfn_valid() check may be problematic. Following commit eeb0753ba27b ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns true for ZONE_DEVICE memory but such memory is allowed not to support MTE. I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn even without virtualisation. > + /* > + * VM will be able to see the page's tags, so we must ensure > + * they have been initialised. if PG_mte_tagged is set, tags > + * have already been initialised. > + */ > + struct page *page = pfn_to_page(pfn); > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > + > + for (i = 0; i < nr_pages; i++, page++) { > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > + mte_clear_page_tags(page_address(page)); > + } > + } > + > if (writable) > prot |= KVM_PGTABLE_PROT_W; > -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-27 15:23 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-27 15:23 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 77cb2d28f2a4..b31b7a821f90 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > if (vma_pagesize == PAGE_SIZE && !force_pte) > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > &pfn, &fault_ipa); > + > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { This pfn_valid() check may be problematic. Following commit eeb0753ba27b ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns true for ZONE_DEVICE memory but such memory is allowed not to support MTE. I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn even without virtualisation. > + /* > + * VM will be able to see the page's tags, so we must ensure > + * they have been initialised. if PG_mte_tagged is set, tags > + * have already been initialised. > + */ > + struct page *page = pfn_to_page(pfn); > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > + > + for (i = 0; i < nr_pages; i++, page++) { > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > + mte_clear_page_tags(page_address(page)); > + } > + } > + > if (writable) > prot |= KVM_PGTABLE_PROT_W; > -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-27 15:23 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-27 15:23 UTC (permalink / raw) To: Steven Price Cc: Dr. David Alan Gilbert, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 77cb2d28f2a4..b31b7a821f90 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > if (vma_pagesize == PAGE_SIZE && !force_pte) > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > &pfn, &fault_ipa); > + > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { This pfn_valid() check may be problematic. Following commit eeb0753ba27b ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns true for ZONE_DEVICE memory but such memory is allowed not to support MTE. I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn even without virtualisation. > + /* > + * VM will be able to see the page's tags, so we must ensure > + * they have been initialised. if PG_mte_tagged is set, tags > + * have already been initialised. > + */ > + struct page *page = pfn_to_page(pfn); > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > + > + for (i = 0; i < nr_pages; i++, page++) { > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > + mte_clear_page_tags(page_address(page)); > + } > + } > + > if (writable) > prot |= KVM_PGTABLE_PROT_W; > -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-27 15:23 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-27 15:23 UTC (permalink / raw) To: Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 77cb2d28f2a4..b31b7a821f90 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > if (vma_pagesize == PAGE_SIZE && !force_pte) > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > &pfn, &fault_ipa); > + > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { This pfn_valid() check may be problematic. Following commit eeb0753ba27b ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns true for ZONE_DEVICE memory but such memory is allowed not to support MTE. I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn even without virtualisation. > + /* > + * VM will be able to see the page's tags, so we must ensure > + * they have been initialised. if PG_mte_tagged is set, tags > + * have already been initialised. > + */ > + struct page *page = pfn_to_page(pfn); > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > + > + for (i = 0; i < nr_pages; i++, page++) { > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > + mte_clear_page_tags(page_address(page)); > + } > + } > + > if (writable) > prot |= KVM_PGTABLE_PROT_W; > -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-27 15:23 ` Catalin Marinas (?) (?) @ 2021-03-28 12:21 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-28 12:21 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > index 77cb2d28f2a4..b31b7a821f90 100644 > > --- a/arch/arm64/kvm/mmu.c > > +++ b/arch/arm64/kvm/mmu.c > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > &pfn, &fault_ipa); > > + > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > + /* > > + * VM will be able to see the page's tags, so we must ensure > > + * they have been initialised. if PG_mte_tagged is set, tags > > + * have already been initialised. > > + */ > > + struct page *page = pfn_to_page(pfn); > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > + > > + for (i = 0; i < nr_pages; i++, page++) { > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > + mte_clear_page_tags(page_address(page)); > > + } > > + } > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > true for ZONE_DEVICE memory but such memory is allowed not to support > MTE. Some more thinking, this should be safe as any ZONE_DEVICE would be mapped as untagged memory in the kernel linear map. It could be slightly inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, untagged memory. Another overhead is pfn_valid() which will likely end up calling memblock_is_map_memory(). However, the bigger issue is that Stage 2 cannot disable tagging for Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a way to detect what gets mapped in the guest as Normal Cacheable memory and make sure it's only early memory or hotplug but no ZONE_DEVICE (or something else like on-chip memory)? If we can't guarantee that all Cacheable memory given to a guest supports tags, we should disable the feature altogether. > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn > even without virtualisation. I haven't checked all the code paths but I don't think we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file descriptor. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-28 12:21 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-28 12:21 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > index 77cb2d28f2a4..b31b7a821f90 100644 > > --- a/arch/arm64/kvm/mmu.c > > +++ b/arch/arm64/kvm/mmu.c > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > &pfn, &fault_ipa); > > + > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > + /* > > + * VM will be able to see the page's tags, so we must ensure > > + * they have been initialised. if PG_mte_tagged is set, tags > > + * have already been initialised. > > + */ > > + struct page *page = pfn_to_page(pfn); > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > + > > + for (i = 0; i < nr_pages; i++, page++) { > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > + mte_clear_page_tags(page_address(page)); > > + } > > + } > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > true for ZONE_DEVICE memory but such memory is allowed not to support > MTE. Some more thinking, this should be safe as any ZONE_DEVICE would be mapped as untagged memory in the kernel linear map. It could be slightly inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, untagged memory. Another overhead is pfn_valid() which will likely end up calling memblock_is_map_memory(). However, the bigger issue is that Stage 2 cannot disable tagging for Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a way to detect what gets mapped in the guest as Normal Cacheable memory and make sure it's only early memory or hotplug but no ZONE_DEVICE (or something else like on-chip memory)? If we can't guarantee that all Cacheable memory given to a guest supports tags, we should disable the feature altogether. > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn > even without virtualisation. I haven't checked all the code paths but I don't think we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file descriptor. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-28 12:21 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-28 12:21 UTC (permalink / raw) To: Steven Price Cc: Dr. David Alan Gilbert, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > index 77cb2d28f2a4..b31b7a821f90 100644 > > --- a/arch/arm64/kvm/mmu.c > > +++ b/arch/arm64/kvm/mmu.c > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > &pfn, &fault_ipa); > > + > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > + /* > > + * VM will be able to see the page's tags, so we must ensure > > + * they have been initialised. if PG_mte_tagged is set, tags > > + * have already been initialised. > > + */ > > + struct page *page = pfn_to_page(pfn); > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > + > > + for (i = 0; i < nr_pages; i++, page++) { > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > + mte_clear_page_tags(page_address(page)); > > + } > > + } > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > true for ZONE_DEVICE memory but such memory is allowed not to support > MTE. Some more thinking, this should be safe as any ZONE_DEVICE would be mapped as untagged memory in the kernel linear map. It could be slightly inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, untagged memory. Another overhead is pfn_valid() which will likely end up calling memblock_is_map_memory(). However, the bigger issue is that Stage 2 cannot disable tagging for Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a way to detect what gets mapped in the guest as Normal Cacheable memory and make sure it's only early memory or hotplug but no ZONE_DEVICE (or something else like on-chip memory)? If we can't guarantee that all Cacheable memory given to a guest supports tags, we should disable the feature altogether. > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn > even without virtualisation. I haven't checked all the code paths but I don't think we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file descriptor. -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-28 12:21 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-28 12:21 UTC (permalink / raw) To: Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > index 77cb2d28f2a4..b31b7a821f90 100644 > > --- a/arch/arm64/kvm/mmu.c > > +++ b/arch/arm64/kvm/mmu.c > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > &pfn, &fault_ipa); > > + > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > + /* > > + * VM will be able to see the page's tags, so we must ensure > > + * they have been initialised. if PG_mte_tagged is set, tags > > + * have already been initialised. > > + */ > > + struct page *page = pfn_to_page(pfn); > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > + > > + for (i = 0; i < nr_pages; i++, page++) { > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > + mte_clear_page_tags(page_address(page)); > > + } > > + } > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > true for ZONE_DEVICE memory but such memory is allowed not to support > MTE. Some more thinking, this should be safe as any ZONE_DEVICE would be mapped as untagged memory in the kernel linear map. It could be slightly inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, untagged memory. Another overhead is pfn_valid() which will likely end up calling memblock_is_map_memory(). However, the bigger issue is that Stage 2 cannot disable tagging for Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a way to detect what gets mapped in the guest as Normal Cacheable memory and make sure it's only early memory or hotplug but no ZONE_DEVICE (or something else like on-chip memory)? If we can't guarantee that all Cacheable memory given to a guest supports tags, we should disable the feature altogether. > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn > even without virtualisation. I haven't checked all the code paths but I don't think we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file descriptor. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-28 12:21 ` Catalin Marinas (?) (?) @ 2021-03-29 16:06 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-29 16:06 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On 28/03/2021 13:21, Catalin Marinas wrote: > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>> index 77cb2d28f2a4..b31b7a821f90 100644 >>> --- a/arch/arm64/kvm/mmu.c >>> +++ b/arch/arm64/kvm/mmu.c >>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>> &pfn, &fault_ipa); >>> + >>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>> + /* >>> + * VM will be able to see the page's tags, so we must ensure >>> + * they have been initialised. if PG_mte_tagged is set, tags >>> + * have already been initialised. >>> + */ >>> + struct page *page = pfn_to_page(pfn); >>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>> + >>> + for (i = 0; i < nr_pages; i++, page++) { >>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>> + mte_clear_page_tags(page_address(page)); >>> + } >>> + } >> >> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >> true for ZONE_DEVICE memory but such memory is allowed not to support >> MTE. > > Some more thinking, this should be safe as any ZONE_DEVICE would be > mapped as untagged memory in the kernel linear map. It could be slightly > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > untagged memory. Another overhead is pfn_valid() which will likely end > up calling memblock_is_map_memory(). > > However, the bigger issue is that Stage 2 cannot disable tagging for > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > way to detect what gets mapped in the guest as Normal Cacheable memory > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > something else like on-chip memory)? If we can't guarantee that all > Cacheable memory given to a guest supports tags, we should disable the > feature altogether. In stage 2 I believe we only have two types of mapping - 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a case of checking the 'device' variable, and makes sense to avoid the overhead you describe. This should also guarantee that all stage-2 cacheable memory supports tags, as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only be true for memory that Linux considers "normal". >> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn >> even without virtualisation. > > I haven't checked all the code paths but I don't think we can get a > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file > descriptor. > I certainly hope this is the case - it's the weird corner cases of device drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). Mali's kbase did something similar in the past. Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-29 16:06 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-29 16:06 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On 28/03/2021 13:21, Catalin Marinas wrote: > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>> index 77cb2d28f2a4..b31b7a821f90 100644 >>> --- a/arch/arm64/kvm/mmu.c >>> +++ b/arch/arm64/kvm/mmu.c >>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>> &pfn, &fault_ipa); >>> + >>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>> + /* >>> + * VM will be able to see the page's tags, so we must ensure >>> + * they have been initialised. if PG_mte_tagged is set, tags >>> + * have already been initialised. >>> + */ >>> + struct page *page = pfn_to_page(pfn); >>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>> + >>> + for (i = 0; i < nr_pages; i++, page++) { >>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>> + mte_clear_page_tags(page_address(page)); >>> + } >>> + } >> >> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >> true for ZONE_DEVICE memory but such memory is allowed not to support >> MTE. > > Some more thinking, this should be safe as any ZONE_DEVICE would be > mapped as untagged memory in the kernel linear map. It could be slightly > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > untagged memory. Another overhead is pfn_valid() which will likely end > up calling memblock_is_map_memory(). > > However, the bigger issue is that Stage 2 cannot disable tagging for > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > way to detect what gets mapped in the guest as Normal Cacheable memory > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > something else like on-chip memory)? If we can't guarantee that all > Cacheable memory given to a guest supports tags, we should disable the > feature altogether. In stage 2 I believe we only have two types of mapping - 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a case of checking the 'device' variable, and makes sense to avoid the overhead you describe. This should also guarantee that all stage-2 cacheable memory supports tags, as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only be true for memory that Linux considers "normal". >> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn >> even without virtualisation. > > I haven't checked all the code paths but I don't think we can get a > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file > descriptor. > I certainly hope this is the case - it's the weird corner cases of device drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). Mali's kbase did something similar in the past. Steve _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-29 16:06 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-29 16:06 UTC (permalink / raw) To: Catalin Marinas Cc: Dr. David Alan Gilbert, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm On 28/03/2021 13:21, Catalin Marinas wrote: > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>> index 77cb2d28f2a4..b31b7a821f90 100644 >>> --- a/arch/arm64/kvm/mmu.c >>> +++ b/arch/arm64/kvm/mmu.c >>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>> &pfn, &fault_ipa); >>> + >>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>> + /* >>> + * VM will be able to see the page's tags, so we must ensure >>> + * they have been initialised. if PG_mte_tagged is set, tags >>> + * have already been initialised. >>> + */ >>> + struct page *page = pfn_to_page(pfn); >>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>> + >>> + for (i = 0; i < nr_pages; i++, page++) { >>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>> + mte_clear_page_tags(page_address(page)); >>> + } >>> + } >> >> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >> true for ZONE_DEVICE memory but such memory is allowed not to support >> MTE. > > Some more thinking, this should be safe as any ZONE_DEVICE would be > mapped as untagged memory in the kernel linear map. It could be slightly > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > untagged memory. Another overhead is pfn_valid() which will likely end > up calling memblock_is_map_memory(). > > However, the bigger issue is that Stage 2 cannot disable tagging for > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > way to detect what gets mapped in the guest as Normal Cacheable memory > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > something else like on-chip memory)? If we can't guarantee that all > Cacheable memory given to a guest supports tags, we should disable the > feature altogether. In stage 2 I believe we only have two types of mapping - 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a case of checking the 'device' variable, and makes sense to avoid the overhead you describe. This should also guarantee that all stage-2 cacheable memory supports tags, as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only be true for memory that Linux considers "normal". >> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn >> even without virtualisation. > > I haven't checked all the code paths but I don't think we can get a > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file > descriptor. > I certainly hope this is the case - it's the weird corner cases of device drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). Mali's kbase did something similar in the past. Steve _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-29 16:06 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-29 16:06 UTC (permalink / raw) To: Catalin Marinas Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 28/03/2021 13:21, Catalin Marinas wrote: > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>> index 77cb2d28f2a4..b31b7a821f90 100644 >>> --- a/arch/arm64/kvm/mmu.c >>> +++ b/arch/arm64/kvm/mmu.c >>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>> &pfn, &fault_ipa); >>> + >>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>> + /* >>> + * VM will be able to see the page's tags, so we must ensure >>> + * they have been initialised. if PG_mte_tagged is set, tags >>> + * have already been initialised. >>> + */ >>> + struct page *page = pfn_to_page(pfn); >>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>> + >>> + for (i = 0; i < nr_pages; i++, page++) { >>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>> + mte_clear_page_tags(page_address(page)); >>> + } >>> + } >> >> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >> true for ZONE_DEVICE memory but such memory is allowed not to support >> MTE. > > Some more thinking, this should be safe as any ZONE_DEVICE would be > mapped as untagged memory in the kernel linear map. It could be slightly > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > untagged memory. Another overhead is pfn_valid() which will likely end > up calling memblock_is_map_memory(). > > However, the bigger issue is that Stage 2 cannot disable tagging for > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > way to detect what gets mapped in the guest as Normal Cacheable memory > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > something else like on-chip memory)? If we can't guarantee that all > Cacheable memory given to a guest supports tags, we should disable the > feature altogether. In stage 2 I believe we only have two types of mapping - 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a case of checking the 'device' variable, and makes sense to avoid the overhead you describe. This should also guarantee that all stage-2 cacheable memory supports tags, as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only be true for memory that Linux considers "normal". >> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn >> even without virtualisation. > > I haven't checked all the code paths but I don't think we can get a > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file > descriptor. > I certainly hope this is the case - it's the weird corner cases of device drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). Mali's kbase did something similar in the past. Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-29 16:06 ` Steven Price (?) (?) @ 2021-03-30 10:30 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-30 10:30 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > On 28/03/2021 13:21, Catalin Marinas wrote: > > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > index 77cb2d28f2a4..b31b7a821f90 100644 > > > > --- a/arch/arm64/kvm/mmu.c > > > > +++ b/arch/arm64/kvm/mmu.c > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > > > &pfn, &fault_ipa); > > > > + > > > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > > > + /* > > > > + * VM will be able to see the page's tags, so we must ensure > > > > + * they have been initialised. if PG_mte_tagged is set, tags > > > > + * have already been initialised. > > > > + */ > > > > + struct page *page = pfn_to_page(pfn); > > > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > > > + > > > > + for (i = 0; i < nr_pages; i++, page++) { > > > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > > > + mte_clear_page_tags(page_address(page)); > > > > + } > > > > + } > > > > > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > > > true for ZONE_DEVICE memory but such memory is allowed not to support > > > MTE. > > > > Some more thinking, this should be safe as any ZONE_DEVICE would be > > mapped as untagged memory in the kernel linear map. It could be slightly > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > > untagged memory. Another overhead is pfn_valid() which will likely end > > up calling memblock_is_map_memory(). > > > > However, the bigger issue is that Stage 2 cannot disable tagging for > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > > way to detect what gets mapped in the guest as Normal Cacheable memory > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > > something else like on-chip memory)? If we can't guarantee that all > > Cacheable memory given to a guest supports tags, we should disable the > > feature altogether. > > In stage 2 I believe we only have two types of mapping - 'normal' or > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a > case of checking the 'device' variable, and makes sense to avoid the > overhead you describe. > > This should also guarantee that all stage-2 cacheable memory supports tags, > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only > be true for memory that Linux considers "normal". That's the problem. With Anshuman's commit I mentioned above, pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent memory, not talking about some I/O mapping that requires Device_nGnRE). So kvm_is_device_pfn() is false for such memory and it may be mapped as Normal but it is not guaranteed to support tagging. For user MTE, we get away with this as the MAP_ANONYMOUS requirement would filter it out while arch_add_memory() will ensure it's mapped as untagged in the linear map. See another recent fix for hotplugged memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged, only hoplugged memory. Both handled via arch_add_memory() in the arch code with ZONE_DEVICE starting at devm_memremap_pages(). > > > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn > > > even without virtualisation. > > > > I haven't checked all the code paths but I don't think we can get a > > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file > > descriptor. > > I certainly hope this is the case - it's the weird corner cases of device > drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl > (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). > Mali's kbase did something similar in the past. I think this should be fine since it's not a MAP_ANONYMOUS (we do allow MAP_SHARED to be tagged). -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-30 10:30 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-30 10:30 UTC (permalink / raw) To: Steven Price Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > On 28/03/2021 13:21, Catalin Marinas wrote: > > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > index 77cb2d28f2a4..b31b7a821f90 100644 > > > > --- a/arch/arm64/kvm/mmu.c > > > > +++ b/arch/arm64/kvm/mmu.c > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > > > &pfn, &fault_ipa); > > > > + > > > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > > > + /* > > > > + * VM will be able to see the page's tags, so we must ensure > > > > + * they have been initialised. if PG_mte_tagged is set, tags > > > > + * have already been initialised. > > > > + */ > > > > + struct page *page = pfn_to_page(pfn); > > > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > > > + > > > > + for (i = 0; i < nr_pages; i++, page++) { > > > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > > > + mte_clear_page_tags(page_address(page)); > > > > + } > > > > + } > > > > > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > > > true for ZONE_DEVICE memory but such memory is allowed not to support > > > MTE. > > > > Some more thinking, this should be safe as any ZONE_DEVICE would be > > mapped as untagged memory in the kernel linear map. It could be slightly > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > > untagged memory. Another overhead is pfn_valid() which will likely end > > up calling memblock_is_map_memory(). > > > > However, the bigger issue is that Stage 2 cannot disable tagging for > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > > way to detect what gets mapped in the guest as Normal Cacheable memory > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > > something else like on-chip memory)? If we can't guarantee that all > > Cacheable memory given to a guest supports tags, we should disable the > > feature altogether. > > In stage 2 I believe we only have two types of mapping - 'normal' or > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a > case of checking the 'device' variable, and makes sense to avoid the > overhead you describe. > > This should also guarantee that all stage-2 cacheable memory supports tags, > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only > be true for memory that Linux considers "normal". That's the problem. With Anshuman's commit I mentioned above, pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent memory, not talking about some I/O mapping that requires Device_nGnRE). So kvm_is_device_pfn() is false for such memory and it may be mapped as Normal but it is not guaranteed to support tagging. For user MTE, we get away with this as the MAP_ANONYMOUS requirement would filter it out while arch_add_memory() will ensure it's mapped as untagged in the linear map. See another recent fix for hotplugged memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged, only hoplugged memory. Both handled via arch_add_memory() in the arch code with ZONE_DEVICE starting at devm_memremap_pages(). > > > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn > > > even without virtualisation. > > > > I haven't checked all the code paths but I don't think we can get a > > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file > > descriptor. > > I certainly hope this is the case - it's the weird corner cases of device > drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl > (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). > Mali's kbase did something similar in the past. I think this should be fine since it's not a MAP_ANONYMOUS (we do allow MAP_SHARED to be tagged). -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-30 10:30 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-30 10:30 UTC (permalink / raw) To: Steven Price Cc: Dr. David Alan Gilbert, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > On 28/03/2021 13:21, Catalin Marinas wrote: > > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > index 77cb2d28f2a4..b31b7a821f90 100644 > > > > --- a/arch/arm64/kvm/mmu.c > > > > +++ b/arch/arm64/kvm/mmu.c > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > > > &pfn, &fault_ipa); > > > > + > > > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > > > + /* > > > > + * VM will be able to see the page's tags, so we must ensure > > > > + * they have been initialised. if PG_mte_tagged is set, tags > > > > + * have already been initialised. > > > > + */ > > > > + struct page *page = pfn_to_page(pfn); > > > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > > > + > > > > + for (i = 0; i < nr_pages; i++, page++) { > > > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > > > + mte_clear_page_tags(page_address(page)); > > > > + } > > > > + } > > > > > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > > > true for ZONE_DEVICE memory but such memory is allowed not to support > > > MTE. > > > > Some more thinking, this should be safe as any ZONE_DEVICE would be > > mapped as untagged memory in the kernel linear map. It could be slightly > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > > untagged memory. Another overhead is pfn_valid() which will likely end > > up calling memblock_is_map_memory(). > > > > However, the bigger issue is that Stage 2 cannot disable tagging for > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > > way to detect what gets mapped in the guest as Normal Cacheable memory > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > > something else like on-chip memory)? If we can't guarantee that all > > Cacheable memory given to a guest supports tags, we should disable the > > feature altogether. > > In stage 2 I believe we only have two types of mapping - 'normal' or > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a > case of checking the 'device' variable, and makes sense to avoid the > overhead you describe. > > This should also guarantee that all stage-2 cacheable memory supports tags, > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only > be true for memory that Linux considers "normal". That's the problem. With Anshuman's commit I mentioned above, pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent memory, not talking about some I/O mapping that requires Device_nGnRE). So kvm_is_device_pfn() is false for such memory and it may be mapped as Normal but it is not guaranteed to support tagging. For user MTE, we get away with this as the MAP_ANONYMOUS requirement would filter it out while arch_add_memory() will ensure it's mapped as untagged in the linear map. See another recent fix for hotplugged memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged, only hoplugged memory. Both handled via arch_add_memory() in the arch code with ZONE_DEVICE starting at devm_memremap_pages(). > > > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn > > > even without virtualisation. > > > > I haven't checked all the code paths but I don't think we can get a > > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file > > descriptor. > > I certainly hope this is the case - it's the weird corner cases of device > drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl > (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). > Mali's kbase did something similar in the past. I think this should be fine since it's not a MAP_ANONYMOUS (we do allow MAP_SHARED to be tagged). -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-30 10:30 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-30 10:30 UTC (permalink / raw) To: Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > On 28/03/2021 13:21, Catalin Marinas wrote: > > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > index 77cb2d28f2a4..b31b7a821f90 100644 > > > > --- a/arch/arm64/kvm/mmu.c > > > > +++ b/arch/arm64/kvm/mmu.c > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > > > &pfn, &fault_ipa); > > > > + > > > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > > > + /* > > > > + * VM will be able to see the page's tags, so we must ensure > > > > + * they have been initialised. if PG_mte_tagged is set, tags > > > > + * have already been initialised. > > > > + */ > > > > + struct page *page = pfn_to_page(pfn); > > > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > > > + > > > > + for (i = 0; i < nr_pages; i++, page++) { > > > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > > > + mte_clear_page_tags(page_address(page)); > > > > + } > > > > + } > > > > > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > > > true for ZONE_DEVICE memory but such memory is allowed not to support > > > MTE. > > > > Some more thinking, this should be safe as any ZONE_DEVICE would be > > mapped as untagged memory in the kernel linear map. It could be slightly > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > > untagged memory. Another overhead is pfn_valid() which will likely end > > up calling memblock_is_map_memory(). > > > > However, the bigger issue is that Stage 2 cannot disable tagging for > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > > way to detect what gets mapped in the guest as Normal Cacheable memory > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > > something else like on-chip memory)? If we can't guarantee that all > > Cacheable memory given to a guest supports tags, we should disable the > > feature altogether. > > In stage 2 I believe we only have two types of mapping - 'normal' or > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a > case of checking the 'device' variable, and makes sense to avoid the > overhead you describe. > > This should also guarantee that all stage-2 cacheable memory supports tags, > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only > be true for memory that Linux considers "normal". That's the problem. With Anshuman's commit I mentioned above, pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent memory, not talking about some I/O mapping that requires Device_nGnRE). So kvm_is_device_pfn() is false for such memory and it may be mapped as Normal but it is not guaranteed to support tagging. For user MTE, we get away with this as the MAP_ANONYMOUS requirement would filter it out while arch_add_memory() will ensure it's mapped as untagged in the linear map. See another recent fix for hotplugged memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged, only hoplugged memory. Both handled via arch_add_memory() in the arch code with ZONE_DEVICE starting at devm_memremap_pages(). > > > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn > > > even without virtualisation. > > > > I haven't checked all the code paths but I don't think we can get a > > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file > > descriptor. > > I certainly hope this is the case - it's the weird corner cases of device > drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl > (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). > Mali's kbase did something similar in the past. I think this should be fine since it's not a MAP_ANONYMOUS (we do allow MAP_SHARED to be tagged). -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-30 10:30 ` Catalin Marinas (?) (?) @ 2021-03-31 7:34 ` David Hildenbrand -1 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 7:34 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 30.03.21 12:30, Catalin Marinas wrote: > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >> On 28/03/2021 13:21, Catalin Marinas wrote: >>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>> --- a/arch/arm64/kvm/mmu.c >>>>> +++ b/arch/arm64/kvm/mmu.c >>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>>>> &pfn, &fault_ipa); >>>>> + >>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>>>> + /* >>>>> + * VM will be able to see the page's tags, so we must ensure >>>>> + * they have been initialised. if PG_mte_tagged is set, tags >>>>> + * have already been initialised. >>>>> + */ >>>>> + struct page *page = pfn_to_page(pfn); >>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>> + >>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>> + mte_clear_page_tags(page_address(page)); >>>>> + } >>>>> + } >>>> >>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >>>> true for ZONE_DEVICE memory but such memory is allowed not to support >>>> MTE. >>> >>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>> mapped as untagged memory in the kernel linear map. It could be slightly >>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>> untagged memory. Another overhead is pfn_valid() which will likely end >>> up calling memblock_is_map_memory(). >>> >>> However, the bigger issue is that Stage 2 cannot disable tagging for >>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a >>> way to detect what gets mapped in the guest as Normal Cacheable memory >>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or >>> something else like on-chip memory)? If we can't guarantee that all >>> Cacheable memory given to a guest supports tags, we should disable the >>> feature altogether. >> >> In stage 2 I believe we only have two types of mapping - 'normal' or >> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a >> case of checking the 'device' variable, and makes sense to avoid the >> overhead you describe. >> >> This should also guarantee that all stage-2 cacheable memory supports tags, >> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only >> be true for memory that Linux considers "normal". If you think "normal" == "normal System RAM", that's wrong; see below. > > That's the problem. With Anshuman's commit I mentioned above, > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent > memory, not talking about some I/O mapping that requires Device_nGnRE). > So kvm_is_device_pfn() is false for such memory and it may be mapped as > Normal but it is not guaranteed to support tagging. pfn_valid() means "there is a struct page"; if you do pfn_to_page() and touch the page, you won't fault. So Anshuman's commit is correct. pfn_to_online_page() means, "there is a struct page and it's system RAM that's in use; the memmap has a sane content" > > For user MTE, we get away with this as the MAP_ANONYMOUS requirement > would filter it out while arch_add_memory() will ensure it's mapped as > untagged in the linear map. See another recent fix for hotplugged > memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal > Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged, > only hoplugged memory. Both handled via arch_add_memory() in the arch > code with ZONE_DEVICE starting at devm_memremap_pages(). > >>>> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn >>>> even without virtualisation. >>> >>> I haven't checked all the code paths but I don't think we can get a >>> MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file >>> descriptor. >> >> I certainly hope this is the case - it's the weird corner cases of device >> drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl >> (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). >> Mali's kbase did something similar in the past. > > I think this should be fine since it's not a MAP_ANONYMOUS (we do allow > MAP_SHARED to be tagged). > -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 7:34 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 7:34 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 30.03.21 12:30, Catalin Marinas wrote: > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >> On 28/03/2021 13:21, Catalin Marinas wrote: >>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>> --- a/arch/arm64/kvm/mmu.c >>>>> +++ b/arch/arm64/kvm/mmu.c >>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>>>> &pfn, &fault_ipa); >>>>> + >>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>>>> + /* >>>>> + * VM will be able to see the page's tags, so we must ensure >>>>> + * they have been initialised. if PG_mte_tagged is set, tags >>>>> + * have already been initialised. >>>>> + */ >>>>> + struct page *page = pfn_to_page(pfn); >>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>> + >>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>> + mte_clear_page_tags(page_address(page)); >>>>> + } >>>>> + } >>>> >>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >>>> true for ZONE_DEVICE memory but such memory is allowed not to support >>>> MTE. >>> >>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>> mapped as untagged memory in the kernel linear map. It could be slightly >>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>> untagged memory. Another overhead is pfn_valid() which will likely end >>> up calling memblock_is_map_memory(). >>> >>> However, the bigger issue is that Stage 2 cannot disable tagging for >>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a >>> way to detect what gets mapped in the guest as Normal Cacheable memory >>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or >>> something else like on-chip memory)? If we can't guarantee that all >>> Cacheable memory given to a guest supports tags, we should disable the >>> feature altogether. >> >> In stage 2 I believe we only have two types of mapping - 'normal' or >> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a >> case of checking the 'device' variable, and makes sense to avoid the >> overhead you describe. >> >> This should also guarantee that all stage-2 cacheable memory supports tags, >> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only >> be true for memory that Linux considers "normal". If you think "normal" == "normal System RAM", that's wrong; see below. > > That's the problem. With Anshuman's commit I mentioned above, > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent > memory, not talking about some I/O mapping that requires Device_nGnRE). > So kvm_is_device_pfn() is false for such memory and it may be mapped as > Normal but it is not guaranteed to support tagging. pfn_valid() means "there is a struct page"; if you do pfn_to_page() and touch the page, you won't fault. So Anshuman's commit is correct. pfn_to_online_page() means, "there is a struct page and it's system RAM that's in use; the memmap has a sane content" > > For user MTE, we get away with this as the MAP_ANONYMOUS requirement > would filter it out while arch_add_memory() will ensure it's mapped as > untagged in the linear map. See another recent fix for hotplugged > memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal > Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged, > only hoplugged memory. Both handled via arch_add_memory() in the arch > code with ZONE_DEVICE starting at devm_memremap_pages(). > >>>> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn >>>> even without virtualisation. >>> >>> I haven't checked all the code paths but I don't think we can get a >>> MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file >>> descriptor. >> >> I certainly hope this is the case - it's the weird corner cases of device >> drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl >> (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). >> Mali's kbase did something similar in the past. > > I think this should be fine since it's not a MAP_ANONYMOUS (we do allow > MAP_SHARED to be tagged). > -- Thanks, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 7:34 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 7:34 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On 30.03.21 12:30, Catalin Marinas wrote: > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >> On 28/03/2021 13:21, Catalin Marinas wrote: >>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>> --- a/arch/arm64/kvm/mmu.c >>>>> +++ b/arch/arm64/kvm/mmu.c >>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>>>> &pfn, &fault_ipa); >>>>> + >>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>>>> + /* >>>>> + * VM will be able to see the page's tags, so we must ensure >>>>> + * they have been initialised. if PG_mte_tagged is set, tags >>>>> + * have already been initialised. >>>>> + */ >>>>> + struct page *page = pfn_to_page(pfn); >>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>> + >>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>> + mte_clear_page_tags(page_address(page)); >>>>> + } >>>>> + } >>>> >>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >>>> true for ZONE_DEVICE memory but such memory is allowed not to support >>>> MTE. >>> >>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>> mapped as untagged memory in the kernel linear map. It could be slightly >>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>> untagged memory. Another overhead is pfn_valid() which will likely end >>> up calling memblock_is_map_memory(). >>> >>> However, the bigger issue is that Stage 2 cannot disable tagging for >>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a >>> way to detect what gets mapped in the guest as Normal Cacheable memory >>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or >>> something else like on-chip memory)? If we can't guarantee that all >>> Cacheable memory given to a guest supports tags, we should disable the >>> feature altogether. >> >> In stage 2 I believe we only have two types of mapping - 'normal' or >> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a >> case of checking the 'device' variable, and makes sense to avoid the >> overhead you describe. >> >> This should also guarantee that all stage-2 cacheable memory supports tags, >> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only >> be true for memory that Linux considers "normal". If you think "normal" == "normal System RAM", that's wrong; see below. > > That's the problem. With Anshuman's commit I mentioned above, > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent > memory, not talking about some I/O mapping that requires Device_nGnRE). > So kvm_is_device_pfn() is false for such memory and it may be mapped as > Normal but it is not guaranteed to support tagging. pfn_valid() means "there is a struct page"; if you do pfn_to_page() and touch the page, you won't fault. So Anshuman's commit is correct. pfn_to_online_page() means, "there is a struct page and it's system RAM that's in use; the memmap has a sane content" > > For user MTE, we get away with this as the MAP_ANONYMOUS requirement > would filter it out while arch_add_memory() will ensure it's mapped as > untagged in the linear map. See another recent fix for hotplugged > memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal > Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged, > only hoplugged memory. Both handled via arch_add_memory() in the arch > code with ZONE_DEVICE starting at devm_memremap_pages(). > >>>> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn >>>> even without virtualisation. >>> >>> I haven't checked all the code paths but I don't think we can get a >>> MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file >>> descriptor. >> >> I certainly hope this is the case - it's the weird corner cases of device >> drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl >> (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). >> Mali's kbase did something similar in the past. > > I think this should be fine since it's not a MAP_ANONYMOUS (we do allow > MAP_SHARED to be tagged). > -- Thanks, David / dhildenb _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 7:34 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 7:34 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, Suzuki K Poulose, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On 30.03.21 12:30, Catalin Marinas wrote: > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >> On 28/03/2021 13:21, Catalin Marinas wrote: >>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>> --- a/arch/arm64/kvm/mmu.c >>>>> +++ b/arch/arm64/kvm/mmu.c >>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>>>> &pfn, &fault_ipa); >>>>> + >>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>>>> + /* >>>>> + * VM will be able to see the page's tags, so we must ensure >>>>> + * they have been initialised. if PG_mte_tagged is set, tags >>>>> + * have already been initialised. >>>>> + */ >>>>> + struct page *page = pfn_to_page(pfn); >>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>> + >>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>> + mte_clear_page_tags(page_address(page)); >>>>> + } >>>>> + } >>>> >>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >>>> true for ZONE_DEVICE memory but such memory is allowed not to support >>>> MTE. >>> >>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>> mapped as untagged memory in the kernel linear map. It could be slightly >>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>> untagged memory. Another overhead is pfn_valid() which will likely end >>> up calling memblock_is_map_memory(). >>> >>> However, the bigger issue is that Stage 2 cannot disable tagging for >>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a >>> way to detect what gets mapped in the guest as Normal Cacheable memory >>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or >>> something else like on-chip memory)? If we can't guarantee that all >>> Cacheable memory given to a guest supports tags, we should disable the >>> feature altogether. >> >> In stage 2 I believe we only have two types of mapping - 'normal' or >> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a >> case of checking the 'device' variable, and makes sense to avoid the >> overhead you describe. >> >> This should also guarantee that all stage-2 cacheable memory supports tags, >> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only >> be true for memory that Linux considers "normal". If you think "normal" == "normal System RAM", that's wrong; see below. > > That's the problem. With Anshuman's commit I mentioned above, > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent > memory, not talking about some I/O mapping that requires Device_nGnRE). > So kvm_is_device_pfn() is false for such memory and it may be mapped as > Normal but it is not guaranteed to support tagging. pfn_valid() means "there is a struct page"; if you do pfn_to_page() and touch the page, you won't fault. So Anshuman's commit is correct. pfn_to_online_page() means, "there is a struct page and it's system RAM that's in use; the memmap has a sane content" > > For user MTE, we get away with this as the MAP_ANONYMOUS requirement > would filter it out while arch_add_memory() will ensure it's mapped as > untagged in the linear map. See another recent fix for hotplugged > memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal > Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged, > only hoplugged memory. Both handled via arch_add_memory() in the arch > code with ZONE_DEVICE starting at devm_memremap_pages(). > >>>> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn >>>> even without virtualisation. >>> >>> I haven't checked all the code paths but I don't think we can get a >>> MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file >>> descriptor. >> >> I certainly hope this is the case - it's the weird corner cases of device >> drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl >> (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED). >> Mali's kbase did something similar in the past. > > I think this should be fine since it's not a MAP_ANONYMOUS (we do allow > MAP_SHARED to be tagged). > -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-31 7:34 ` David Hildenbrand (?) (?) @ 2021-03-31 9:21 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-31 9:21 UTC (permalink / raw) To: David Hildenbrand Cc: Steven Price, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: > On 30.03.21 12:30, Catalin Marinas wrote: > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > > > On 28/03/2021 13:21, Catalin Marinas wrote: > > > > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > > > > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > > > index 77cb2d28f2a4..b31b7a821f90 100644 > > > > > > --- a/arch/arm64/kvm/mmu.c > > > > > > +++ b/arch/arm64/kvm/mmu.c > > > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > > > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > > > > > &pfn, &fault_ipa); > > > > > > + > > > > > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > > > > > + /* > > > > > > + * VM will be able to see the page's tags, so we must ensure > > > > > > + * they have been initialised. if PG_mte_tagged is set, tags > > > > > > + * have already been initialised. > > > > > > + */ > > > > > > + struct page *page = pfn_to_page(pfn); > > > > > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > > > > > + > > > > > > + for (i = 0; i < nr_pages; i++, page++) { > > > > > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > > > > > + mte_clear_page_tags(page_address(page)); > > > > > > + } > > > > > > + } > > > > > > > > > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > > > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > > > > > true for ZONE_DEVICE memory but such memory is allowed not to support > > > > > MTE. > > > > > > > > Some more thinking, this should be safe as any ZONE_DEVICE would be > > > > mapped as untagged memory in the kernel linear map. It could be slightly > > > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > > > > untagged memory. Another overhead is pfn_valid() which will likely end > > > > up calling memblock_is_map_memory(). > > > > > > > > However, the bigger issue is that Stage 2 cannot disable tagging for > > > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > > > > way to detect what gets mapped in the guest as Normal Cacheable memory > > > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > > > > something else like on-chip memory)? If we can't guarantee that all > > > > Cacheable memory given to a guest supports tags, we should disable the > > > > feature altogether. > > > > > > In stage 2 I believe we only have two types of mapping - 'normal' or > > > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a > > > case of checking the 'device' variable, and makes sense to avoid the > > > overhead you describe. > > > > > > This should also guarantee that all stage-2 cacheable memory supports tags, > > > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only > > > be true for memory that Linux considers "normal". > > If you think "normal" == "normal System RAM", that's wrong; see below. By "normal" I think both Steven and I meant the Normal Cacheable memory attribute (another being the Device memory attribute). > > That's the problem. With Anshuman's commit I mentioned above, > > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent > > memory, not talking about some I/O mapping that requires Device_nGnRE). > > So kvm_is_device_pfn() is false for such memory and it may be mapped as > > Normal but it is not guaranteed to support tagging. > > pfn_valid() means "there is a struct page"; if you do pfn_to_page() and > touch the page, you won't fault. So Anshuman's commit is correct. I agree. > pfn_to_online_page() means, "there is a struct page and it's system RAM > that's in use; the memmap has a sane content" Does pfn_to_online_page() returns a valid struct page pointer for ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for some definition of system RAM (I assume NVDIMM != system RAM). For example, pmem_attach_disk() calls devm_memremap_pages() and this would use the Normal Cacheable memory attribute without necessarily being system RAM. So if pfn_valid() is not equivalent to system RAM, we have a potential issue with MTE. Even if "system RAM" includes NVDIMMs, we still have this issue and we may need a new term to describe MTE-safe memory. In the kernel we assume MTE-safe all pages that can be mapped as MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 9:21 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-31 9:21 UTC (permalink / raw) To: David Hildenbrand Cc: Steven Price, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: > On 30.03.21 12:30, Catalin Marinas wrote: > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > > > On 28/03/2021 13:21, Catalin Marinas wrote: > > > > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > > > > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > > > index 77cb2d28f2a4..b31b7a821f90 100644 > > > > > > --- a/arch/arm64/kvm/mmu.c > > > > > > +++ b/arch/arm64/kvm/mmu.c > > > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > > > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > > > > > &pfn, &fault_ipa); > > > > > > + > > > > > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > > > > > + /* > > > > > > + * VM will be able to see the page's tags, so we must ensure > > > > > > + * they have been initialised. if PG_mte_tagged is set, tags > > > > > > + * have already been initialised. > > > > > > + */ > > > > > > + struct page *page = pfn_to_page(pfn); > > > > > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > > > > > + > > > > > > + for (i = 0; i < nr_pages; i++, page++) { > > > > > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > > > > > + mte_clear_page_tags(page_address(page)); > > > > > > + } > > > > > > + } > > > > > > > > > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > > > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > > > > > true for ZONE_DEVICE memory but such memory is allowed not to support > > > > > MTE. > > > > > > > > Some more thinking, this should be safe as any ZONE_DEVICE would be > > > > mapped as untagged memory in the kernel linear map. It could be slightly > > > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > > > > untagged memory. Another overhead is pfn_valid() which will likely end > > > > up calling memblock_is_map_memory(). > > > > > > > > However, the bigger issue is that Stage 2 cannot disable tagging for > > > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > > > > way to detect what gets mapped in the guest as Normal Cacheable memory > > > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > > > > something else like on-chip memory)? If we can't guarantee that all > > > > Cacheable memory given to a guest supports tags, we should disable the > > > > feature altogether. > > > > > > In stage 2 I believe we only have two types of mapping - 'normal' or > > > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a > > > case of checking the 'device' variable, and makes sense to avoid the > > > overhead you describe. > > > > > > This should also guarantee that all stage-2 cacheable memory supports tags, > > > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only > > > be true for memory that Linux considers "normal". > > If you think "normal" == "normal System RAM", that's wrong; see below. By "normal" I think both Steven and I meant the Normal Cacheable memory attribute (another being the Device memory attribute). > > That's the problem. With Anshuman's commit I mentioned above, > > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent > > memory, not talking about some I/O mapping that requires Device_nGnRE). > > So kvm_is_device_pfn() is false for such memory and it may be mapped as > > Normal but it is not guaranteed to support tagging. > > pfn_valid() means "there is a struct page"; if you do pfn_to_page() and > touch the page, you won't fault. So Anshuman's commit is correct. I agree. > pfn_to_online_page() means, "there is a struct page and it's system RAM > that's in use; the memmap has a sane content" Does pfn_to_online_page() returns a valid struct page pointer for ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for some definition of system RAM (I assume NVDIMM != system RAM). For example, pmem_attach_disk() calls devm_memremap_pages() and this would use the Normal Cacheable memory attribute without necessarily being system RAM. So if pfn_valid() is not equivalent to system RAM, we have a potential issue with MTE. Even if "system RAM" includes NVDIMMs, we still have this issue and we may need a new term to describe MTE-safe memory. In the kernel we assume MTE-safe all pages that can be mapped as MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages. Thanks. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 9:21 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-31 9:21 UTC (permalink / raw) To: David Hildenbrand Cc: qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, Steven Price, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: > On 30.03.21 12:30, Catalin Marinas wrote: > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > > > On 28/03/2021 13:21, Catalin Marinas wrote: > > > > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > > > > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > > > index 77cb2d28f2a4..b31b7a821f90 100644 > > > > > > --- a/arch/arm64/kvm/mmu.c > > > > > > +++ b/arch/arm64/kvm/mmu.c > > > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > > > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > > > > > &pfn, &fault_ipa); > > > > > > + > > > > > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > > > > > + /* > > > > > > + * VM will be able to see the page's tags, so we must ensure > > > > > > + * they have been initialised. if PG_mte_tagged is set, tags > > > > > > + * have already been initialised. > > > > > > + */ > > > > > > + struct page *page = pfn_to_page(pfn); > > > > > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > > > > > + > > > > > > + for (i = 0; i < nr_pages; i++, page++) { > > > > > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > > > > > + mte_clear_page_tags(page_address(page)); > > > > > > + } > > > > > > + } > > > > > > > > > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > > > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > > > > > true for ZONE_DEVICE memory but such memory is allowed not to support > > > > > MTE. > > > > > > > > Some more thinking, this should be safe as any ZONE_DEVICE would be > > > > mapped as untagged memory in the kernel linear map. It could be slightly > > > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > > > > untagged memory. Another overhead is pfn_valid() which will likely end > > > > up calling memblock_is_map_memory(). > > > > > > > > However, the bigger issue is that Stage 2 cannot disable tagging for > > > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > > > > way to detect what gets mapped in the guest as Normal Cacheable memory > > > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > > > > something else like on-chip memory)? If we can't guarantee that all > > > > Cacheable memory given to a guest supports tags, we should disable the > > > > feature altogether. > > > > > > In stage 2 I believe we only have two types of mapping - 'normal' or > > > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a > > > case of checking the 'device' variable, and makes sense to avoid the > > > overhead you describe. > > > > > > This should also guarantee that all stage-2 cacheable memory supports tags, > > > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only > > > be true for memory that Linux considers "normal". > > If you think "normal" == "normal System RAM", that's wrong; see below. By "normal" I think both Steven and I meant the Normal Cacheable memory attribute (another being the Device memory attribute). > > That's the problem. With Anshuman's commit I mentioned above, > > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent > > memory, not talking about some I/O mapping that requires Device_nGnRE). > > So kvm_is_device_pfn() is false for such memory and it may be mapped as > > Normal but it is not guaranteed to support tagging. > > pfn_valid() means "there is a struct page"; if you do pfn_to_page() and > touch the page, you won't fault. So Anshuman's commit is correct. I agree. > pfn_to_online_page() means, "there is a struct page and it's system RAM > that's in use; the memmap has a sane content" Does pfn_to_online_page() returns a valid struct page pointer for ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for some definition of system RAM (I assume NVDIMM != system RAM). For example, pmem_attach_disk() calls devm_memremap_pages() and this would use the Normal Cacheable memory attribute without necessarily being system RAM. So if pfn_valid() is not equivalent to system RAM, we have a potential issue with MTE. Even if "system RAM" includes NVDIMMs, we still have this issue and we may need a new term to describe MTE-safe memory. In the kernel we assume MTE-safe all pages that can be mapped as MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages. Thanks. -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 9:21 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-31 9:21 UTC (permalink / raw) To: David Hildenbrand Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, Steven Price, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: > On 30.03.21 12:30, Catalin Marinas wrote: > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > > > On 28/03/2021 13:21, Catalin Marinas wrote: > > > > On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: > > > > > On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > > > index 77cb2d28f2a4..b31b7a821f90 100644 > > > > > > --- a/arch/arm64/kvm/mmu.c > > > > > > +++ b/arch/arm64/kvm/mmu.c > > > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > > > > > if (vma_pagesize == PAGE_SIZE && !force_pte) > > > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva, > > > > > > &pfn, &fault_ipa); > > > > > > + > > > > > > + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { > > > > > > + /* > > > > > > + * VM will be able to see the page's tags, so we must ensure > > > > > > + * they have been initialised. if PG_mte_tagged is set, tags > > > > > > + * have already been initialised. > > > > > > + */ > > > > > > + struct page *page = pfn_to_page(pfn); > > > > > > + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; > > > > > > + > > > > > > + for (i = 0; i < nr_pages; i++, page++) { > > > > > > + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) > > > > > > + mte_clear_page_tags(page_address(page)); > > > > > > + } > > > > > > + } > > > > > > > > > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b > > > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns > > > > > true for ZONE_DEVICE memory but such memory is allowed not to support > > > > > MTE. > > > > > > > > Some more thinking, this should be safe as any ZONE_DEVICE would be > > > > mapped as untagged memory in the kernel linear map. It could be slightly > > > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, > > > > untagged memory. Another overhead is pfn_valid() which will likely end > > > > up calling memblock_is_map_memory(). > > > > > > > > However, the bigger issue is that Stage 2 cannot disable tagging for > > > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a > > > > way to detect what gets mapped in the guest as Normal Cacheable memory > > > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or > > > > something else like on-chip memory)? If we can't guarantee that all > > > > Cacheable memory given to a guest supports tags, we should disable the > > > > feature altogether. > > > > > > In stage 2 I believe we only have two types of mapping - 'normal' or > > > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a > > > case of checking the 'device' variable, and makes sense to avoid the > > > overhead you describe. > > > > > > This should also guarantee that all stage-2 cacheable memory supports tags, > > > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only > > > be true for memory that Linux considers "normal". > > If you think "normal" == "normal System RAM", that's wrong; see below. By "normal" I think both Steven and I meant the Normal Cacheable memory attribute (another being the Device memory attribute). > > That's the problem. With Anshuman's commit I mentioned above, > > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent > > memory, not talking about some I/O mapping that requires Device_nGnRE). > > So kvm_is_device_pfn() is false for such memory and it may be mapped as > > Normal but it is not guaranteed to support tagging. > > pfn_valid() means "there is a struct page"; if you do pfn_to_page() and > touch the page, you won't fault. So Anshuman's commit is correct. I agree. > pfn_to_online_page() means, "there is a struct page and it's system RAM > that's in use; the memmap has a sane content" Does pfn_to_online_page() returns a valid struct page pointer for ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for some definition of system RAM (I assume NVDIMM != system RAM). For example, pmem_attach_disk() calls devm_memremap_pages() and this would use the Normal Cacheable memory attribute without necessarily being system RAM. So if pfn_valid() is not equivalent to system RAM, we have a potential issue with MTE. Even if "system RAM" includes NVDIMMs, we still have this issue and we may need a new term to describe MTE-safe memory. In the kernel we assume MTE-safe all pages that can be mapped as MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-31 9:21 ` Catalin Marinas (?) (?) @ 2021-03-31 9:32 ` David Hildenbrand -1 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 9:32 UTC (permalink / raw) To: Catalin Marinas Cc: Steven Price, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 31.03.21 11:21, Catalin Marinas wrote: > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >> On 30.03.21 12:30, Catalin Marinas wrote: >>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>>>>>> &pfn, &fault_ipa); >>>>>>> + >>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>>>>>> + /* >>>>>>> + * VM will be able to see the page's tags, so we must ensure >>>>>>> + * they have been initialised. if PG_mte_tagged is set, tags >>>>>>> + * have already been initialised. >>>>>>> + */ >>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>> + >>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >>>>>> true for ZONE_DEVICE memory but such memory is allowed not to support >>>>>> MTE. >>>>> >>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>> mapped as untagged memory in the kernel linear map. It could be slightly >>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>> untagged memory. Another overhead is pfn_valid() which will likely end >>>>> up calling memblock_is_map_memory(). >>>>> >>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a >>>>> way to detect what gets mapped in the guest as Normal Cacheable memory >>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or >>>>> something else like on-chip memory)? If we can't guarantee that all >>>>> Cacheable memory given to a guest supports tags, we should disable the >>>>> feature altogether. >>>> >>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a >>>> case of checking the 'device' variable, and makes sense to avoid the >>>> overhead you describe. >>>> >>>> This should also guarantee that all stage-2 cacheable memory supports tags, >>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only >>>> be true for memory that Linux considers "normal". >> >> If you think "normal" == "normal System RAM", that's wrong; see below. > > By "normal" I think both Steven and I meant the Normal Cacheable memory > attribute (another being the Device memory attribute). > >>> That's the problem. With Anshuman's commit I mentioned above, >>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>> Normal but it is not guaranteed to support tagging. >> >> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >> touch the page, you won't fault. So Anshuman's commit is correct. > > I agree. > >> pfn_to_online_page() means, "there is a struct page and it's system RAM >> that's in use; the memmap has a sane content" > > Does pfn_to_online_page() returns a valid struct page pointer for > ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for > some definition of system RAM (I assume NVDIMM != system RAM). For > example, pmem_attach_disk() calls devm_memremap_pages() and this would > use the Normal Cacheable memory attribute without necessarily being > system RAM. No, not for ZONE_DEVICE. However, if you expose PMEM via dax/kmem as System RAM to the system (-> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system thinks it's "ordinary system RAM" and the memory is managed by the buddy. > > So if pfn_valid() is not equivalent to system RAM, we have a potential > issue with MTE. Even if "system RAM" includes NVDIMMs, we still have > this issue and we may need a new term to describe MTE-safe memory. In > the kernel we assume MTE-safe all pages that can be mapped as > MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages. > > Thanks. > -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 9:32 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 9:32 UTC (permalink / raw) To: Catalin Marinas Cc: Steven Price, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 31.03.21 11:21, Catalin Marinas wrote: > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >> On 30.03.21 12:30, Catalin Marinas wrote: >>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>>>>>> &pfn, &fault_ipa); >>>>>>> + >>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>>>>>> + /* >>>>>>> + * VM will be able to see the page's tags, so we must ensure >>>>>>> + * they have been initialised. if PG_mte_tagged is set, tags >>>>>>> + * have already been initialised. >>>>>>> + */ >>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>> + >>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >>>>>> true for ZONE_DEVICE memory but such memory is allowed not to support >>>>>> MTE. >>>>> >>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>> mapped as untagged memory in the kernel linear map. It could be slightly >>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>> untagged memory. Another overhead is pfn_valid() which will likely end >>>>> up calling memblock_is_map_memory(). >>>>> >>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a >>>>> way to detect what gets mapped in the guest as Normal Cacheable memory >>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or >>>>> something else like on-chip memory)? If we can't guarantee that all >>>>> Cacheable memory given to a guest supports tags, we should disable the >>>>> feature altogether. >>>> >>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a >>>> case of checking the 'device' variable, and makes sense to avoid the >>>> overhead you describe. >>>> >>>> This should also guarantee that all stage-2 cacheable memory supports tags, >>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only >>>> be true for memory that Linux considers "normal". >> >> If you think "normal" == "normal System RAM", that's wrong; see below. > > By "normal" I think both Steven and I meant the Normal Cacheable memory > attribute (another being the Device memory attribute). > >>> That's the problem. With Anshuman's commit I mentioned above, >>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>> Normal but it is not guaranteed to support tagging. >> >> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >> touch the page, you won't fault. So Anshuman's commit is correct. > > I agree. > >> pfn_to_online_page() means, "there is a struct page and it's system RAM >> that's in use; the memmap has a sane content" > > Does pfn_to_online_page() returns a valid struct page pointer for > ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for > some definition of system RAM (I assume NVDIMM != system RAM). For > example, pmem_attach_disk() calls devm_memremap_pages() and this would > use the Normal Cacheable memory attribute without necessarily being > system RAM. No, not for ZONE_DEVICE. However, if you expose PMEM via dax/kmem as System RAM to the system (-> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system thinks it's "ordinary system RAM" and the memory is managed by the buddy. > > So if pfn_valid() is not equivalent to system RAM, we have a potential > issue with MTE. Even if "system RAM" includes NVDIMMs, we still have > this issue and we may need a new term to describe MTE-safe memory. In > the kernel we assume MTE-safe all pages that can be mapped as > MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages. > > Thanks. > -- Thanks, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 9:32 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 9:32 UTC (permalink / raw) To: Catalin Marinas Cc: qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, Steven Price, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On 31.03.21 11:21, Catalin Marinas wrote: > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >> On 30.03.21 12:30, Catalin Marinas wrote: >>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>>>>>> &pfn, &fault_ipa); >>>>>>> + >>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>>>>>> + /* >>>>>>> + * VM will be able to see the page's tags, so we must ensure >>>>>>> + * they have been initialised. if PG_mte_tagged is set, tags >>>>>>> + * have already been initialised. >>>>>>> + */ >>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>> + >>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >>>>>> true for ZONE_DEVICE memory but such memory is allowed not to support >>>>>> MTE. >>>>> >>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>> mapped as untagged memory in the kernel linear map. It could be slightly >>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>> untagged memory. Another overhead is pfn_valid() which will likely end >>>>> up calling memblock_is_map_memory(). >>>>> >>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a >>>>> way to detect what gets mapped in the guest as Normal Cacheable memory >>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or >>>>> something else like on-chip memory)? If we can't guarantee that all >>>>> Cacheable memory given to a guest supports tags, we should disable the >>>>> feature altogether. >>>> >>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a >>>> case of checking the 'device' variable, and makes sense to avoid the >>>> overhead you describe. >>>> >>>> This should also guarantee that all stage-2 cacheable memory supports tags, >>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only >>>> be true for memory that Linux considers "normal". >> >> If you think "normal" == "normal System RAM", that's wrong; see below. > > By "normal" I think both Steven and I meant the Normal Cacheable memory > attribute (another being the Device memory attribute). > >>> That's the problem. With Anshuman's commit I mentioned above, >>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>> Normal but it is not guaranteed to support tagging. >> >> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >> touch the page, you won't fault. So Anshuman's commit is correct. > > I agree. > >> pfn_to_online_page() means, "there is a struct page and it's system RAM >> that's in use; the memmap has a sane content" > > Does pfn_to_online_page() returns a valid struct page pointer for > ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for > some definition of system RAM (I assume NVDIMM != system RAM). For > example, pmem_attach_disk() calls devm_memremap_pages() and this would > use the Normal Cacheable memory attribute without necessarily being > system RAM. No, not for ZONE_DEVICE. However, if you expose PMEM via dax/kmem as System RAM to the system (-> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system thinks it's "ordinary system RAM" and the memory is managed by the buddy. > > So if pfn_valid() is not equivalent to system RAM, we have a potential > issue with MTE. Even if "system RAM" includes NVDIMMs, we still have > this issue and we may need a new term to describe MTE-safe memory. In > the kernel we assume MTE-safe all pages that can be mapped as > MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages. > > Thanks. > -- Thanks, David / dhildenb _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 9:32 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 9:32 UTC (permalink / raw) To: Catalin Marinas Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, Steven Price, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On 31.03.21 11:21, Catalin Marinas wrote: > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >> On 30.03.21 12:30, Catalin Marinas wrote: >>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, hva, >>>>>>> &pfn, &fault_ipa); >>>>>>> + >>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) { >>>>>>> + /* >>>>>>> + * VM will be able to see the page's tags, so we must ensure >>>>>>> + * they have been initialised. if PG_mte_tagged is set, tags >>>>>>> + * have already been initialised. >>>>>>> + */ >>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>> + >>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> This pfn_valid() check may be problematic. Following commit eeb0753ba27b >>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns >>>>>> true for ZONE_DEVICE memory but such memory is allowed not to support >>>>>> MTE. >>>>> >>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>> mapped as untagged memory in the kernel linear map. It could be slightly >>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>> untagged memory. Another overhead is pfn_valid() which will likely end >>>>> up calling memblock_is_map_memory(). >>>>> >>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a >>>>> way to detect what gets mapped in the guest as Normal Cacheable memory >>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE (or >>>>> something else like on-chip memory)? If we can't guarantee that all >>>>> Cacheable memory given to a guest supports tags, we should disable the >>>>> feature altogether. >>>> >>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a >>>> case of checking the 'device' variable, and makes sense to avoid the >>>> overhead you describe. >>>> >>>> This should also guarantee that all stage-2 cacheable memory supports tags, >>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only >>>> be true for memory that Linux considers "normal". >> >> If you think "normal" == "normal System RAM", that's wrong; see below. > > By "normal" I think both Steven and I meant the Normal Cacheable memory > attribute (another being the Device memory attribute). > >>> That's the problem. With Anshuman's commit I mentioned above, >>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>> Normal but it is not guaranteed to support tagging. >> >> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >> touch the page, you won't fault. So Anshuman's commit is correct. > > I agree. > >> pfn_to_online_page() means, "there is a struct page and it's system RAM >> that's in use; the memmap has a sane content" > > Does pfn_to_online_page() returns a valid struct page pointer for > ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for > some definition of system RAM (I assume NVDIMM != system RAM). For > example, pmem_attach_disk() calls devm_memremap_pages() and this would > use the Normal Cacheable memory attribute without necessarily being > system RAM. No, not for ZONE_DEVICE. However, if you expose PMEM via dax/kmem as System RAM to the system (-> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system thinks it's "ordinary system RAM" and the memory is managed by the buddy. > > So if pfn_valid() is not equivalent to system RAM, we have a potential > issue with MTE. Even if "system RAM" includes NVDIMMs, we still have > this issue and we may need a new term to describe MTE-safe memory. In > the kernel we assume MTE-safe all pages that can be mapped as > MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages. > > Thanks. > -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-31 9:32 ` David Hildenbrand (?) (?) @ 2021-03-31 10:41 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-31 10:41 UTC (permalink / raw) To: David Hildenbrand, Catalin Marinas Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 31/03/2021 10:32, David Hildenbrand wrote: > On 31.03.21 11:21, Catalin Marinas wrote: >> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>> On 30.03.21 12:30, Catalin Marinas wrote: >>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu >>>>>>>> *vcpu, phys_addr_t fault_ipa, >>>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, >>>>>>>> hva, >>>>>>>> &pfn, &fault_ipa); >>>>>>>> + >>>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && >>>>>>>> pfn_valid(pfn)) { >>>>>>>> + /* >>>>>>>> + * VM will be able to see the page's tags, so we must >>>>>>>> ensure >>>>>>>> + * they have been initialised. if PG_mte_tagged is set, >>>>>>>> tags >>>>>>>> + * have already been initialised. >>>>>>>> + */ >>>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>>> + >>>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>>> + } >>>>>>>> + } >>>>>>> >>>>>>> This pfn_valid() check may be problematic. Following commit >>>>>>> eeb0753ba27b >>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it >>>>>>> returns >>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to >>>>>>> support >>>>>>> MTE. >>>>>> >>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>>> mapped as untagged memory in the kernel linear map. It could be >>>>>> slightly >>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>>> untagged memory. Another overhead is pfn_valid() which will likely >>>>>> end >>>>>> up calling memblock_is_map_memory(). >>>>>> >>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is >>>>>> there a >>>>>> way to detect what gets mapped in the guest as Normal Cacheable >>>>>> memory >>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE >>>>>> (or >>>>>> something else like on-chip memory)? If we can't guarantee that all >>>>>> Cacheable memory given to a guest supports tags, we should disable >>>>>> the >>>>>> feature altogether. >>>>> >>>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the >>>>> latter is a >>>>> case of checking the 'device' variable, and makes sense to avoid the >>>>> overhead you describe. >>>>> >>>>> This should also guarantee that all stage-2 cacheable memory >>>>> supports tags, >>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() >>>>> should only >>>>> be true for memory that Linux considers "normal". >>> >>> If you think "normal" == "normal System RAM", that's wrong; see below. >> >> By "normal" I think both Steven and I meant the Normal Cacheable memory >> attribute (another being the Device memory attribute). Sadly there's no good standardised terminology here. Aarch64 provides the "normal (cacheable)" definition. Memory which is mapped as "Normal Cacheable" is implicitly MTE capable when shared with a guest (because the stage 2 mappings don't allow restricting MTE other than mapping it as Device memory). So MTE also forces us to have a definition of memory which is "bog standard memory"[1] separate from the mapping attributes. This is the main memory which fully supports MTE. Separate from the "bog standard" we have the "special"[1] memory, e.g. ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that memory may not support MTE tags. This memory can only be safely shared with a guest in the following situations: 1. MTE is completely disabled for the guest 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) 3. We have some guarantee that guest MTE access are in some way safe. (1) is the situation today (without this patch series). But it prevents the guest from using MTE in any form. (2) is pretty terrible for general memory, but is the get-out clause for mapping devices into the guest. (3) isn't something we have any architectural way of discovering. We'd need to know what the device did with the MTE accesses (and any caches between the CPU and the device) to ensure there aren't any side-channels or h/w lockup issues. We'd also need some way of describing this memory to the guest. So at least for the time being the approach is to avoid letting a guest with MTE enabled have access to this sort of memory. [1] Neither "bog standard" nor "special" are real terms - like I said there's a lack of standardised terminology. >>>> That's the problem. With Anshuman's commit I mentioned above, >>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>>> Normal but it is not guaranteed to support tagging. >>> >>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >>> touch the page, you won't fault. So Anshuman's commit is correct. >> >> I agree. >> >>> pfn_to_online_page() means, "there is a struct page and it's system RAM >>> that's in use; the memmap has a sane content" >> >> Does pfn_to_online_page() returns a valid struct page pointer for >> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for >> some definition of system RAM (I assume NVDIMM != system RAM). For >> example, pmem_attach_disk() calls devm_memremap_pages() and this would >> use the Normal Cacheable memory attribute without necessarily being >> system RAM. > > No, not for ZONE_DEVICE. > > However, if you expose PMEM via dax/kmem as System RAM to the system (-> > add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or > ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system > thinks it's "ordinary system RAM" and the memory is managed by the buddy. So if I'm understanding this correctly for KVM we need to use pfn_to_online_pages() and reject if NULL is returned? In the case of dax/kmem there already needs to be validation that the memory supports MTE (otherwise we break user space) before it's allowed into the "ordinary system RAM" bucket. Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 10:41 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-31 10:41 UTC (permalink / raw) To: David Hildenbrand, Catalin Marinas Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 31/03/2021 10:32, David Hildenbrand wrote: > On 31.03.21 11:21, Catalin Marinas wrote: >> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>> On 30.03.21 12:30, Catalin Marinas wrote: >>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu >>>>>>>> *vcpu, phys_addr_t fault_ipa, >>>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, >>>>>>>> hva, >>>>>>>> &pfn, &fault_ipa); >>>>>>>> + >>>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && >>>>>>>> pfn_valid(pfn)) { >>>>>>>> + /* >>>>>>>> + * VM will be able to see the page's tags, so we must >>>>>>>> ensure >>>>>>>> + * they have been initialised. if PG_mte_tagged is set, >>>>>>>> tags >>>>>>>> + * have already been initialised. >>>>>>>> + */ >>>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>>> + >>>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>>> + } >>>>>>>> + } >>>>>>> >>>>>>> This pfn_valid() check may be problematic. Following commit >>>>>>> eeb0753ba27b >>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it >>>>>>> returns >>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to >>>>>>> support >>>>>>> MTE. >>>>>> >>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>>> mapped as untagged memory in the kernel linear map. It could be >>>>>> slightly >>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>>> untagged memory. Another overhead is pfn_valid() which will likely >>>>>> end >>>>>> up calling memblock_is_map_memory(). >>>>>> >>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is >>>>>> there a >>>>>> way to detect what gets mapped in the guest as Normal Cacheable >>>>>> memory >>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE >>>>>> (or >>>>>> something else like on-chip memory)? If we can't guarantee that all >>>>>> Cacheable memory given to a guest supports tags, we should disable >>>>>> the >>>>>> feature altogether. >>>>> >>>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the >>>>> latter is a >>>>> case of checking the 'device' variable, and makes sense to avoid the >>>>> overhead you describe. >>>>> >>>>> This should also guarantee that all stage-2 cacheable memory >>>>> supports tags, >>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() >>>>> should only >>>>> be true for memory that Linux considers "normal". >>> >>> If you think "normal" == "normal System RAM", that's wrong; see below. >> >> By "normal" I think both Steven and I meant the Normal Cacheable memory >> attribute (another being the Device memory attribute). Sadly there's no good standardised terminology here. Aarch64 provides the "normal (cacheable)" definition. Memory which is mapped as "Normal Cacheable" is implicitly MTE capable when shared with a guest (because the stage 2 mappings don't allow restricting MTE other than mapping it as Device memory). So MTE also forces us to have a definition of memory which is "bog standard memory"[1] separate from the mapping attributes. This is the main memory which fully supports MTE. Separate from the "bog standard" we have the "special"[1] memory, e.g. ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that memory may not support MTE tags. This memory can only be safely shared with a guest in the following situations: 1. MTE is completely disabled for the guest 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) 3. We have some guarantee that guest MTE access are in some way safe. (1) is the situation today (without this patch series). But it prevents the guest from using MTE in any form. (2) is pretty terrible for general memory, but is the get-out clause for mapping devices into the guest. (3) isn't something we have any architectural way of discovering. We'd need to know what the device did with the MTE accesses (and any caches between the CPU and the device) to ensure there aren't any side-channels or h/w lockup issues. We'd also need some way of describing this memory to the guest. So at least for the time being the approach is to avoid letting a guest with MTE enabled have access to this sort of memory. [1] Neither "bog standard" nor "special" are real terms - like I said there's a lack of standardised terminology. >>>> That's the problem. With Anshuman's commit I mentioned above, >>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>>> Normal but it is not guaranteed to support tagging. >>> >>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >>> touch the page, you won't fault. So Anshuman's commit is correct. >> >> I agree. >> >>> pfn_to_online_page() means, "there is a struct page and it's system RAM >>> that's in use; the memmap has a sane content" >> >> Does pfn_to_online_page() returns a valid struct page pointer for >> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for >> some definition of system RAM (I assume NVDIMM != system RAM). For >> example, pmem_attach_disk() calls devm_memremap_pages() and this would >> use the Normal Cacheable memory attribute without necessarily being >> system RAM. > > No, not for ZONE_DEVICE. > > However, if you expose PMEM via dax/kmem as System RAM to the system (-> > add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or > ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system > thinks it's "ordinary system RAM" and the memory is managed by the buddy. So if I'm understanding this correctly for KVM we need to use pfn_to_online_pages() and reject if NULL is returned? In the case of dax/kmem there already needs to be validation that the memory supports MTE (otherwise we break user space) before it's allowed into the "ordinary system RAM" bucket. Steve _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 10:41 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-31 10:41 UTC (permalink / raw) To: David Hildenbrand, Catalin Marinas Cc: Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On 31/03/2021 10:32, David Hildenbrand wrote: > On 31.03.21 11:21, Catalin Marinas wrote: >> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>> On 30.03.21 12:30, Catalin Marinas wrote: >>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu >>>>>>>> *vcpu, phys_addr_t fault_ipa, >>>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, >>>>>>>> hva, >>>>>>>> &pfn, &fault_ipa); >>>>>>>> + >>>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && >>>>>>>> pfn_valid(pfn)) { >>>>>>>> + /* >>>>>>>> + * VM will be able to see the page's tags, so we must >>>>>>>> ensure >>>>>>>> + * they have been initialised. if PG_mte_tagged is set, >>>>>>>> tags >>>>>>>> + * have already been initialised. >>>>>>>> + */ >>>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>>> + >>>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>>> + } >>>>>>>> + } >>>>>>> >>>>>>> This pfn_valid() check may be problematic. Following commit >>>>>>> eeb0753ba27b >>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it >>>>>>> returns >>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to >>>>>>> support >>>>>>> MTE. >>>>>> >>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>>> mapped as untagged memory in the kernel linear map. It could be >>>>>> slightly >>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>>> untagged memory. Another overhead is pfn_valid() which will likely >>>>>> end >>>>>> up calling memblock_is_map_memory(). >>>>>> >>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is >>>>>> there a >>>>>> way to detect what gets mapped in the guest as Normal Cacheable >>>>>> memory >>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE >>>>>> (or >>>>>> something else like on-chip memory)? If we can't guarantee that all >>>>>> Cacheable memory given to a guest supports tags, we should disable >>>>>> the >>>>>> feature altogether. >>>>> >>>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the >>>>> latter is a >>>>> case of checking the 'device' variable, and makes sense to avoid the >>>>> overhead you describe. >>>>> >>>>> This should also guarantee that all stage-2 cacheable memory >>>>> supports tags, >>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() >>>>> should only >>>>> be true for memory that Linux considers "normal". >>> >>> If you think "normal" == "normal System RAM", that's wrong; see below. >> >> By "normal" I think both Steven and I meant the Normal Cacheable memory >> attribute (another being the Device memory attribute). Sadly there's no good standardised terminology here. Aarch64 provides the "normal (cacheable)" definition. Memory which is mapped as "Normal Cacheable" is implicitly MTE capable when shared with a guest (because the stage 2 mappings don't allow restricting MTE other than mapping it as Device memory). So MTE also forces us to have a definition of memory which is "bog standard memory"[1] separate from the mapping attributes. This is the main memory which fully supports MTE. Separate from the "bog standard" we have the "special"[1] memory, e.g. ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that memory may not support MTE tags. This memory can only be safely shared with a guest in the following situations: 1. MTE is completely disabled for the guest 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) 3. We have some guarantee that guest MTE access are in some way safe. (1) is the situation today (without this patch series). But it prevents the guest from using MTE in any form. (2) is pretty terrible for general memory, but is the get-out clause for mapping devices into the guest. (3) isn't something we have any architectural way of discovering. We'd need to know what the device did with the MTE accesses (and any caches between the CPU and the device) to ensure there aren't any side-channels or h/w lockup issues. We'd also need some way of describing this memory to the guest. So at least for the time being the approach is to avoid letting a guest with MTE enabled have access to this sort of memory. [1] Neither "bog standard" nor "special" are real terms - like I said there's a lack of standardised terminology. >>>> That's the problem. With Anshuman's commit I mentioned above, >>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>>> Normal but it is not guaranteed to support tagging. >>> >>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >>> touch the page, you won't fault. So Anshuman's commit is correct. >> >> I agree. >> >>> pfn_to_online_page() means, "there is a struct page and it's system RAM >>> that's in use; the memmap has a sane content" >> >> Does pfn_to_online_page() returns a valid struct page pointer for >> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for >> some definition of system RAM (I assume NVDIMM != system RAM). For >> example, pmem_attach_disk() calls devm_memremap_pages() and this would >> use the Normal Cacheable memory attribute without necessarily being >> system RAM. > > No, not for ZONE_DEVICE. > > However, if you expose PMEM via dax/kmem as System RAM to the system (-> > add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or > ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system > thinks it's "ordinary system RAM" and the memory is managed by the buddy. So if I'm understanding this correctly for KVM we need to use pfn_to_online_pages() and reject if NULL is returned? In the case of dax/kmem there already needs to be validation that the memory supports MTE (otherwise we break user space) before it's allowed into the "ordinary system RAM" bucket. Steve _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 10:41 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-31 10:41 UTC (permalink / raw) To: David Hildenbrand, Catalin Marinas Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, Suzuki K Poulose, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On 31/03/2021 10:32, David Hildenbrand wrote: > On 31.03.21 11:21, Catalin Marinas wrote: >> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>> On 30.03.21 12:30, Catalin Marinas wrote: >>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu >>>>>>>> *vcpu, phys_addr_t fault_ipa, >>>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, >>>>>>>> hva, >>>>>>>> &pfn, &fault_ipa); >>>>>>>> + >>>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && >>>>>>>> pfn_valid(pfn)) { >>>>>>>> + /* >>>>>>>> + * VM will be able to see the page's tags, so we must >>>>>>>> ensure >>>>>>>> + * they have been initialised. if PG_mte_tagged is set, >>>>>>>> tags >>>>>>>> + * have already been initialised. >>>>>>>> + */ >>>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>>> + >>>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>>> + } >>>>>>>> + } >>>>>>> >>>>>>> This pfn_valid() check may be problematic. Following commit >>>>>>> eeb0753ba27b >>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it >>>>>>> returns >>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to >>>>>>> support >>>>>>> MTE. >>>>>> >>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>>> mapped as untagged memory in the kernel linear map. It could be >>>>>> slightly >>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>>> untagged memory. Another overhead is pfn_valid() which will likely >>>>>> end >>>>>> up calling memblock_is_map_memory(). >>>>>> >>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is >>>>>> there a >>>>>> way to detect what gets mapped in the guest as Normal Cacheable >>>>>> memory >>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE >>>>>> (or >>>>>> something else like on-chip memory)? If we can't guarantee that all >>>>>> Cacheable memory given to a guest supports tags, we should disable >>>>>> the >>>>>> feature altogether. >>>>> >>>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the >>>>> latter is a >>>>> case of checking the 'device' variable, and makes sense to avoid the >>>>> overhead you describe. >>>>> >>>>> This should also guarantee that all stage-2 cacheable memory >>>>> supports tags, >>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() >>>>> should only >>>>> be true for memory that Linux considers "normal". >>> >>> If you think "normal" == "normal System RAM", that's wrong; see below. >> >> By "normal" I think both Steven and I meant the Normal Cacheable memory >> attribute (another being the Device memory attribute). Sadly there's no good standardised terminology here. Aarch64 provides the "normal (cacheable)" definition. Memory which is mapped as "Normal Cacheable" is implicitly MTE capable when shared with a guest (because the stage 2 mappings don't allow restricting MTE other than mapping it as Device memory). So MTE also forces us to have a definition of memory which is "bog standard memory"[1] separate from the mapping attributes. This is the main memory which fully supports MTE. Separate from the "bog standard" we have the "special"[1] memory, e.g. ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that memory may not support MTE tags. This memory can only be safely shared with a guest in the following situations: 1. MTE is completely disabled for the guest 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) 3. We have some guarantee that guest MTE access are in some way safe. (1) is the situation today (without this patch series). But it prevents the guest from using MTE in any form. (2) is pretty terrible for general memory, but is the get-out clause for mapping devices into the guest. (3) isn't something we have any architectural way of discovering. We'd need to know what the device did with the MTE accesses (and any caches between the CPU and the device) to ensure there aren't any side-channels or h/w lockup issues. We'd also need some way of describing this memory to the guest. So at least for the time being the approach is to avoid letting a guest with MTE enabled have access to this sort of memory. [1] Neither "bog standard" nor "special" are real terms - like I said there's a lack of standardised terminology. >>>> That's the problem. With Anshuman's commit I mentioned above, >>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>>> Normal but it is not guaranteed to support tagging. >>> >>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >>> touch the page, you won't fault. So Anshuman's commit is correct. >> >> I agree. >> >>> pfn_to_online_page() means, "there is a struct page and it's system RAM >>> that's in use; the memmap has a sane content" >> >> Does pfn_to_online_page() returns a valid struct page pointer for >> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for >> some definition of system RAM (I assume NVDIMM != system RAM). For >> example, pmem_attach_disk() calls devm_memremap_pages() and this would >> use the Normal Cacheable memory attribute without necessarily being >> system RAM. > > No, not for ZONE_DEVICE. > > However, if you expose PMEM via dax/kmem as System RAM to the system (-> > add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or > ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system > thinks it's "ordinary system RAM" and the memory is managed by the buddy. So if I'm understanding this correctly for KVM we need to use pfn_to_online_pages() and reject if NULL is returned? In the case of dax/kmem there already needs to be validation that the memory supports MTE (otherwise we break user space) before it's allowed into the "ordinary system RAM" bucket. Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-31 10:41 ` Steven Price (?) (?) @ 2021-03-31 14:14 ` David Hildenbrand -1 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 14:14 UTC (permalink / raw) To: Steven Price, Catalin Marinas Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 31.03.21 12:41, Steven Price wrote: > On 31/03/2021 10:32, David Hildenbrand wrote: >> On 31.03.21 11:21, Catalin Marinas wrote: >>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>>> On 30.03.21 12:30, Catalin Marinas wrote: >>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu >>>>>>>>> *vcpu, phys_addr_t fault_ipa, >>>>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, >>>>>>>>> hva, >>>>>>>>> &pfn, &fault_ipa); >>>>>>>>> + >>>>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && >>>>>>>>> pfn_valid(pfn)) { >>>>>>>>> + /* >>>>>>>>> + * VM will be able to see the page's tags, so we must >>>>>>>>> ensure >>>>>>>>> + * they have been initialised. if PG_mte_tagged is set, >>>>>>>>> tags >>>>>>>>> + * have already been initialised. >>>>>>>>> + */ >>>>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>>>> + >>>>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>>>> + } >>>>>>>>> + } >>>>>>>> >>>>>>>> This pfn_valid() check may be problematic. Following commit >>>>>>>> eeb0753ba27b >>>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it >>>>>>>> returns >>>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to >>>>>>>> support >>>>>>>> MTE. >>>>>>> >>>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>>>> mapped as untagged memory in the kernel linear map. It could be >>>>>>> slightly >>>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>>>> untagged memory. Another overhead is pfn_valid() which will likely >>>>>>> end >>>>>>> up calling memblock_is_map_memory(). >>>>>>> >>>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is >>>>>>> there a >>>>>>> way to detect what gets mapped in the guest as Normal Cacheable >>>>>>> memory >>>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE >>>>>>> (or >>>>>>> something else like on-chip memory)? If we can't guarantee that all >>>>>>> Cacheable memory given to a guest supports tags, we should disable >>>>>>> the >>>>>>> feature altogether. >>>>>> >>>>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the >>>>>> latter is a >>>>>> case of checking the 'device' variable, and makes sense to avoid the >>>>>> overhead you describe. >>>>>> >>>>>> This should also guarantee that all stage-2 cacheable memory >>>>>> supports tags, >>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() >>>>>> should only >>>>>> be true for memory that Linux considers "normal". >>>> >>>> If you think "normal" == "normal System RAM", that's wrong; see below. >>> >>> By "normal" I think both Steven and I meant the Normal Cacheable memory >>> attribute (another being the Device memory attribute). > > Sadly there's no good standardised terminology here. Aarch64 provides > the "normal (cacheable)" definition. Memory which is mapped as "Normal > Cacheable" is implicitly MTE capable when shared with a guest (because > the stage 2 mappings don't allow restricting MTE other than mapping it > as Device memory). > > So MTE also forces us to have a definition of memory which is "bog > standard memory"[1] separate from the mapping attributes. This is the > main memory which fully supports MTE. > > Separate from the "bog standard" we have the "special"[1] memory, e.g. > ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but > that memory may not support MTE tags. This memory can only be safely > shared with a guest in the following situations: > > 1. MTE is completely disabled for the guest > > 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) > > 3. We have some guarantee that guest MTE access are in some way safe. > > (1) is the situation today (without this patch series). But it prevents > the guest from using MTE in any form. > > (2) is pretty terrible for general memory, but is the get-out clause for > mapping devices into the guest. > > (3) isn't something we have any architectural way of discovering. We'd > need to know what the device did with the MTE accesses (and any caches > between the CPU and the device) to ensure there aren't any side-channels > or h/w lockup issues. We'd also need some way of describing this memory > to the guest. > > So at least for the time being the approach is to avoid letting a guest > with MTE enabled have access to this sort of memory. > > [1] Neither "bog standard" nor "special" are real terms - like I said > there's a lack of standardised terminology. > >>>>> That's the problem. With Anshuman's commit I mentioned above, >>>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>>>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>>>> Normal but it is not guaranteed to support tagging. >>>> >>>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >>>> touch the page, you won't fault. So Anshuman's commit is correct. >>> >>> I agree. >>> >>>> pfn_to_online_page() means, "there is a struct page and it's system RAM >>>> that's in use; the memmap has a sane content" >>> >>> Does pfn_to_online_page() returns a valid struct page pointer for >>> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for >>> some definition of system RAM (I assume NVDIMM != system RAM). For >>> example, pmem_attach_disk() calls devm_memremap_pages() and this would >>> use the Normal Cacheable memory attribute without necessarily being >>> system RAM. >> >> No, not for ZONE_DEVICE. >> >> However, if you expose PMEM via dax/kmem as System RAM to the system (-> >> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or >> ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system >> thinks it's "ordinary system RAM" and the memory is managed by the buddy. > > So if I'm understanding this correctly for KVM we need to use > pfn_to_online_pages() and reject if NULL is returned? In the case of > dax/kmem there already needs to be validation that the memory supports > MTE (otherwise we break user space) before it's allowed into the > "ordinary system RAM" bucket. That should work. 1. One alternative is if (!pfn_valid(pfn)) return false; #ifdef CONFIG_ZONE_DEVICE page = pfn_to_page(pfn); if (page_zonenum(page) == ZONE_DEVICE) return false; #endif return true; Note that when you are dealing with random PFNs, this approach is in general not safe; the memmap could be uninitialized and contain garbage. You can have false positives for ZONE_DEVICE. 2. Yet another (slower?) variant to detect (some?) ZONE_DEVICE is pgmap = get_dev_pagemap(pfn, NULL); put_dev_pagemap(pgmap); if (pgmap) return false; return true; I know that /dev/mem mappings can be problematic ... because the memmap could be in any state and actually we shouldn't even touch/rely on any "struct pages" at all, as we have a pure PFN mapping ... -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 14:14 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 14:14 UTC (permalink / raw) To: Steven Price, Catalin Marinas Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 31.03.21 12:41, Steven Price wrote: > On 31/03/2021 10:32, David Hildenbrand wrote: >> On 31.03.21 11:21, Catalin Marinas wrote: >>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>>> On 30.03.21 12:30, Catalin Marinas wrote: >>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu >>>>>>>>> *vcpu, phys_addr_t fault_ipa, >>>>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, >>>>>>>>> hva, >>>>>>>>> &pfn, &fault_ipa); >>>>>>>>> + >>>>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && >>>>>>>>> pfn_valid(pfn)) { >>>>>>>>> + /* >>>>>>>>> + * VM will be able to see the page's tags, so we must >>>>>>>>> ensure >>>>>>>>> + * they have been initialised. if PG_mte_tagged is set, >>>>>>>>> tags >>>>>>>>> + * have already been initialised. >>>>>>>>> + */ >>>>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>>>> + >>>>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>>>> + } >>>>>>>>> + } >>>>>>>> >>>>>>>> This pfn_valid() check may be problematic. Following commit >>>>>>>> eeb0753ba27b >>>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it >>>>>>>> returns >>>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to >>>>>>>> support >>>>>>>> MTE. >>>>>>> >>>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>>>> mapped as untagged memory in the kernel linear map. It could be >>>>>>> slightly >>>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>>>> untagged memory. Another overhead is pfn_valid() which will likely >>>>>>> end >>>>>>> up calling memblock_is_map_memory(). >>>>>>> >>>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is >>>>>>> there a >>>>>>> way to detect what gets mapped in the guest as Normal Cacheable >>>>>>> memory >>>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE >>>>>>> (or >>>>>>> something else like on-chip memory)? If we can't guarantee that all >>>>>>> Cacheable memory given to a guest supports tags, we should disable >>>>>>> the >>>>>>> feature altogether. >>>>>> >>>>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the >>>>>> latter is a >>>>>> case of checking the 'device' variable, and makes sense to avoid the >>>>>> overhead you describe. >>>>>> >>>>>> This should also guarantee that all stage-2 cacheable memory >>>>>> supports tags, >>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() >>>>>> should only >>>>>> be true for memory that Linux considers "normal". >>>> >>>> If you think "normal" == "normal System RAM", that's wrong; see below. >>> >>> By "normal" I think both Steven and I meant the Normal Cacheable memory >>> attribute (another being the Device memory attribute). > > Sadly there's no good standardised terminology here. Aarch64 provides > the "normal (cacheable)" definition. Memory which is mapped as "Normal > Cacheable" is implicitly MTE capable when shared with a guest (because > the stage 2 mappings don't allow restricting MTE other than mapping it > as Device memory). > > So MTE also forces us to have a definition of memory which is "bog > standard memory"[1] separate from the mapping attributes. This is the > main memory which fully supports MTE. > > Separate from the "bog standard" we have the "special"[1] memory, e.g. > ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but > that memory may not support MTE tags. This memory can only be safely > shared with a guest in the following situations: > > 1. MTE is completely disabled for the guest > > 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) > > 3. We have some guarantee that guest MTE access are in some way safe. > > (1) is the situation today (without this patch series). But it prevents > the guest from using MTE in any form. > > (2) is pretty terrible for general memory, but is the get-out clause for > mapping devices into the guest. > > (3) isn't something we have any architectural way of discovering. We'd > need to know what the device did with the MTE accesses (and any caches > between the CPU and the device) to ensure there aren't any side-channels > or h/w lockup issues. We'd also need some way of describing this memory > to the guest. > > So at least for the time being the approach is to avoid letting a guest > with MTE enabled have access to this sort of memory. > > [1] Neither "bog standard" nor "special" are real terms - like I said > there's a lack of standardised terminology. > >>>>> That's the problem. With Anshuman's commit I mentioned above, >>>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>>>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>>>> Normal but it is not guaranteed to support tagging. >>>> >>>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >>>> touch the page, you won't fault. So Anshuman's commit is correct. >>> >>> I agree. >>> >>>> pfn_to_online_page() means, "there is a struct page and it's system RAM >>>> that's in use; the memmap has a sane content" >>> >>> Does pfn_to_online_page() returns a valid struct page pointer for >>> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for >>> some definition of system RAM (I assume NVDIMM != system RAM). For >>> example, pmem_attach_disk() calls devm_memremap_pages() and this would >>> use the Normal Cacheable memory attribute without necessarily being >>> system RAM. >> >> No, not for ZONE_DEVICE. >> >> However, if you expose PMEM via dax/kmem as System RAM to the system (-> >> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or >> ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system >> thinks it's "ordinary system RAM" and the memory is managed by the buddy. > > So if I'm understanding this correctly for KVM we need to use > pfn_to_online_pages() and reject if NULL is returned? In the case of > dax/kmem there already needs to be validation that the memory supports > MTE (otherwise we break user space) before it's allowed into the > "ordinary system RAM" bucket. That should work. 1. One alternative is if (!pfn_valid(pfn)) return false; #ifdef CONFIG_ZONE_DEVICE page = pfn_to_page(pfn); if (page_zonenum(page) == ZONE_DEVICE) return false; #endif return true; Note that when you are dealing with random PFNs, this approach is in general not safe; the memmap could be uninitialized and contain garbage. You can have false positives for ZONE_DEVICE. 2. Yet another (slower?) variant to detect (some?) ZONE_DEVICE is pgmap = get_dev_pagemap(pfn, NULL); put_dev_pagemap(pgmap); if (pgmap) return false; return true; I know that /dev/mem mappings can be problematic ... because the memmap could be in any state and actually we shouldn't even touch/rely on any "struct pages" at all, as we have a pure PFN mapping ... -- Thanks, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 14:14 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 14:14 UTC (permalink / raw) To: Steven Price, Catalin Marinas Cc: Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On 31.03.21 12:41, Steven Price wrote: > On 31/03/2021 10:32, David Hildenbrand wrote: >> On 31.03.21 11:21, Catalin Marinas wrote: >>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>>> On 30.03.21 12:30, Catalin Marinas wrote: >>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu >>>>>>>>> *vcpu, phys_addr_t fault_ipa, >>>>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, >>>>>>>>> hva, >>>>>>>>> &pfn, &fault_ipa); >>>>>>>>> + >>>>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && >>>>>>>>> pfn_valid(pfn)) { >>>>>>>>> + /* >>>>>>>>> + * VM will be able to see the page's tags, so we must >>>>>>>>> ensure >>>>>>>>> + * they have been initialised. if PG_mte_tagged is set, >>>>>>>>> tags >>>>>>>>> + * have already been initialised. >>>>>>>>> + */ >>>>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>>>> + >>>>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>>>> + } >>>>>>>>> + } >>>>>>>> >>>>>>>> This pfn_valid() check may be problematic. Following commit >>>>>>>> eeb0753ba27b >>>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it >>>>>>>> returns >>>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to >>>>>>>> support >>>>>>>> MTE. >>>>>>> >>>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>>>> mapped as untagged memory in the kernel linear map. It could be >>>>>>> slightly >>>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>>>> untagged memory. Another overhead is pfn_valid() which will likely >>>>>>> end >>>>>>> up calling memblock_is_map_memory(). >>>>>>> >>>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is >>>>>>> there a >>>>>>> way to detect what gets mapped in the guest as Normal Cacheable >>>>>>> memory >>>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE >>>>>>> (or >>>>>>> something else like on-chip memory)? If we can't guarantee that all >>>>>>> Cacheable memory given to a guest supports tags, we should disable >>>>>>> the >>>>>>> feature altogether. >>>>>> >>>>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the >>>>>> latter is a >>>>>> case of checking the 'device' variable, and makes sense to avoid the >>>>>> overhead you describe. >>>>>> >>>>>> This should also guarantee that all stage-2 cacheable memory >>>>>> supports tags, >>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() >>>>>> should only >>>>>> be true for memory that Linux considers "normal". >>>> >>>> If you think "normal" == "normal System RAM", that's wrong; see below. >>> >>> By "normal" I think both Steven and I meant the Normal Cacheable memory >>> attribute (another being the Device memory attribute). > > Sadly there's no good standardised terminology here. Aarch64 provides > the "normal (cacheable)" definition. Memory which is mapped as "Normal > Cacheable" is implicitly MTE capable when shared with a guest (because > the stage 2 mappings don't allow restricting MTE other than mapping it > as Device memory). > > So MTE also forces us to have a definition of memory which is "bog > standard memory"[1] separate from the mapping attributes. This is the > main memory which fully supports MTE. > > Separate from the "bog standard" we have the "special"[1] memory, e.g. > ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but > that memory may not support MTE tags. This memory can only be safely > shared with a guest in the following situations: > > 1. MTE is completely disabled for the guest > > 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) > > 3. We have some guarantee that guest MTE access are in some way safe. > > (1) is the situation today (without this patch series). But it prevents > the guest from using MTE in any form. > > (2) is pretty terrible for general memory, but is the get-out clause for > mapping devices into the guest. > > (3) isn't something we have any architectural way of discovering. We'd > need to know what the device did with the MTE accesses (and any caches > between the CPU and the device) to ensure there aren't any side-channels > or h/w lockup issues. We'd also need some way of describing this memory > to the guest. > > So at least for the time being the approach is to avoid letting a guest > with MTE enabled have access to this sort of memory. > > [1] Neither "bog standard" nor "special" are real terms - like I said > there's a lack of standardised terminology. > >>>>> That's the problem. With Anshuman's commit I mentioned above, >>>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>>>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>>>> Normal but it is not guaranteed to support tagging. >>>> >>>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >>>> touch the page, you won't fault. So Anshuman's commit is correct. >>> >>> I agree. >>> >>>> pfn_to_online_page() means, "there is a struct page and it's system RAM >>>> that's in use; the memmap has a sane content" >>> >>> Does pfn_to_online_page() returns a valid struct page pointer for >>> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for >>> some definition of system RAM (I assume NVDIMM != system RAM). For >>> example, pmem_attach_disk() calls devm_memremap_pages() and this would >>> use the Normal Cacheable memory attribute without necessarily being >>> system RAM. >> >> No, not for ZONE_DEVICE. >> >> However, if you expose PMEM via dax/kmem as System RAM to the system (-> >> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or >> ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system >> thinks it's "ordinary system RAM" and the memory is managed by the buddy. > > So if I'm understanding this correctly for KVM we need to use > pfn_to_online_pages() and reject if NULL is returned? In the case of > dax/kmem there already needs to be validation that the memory supports > MTE (otherwise we break user space) before it's allowed into the > "ordinary system RAM" bucket. That should work. 1. One alternative is if (!pfn_valid(pfn)) return false; #ifdef CONFIG_ZONE_DEVICE page = pfn_to_page(pfn); if (page_zonenum(page) == ZONE_DEVICE) return false; #endif return true; Note that when you are dealing with random PFNs, this approach is in general not safe; the memmap could be uninitialized and contain garbage. You can have false positives for ZONE_DEVICE. 2. Yet another (slower?) variant to detect (some?) ZONE_DEVICE is pgmap = get_dev_pagemap(pfn, NULL); put_dev_pagemap(pgmap); if (pgmap) return false; return true; I know that /dev/mem mappings can be problematic ... because the memmap could be in any state and actually we shouldn't even touch/rely on any "struct pages" at all, as we have a pure PFN mapping ... -- Thanks, David / dhildenb _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 14:14 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-03-31 14:14 UTC (permalink / raw) To: Steven Price, Catalin Marinas Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, Suzuki K Poulose, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On 31.03.21 12:41, Steven Price wrote: > On 31/03/2021 10:32, David Hildenbrand wrote: >> On 31.03.21 11:21, Catalin Marinas wrote: >>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>>> On 30.03.21 12:30, Catalin Marinas wrote: >>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote: >>>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote: >>>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644 >>>>>>>>> --- a/arch/arm64/kvm/mmu.c >>>>>>>>> +++ b/arch/arm64/kvm/mmu.c >>>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu >>>>>>>>> *vcpu, phys_addr_t fault_ipa, >>>>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte) >>>>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot, >>>>>>>>> hva, >>>>>>>>> &pfn, &fault_ipa); >>>>>>>>> + >>>>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && >>>>>>>>> pfn_valid(pfn)) { >>>>>>>>> + /* >>>>>>>>> + * VM will be able to see the page's tags, so we must >>>>>>>>> ensure >>>>>>>>> + * they have been initialised. if PG_mte_tagged is set, >>>>>>>>> tags >>>>>>>>> + * have already been initialised. >>>>>>>>> + */ >>>>>>>>> + struct page *page = pfn_to_page(pfn); >>>>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT; >>>>>>>>> + >>>>>>>>> + for (i = 0; i < nr_pages; i++, page++) { >>>>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags)) >>>>>>>>> + mte_clear_page_tags(page_address(page)); >>>>>>>>> + } >>>>>>>>> + } >>>>>>>> >>>>>>>> This pfn_valid() check may be problematic. Following commit >>>>>>>> eeb0753ba27b >>>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it >>>>>>>> returns >>>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to >>>>>>>> support >>>>>>>> MTE. >>>>>>> >>>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be >>>>>>> mapped as untagged memory in the kernel linear map. It could be >>>>>>> slightly >>>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE, >>>>>>> untagged memory. Another overhead is pfn_valid() which will likely >>>>>>> end >>>>>>> up calling memblock_is_map_memory(). >>>>>>> >>>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for >>>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is >>>>>>> there a >>>>>>> way to detect what gets mapped in the guest as Normal Cacheable >>>>>>> memory >>>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE >>>>>>> (or >>>>>>> something else like on-chip memory)? If we can't guarantee that all >>>>>>> Cacheable memory given to a guest supports tags, we should disable >>>>>>> the >>>>>>> feature altogether. >>>>>> >>>>>> In stage 2 I believe we only have two types of mapping - 'normal' or >>>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the >>>>>> latter is a >>>>>> case of checking the 'device' variable, and makes sense to avoid the >>>>>> overhead you describe. >>>>>> >>>>>> This should also guarantee that all stage-2 cacheable memory >>>>>> supports tags, >>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() >>>>>> should only >>>>>> be true for memory that Linux considers "normal". >>>> >>>> If you think "normal" == "normal System RAM", that's wrong; see below. >>> >>> By "normal" I think both Steven and I meant the Normal Cacheable memory >>> attribute (another being the Device memory attribute). > > Sadly there's no good standardised terminology here. Aarch64 provides > the "normal (cacheable)" definition. Memory which is mapped as "Normal > Cacheable" is implicitly MTE capable when shared with a guest (because > the stage 2 mappings don't allow restricting MTE other than mapping it > as Device memory). > > So MTE also forces us to have a definition of memory which is "bog > standard memory"[1] separate from the mapping attributes. This is the > main memory which fully supports MTE. > > Separate from the "bog standard" we have the "special"[1] memory, e.g. > ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but > that memory may not support MTE tags. This memory can only be safely > shared with a guest in the following situations: > > 1. MTE is completely disabled for the guest > > 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) > > 3. We have some guarantee that guest MTE access are in some way safe. > > (1) is the situation today (without this patch series). But it prevents > the guest from using MTE in any form. > > (2) is pretty terrible for general memory, but is the get-out clause for > mapping devices into the guest. > > (3) isn't something we have any architectural way of discovering. We'd > need to know what the device did with the MTE accesses (and any caches > between the CPU and the device) to ensure there aren't any side-channels > or h/w lockup issues. We'd also need some way of describing this memory > to the guest. > > So at least for the time being the approach is to avoid letting a guest > with MTE enabled have access to this sort of memory. > > [1] Neither "bog standard" nor "special" are real terms - like I said > there's a lack of standardised terminology. > >>>>> That's the problem. With Anshuman's commit I mentioned above, >>>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent >>>>> memory, not talking about some I/O mapping that requires Device_nGnRE). >>>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as >>>>> Normal but it is not guaranteed to support tagging. >>>> >>>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and >>>> touch the page, you won't fault. So Anshuman's commit is correct. >>> >>> I agree. >>> >>>> pfn_to_online_page() means, "there is a struct page and it's system RAM >>>> that's in use; the memmap has a sane content" >>> >>> Does pfn_to_online_page() returns a valid struct page pointer for >>> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for >>> some definition of system RAM (I assume NVDIMM != system RAM). For >>> example, pmem_attach_disk() calls devm_memremap_pages() and this would >>> use the Normal Cacheable memory attribute without necessarily being >>> system RAM. >> >> No, not for ZONE_DEVICE. >> >> However, if you expose PMEM via dax/kmem as System RAM to the system (-> >> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or >> ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system >> thinks it's "ordinary system RAM" and the memory is managed by the buddy. > > So if I'm understanding this correctly for KVM we need to use > pfn_to_online_pages() and reject if NULL is returned? In the case of > dax/kmem there already needs to be validation that the memory supports > MTE (otherwise we break user space) before it's allowed into the > "ordinary system RAM" bucket. That should work. 1. One alternative is if (!pfn_valid(pfn)) return false; #ifdef CONFIG_ZONE_DEVICE page = pfn_to_page(pfn); if (page_zonenum(page) == ZONE_DEVICE) return false; #endif return true; Note that when you are dealing with random PFNs, this approach is in general not safe; the memmap could be uninitialized and contain garbage. You can have false positives for ZONE_DEVICE. 2. Yet another (slower?) variant to detect (some?) ZONE_DEVICE is pgmap = get_dev_pagemap(pfn, NULL); put_dev_pagemap(pgmap); if (pgmap) return false; return true; I know that /dev/mem mappings can be problematic ... because the memmap could be in any state and actually we shouldn't even touch/rely on any "struct pages" at all, as we have a pure PFN mapping ... -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-31 10:41 ` Steven Price (?) (?) @ 2021-03-31 18:43 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-31 18:43 UTC (permalink / raw) To: Steven Price Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote: > On 31/03/2021 10:32, David Hildenbrand wrote: > > On 31.03.21 11:21, Catalin Marinas wrote: > > > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: > > > > On 30.03.21 12:30, Catalin Marinas wrote: > > > > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > > > > > > On 28/03/2021 13:21, Catalin Marinas wrote: > > > > > > > However, the bigger issue is that Stage 2 cannot disable > > > > > > > tagging for Stage 1 unless the memory is Non-cacheable or > > > > > > > Device at S2. Is there a way to detect what gets mapped in > > > > > > > the guest as Normal Cacheable memory and make sure it's > > > > > > > only early memory or hotplug but no ZONE_DEVICE (or > > > > > > > something else like on-chip memory)? If we can't > > > > > > > guarantee that all Cacheable memory given to a guest > > > > > > > supports tags, we should disable the feature altogether. > > > > > > > > > > > > In stage 2 I believe we only have two types of mapping - > > > > > > 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). > > > > > > Filtering out the latter is a case of checking the 'device' > > > > > > variable, and makes sense to avoid the overhead you > > > > > > describe. > > > > > > > > > > > > This should also guarantee that all stage-2 cacheable > > > > > > memory supports tags, > > > > > > as kvm_is_device_pfn() is simply !pfn_valid(), and > > > > > > pfn_valid() should only > > > > > > be true for memory that Linux considers "normal". > > > > > > > > If you think "normal" == "normal System RAM", that's wrong; see > > > > below. > > > > > > By "normal" I think both Steven and I meant the Normal Cacheable memory > > > attribute (another being the Device memory attribute). > > Sadly there's no good standardised terminology here. Aarch64 provides the > "normal (cacheable)" definition. Memory which is mapped as "Normal > Cacheable" is implicitly MTE capable when shared with a guest (because the > stage 2 mappings don't allow restricting MTE other than mapping it as Device > memory). > > So MTE also forces us to have a definition of memory which is "bog standard > memory"[1] separate from the mapping attributes. This is the main memory > which fully supports MTE. > > Separate from the "bog standard" we have the "special"[1] memory, e.g. > ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that > memory may not support MTE tags. This memory can only be safely shared with > a guest in the following situations: > > 1. MTE is completely disabled for the guest > > 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) > > 3. We have some guarantee that guest MTE access are in some way safe. > > (1) is the situation today (without this patch series). But it prevents the > guest from using MTE in any form. > > (2) is pretty terrible for general memory, but is the get-out clause for > mapping devices into the guest. > > (3) isn't something we have any architectural way of discovering. We'd need > to know what the device did with the MTE accesses (and any caches between > the CPU and the device) to ensure there aren't any side-channels or h/w > lockup issues. We'd also need some way of describing this memory to the > guest. > > So at least for the time being the approach is to avoid letting a guest with > MTE enabled have access to this sort of memory. When a slot is added by the VMM, if it asked MTE in guest (I guess that's an opt-in by the VMM, haven't checked the other patches), can we reject it if it's is going to be mapped as Normal Cacheable but it is a ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to check for ZONE_DEVICE)? This way we don't need to do more expensive checks in set_pte_at(). We could simplify the set_pte_at() further if we require that the VMM has a PROT_MTE mapping. This does not mean it cannot have two mappings, the other without PROT_MTE. But at least we get a set_pte_at() when swapping in which has PROT_MTE. We could add another PROT_TAGGED or something which means PG_mte_tagged set but still mapped as Normal Untagged. It's just that we are short of pte bits for another flag. Can we somehow identify when the S2 pte is set and can we get access to the prior swap pte? This way we could avoid changes to set_pte_at() for S2 faults. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 18:43 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-31 18:43 UTC (permalink / raw) To: Steven Price Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote: > On 31/03/2021 10:32, David Hildenbrand wrote: > > On 31.03.21 11:21, Catalin Marinas wrote: > > > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: > > > > On 30.03.21 12:30, Catalin Marinas wrote: > > > > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > > > > > > On 28/03/2021 13:21, Catalin Marinas wrote: > > > > > > > However, the bigger issue is that Stage 2 cannot disable > > > > > > > tagging for Stage 1 unless the memory is Non-cacheable or > > > > > > > Device at S2. Is there a way to detect what gets mapped in > > > > > > > the guest as Normal Cacheable memory and make sure it's > > > > > > > only early memory or hotplug but no ZONE_DEVICE (or > > > > > > > something else like on-chip memory)? If we can't > > > > > > > guarantee that all Cacheable memory given to a guest > > > > > > > supports tags, we should disable the feature altogether. > > > > > > > > > > > > In stage 2 I believe we only have two types of mapping - > > > > > > 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). > > > > > > Filtering out the latter is a case of checking the 'device' > > > > > > variable, and makes sense to avoid the overhead you > > > > > > describe. > > > > > > > > > > > > This should also guarantee that all stage-2 cacheable > > > > > > memory supports tags, > > > > > > as kvm_is_device_pfn() is simply !pfn_valid(), and > > > > > > pfn_valid() should only > > > > > > be true for memory that Linux considers "normal". > > > > > > > > If you think "normal" == "normal System RAM", that's wrong; see > > > > below. > > > > > > By "normal" I think both Steven and I meant the Normal Cacheable memory > > > attribute (another being the Device memory attribute). > > Sadly there's no good standardised terminology here. Aarch64 provides the > "normal (cacheable)" definition. Memory which is mapped as "Normal > Cacheable" is implicitly MTE capable when shared with a guest (because the > stage 2 mappings don't allow restricting MTE other than mapping it as Device > memory). > > So MTE also forces us to have a definition of memory which is "bog standard > memory"[1] separate from the mapping attributes. This is the main memory > which fully supports MTE. > > Separate from the "bog standard" we have the "special"[1] memory, e.g. > ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that > memory may not support MTE tags. This memory can only be safely shared with > a guest in the following situations: > > 1. MTE is completely disabled for the guest > > 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) > > 3. We have some guarantee that guest MTE access are in some way safe. > > (1) is the situation today (without this patch series). But it prevents the > guest from using MTE in any form. > > (2) is pretty terrible for general memory, but is the get-out clause for > mapping devices into the guest. > > (3) isn't something we have any architectural way of discovering. We'd need > to know what the device did with the MTE accesses (and any caches between > the CPU and the device) to ensure there aren't any side-channels or h/w > lockup issues. We'd also need some way of describing this memory to the > guest. > > So at least for the time being the approach is to avoid letting a guest with > MTE enabled have access to this sort of memory. When a slot is added by the VMM, if it asked MTE in guest (I guess that's an opt-in by the VMM, haven't checked the other patches), can we reject it if it's is going to be mapped as Normal Cacheable but it is a ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to check for ZONE_DEVICE)? This way we don't need to do more expensive checks in set_pte_at(). We could simplify the set_pte_at() further if we require that the VMM has a PROT_MTE mapping. This does not mean it cannot have two mappings, the other without PROT_MTE. But at least we get a set_pte_at() when swapping in which has PROT_MTE. We could add another PROT_TAGGED or something which means PG_mte_tagged set but still mapped as Normal Untagged. It's just that we are short of pte bits for another flag. Can we somehow identify when the S2 pte is set and can we get access to the prior swap pte? This way we could avoid changes to set_pte_at() for S2 faults. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 18:43 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-31 18:43 UTC (permalink / raw) To: Steven Price Cc: David Hildenbrand, Marc Zyngier, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote: > On 31/03/2021 10:32, David Hildenbrand wrote: > > On 31.03.21 11:21, Catalin Marinas wrote: > > > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: > > > > On 30.03.21 12:30, Catalin Marinas wrote: > > > > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > > > > > > On 28/03/2021 13:21, Catalin Marinas wrote: > > > > > > > However, the bigger issue is that Stage 2 cannot disable > > > > > > > tagging for Stage 1 unless the memory is Non-cacheable or > > > > > > > Device at S2. Is there a way to detect what gets mapped in > > > > > > > the guest as Normal Cacheable memory and make sure it's > > > > > > > only early memory or hotplug but no ZONE_DEVICE (or > > > > > > > something else like on-chip memory)? If we can't > > > > > > > guarantee that all Cacheable memory given to a guest > > > > > > > supports tags, we should disable the feature altogether. > > > > > > > > > > > > In stage 2 I believe we only have two types of mapping - > > > > > > 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). > > > > > > Filtering out the latter is a case of checking the 'device' > > > > > > variable, and makes sense to avoid the overhead you > > > > > > describe. > > > > > > > > > > > > This should also guarantee that all stage-2 cacheable > > > > > > memory supports tags, > > > > > > as kvm_is_device_pfn() is simply !pfn_valid(), and > > > > > > pfn_valid() should only > > > > > > be true for memory that Linux considers "normal". > > > > > > > > If you think "normal" == "normal System RAM", that's wrong; see > > > > below. > > > > > > By "normal" I think both Steven and I meant the Normal Cacheable memory > > > attribute (another being the Device memory attribute). > > Sadly there's no good standardised terminology here. Aarch64 provides the > "normal (cacheable)" definition. Memory which is mapped as "Normal > Cacheable" is implicitly MTE capable when shared with a guest (because the > stage 2 mappings don't allow restricting MTE other than mapping it as Device > memory). > > So MTE also forces us to have a definition of memory which is "bog standard > memory"[1] separate from the mapping attributes. This is the main memory > which fully supports MTE. > > Separate from the "bog standard" we have the "special"[1] memory, e.g. > ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that > memory may not support MTE tags. This memory can only be safely shared with > a guest in the following situations: > > 1. MTE is completely disabled for the guest > > 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) > > 3. We have some guarantee that guest MTE access are in some way safe. > > (1) is the situation today (without this patch series). But it prevents the > guest from using MTE in any form. > > (2) is pretty terrible for general memory, but is the get-out clause for > mapping devices into the guest. > > (3) isn't something we have any architectural way of discovering. We'd need > to know what the device did with the MTE accesses (and any caches between > the CPU and the device) to ensure there aren't any side-channels or h/w > lockup issues. We'd also need some way of describing this memory to the > guest. > > So at least for the time being the approach is to avoid letting a guest with > MTE enabled have access to this sort of memory. When a slot is added by the VMM, if it asked MTE in guest (I guess that's an opt-in by the VMM, haven't checked the other patches), can we reject it if it's is going to be mapped as Normal Cacheable but it is a ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to check for ZONE_DEVICE)? This way we don't need to do more expensive checks in set_pte_at(). We could simplify the set_pte_at() further if we require that the VMM has a PROT_MTE mapping. This does not mean it cannot have two mappings, the other without PROT_MTE. But at least we get a set_pte_at() when swapping in which has PROT_MTE. We could add another PROT_TAGGED or something which means PG_mte_tagged set but still mapped as Normal Untagged. It's just that we are short of pte bits for another flag. Can we somehow identify when the S2 pte is set and can we get access to the prior swap pte? This way we could avoid changes to set_pte_at() for S2 faults. -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-03-31 18:43 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-03-31 18:43 UTC (permalink / raw) To: Steven Price Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, David Hildenbrand, Marc Zyngier, Suzuki K Poulose, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote: > On 31/03/2021 10:32, David Hildenbrand wrote: > > On 31.03.21 11:21, Catalin Marinas wrote: > > > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: > > > > On 30.03.21 12:30, Catalin Marinas wrote: > > > > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: > > > > > > On 28/03/2021 13:21, Catalin Marinas wrote: > > > > > > > However, the bigger issue is that Stage 2 cannot disable > > > > > > > tagging for Stage 1 unless the memory is Non-cacheable or > > > > > > > Device at S2. Is there a way to detect what gets mapped in > > > > > > > the guest as Normal Cacheable memory and make sure it's > > > > > > > only early memory or hotplug but no ZONE_DEVICE (or > > > > > > > something else like on-chip memory)? If we can't > > > > > > > guarantee that all Cacheable memory given to a guest > > > > > > > supports tags, we should disable the feature altogether. > > > > > > > > > > > > In stage 2 I believe we only have two types of mapping - > > > > > > 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). > > > > > > Filtering out the latter is a case of checking the 'device' > > > > > > variable, and makes sense to avoid the overhead you > > > > > > describe. > > > > > > > > > > > > This should also guarantee that all stage-2 cacheable > > > > > > memory supports tags, > > > > > > as kvm_is_device_pfn() is simply !pfn_valid(), and > > > > > > pfn_valid() should only > > > > > > be true for memory that Linux considers "normal". > > > > > > > > If you think "normal" == "normal System RAM", that's wrong; see > > > > below. > > > > > > By "normal" I think both Steven and I meant the Normal Cacheable memory > > > attribute (another being the Device memory attribute). > > Sadly there's no good standardised terminology here. Aarch64 provides the > "normal (cacheable)" definition. Memory which is mapped as "Normal > Cacheable" is implicitly MTE capable when shared with a guest (because the > stage 2 mappings don't allow restricting MTE other than mapping it as Device > memory). > > So MTE also forces us to have a definition of memory which is "bog standard > memory"[1] separate from the mapping attributes. This is the main memory > which fully supports MTE. > > Separate from the "bog standard" we have the "special"[1] memory, e.g. > ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that > memory may not support MTE tags. This memory can only be safely shared with > a guest in the following situations: > > 1. MTE is completely disabled for the guest > > 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) > > 3. We have some guarantee that guest MTE access are in some way safe. > > (1) is the situation today (without this patch series). But it prevents the > guest from using MTE in any form. > > (2) is pretty terrible for general memory, but is the get-out clause for > mapping devices into the guest. > > (3) isn't something we have any architectural way of discovering. We'd need > to know what the device did with the MTE accesses (and any caches between > the CPU and the device) to ensure there aren't any side-channels or h/w > lockup issues. We'd also need some way of describing this memory to the > guest. > > So at least for the time being the approach is to avoid letting a guest with > MTE enabled have access to this sort of memory. When a slot is added by the VMM, if it asked MTE in guest (I guess that's an opt-in by the VMM, haven't checked the other patches), can we reject it if it's is going to be mapped as Normal Cacheable but it is a ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to check for ZONE_DEVICE)? This way we don't need to do more expensive checks in set_pte_at(). We could simplify the set_pte_at() further if we require that the VMM has a PROT_MTE mapping. This does not mean it cannot have two mappings, the other without PROT_MTE. But at least we get a set_pte_at() when swapping in which has PROT_MTE. We could add another PROT_TAGGED or something which means PG_mte_tagged set but still mapped as Normal Untagged. It's just that we are short of pte bits for another flag. Can we somehow identify when the S2 pte is set and can we get access to the prior swap pte? This way we could avoid changes to set_pte_at() for S2 faults. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-03-31 18:43 ` Catalin Marinas (?) (?) @ 2021-04-07 10:20 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-04-07 10:20 UTC (permalink / raw) To: Catalin Marinas Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 31/03/2021 19:43, Catalin Marinas wrote: > On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote: >> On 31/03/2021 10:32, David Hildenbrand wrote: >>> On 31.03.21 11:21, Catalin Marinas wrote: >>>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>>>> On 30.03.21 12:30, Catalin Marinas wrote: >>>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>>>> However, the bigger issue is that Stage 2 cannot disable >>>>>>>> tagging for Stage 1 unless the memory is Non-cacheable or >>>>>>>> Device at S2. Is there a way to detect what gets mapped in >>>>>>>> the guest as Normal Cacheable memory and make sure it's >>>>>>>> only early memory or hotplug but no ZONE_DEVICE (or >>>>>>>> something else like on-chip memory)?� If we can't >>>>>>>> guarantee that all Cacheable memory given to a guest >>>>>>>> supports tags, we should disable the feature altogether. >>>>>>> >>>>>>> In stage 2 I believe we only have two types of mapping - >>>>>>> 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). >>>>>>> Filtering out the latter is a case of checking the 'device' >>>>>>> variable, and makes sense to avoid the overhead you >>>>>>> describe. >>>>>>> >>>>>>> This should also guarantee that all stage-2 cacheable >>>>>>> memory supports tags, >>>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and >>>>>>> pfn_valid() should only >>>>>>> be true for memory that Linux considers "normal". >>>>> >>>>> If you think "normal" == "normal System RAM", that's wrong; see >>>>> below. >>>> >>>> By "normal" I think both Steven and I meant the Normal Cacheable memory >>>> attribute (another being the Device memory attribute). >> >> Sadly there's no good standardised terminology here. Aarch64 provides the >> "normal (cacheable)" definition. Memory which is mapped as "Normal >> Cacheable" is implicitly MTE capable when shared with a guest (because the >> stage 2 mappings don't allow restricting MTE other than mapping it as Device >> memory). >> >> So MTE also forces us to have a definition of memory which is "bog standard >> memory"[1] separate from the mapping attributes. This is the main memory >> which fully supports MTE. >> >> Separate from the "bog standard" we have the "special"[1] memory, e.g. >> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that >> memory may not support MTE tags. This memory can only be safely shared with >> a guest in the following situations: >> >> 1. MTE is completely disabled for the guest >> >> 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) >> >> 3. We have some guarantee that guest MTE access are in some way safe. >> >> (1) is the situation today (without this patch series). But it prevents the >> guest from using MTE in any form. >> >> (2) is pretty terrible for general memory, but is the get-out clause for >> mapping devices into the guest. >> >> (3) isn't something we have any architectural way of discovering. We'd need >> to know what the device did with the MTE accesses (and any caches between >> the CPU and the device) to ensure there aren't any side-channels or h/w >> lockup issues. We'd also need some way of describing this memory to the >> guest. >> >> So at least for the time being the approach is to avoid letting a guest with >> MTE enabled have access to this sort of memory. > > When a slot is added by the VMM, if it asked MTE in guest (I guess > that's an opt-in by the VMM, haven't checked the other patches), can we > reject it if it's is going to be mapped as Normal Cacheable but it is a > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > check for ZONE_DEVICE)? This way we don't need to do more expensive > checks in set_pte_at(). The problem is that KVM allows the VMM to change the memory backing a slot while the guest is running. This is obviously useful for the likes of migration, but ultimately means that even if you were to do checks at the time of slot creation, you would need to repeat the checks at set_pte_at() time to ensure a mischievous VMM didn't swap the page for a problematic one. > We could simplify the set_pte_at() further if we require that the VMM > has a PROT_MTE mapping. This does not mean it cannot have two mappings, > the other without PROT_MTE. But at least we get a set_pte_at() when > swapping in which has PROT_MTE. That is certainly an option - but from what I've seen of trying to implement a VMM to support MTE, the PROT_MTE mapping is not what you actually want in user space. Two mappings is possible but is likely to complicate the VMM. > We could add another PROT_TAGGED or something which means PG_mte_tagged > set but still mapped as Normal Untagged. It's just that we are short of > pte bits for another flag. That could help here - although it's slightly odd as you're asking the kernel to track the tags, but not allowing user space (direct) access to them. Like you say using us the precious bits for this seems like it might be short-sighted. > Can we somehow identify when the S2 pte is set and can we get access to > the prior swap pte? This way we could avoid changes to set_pte_at() for > S2 faults. > Unless I'm misunderstanding the code the swap information is (only) stored in the stage 1 user-space VMM PTE. When we get a stage 2 fault this triggers a corresponding access attempt to the VMM's address space. It's at this point when populating the VMM's page tables that the swap information is discovered. The problem at the moment is a mismatch regarding whether the page needs tags or not. The VMM's mapping can (currently) be !PROT_MTE which means we wouldn't normally require restoring/zeroing the tags. However the stage 2 access requires that the tags be preserved. Requiring PROT_MTE (or PROT_TAGGED as above) would certainly simplify the handling in the kernel. Of course I did propose the 'requiring PROT_MTE' approach before which led to a conversation[1] ending with a conclusion[2] that: I'd much rather the kernel just provided us with an API for what we want, which is (1) the guest RAM as just RAM with no tag checking and separately (2) some mechanism yet-to-be-designed which lets us bulk copy a page's worth of tags for migration. Which is what I've implemented ;) Do you think it's worth investigating the PROT_TAGGED approach as a middle ground? My gut feeling is that it's a waste of a VM flag, but I agree it would certainly make the code cleaner. Steve [1] https://lore.kernel.org/kvmarm/CAFEAcA85fiqA206FuFANKbV_3GkfY1F8Gv7MP58BgTT81bs9kA@mail.gmail.com/ [2] https://lore.kernel.org/kvmarm/CAFEAcA_K47jKSp46DFK-AKWv6MD1pkrEB6FNz=HNGdxmBDCSbw@mail.gmail.com/ ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 10:20 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-04-07 10:20 UTC (permalink / raw) To: Catalin Marinas Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 31/03/2021 19:43, Catalin Marinas wrote: > On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote: >> On 31/03/2021 10:32, David Hildenbrand wrote: >>> On 31.03.21 11:21, Catalin Marinas wrote: >>>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>>>> On 30.03.21 12:30, Catalin Marinas wrote: >>>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>>>> However, the bigger issue is that Stage 2 cannot disable >>>>>>>> tagging for Stage 1 unless the memory is Non-cacheable or >>>>>>>> Device at S2. Is there a way to detect what gets mapped in >>>>>>>> the guest as Normal Cacheable memory and make sure it's >>>>>>>> only early memory or hotplug but no ZONE_DEVICE (or >>>>>>>> something else like on-chip memory)?� If we can't >>>>>>>> guarantee that all Cacheable memory given to a guest >>>>>>>> supports tags, we should disable the feature altogether. >>>>>>> >>>>>>> In stage 2 I believe we only have two types of mapping - >>>>>>> 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). >>>>>>> Filtering out the latter is a case of checking the 'device' >>>>>>> variable, and makes sense to avoid the overhead you >>>>>>> describe. >>>>>>> >>>>>>> This should also guarantee that all stage-2 cacheable >>>>>>> memory supports tags, >>>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and >>>>>>> pfn_valid() should only >>>>>>> be true for memory that Linux considers "normal". >>>>> >>>>> If you think "normal" == "normal System RAM", that's wrong; see >>>>> below. >>>> >>>> By "normal" I think both Steven and I meant the Normal Cacheable memory >>>> attribute (another being the Device memory attribute). >> >> Sadly there's no good standardised terminology here. Aarch64 provides the >> "normal (cacheable)" definition. Memory which is mapped as "Normal >> Cacheable" is implicitly MTE capable when shared with a guest (because the >> stage 2 mappings don't allow restricting MTE other than mapping it as Device >> memory). >> >> So MTE also forces us to have a definition of memory which is "bog standard >> memory"[1] separate from the mapping attributes. This is the main memory >> which fully supports MTE. >> >> Separate from the "bog standard" we have the "special"[1] memory, e.g. >> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that >> memory may not support MTE tags. This memory can only be safely shared with >> a guest in the following situations: >> >> 1. MTE is completely disabled for the guest >> >> 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) >> >> 3. We have some guarantee that guest MTE access are in some way safe. >> >> (1) is the situation today (without this patch series). But it prevents the >> guest from using MTE in any form. >> >> (2) is pretty terrible for general memory, but is the get-out clause for >> mapping devices into the guest. >> >> (3) isn't something we have any architectural way of discovering. We'd need >> to know what the device did with the MTE accesses (and any caches between >> the CPU and the device) to ensure there aren't any side-channels or h/w >> lockup issues. We'd also need some way of describing this memory to the >> guest. >> >> So at least for the time being the approach is to avoid letting a guest with >> MTE enabled have access to this sort of memory. > > When a slot is added by the VMM, if it asked MTE in guest (I guess > that's an opt-in by the VMM, haven't checked the other patches), can we > reject it if it's is going to be mapped as Normal Cacheable but it is a > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > check for ZONE_DEVICE)? This way we don't need to do more expensive > checks in set_pte_at(). The problem is that KVM allows the VMM to change the memory backing a slot while the guest is running. This is obviously useful for the likes of migration, but ultimately means that even if you were to do checks at the time of slot creation, you would need to repeat the checks at set_pte_at() time to ensure a mischievous VMM didn't swap the page for a problematic one. > We could simplify the set_pte_at() further if we require that the VMM > has a PROT_MTE mapping. This does not mean it cannot have two mappings, > the other without PROT_MTE. But at least we get a set_pte_at() when > swapping in which has PROT_MTE. That is certainly an option - but from what I've seen of trying to implement a VMM to support MTE, the PROT_MTE mapping is not what you actually want in user space. Two mappings is possible but is likely to complicate the VMM. > We could add another PROT_TAGGED or something which means PG_mte_tagged > set but still mapped as Normal Untagged. It's just that we are short of > pte bits for another flag. That could help here - although it's slightly odd as you're asking the kernel to track the tags, but not allowing user space (direct) access to them. Like you say using us the precious bits for this seems like it might be short-sighted. > Can we somehow identify when the S2 pte is set and can we get access to > the prior swap pte? This way we could avoid changes to set_pte_at() for > S2 faults. > Unless I'm misunderstanding the code the swap information is (only) stored in the stage 1 user-space VMM PTE. When we get a stage 2 fault this triggers a corresponding access attempt to the VMM's address space. It's at this point when populating the VMM's page tables that the swap information is discovered. The problem at the moment is a mismatch regarding whether the page needs tags or not. The VMM's mapping can (currently) be !PROT_MTE which means we wouldn't normally require restoring/zeroing the tags. However the stage 2 access requires that the tags be preserved. Requiring PROT_MTE (or PROT_TAGGED as above) would certainly simplify the handling in the kernel. Of course I did propose the 'requiring PROT_MTE' approach before which led to a conversation[1] ending with a conclusion[2] that: I'd much rather the kernel just provided us with an API for what we want, which is (1) the guest RAM as just RAM with no tag checking and separately (2) some mechanism yet-to-be-designed which lets us bulk copy a page's worth of tags for migration. Which is what I've implemented ;) Do you think it's worth investigating the PROT_TAGGED approach as a middle ground? My gut feeling is that it's a waste of a VM flag, but I agree it would certainly make the code cleaner. Steve [1] https://lore.kernel.org/kvmarm/CAFEAcA85fiqA206FuFANKbV_3GkfY1F8Gv7MP58BgTT81bs9kA@mail.gmail.com/ [2] https://lore.kernel.org/kvmarm/CAFEAcA_K47jKSp46DFK-AKWv6MD1pkrEB6FNz=HNGdxmBDCSbw@mail.gmail.com/ _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 10:20 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-04-07 10:20 UTC (permalink / raw) To: Catalin Marinas Cc: David Hildenbrand, Marc Zyngier, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On 31/03/2021 19:43, Catalin Marinas wrote: > On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote: >> On 31/03/2021 10:32, David Hildenbrand wrote: >>> On 31.03.21 11:21, Catalin Marinas wrote: >>>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>>>> On 30.03.21 12:30, Catalin Marinas wrote: >>>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>>>> However, the bigger issue is that Stage 2 cannot disable >>>>>>>> tagging for Stage 1 unless the memory is Non-cacheable or >>>>>>>> Device at S2. Is there a way to detect what gets mapped in >>>>>>>> the guest as Normal Cacheable memory and make sure it's >>>>>>>> only early memory or hotplug but no ZONE_DEVICE (or >>>>>>>> something else like on-chip memory)?� If we can't >>>>>>>> guarantee that all Cacheable memory given to a guest >>>>>>>> supports tags, we should disable the feature altogether. >>>>>>> >>>>>>> In stage 2 I believe we only have two types of mapping - >>>>>>> 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). >>>>>>> Filtering out the latter is a case of checking the 'device' >>>>>>> variable, and makes sense to avoid the overhead you >>>>>>> describe. >>>>>>> >>>>>>> This should also guarantee that all stage-2 cacheable >>>>>>> memory supports tags, >>>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and >>>>>>> pfn_valid() should only >>>>>>> be true for memory that Linux considers "normal". >>>>> >>>>> If you think "normal" == "normal System RAM", that's wrong; see >>>>> below. >>>> >>>> By "normal" I think both Steven and I meant the Normal Cacheable memory >>>> attribute (another being the Device memory attribute). >> >> Sadly there's no good standardised terminology here. Aarch64 provides the >> "normal (cacheable)" definition. Memory which is mapped as "Normal >> Cacheable" is implicitly MTE capable when shared with a guest (because the >> stage 2 mappings don't allow restricting MTE other than mapping it as Device >> memory). >> >> So MTE also forces us to have a definition of memory which is "bog standard >> memory"[1] separate from the mapping attributes. This is the main memory >> which fully supports MTE. >> >> Separate from the "bog standard" we have the "special"[1] memory, e.g. >> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that >> memory may not support MTE tags. This memory can only be safely shared with >> a guest in the following situations: >> >> 1. MTE is completely disabled for the guest >> >> 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) >> >> 3. We have some guarantee that guest MTE access are in some way safe. >> >> (1) is the situation today (without this patch series). But it prevents the >> guest from using MTE in any form. >> >> (2) is pretty terrible for general memory, but is the get-out clause for >> mapping devices into the guest. >> >> (3) isn't something we have any architectural way of discovering. We'd need >> to know what the device did with the MTE accesses (and any caches between >> the CPU and the device) to ensure there aren't any side-channels or h/w >> lockup issues. We'd also need some way of describing this memory to the >> guest. >> >> So at least for the time being the approach is to avoid letting a guest with >> MTE enabled have access to this sort of memory. > > When a slot is added by the VMM, if it asked MTE in guest (I guess > that's an opt-in by the VMM, haven't checked the other patches), can we > reject it if it's is going to be mapped as Normal Cacheable but it is a > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > check for ZONE_DEVICE)? This way we don't need to do more expensive > checks in set_pte_at(). The problem is that KVM allows the VMM to change the memory backing a slot while the guest is running. This is obviously useful for the likes of migration, but ultimately means that even if you were to do checks at the time of slot creation, you would need to repeat the checks at set_pte_at() time to ensure a mischievous VMM didn't swap the page for a problematic one. > We could simplify the set_pte_at() further if we require that the VMM > has a PROT_MTE mapping. This does not mean it cannot have two mappings, > the other without PROT_MTE. But at least we get a set_pte_at() when > swapping in which has PROT_MTE. That is certainly an option - but from what I've seen of trying to implement a VMM to support MTE, the PROT_MTE mapping is not what you actually want in user space. Two mappings is possible but is likely to complicate the VMM. > We could add another PROT_TAGGED or something which means PG_mte_tagged > set but still mapped as Normal Untagged. It's just that we are short of > pte bits for another flag. That could help here - although it's slightly odd as you're asking the kernel to track the tags, but not allowing user space (direct) access to them. Like you say using us the precious bits for this seems like it might be short-sighted. > Can we somehow identify when the S2 pte is set and can we get access to > the prior swap pte? This way we could avoid changes to set_pte_at() for > S2 faults. > Unless I'm misunderstanding the code the swap information is (only) stored in the stage 1 user-space VMM PTE. When we get a stage 2 fault this triggers a corresponding access attempt to the VMM's address space. It's at this point when populating the VMM's page tables that the swap information is discovered. The problem at the moment is a mismatch regarding whether the page needs tags or not. The VMM's mapping can (currently) be !PROT_MTE which means we wouldn't normally require restoring/zeroing the tags. However the stage 2 access requires that the tags be preserved. Requiring PROT_MTE (or PROT_TAGGED as above) would certainly simplify the handling in the kernel. Of course I did propose the 'requiring PROT_MTE' approach before which led to a conversation[1] ending with a conclusion[2] that: I'd much rather the kernel just provided us with an API for what we want, which is (1) the guest RAM as just RAM with no tag checking and separately (2) some mechanism yet-to-be-designed which lets us bulk copy a page's worth of tags for migration. Which is what I've implemented ;) Do you think it's worth investigating the PROT_TAGGED approach as a middle ground? My gut feeling is that it's a waste of a VM flag, but I agree it would certainly make the code cleaner. Steve [1] https://lore.kernel.org/kvmarm/CAFEAcA85fiqA206FuFANKbV_3GkfY1F8Gv7MP58BgTT81bs9kA@mail.gmail.com/ [2] https://lore.kernel.org/kvmarm/CAFEAcA_K47jKSp46DFK-AKWv6MD1pkrEB6FNz=HNGdxmBDCSbw@mail.gmail.com/ _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 10:20 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-04-07 10:20 UTC (permalink / raw) To: Catalin Marinas Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, David Hildenbrand, Marc Zyngier, Suzuki K Poulose, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On 31/03/2021 19:43, Catalin Marinas wrote: > On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote: >> On 31/03/2021 10:32, David Hildenbrand wrote: >>> On 31.03.21 11:21, Catalin Marinas wrote: >>>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote: >>>>> On 30.03.21 12:30, Catalin Marinas wrote: >>>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote: >>>>>>> On 28/03/2021 13:21, Catalin Marinas wrote: >>>>>>>> However, the bigger issue is that Stage 2 cannot disable >>>>>>>> tagging for Stage 1 unless the memory is Non-cacheable or >>>>>>>> Device at S2. Is there a way to detect what gets mapped in >>>>>>>> the guest as Normal Cacheable memory and make sure it's >>>>>>>> only early memory or hotplug but no ZONE_DEVICE (or >>>>>>>> something else like on-chip memory)?� If we can't >>>>>>>> guarantee that all Cacheable memory given to a guest >>>>>>>> supports tags, we should disable the feature altogether. >>>>>>> >>>>>>> In stage 2 I believe we only have two types of mapping - >>>>>>> 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()). >>>>>>> Filtering out the latter is a case of checking the 'device' >>>>>>> variable, and makes sense to avoid the overhead you >>>>>>> describe. >>>>>>> >>>>>>> This should also guarantee that all stage-2 cacheable >>>>>>> memory supports tags, >>>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and >>>>>>> pfn_valid() should only >>>>>>> be true for memory that Linux considers "normal". >>>>> >>>>> If you think "normal" == "normal System RAM", that's wrong; see >>>>> below. >>>> >>>> By "normal" I think both Steven and I meant the Normal Cacheable memory >>>> attribute (another being the Device memory attribute). >> >> Sadly there's no good standardised terminology here. Aarch64 provides the >> "normal (cacheable)" definition. Memory which is mapped as "Normal >> Cacheable" is implicitly MTE capable when shared with a guest (because the >> stage 2 mappings don't allow restricting MTE other than mapping it as Device >> memory). >> >> So MTE also forces us to have a definition of memory which is "bog standard >> memory"[1] separate from the mapping attributes. This is the main memory >> which fully supports MTE. >> >> Separate from the "bog standard" we have the "special"[1] memory, e.g. >> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that >> memory may not support MTE tags. This memory can only be safely shared with >> a guest in the following situations: >> >> 1. MTE is completely disabled for the guest >> >> 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE) >> >> 3. We have some guarantee that guest MTE access are in some way safe. >> >> (1) is the situation today (without this patch series). But it prevents the >> guest from using MTE in any form. >> >> (2) is pretty terrible for general memory, but is the get-out clause for >> mapping devices into the guest. >> >> (3) isn't something we have any architectural way of discovering. We'd need >> to know what the device did with the MTE accesses (and any caches between >> the CPU and the device) to ensure there aren't any side-channels or h/w >> lockup issues. We'd also need some way of describing this memory to the >> guest. >> >> So at least for the time being the approach is to avoid letting a guest with >> MTE enabled have access to this sort of memory. > > When a slot is added by the VMM, if it asked MTE in guest (I guess > that's an opt-in by the VMM, haven't checked the other patches), can we > reject it if it's is going to be mapped as Normal Cacheable but it is a > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > check for ZONE_DEVICE)? This way we don't need to do more expensive > checks in set_pte_at(). The problem is that KVM allows the VMM to change the memory backing a slot while the guest is running. This is obviously useful for the likes of migration, but ultimately means that even if you were to do checks at the time of slot creation, you would need to repeat the checks at set_pte_at() time to ensure a mischievous VMM didn't swap the page for a problematic one. > We could simplify the set_pte_at() further if we require that the VMM > has a PROT_MTE mapping. This does not mean it cannot have two mappings, > the other without PROT_MTE. But at least we get a set_pte_at() when > swapping in which has PROT_MTE. That is certainly an option - but from what I've seen of trying to implement a VMM to support MTE, the PROT_MTE mapping is not what you actually want in user space. Two mappings is possible but is likely to complicate the VMM. > We could add another PROT_TAGGED or something which means PG_mte_tagged > set but still mapped as Normal Untagged. It's just that we are short of > pte bits for another flag. That could help here - although it's slightly odd as you're asking the kernel to track the tags, but not allowing user space (direct) access to them. Like you say using us the precious bits for this seems like it might be short-sighted. > Can we somehow identify when the S2 pte is set and can we get access to > the prior swap pte? This way we could avoid changes to set_pte_at() for > S2 faults. > Unless I'm misunderstanding the code the swap information is (only) stored in the stage 1 user-space VMM PTE. When we get a stage 2 fault this triggers a corresponding access attempt to the VMM's address space. It's at this point when populating the VMM's page tables that the swap information is discovered. The problem at the moment is a mismatch regarding whether the page needs tags or not. The VMM's mapping can (currently) be !PROT_MTE which means we wouldn't normally require restoring/zeroing the tags. However the stage 2 access requires that the tags be preserved. Requiring PROT_MTE (or PROT_TAGGED as above) would certainly simplify the handling in the kernel. Of course I did propose the 'requiring PROT_MTE' approach before which led to a conversation[1] ending with a conclusion[2] that: I'd much rather the kernel just provided us with an API for what we want, which is (1) the guest RAM as just RAM with no tag checking and separately (2) some mechanism yet-to-be-designed which lets us bulk copy a page's worth of tags for migration. Which is what I've implemented ;) Do you think it's worth investigating the PROT_TAGGED approach as a middle ground? My gut feeling is that it's a waste of a VM flag, but I agree it would certainly make the code cleaner. Steve [1] https://lore.kernel.org/kvmarm/CAFEAcA85fiqA206FuFANKbV_3GkfY1F8Gv7MP58BgTT81bs9kA@mail.gmail.com/ [2] https://lore.kernel.org/kvmarm/CAFEAcA_K47jKSp46DFK-AKWv6MD1pkrEB6FNz=HNGdxmBDCSbw@mail.gmail.com/ ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-04-07 10:20 ` Steven Price (?) (?) @ 2021-04-07 15:14 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-07 15:14 UTC (permalink / raw) To: Steven Price Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > On 31/03/2021 19:43, Catalin Marinas wrote: > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > that's an opt-in by the VMM, haven't checked the other patches), can we > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > checks in set_pte_at(). > > The problem is that KVM allows the VMM to change the memory backing a slot > while the guest is running. This is obviously useful for the likes of > migration, but ultimately means that even if you were to do checks at the > time of slot creation, you would need to repeat the checks at set_pte_at() > time to ensure a mischievous VMM didn't swap the page for a problematic one. Does changing the slot require some KVM API call? Can we intercept it and do the checks there? Maybe a better alternative for the time being is to add a new kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns true _and_ the VMM asked for MTE in guest. We can then only set PG_mte_tagged if !device. We can later relax this further to Normal Non-cacheable for ZONE_DEVICE memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable if we manage to change the behaviour of the architecture. > > We could add another PROT_TAGGED or something which means PG_mte_tagged > > set but still mapped as Normal Untagged. It's just that we are short of > > pte bits for another flag. > > That could help here - although it's slightly odd as you're asking the > kernel to track the tags, but not allowing user space (direct) access to > them. Like you say using us the precious bits for this seems like it might > be short-sighted. Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), so we already know it's a page potentially containing tags. On restoring from swap, we need to check for MTE metadata irrespective of whether the user pte is tagged or not, as you already did in patch 1. I'll get back to that and look at the potential races. BTW, after a page is restored from swap, how long do we keep the metadata around? I think we can delete it as soon as it was restored and PG_mte_tagged was set. Currently it looks like we only do this when the actual page was freed or swapoff. I haven't convinced myself that it's safe to do this for swapoff unless it guarantees that all the ptes sharing a page have been restored. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 15:14 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-07 15:14 UTC (permalink / raw) To: Steven Price Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > On 31/03/2021 19:43, Catalin Marinas wrote: > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > that's an opt-in by the VMM, haven't checked the other patches), can we > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > checks in set_pte_at(). > > The problem is that KVM allows the VMM to change the memory backing a slot > while the guest is running. This is obviously useful for the likes of > migration, but ultimately means that even if you were to do checks at the > time of slot creation, you would need to repeat the checks at set_pte_at() > time to ensure a mischievous VMM didn't swap the page for a problematic one. Does changing the slot require some KVM API call? Can we intercept it and do the checks there? Maybe a better alternative for the time being is to add a new kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns true _and_ the VMM asked for MTE in guest. We can then only set PG_mte_tagged if !device. We can later relax this further to Normal Non-cacheable for ZONE_DEVICE memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable if we manage to change the behaviour of the architecture. > > We could add another PROT_TAGGED or something which means PG_mte_tagged > > set but still mapped as Normal Untagged. It's just that we are short of > > pte bits for another flag. > > That could help here - although it's slightly odd as you're asking the > kernel to track the tags, but not allowing user space (direct) access to > them. Like you say using us the precious bits for this seems like it might > be short-sighted. Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), so we already know it's a page potentially containing tags. On restoring from swap, we need to check for MTE metadata irrespective of whether the user pte is tagged or not, as you already did in patch 1. I'll get back to that and look at the potential races. BTW, after a page is restored from swap, how long do we keep the metadata around? I think we can delete it as soon as it was restored and PG_mte_tagged was set. Currently it looks like we only do this when the actual page was freed or swapoff. I haven't convinced myself that it's safe to do this for swapoff unless it guarantees that all the ptes sharing a page have been restored. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 15:14 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-07 15:14 UTC (permalink / raw) To: Steven Price Cc: David Hildenbrand, Marc Zyngier, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > On 31/03/2021 19:43, Catalin Marinas wrote: > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > that's an opt-in by the VMM, haven't checked the other patches), can we > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > checks in set_pte_at(). > > The problem is that KVM allows the VMM to change the memory backing a slot > while the guest is running. This is obviously useful for the likes of > migration, but ultimately means that even if you were to do checks at the > time of slot creation, you would need to repeat the checks at set_pte_at() > time to ensure a mischievous VMM didn't swap the page for a problematic one. Does changing the slot require some KVM API call? Can we intercept it and do the checks there? Maybe a better alternative for the time being is to add a new kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns true _and_ the VMM asked for MTE in guest. We can then only set PG_mte_tagged if !device. We can later relax this further to Normal Non-cacheable for ZONE_DEVICE memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable if we manage to change the behaviour of the architecture. > > We could add another PROT_TAGGED or something which means PG_mte_tagged > > set but still mapped as Normal Untagged. It's just that we are short of > > pte bits for another flag. > > That could help here - although it's slightly odd as you're asking the > kernel to track the tags, but not allowing user space (direct) access to > them. Like you say using us the precious bits for this seems like it might > be short-sighted. Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), so we already know it's a page potentially containing tags. On restoring from swap, we need to check for MTE metadata irrespective of whether the user pte is tagged or not, as you already did in patch 1. I'll get back to that and look at the potential races. BTW, after a page is restored from swap, how long do we keep the metadata around? I think we can delete it as soon as it was restored and PG_mte_tagged was set. Currently it looks like we only do this when the actual page was freed or swapoff. I haven't convinced myself that it's safe to do this for swapoff unless it guarantees that all the ptes sharing a page have been restored. -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 15:14 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-07 15:14 UTC (permalink / raw) To: Steven Price Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, David Hildenbrand, Marc Zyngier, Suzuki K Poulose, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > On 31/03/2021 19:43, Catalin Marinas wrote: > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > that's an opt-in by the VMM, haven't checked the other patches), can we > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > checks in set_pte_at(). > > The problem is that KVM allows the VMM to change the memory backing a slot > while the guest is running. This is obviously useful for the likes of > migration, but ultimately means that even if you were to do checks at the > time of slot creation, you would need to repeat the checks at set_pte_at() > time to ensure a mischievous VMM didn't swap the page for a problematic one. Does changing the slot require some KVM API call? Can we intercept it and do the checks there? Maybe a better alternative for the time being is to add a new kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns true _and_ the VMM asked for MTE in guest. We can then only set PG_mte_tagged if !device. We can later relax this further to Normal Non-cacheable for ZONE_DEVICE memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable if we manage to change the behaviour of the architecture. > > We could add another PROT_TAGGED or something which means PG_mte_tagged > > set but still mapped as Normal Untagged. It's just that we are short of > > pte bits for another flag. > > That could help here - although it's slightly odd as you're asking the > kernel to track the tags, but not allowing user space (direct) access to > them. Like you say using us the precious bits for this seems like it might > be short-sighted. Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), so we already know it's a page potentially containing tags. On restoring from swap, we need to check for MTE metadata irrespective of whether the user pte is tagged or not, as you already did in patch 1. I'll get back to that and look at the potential races. BTW, after a page is restored from swap, how long do we keep the metadata around? I think we can delete it as soon as it was restored and PG_mte_tagged was set. Currently it looks like we only do this when the actual page was freed or swapoff. I haven't convinced myself that it's safe to do this for swapoff unless it guarantees that all the ptes sharing a page have been restored. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-04-07 15:14 ` Catalin Marinas (?) (?) @ 2021-04-07 15:30 ` David Hildenbrand -1 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-04-07 15:30 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 07.04.21 17:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? User space can simply mmap(MAP_FIXED) the user space area registered in a KVM memory slot. You cannot really intercept that. You can only check in the KVM MMU code at fault time that the VMA which you hold in your hands is still in a proper state. The KVM MMU is synchronous, which means that updates to the VMA layout are reflected in the KVM MMU page tables -- e.g., via mmu notifier calls. E.g., in s390x code we cannot handle VMAs with gigantic pages. We check that when faulting (creating the links in the page table) via __gmap_link(). You could either check the page itself (ZONE_DEVICE) or might even be able to check via the VMA/file. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 15:30 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-04-07 15:30 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 07.04.21 17:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? User space can simply mmap(MAP_FIXED) the user space area registered in a KVM memory slot. You cannot really intercept that. You can only check in the KVM MMU code at fault time that the VMA which you hold in your hands is still in a proper state. The KVM MMU is synchronous, which means that updates to the VMA layout are reflected in the KVM MMU page tables -- e.g., via mmu notifier calls. E.g., in s390x code we cannot handle VMAs with gigantic pages. We check that when faulting (creating the links in the page table) via __gmap_link(). You could either check the page itself (ZONE_DEVICE) or might even be able to check via the VMA/file. -- Thanks, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 15:30 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-04-07 15:30 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On 07.04.21 17:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? User space can simply mmap(MAP_FIXED) the user space area registered in a KVM memory slot. You cannot really intercept that. You can only check in the KVM MMU code at fault time that the VMA which you hold in your hands is still in a proper state. The KVM MMU is synchronous, which means that updates to the VMA layout are reflected in the KVM MMU page tables -- e.g., via mmu notifier calls. E.g., in s390x code we cannot handle VMAs with gigantic pages. We check that when faulting (creating the links in the page table) via __gmap_link(). You could either check the page itself (ZONE_DEVICE) or might even be able to check via the VMA/file. -- Thanks, David / dhildenb _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 15:30 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-04-07 15:30 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, Suzuki K Poulose, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On 07.04.21 17:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? User space can simply mmap(MAP_FIXED) the user space area registered in a KVM memory slot. You cannot really intercept that. You can only check in the KVM MMU code at fault time that the VMA which you hold in your hands is still in a proper state. The KVM MMU is synchronous, which means that updates to the VMA layout are reflected in the KVM MMU page tables -- e.g., via mmu notifier calls. E.g., in s390x code we cannot handle VMAs with gigantic pages. We check that when faulting (creating the links in the page table) via __gmap_link(). You could either check the page itself (ZONE_DEVICE) or might even be able to check via the VMA/file. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-04-07 15:14 ` Catalin Marinas (?) (?) @ 2021-04-07 15:52 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-04-07 15:52 UTC (permalink / raw) To: Catalin Marinas Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 07/04/2021 16:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? As David has already replied - KVM uses MMU notifiers, so there's not really a good place to intercept this before the fault. > Maybe a better alternative for the time being is to add a new > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > true _and_ the VMM asked for MTE in guest. We can then only set > PG_mte_tagged if !device. KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch locally). > We can later relax this further to Normal Non-cacheable for ZONE_DEVICE > memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable > if we manage to change the behaviour of the architecture. Indeed, it'll be interesting to see whether people want to build MTE capable systems with significant quantities of non-MTE capable memory. But for a first stage let's stick with either all guest memory (except devices) is MTE or you disable MTE for the guest. >>> We could add another PROT_TAGGED or something which means PG_mte_tagged >>> set but still mapped as Normal Untagged. It's just that we are short of >>> pte bits for another flag. >> >> That could help here - although it's slightly odd as you're asking the >> kernel to track the tags, but not allowing user space (direct) access to >> them. Like you say using us the precious bits for this seems like it might >> be short-sighted. > > Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), > so we already know it's a page potentially containing tags. On > restoring from swap, we need to check for MTE metadata irrespective of > whether the user pte is tagged or not, as you already did in patch 1. > I'll get back to that and look at the potential races. > > BTW, after a page is restored from swap, how long do we keep the > metadata around? I think we can delete it as soon as it was restored and > PG_mte_tagged was set. Currently it looks like we only do this when the > actual page was freed or swapoff. I haven't convinced myself that it's > safe to do this for swapoff unless it guarantees that all the ptes > sharing a page have been restored. > My initial thought was to free the metadata immediately. However it turns out that the following sequence can happen: 1. Swap out a page 2. Swap the page in *read only* 3. Discard the page 4. Swap the page in again So there's no writing of the swap data again before (3). This works nicely with a swap device because after writing a page it stays there forever, so if you know it hasn't been modified it's pointless rewriting it. Sadly it's not quite so ideal with the MTE tags which are currently kept in RAM. Arguably it would make sense to modify the on-disk swap format to include the tags - but that would open a whole new can of worms! swapoff needs to ensure that all the PTEs have been restored because after the swapoff has completed the PTEs will be pointing at a swap entry which is no longer valid (and could even have been reallocated to point to a new swap device). When you issue you a swapoff, Linux will scan the mmlist and the page tables of every process to search for swap entry PTEs relating to the swap which is being removed (see try_to_unuse()). Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 15:52 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-04-07 15:52 UTC (permalink / raw) To: Catalin Marinas Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 07/04/2021 16:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? As David has already replied - KVM uses MMU notifiers, so there's not really a good place to intercept this before the fault. > Maybe a better alternative for the time being is to add a new > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > true _and_ the VMM asked for MTE in guest. We can then only set > PG_mte_tagged if !device. KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch locally). > We can later relax this further to Normal Non-cacheable for ZONE_DEVICE > memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable > if we manage to change the behaviour of the architecture. Indeed, it'll be interesting to see whether people want to build MTE capable systems with significant quantities of non-MTE capable memory. But for a first stage let's stick with either all guest memory (except devices) is MTE or you disable MTE for the guest. >>> We could add another PROT_TAGGED or something which means PG_mte_tagged >>> set but still mapped as Normal Untagged. It's just that we are short of >>> pte bits for another flag. >> >> That could help here - although it's slightly odd as you're asking the >> kernel to track the tags, but not allowing user space (direct) access to >> them. Like you say using us the precious bits for this seems like it might >> be short-sighted. > > Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), > so we already know it's a page potentially containing tags. On > restoring from swap, we need to check for MTE metadata irrespective of > whether the user pte is tagged or not, as you already did in patch 1. > I'll get back to that and look at the potential races. > > BTW, after a page is restored from swap, how long do we keep the > metadata around? I think we can delete it as soon as it was restored and > PG_mte_tagged was set. Currently it looks like we only do this when the > actual page was freed or swapoff. I haven't convinced myself that it's > safe to do this for swapoff unless it guarantees that all the ptes > sharing a page have been restored. > My initial thought was to free the metadata immediately. However it turns out that the following sequence can happen: 1. Swap out a page 2. Swap the page in *read only* 3. Discard the page 4. Swap the page in again So there's no writing of the swap data again before (3). This works nicely with a swap device because after writing a page it stays there forever, so if you know it hasn't been modified it's pointless rewriting it. Sadly it's not quite so ideal with the MTE tags which are currently kept in RAM. Arguably it would make sense to modify the on-disk swap format to include the tags - but that would open a whole new can of worms! swapoff needs to ensure that all the PTEs have been restored because after the swapoff has completed the PTEs will be pointing at a swap entry which is no longer valid (and could even have been reallocated to point to a new swap device). When you issue you a swapoff, Linux will scan the mmlist and the page tables of every process to search for swap entry PTEs relating to the swap which is being removed (see try_to_unuse()). Steve _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 15:52 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-04-07 15:52 UTC (permalink / raw) To: Catalin Marinas Cc: David Hildenbrand, Marc Zyngier, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On 07/04/2021 16:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? As David has already replied - KVM uses MMU notifiers, so there's not really a good place to intercept this before the fault. > Maybe a better alternative for the time being is to add a new > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > true _and_ the VMM asked for MTE in guest. We can then only set > PG_mte_tagged if !device. KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch locally). > We can later relax this further to Normal Non-cacheable for ZONE_DEVICE > memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable > if we manage to change the behaviour of the architecture. Indeed, it'll be interesting to see whether people want to build MTE capable systems with significant quantities of non-MTE capable memory. But for a first stage let's stick with either all guest memory (except devices) is MTE or you disable MTE for the guest. >>> We could add another PROT_TAGGED or something which means PG_mte_tagged >>> set but still mapped as Normal Untagged. It's just that we are short of >>> pte bits for another flag. >> >> That could help here - although it's slightly odd as you're asking the >> kernel to track the tags, but not allowing user space (direct) access to >> them. Like you say using us the precious bits for this seems like it might >> be short-sighted. > > Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), > so we already know it's a page potentially containing tags. On > restoring from swap, we need to check for MTE metadata irrespective of > whether the user pte is tagged or not, as you already did in patch 1. > I'll get back to that and look at the potential races. > > BTW, after a page is restored from swap, how long do we keep the > metadata around? I think we can delete it as soon as it was restored and > PG_mte_tagged was set. Currently it looks like we only do this when the > actual page was freed or swapoff. I haven't convinced myself that it's > safe to do this for swapoff unless it guarantees that all the ptes > sharing a page have been restored. > My initial thought was to free the metadata immediately. However it turns out that the following sequence can happen: 1. Swap out a page 2. Swap the page in *read only* 3. Discard the page 4. Swap the page in again So there's no writing of the swap data again before (3). This works nicely with a swap device because after writing a page it stays there forever, so if you know it hasn't been modified it's pointless rewriting it. Sadly it's not quite so ideal with the MTE tags which are currently kept in RAM. Arguably it would make sense to modify the on-disk swap format to include the tags - but that would open a whole new can of worms! swapoff needs to ensure that all the PTEs have been restored because after the swapoff has completed the PTEs will be pointing at a swap entry which is no longer valid (and could even have been reallocated to point to a new swap device). When you issue you a swapoff, Linux will scan the mmlist and the page tables of every process to search for swap entry PTEs relating to the swap which is being removed (see try_to_unuse()). Steve _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-07 15:52 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-04-07 15:52 UTC (permalink / raw) To: Catalin Marinas Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, David Hildenbrand, Marc Zyngier, Suzuki K Poulose, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On 07/04/2021 16:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? As David has already replied - KVM uses MMU notifiers, so there's not really a good place to intercept this before the fault. > Maybe a better alternative for the time being is to add a new > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > true _and_ the VMM asked for MTE in guest. We can then only set > PG_mte_tagged if !device. KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch locally). > We can later relax this further to Normal Non-cacheable for ZONE_DEVICE > memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable > if we manage to change the behaviour of the architecture. Indeed, it'll be interesting to see whether people want to build MTE capable systems with significant quantities of non-MTE capable memory. But for a first stage let's stick with either all guest memory (except devices) is MTE or you disable MTE for the guest. >>> We could add another PROT_TAGGED or something which means PG_mte_tagged >>> set but still mapped as Normal Untagged. It's just that we are short of >>> pte bits for another flag. >> >> That could help here - although it's slightly odd as you're asking the >> kernel to track the tags, but not allowing user space (direct) access to >> them. Like you say using us the precious bits for this seems like it might >> be short-sighted. > > Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), > so we already know it's a page potentially containing tags. On > restoring from swap, we need to check for MTE metadata irrespective of > whether the user pte is tagged or not, as you already did in patch 1. > I'll get back to that and look at the potential races. > > BTW, after a page is restored from swap, how long do we keep the > metadata around? I think we can delete it as soon as it was restored and > PG_mte_tagged was set. Currently it looks like we only do this when the > actual page was freed or swapoff. I haven't convinced myself that it's > safe to do this for swapoff unless it guarantees that all the ptes > sharing a page have been restored. > My initial thought was to free the metadata immediately. However it turns out that the following sequence can happen: 1. Swap out a page 2. Swap the page in *read only* 3. Discard the page 4. Swap the page in again So there's no writing of the swap data again before (3). This works nicely with a swap device because after writing a page it stays there forever, so if you know it hasn't been modified it's pointless rewriting it. Sadly it's not quite so ideal with the MTE tags which are currently kept in RAM. Arguably it would make sense to modify the on-disk swap format to include the tags - but that would open a whole new can of worms! swapoff needs to ensure that all the PTEs have been restored because after the swapoff has completed the PTEs will be pointing at a swap entry which is no longer valid (and could even have been reallocated to point to a new swap device). When you issue you a swapoff, Linux will scan the mmlist and the page tables of every process to search for swap entry PTEs relating to the swap which is being removed (see try_to_unuse()). Steve ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-04-07 15:52 ` Steven Price (?) (?) @ 2021-04-08 14:18 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-08 14:18 UTC (permalink / raw) To: Steven Price Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: > On 07/04/2021 16:14, Catalin Marinas wrote: > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > > > On 31/03/2021 19:43, Catalin Marinas wrote: > > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > > > that's an opt-in by the VMM, haven't checked the other patches), can we > > > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > > > checks in set_pte_at(). > > > > > > The problem is that KVM allows the VMM to change the memory backing a slot > > > while the guest is running. This is obviously useful for the likes of > > > migration, but ultimately means that even if you were to do checks at the > > > time of slot creation, you would need to repeat the checks at set_pte_at() > > > time to ensure a mischievous VMM didn't swap the page for a problematic one. > > > > Does changing the slot require some KVM API call? Can we intercept it > > and do the checks there? > > As David has already replied - KVM uses MMU notifiers, so there's not really > a good place to intercept this before the fault. > > > Maybe a better alternative for the time being is to add a new > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > > true _and_ the VMM asked for MTE in guest. We can then only set > > PG_mte_tagged if !device. > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch > locally). Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, I'd also mark a pfn as 'device' in user_mem_abort() if pfn_to_online_page() is NULL as we don't want to map it as Cacheable in Stage 2. It's unlikely that we'll trip over this path but just in case. (can we have a ZONE_DEVICE _online_ pfn or by definition they are considered offline?) > > BTW, after a page is restored from swap, how long do we keep the > > metadata around? I think we can delete it as soon as it was restored and > > PG_mte_tagged was set. Currently it looks like we only do this when the > > actual page was freed or swapoff. I haven't convinced myself that it's > > safe to do this for swapoff unless it guarantees that all the ptes > > sharing a page have been restored. > > My initial thought was to free the metadata immediately. However it turns > out that the following sequence can happen: > > 1. Swap out a page > 2. Swap the page in *read only* > 3. Discard the page > 4. Swap the page in again > > So there's no writing of the swap data again before (3). This works nicely > with a swap device because after writing a page it stays there forever, so > if you know it hasn't been modified it's pointless rewriting it. Sadly it's > not quite so ideal with the MTE tags which are currently kept in RAM. I missed this scenario. So we need to keep it around as long as the corresponding swap storage is still valid. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-08 14:18 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-08 14:18 UTC (permalink / raw) To: Steven Price Cc: David Hildenbrand, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: > On 07/04/2021 16:14, Catalin Marinas wrote: > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > > > On 31/03/2021 19:43, Catalin Marinas wrote: > > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > > > that's an opt-in by the VMM, haven't checked the other patches), can we > > > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > > > checks in set_pte_at(). > > > > > > The problem is that KVM allows the VMM to change the memory backing a slot > > > while the guest is running. This is obviously useful for the likes of > > > migration, but ultimately means that even if you were to do checks at the > > > time of slot creation, you would need to repeat the checks at set_pte_at() > > > time to ensure a mischievous VMM didn't swap the page for a problematic one. > > > > Does changing the slot require some KVM API call? Can we intercept it > > and do the checks there? > > As David has already replied - KVM uses MMU notifiers, so there's not really > a good place to intercept this before the fault. > > > Maybe a better alternative for the time being is to add a new > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > > true _and_ the VMM asked for MTE in guest. We can then only set > > PG_mte_tagged if !device. > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch > locally). Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, I'd also mark a pfn as 'device' in user_mem_abort() if pfn_to_online_page() is NULL as we don't want to map it as Cacheable in Stage 2. It's unlikely that we'll trip over this path but just in case. (can we have a ZONE_DEVICE _online_ pfn or by definition they are considered offline?) > > BTW, after a page is restored from swap, how long do we keep the > > metadata around? I think we can delete it as soon as it was restored and > > PG_mte_tagged was set. Currently it looks like we only do this when the > > actual page was freed or swapoff. I haven't convinced myself that it's > > safe to do this for swapoff unless it guarantees that all the ptes > > sharing a page have been restored. > > My initial thought was to free the metadata immediately. However it turns > out that the following sequence can happen: > > 1. Swap out a page > 2. Swap the page in *read only* > 3. Discard the page > 4. Swap the page in again > > So there's no writing of the swap data again before (3). This works nicely > with a swap device because after writing a page it stays there forever, so > if you know it hasn't been modified it's pointless rewriting it. Sadly it's > not quite so ideal with the MTE tags which are currently kept in RAM. I missed this scenario. So we need to keep it around as long as the corresponding swap storage is still valid. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-08 14:18 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-08 14:18 UTC (permalink / raw) To: Steven Price Cc: David Hildenbrand, Marc Zyngier, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: > On 07/04/2021 16:14, Catalin Marinas wrote: > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > > > On 31/03/2021 19:43, Catalin Marinas wrote: > > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > > > that's an opt-in by the VMM, haven't checked the other patches), can we > > > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > > > checks in set_pte_at(). > > > > > > The problem is that KVM allows the VMM to change the memory backing a slot > > > while the guest is running. This is obviously useful for the likes of > > > migration, but ultimately means that even if you were to do checks at the > > > time of slot creation, you would need to repeat the checks at set_pte_at() > > > time to ensure a mischievous VMM didn't swap the page for a problematic one. > > > > Does changing the slot require some KVM API call? Can we intercept it > > and do the checks there? > > As David has already replied - KVM uses MMU notifiers, so there's not really > a good place to intercept this before the fault. > > > Maybe a better alternative for the time being is to add a new > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > > true _and_ the VMM asked for MTE in guest. We can then only set > > PG_mte_tagged if !device. > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch > locally). Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, I'd also mark a pfn as 'device' in user_mem_abort() if pfn_to_online_page() is NULL as we don't want to map it as Cacheable in Stage 2. It's unlikely that we'll trip over this path but just in case. (can we have a ZONE_DEVICE _online_ pfn or by definition they are considered offline?) > > BTW, after a page is restored from swap, how long do we keep the > > metadata around? I think we can delete it as soon as it was restored and > > PG_mte_tagged was set. Currently it looks like we only do this when the > > actual page was freed or swapoff. I haven't convinced myself that it's > > safe to do this for swapoff unless it guarantees that all the ptes > > sharing a page have been restored. > > My initial thought was to free the metadata immediately. However it turns > out that the following sequence can happen: > > 1. Swap out a page > 2. Swap the page in *read only* > 3. Discard the page > 4. Swap the page in again > > So there's no writing of the swap data again before (3). This works nicely > with a swap device because after writing a page it stays there forever, so > if you know it hasn't been modified it's pointless rewriting it. Sadly it's > not quite so ideal with the MTE tags which are currently kept in RAM. I missed this scenario. So we need to keep it around as long as the corresponding swap storage is still valid. -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-08 14:18 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-08 14:18 UTC (permalink / raw) To: Steven Price Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, David Hildenbrand, Marc Zyngier, Suzuki K Poulose, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, Juan Quintela, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: > On 07/04/2021 16:14, Catalin Marinas wrote: > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > > > On 31/03/2021 19:43, Catalin Marinas wrote: > > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > > > that's an opt-in by the VMM, haven't checked the other patches), can we > > > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > > > checks in set_pte_at(). > > > > > > The problem is that KVM allows the VMM to change the memory backing a slot > > > while the guest is running. This is obviously useful for the likes of > > > migration, but ultimately means that even if you were to do checks at the > > > time of slot creation, you would need to repeat the checks at set_pte_at() > > > time to ensure a mischievous VMM didn't swap the page for a problematic one. > > > > Does changing the slot require some KVM API call? Can we intercept it > > and do the checks there? > > As David has already replied - KVM uses MMU notifiers, so there's not really > a good place to intercept this before the fault. > > > Maybe a better alternative for the time being is to add a new > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > > true _and_ the VMM asked for MTE in guest. We can then only set > > PG_mte_tagged if !device. > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch > locally). Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, I'd also mark a pfn as 'device' in user_mem_abort() if pfn_to_online_page() is NULL as we don't want to map it as Cacheable in Stage 2. It's unlikely that we'll trip over this path but just in case. (can we have a ZONE_DEVICE _online_ pfn or by definition they are considered offline?) > > BTW, after a page is restored from swap, how long do we keep the > > metadata around? I think we can delete it as soon as it was restored and > > PG_mte_tagged was set. Currently it looks like we only do this when the > > actual page was freed or swapoff. I haven't convinced myself that it's > > safe to do this for swapoff unless it guarantees that all the ptes > > sharing a page have been restored. > > My initial thought was to free the metadata immediately. However it turns > out that the following sequence can happen: > > 1. Swap out a page > 2. Swap the page in *read only* > 3. Discard the page > 4. Swap the page in again > > So there's no writing of the swap data again before (3). This works nicely > with a swap device because after writing a page it stays there forever, so > if you know it hasn't been modified it's pointless rewriting it. Sadly it's > not quite so ideal with the MTE tags which are currently kept in RAM. I missed this scenario. So we need to keep it around as long as the corresponding swap storage is still valid. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-04-08 14:18 ` Catalin Marinas (?) (?) @ 2021-04-08 18:16 ` David Hildenbrand -1 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-04-08 18:16 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 08.04.21 16:18, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: >> On 07/04/2021 16:14, Catalin Marinas wrote: >>> On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >>>> On 31/03/2021 19:43, Catalin Marinas wrote: >>>>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>>>> that's an opt-in by the VMM, haven't checked the other patches), can we >>>>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>>>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>>>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>>>> checks in set_pte_at(). >>>> >>>> The problem is that KVM allows the VMM to change the memory backing a slot >>>> while the guest is running. This is obviously useful for the likes of >>>> migration, but ultimately means that even if you were to do checks at the >>>> time of slot creation, you would need to repeat the checks at set_pte_at() >>>> time to ensure a mischievous VMM didn't swap the page for a problematic one. >>> >>> Does changing the slot require some KVM API call? Can we intercept it >>> and do the checks there? >> >> As David has already replied - KVM uses MMU notifiers, so there's not really >> a good place to intercept this before the fault. >> >>> Maybe a better alternative for the time being is to add a new >>> kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns >>> true _and_ the VMM asked for MTE in guest. We can then only set >>> PG_mte_tagged if !device. >> >> KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE >> checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch >> locally). > > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, > I'd also mark a pfn as 'device' in user_mem_abort() if > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in > Stage 2. It's unlikely that we'll trip over this path but just in case. > > (can we have a ZONE_DEVICE _online_ pfn or by definition they are > considered offline?) By definition (and implementation) offline. When you get a page = pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE (otherwise it would be a BUG). As I said, things are different when exposing dax memory via dax/kmem to the buddy. But then, we are no longer talking about ZONE_DEVICE. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-08 18:16 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-04-08 18:16 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On 08.04.21 16:18, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: >> On 07/04/2021 16:14, Catalin Marinas wrote: >>> On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >>>> On 31/03/2021 19:43, Catalin Marinas wrote: >>>>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>>>> that's an opt-in by the VMM, haven't checked the other patches), can we >>>>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>>>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>>>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>>>> checks in set_pte_at(). >>>> >>>> The problem is that KVM allows the VMM to change the memory backing a slot >>>> while the guest is running. This is obviously useful for the likes of >>>> migration, but ultimately means that even if you were to do checks at the >>>> time of slot creation, you would need to repeat the checks at set_pte_at() >>>> time to ensure a mischievous VMM didn't swap the page for a problematic one. >>> >>> Does changing the slot require some KVM API call? Can we intercept it >>> and do the checks there? >> >> As David has already replied - KVM uses MMU notifiers, so there's not really >> a good place to intercept this before the fault. >> >>> Maybe a better alternative for the time being is to add a new >>> kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns >>> true _and_ the VMM asked for MTE in guest. We can then only set >>> PG_mte_tagged if !device. >> >> KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE >> checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch >> locally). > > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, > I'd also mark a pfn as 'device' in user_mem_abort() if > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in > Stage 2. It's unlikely that we'll trip over this path but just in case. > > (can we have a ZONE_DEVICE _online_ pfn or by definition they are > considered offline?) By definition (and implementation) offline. When you get a page = pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE (otherwise it would be a BUG). As I said, things are different when exposing dax memory via dax/kmem to the buddy. But then, we are no longer talking about ZONE_DEVICE. -- Thanks, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-08 18:16 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-04-08 18:16 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On 08.04.21 16:18, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: >> On 07/04/2021 16:14, Catalin Marinas wrote: >>> On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >>>> On 31/03/2021 19:43, Catalin Marinas wrote: >>>>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>>>> that's an opt-in by the VMM, haven't checked the other patches), can we >>>>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>>>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>>>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>>>> checks in set_pte_at(). >>>> >>>> The problem is that KVM allows the VMM to change the memory backing a slot >>>> while the guest is running. This is obviously useful for the likes of >>>> migration, but ultimately means that even if you were to do checks at the >>>> time of slot creation, you would need to repeat the checks at set_pte_at() >>>> time to ensure a mischievous VMM didn't swap the page for a problematic one. >>> >>> Does changing the slot require some KVM API call? Can we intercept it >>> and do the checks there? >> >> As David has already replied - KVM uses MMU notifiers, so there's not really >> a good place to intercept this before the fault. >> >>> Maybe a better alternative for the time being is to add a new >>> kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns >>> true _and_ the VMM asked for MTE in guest. We can then only set >>> PG_mte_tagged if !device. >> >> KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE >> checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch >> locally). > > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, > I'd also mark a pfn as 'device' in user_mem_abort() if > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in > Stage 2. It's unlikely that we'll trip over this path but just in case. > > (can we have a ZONE_DEVICE _online_ pfn or by definition they are > considered offline?) By definition (and implementation) offline. When you get a page = pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE (otherwise it would be a BUG). As I said, things are different when exposing dax memory via dax/kmem to the buddy. But then, we are no longer talking about ZONE_DEVICE. -- Thanks, David / dhildenb _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-08 18:16 ` David Hildenbrand 0 siblings, 0 replies; 112+ messages in thread From: David Hildenbrand @ 2021-04-08 18:16 UTC (permalink / raw) To: Catalin Marinas, Steven Price Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, Suzuki K Poulose, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, qemu-devel, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On 08.04.21 16:18, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: >> On 07/04/2021 16:14, Catalin Marinas wrote: >>> On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >>>> On 31/03/2021 19:43, Catalin Marinas wrote: >>>>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>>>> that's an opt-in by the VMM, haven't checked the other patches), can we >>>>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>>>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>>>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>>>> checks in set_pte_at(). >>>> >>>> The problem is that KVM allows the VMM to change the memory backing a slot >>>> while the guest is running. This is obviously useful for the likes of >>>> migration, but ultimately means that even if you were to do checks at the >>>> time of slot creation, you would need to repeat the checks at set_pte_at() >>>> time to ensure a mischievous VMM didn't swap the page for a problematic one. >>> >>> Does changing the slot require some KVM API call? Can we intercept it >>> and do the checks there? >> >> As David has already replied - KVM uses MMU notifiers, so there's not really >> a good place to intercept this before the fault. >> >>> Maybe a better alternative for the time being is to add a new >>> kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns >>> true _and_ the VMM asked for MTE in guest. We can then only set >>> PG_mte_tagged if !device. >> >> KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE >> checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch >> locally). > > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, > I'd also mark a pfn as 'device' in user_mem_abort() if > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in > Stage 2. It's unlikely that we'll trip over this path but just in case. > > (can we have a ZONE_DEVICE _online_ pfn or by definition they are > considered offline?) By definition (and implementation) offline. When you get a page = pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE (otherwise it would be a BUG). As I said, things are different when exposing dax memory via dax/kmem to the buddy. But then, we are no longer talking about ZONE_DEVICE. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature 2021-04-08 18:16 ` David Hildenbrand (?) (?) @ 2021-04-08 18:21 ` Catalin Marinas -1 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-08 18:21 UTC (permalink / raw) To: David Hildenbrand Cc: Steven Price, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Thu, Apr 08, 2021 at 08:16:17PM +0200, David Hildenbrand wrote: > On 08.04.21 16:18, Catalin Marinas wrote: > > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: > > > On 07/04/2021 16:14, Catalin Marinas wrote: > > > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > > > > > On 31/03/2021 19:43, Catalin Marinas wrote: > > > > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > > > > > that's an opt-in by the VMM, haven't checked the other patches), can we > > > > > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > > > > > checks in set_pte_at(). > > > > > > > > > > The problem is that KVM allows the VMM to change the memory backing a slot > > > > > while the guest is running. This is obviously useful for the likes of > > > > > migration, but ultimately means that even if you were to do checks at the > > > > > time of slot creation, you would need to repeat the checks at set_pte_at() > > > > > time to ensure a mischievous VMM didn't swap the page for a problematic one. > > > > > > > > Does changing the slot require some KVM API call? Can we intercept it > > > > and do the checks there? > > > > > > As David has already replied - KVM uses MMU notifiers, so there's not really > > > a good place to intercept this before the fault. > > > > > > > Maybe a better alternative for the time being is to add a new > > > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > > > > true _and_ the VMM asked for MTE in guest. We can then only set > > > > PG_mte_tagged if !device. > > > > > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE > > > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch > > > locally). > > > > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, > > I'd also mark a pfn as 'device' in user_mem_abort() if > > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in > > Stage 2. It's unlikely that we'll trip over this path but just in case. > > > > (can we have a ZONE_DEVICE _online_ pfn or by definition they are > > considered offline?) > > By definition (and implementation) offline. When you get a page = > pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE > (otherwise it would be a BUG). > > As I said, things are different when exposing dax memory via dax/kmem to the > buddy. But then, we are no longer talking about ZONE_DEVICE. Thanks David, it's clear now. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-08 18:21 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-08 18:21 UTC (permalink / raw) To: David Hildenbrand Cc: Steven Price, Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, linux-kernel, Dave Martin, James Morse, linux-arm-kernel, Thomas Gleixner, Will Deacon, kvmarm, Julien Thierry On Thu, Apr 08, 2021 at 08:16:17PM +0200, David Hildenbrand wrote: > On 08.04.21 16:18, Catalin Marinas wrote: > > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: > > > On 07/04/2021 16:14, Catalin Marinas wrote: > > > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > > > > > On 31/03/2021 19:43, Catalin Marinas wrote: > > > > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > > > > > that's an opt-in by the VMM, haven't checked the other patches), can we > > > > > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > > > > > checks in set_pte_at(). > > > > > > > > > > The problem is that KVM allows the VMM to change the memory backing a slot > > > > > while the guest is running. This is obviously useful for the likes of > > > > > migration, but ultimately means that even if you were to do checks at the > > > > > time of slot creation, you would need to repeat the checks at set_pte_at() > > > > > time to ensure a mischievous VMM didn't swap the page for a problematic one. > > > > > > > > Does changing the slot require some KVM API call? Can we intercept it > > > > and do the checks there? > > > > > > As David has already replied - KVM uses MMU notifiers, so there's not really > > > a good place to intercept this before the fault. > > > > > > > Maybe a better alternative for the time being is to add a new > > > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > > > > true _and_ the VMM asked for MTE in guest. We can then only set > > > > PG_mte_tagged if !device. > > > > > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE > > > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch > > > locally). > > > > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, > > I'd also mark a pfn as 'device' in user_mem_abort() if > > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in > > Stage 2. It's unlikely that we'll trip over this path but just in case. > > > > (can we have a ZONE_DEVICE _online_ pfn or by definition they are > > considered offline?) > > By definition (and implementation) offline. When you get a page = > pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE > (otherwise it would be a BUG). > > As I said, things are different when exposing dax memory via dax/kmem to the > buddy. But then, we are no longer talking about ZONE_DEVICE. Thanks David, it's clear now. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-08 18:21 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-08 18:21 UTC (permalink / raw) To: David Hildenbrand Cc: qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, Steven Price, linux-arm-kernel, kvmarm, Thomas Gleixner, Will Deacon, Dave Martin, linux-kernel On Thu, Apr 08, 2021 at 08:16:17PM +0200, David Hildenbrand wrote: > On 08.04.21 16:18, Catalin Marinas wrote: > > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: > > > On 07/04/2021 16:14, Catalin Marinas wrote: > > > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > > > > > On 31/03/2021 19:43, Catalin Marinas wrote: > > > > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > > > > > that's an opt-in by the VMM, haven't checked the other patches), can we > > > > > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > > > > > checks in set_pte_at(). > > > > > > > > > > The problem is that KVM allows the VMM to change the memory backing a slot > > > > > while the guest is running. This is obviously useful for the likes of > > > > > migration, but ultimately means that even if you were to do checks at the > > > > > time of slot creation, you would need to repeat the checks at set_pte_at() > > > > > time to ensure a mischievous VMM didn't swap the page for a problematic one. > > > > > > > > Does changing the slot require some KVM API call? Can we intercept it > > > > and do the checks there? > > > > > > As David has already replied - KVM uses MMU notifiers, so there's not really > > > a good place to intercept this before the fault. > > > > > > > Maybe a better alternative for the time being is to add a new > > > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > > > > true _and_ the VMM asked for MTE in guest. We can then only set > > > > PG_mte_tagged if !device. > > > > > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE > > > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch > > > locally). > > > > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, > > I'd also mark a pfn as 'device' in user_mem_abort() if > > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in > > Stage 2. It's unlikely that we'll trip over this path but just in case. > > > > (can we have a ZONE_DEVICE _online_ pfn or by definition they are > > considered offline?) > > By definition (and implementation) offline. When you get a page = > pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE > (otherwise it would be a BUG). > > As I said, things are different when exposing dax memory via dax/kmem to the > buddy. But then, we are no longer talking about ZONE_DEVICE. Thanks David, it's clear now. -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply [flat|nested] 112+ messages in thread
* Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature @ 2021-04-08 18:21 ` Catalin Marinas 0 siblings, 0 replies; 112+ messages in thread From: Catalin Marinas @ 2021-04-08 18:21 UTC (permalink / raw) To: David Hildenbrand Cc: Mark Rutland, Peter Maydell, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Marc Zyngier, Juan Quintela, Richard Henderson, Dr. David Alan Gilbert, Steven Price, James Morse, linux-arm-kernel, kvmarm, Thomas Gleixner, Julien Thierry, Will Deacon, Dave Martin, linux-kernel On Thu, Apr 08, 2021 at 08:16:17PM +0200, David Hildenbrand wrote: > On 08.04.21 16:18, Catalin Marinas wrote: > > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote: > > > On 07/04/2021 16:14, Catalin Marinas wrote: > > > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: > > > > > On 31/03/2021 19:43, Catalin Marinas wrote: > > > > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess > > > > > > that's an opt-in by the VMM, haven't checked the other patches), can we > > > > > > reject it if it's is going to be mapped as Normal Cacheable but it is a > > > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to > > > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive > > > > > > checks in set_pte_at(). > > > > > > > > > > The problem is that KVM allows the VMM to change the memory backing a slot > > > > > while the guest is running. This is obviously useful for the likes of > > > > > migration, but ultimately means that even if you were to do checks at the > > > > > time of slot creation, you would need to repeat the checks at set_pte_at() > > > > > time to ensure a mischievous VMM didn't swap the page for a problematic one. > > > > > > > > Does changing the slot require some KVM API call? Can we intercept it > > > > and do the checks there? > > > > > > As David has already replied - KVM uses MMU notifiers, so there's not really > > > a good place to intercept this before the fault. > > > > > > > Maybe a better alternative for the time being is to add a new > > > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > > > > true _and_ the VMM asked for MTE in guest. We can then only set > > > > PG_mte_tagged if !device. > > > > > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE > > > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch > > > locally). > > > > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE, > > I'd also mark a pfn as 'device' in user_mem_abort() if > > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in > > Stage 2. It's unlikely that we'll trip over this path but just in case. > > > > (can we have a ZONE_DEVICE _online_ pfn or by definition they are > > considered offline?) > > By definition (and implementation) offline. When you get a page = > pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE > (otherwise it would be a BUG). > > As I said, things are different when exposing dax memory via dax/kmem to the > buddy. But then, we are no longer talking about ZONE_DEVICE. Thanks David, it's clear now. -- Catalin ^ permalink raw reply [flat|nested] 112+ messages in thread
* [PATCH v10 3/6] arm64: kvm: Save/restore MTE registers 2021-03-12 15:18 ` Steven Price (?) (?) @ 2021-03-12 15:18 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones Define the new system registers that MTE introduces and context switch them. The MTE feature is still hidden from the ID register as it isn't supported in a VM yet. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_host.h | 6 ++ arch/arm64/include/asm/kvm_mte.h | 66 ++++++++++++++++++++++ arch/arm64/include/asm/sysreg.h | 3 +- arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kvm/hyp/entry.S | 7 +++ arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++ arch/arm64/kvm/sys_regs.c | 22 ++++++-- 7 files changed, 123 insertions(+), 5 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_mte.h diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 1170ee137096..d00cc3590f6e 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -208,6 +208,12 @@ enum vcpu_sysreg { CNTP_CVAL_EL0, CNTP_CTL_EL0, + /* Memory Tagging Extension registers */ + RGSR_EL1, /* Random Allocation Tag Seed Register */ + GCR_EL1, /* Tag Control Register */ + TFSR_EL1, /* Tag Fault Status Register (EL1) */ + TFSRE0_EL1, /* Tag Fault Status Register (EL0) */ + /* 32bit specific registers. Keep them at the end of the range */ DACR32_EL2, /* Domain Access Control Register */ IFSR32_EL2, /* Instruction Fault Status Register */ diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h new file mode 100644 index 000000000000..6541c7d6ce06 --- /dev/null +++ b/arch/arm64/include/asm/kvm_mte.h @@ -0,0 +1,66 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2020 ARM Ltd. + */ +#ifndef __ASM_KVM_MTE_H +#define __ASM_KVM_MTE_H + +#ifdef __ASSEMBLY__ + +#include <asm/sysreg.h> + +#ifdef CONFIG_ARM64_MTE + +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1 +alternative_if_not ARM64_MTE + b .L__skip_switch\@ +alternative_else_nop_endif + mrs \reg1, hcr_el2 + and \reg1, \reg1, #(HCR_ATA) + cbz \reg1, .L__skip_switch\@ + + mrs_s \reg1, SYS_RGSR_EL1 + str \reg1, [\h_ctxt, #CPU_RGSR_EL1] + mrs_s \reg1, SYS_GCR_EL1 + str \reg1, [\h_ctxt, #CPU_GCR_EL1] + + ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1] + msr_s SYS_RGSR_EL1, \reg1 + ldr \reg1, [\g_ctxt, #CPU_GCR_EL1] + msr_s SYS_GCR_EL1, \reg1 + +.L__skip_switch\@: +.endm + +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1 +alternative_if_not ARM64_MTE + b .L__skip_switch\@ +alternative_else_nop_endif + mrs \reg1, hcr_el2 + and \reg1, \reg1, #(HCR_ATA) + cbz \reg1, .L__skip_switch\@ + + mrs_s \reg1, SYS_RGSR_EL1 + str \reg1, [\g_ctxt, #CPU_RGSR_EL1] + mrs_s \reg1, SYS_GCR_EL1 + str \reg1, [\g_ctxt, #CPU_GCR_EL1] + + ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1] + msr_s SYS_RGSR_EL1, \reg1 + ldr \reg1, [\h_ctxt, #CPU_GCR_EL1] + msr_s SYS_GCR_EL1, \reg1 + +.L__skip_switch\@: +.endm + +#else /* CONFIG_ARM64_MTE */ + +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1 +.endm + +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1 +.endm + +#endif /* CONFIG_ARM64_MTE */ +#endif /* __ASSEMBLY__ */ +#endif /* __ASM_KVM_MTE_H */ diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index dfd4edbfe360..5424d195cf96 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -580,7 +580,8 @@ #define SCTLR_ELx_M (BIT(0)) #define SCTLR_ELx_FLAGS (SCTLR_ELx_M | SCTLR_ELx_A | SCTLR_ELx_C | \ - SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB) + SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB | \ + SCTLR_ELx_ITFSB) /* SCTLR_EL2 specific flags. */ #define SCTLR_EL2_RES1 ((BIT(4)) | (BIT(5)) | (BIT(11)) | (BIT(16)) | \ diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c index a36e2fc330d4..944e4f1f45d9 100644 --- a/arch/arm64/kernel/asm-offsets.c +++ b/arch/arm64/kernel/asm-offsets.c @@ -108,6 +108,9 @@ int main(void) DEFINE(VCPU_WORKAROUND_FLAGS, offsetof(struct kvm_vcpu, arch.workaround_flags)); DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2)); DEFINE(CPU_USER_PT_REGS, offsetof(struct kvm_cpu_context, regs)); + DEFINE(CPU_RGSR_EL1, offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1])); + DEFINE(CPU_GCR_EL1, offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1])); + DEFINE(CPU_TFSRE0_EL1, offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1])); DEFINE(CPU_APIAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1])); DEFINE(CPU_APIBKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1])); DEFINE(CPU_APDAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1])); diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S index b0afad7a99c6..c67582c6dd55 100644 --- a/arch/arm64/kvm/hyp/entry.S +++ b/arch/arm64/kvm/hyp/entry.S @@ -13,6 +13,7 @@ #include <asm/kvm_arm.h> #include <asm/kvm_asm.h> #include <asm/kvm_mmu.h> +#include <asm/kvm_mte.h> #include <asm/kvm_ptrauth.h> .text @@ -51,6 +52,9 @@ alternative_else_nop_endif add x29, x0, #VCPU_CONTEXT + // mte_switch_to_guest(g_ctxt, h_ctxt, tmp1) + mte_switch_to_guest x29, x1, x2 + // Macro ptrauth_switch_to_guest format: // ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3) // The below macro to restore guest keys is not implemented in C code @@ -140,6 +144,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL) // when this feature is enabled for kernel code. ptrauth_switch_to_hyp x1, x2, x3, x4, x5 + // mte_switch_to_hyp(g_ctxt, h_ctxt, reg1) + mte_switch_to_hyp x1, x2, x3 + // Restore hyp's sp_el0 restore_sp_el0 x2, x3 diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h index cce43bfe158f..de7e14c862e6 100644 --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h @@ -14,6 +14,7 @@ #include <asm/kvm_asm.h> #include <asm/kvm_emulate.h> #include <asm/kvm_hyp.h> +#include <asm/kvm_mmu.h> static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt) { @@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt) ctxt_sys_reg(ctxt, TPIDRRO_EL0) = read_sysreg(tpidrro_el0); } +static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt) +{ + struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu; + + if (!vcpu) + vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt); + + return kvm_has_mte(kern_hyp_va(vcpu->kvm)); +} + static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) { ctxt_sys_reg(ctxt, CSSELR_EL1) = read_sysreg(csselr_el1); @@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) ctxt_sys_reg(ctxt, PAR_EL1) = read_sysreg_par(); ctxt_sys_reg(ctxt, TPIDR_EL1) = read_sysreg(tpidr_el1); + if (ctxt_has_mte(ctxt)) { + ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR); + ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1); + } + ctxt_sys_reg(ctxt, SP_EL1) = read_sysreg(sp_el1); ctxt_sys_reg(ctxt, ELR_EL1) = read_sysreg_el1(SYS_ELR); ctxt_sys_reg(ctxt, SPSR_EL1) = read_sysreg_el1(SYS_SPSR); @@ -107,6 +123,11 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt) write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1), par_el1); write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1), tpidr_el1); + if (ctxt_has_mte(ctxt)) { + write_sysreg_el1(ctxt_sys_reg(ctxt, TFSR_EL1), SYS_TFSR); + write_sysreg_s(ctxt_sys_reg(ctxt, TFSRE0_EL1), SYS_TFSRE0_EL1); + } + if (!has_vhe() && cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT) && ctxt->__hyp_running_vcpu) { diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 18c87500a7a8..377ae6efb0ef 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1303,6 +1303,20 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, return true; } +static unsigned int mte_visibility(const struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd) +{ + return REG_HIDDEN; +} + +#define MTE_REG(name) { \ + SYS_DESC(SYS_##name), \ + .access = undef_access, \ + .reset = reset_unknown, \ + .reg = name, \ + .visibility = mte_visibility, \ +} + /* sys_reg_desc initialiser for known cpufeature ID registers */ #define ID_SANITISED(name) { \ SYS_DESC(SYS_##name), \ @@ -1471,8 +1485,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ACTLR_EL1), access_actlr, reset_actlr, ACTLR_EL1 }, { SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 }, - { SYS_DESC(SYS_RGSR_EL1), undef_access }, - { SYS_DESC(SYS_GCR_EL1), undef_access }, + MTE_REG(RGSR_EL1), + MTE_REG(GCR_EL1), { SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility }, { SYS_DESC(SYS_TTBR0_EL1), access_vm_reg, reset_unknown, TTBR0_EL1 }, @@ -1498,8 +1512,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi }, { SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi }, - { SYS_DESC(SYS_TFSR_EL1), undef_access }, - { SYS_DESC(SYS_TFSRE0_EL1), undef_access }, + MTE_REG(TFSR_EL1), + MTE_REG(TFSRE0_EL1), { SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 }, { SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 }, -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 3/6] arm64: kvm: Save/restore MTE registers @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones Define the new system registers that MTE introduces and context switch them. The MTE feature is still hidden from the ID register as it isn't supported in a VM yet. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_host.h | 6 ++ arch/arm64/include/asm/kvm_mte.h | 66 ++++++++++++++++++++++ arch/arm64/include/asm/sysreg.h | 3 +- arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kvm/hyp/entry.S | 7 +++ arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++ arch/arm64/kvm/sys_regs.c | 22 ++++++-- 7 files changed, 123 insertions(+), 5 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_mte.h diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 1170ee137096..d00cc3590f6e 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -208,6 +208,12 @@ enum vcpu_sysreg { CNTP_CVAL_EL0, CNTP_CTL_EL0, + /* Memory Tagging Extension registers */ + RGSR_EL1, /* Random Allocation Tag Seed Register */ + GCR_EL1, /* Tag Control Register */ + TFSR_EL1, /* Tag Fault Status Register (EL1) */ + TFSRE0_EL1, /* Tag Fault Status Register (EL0) */ + /* 32bit specific registers. Keep them at the end of the range */ DACR32_EL2, /* Domain Access Control Register */ IFSR32_EL2, /* Instruction Fault Status Register */ diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h new file mode 100644 index 000000000000..6541c7d6ce06 --- /dev/null +++ b/arch/arm64/include/asm/kvm_mte.h @@ -0,0 +1,66 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2020 ARM Ltd. + */ +#ifndef __ASM_KVM_MTE_H +#define __ASM_KVM_MTE_H + +#ifdef __ASSEMBLY__ + +#include <asm/sysreg.h> + +#ifdef CONFIG_ARM64_MTE + +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1 +alternative_if_not ARM64_MTE + b .L__skip_switch\@ +alternative_else_nop_endif + mrs \reg1, hcr_el2 + and \reg1, \reg1, #(HCR_ATA) + cbz \reg1, .L__skip_switch\@ + + mrs_s \reg1, SYS_RGSR_EL1 + str \reg1, [\h_ctxt, #CPU_RGSR_EL1] + mrs_s \reg1, SYS_GCR_EL1 + str \reg1, [\h_ctxt, #CPU_GCR_EL1] + + ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1] + msr_s SYS_RGSR_EL1, \reg1 + ldr \reg1, [\g_ctxt, #CPU_GCR_EL1] + msr_s SYS_GCR_EL1, \reg1 + +.L__skip_switch\@: +.endm + +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1 +alternative_if_not ARM64_MTE + b .L__skip_switch\@ +alternative_else_nop_endif + mrs \reg1, hcr_el2 + and \reg1, \reg1, #(HCR_ATA) + cbz \reg1, .L__skip_switch\@ + + mrs_s \reg1, SYS_RGSR_EL1 + str \reg1, [\g_ctxt, #CPU_RGSR_EL1] + mrs_s \reg1, SYS_GCR_EL1 + str \reg1, [\g_ctxt, #CPU_GCR_EL1] + + ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1] + msr_s SYS_RGSR_EL1, \reg1 + ldr \reg1, [\h_ctxt, #CPU_GCR_EL1] + msr_s SYS_GCR_EL1, \reg1 + +.L__skip_switch\@: +.endm + +#else /* CONFIG_ARM64_MTE */ + +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1 +.endm + +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1 +.endm + +#endif /* CONFIG_ARM64_MTE */ +#endif /* __ASSEMBLY__ */ +#endif /* __ASM_KVM_MTE_H */ diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index dfd4edbfe360..5424d195cf96 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -580,7 +580,8 @@ #define SCTLR_ELx_M (BIT(0)) #define SCTLR_ELx_FLAGS (SCTLR_ELx_M | SCTLR_ELx_A | SCTLR_ELx_C | \ - SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB) + SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB | \ + SCTLR_ELx_ITFSB) /* SCTLR_EL2 specific flags. */ #define SCTLR_EL2_RES1 ((BIT(4)) | (BIT(5)) | (BIT(11)) | (BIT(16)) | \ diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c index a36e2fc330d4..944e4f1f45d9 100644 --- a/arch/arm64/kernel/asm-offsets.c +++ b/arch/arm64/kernel/asm-offsets.c @@ -108,6 +108,9 @@ int main(void) DEFINE(VCPU_WORKAROUND_FLAGS, offsetof(struct kvm_vcpu, arch.workaround_flags)); DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2)); DEFINE(CPU_USER_PT_REGS, offsetof(struct kvm_cpu_context, regs)); + DEFINE(CPU_RGSR_EL1, offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1])); + DEFINE(CPU_GCR_EL1, offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1])); + DEFINE(CPU_TFSRE0_EL1, offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1])); DEFINE(CPU_APIAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1])); DEFINE(CPU_APIBKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1])); DEFINE(CPU_APDAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1])); diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S index b0afad7a99c6..c67582c6dd55 100644 --- a/arch/arm64/kvm/hyp/entry.S +++ b/arch/arm64/kvm/hyp/entry.S @@ -13,6 +13,7 @@ #include <asm/kvm_arm.h> #include <asm/kvm_asm.h> #include <asm/kvm_mmu.h> +#include <asm/kvm_mte.h> #include <asm/kvm_ptrauth.h> .text @@ -51,6 +52,9 @@ alternative_else_nop_endif add x29, x0, #VCPU_CONTEXT + // mte_switch_to_guest(g_ctxt, h_ctxt, tmp1) + mte_switch_to_guest x29, x1, x2 + // Macro ptrauth_switch_to_guest format: // ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3) // The below macro to restore guest keys is not implemented in C code @@ -140,6 +144,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL) // when this feature is enabled for kernel code. ptrauth_switch_to_hyp x1, x2, x3, x4, x5 + // mte_switch_to_hyp(g_ctxt, h_ctxt, reg1) + mte_switch_to_hyp x1, x2, x3 + // Restore hyp's sp_el0 restore_sp_el0 x2, x3 diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h index cce43bfe158f..de7e14c862e6 100644 --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h @@ -14,6 +14,7 @@ #include <asm/kvm_asm.h> #include <asm/kvm_emulate.h> #include <asm/kvm_hyp.h> +#include <asm/kvm_mmu.h> static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt) { @@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt) ctxt_sys_reg(ctxt, TPIDRRO_EL0) = read_sysreg(tpidrro_el0); } +static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt) +{ + struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu; + + if (!vcpu) + vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt); + + return kvm_has_mte(kern_hyp_va(vcpu->kvm)); +} + static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) { ctxt_sys_reg(ctxt, CSSELR_EL1) = read_sysreg(csselr_el1); @@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) ctxt_sys_reg(ctxt, PAR_EL1) = read_sysreg_par(); ctxt_sys_reg(ctxt, TPIDR_EL1) = read_sysreg(tpidr_el1); + if (ctxt_has_mte(ctxt)) { + ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR); + ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1); + } + ctxt_sys_reg(ctxt, SP_EL1) = read_sysreg(sp_el1); ctxt_sys_reg(ctxt, ELR_EL1) = read_sysreg_el1(SYS_ELR); ctxt_sys_reg(ctxt, SPSR_EL1) = read_sysreg_el1(SYS_SPSR); @@ -107,6 +123,11 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt) write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1), par_el1); write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1), tpidr_el1); + if (ctxt_has_mte(ctxt)) { + write_sysreg_el1(ctxt_sys_reg(ctxt, TFSR_EL1), SYS_TFSR); + write_sysreg_s(ctxt_sys_reg(ctxt, TFSRE0_EL1), SYS_TFSRE0_EL1); + } + if (!has_vhe() && cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT) && ctxt->__hyp_running_vcpu) { diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 18c87500a7a8..377ae6efb0ef 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1303,6 +1303,20 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, return true; } +static unsigned int mte_visibility(const struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd) +{ + return REG_HIDDEN; +} + +#define MTE_REG(name) { \ + SYS_DESC(SYS_##name), \ + .access = undef_access, \ + .reset = reset_unknown, \ + .reg = name, \ + .visibility = mte_visibility, \ +} + /* sys_reg_desc initialiser for known cpufeature ID registers */ #define ID_SANITISED(name) { \ SYS_DESC(SYS_##name), \ @@ -1471,8 +1485,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ACTLR_EL1), access_actlr, reset_actlr, ACTLR_EL1 }, { SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 }, - { SYS_DESC(SYS_RGSR_EL1), undef_access }, - { SYS_DESC(SYS_GCR_EL1), undef_access }, + MTE_REG(RGSR_EL1), + MTE_REG(GCR_EL1), { SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility }, { SYS_DESC(SYS_TTBR0_EL1), access_vm_reg, reset_unknown, TTBR0_EL1 }, @@ -1498,8 +1512,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi }, { SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi }, - { SYS_DESC(SYS_TFSR_EL1), undef_access }, - { SYS_DESC(SYS_TFSRE0_EL1), undef_access }, + MTE_REG(TFSR_EL1), + MTE_REG(TFSRE0_EL1), { SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 }, { SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 }, -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 3/6] arm64: kvm: Save/restore MTE registers @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Dr. David Alan Gilbert, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, Thomas Gleixner, kvmarm, linux-arm-kernel Define the new system registers that MTE introduces and context switch them. The MTE feature is still hidden from the ID register as it isn't supported in a VM yet. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_host.h | 6 ++ arch/arm64/include/asm/kvm_mte.h | 66 ++++++++++++++++++++++ arch/arm64/include/asm/sysreg.h | 3 +- arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kvm/hyp/entry.S | 7 +++ arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++ arch/arm64/kvm/sys_regs.c | 22 ++++++-- 7 files changed, 123 insertions(+), 5 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_mte.h diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 1170ee137096..d00cc3590f6e 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -208,6 +208,12 @@ enum vcpu_sysreg { CNTP_CVAL_EL0, CNTP_CTL_EL0, + /* Memory Tagging Extension registers */ + RGSR_EL1, /* Random Allocation Tag Seed Register */ + GCR_EL1, /* Tag Control Register */ + TFSR_EL1, /* Tag Fault Status Register (EL1) */ + TFSRE0_EL1, /* Tag Fault Status Register (EL0) */ + /* 32bit specific registers. Keep them at the end of the range */ DACR32_EL2, /* Domain Access Control Register */ IFSR32_EL2, /* Instruction Fault Status Register */ diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h new file mode 100644 index 000000000000..6541c7d6ce06 --- /dev/null +++ b/arch/arm64/include/asm/kvm_mte.h @@ -0,0 +1,66 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2020 ARM Ltd. + */ +#ifndef __ASM_KVM_MTE_H +#define __ASM_KVM_MTE_H + +#ifdef __ASSEMBLY__ + +#include <asm/sysreg.h> + +#ifdef CONFIG_ARM64_MTE + +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1 +alternative_if_not ARM64_MTE + b .L__skip_switch\@ +alternative_else_nop_endif + mrs \reg1, hcr_el2 + and \reg1, \reg1, #(HCR_ATA) + cbz \reg1, .L__skip_switch\@ + + mrs_s \reg1, SYS_RGSR_EL1 + str \reg1, [\h_ctxt, #CPU_RGSR_EL1] + mrs_s \reg1, SYS_GCR_EL1 + str \reg1, [\h_ctxt, #CPU_GCR_EL1] + + ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1] + msr_s SYS_RGSR_EL1, \reg1 + ldr \reg1, [\g_ctxt, #CPU_GCR_EL1] + msr_s SYS_GCR_EL1, \reg1 + +.L__skip_switch\@: +.endm + +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1 +alternative_if_not ARM64_MTE + b .L__skip_switch\@ +alternative_else_nop_endif + mrs \reg1, hcr_el2 + and \reg1, \reg1, #(HCR_ATA) + cbz \reg1, .L__skip_switch\@ + + mrs_s \reg1, SYS_RGSR_EL1 + str \reg1, [\g_ctxt, #CPU_RGSR_EL1] + mrs_s \reg1, SYS_GCR_EL1 + str \reg1, [\g_ctxt, #CPU_GCR_EL1] + + ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1] + msr_s SYS_RGSR_EL1, \reg1 + ldr \reg1, [\h_ctxt, #CPU_GCR_EL1] + msr_s SYS_GCR_EL1, \reg1 + +.L__skip_switch\@: +.endm + +#else /* CONFIG_ARM64_MTE */ + +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1 +.endm + +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1 +.endm + +#endif /* CONFIG_ARM64_MTE */ +#endif /* __ASSEMBLY__ */ +#endif /* __ASM_KVM_MTE_H */ diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index dfd4edbfe360..5424d195cf96 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -580,7 +580,8 @@ #define SCTLR_ELx_M (BIT(0)) #define SCTLR_ELx_FLAGS (SCTLR_ELx_M | SCTLR_ELx_A | SCTLR_ELx_C | \ - SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB) + SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB | \ + SCTLR_ELx_ITFSB) /* SCTLR_EL2 specific flags. */ #define SCTLR_EL2_RES1 ((BIT(4)) | (BIT(5)) | (BIT(11)) | (BIT(16)) | \ diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c index a36e2fc330d4..944e4f1f45d9 100644 --- a/arch/arm64/kernel/asm-offsets.c +++ b/arch/arm64/kernel/asm-offsets.c @@ -108,6 +108,9 @@ int main(void) DEFINE(VCPU_WORKAROUND_FLAGS, offsetof(struct kvm_vcpu, arch.workaround_flags)); DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2)); DEFINE(CPU_USER_PT_REGS, offsetof(struct kvm_cpu_context, regs)); + DEFINE(CPU_RGSR_EL1, offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1])); + DEFINE(CPU_GCR_EL1, offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1])); + DEFINE(CPU_TFSRE0_EL1, offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1])); DEFINE(CPU_APIAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1])); DEFINE(CPU_APIBKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1])); DEFINE(CPU_APDAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1])); diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S index b0afad7a99c6..c67582c6dd55 100644 --- a/arch/arm64/kvm/hyp/entry.S +++ b/arch/arm64/kvm/hyp/entry.S @@ -13,6 +13,7 @@ #include <asm/kvm_arm.h> #include <asm/kvm_asm.h> #include <asm/kvm_mmu.h> +#include <asm/kvm_mte.h> #include <asm/kvm_ptrauth.h> .text @@ -51,6 +52,9 @@ alternative_else_nop_endif add x29, x0, #VCPU_CONTEXT + // mte_switch_to_guest(g_ctxt, h_ctxt, tmp1) + mte_switch_to_guest x29, x1, x2 + // Macro ptrauth_switch_to_guest format: // ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3) // The below macro to restore guest keys is not implemented in C code @@ -140,6 +144,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL) // when this feature is enabled for kernel code. ptrauth_switch_to_hyp x1, x2, x3, x4, x5 + // mte_switch_to_hyp(g_ctxt, h_ctxt, reg1) + mte_switch_to_hyp x1, x2, x3 + // Restore hyp's sp_el0 restore_sp_el0 x2, x3 diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h index cce43bfe158f..de7e14c862e6 100644 --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h @@ -14,6 +14,7 @@ #include <asm/kvm_asm.h> #include <asm/kvm_emulate.h> #include <asm/kvm_hyp.h> +#include <asm/kvm_mmu.h> static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt) { @@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt) ctxt_sys_reg(ctxt, TPIDRRO_EL0) = read_sysreg(tpidrro_el0); } +static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt) +{ + struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu; + + if (!vcpu) + vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt); + + return kvm_has_mte(kern_hyp_va(vcpu->kvm)); +} + static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) { ctxt_sys_reg(ctxt, CSSELR_EL1) = read_sysreg(csselr_el1); @@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) ctxt_sys_reg(ctxt, PAR_EL1) = read_sysreg_par(); ctxt_sys_reg(ctxt, TPIDR_EL1) = read_sysreg(tpidr_el1); + if (ctxt_has_mte(ctxt)) { + ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR); + ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1); + } + ctxt_sys_reg(ctxt, SP_EL1) = read_sysreg(sp_el1); ctxt_sys_reg(ctxt, ELR_EL1) = read_sysreg_el1(SYS_ELR); ctxt_sys_reg(ctxt, SPSR_EL1) = read_sysreg_el1(SYS_SPSR); @@ -107,6 +123,11 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt) write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1), par_el1); write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1), tpidr_el1); + if (ctxt_has_mte(ctxt)) { + write_sysreg_el1(ctxt_sys_reg(ctxt, TFSR_EL1), SYS_TFSR); + write_sysreg_s(ctxt_sys_reg(ctxt, TFSRE0_EL1), SYS_TFSRE0_EL1); + } + if (!has_vhe() && cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT) && ctxt->__hyp_running_vcpu) { diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 18c87500a7a8..377ae6efb0ef 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1303,6 +1303,20 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, return true; } +static unsigned int mte_visibility(const struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd) +{ + return REG_HIDDEN; +} + +#define MTE_REG(name) { \ + SYS_DESC(SYS_##name), \ + .access = undef_access, \ + .reset = reset_unknown, \ + .reg = name, \ + .visibility = mte_visibility, \ +} + /* sys_reg_desc initialiser for known cpufeature ID registers */ #define ID_SANITISED(name) { \ SYS_DESC(SYS_##name), \ @@ -1471,8 +1485,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ACTLR_EL1), access_actlr, reset_actlr, ACTLR_EL1 }, { SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 }, - { SYS_DESC(SYS_RGSR_EL1), undef_access }, - { SYS_DESC(SYS_GCR_EL1), undef_access }, + MTE_REG(RGSR_EL1), + MTE_REG(GCR_EL1), { SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility }, { SYS_DESC(SYS_TTBR0_EL1), access_vm_reg, reset_unknown, TTBR0_EL1 }, @@ -1498,8 +1512,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi }, { SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi }, - { SYS_DESC(SYS_TFSR_EL1), undef_access }, - { SYS_DESC(SYS_TFSRE0_EL1), undef_access }, + MTE_REG(TFSR_EL1), + MTE_REG(TFSRE0_EL1), { SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 }, { SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 }, -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 3/6] arm64: kvm: Save/restore MTE registers @ 2021-03-12 15:18 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:18 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, James Morse, Julien Thierry, Thomas Gleixner, kvmarm, linux-arm-kernel Define the new system registers that MTE introduces and context switch them. The MTE feature is still hidden from the ID register as it isn't supported in a VM yet. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/asm/kvm_host.h | 6 ++ arch/arm64/include/asm/kvm_mte.h | 66 ++++++++++++++++++++++ arch/arm64/include/asm/sysreg.h | 3 +- arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kvm/hyp/entry.S | 7 +++ arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++ arch/arm64/kvm/sys_regs.c | 22 ++++++-- 7 files changed, 123 insertions(+), 5 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_mte.h diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 1170ee137096..d00cc3590f6e 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -208,6 +208,12 @@ enum vcpu_sysreg { CNTP_CVAL_EL0, CNTP_CTL_EL0, + /* Memory Tagging Extension registers */ + RGSR_EL1, /* Random Allocation Tag Seed Register */ + GCR_EL1, /* Tag Control Register */ + TFSR_EL1, /* Tag Fault Status Register (EL1) */ + TFSRE0_EL1, /* Tag Fault Status Register (EL0) */ + /* 32bit specific registers. Keep them at the end of the range */ DACR32_EL2, /* Domain Access Control Register */ IFSR32_EL2, /* Instruction Fault Status Register */ diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h new file mode 100644 index 000000000000..6541c7d6ce06 --- /dev/null +++ b/arch/arm64/include/asm/kvm_mte.h @@ -0,0 +1,66 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2020 ARM Ltd. + */ +#ifndef __ASM_KVM_MTE_H +#define __ASM_KVM_MTE_H + +#ifdef __ASSEMBLY__ + +#include <asm/sysreg.h> + +#ifdef CONFIG_ARM64_MTE + +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1 +alternative_if_not ARM64_MTE + b .L__skip_switch\@ +alternative_else_nop_endif + mrs \reg1, hcr_el2 + and \reg1, \reg1, #(HCR_ATA) + cbz \reg1, .L__skip_switch\@ + + mrs_s \reg1, SYS_RGSR_EL1 + str \reg1, [\h_ctxt, #CPU_RGSR_EL1] + mrs_s \reg1, SYS_GCR_EL1 + str \reg1, [\h_ctxt, #CPU_GCR_EL1] + + ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1] + msr_s SYS_RGSR_EL1, \reg1 + ldr \reg1, [\g_ctxt, #CPU_GCR_EL1] + msr_s SYS_GCR_EL1, \reg1 + +.L__skip_switch\@: +.endm + +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1 +alternative_if_not ARM64_MTE + b .L__skip_switch\@ +alternative_else_nop_endif + mrs \reg1, hcr_el2 + and \reg1, \reg1, #(HCR_ATA) + cbz \reg1, .L__skip_switch\@ + + mrs_s \reg1, SYS_RGSR_EL1 + str \reg1, [\g_ctxt, #CPU_RGSR_EL1] + mrs_s \reg1, SYS_GCR_EL1 + str \reg1, [\g_ctxt, #CPU_GCR_EL1] + + ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1] + msr_s SYS_RGSR_EL1, \reg1 + ldr \reg1, [\h_ctxt, #CPU_GCR_EL1] + msr_s SYS_GCR_EL1, \reg1 + +.L__skip_switch\@: +.endm + +#else /* CONFIG_ARM64_MTE */ + +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1 +.endm + +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1 +.endm + +#endif /* CONFIG_ARM64_MTE */ +#endif /* __ASSEMBLY__ */ +#endif /* __ASM_KVM_MTE_H */ diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index dfd4edbfe360..5424d195cf96 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -580,7 +580,8 @@ #define SCTLR_ELx_M (BIT(0)) #define SCTLR_ELx_FLAGS (SCTLR_ELx_M | SCTLR_ELx_A | SCTLR_ELx_C | \ - SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB) + SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB | \ + SCTLR_ELx_ITFSB) /* SCTLR_EL2 specific flags. */ #define SCTLR_EL2_RES1 ((BIT(4)) | (BIT(5)) | (BIT(11)) | (BIT(16)) | \ diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c index a36e2fc330d4..944e4f1f45d9 100644 --- a/arch/arm64/kernel/asm-offsets.c +++ b/arch/arm64/kernel/asm-offsets.c @@ -108,6 +108,9 @@ int main(void) DEFINE(VCPU_WORKAROUND_FLAGS, offsetof(struct kvm_vcpu, arch.workaround_flags)); DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2)); DEFINE(CPU_USER_PT_REGS, offsetof(struct kvm_cpu_context, regs)); + DEFINE(CPU_RGSR_EL1, offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1])); + DEFINE(CPU_GCR_EL1, offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1])); + DEFINE(CPU_TFSRE0_EL1, offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1])); DEFINE(CPU_APIAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1])); DEFINE(CPU_APIBKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1])); DEFINE(CPU_APDAKEYLO_EL1, offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1])); diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S index b0afad7a99c6..c67582c6dd55 100644 --- a/arch/arm64/kvm/hyp/entry.S +++ b/arch/arm64/kvm/hyp/entry.S @@ -13,6 +13,7 @@ #include <asm/kvm_arm.h> #include <asm/kvm_asm.h> #include <asm/kvm_mmu.h> +#include <asm/kvm_mte.h> #include <asm/kvm_ptrauth.h> .text @@ -51,6 +52,9 @@ alternative_else_nop_endif add x29, x0, #VCPU_CONTEXT + // mte_switch_to_guest(g_ctxt, h_ctxt, tmp1) + mte_switch_to_guest x29, x1, x2 + // Macro ptrauth_switch_to_guest format: // ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3) // The below macro to restore guest keys is not implemented in C code @@ -140,6 +144,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL) // when this feature is enabled for kernel code. ptrauth_switch_to_hyp x1, x2, x3, x4, x5 + // mte_switch_to_hyp(g_ctxt, h_ctxt, reg1) + mte_switch_to_hyp x1, x2, x3 + // Restore hyp's sp_el0 restore_sp_el0 x2, x3 diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h index cce43bfe158f..de7e14c862e6 100644 --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h @@ -14,6 +14,7 @@ #include <asm/kvm_asm.h> #include <asm/kvm_emulate.h> #include <asm/kvm_hyp.h> +#include <asm/kvm_mmu.h> static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt) { @@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt) ctxt_sys_reg(ctxt, TPIDRRO_EL0) = read_sysreg(tpidrro_el0); } +static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt) +{ + struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu; + + if (!vcpu) + vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt); + + return kvm_has_mte(kern_hyp_va(vcpu->kvm)); +} + static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) { ctxt_sys_reg(ctxt, CSSELR_EL1) = read_sysreg(csselr_el1); @@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) ctxt_sys_reg(ctxt, PAR_EL1) = read_sysreg_par(); ctxt_sys_reg(ctxt, TPIDR_EL1) = read_sysreg(tpidr_el1); + if (ctxt_has_mte(ctxt)) { + ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR); + ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1); + } + ctxt_sys_reg(ctxt, SP_EL1) = read_sysreg(sp_el1); ctxt_sys_reg(ctxt, ELR_EL1) = read_sysreg_el1(SYS_ELR); ctxt_sys_reg(ctxt, SPSR_EL1) = read_sysreg_el1(SYS_SPSR); @@ -107,6 +123,11 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt) write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1), par_el1); write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1), tpidr_el1); + if (ctxt_has_mte(ctxt)) { + write_sysreg_el1(ctxt_sys_reg(ctxt, TFSR_EL1), SYS_TFSR); + write_sysreg_s(ctxt_sys_reg(ctxt, TFSRE0_EL1), SYS_TFSRE0_EL1); + } + if (!has_vhe() && cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT) && ctxt->__hyp_running_vcpu) { diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 18c87500a7a8..377ae6efb0ef 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1303,6 +1303,20 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, return true; } +static unsigned int mte_visibility(const struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd) +{ + return REG_HIDDEN; +} + +#define MTE_REG(name) { \ + SYS_DESC(SYS_##name), \ + .access = undef_access, \ + .reset = reset_unknown, \ + .reg = name, \ + .visibility = mte_visibility, \ +} + /* sys_reg_desc initialiser for known cpufeature ID registers */ #define ID_SANITISED(name) { \ SYS_DESC(SYS_##name), \ @@ -1471,8 +1485,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ACTLR_EL1), access_actlr, reset_actlr, ACTLR_EL1 }, { SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 }, - { SYS_DESC(SYS_RGSR_EL1), undef_access }, - { SYS_DESC(SYS_GCR_EL1), undef_access }, + MTE_REG(RGSR_EL1), + MTE_REG(GCR_EL1), { SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility }, { SYS_DESC(SYS_TTBR0_EL1), access_vm_reg, reset_unknown, TTBR0_EL1 }, @@ -1498,8 +1512,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi }, { SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi }, - { SYS_DESC(SYS_TFSR_EL1), undef_access }, - { SYS_DESC(SYS_TFSRE0_EL1), undef_access }, + MTE_REG(TFSR_EL1), + MTE_REG(TFSRE0_EL1), { SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 }, { SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 }, -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE 2021-03-12 15:18 ` Steven Price (?) (?) @ 2021-03-12 15:19 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones It's now safe for the VMM to enable MTE in a guest, so expose the capability to user space. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/kvm/arm.c | 9 +++++++++ arch/arm64/kvm/sys_regs.c | 3 +++ 2 files changed, 12 insertions(+) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index fc4c95dd2d26..46bf319f6cb7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, r = 0; kvm->arch.return_nisv_io_abort_to_user = true; break; + case KVM_CAP_ARM_MTE: + if (!system_supports_mte() || kvm->created_vcpus) + return -EINVAL; + r = 0; + kvm->arch.mte_enabled = true; + break; default: r = -EINVAL; break; @@ -234,6 +240,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) */ r = 1; break; + case KVM_CAP_ARM_MTE: + r = system_supports_mte(); + break; case KVM_CAP_STEAL_TIME: r = kvm_arm_pvtime_supported(); break; diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 377ae6efb0ef..46937bfaac8a 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1306,6 +1306,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, static unsigned int mte_visibility(const struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd) { + if (kvm_has_mte(vcpu->kvm)) + return 0; + return REG_HIDDEN; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE @ 2021-03-12 15:19 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones It's now safe for the VMM to enable MTE in a guest, so expose the capability to user space. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/kvm/arm.c | 9 +++++++++ arch/arm64/kvm/sys_regs.c | 3 +++ 2 files changed, 12 insertions(+) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index fc4c95dd2d26..46bf319f6cb7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, r = 0; kvm->arch.return_nisv_io_abort_to_user = true; break; + case KVM_CAP_ARM_MTE: + if (!system_supports_mte() || kvm->created_vcpus) + return -EINVAL; + r = 0; + kvm->arch.mte_enabled = true; + break; default: r = -EINVAL; break; @@ -234,6 +240,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) */ r = 1; break; + case KVM_CAP_ARM_MTE: + r = system_supports_mte(); + break; case KVM_CAP_STEAL_TIME: r = kvm_arm_pvtime_supported(); break; diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 377ae6efb0ef..46937bfaac8a 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1306,6 +1306,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, static unsigned int mte_visibility(const struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd) { + if (kvm_has_mte(vcpu->kvm)) + return 0; + return REG_HIDDEN; } -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE @ 2021-03-12 15:19 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Dr. David Alan Gilbert, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, Thomas Gleixner, kvmarm, linux-arm-kernel It's now safe for the VMM to enable MTE in a guest, so expose the capability to user space. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/kvm/arm.c | 9 +++++++++ arch/arm64/kvm/sys_regs.c | 3 +++ 2 files changed, 12 insertions(+) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index fc4c95dd2d26..46bf319f6cb7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, r = 0; kvm->arch.return_nisv_io_abort_to_user = true; break; + case KVM_CAP_ARM_MTE: + if (!system_supports_mte() || kvm->created_vcpus) + return -EINVAL; + r = 0; + kvm->arch.mte_enabled = true; + break; default: r = -EINVAL; break; @@ -234,6 +240,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) */ r = 1; break; + case KVM_CAP_ARM_MTE: + r = system_supports_mte(); + break; case KVM_CAP_STEAL_TIME: r = kvm_arm_pvtime_supported(); break; diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 377ae6efb0ef..46937bfaac8a 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1306,6 +1306,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, static unsigned int mte_visibility(const struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd) { + if (kvm_has_mte(vcpu->kvm)) + return 0; + return REG_HIDDEN; } -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE @ 2021-03-12 15:19 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, James Morse, Julien Thierry, Thomas Gleixner, kvmarm, linux-arm-kernel It's now safe for the VMM to enable MTE in a guest, so expose the capability to user space. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/kvm/arm.c | 9 +++++++++ arch/arm64/kvm/sys_regs.c | 3 +++ 2 files changed, 12 insertions(+) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index fc4c95dd2d26..46bf319f6cb7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, r = 0; kvm->arch.return_nisv_io_abort_to_user = true; break; + case KVM_CAP_ARM_MTE: + if (!system_supports_mte() || kvm->created_vcpus) + return -EINVAL; + r = 0; + kvm->arch.mte_enabled = true; + break; default: r = -EINVAL; break; @@ -234,6 +240,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) */ r = 1; break; + case KVM_CAP_ARM_MTE: + r = system_supports_mte(); + break; case KVM_CAP_STEAL_TIME: r = kvm_arm_pvtime_supported(); break; diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 377ae6efb0ef..46937bfaac8a 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1306,6 +1306,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, static unsigned int mte_visibility(const struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd) { + if (kvm_has_mte(vcpu->kvm)) + return 0; + return REG_HIDDEN; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 5/6] KVM: arm64: ioctl to fetch/store tags in a guest 2021-03-12 15:18 ` Steven Price (?) (?) @ 2021-03-12 15:19 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones The VMM may not wish to have it's own mapping of guest memory mapped with PROT_MTE because this causes problems if the VMM has tag checking enabled (the guest controls the tags in physical RAM and it's unlikely the tags are correct for the VMM). Instead add a new ioctl which allows the VMM to easily read/write the tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE while the VMM can still read/write the tags for the purpose of migration. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/uapi/asm/kvm.h | 14 +++++++ arch/arm64/kvm/arm.c | 69 +++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 1 + 3 files changed, 84 insertions(+) diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 24223adae150..2b85a047c37d 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -184,6 +184,20 @@ struct kvm_vcpu_events { __u32 reserved[12]; }; +struct kvm_arm_copy_mte_tags { + __u64 guest_ipa; + __u64 length; + union { + void __user *addr; + __u64 padding; + }; + __u64 flags; + __u64 reserved[2]; +}; + +#define KVM_ARM_TAGS_TO_GUEST 0 +#define KVM_ARM_TAGS_FROM_GUEST 1 + /* If you need to interpret the index values, here is the key: */ #define KVM_REG_ARM_COPROC_MASK 0x000000000FFF0000 #define KVM_REG_ARM_COPROC_SHIFT 16 diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 46bf319f6cb7..9a6b26d37236 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -1297,6 +1297,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm, } } +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, + struct kvm_arm_copy_mte_tags *copy_tags) +{ + gpa_t guest_ipa = copy_tags->guest_ipa; + size_t length = copy_tags->length; + void __user *tags = copy_tags->addr; + gpa_t gfn; + bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST); + int ret = 0; + + if (copy_tags->reserved[0] || copy_tags->reserved[1]) + return -EINVAL; + + if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST) + return -EINVAL; + + if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK) + return -EINVAL; + + gfn = gpa_to_gfn(guest_ipa); + + mutex_lock(&kvm->slots_lock); + + while (length > 0) { + kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL); + void *maddr; + unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE; + + if (is_error_noslot_pfn(pfn)) { + ret = -EFAULT; + goto out; + } + + maddr = page_address(pfn_to_page(pfn)); + + if (!write) { + num_tags = mte_copy_tags_to_user(tags, maddr, num_tags); + kvm_release_pfn_clean(pfn); + } else { + num_tags = mte_copy_tags_from_user(maddr, tags, + num_tags); + kvm_release_pfn_dirty(pfn); + } + + if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) { + ret = -EFAULT; + goto out; + } + + gfn++; + tags += num_tags; + length -= PAGE_SIZE; + } + +out: + mutex_unlock(&kvm->slots_lock); + return ret; +} + long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -1333,6 +1392,16 @@ long kvm_arch_vm_ioctl(struct file *filp, return 0; } + case KVM_ARM_MTE_COPY_TAGS: { + struct kvm_arm_copy_mte_tags copy_tags; + + if (!kvm_has_mte(kvm)) + return -EINVAL; + + if (copy_from_user(©_tags, argp, sizeof(copy_tags))) + return -EFAULT; + return kvm_vm_ioctl_mte_copy_tags(kvm, ©_tags); + } default: return -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 6dc16c09a2d1..470c122f4c2d 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1424,6 +1424,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PMU_EVENT_FILTER */ #define KVM_SET_PMU_EVENT_FILTER _IOW(KVMIO, 0xb2, struct kvm_pmu_event_filter) #define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3) +#define KVM_ARM_MTE_COPY_TAGS _IOR(KVMIO, 0xb4, struct kvm_arm_copy_mte_tags) /* ioctl for vm fd */ #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 5/6] KVM: arm64: ioctl to fetch/store tags in a guest @ 2021-03-12 15:19 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones The VMM may not wish to have it's own mapping of guest memory mapped with PROT_MTE because this causes problems if the VMM has tag checking enabled (the guest controls the tags in physical RAM and it's unlikely the tags are correct for the VMM). Instead add a new ioctl which allows the VMM to easily read/write the tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE while the VMM can still read/write the tags for the purpose of migration. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/uapi/asm/kvm.h | 14 +++++++ arch/arm64/kvm/arm.c | 69 +++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 1 + 3 files changed, 84 insertions(+) diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 24223adae150..2b85a047c37d 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -184,6 +184,20 @@ struct kvm_vcpu_events { __u32 reserved[12]; }; +struct kvm_arm_copy_mte_tags { + __u64 guest_ipa; + __u64 length; + union { + void __user *addr; + __u64 padding; + }; + __u64 flags; + __u64 reserved[2]; +}; + +#define KVM_ARM_TAGS_TO_GUEST 0 +#define KVM_ARM_TAGS_FROM_GUEST 1 + /* If you need to interpret the index values, here is the key: */ #define KVM_REG_ARM_COPROC_MASK 0x000000000FFF0000 #define KVM_REG_ARM_COPROC_SHIFT 16 diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 46bf319f6cb7..9a6b26d37236 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -1297,6 +1297,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm, } } +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, + struct kvm_arm_copy_mte_tags *copy_tags) +{ + gpa_t guest_ipa = copy_tags->guest_ipa; + size_t length = copy_tags->length; + void __user *tags = copy_tags->addr; + gpa_t gfn; + bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST); + int ret = 0; + + if (copy_tags->reserved[0] || copy_tags->reserved[1]) + return -EINVAL; + + if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST) + return -EINVAL; + + if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK) + return -EINVAL; + + gfn = gpa_to_gfn(guest_ipa); + + mutex_lock(&kvm->slots_lock); + + while (length > 0) { + kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL); + void *maddr; + unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE; + + if (is_error_noslot_pfn(pfn)) { + ret = -EFAULT; + goto out; + } + + maddr = page_address(pfn_to_page(pfn)); + + if (!write) { + num_tags = mte_copy_tags_to_user(tags, maddr, num_tags); + kvm_release_pfn_clean(pfn); + } else { + num_tags = mte_copy_tags_from_user(maddr, tags, + num_tags); + kvm_release_pfn_dirty(pfn); + } + + if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) { + ret = -EFAULT; + goto out; + } + + gfn++; + tags += num_tags; + length -= PAGE_SIZE; + } + +out: + mutex_unlock(&kvm->slots_lock); + return ret; +} + long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -1333,6 +1392,16 @@ long kvm_arch_vm_ioctl(struct file *filp, return 0; } + case KVM_ARM_MTE_COPY_TAGS: { + struct kvm_arm_copy_mte_tags copy_tags; + + if (!kvm_has_mte(kvm)) + return -EINVAL; + + if (copy_from_user(©_tags, argp, sizeof(copy_tags))) + return -EFAULT; + return kvm_vm_ioctl_mte_copy_tags(kvm, ©_tags); + } default: return -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 6dc16c09a2d1..470c122f4c2d 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1424,6 +1424,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PMU_EVENT_FILTER */ #define KVM_SET_PMU_EVENT_FILTER _IOW(KVMIO, 0xb2, struct kvm_pmu_event_filter) #define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3) +#define KVM_ARM_MTE_COPY_TAGS _IOR(KVMIO, 0xb4, struct kvm_arm_copy_mte_tags) /* ioctl for vm fd */ #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 5/6] KVM: arm64: ioctl to fetch/store tags in a guest @ 2021-03-12 15:19 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Dr. David Alan Gilbert, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, Thomas Gleixner, kvmarm, linux-arm-kernel The VMM may not wish to have it's own mapping of guest memory mapped with PROT_MTE because this causes problems if the VMM has tag checking enabled (the guest controls the tags in physical RAM and it's unlikely the tags are correct for the VMM). Instead add a new ioctl which allows the VMM to easily read/write the tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE while the VMM can still read/write the tags for the purpose of migration. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/uapi/asm/kvm.h | 14 +++++++ arch/arm64/kvm/arm.c | 69 +++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 1 + 3 files changed, 84 insertions(+) diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 24223adae150..2b85a047c37d 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -184,6 +184,20 @@ struct kvm_vcpu_events { __u32 reserved[12]; }; +struct kvm_arm_copy_mte_tags { + __u64 guest_ipa; + __u64 length; + union { + void __user *addr; + __u64 padding; + }; + __u64 flags; + __u64 reserved[2]; +}; + +#define KVM_ARM_TAGS_TO_GUEST 0 +#define KVM_ARM_TAGS_FROM_GUEST 1 + /* If you need to interpret the index values, here is the key: */ #define KVM_REG_ARM_COPROC_MASK 0x000000000FFF0000 #define KVM_REG_ARM_COPROC_SHIFT 16 diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 46bf319f6cb7..9a6b26d37236 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -1297,6 +1297,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm, } } +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, + struct kvm_arm_copy_mte_tags *copy_tags) +{ + gpa_t guest_ipa = copy_tags->guest_ipa; + size_t length = copy_tags->length; + void __user *tags = copy_tags->addr; + gpa_t gfn; + bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST); + int ret = 0; + + if (copy_tags->reserved[0] || copy_tags->reserved[1]) + return -EINVAL; + + if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST) + return -EINVAL; + + if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK) + return -EINVAL; + + gfn = gpa_to_gfn(guest_ipa); + + mutex_lock(&kvm->slots_lock); + + while (length > 0) { + kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL); + void *maddr; + unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE; + + if (is_error_noslot_pfn(pfn)) { + ret = -EFAULT; + goto out; + } + + maddr = page_address(pfn_to_page(pfn)); + + if (!write) { + num_tags = mte_copy_tags_to_user(tags, maddr, num_tags); + kvm_release_pfn_clean(pfn); + } else { + num_tags = mte_copy_tags_from_user(maddr, tags, + num_tags); + kvm_release_pfn_dirty(pfn); + } + + if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) { + ret = -EFAULT; + goto out; + } + + gfn++; + tags += num_tags; + length -= PAGE_SIZE; + } + +out: + mutex_unlock(&kvm->slots_lock); + return ret; +} + long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -1333,6 +1392,16 @@ long kvm_arch_vm_ioctl(struct file *filp, return 0; } + case KVM_ARM_MTE_COPY_TAGS: { + struct kvm_arm_copy_mte_tags copy_tags; + + if (!kvm_has_mte(kvm)) + return -EINVAL; + + if (copy_from_user(©_tags, argp, sizeof(copy_tags))) + return -EFAULT; + return kvm_vm_ioctl_mte_copy_tags(kvm, ©_tags); + } default: return -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 6dc16c09a2d1..470c122f4c2d 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1424,6 +1424,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PMU_EVENT_FILTER */ #define KVM_SET_PMU_EVENT_FILTER _IOW(KVMIO, 0xb2, struct kvm_pmu_event_filter) #define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3) +#define KVM_ARM_MTE_COPY_TAGS _IOR(KVMIO, 0xb4, struct kvm_arm_copy_mte_tags) /* ioctl for vm fd */ #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 5/6] KVM: arm64: ioctl to fetch/store tags in a guest @ 2021-03-12 15:19 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, James Morse, Julien Thierry, Thomas Gleixner, kvmarm, linux-arm-kernel The VMM may not wish to have it's own mapping of guest memory mapped with PROT_MTE because this causes problems if the VMM has tag checking enabled (the guest controls the tags in physical RAM and it's unlikely the tags are correct for the VMM). Instead add a new ioctl which allows the VMM to easily read/write the tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE while the VMM can still read/write the tags for the purpose of migration. Signed-off-by: Steven Price <steven.price@arm.com> --- arch/arm64/include/uapi/asm/kvm.h | 14 +++++++ arch/arm64/kvm/arm.c | 69 +++++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 1 + 3 files changed, 84 insertions(+) diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 24223adae150..2b85a047c37d 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -184,6 +184,20 @@ struct kvm_vcpu_events { __u32 reserved[12]; }; +struct kvm_arm_copy_mte_tags { + __u64 guest_ipa; + __u64 length; + union { + void __user *addr; + __u64 padding; + }; + __u64 flags; + __u64 reserved[2]; +}; + +#define KVM_ARM_TAGS_TO_GUEST 0 +#define KVM_ARM_TAGS_FROM_GUEST 1 + /* If you need to interpret the index values, here is the key: */ #define KVM_REG_ARM_COPROC_MASK 0x000000000FFF0000 #define KVM_REG_ARM_COPROC_SHIFT 16 diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 46bf319f6cb7..9a6b26d37236 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -1297,6 +1297,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm, } } +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, + struct kvm_arm_copy_mte_tags *copy_tags) +{ + gpa_t guest_ipa = copy_tags->guest_ipa; + size_t length = copy_tags->length; + void __user *tags = copy_tags->addr; + gpa_t gfn; + bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST); + int ret = 0; + + if (copy_tags->reserved[0] || copy_tags->reserved[1]) + return -EINVAL; + + if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST) + return -EINVAL; + + if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK) + return -EINVAL; + + gfn = gpa_to_gfn(guest_ipa); + + mutex_lock(&kvm->slots_lock); + + while (length > 0) { + kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL); + void *maddr; + unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE; + + if (is_error_noslot_pfn(pfn)) { + ret = -EFAULT; + goto out; + } + + maddr = page_address(pfn_to_page(pfn)); + + if (!write) { + num_tags = mte_copy_tags_to_user(tags, maddr, num_tags); + kvm_release_pfn_clean(pfn); + } else { + num_tags = mte_copy_tags_from_user(maddr, tags, + num_tags); + kvm_release_pfn_dirty(pfn); + } + + if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) { + ret = -EFAULT; + goto out; + } + + gfn++; + tags += num_tags; + length -= PAGE_SIZE; + } + +out: + mutex_unlock(&kvm->slots_lock); + return ret; +} + long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -1333,6 +1392,16 @@ long kvm_arch_vm_ioctl(struct file *filp, return 0; } + case KVM_ARM_MTE_COPY_TAGS: { + struct kvm_arm_copy_mte_tags copy_tags; + + if (!kvm_has_mte(kvm)) + return -EINVAL; + + if (copy_from_user(©_tags, argp, sizeof(copy_tags))) + return -EFAULT; + return kvm_vm_ioctl_mte_copy_tags(kvm, ©_tags); + } default: return -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 6dc16c09a2d1..470c122f4c2d 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1424,6 +1424,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PMU_EVENT_FILTER */ #define KVM_SET_PMU_EVENT_FILTER _IOW(KVMIO, 0xb2, struct kvm_pmu_event_filter) #define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3) +#define KVM_ARM_MTE_COPY_TAGS _IOR(KVMIO, 0xb4, struct kvm_arm_copy_mte_tags) /* ioctl for vm fd */ #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 6/6] KVM: arm64: Document MTE capability and ioctl 2021-03-12 15:18 ` Steven Price (?) (?) @ 2021-03-12 15:19 ` Steven Price -1 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports granting a guest access to the tags, and provides a mechanism for the VMM to enable it. A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to access the tags of a guest without having to maintain a PROT_MTE mapping in userspace. The above capability gates access to the ioctl. Signed-off-by: Steven Price <steven.price@arm.com> --- Documentation/virt/kvm/api.rst | 53 ++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 1a2b5210cdbf..ccc84f21ba5e 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4938,6 +4938,40 @@ see KVM_XEN_VCPU_SET_ATTR above. The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used with the KVM_XEN_VCPU_GET_ATTR ioctl. +4.131 KVM_ARM_MTE_COPY_TAGS +--------------------------- + +:Capability: KVM_CAP_ARM_MTE +:Architectures: arm64 +:Type: vm ioctl +:Parameters: struct kvm_arm_copy_mte_tags +:Returns: 0 on success, < 0 on error + +:: + + struct kvm_arm_copy_mte_tags { + __u64 guest_ipa; + __u64 length; + union { + void __user *addr; + __u64 padding; + }; + __u64 flags; + __u64 reserved[2]; + }; + +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr`` +fieldmust point to a buffer which the tags will be copied to or from. + +``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or +``KVM_ARM_TAGS_FROM_GUEST``. + +The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)`` +bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and +``PTRACE_POKEMTETAGS``. + 5. The kvm_run structure ======================== @@ -6227,6 +6261,25 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between them. This capability can be used to check / enable 2nd DAWR feature provided by POWER10 processor. +7.23 KVM_CAP_ARM_MTE +-------------------- + +:Architectures: arm64 +:Parameters: none + +This capability indicates that KVM (and the hardware) supports exposing the +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the +VMM before the guest will be granted access. + +When enabled the guest is able to access tags associated with any memory given +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so +that the tags are maintained during swap or hibernation of the host; however +the VMM needs to manually save/restore the tags as appropriate if the VM is +migrated. + +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to +perform a bulk copy of tags to/from the guest. + 8. Other capabilities. ====================== -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 6/6] KVM: arm64: Document MTE capability and ioctl @ 2021-03-12 15:19 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel, Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela, Dr. David Alan Gilbert, Richard Henderson, Peter Maydell, Haibo Xu, Andrew Jones A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports granting a guest access to the tags, and provides a mechanism for the VMM to enable it. A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to access the tags of a guest without having to maintain a PROT_MTE mapping in userspace. The above capability gates access to the ioctl. Signed-off-by: Steven Price <steven.price@arm.com> --- Documentation/virt/kvm/api.rst | 53 ++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 1a2b5210cdbf..ccc84f21ba5e 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4938,6 +4938,40 @@ see KVM_XEN_VCPU_SET_ATTR above. The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used with the KVM_XEN_VCPU_GET_ATTR ioctl. +4.131 KVM_ARM_MTE_COPY_TAGS +--------------------------- + +:Capability: KVM_CAP_ARM_MTE +:Architectures: arm64 +:Type: vm ioctl +:Parameters: struct kvm_arm_copy_mte_tags +:Returns: 0 on success, < 0 on error + +:: + + struct kvm_arm_copy_mte_tags { + __u64 guest_ipa; + __u64 length; + union { + void __user *addr; + __u64 padding; + }; + __u64 flags; + __u64 reserved[2]; + }; + +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr`` +fieldmust point to a buffer which the tags will be copied to or from. + +``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or +``KVM_ARM_TAGS_FROM_GUEST``. + +The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)`` +bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and +``PTRACE_POKEMTETAGS``. + 5. The kvm_run structure ======================== @@ -6227,6 +6261,25 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between them. This capability can be used to check / enable 2nd DAWR feature provided by POWER10 processor. +7.23 KVM_CAP_ARM_MTE +-------------------- + +:Architectures: arm64 +:Parameters: none + +This capability indicates that KVM (and the hardware) supports exposing the +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the +VMM before the guest will be granted access. + +When enabled the guest is able to access tags associated with any memory given +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so +that the tags are maintained during swap or hibernation of the host; however +the VMM needs to manually save/restore the tags as appropriate if the VM is +migrated. + +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to +perform a bulk copy of tags to/from the guest. + 8. Other capabilities. ====================== -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 6/6] KVM: arm64: Document MTE capability and ioctl @ 2021-03-12 15:19 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Dr. David Alan Gilbert, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, Thomas Gleixner, kvmarm, linux-arm-kernel A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports granting a guest access to the tags, and provides a mechanism for the VMM to enable it. A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to access the tags of a guest without having to maintain a PROT_MTE mapping in userspace. The above capability gates access to the ioctl. Signed-off-by: Steven Price <steven.price@arm.com> --- Documentation/virt/kvm/api.rst | 53 ++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 1a2b5210cdbf..ccc84f21ba5e 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4938,6 +4938,40 @@ see KVM_XEN_VCPU_SET_ATTR above. The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used with the KVM_XEN_VCPU_GET_ATTR ioctl. +4.131 KVM_ARM_MTE_COPY_TAGS +--------------------------- + +:Capability: KVM_CAP_ARM_MTE +:Architectures: arm64 +:Type: vm ioctl +:Parameters: struct kvm_arm_copy_mte_tags +:Returns: 0 on success, < 0 on error + +:: + + struct kvm_arm_copy_mte_tags { + __u64 guest_ipa; + __u64 length; + union { + void __user *addr; + __u64 padding; + }; + __u64 flags; + __u64 reserved[2]; + }; + +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr`` +fieldmust point to a buffer which the tags will be copied to or from. + +``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or +``KVM_ARM_TAGS_FROM_GUEST``. + +The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)`` +bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and +``PTRACE_POKEMTETAGS``. + 5. The kvm_run structure ======================== @@ -6227,6 +6261,25 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between them. This capability can be used to check / enable 2nd DAWR feature provided by POWER10 processor. +7.23 KVM_CAP_ARM_MTE +-------------------- + +:Architectures: arm64 +:Parameters: none + +This capability indicates that KVM (and the hardware) supports exposing the +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the +VMM before the guest will be granted access. + +When enabled the guest is able to access tags associated with any memory given +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so +that the tags are maintained during swap or hibernation of the host; however +the VMM needs to manually save/restore the tags as appropriate if the VM is +migrated. + +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to +perform a bulk copy of tags to/from the guest. + 8. Other capabilities. ====================== -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm ^ permalink raw reply related [flat|nested] 112+ messages in thread
* [PATCH v10 6/6] KVM: arm64: Document MTE capability and ioctl @ 2021-03-12 15:19 ` Steven Price 0 siblings, 0 replies; 112+ messages in thread From: Steven Price @ 2021-03-12 15:19 UTC (permalink / raw) To: Catalin Marinas, Marc Zyngier, Will Deacon Cc: Mark Rutland, Peter Maydell, Dr. David Alan Gilbert, Andrew Jones, Haibo Xu, Suzuki K Poulose, qemu-devel, Dave Martin, Juan Quintela, Richard Henderson, linux-kernel, Steven Price, James Morse, Julien Thierry, Thomas Gleixner, kvmarm, linux-arm-kernel A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports granting a guest access to the tags, and provides a mechanism for the VMM to enable it. A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to access the tags of a guest without having to maintain a PROT_MTE mapping in userspace. The above capability gates access to the ioctl. Signed-off-by: Steven Price <steven.price@arm.com> --- Documentation/virt/kvm/api.rst | 53 ++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 1a2b5210cdbf..ccc84f21ba5e 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4938,6 +4938,40 @@ see KVM_XEN_VCPU_SET_ATTR above. The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used with the KVM_XEN_VCPU_GET_ATTR ioctl. +4.131 KVM_ARM_MTE_COPY_TAGS +--------------------------- + +:Capability: KVM_CAP_ARM_MTE +:Architectures: arm64 +:Type: vm ioctl +:Parameters: struct kvm_arm_copy_mte_tags +:Returns: 0 on success, < 0 on error + +:: + + struct kvm_arm_copy_mte_tags { + __u64 guest_ipa; + __u64 length; + union { + void __user *addr; + __u64 padding; + }; + __u64 flags; + __u64 reserved[2]; + }; + +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr`` +fieldmust point to a buffer which the tags will be copied to or from. + +``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or +``KVM_ARM_TAGS_FROM_GUEST``. + +The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)`` +bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and +``PTRACE_POKEMTETAGS``. + 5. The kvm_run structure ======================== @@ -6227,6 +6261,25 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between them. This capability can be used to check / enable 2nd DAWR feature provided by POWER10 processor. +7.23 KVM_CAP_ARM_MTE +-------------------- + +:Architectures: arm64 +:Parameters: none + +This capability indicates that KVM (and the hardware) supports exposing the +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the +VMM before the guest will be granted access. + +When enabled the guest is able to access tags associated with any memory given +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so +that the tags are maintained during swap or hibernation of the host; however +the VMM needs to manually save/restore the tags as appropriate if the VM is +migrated. + +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to +perform a bulk copy of tags to/from the guest. + 8. Other capabilities. ====================== -- 2.20.1 ^ permalink raw reply related [flat|nested] 112+ messages in thread
end of thread, other threads:[~2021-04-08 18:22 UTC | newest] Thread overview: 112+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-12 15:18 [PATCH v10 0/6] MTE support for KVM guest Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:18 ` [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-26 18:56 ` Catalin Marinas 2021-03-26 18:56 ` Catalin Marinas 2021-03-26 18:56 ` Catalin Marinas 2021-03-26 18:56 ` Catalin Marinas 2021-03-29 15:55 ` Steven Price 2021-03-29 15:55 ` Steven Price 2021-03-29 15:55 ` Steven Price 2021-03-29 15:55 ` Steven Price 2021-03-30 10:13 ` Catalin Marinas 2021-03-30 10:13 ` Catalin Marinas 2021-03-30 10:13 ` Catalin Marinas 2021-03-30 10:13 ` Catalin Marinas 2021-03-31 10:09 ` Steven Price 2021-03-31 10:09 ` Steven Price 2021-03-31 10:09 ` Steven Price 2021-03-31 10:09 ` Steven Price 2021-03-12 15:18 ` [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-27 15:23 ` Catalin Marinas 2021-03-27 15:23 ` Catalin Marinas 2021-03-27 15:23 ` Catalin Marinas 2021-03-27 15:23 ` Catalin Marinas 2021-03-28 12:21 ` Catalin Marinas 2021-03-28 12:21 ` Catalin Marinas 2021-03-28 12:21 ` Catalin Marinas 2021-03-28 12:21 ` Catalin Marinas 2021-03-29 16:06 ` Steven Price 2021-03-29 16:06 ` Steven Price 2021-03-29 16:06 ` Steven Price 2021-03-29 16:06 ` Steven Price 2021-03-30 10:30 ` Catalin Marinas 2021-03-30 10:30 ` Catalin Marinas 2021-03-30 10:30 ` Catalin Marinas 2021-03-30 10:30 ` Catalin Marinas 2021-03-31 7:34 ` David Hildenbrand 2021-03-31 7:34 ` David Hildenbrand 2021-03-31 7:34 ` David Hildenbrand 2021-03-31 7:34 ` David Hildenbrand 2021-03-31 9:21 ` Catalin Marinas 2021-03-31 9:21 ` Catalin Marinas 2021-03-31 9:21 ` Catalin Marinas 2021-03-31 9:21 ` Catalin Marinas 2021-03-31 9:32 ` David Hildenbrand 2021-03-31 9:32 ` David Hildenbrand 2021-03-31 9:32 ` David Hildenbrand 2021-03-31 9:32 ` David Hildenbrand 2021-03-31 10:41 ` Steven Price 2021-03-31 10:41 ` Steven Price 2021-03-31 10:41 ` Steven Price 2021-03-31 10:41 ` Steven Price 2021-03-31 14:14 ` David Hildenbrand 2021-03-31 14:14 ` David Hildenbrand 2021-03-31 14:14 ` David Hildenbrand 2021-03-31 14:14 ` David Hildenbrand 2021-03-31 18:43 ` Catalin Marinas 2021-03-31 18:43 ` Catalin Marinas 2021-03-31 18:43 ` Catalin Marinas 2021-03-31 18:43 ` Catalin Marinas 2021-04-07 10:20 ` Steven Price 2021-04-07 10:20 ` Steven Price 2021-04-07 10:20 ` Steven Price 2021-04-07 10:20 ` Steven Price 2021-04-07 15:14 ` Catalin Marinas 2021-04-07 15:14 ` Catalin Marinas 2021-04-07 15:14 ` Catalin Marinas 2021-04-07 15:14 ` Catalin Marinas 2021-04-07 15:30 ` David Hildenbrand 2021-04-07 15:30 ` David Hildenbrand 2021-04-07 15:30 ` David Hildenbrand 2021-04-07 15:30 ` David Hildenbrand 2021-04-07 15:52 ` Steven Price 2021-04-07 15:52 ` Steven Price 2021-04-07 15:52 ` Steven Price 2021-04-07 15:52 ` Steven Price 2021-04-08 14:18 ` Catalin Marinas 2021-04-08 14:18 ` Catalin Marinas 2021-04-08 14:18 ` Catalin Marinas 2021-04-08 14:18 ` Catalin Marinas 2021-04-08 18:16 ` David Hildenbrand 2021-04-08 18:16 ` David Hildenbrand 2021-04-08 18:16 ` David Hildenbrand 2021-04-08 18:16 ` David Hildenbrand 2021-04-08 18:21 ` Catalin Marinas 2021-04-08 18:21 ` Catalin Marinas 2021-04-08 18:21 ` Catalin Marinas 2021-04-08 18:21 ` Catalin Marinas 2021-03-12 15:18 ` [PATCH v10 3/6] arm64: kvm: Save/restore MTE registers Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:18 ` Steven Price 2021-03-12 15:19 ` [PATCH v10 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price 2021-03-12 15:19 ` Steven Price 2021-03-12 15:19 ` Steven Price 2021-03-12 15:19 ` Steven Price 2021-03-12 15:19 ` [PATCH v10 5/6] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price 2021-03-12 15:19 ` Steven Price 2021-03-12 15:19 ` Steven Price 2021-03-12 15:19 ` Steven Price 2021-03-12 15:19 ` [PATCH v10 6/6] KVM: arm64: Document MTE capability and ioctl Steven Price 2021-03-12 15:19 ` Steven Price 2021-03-12 15:19 ` Steven Price 2021-03-12 15:19 ` Steven Price
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.