linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 0/8] MTE support for KVM guest
@ 2021-05-17 12:32 Steven Price
  2021-05-17 12:32 ` [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags Steven Price
                   ` (7 more replies)
  0 siblings, 8 replies; 49+ messages in thread
From: Steven Price @ 2021-05-17 12:32 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

Changes since v11[1]:

 * Series is prefixed with a bug fix for a potential race synchronising
   tags. This is basically race as was recently[2] fixed for
   PG_dcache_clean where the update of the page flag cannot be done
   atomically with the work that flag represents.

   For the PG_dcache_clean case the problem is easier because extra
   cache maintenance isn't a problem, but here restoring the tags twice
   could cause data loss.

   The current solution is a global spinlock for mte_sync_page_tags().
   If we hit scalability problems that other solutions such as
   potentially using another page flag as a lock will need to be
   investigated.

 * The second patch is from Catalin to mitigate the performance impact
   of the first - by handling the page zeroing case explicitly we can
   avoid entering mte_sync_page_tags() at all in most cases. Peter
   Collingbourne has a patch which similarly improves this case using
   the DC GZVA instruction. So this patch may be dropped in favour of
   Peter's, however Catalin's is likely easier to backport.

 * Use pte_access_permitted() in set_pte_at() to identify pages which
   may be accessed by the user rather than open-coding a check for
   PTE_USER. Also add a comment documenting what's going on.
   There's also some short-cuts added in mte_sync_tags() compared to the
   previous post, to again mitigate the performance impact of the first
   patch.

 * Move the code to sanitise tags out of user_mem_abort() into its own
   function. Also call this new function from kvm_set_spte_gfn() as that
   path was missing the sanitising.

   Originally I was going to move the code all the way down to
   kvm_pgtable_stage2_map(). Sadly as that also part of the EL2
   hypervisor this breaks nVHE as the code needs to perform actions in
   the host.

 * Drop the union in struct kvm_vcpu_events - it served no purpose and
   was confusing.

 * Update CAP number (again) and other minor conflict resolutions.

[1] https://lore.kernel.org/r/20210416154309.22129-1-steven.price@arm.com/
[2] https://lore.kernel.org/r/20210514095001.13236-1-catalin.marinas@arm.com/
[3] https://lore.kernel.org/r/de812a02fd94a0dba07d43606bd893c564aa4528.1620849613.git.pcc@google.com/

Catalin Marinas (1):
  arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage()

Steven Price (7):
  arm64: mte: Handle race when synchronising tags
  arm64: mte: Sync tags for pages where PTE is untagged
  arm64: kvm: Introduce MTE VM feature
  arm64: kvm: Save/restore MTE registers
  arm64: kvm: Expose KVM_ARM_CAP_MTE
  KVM: arm64: ioctl to fetch/store tags in a guest
  KVM: arm64: Document MTE capability and ioctl

 Documentation/virt/kvm/api.rst             | 53 +++++++++++++++
 arch/arm64/include/asm/kvm_emulate.h       |  3 +
 arch/arm64/include/asm/kvm_host.h          |  9 +++
 arch/arm64/include/asm/kvm_mte.h           | 66 ++++++++++++++++++
 arch/arm64/include/asm/page.h              |  6 +-
 arch/arm64/include/asm/pgtable.h           |  9 ++-
 arch/arm64/include/asm/sysreg.h            |  3 +-
 arch/arm64/include/uapi/asm/kvm.h          | 11 +++
 arch/arm64/kernel/asm-offsets.c            |  3 +
 arch/arm64/kernel/mte.c                    | 37 ++++++++--
 arch/arm64/kvm/arm.c                       | 78 ++++++++++++++++++++++
 arch/arm64/kvm/hyp/entry.S                 |  7 ++
 arch/arm64/kvm/hyp/exception.c             |  3 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++++++
 arch/arm64/kvm/mmu.c                       | 37 +++++++++-
 arch/arm64/kvm/sys_regs.c                  | 28 ++++++--
 arch/arm64/mm/fault.c                      | 21 ++++++
 include/uapi/linux/kvm.h                   |  2 +
 18 files changed, 381 insertions(+), 16 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

-- 
2.20.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags
  2021-05-17 12:32 [PATCH v12 0/8] MTE support for KVM guest Steven Price
@ 2021-05-17 12:32 ` Steven Price
  2021-05-17 14:03   ` Marc Zyngier
  2021-05-19 17:32   ` Catalin Marinas
  2021-05-17 12:32 ` [PATCH v12 2/8] arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage() Steven Price
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Steven Price @ 2021-05-17 12:32 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
before restoring/zeroing the MTE tags. However if another thread were to
race and attempt to sync the tags on the same page before the first
thread had completed restoring/zeroing then it would see the flag is
already set and continue without waiting. This would potentially expose
the previous contents of the tags to user space, and cause any updates
that user space makes before the restoring/zeroing has completed to
potentially be lost.

Since this code is run from atomic contexts we can't just lock the page
during the process. Instead implement a new (global) spinlock to protect
the mte_sync_page_tags() function.

Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in user-space with PROT_MTE")
Signed-off-by: Steven Price <steven.price@arm.com>
---
---
 arch/arm64/kernel/mte.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..c88e778c2fa9 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,6 +25,7 @@
 u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
+static spinlock_t tag_sync_lock;
 
 #ifdef CONFIG_KASAN_HW_TAGS
 /* Whether the MTE asynchronous mode is enabled. */
@@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
 
 static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 {
+	unsigned long flags;
 	pte_t old_pte = READ_ONCE(*ptep);
 
+	spin_lock_irqsave(&tag_sync_lock, flags);
+
+	/* Recheck with the lock held */
+	if (test_bit(PG_mte_tagged, &page->flags))
+		goto out;
+
 	if (check_swap && is_swap_pte(old_pte)) {
 		swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-		if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
-			return;
+		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+			set_bit(PG_mte_tagged, &page->flags);
+			goto out;
+		}
 	}
 
 	page_kasan_tag_reset(page);
@@ -53,6 +63,10 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 	 */
 	smp_wmb();
 	mte_clear_page_tags(page_address(page));
+	set_bit(PG_mte_tagged, &page->flags);
+
+out:
+	spin_unlock_irqrestore(&tag_sync_lock, flags);
 }
 
 void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -60,10 +74,11 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
 	struct page *page = pte_page(pte);
 	long i, nr_pages = compound_nr(page);
 	bool check_swap = nr_pages == 1;
+	bool pte_is_tagged = pte_tagged(pte);
 
 	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
-		if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+		if (!test_bit(PG_mte_tagged, &page->flags))
 			mte_sync_page_tags(page, ptep, check_swap);
 	}
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v12 2/8] arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage()
  2021-05-17 12:32 [PATCH v12 0/8] MTE support for KVM guest Steven Price
  2021-05-17 12:32 ` [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags Steven Price
@ 2021-05-17 12:32 ` Steven Price
  2021-05-17 12:32 ` [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Steven Price @ 2021-05-17 12:32 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

From: Catalin Marinas <catalin.marinas@arm.com>

Currently, on an anonymous page fault, the kernel allocates a zeroed
page and maps it in user space. If the mapping is tagged (PROT_MTE),
set_pte_at() additionally clears the tags under a spinlock to avoid a
race on the page->flags. In order to optimise the lock, clear the page
tags on allocation in __alloc_zeroed_user_highpage() if the vma flags
have VM_MTE set.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/page.h |  6 ++++--
 arch/arm64/mm/fault.c         | 21 +++++++++++++++++++++
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..97853570d0f1 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -13,6 +13,7 @@
 #ifndef __ASSEMBLY__
 
 #include <linux/personality.h> /* for READ_IMPLIES_EXEC */
+#include <linux/types.h>
 #include <asm/pgtable-types.h>
 
 struct page;
@@ -28,8 +29,9 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
-#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+					  struct vm_area_struct *vma,
+					  unsigned long vaddr);
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 871c82ab0a30..5a03428e97f3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -921,3 +921,24 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr,
 	debug_exception_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug_exception);
+
+/*
+ * Used during anonymous page fault handling.
+ */
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+					  struct vm_area_struct *vma,
+					  unsigned long vaddr)
+{
+	struct page *page;
+	bool tagged = system_supports_mte() && (vma->vm_flags & VM_MTE);
+
+	page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma,
+			      vaddr);
+	if (tagged && page) {
+		mte_clear_page_tags(page_address(page));
+		page_kasan_tag_reset(page);
+		set_bit(PG_mte_tagged, &page->flags);
+	}
+
+	return page;
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged
  2021-05-17 12:32 [PATCH v12 0/8] MTE support for KVM guest Steven Price
  2021-05-17 12:32 ` [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags Steven Price
  2021-05-17 12:32 ` [PATCH v12 2/8] arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage() Steven Price
@ 2021-05-17 12:32 ` Steven Price
  2021-05-17 16:14   ` Marc Zyngier
  2021-05-19 18:06   ` Catalin Marinas
  2021-05-17 12:32 ` [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature Steven Price
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Steven Price @ 2021-05-17 12:32 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages for which pte_access_permitted() returns false
as these will not have been swapped out.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/pgtable.h |  9 +++++++--
 arch/arm64/kernel/mte.c          | 16 ++++++++++++++--
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b10204e72fc..275178a810c1 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
 		__sync_icache_dcache(pte);
 
-	if (system_supports_mte() &&
-	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
+	/*
+	 * If the PTE would provide user space access to the tags associated
+	 * with it then ensure that the MTE tags are synchronised.  Exec-only
+	 * mappings don't expose tags (instruction fetches don't check tags).
+	 */
+	if (system_supports_mte() && pte_present(pte) &&
+	    pte_access_permitted(pte, false) && !pte_special(pte))
 		mte_sync_tags(ptep, pte);
 
 	__check_racy_pte_update(mm, ptep, pte);
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index c88e778c2fa9..a604818c52c1 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -33,11 +33,15 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
 EXPORT_SYMBOL_GPL(mte_async_mode);
 #endif
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
+			       bool pte_is_tagged)
 {
 	unsigned long flags;
 	pte_t old_pte = READ_ONCE(*ptep);
 
+	if (!is_swap_pte(old_pte) && !pte_is_tagged)
+		return;
+
 	spin_lock_irqsave(&tag_sync_lock, flags);
 
 	/* Recheck with the lock held */
@@ -53,6 +57,9 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 		}
 	}
 
+	if (!pte_is_tagged)
+		goto out;
+
 	page_kasan_tag_reset(page);
 	/*
 	 * We need smp_wmb() in between setting the flags and clearing the
@@ -76,10 +83,15 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
 	bool check_swap = nr_pages == 1;
 	bool pte_is_tagged = pte_tagged(pte);
 
+	/* Early out if there's nothing to do */
+	if (!check_swap && !pte_is_tagged)
+		return;
+
 	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
 		if (!test_bit(PG_mte_tagged, &page->flags))
-			mte_sync_page_tags(page, ptep, check_swap);
+			mte_sync_page_tags(page, ptep, check_swap,
+					   pte_is_tagged);
 	}
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature
  2021-05-17 12:32 [PATCH v12 0/8] MTE support for KVM guest Steven Price
                   ` (2 preceding siblings ...)
  2021-05-17 12:32 ` [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
@ 2021-05-17 12:32 ` Steven Price
  2021-05-17 16:45   ` Marc Zyngier
  2021-05-20 11:54   ` Catalin Marinas
  2021-05-17 12:32 ` [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers Steven Price
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Steven Price @ 2021-05-17 12:32 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/kvm_emulate.h |  3 +++
 arch/arm64/include/asm/kvm_host.h    |  3 +++
 arch/arm64/kvm/hyp/exception.c       |  3 ++-
 arch/arm64/kvm/mmu.c                 | 37 +++++++++++++++++++++++++++-
 arch/arm64/kvm/sys_regs.c            |  3 +++
 include/uapi/linux/kvm.h             |  1 +
 6 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
 	if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
 	    vcpu_el1_is_32bit(vcpu))
 		vcpu->arch.hcr_el2 |= HCR_TID2;
+
+	if (kvm_has_mte(vcpu->kvm))
+		vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..afaa5333f0e4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {
 
 	u8 pfr0_csv2;
 	u8 pfr0_csv3;
+	/* Memory Tagging Extension enabled for the guest */
+	bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
 	((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu)					\
 	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
 	new |= (old & PSR_C_BIT);
 	new |= (old & PSR_V_BIT);
 
-	// TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+	if (kvm_has_mte(vcpu->kvm))
+		new |= PSR_TCO_BIT;
 
 	new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..8660f6a03f51 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 	return PAGE_SIZE;
 }
 
+static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
+			     kvm_pfn_t pfn)
+{
+	if (kvm_has_mte(kvm)) {
+		/*
+		 * The page will be mapped in stage 2 as Normal Cacheable, so
+		 * the VM will be able to see the page's tags and therefore
+		 * they must be initialised first. If PG_mte_tagged is set,
+		 * tags have already been initialised.
+		 */
+		unsigned long i, nr_pages = size >> PAGE_SHIFT;
+		struct page *page = pfn_to_online_page(pfn);
+
+		if (!page)
+			return -EFAULT;
+
+		for (i = 0; i < nr_pages; i++, page++) {
+			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+				mte_clear_page_tags(page_address(page));
+		}
+	}
+
+	return 0;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -971,8 +996,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
-	if (fault_status != FSC_PERM && !device)
+	if (fault_status != FSC_PERM && !device) {
+		ret = sanitise_mte_tags(kvm, vma_pagesize, pfn);
+		if (ret)
+			goto out_unlock;
+
 		clean_dcache_guest_page(pfn, vma_pagesize);
+	}
 
 	if (exec_fault) {
 		prot |= KVM_PGTABLE_PROT_X;
@@ -1168,12 +1198,17 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_pfn_t pfn = pte_pfn(range->pte);
+	int ret;
 
 	if (!kvm->arch.mmu.pgt)
 		return 0;
 
 	WARN_ON(range->end - range->start != 1);
 
+	ret = sanitise_mte_tags(kvm, PAGE_SIZE, pfn);
+	if (ret)
+		return ret;
+
 	/*
 	 * We've moved a page around, probably through CoW, so let's treat it
 	 * just like a translation fault and clean the cache to the PoC.
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 76ea2800c33e..24a844cb79ca 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
 		break;
 	case SYS_ID_AA64PFR1_EL1:
 		val &= ~FEATURE(ID_AA64PFR1_MTE);
+		if (kvm_has_mte(vcpu->kvm))
+			val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
+					  ID_AA64PFR1_MTE);
 		break;
 	case SYS_ID_AA64ISAR1_EL1:
 		if (!vcpu_has_ptrauth(vcpu))
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3fd9a7e9d90c..8c95ba0fadda 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_SGX_ATTRIBUTE 196
 #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
 #define KVM_CAP_PTP_KVM 198
+#define KVM_CAP_ARM_MTE 199
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers
  2021-05-17 12:32 [PATCH v12 0/8] MTE support for KVM guest Steven Price
                   ` (3 preceding siblings ...)
  2021-05-17 12:32 ` [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature Steven Price
@ 2021-05-17 12:32 ` Steven Price
  2021-05-17 17:17   ` Marc Zyngier
  2021-05-17 12:32 ` [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-17 12:32 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/kvm_host.h          |  6 ++
 arch/arm64/include/asm/kvm_mte.h           | 66 ++++++++++++++++++++++
 arch/arm64/include/asm/sysreg.h            |  3 +-
 arch/arm64/kernel/asm-offsets.c            |  3 +
 arch/arm64/kvm/hyp/entry.S                 |  7 +++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++
 arch/arm64/kvm/sys_regs.c                  | 22 ++++++--
 7 files changed, 123 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index afaa5333f0e4..309e36cc1b42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
 	CNTP_CVAL_EL0,
 	CNTP_CTL_EL0,
 
+	/* Memory Tagging Extension registers */
+	RGSR_EL1,	/* Random Allocation Tag Seed Register */
+	GCR_EL1,	/* Tag Control Register */
+	TFSR_EL1,	/* Tag Fault Status Register (EL1) */
+	TFSRE0_EL1,	/* Tag Fault Status Register (EL0) */
+
 	/* 32bit specific registers. Keep them at the end of the range */
 	DACR32_EL2,	/* Domain Access Control Register */
 	IFSR32_EL2,	/* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index 000000000000..6541c7d6ce06
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include <asm/sysreg.h>
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+	b	.L__skip_switch\@
+alternative_else_nop_endif
+	mrs	\reg1, hcr_el2
+	and	\reg1, \reg1, #(HCR_ATA)
+	cbz	\reg1, .L__skip_switch\@
+
+	mrs_s	\reg1, SYS_RGSR_EL1
+	str	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
+	mrs_s	\reg1, SYS_GCR_EL1
+	str	\reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+	ldr	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
+	msr_s	SYS_RGSR_EL1, \reg1
+	ldr	\reg1, [\g_ctxt, #CPU_GCR_EL1]
+	msr_s	SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+	b	.L__skip_switch\@
+alternative_else_nop_endif
+	mrs	\reg1, hcr_el2
+	and	\reg1, \reg1, #(HCR_ATA)
+	cbz	\reg1, .L__skip_switch\@
+
+	mrs_s	\reg1, SYS_RGSR_EL1
+	str	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
+	mrs_s	\reg1, SYS_GCR_EL1
+	str	\reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+	ldr	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
+	msr_s	SYS_RGSR_EL1, \reg1
+	ldr	\reg1, [\h_ctxt, #CPU_GCR_EL1]
+	msr_s	SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 65d15700a168..347ccac2341e 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -651,7 +651,8 @@
 
 #define INIT_SCTLR_EL2_MMU_ON						\
 	(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |	\
-	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
+	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |		\
+	 SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
 
 #define INIT_SCTLR_EL2_MMU_OFF \
 	(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0cb34ccb6e73..6b489a8462f0 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -111,6 +111,9 @@ int main(void)
   DEFINE(VCPU_WORKAROUND_FLAGS,	offsetof(struct kvm_vcpu, arch.workaround_flags));
   DEFINE(VCPU_HCR_EL2,		offsetof(struct kvm_vcpu, arch.hcr_el2));
   DEFINE(CPU_USER_PT_REGS,	offsetof(struct kvm_cpu_context, regs));
+  DEFINE(CPU_RGSR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1]));
+  DEFINE(CPU_GCR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1]));
+  DEFINE(CPU_TFSRE0_EL1,	offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1]));
   DEFINE(CPU_APIAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1]));
   DEFINE(CPU_APIBKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1]));
   DEFINE(CPU_APDAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1]));
diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
index e831d3dfd50d..435346ea1504 100644
--- a/arch/arm64/kvm/hyp/entry.S
+++ b/arch/arm64/kvm/hyp/entry.S
@@ -13,6 +13,7 @@
 #include <asm/kvm_arm.h>
 #include <asm/kvm_asm.h>
 #include <asm/kvm_mmu.h>
+#include <asm/kvm_mte.h>
 #include <asm/kvm_ptrauth.h>
 
 	.text
@@ -51,6 +52,9 @@ alternative_else_nop_endif
 
 	add	x29, x0, #VCPU_CONTEXT
 
+	// mte_switch_to_guest(g_ctxt, h_ctxt, tmp1)
+	mte_switch_to_guest x29, x1, x2
+
 	// Macro ptrauth_switch_to_guest format:
 	// 	ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3)
 	// The below macro to restore guest keys is not implemented in C code
@@ -142,6 +146,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
 	// when this feature is enabled for kernel code.
 	ptrauth_switch_to_hyp x1, x2, x3, x4, x5
 
+	// mte_switch_to_hyp(g_ctxt, h_ctxt, reg1)
+	mte_switch_to_hyp x1, x2, x3
+
 	// Restore hyp's sp_el0
 	restore_sp_el0 x2, x3
 
diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
index cce43bfe158f..de7e14c862e6 100644
--- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
+++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
@@ -14,6 +14,7 @@
 #include <asm/kvm_asm.h>
 #include <asm/kvm_emulate.h>
 #include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>
 
 static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt)
 {
@@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt)
 	ctxt_sys_reg(ctxt, TPIDRRO_EL0)	= read_sysreg(tpidrro_el0);
 }
 
+static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt)
+{
+	struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu;
+
+	if (!vcpu)
+		vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt);
+
+	return kvm_has_mte(kern_hyp_va(vcpu->kvm));
+}
+
 static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
 {
 	ctxt_sys_reg(ctxt, CSSELR_EL1)	= read_sysreg(csselr_el1);
@@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
 	ctxt_sys_reg(ctxt, PAR_EL1)	= read_sysreg_par();
 	ctxt_sys_reg(ctxt, TPIDR_EL1)	= read_sysreg(tpidr_el1);
 
+	if (ctxt_has_mte(ctxt)) {
+		ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR);
+		ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1);
+	}
+
 	ctxt_sys_reg(ctxt, SP_EL1)	= read_sysreg(sp_el1);
 	ctxt_sys_reg(ctxt, ELR_EL1)	= read_sysreg_el1(SYS_ELR);
 	ctxt_sys_reg(ctxt, SPSR_EL1)	= read_sysreg_el1(SYS_SPSR);
@@ -107,6 +123,11 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt)
 	write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1),	par_el1);
 	write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1),	tpidr_el1);
 
+	if (ctxt_has_mte(ctxt)) {
+		write_sysreg_el1(ctxt_sys_reg(ctxt, TFSR_EL1), SYS_TFSR);
+		write_sysreg_s(ctxt_sys_reg(ctxt, TFSRE0_EL1), SYS_TFSRE0_EL1);
+	}
+
 	if (!has_vhe() &&
 	    cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT) &&
 	    ctxt->__hyp_running_vcpu) {
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 24a844cb79ca..88adbc2286f2 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1305,6 +1305,20 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
 	return true;
 }
 
+static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
+				   const struct sys_reg_desc *rd)
+{
+	return REG_HIDDEN;
+}
+
+#define MTE_REG(name) {				\
+	SYS_DESC(SYS_##name),			\
+	.access = undef_access,			\
+	.reset = reset_unknown,			\
+	.reg = name,				\
+	.visibility = mte_visibility,		\
+}
+
 /* sys_reg_desc initialiser for known cpufeature ID registers */
 #define ID_SANITISED(name) {			\
 	SYS_DESC(SYS_##name),			\
@@ -1473,8 +1487,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	{ SYS_DESC(SYS_ACTLR_EL1), access_actlr, reset_actlr, ACTLR_EL1 },
 	{ SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 },
 
-	{ SYS_DESC(SYS_RGSR_EL1), undef_access },
-	{ SYS_DESC(SYS_GCR_EL1), undef_access },
+	MTE_REG(RGSR_EL1),
+	MTE_REG(GCR_EL1),
 
 	{ SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility },
 	{ SYS_DESC(SYS_TRFCR_EL1), undef_access },
@@ -1501,8 +1515,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	{ SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi },
 	{ SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi },
 
-	{ SYS_DESC(SYS_TFSR_EL1), undef_access },
-	{ SYS_DESC(SYS_TFSRE0_EL1), undef_access },
+	MTE_REG(TFSR_EL1),
+	MTE_REG(TFSRE0_EL1),
 
 	{ SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 },
 	{ SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 },
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE
  2021-05-17 12:32 [PATCH v12 0/8] MTE support for KVM guest Steven Price
                   ` (4 preceding siblings ...)
  2021-05-17 12:32 ` [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers Steven Price
@ 2021-05-17 12:32 ` Steven Price
  2021-05-17 17:40   ` Marc Zyngier
  2021-05-17 12:32 ` [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
  2021-05-17 12:32 ` [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl Steven Price
  7 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-17 12:32 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/kvm/arm.c      | 9 +++++++++
 arch/arm64/kvm/sys_regs.c | 3 +++
 2 files changed, 12 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1cb39c0803a4..e89a5e275e25 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		r = 0;
 		kvm->arch.return_nisv_io_abort_to_user = true;
 		break;
+	case KVM_CAP_ARM_MTE:
+		if (!system_supports_mte() || kvm->created_vcpus)
+			return -EINVAL;
+		r = 0;
+		kvm->arch.mte_enabled = true;
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -237,6 +243,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		 */
 		r = 1;
 		break;
+	case KVM_CAP_ARM_MTE:
+		r = system_supports_mte();
+		break;
 	case KVM_CAP_STEAL_TIME:
 		r = kvm_arm_pvtime_supported();
 		break;
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 88adbc2286f2..3a749fa0779b 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1308,6 +1308,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
 static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
 				   const struct sys_reg_desc *rd)
 {
+	if (kvm_has_mte(vcpu->kvm))
+		return 0;
+
 	return REG_HIDDEN;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-17 12:32 [PATCH v12 0/8] MTE support for KVM guest Steven Price
                   ` (5 preceding siblings ...)
  2021-05-17 12:32 ` [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price
@ 2021-05-17 12:32 ` Steven Price
  2021-05-17 18:04   ` Marc Zyngier
  2021-05-20 12:05   ` Catalin Marinas
  2021-05-17 12:32 ` [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl Steven Price
  7 siblings, 2 replies; 49+ messages in thread
From: Steven Price @ 2021-05-17 12:32 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).

Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/uapi/asm/kvm.h | 11 +++++
 arch/arm64/kvm/arm.c              | 69 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h          |  1 +
 3 files changed, 81 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..b3edde68bc3e 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,17 @@ struct kvm_vcpu_events {
 	__u32 reserved[12];
 };
 
+struct kvm_arm_copy_mte_tags {
+	__u64 guest_ipa;
+	__u64 length;
+	void __user *addr;
+	__u64 flags;
+	__u64 reserved[2];
+};
+
+#define KVM_ARM_TAGS_TO_GUEST		0
+#define KVM_ARM_TAGS_FROM_GUEST		1
+
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_ARM_COPROC_MASK		0x000000000FFF0000
 #define KVM_REG_ARM_COPROC_SHIFT	16
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e89a5e275e25..4b6c83beb75d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
 	}
 }
 
+static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+				      struct kvm_arm_copy_mte_tags *copy_tags)
+{
+	gpa_t guest_ipa = copy_tags->guest_ipa;
+	size_t length = copy_tags->length;
+	void __user *tags = copy_tags->addr;
+	gpa_t gfn;
+	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
+	int ret = 0;
+
+	if (copy_tags->reserved[0] || copy_tags->reserved[1])
+		return -EINVAL;
+
+	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
+		return -EINVAL;
+
+	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
+		return -EINVAL;
+
+	gfn = gpa_to_gfn(guest_ipa);
+
+	mutex_lock(&kvm->slots_lock);
+
+	while (length > 0) {
+		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
+		void *maddr;
+		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
+
+		if (is_error_noslot_pfn(pfn)) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		maddr = page_address(pfn_to_page(pfn));
+
+		if (!write) {
+			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
+			kvm_release_pfn_clean(pfn);
+		} else {
+			num_tags = mte_copy_tags_from_user(maddr, tags,
+							   num_tags);
+			kvm_release_pfn_dirty(pfn);
+		}
+
+		if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		gfn++;
+		tags += num_tags;
+		length -= PAGE_SIZE;
+	}
+
+out:
+	mutex_unlock(&kvm->slots_lock);
+	return ret;
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
 		       unsigned int ioctl, unsigned long arg)
 {
@@ -1345,6 +1404,16 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 		return 0;
 	}
+	case KVM_ARM_MTE_COPY_TAGS: {
+		struct kvm_arm_copy_mte_tags copy_tags;
+
+		if (!kvm_has_mte(kvm))
+			return -EINVAL;
+
+		if (copy_from_user(&copy_tags, argp, sizeof(copy_tags)))
+			return -EFAULT;
+		return kvm_vm_ioctl_mte_copy_tags(kvm, &copy_tags);
+	}
 	default:
 		return -EINVAL;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8c95ba0fadda..4c011c60d468 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1428,6 +1428,7 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PMU_EVENT_FILTER */
 #define KVM_SET_PMU_EVENT_FILTER  _IOW(KVMIO,  0xb2, struct kvm_pmu_event_filter)
 #define KVM_PPC_SVM_OFF		  _IO(KVMIO,  0xb3)
+#define KVM_ARM_MTE_COPY_TAGS	  _IOR(KVMIO,  0xb4, struct kvm_arm_copy_mte_tags)
 
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl
  2021-05-17 12:32 [PATCH v12 0/8] MTE support for KVM guest Steven Price
                   ` (6 preceding siblings ...)
  2021-05-17 12:32 ` [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
@ 2021-05-17 12:32 ` Steven Price
  2021-05-17 18:09   ` Marc Zyngier
  7 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-17 12:32 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 Documentation/virt/kvm/api.rst | 53 ++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 22d077562149..a31661b870ba 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5034,6 +5034,40 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.130 KVM_ARM_MTE_COPY_TAGS
+---------------------------
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: 0 on success, < 0 on error
+
+::
+
+  struct kvm_arm_copy_mte_tags {
+	__u64 guest_ipa;
+	__u64 length;
+	union {
+		void __user *addr;
+		__u64 padding;
+	};
+	__u64 flags;
+	__u64 reserved[2];
+  };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+fieldmust point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)``
+bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
 5. The kvm_run structure
 ========================
 
@@ -6362,6 +6396,25 @@ default.
 
 See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
 
+7.26 KVM_CAP_ARM_MTE
+--------------------
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before the guest will be granted access.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
+that the tags are maintained during swap or hibernation of the host; however
+the VMM needs to manually save/restore the tags as appropriate if the VM is
+migrated.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
 8. Other capabilities.
 ======================
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags
  2021-05-17 12:32 ` [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags Steven Price
@ 2021-05-17 14:03   ` Marc Zyngier
  2021-05-17 14:56     ` Steven Price
  2021-05-19 17:32   ` Catalin Marinas
  1 sibling, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-17 14:03 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

Hi Steven,

On Mon, 17 May 2021 13:32:32 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
> before restoring/zeroing the MTE tags. However if another thread were to
> race and attempt to sync the tags on the same page before the first
> thread had completed restoring/zeroing then it would see the flag is
> already set and continue without waiting. This would potentially expose
> the previous contents of the tags to user space, and cause any updates
> that user space makes before the restoring/zeroing has completed to
> potentially be lost.
> 
> Since this code is run from atomic contexts we can't just lock the page
> during the process. Instead implement a new (global) spinlock to protect
> the mte_sync_page_tags() function.
> 
> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in user-space with PROT_MTE")
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> ---
>  arch/arm64/kernel/mte.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index 125a10e413e9..c88e778c2fa9 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -25,6 +25,7 @@
>  u64 gcr_kernel_excl __ro_after_init;
>  
>  static bool report_fault_once = true;
> +static spinlock_t tag_sync_lock;

What initialises this spinlock? Have you tried this with lockdep? I'd
expect it to be defined with DEFINE_SPINLOCK(), which always does the
right thing.

>  
>  #ifdef CONFIG_KASAN_HW_TAGS
>  /* Whether the MTE asynchronous mode is enabled. */
> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
>  
>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
>  {
> +	unsigned long flags;
>  	pte_t old_pte = READ_ONCE(*ptep);
>  
> +	spin_lock_irqsave(&tag_sync_lock, flags);
> +
> +	/* Recheck with the lock held */
> +	if (test_bit(PG_mte_tagged, &page->flags))
> +		goto out;
> +
>  	if (check_swap && is_swap_pte(old_pte)) {
>  		swp_entry_t entry = pte_to_swp_entry(old_pte);
>  
> -		if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
> -			return;
> +		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
> +			set_bit(PG_mte_tagged, &page->flags);
> +			goto out;
> +		}
>  	}
>  
>  	page_kasan_tag_reset(page);
> @@ -53,6 +63,10 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
>  	 */
>  	smp_wmb();
>  	mte_clear_page_tags(page_address(page));
> +	set_bit(PG_mte_tagged, &page->flags);
> +
> +out:
> +	spin_unlock_irqrestore(&tag_sync_lock, flags);
>  }
>  
>  void mte_sync_tags(pte_t *ptep, pte_t pte)
> @@ -60,10 +74,11 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
>  	struct page *page = pte_page(pte);
>  	long i, nr_pages = compound_nr(page);
>  	bool check_swap = nr_pages == 1;
> +	bool pte_is_tagged = pte_tagged(pte);
>  
>  	/* if PG_mte_tagged is set, tags have already been initialised */
>  	for (i = 0; i < nr_pages; i++, page++) {
> -		if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> +		if (!test_bit(PG_mte_tagged, &page->flags))
>  			mte_sync_page_tags(page, ptep, check_swap);
>  	}
>  }

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags
  2021-05-17 14:03   ` Marc Zyngier
@ 2021-05-17 14:56     ` Steven Price
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Price @ 2021-05-17 14:56 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 17/05/2021 15:03, Marc Zyngier wrote:
> Hi Steven,

Hi Marc,

> On Mon, 17 May 2021 13:32:32 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
>> before restoring/zeroing the MTE tags. However if another thread were to
>> race and attempt to sync the tags on the same page before the first
>> thread had completed restoring/zeroing then it would see the flag is
>> already set and continue without waiting. This would potentially expose
>> the previous contents of the tags to user space, and cause any updates
>> that user space makes before the restoring/zeroing has completed to
>> potentially be lost.
>>
>> Since this code is run from atomic contexts we can't just lock the page
>> during the process. Instead implement a new (global) spinlock to protect
>> the mte_sync_page_tags() function.
>>
>> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in user-space with PROT_MTE")
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> ---
>>  arch/arm64/kernel/mte.c | 21 ++++++++++++++++++---
>>  1 file changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index 125a10e413e9..c88e778c2fa9 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -25,6 +25,7 @@
>>  u64 gcr_kernel_excl __ro_after_init;
>>  
>>  static bool report_fault_once = true;
>> +static spinlock_t tag_sync_lock;
> 
> What initialises this spinlock? Have you tried this with lockdep? I'd
> expect it to be defined with DEFINE_SPINLOCK(), which always does the
> right thing.

You of course are absolute right, and this will blow up with lockdep.
Sorry about that. DEFINE_SPINLOCK() solves the problem.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged
  2021-05-17 12:32 ` [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
@ 2021-05-17 16:14   ` Marc Zyngier
  2021-05-19  9:32     ` Steven Price
  2021-05-19 18:06   ` Catalin Marinas
  1 sibling, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-17 16:14 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, 17 May 2021 13:32:34 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> A KVM guest could store tags in a page even if the VMM hasn't mapped
> the page with PROT_MTE. So when restoring pages from swap we will
> need to check to see if there are any saved tags even if !pte_tagged().
> 
> However don't check pages for which pte_access_permitted() returns false
> as these will not have been swapped out.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h |  9 +++++++--
>  arch/arm64/kernel/mte.c          | 16 ++++++++++++++--
>  2 files changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0b10204e72fc..275178a810c1 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>  	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
>  		__sync_icache_dcache(pte);
>  
> -	if (system_supports_mte() &&
> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> +	/*
> +	 * If the PTE would provide user space access to the tags associated
> +	 * with it then ensure that the MTE tags are synchronised.  Exec-only
> +	 * mappings don't expose tags (instruction fetches don't check tags).

I'm not sure I understand this comment. Of course, execution doesn't
match tags. But the memory could still have tags associated with
it. Does this mean such a page would lose its tags is swapped out?

Thanks,

	M.

> +	 */
> +	if (system_supports_mte() && pte_present(pte) &&
> +	    pte_access_permitted(pte, false) && !pte_special(pte))
>  		mte_sync_tags(ptep, pte);
>  
>  	__check_racy_pte_update(mm, ptep, pte);
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index c88e778c2fa9..a604818c52c1 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -33,11 +33,15 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
>  EXPORT_SYMBOL_GPL(mte_async_mode);
>  #endif
>  
> -static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
> +static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
> +			       bool pte_is_tagged)
>  {
>  	unsigned long flags;
>  	pte_t old_pte = READ_ONCE(*ptep);
>  
> +	if (!is_swap_pte(old_pte) && !pte_is_tagged)
> +		return;
> +
>  	spin_lock_irqsave(&tag_sync_lock, flags);
>  
>  	/* Recheck with the lock held */
> @@ -53,6 +57,9 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
>  		}
>  	}
>  
> +	if (!pte_is_tagged)
> +		goto out;
> +
>  	page_kasan_tag_reset(page);
>  	/*
>  	 * We need smp_wmb() in between setting the flags and clearing the
> @@ -76,10 +83,15 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
>  	bool check_swap = nr_pages == 1;
>  	bool pte_is_tagged = pte_tagged(pte);
>  
> +	/* Early out if there's nothing to do */
> +	if (!check_swap && !pte_is_tagged)
> +		return;
> +
>  	/* if PG_mte_tagged is set, tags have already been initialised */
>  	for (i = 0; i < nr_pages; i++, page++) {
>  		if (!test_bit(PG_mte_tagged, &page->flags))
> -			mte_sync_page_tags(page, ptep, check_swap);
> +			mte_sync_page_tags(page, ptep, check_swap,
> +					   pte_is_tagged);
>  	}
>  }
>  
> -- 
> 2.20.1
> 
> 

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature
  2021-05-17 12:32 ` [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature Steven Price
@ 2021-05-17 16:45   ` Marc Zyngier
  2021-05-19 10:48     ` Steven Price
  2021-05-20 11:54   ` Catalin Marinas
  1 sibling, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-17 16:45 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, 17 May 2021 13:32:35 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
> for a VM. This will expose the feature to the guest and automatically
> tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
> storage) to ensure that the guest cannot see stale tags, and so that
> the tags are correctly saved/restored across swap.
> 
> Actually exposing the new capability to user space happens in a later
> patch.

uber nit in $SUBJECT: "KVM: arm64:" is the preferred prefix (just like
patches 7 and 8).

> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/asm/kvm_emulate.h |  3 +++
>  arch/arm64/include/asm/kvm_host.h    |  3 +++
>  arch/arm64/kvm/hyp/exception.c       |  3 ++-
>  arch/arm64/kvm/mmu.c                 | 37 +++++++++++++++++++++++++++-
>  arch/arm64/kvm/sys_regs.c            |  3 +++
>  include/uapi/linux/kvm.h             |  1 +
>  6 files changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index f612c090f2e4..6bf776c2399c 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
>  	if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
>  	    vcpu_el1_is_32bit(vcpu))
>  		vcpu->arch.hcr_el2 |= HCR_TID2;
> +
> +	if (kvm_has_mte(vcpu->kvm))
> +		vcpu->arch.hcr_el2 |= HCR_ATA;
>  }
>  
>  static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 7cd7d5c8c4bc..afaa5333f0e4 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -132,6 +132,8 @@ struct kvm_arch {
>  
>  	u8 pfr0_csv2;
>  	u8 pfr0_csv3;
> +	/* Memory Tagging Extension enabled for the guest */
> +	bool mte_enabled;
>  };
>  
>  struct kvm_vcpu_fault_info {
> @@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
>  #define kvm_arm_vcpu_sve_finalized(vcpu) \
>  	((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
>  
> +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
>  #define kvm_vcpu_has_pmu(vcpu)					\
>  	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
>  
> diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
> index 73629094f903..56426565600c 100644
> --- a/arch/arm64/kvm/hyp/exception.c
> +++ b/arch/arm64/kvm/hyp/exception.c
> @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
>  	new |= (old & PSR_C_BIT);
>  	new |= (old & PSR_V_BIT);
>  
> -	// TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
> +	if (kvm_has_mte(vcpu->kvm))
> +		new |= PSR_TCO_BIT;
>  
>  	new |= (old & PSR_DIT_BIT);
>  
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c5d1f3c87dbd..8660f6a03f51 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>  	return PAGE_SIZE;
>  }
>  
> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
> +			     kvm_pfn_t pfn)

Nit: please order the parameters as address, then size.

> +{
> +	if (kvm_has_mte(kvm)) {
> +		/*
> +		 * The page will be mapped in stage 2 as Normal Cacheable, so
> +		 * the VM will be able to see the page's tags and therefore
> +		 * they must be initialised first. If PG_mte_tagged is set,
> +		 * tags have already been initialised.
> +		 */
> +		unsigned long i, nr_pages = size >> PAGE_SHIFT;
> +		struct page *page = pfn_to_online_page(pfn);
> +
> +		if (!page)
> +			return -EFAULT;

Under which circumstances can this happen? We already have done a GUP
on the page, so I really can't see how the page can vanish from under
our feet.

> +
> +		for (i = 0; i < nr_pages; i++, page++) {
> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> +				mte_clear_page_tags(page_address(page));

You seem to be doing this irrespective of the VMA being created with
PROT_MTE. This is fine form a guest perspective (all its memory should
be MTE capable). However, I can't see any guarantee that the VMM will
actually allocate memslots with PROT_MTE.

Aren't we missing some sanity checks at memslot registration time?

> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -971,8 +996,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;
>  
> -	if (fault_status != FSC_PERM && !device)
> +	if (fault_status != FSC_PERM && !device) {
> +		ret = sanitise_mte_tags(kvm, vma_pagesize, pfn);
> +		if (ret)
> +			goto out_unlock;
> +
>  		clean_dcache_guest_page(pfn, vma_pagesize);
> +	}
>  
>  	if (exec_fault) {
>  		prot |= KVM_PGTABLE_PROT_X;
> @@ -1168,12 +1198,17 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	kvm_pfn_t pfn = pte_pfn(range->pte);
> +	int ret;
>  
>  	if (!kvm->arch.mmu.pgt)
>  		return 0;
>  
>  	WARN_ON(range->end - range->start != 1);
>  
> +	ret = sanitise_mte_tags(kvm, PAGE_SIZE, pfn);
> +	if (ret)
> +		return ret;

Notice the change in return type?

> +
>  	/*
>  	 * We've moved a page around, probably through CoW, so let's treat it
>  	 * just like a translation fault and clean the cache to the PoC.
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 76ea2800c33e..24a844cb79ca 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
>  		break;
>  	case SYS_ID_AA64PFR1_EL1:
>  		val &= ~FEATURE(ID_AA64PFR1_MTE);
> +		if (kvm_has_mte(vcpu->kvm))
> +			val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
> +					  ID_AA64PFR1_MTE);

Shouldn't this be consistent with what the HW is capable of
(i.e. FEAT_MTE3 if available), and extracted from the sanitised view
of the feature set?

>  		break;
>  	case SYS_ID_AA64ISAR1_EL1:
>  		if (!vcpu_has_ptrauth(vcpu))
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 3fd9a7e9d90c..8c95ba0fadda 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_SGX_ATTRIBUTE 196
>  #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
>  #define KVM_CAP_PTP_KVM 198
> +#define KVM_CAP_ARM_MTE 199
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers
  2021-05-17 12:32 ` [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers Steven Price
@ 2021-05-17 17:17   ` Marc Zyngier
  2021-05-19 13:04     ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-17 17:17 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, 17 May 2021 13:32:36 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> Define the new system registers that MTE introduces and context switch
> them. The MTE feature is still hidden from the ID register as it isn't
> supported in a VM yet.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/asm/kvm_host.h          |  6 ++
>  arch/arm64/include/asm/kvm_mte.h           | 66 ++++++++++++++++++++++
>  arch/arm64/include/asm/sysreg.h            |  3 +-
>  arch/arm64/kernel/asm-offsets.c            |  3 +
>  arch/arm64/kvm/hyp/entry.S                 |  7 +++
>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++
>  arch/arm64/kvm/sys_regs.c                  | 22 ++++++--
>  7 files changed, 123 insertions(+), 5 deletions(-)
>  create mode 100644 arch/arm64/include/asm/kvm_mte.h
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index afaa5333f0e4..309e36cc1b42 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -208,6 +208,12 @@ enum vcpu_sysreg {
>  	CNTP_CVAL_EL0,
>  	CNTP_CTL_EL0,
>  
> +	/* Memory Tagging Extension registers */
> +	RGSR_EL1,	/* Random Allocation Tag Seed Register */
> +	GCR_EL1,	/* Tag Control Register */
> +	TFSR_EL1,	/* Tag Fault Status Register (EL1) */
> +	TFSRE0_EL1,	/* Tag Fault Status Register (EL0) */
> +
>  	/* 32bit specific registers. Keep them at the end of the range */
>  	DACR32_EL2,	/* Domain Access Control Register */
>  	IFSR32_EL2,	/* Instruction Fault Status Register */
> diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
> new file mode 100644
> index 000000000000..6541c7d6ce06
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_mte.h
> @@ -0,0 +1,66 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2020 ARM Ltd.
> + */
> +#ifndef __ASM_KVM_MTE_H
> +#define __ASM_KVM_MTE_H
> +
> +#ifdef __ASSEMBLY__
> +
> +#include <asm/sysreg.h>
> +
> +#ifdef CONFIG_ARM64_MTE
> +
> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
> +alternative_if_not ARM64_MTE
> +	b	.L__skip_switch\@
> +alternative_else_nop_endif
> +	mrs	\reg1, hcr_el2
> +	and	\reg1, \reg1, #(HCR_ATA)
> +	cbz	\reg1, .L__skip_switch\@
> +
> +	mrs_s	\reg1, SYS_RGSR_EL1
> +	str	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
> +	mrs_s	\reg1, SYS_GCR_EL1
> +	str	\reg1, [\h_ctxt, #CPU_GCR_EL1]
> +
> +	ldr	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
> +	msr_s	SYS_RGSR_EL1, \reg1
> +	ldr	\reg1, [\g_ctxt, #CPU_GCR_EL1]
> +	msr_s	SYS_GCR_EL1, \reg1
> +
> +.L__skip_switch\@:
> +.endm
> +
> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
> +alternative_if_not ARM64_MTE
> +	b	.L__skip_switch\@
> +alternative_else_nop_endif
> +	mrs	\reg1, hcr_el2
> +	and	\reg1, \reg1, #(HCR_ATA)
> +	cbz	\reg1, .L__skip_switch\@
> +
> +	mrs_s	\reg1, SYS_RGSR_EL1
> +	str	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
> +	mrs_s	\reg1, SYS_GCR_EL1
> +	str	\reg1, [\g_ctxt, #CPU_GCR_EL1]
> +
> +	ldr	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
> +	msr_s	SYS_RGSR_EL1, \reg1
> +	ldr	\reg1, [\h_ctxt, #CPU_GCR_EL1]
> +	msr_s	SYS_GCR_EL1, \reg1

What is the rational for not having any synchronisation here? It is
quite uncommon to allocate memory at EL2, but VHE can perform all kind
of tricks.

> +
> +.L__skip_switch\@:
> +.endm
> +
> +#else /* CONFIG_ARM64_MTE */
> +
> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
> +.endm
> +
> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
> +.endm
> +
> +#endif /* CONFIG_ARM64_MTE */
> +#endif /* __ASSEMBLY__ */
> +#endif /* __ASM_KVM_MTE_H */
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index 65d15700a168..347ccac2341e 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -651,7 +651,8 @@
>  
>  #define INIT_SCTLR_EL2_MMU_ON						\
>  	(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |	\
> -	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
> +	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |		\
> +	 SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
>  
>  #define INIT_SCTLR_EL2_MMU_OFF \
>  	(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 0cb34ccb6e73..6b489a8462f0 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -111,6 +111,9 @@ int main(void)
>    DEFINE(VCPU_WORKAROUND_FLAGS,	offsetof(struct kvm_vcpu, arch.workaround_flags));
>    DEFINE(VCPU_HCR_EL2,		offsetof(struct kvm_vcpu, arch.hcr_el2));
>    DEFINE(CPU_USER_PT_REGS,	offsetof(struct kvm_cpu_context, regs));
> +  DEFINE(CPU_RGSR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1]));
> +  DEFINE(CPU_GCR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1]));
> +  DEFINE(CPU_TFSRE0_EL1,	offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1]));

TFSRE0_EL1 is never accessed from assembly code. Leftover from a
previous version?

>    DEFINE(CPU_APIAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1]));
>    DEFINE(CPU_APIBKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1]));
>    DEFINE(CPU_APDAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1]));
> diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
> index e831d3dfd50d..435346ea1504 100644
> --- a/arch/arm64/kvm/hyp/entry.S
> +++ b/arch/arm64/kvm/hyp/entry.S
> @@ -13,6 +13,7 @@
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_mmu.h>
> +#include <asm/kvm_mte.h>
>  #include <asm/kvm_ptrauth.h>
>  
>  	.text
> @@ -51,6 +52,9 @@ alternative_else_nop_endif
>  
>  	add	x29, x0, #VCPU_CONTEXT
>  
> +	// mte_switch_to_guest(g_ctxt, h_ctxt, tmp1)
> +	mte_switch_to_guest x29, x1, x2
> +
>  	// Macro ptrauth_switch_to_guest format:
>  	// 	ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3)
>  	// The below macro to restore guest keys is not implemented in C code
> @@ -142,6 +146,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
>  	// when this feature is enabled for kernel code.
>  	ptrauth_switch_to_hyp x1, x2, x3, x4, x5
>  
> +	// mte_switch_to_hyp(g_ctxt, h_ctxt, reg1)
> +	mte_switch_to_hyp x1, x2, x3
> +
>  	// Restore hyp's sp_el0
>  	restore_sp_el0 x2, x3
>  
> diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
> index cce43bfe158f..de7e14c862e6 100644
> --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
> +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
> @@ -14,6 +14,7 @@
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_emulate.h>
>  #include <asm/kvm_hyp.h>
> +#include <asm/kvm_mmu.h>
>  
>  static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt)
>  {
> @@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt)
>  	ctxt_sys_reg(ctxt, TPIDRRO_EL0)	= read_sysreg(tpidrro_el0);
>  }
>  
> +static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt)
> +{
> +	struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu;
> +
> +	if (!vcpu)
> +		vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt);
> +
> +	return kvm_has_mte(kern_hyp_va(vcpu->kvm));
> +}
> +
>  static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
>  {
>  	ctxt_sys_reg(ctxt, CSSELR_EL1)	= read_sysreg(csselr_el1);
> @@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
>  	ctxt_sys_reg(ctxt, PAR_EL1)	= read_sysreg_par();
>  	ctxt_sys_reg(ctxt, TPIDR_EL1)	= read_sysreg(tpidr_el1);
>  
> +	if (ctxt_has_mte(ctxt)) {
> +		ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR);
> +		ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1);
> +	}

I remember suggesting that this is slightly heavier than necessary.

On nVHE, TFSRE0_EL1 could be moved to load/put, as we never run
userspace with a vcpu loaded. The same holds of course for VHE, but we
also can move TFSR_EL1 to load/put, as the host uses TFSR_EL2.

Do you see any issue with that?

> +
>  	ctxt_sys_reg(ctxt, SP_EL1)	= read_sysreg(sp_el1);
>  	ctxt_sys_reg(ctxt, ELR_EL1)	= read_sysreg_el1(SYS_ELR);
>  	ctxt_sys_reg(ctxt, SPSR_EL1)	= read_sysreg_el1(SYS_SPSR);
> @@ -107,6 +123,11 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt)
>  	write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1),	par_el1);
>  	write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1),	tpidr_el1);
>  
> +	if (ctxt_has_mte(ctxt)) {
> +		write_sysreg_el1(ctxt_sys_reg(ctxt, TFSR_EL1), SYS_TFSR);
> +		write_sysreg_s(ctxt_sys_reg(ctxt, TFSRE0_EL1), SYS_TFSRE0_EL1);
> +	}
> +
>  	if (!has_vhe() &&
>  	    cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT) &&
>  	    ctxt->__hyp_running_vcpu) {
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 24a844cb79ca..88adbc2286f2 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1305,6 +1305,20 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
>  	return true;
>  }
>  
> +static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
> +				   const struct sys_reg_desc *rd)
> +{
> +	return REG_HIDDEN;
> +}
> +
> +#define MTE_REG(name) {				\
> +	SYS_DESC(SYS_##name),			\
> +	.access = undef_access,			\
> +	.reset = reset_unknown,			\
> +	.reg = name,				\
> +	.visibility = mte_visibility,		\
> +}
> +
>  /* sys_reg_desc initialiser for known cpufeature ID registers */
>  #define ID_SANITISED(name) {			\
>  	SYS_DESC(SYS_##name),			\
> @@ -1473,8 +1487,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  	{ SYS_DESC(SYS_ACTLR_EL1), access_actlr, reset_actlr, ACTLR_EL1 },
>  	{ SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 },
>  
> -	{ SYS_DESC(SYS_RGSR_EL1), undef_access },
> -	{ SYS_DESC(SYS_GCR_EL1), undef_access },
> +	MTE_REG(RGSR_EL1),
> +	MTE_REG(GCR_EL1),
>  
>  	{ SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility },
>  	{ SYS_DESC(SYS_TRFCR_EL1), undef_access },
> @@ -1501,8 +1515,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  	{ SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi },
>  	{ SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi },
>  
> -	{ SYS_DESC(SYS_TFSR_EL1), undef_access },
> -	{ SYS_DESC(SYS_TFSRE0_EL1), undef_access },
> +	MTE_REG(TFSR_EL1),
> +	MTE_REG(TFSRE0_EL1),
>  
>  	{ SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 },
>  	{ SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 },

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE
  2021-05-17 12:32 ` [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price
@ 2021-05-17 17:40   ` Marc Zyngier
  2021-05-19 13:26     ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-17 17:40 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, 17 May 2021 13:32:37 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> It's now safe for the VMM to enable MTE in a guest, so expose the
> capability to user space.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/kvm/arm.c      | 9 +++++++++
>  arch/arm64/kvm/sys_regs.c | 3 +++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 1cb39c0803a4..e89a5e275e25 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		r = 0;
>  		kvm->arch.return_nisv_io_abort_to_user = true;
>  		break;
> +	case KVM_CAP_ARM_MTE:
> +		if (!system_supports_mte() || kvm->created_vcpus)
> +			return -EINVAL;
> +		r = 0;
> +		kvm->arch.mte_enabled = true;

As far as I can tell from the architecture, this isn't valid for a
32bit guest.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-17 12:32 ` [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
@ 2021-05-17 18:04   ` Marc Zyngier
  2021-05-19 13:51     ` Steven Price
  2021-05-20 12:05   ` Catalin Marinas
  1 sibling, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-17 18:04 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, 17 May 2021 13:32:38 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> The VMM may not wish to have it's own mapping of guest memory mapped
> with PROT_MTE because this causes problems if the VMM has tag checking
> enabled (the guest controls the tags in physical RAM and it's unlikely
> the tags are correct for the VMM).
> 
> Instead add a new ioctl which allows the VMM to easily read/write the
> tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
> while the VMM can still read/write the tags for the purpose of
> migration.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/uapi/asm/kvm.h | 11 +++++
>  arch/arm64/kvm/arm.c              | 69 +++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h          |  1 +
>  3 files changed, 81 insertions(+)
> 
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 24223adae150..b3edde68bc3e 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>  	__u32 reserved[12];
>  };
>  
> +struct kvm_arm_copy_mte_tags {
> +	__u64 guest_ipa;
> +	__u64 length;
> +	void __user *addr;
> +	__u64 flags;
> +	__u64 reserved[2];
> +};
> +
> +#define KVM_ARM_TAGS_TO_GUEST		0
> +#define KVM_ARM_TAGS_FROM_GUEST		1
> +
>  /* If you need to interpret the index values, here is the key: */
>  #define KVM_REG_ARM_COPROC_MASK		0x000000000FFF0000
>  #define KVM_REG_ARM_COPROC_SHIFT	16
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e89a5e275e25..4b6c83beb75d 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>  	}
>  }
>  
> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
> +				      struct kvm_arm_copy_mte_tags *copy_tags)
> +{
> +	gpa_t guest_ipa = copy_tags->guest_ipa;
> +	size_t length = copy_tags->length;
> +	void __user *tags = copy_tags->addr;
> +	gpa_t gfn;
> +	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
> +	int ret = 0;
> +
> +	if (copy_tags->reserved[0] || copy_tags->reserved[1])
> +		return -EINVAL;
> +
> +	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
> +		return -EINVAL;
> +
> +	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	gfn = gpa_to_gfn(guest_ipa);
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	while (length > 0) {
> +		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
> +		void *maddr;
> +		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;

nit: this is a compile time constant, make it a #define. This will
avoid the confusing overloading of "num_tags" as both an input and an
output for the mte_copy_tags-* functions.

> +
> +		if (is_error_noslot_pfn(pfn)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		maddr = page_address(pfn_to_page(pfn));
> +
> +		if (!write) {
> +			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
> +			kvm_release_pfn_clean(pfn);
> +		} else {
> +			num_tags = mte_copy_tags_from_user(maddr, tags,
> +							   num_tags);
> +			kvm_release_pfn_dirty(pfn);
> +		}
> +
> +		if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		gfn++;
> +		tags += num_tags;
> +		length -= PAGE_SIZE;
> +	}
> +
> +out:
> +	mutex_unlock(&kvm->slots_lock);
> +	return ret;
> +}
> +

nit again: I'd really prefer it if you moved this to guest.c, where we
already have a bunch of the save/restore stuff.

>  long kvm_arch_vm_ioctl(struct file *filp,
>  		       unsigned int ioctl, unsigned long arg)
>  {
> @@ -1345,6 +1404,16 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  
>  		return 0;
>  	}
> +	case KVM_ARM_MTE_COPY_TAGS: {
> +		struct kvm_arm_copy_mte_tags copy_tags;
> +
> +		if (!kvm_has_mte(kvm))
> +			return -EINVAL;
> +
> +		if (copy_from_user(&copy_tags, argp, sizeof(copy_tags)))
> +			return -EFAULT;
> +		return kvm_vm_ioctl_mte_copy_tags(kvm, &copy_tags);
> +	}
>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 8c95ba0fadda..4c011c60d468 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1428,6 +1428,7 @@ struct kvm_s390_ucas_mapping {
>  /* Available with KVM_CAP_PMU_EVENT_FILTER */
>  #define KVM_SET_PMU_EVENT_FILTER  _IOW(KVMIO,  0xb2, struct kvm_pmu_event_filter)
>  #define KVM_PPC_SVM_OFF		  _IO(KVMIO,  0xb3)
> +#define KVM_ARM_MTE_COPY_TAGS	  _IOR(KVMIO,  0xb4, struct kvm_arm_copy_mte_tags)
>  
>  /* ioctl for vm fd */
>  #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl
  2021-05-17 12:32 ` [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl Steven Price
@ 2021-05-17 18:09   ` Marc Zyngier
  2021-05-19 14:09     ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-17 18:09 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, 17 May 2021 13:32:39 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
> granting a guest access to the tags, and provides a mechanism for the
> VMM to enable it.
> 
> A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
> access the tags of a guest without having to maintain a PROT_MTE mapping
> in userspace. The above capability gates access to the ioctl.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  Documentation/virt/kvm/api.rst | 53 ++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 22d077562149..a31661b870ba 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5034,6 +5034,40 @@ see KVM_XEN_VCPU_SET_ATTR above.
>  The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
>  with the KVM_XEN_VCPU_GET_ATTR ioctl.
>  
> +4.130 KVM_ARM_MTE_COPY_TAGS
> +---------------------------
> +
> +:Capability: KVM_CAP_ARM_MTE
> +:Architectures: arm64
> +:Type: vm ioctl
> +:Parameters: struct kvm_arm_copy_mte_tags
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +  struct kvm_arm_copy_mte_tags {
> +	__u64 guest_ipa;
> +	__u64 length;
> +	union {
> +		void __user *addr;
> +		__u64 padding;
> +	};
> +	__u64 flags;
> +	__u64 reserved[2];
> +  };

This doesn't exactly match the structure in the previous patch :-(.

> +
> +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
> +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
> +fieldmust point to a buffer which the tags will be copied to or from.
> +
> +``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
> +``KVM_ARM_TAGS_FROM_GUEST``.
> +
> +The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)``

Should we add a UAPI definition for MTE_GRANULE_SIZE?

> +bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag
> +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
> +``PTRACE_POKEMTETAGS``.
> +
>  5. The kvm_run structure
>  ========================
>  
> @@ -6362,6 +6396,25 @@ default.
>  
>  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
>  
> +7.26 KVM_CAP_ARM_MTE
> +--------------------
> +
> +:Architectures: arm64
> +:Parameters: none
> +
> +This capability indicates that KVM (and the hardware) supports exposing the
> +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
> +VMM before the guest will be granted access.
> +
> +When enabled the guest is able to access tags associated with any memory given
> +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
> +that the tags are maintained during swap or hibernation of the host; however
> +the VMM needs to manually save/restore the tags as appropriate if the VM is
> +migrated.
> +
> +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
> +perform a bulk copy of tags to/from the guest.
> +

Missing limitation to AArch64 guests.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged
  2021-05-17 16:14   ` Marc Zyngier
@ 2021-05-19  9:32     ` Steven Price
  2021-05-19 17:48       ` Catalin Marinas
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-19  9:32 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 17/05/2021 17:14, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:34 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> A KVM guest could store tags in a page even if the VMM hasn't mapped
>> the page with PROT_MTE. So when restoring pages from swap we will
>> need to check to see if there are any saved tags even if !pte_tagged().
>>
>> However don't check pages for which pte_access_permitted() returns false
>> as these will not have been swapped out.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h |  9 +++++++--
>>  arch/arm64/kernel/mte.c          | 16 ++++++++++++++--
>>  2 files changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 0b10204e72fc..275178a810c1 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>>  	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
>>  		__sync_icache_dcache(pte);
>>  
>> -	if (system_supports_mte() &&
>> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
>> +	/*
>> +	 * If the PTE would provide user space access to the tags associated
>> +	 * with it then ensure that the MTE tags are synchronised.  Exec-only
>> +	 * mappings don't expose tags (instruction fetches don't check tags).
> 
> I'm not sure I understand this comment. Of course, execution doesn't
> match tags. But the memory could still have tags associated with
> it. Does this mean such a page would lose its tags is swapped out?

Hmm, I probably should have reread that - the context of the comment is
lost.

I added the comment when changing to pte_access_permitted(), and the
comment on pte_access_permitted() explains a potential gotcha:

 * p??_access_permitted() is true for valid user mappings (PTE_USER
 * bit set, subject to the write permission check). For execute-only
 * mappings, like PROT_EXEC with EPAN (both PTE_USER and PTE_UXN bits
 * not set) must return false. PROT_NONE mappings do not have the
 * PTE_VALID bit set.

So execute-only mappings return false even though that is effectively a
type of user access. However, because MTE checks are not performed by
the PE for instruction fetches this doesn't matter. I'll update the
comment, how about:

/*
 * If the PTE would provide user space access to the tags associated
 * with it then ensure that the MTE tags are synchronised.  Although
 * pte_access_permitted() returns false for exec only mappings, they
 * don't expose tags (instruction fetches don't check tags).
 */

Thanks,

Steve

> Thanks,
> 
> 	M.
> 
>> +	 */
>> +	if (system_supports_mte() && pte_present(pte) &&
>> +	    pte_access_permitted(pte, false) && !pte_special(pte))
>>  		mte_sync_tags(ptep, pte);
>>  
>>  	__check_racy_pte_update(mm, ptep, pte);
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index c88e778c2fa9..a604818c52c1 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -33,11 +33,15 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
>>  EXPORT_SYMBOL_GPL(mte_async_mode);
>>  #endif
>>  
>> -static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
>> +static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
>> +			       bool pte_is_tagged)
>>  {
>>  	unsigned long flags;
>>  	pte_t old_pte = READ_ONCE(*ptep);
>>  
>> +	if (!is_swap_pte(old_pte) && !pte_is_tagged)
>> +		return;
>> +
>>  	spin_lock_irqsave(&tag_sync_lock, flags);
>>  
>>  	/* Recheck with the lock held */
>> @@ -53,6 +57,9 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
>>  		}
>>  	}
>>  
>> +	if (!pte_is_tagged)
>> +		goto out;
>> +
>>  	page_kasan_tag_reset(page);
>>  	/*
>>  	 * We need smp_wmb() in between setting the flags and clearing the
>> @@ -76,10 +83,15 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
>>  	bool check_swap = nr_pages == 1;
>>  	bool pte_is_tagged = pte_tagged(pte);
>>  
>> +	/* Early out if there's nothing to do */
>> +	if (!check_swap && !pte_is_tagged)
>> +		return;
>> +
>>  	/* if PG_mte_tagged is set, tags have already been initialised */
>>  	for (i = 0; i < nr_pages; i++, page++) {
>>  		if (!test_bit(PG_mte_tagged, &page->flags))
>> -			mte_sync_page_tags(page, ptep, check_swap);
>> +			mte_sync_page_tags(page, ptep, check_swap,
>> +					   pte_is_tagged);
>>  	}
>>  }
>>  
>> -- 
>> 2.20.1
>>
>>
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature
  2021-05-17 16:45   ` Marc Zyngier
@ 2021-05-19 10:48     ` Steven Price
  2021-05-20  8:51       ` Marc Zyngier
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-19 10:48 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 17/05/2021 17:45, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:35 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
>> for a VM. This will expose the feature to the guest and automatically
>> tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
>> storage) to ensure that the guest cannot see stale tags, and so that
>> the tags are correctly saved/restored across swap.
>>
>> Actually exposing the new capability to user space happens in a later
>> patch.
> 
> uber nit in $SUBJECT: "KVM: arm64:" is the preferred prefix (just like
> patches 7 and 8).

Good spot - I obviously got carried away with the "arm64:" prefix ;)

>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/kvm_emulate.h |  3 +++
>>  arch/arm64/include/asm/kvm_host.h    |  3 +++
>>  arch/arm64/kvm/hyp/exception.c       |  3 ++-
>>  arch/arm64/kvm/mmu.c                 | 37 +++++++++++++++++++++++++++-
>>  arch/arm64/kvm/sys_regs.c            |  3 +++
>>  include/uapi/linux/kvm.h             |  1 +
>>  6 files changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index f612c090f2e4..6bf776c2399c 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
>>  	if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
>>  	    vcpu_el1_is_32bit(vcpu))
>>  		vcpu->arch.hcr_el2 |= HCR_TID2;
>> +
>> +	if (kvm_has_mte(vcpu->kvm))
>> +		vcpu->arch.hcr_el2 |= HCR_ATA;
>>  }
>>  
>>  static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 7cd7d5c8c4bc..afaa5333f0e4 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -132,6 +132,8 @@ struct kvm_arch {
>>  
>>  	u8 pfr0_csv2;
>>  	u8 pfr0_csv3;
>> +	/* Memory Tagging Extension enabled for the guest */
>> +	bool mte_enabled;
>>  };
>>  
>>  struct kvm_vcpu_fault_info {
>> @@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
>>  #define kvm_arm_vcpu_sve_finalized(vcpu) \
>>  	((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
>>  
>> +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
>>  #define kvm_vcpu_has_pmu(vcpu)					\
>>  	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
>>  
>> diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
>> index 73629094f903..56426565600c 100644
>> --- a/arch/arm64/kvm/hyp/exception.c
>> +++ b/arch/arm64/kvm/hyp/exception.c
>> @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
>>  	new |= (old & PSR_C_BIT);
>>  	new |= (old & PSR_V_BIT);
>>  
>> -	// TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
>> +	if (kvm_has_mte(vcpu->kvm))
>> +		new |= PSR_TCO_BIT;
>>  
>>  	new |= (old & PSR_DIT_BIT);
>>  
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c5d1f3c87dbd..8660f6a03f51 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>>  	return PAGE_SIZE;
>>  }
>>  
>> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
>> +			     kvm_pfn_t pfn)
> 
> Nit: please order the parameters as address, then size.

Sure

>> +{
>> +	if (kvm_has_mte(kvm)) {
>> +		/*
>> +		 * The page will be mapped in stage 2 as Normal Cacheable, so
>> +		 * the VM will be able to see the page's tags and therefore
>> +		 * they must be initialised first. If PG_mte_tagged is set,
>> +		 * tags have already been initialised.
>> +		 */
>> +		unsigned long i, nr_pages = size >> PAGE_SHIFT;
>> +		struct page *page = pfn_to_online_page(pfn);
>> +
>> +		if (!page)
>> +			return -EFAULT;
> 
> Under which circumstances can this happen? We already have done a GUP
> on the page, so I really can't see how the page can vanish from under
> our feet.

It's less about the page vanishing and more that pfn_to_online_page()
will reject some pages. Specifically in this case we want to reject any
sort of device memory (e.g. graphics card memory or other memory on the
end of a bus like PCIe) as it is unlikely to support MTE.

>> +
>> +		for (i = 0; i < nr_pages; i++, page++) {
>> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>> +				mte_clear_page_tags(page_address(page));
> 
> You seem to be doing this irrespective of the VMA being created with
> PROT_MTE. This is fine form a guest perspective (all its memory should
> be MTE capable). However, I can't see any guarantee that the VMM will
> actually allocate memslots with PROT_MTE.
> 
> Aren't we missing some sanity checks at memslot registration time?

I've taken the decision not to require that the VMM allocates with
PROT_MTE, there are two main reasons for this:

 1. The VMM generally doesn't actually want a PROT_MTE mapping as the
tags from the guest are almost certainly wrong for most usages (e.g.
device emulation). So a PROT_MTE mapping actively gets in the way of the
VMM using MTE for it's own purposes. However this then leads to the
requirement for the new ioctl in patch 7.

 2. Because the VMM can change the pages in a memslot at any time and
KVM relies on mmu notifiers to spot the change it's hard and ugly to
enforce that the memslot VMAs keep having the PROT_MTE flag. When I
tried this it meant we've discover that a page doesn't have the MTE flag
at fault time and have no other option that to kill the VM at that time.

So the model is that non-PROT_MTE memory can be supplied to the memslots
and KVM will automatically upgrade it to PG_mte_tagged if you supply it
to a VM with MTE enabled. This makes the VMM implementation easier for
most cases, and the new ioctl helps for migration. I think the kernel
code is tidier too.

Of course even better would be a stage 2 flag to control MTE
availability on a page-by-page basis, but that doesn't exist in the
architecture at the moment.

>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>  			  unsigned long fault_status)
>> @@ -971,8 +996,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  	if (writable)
>>  		prot |= KVM_PGTABLE_PROT_W;
>>  
>> -	if (fault_status != FSC_PERM && !device)
>> +	if (fault_status != FSC_PERM && !device) {
>> +		ret = sanitise_mte_tags(kvm, vma_pagesize, pfn);
>> +		if (ret)
>> +			goto out_unlock;
>> +
>>  		clean_dcache_guest_page(pfn, vma_pagesize);
>> +	}
>>  
>>  	if (exec_fault) {
>>  		prot |= KVM_PGTABLE_PROT_X;
>> @@ -1168,12 +1198,17 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>>  {
>>  	kvm_pfn_t pfn = pte_pfn(range->pte);
>> +	int ret;
>>  
>>  	if (!kvm->arch.mmu.pgt)
>>  		return 0;
>>  
>>  	WARN_ON(range->end - range->start != 1);
>>  
>> +	ret = sanitise_mte_tags(kvm, PAGE_SIZE, pfn);
>> +	if (ret)
>> +		return ret;
> 
> Notice the change in return type?

I do now - I was tricked by the use of '0' as false. Looks like false
('0') is actually the correct return here to avoid an unnecessary
kvm_flush_remote_tlbs().

>> +
>>  	/*
>>  	 * We've moved a page around, probably through CoW, so let's treat it
>>  	 * just like a translation fault and clean the cache to the PoC.
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index 76ea2800c33e..24a844cb79ca 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
>>  		break;
>>  	case SYS_ID_AA64PFR1_EL1:
>>  		val &= ~FEATURE(ID_AA64PFR1_MTE);
>> +		if (kvm_has_mte(vcpu->kvm))
>> +			val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
>> +					  ID_AA64PFR1_MTE);
> 
> Shouldn't this be consistent with what the HW is capable of
> (i.e. FEAT_MTE3 if available), and extracted from the sanitised view
> of the feature set?

Yes - however at the moment our sanitised view is either FEAT_MTE2 or
nothing:

	{
		.desc = "Memory Tagging Extension",
		.capability = ARM64_MTE,
		.type = ARM64_CPUCAP_STRICT_BOOT_CPU_FEATURE,
		.matches = has_cpuid_feature,
		.sys_reg = SYS_ID_AA64PFR1_EL1,
		.field_pos = ID_AA64PFR1_MTE_SHIFT,
		.min_field_value = ID_AA64PFR1_MTE,
		.sign = FTR_UNSIGNED,
		.cpu_enable = cpu_enable_mte,
	},

When host support for FEAT_MTE3 is added then the KVM code will need
revisiting to expose that down to the guest safely (AFAICS there's
nothing extra to do here, but I haven't tested any of the MTE3
features). I don't think we want to expose newer versions to the guest
than the host is aware of. (Or indeed expose FEAT_MTE if the host has
MTE disabled because Linux requires at least FEAT_MTE2).

Thanks,

Steve

>>  		break;
>>  	case SYS_ID_AA64ISAR1_EL1:
>>  		if (!vcpu_has_ptrauth(vcpu))
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 3fd9a7e9d90c..8c95ba0fadda 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt {
>>  #define KVM_CAP_SGX_ATTRIBUTE 196
>>  #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
>>  #define KVM_CAP_PTP_KVM 198
>> +#define KVM_CAP_ARM_MTE 199
>>  
>>  #ifdef KVM_CAP_IRQ_ROUTING
>>  
> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers
  2021-05-17 17:17   ` Marc Zyngier
@ 2021-05-19 13:04     ` Steven Price
  2021-05-20  9:46       ` Marc Zyngier
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-19 13:04 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 17/05/2021 18:17, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:36 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> Define the new system registers that MTE introduces and context switch
>> them. The MTE feature is still hidden from the ID register as it isn't
>> supported in a VM yet.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/kvm_host.h          |  6 ++
>>  arch/arm64/include/asm/kvm_mte.h           | 66 ++++++++++++++++++++++
>>  arch/arm64/include/asm/sysreg.h            |  3 +-
>>  arch/arm64/kernel/asm-offsets.c            |  3 +
>>  arch/arm64/kvm/hyp/entry.S                 |  7 +++
>>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++
>>  arch/arm64/kvm/sys_regs.c                  | 22 ++++++--
>>  7 files changed, 123 insertions(+), 5 deletions(-)
>>  create mode 100644 arch/arm64/include/asm/kvm_mte.h
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index afaa5333f0e4..309e36cc1b42 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -208,6 +208,12 @@ enum vcpu_sysreg {
>>  	CNTP_CVAL_EL0,
>>  	CNTP_CTL_EL0,
>>  
>> +	/* Memory Tagging Extension registers */
>> +	RGSR_EL1,	/* Random Allocation Tag Seed Register */
>> +	GCR_EL1,	/* Tag Control Register */
>> +	TFSR_EL1,	/* Tag Fault Status Register (EL1) */
>> +	TFSRE0_EL1,	/* Tag Fault Status Register (EL0) */
>> +
>>  	/* 32bit specific registers. Keep them at the end of the range */
>>  	DACR32_EL2,	/* Domain Access Control Register */
>>  	IFSR32_EL2,	/* Instruction Fault Status Register */
>> diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
>> new file mode 100644
>> index 000000000000..6541c7d6ce06
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/kvm_mte.h
>> @@ -0,0 +1,66 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (C) 2020 ARM Ltd.
>> + */
>> +#ifndef __ASM_KVM_MTE_H
>> +#define __ASM_KVM_MTE_H
>> +
>> +#ifdef __ASSEMBLY__
>> +
>> +#include <asm/sysreg.h>
>> +
>> +#ifdef CONFIG_ARM64_MTE
>> +
>> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
>> +alternative_if_not ARM64_MTE
>> +	b	.L__skip_switch\@
>> +alternative_else_nop_endif
>> +	mrs	\reg1, hcr_el2
>> +	and	\reg1, \reg1, #(HCR_ATA)
>> +	cbz	\reg1, .L__skip_switch\@
>> +
>> +	mrs_s	\reg1, SYS_RGSR_EL1
>> +	str	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
>> +	mrs_s	\reg1, SYS_GCR_EL1
>> +	str	\reg1, [\h_ctxt, #CPU_GCR_EL1]
>> +
>> +	ldr	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
>> +	msr_s	SYS_RGSR_EL1, \reg1
>> +	ldr	\reg1, [\g_ctxt, #CPU_GCR_EL1]
>> +	msr_s	SYS_GCR_EL1, \reg1
>> +
>> +.L__skip_switch\@:
>> +.endm
>> +
>> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
>> +alternative_if_not ARM64_MTE
>> +	b	.L__skip_switch\@
>> +alternative_else_nop_endif
>> +	mrs	\reg1, hcr_el2
>> +	and	\reg1, \reg1, #(HCR_ATA)
>> +	cbz	\reg1, .L__skip_switch\@
>> +
>> +	mrs_s	\reg1, SYS_RGSR_EL1
>> +	str	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
>> +	mrs_s	\reg1, SYS_GCR_EL1
>> +	str	\reg1, [\g_ctxt, #CPU_GCR_EL1]
>> +
>> +	ldr	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
>> +	msr_s	SYS_RGSR_EL1, \reg1
>> +	ldr	\reg1, [\h_ctxt, #CPU_GCR_EL1]
>> +	msr_s	SYS_GCR_EL1, \reg1
> 
> What is the rational for not having any synchronisation here? It is
> quite uncommon to allocate memory at EL2, but VHE can perform all kind
> of tricks.

I don't follow. This is part of the __guest_exit path and there's an ISB
at the end of that - is that not sufficient? I don't see any possibility
for allocating memory before that. What am I missing?

>> +
>> +.L__skip_switch\@:
>> +.endm
>> +
>> +#else /* CONFIG_ARM64_MTE */
>> +
>> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
>> +.endm
>> +
>> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
>> +.endm
>> +
>> +#endif /* CONFIG_ARM64_MTE */
>> +#endif /* __ASSEMBLY__ */
>> +#endif /* __ASM_KVM_MTE_H */
>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>> index 65d15700a168..347ccac2341e 100644
>> --- a/arch/arm64/include/asm/sysreg.h
>> +++ b/arch/arm64/include/asm/sysreg.h
>> @@ -651,7 +651,8 @@
>>  
>>  #define INIT_SCTLR_EL2_MMU_ON						\
>>  	(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |	\
>> -	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
>> +	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |		\
>> +	 SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
>>  
>>  #define INIT_SCTLR_EL2_MMU_OFF \
>>  	(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
>> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
>> index 0cb34ccb6e73..6b489a8462f0 100644
>> --- a/arch/arm64/kernel/asm-offsets.c
>> +++ b/arch/arm64/kernel/asm-offsets.c
>> @@ -111,6 +111,9 @@ int main(void)
>>    DEFINE(VCPU_WORKAROUND_FLAGS,	offsetof(struct kvm_vcpu, arch.workaround_flags));
>>    DEFINE(VCPU_HCR_EL2,		offsetof(struct kvm_vcpu, arch.hcr_el2));
>>    DEFINE(CPU_USER_PT_REGS,	offsetof(struct kvm_cpu_context, regs));
>> +  DEFINE(CPU_RGSR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1]));
>> +  DEFINE(CPU_GCR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1]));
>> +  DEFINE(CPU_TFSRE0_EL1,	offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1]));
> 
> TFSRE0_EL1 is never accessed from assembly code. Leftover from a
> previous version?

Indeed, I will drop it.

>>    DEFINE(CPU_APIAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1]));
>>    DEFINE(CPU_APIBKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1]));
>>    DEFINE(CPU_APDAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1]));
>> diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
>> index e831d3dfd50d..435346ea1504 100644
>> --- a/arch/arm64/kvm/hyp/entry.S
>> +++ b/arch/arm64/kvm/hyp/entry.S
>> @@ -13,6 +13,7 @@
>>  #include <asm/kvm_arm.h>
>>  #include <asm/kvm_asm.h>
>>  #include <asm/kvm_mmu.h>
>> +#include <asm/kvm_mte.h>
>>  #include <asm/kvm_ptrauth.h>
>>  
>>  	.text
>> @@ -51,6 +52,9 @@ alternative_else_nop_endif
>>  
>>  	add	x29, x0, #VCPU_CONTEXT
>>  
>> +	// mte_switch_to_guest(g_ctxt, h_ctxt, tmp1)
>> +	mte_switch_to_guest x29, x1, x2
>> +
>>  	// Macro ptrauth_switch_to_guest format:
>>  	// 	ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3)
>>  	// The below macro to restore guest keys is not implemented in C code
>> @@ -142,6 +146,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
>>  	// when this feature is enabled for kernel code.
>>  	ptrauth_switch_to_hyp x1, x2, x3, x4, x5
>>  
>> +	// mte_switch_to_hyp(g_ctxt, h_ctxt, reg1)
>> +	mte_switch_to_hyp x1, x2, x3
>> +
>>  	// Restore hyp's sp_el0
>>  	restore_sp_el0 x2, x3
>>  
>> diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
>> index cce43bfe158f..de7e14c862e6 100644
>> --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
>> +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
>> @@ -14,6 +14,7 @@
>>  #include <asm/kvm_asm.h>
>>  #include <asm/kvm_emulate.h>
>>  #include <asm/kvm_hyp.h>
>> +#include <asm/kvm_mmu.h>
>>  
>>  static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt)
>>  {
>> @@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt)
>>  	ctxt_sys_reg(ctxt, TPIDRRO_EL0)	= read_sysreg(tpidrro_el0);
>>  }
>>  
>> +static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt)
>> +{
>> +	struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu;
>> +
>> +	if (!vcpu)
>> +		vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt);
>> +
>> +	return kvm_has_mte(kern_hyp_va(vcpu->kvm));
>> +}
>> +
>>  static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
>>  {
>>  	ctxt_sys_reg(ctxt, CSSELR_EL1)	= read_sysreg(csselr_el1);
>> @@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
>>  	ctxt_sys_reg(ctxt, PAR_EL1)	= read_sysreg_par();
>>  	ctxt_sys_reg(ctxt, TPIDR_EL1)	= read_sysreg(tpidr_el1);
>>  
>> +	if (ctxt_has_mte(ctxt)) {
>> +		ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR);
>> +		ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1);
>> +	}
> 
> I remember suggesting that this is slightly heavier than necessary.
> 
> On nVHE, TFSRE0_EL1 could be moved to load/put, as we never run
> userspace with a vcpu loaded. The same holds of course for VHE, but we
> also can move TFSR_EL1 to load/put, as the host uses TFSR_EL2.
> 
> Do you see any issue with that?

The comment[1] I made before was:

  For TFSR_EL1 + VHE I believe it is synchronised only on vcpu_load/put -
  __sysreg_save_el1_state() is called from kvm_vcpu_load_sysregs_vhe().

  TFSRE0_EL1 potentially could be improved. I have to admit I was unsure
  if it should be in __sysreg_save_user_state() instead. However AFAICT
  that is called at the same time as __sysreg_save_el1_state() and there's
  no optimisation for nVHE. And given it's an _EL1 register this seemed
  like the logic place.

  Am I missing something here? Potentially there are other registers to be
  optimised (TPIDRRO_EL0 looks like a possiblity), but IMHO that doesn't
  belong in this series.

For VHE TFSR_EL1 is already only saved/restored on load/put
(__sysreg_save_el1_state() is called from kvm_vcpu_put_sysregs_vhe()).

TFSRE0_EL1 could be moved, but I'm not sure where it should live as I
mentioned above.

[1] https://lore.kernel.org/kvmarm/b16b65b5-d27f-7f86-fe0c-38a951e7d3ae@arm.com/

Thanks,

Steve

>> +
>>  	ctxt_sys_reg(ctxt, SP_EL1)	= read_sysreg(sp_el1);
>>  	ctxt_sys_reg(ctxt, ELR_EL1)	= read_sysreg_el1(SYS_ELR);
>>  	ctxt_sys_reg(ctxt, SPSR_EL1)	= read_sysreg_el1(SYS_SPSR);
>> @@ -107,6 +123,11 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt)
>>  	write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1),	par_el1);
>>  	write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1),	tpidr_el1);
>>  
>> +	if (ctxt_has_mte(ctxt)) {
>> +		write_sysreg_el1(ctxt_sys_reg(ctxt, TFSR_EL1), SYS_TFSR);
>> +		write_sysreg_s(ctxt_sys_reg(ctxt, TFSRE0_EL1), SYS_TFSRE0_EL1);
>> +	}
>> +
>>  	if (!has_vhe() &&
>>  	    cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT) &&
>>  	    ctxt->__hyp_running_vcpu) {
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index 24a844cb79ca..88adbc2286f2 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -1305,6 +1305,20 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
>>  	return true;
>>  }
>>  
>> +static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
>> +				   const struct sys_reg_desc *rd)
>> +{
>> +	return REG_HIDDEN;
>> +}
>> +
>> +#define MTE_REG(name) {				\
>> +	SYS_DESC(SYS_##name),			\
>> +	.access = undef_access,			\
>> +	.reset = reset_unknown,			\
>> +	.reg = name,				\
>> +	.visibility = mte_visibility,		\
>> +}
>> +
>>  /* sys_reg_desc initialiser for known cpufeature ID registers */
>>  #define ID_SANITISED(name) {			\
>>  	SYS_DESC(SYS_##name),			\
>> @@ -1473,8 +1487,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>>  	{ SYS_DESC(SYS_ACTLR_EL1), access_actlr, reset_actlr, ACTLR_EL1 },
>>  	{ SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 },
>>  
>> -	{ SYS_DESC(SYS_RGSR_EL1), undef_access },
>> -	{ SYS_DESC(SYS_GCR_EL1), undef_access },
>> +	MTE_REG(RGSR_EL1),
>> +	MTE_REG(GCR_EL1),
>>  
>>  	{ SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility },
>>  	{ SYS_DESC(SYS_TRFCR_EL1), undef_access },
>> @@ -1501,8 +1515,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>>  	{ SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi },
>>  	{ SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi },
>>  
>> -	{ SYS_DESC(SYS_TFSR_EL1), undef_access },
>> -	{ SYS_DESC(SYS_TFSRE0_EL1), undef_access },
>> +	MTE_REG(TFSR_EL1),
>> +	MTE_REG(TFSRE0_EL1),
>>  
>>  	{ SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 },
>>  	{ SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 },
> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE
  2021-05-17 17:40   ` Marc Zyngier
@ 2021-05-19 13:26     ` Steven Price
  2021-05-20 10:09       ` Marc Zyngier
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-19 13:26 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 17/05/2021 18:40, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:37 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> It's now safe for the VMM to enable MTE in a guest, so expose the
>> capability to user space.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/kvm/arm.c      | 9 +++++++++
>>  arch/arm64/kvm/sys_regs.c | 3 +++
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 1cb39c0803a4..e89a5e275e25 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>  		r = 0;
>>  		kvm->arch.return_nisv_io_abort_to_user = true;
>>  		break;
>> +	case KVM_CAP_ARM_MTE:
>> +		if (!system_supports_mte() || kvm->created_vcpus)
>> +			return -EINVAL;
>> +		r = 0;
>> +		kvm->arch.mte_enabled = true;
> 
> As far as I can tell from the architecture, this isn't valid for a
> 32bit guest.

Indeed, however the MTE flag is a property of the VM not of the vCPU.
And, unless I'm mistaken, it's technically possible to create a VM where
some CPUs are 32 bit and some 64 bit. Not that I can see much use of a
configuration like that.

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-17 18:04   ` Marc Zyngier
@ 2021-05-19 13:51     ` Steven Price
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Price @ 2021-05-19 13:51 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 17/05/2021 19:04, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:38 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> The VMM may not wish to have it's own mapping of guest memory mapped
>> with PROT_MTE because this causes problems if the VMM has tag checking
>> enabled (the guest controls the tags in physical RAM and it's unlikely
>> the tags are correct for the VMM).
>>
>> Instead add a new ioctl which allows the VMM to easily read/write the
>> tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
>> while the VMM can still read/write the tags for the purpose of
>> migration.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/uapi/asm/kvm.h | 11 +++++
>>  arch/arm64/kvm/arm.c              | 69 +++++++++++++++++++++++++++++++
>>  include/uapi/linux/kvm.h          |  1 +
>>  3 files changed, 81 insertions(+)
>>
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 24223adae150..b3edde68bc3e 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>>  	__u32 reserved[12];
>>  };
>>  
>> +struct kvm_arm_copy_mte_tags {
>> +	__u64 guest_ipa;
>> +	__u64 length;
>> +	void __user *addr;
>> +	__u64 flags;
>> +	__u64 reserved[2];
>> +};
>> +
>> +#define KVM_ARM_TAGS_TO_GUEST		0
>> +#define KVM_ARM_TAGS_FROM_GUEST		1
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK		0x000000000FFF0000
>>  #define KVM_REG_ARM_COPROC_SHIFT	16
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index e89a5e275e25..4b6c83beb75d 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>  	}
>>  }
>>  
>> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>> +				      struct kvm_arm_copy_mte_tags *copy_tags)
>> +{
>> +	gpa_t guest_ipa = copy_tags->guest_ipa;
>> +	size_t length = copy_tags->length;
>> +	void __user *tags = copy_tags->addr;
>> +	gpa_t gfn;
>> +	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
>> +	int ret = 0;
>> +
>> +	if (copy_tags->reserved[0] || copy_tags->reserved[1])
>> +		return -EINVAL;
>> +
>> +	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
>> +		return -EINVAL;
>> +
>> +	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
>> +		return -EINVAL;
>> +
>> +	gfn = gpa_to_gfn(guest_ipa);
>> +
>> +	mutex_lock(&kvm->slots_lock);
>> +
>> +	while (length > 0) {
>> +		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
>> +		void *maddr;
>> +		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
> 
> nit: this is a compile time constant, make it a #define. This will
> avoid the confusing overloading of "num_tags" as both an input and an
> output for the mte_copy_tags-* functions.

No problem, I agree my usage of num_tags wasn't very clear.

>> +
>> +		if (is_error_noslot_pfn(pfn)) {
>> +			ret = -EFAULT;
>> +			goto out;
>> +		}
>> +
>> +		maddr = page_address(pfn_to_page(pfn));
>> +
>> +		if (!write) {
>> +			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
>> +			kvm_release_pfn_clean(pfn);
>> +		} else {
>> +			num_tags = mte_copy_tags_from_user(maddr, tags,
>> +							   num_tags);
>> +			kvm_release_pfn_dirty(pfn);
>> +		}
>> +
>> +		if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) {
>> +			ret = -EFAULT;
>> +			goto out;
>> +		}
>> +
>> +		gfn++;
>> +		tags += num_tags;
>> +		length -= PAGE_SIZE;
>> +	}
>> +
>> +out:
>> +	mutex_unlock(&kvm->slots_lock);
>> +	return ret;
>> +}
>> +
> 
> nit again: I'd really prefer it if you moved this to guest.c, where we
> already have a bunch of the save/restore stuff.

Sure - I'll move it across.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl
  2021-05-17 18:09   ` Marc Zyngier
@ 2021-05-19 14:09     ` Steven Price
  2021-05-20 10:24       ` Marc Zyngier
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-19 14:09 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 17/05/2021 19:09, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:39 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
>> granting a guest access to the tags, and provides a mechanism for the
>> VMM to enable it.
>>
>> A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
>> access the tags of a guest without having to maintain a PROT_MTE mapping
>> in userspace. The above capability gates access to the ioctl.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  Documentation/virt/kvm/api.rst | 53 ++++++++++++++++++++++++++++++++++
>>  1 file changed, 53 insertions(+)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 22d077562149..a31661b870ba 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -5034,6 +5034,40 @@ see KVM_XEN_VCPU_SET_ATTR above.
>>  The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
>>  with the KVM_XEN_VCPU_GET_ATTR ioctl.
>>  
>> +4.130 KVM_ARM_MTE_COPY_TAGS
>> +---------------------------
>> +
>> +:Capability: KVM_CAP_ARM_MTE
>> +:Architectures: arm64
>> +:Type: vm ioctl
>> +:Parameters: struct kvm_arm_copy_mte_tags
>> +:Returns: 0 on success, < 0 on error
>> +
>> +::
>> +
>> +  struct kvm_arm_copy_mte_tags {
>> +	__u64 guest_ipa;
>> +	__u64 length;
>> +	union {
>> +		void __user *addr;
>> +		__u64 padding;
>> +	};
>> +	__u64 flags;
>> +	__u64 reserved[2];
>> +  };
> 
> This doesn't exactly match the structure in the previous patch :-(.

:( I knew there was a reason I didn't include it in the documentation
for the first 9 versions... I'll fix this up, thanks for spotting it.

>> +
>> +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
>> +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
>> +fieldmust point to a buffer which the tags will be copied to or from.
>> +
>> +``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
>> +``KVM_ARM_TAGS_FROM_GUEST``.
>> +
>> +The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)``
> 
> Should we add a UAPI definition for MTE_GRANULE_SIZE?

I wasn't sure whether to export this or not. The ioctl is based around
the existing ptrace interface (PTRACE_{PEEK,POKE}MTETAGS) which doesn't
expose a UAPI definition. Admittedly the documentation there also just
says "16-byte granule" rather than MTE_GRANULE_SIZE.

So I'll just remove the reference to MTE_GRANULE_SIZE in the
documentation unless you feel that we should have a UAPI definition.

>> +bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag
>> +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
>> +``PTRACE_POKEMTETAGS``.
>> +
>>  5. The kvm_run structure
>>  ========================
>>  
>> @@ -6362,6 +6396,25 @@ default.
>>  
>>  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
>>  
>> +7.26 KVM_CAP_ARM_MTE
>> +--------------------
>> +
>> +:Architectures: arm64
>> +:Parameters: none
>> +
>> +This capability indicates that KVM (and the hardware) supports exposing the
>> +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
>> +VMM before the guest will be granted access.
>> +
>> +When enabled the guest is able to access tags associated with any memory given
>> +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
>> +that the tags are maintained during swap or hibernation of the host; however
>> +the VMM needs to manually save/restore the tags as appropriate if the VM is
>> +migrated.
>> +
>> +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
>> +perform a bulk copy of tags to/from the guest.
>> +
> 
> Missing limitation to AArch64 guests.

As mentioned previously it's not technically limited to AArch64, but
I'll expand this to make it clear that MTE isn't usable from a AArch32 VCPU.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags
  2021-05-17 12:32 ` [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags Steven Price
  2021-05-17 14:03   ` Marc Zyngier
@ 2021-05-19 17:32   ` Catalin Marinas
  1 sibling, 0 replies; 49+ messages in thread
From: Catalin Marinas @ 2021-05-19 17:32 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, May 17, 2021 at 01:32:32PM +0100, Steven Price wrote:
> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
> before restoring/zeroing the MTE tags. However if another thread were to
> race and attempt to sync the tags on the same page before the first
> thread had completed restoring/zeroing then it would see the flag is
> already set and continue without waiting. This would potentially expose
> the previous contents of the tags to user space, and cause any updates
> that user space makes before the restoring/zeroing has completed to
> potentially be lost.
> 
> Since this code is run from atomic contexts we can't just lock the page
> during the process. Instead implement a new (global) spinlock to protect
> the mte_sync_page_tags() function.
> 
> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in user-space with PROT_MTE")
> Signed-off-by: Steven Price <steven.price@arm.com>

Other than the missing spinlock initialisation, the patch looks fine to
me.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

(though I'll probably queue it as a fix, waiting a couple of days for
comments)

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged
  2021-05-19  9:32     ` Steven Price
@ 2021-05-19 17:48       ` Catalin Marinas
  0 siblings, 0 replies; 49+ messages in thread
From: Catalin Marinas @ 2021-05-19 17:48 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Wed, May 19, 2021 at 10:32:01AM +0100, Steven Price wrote:
> On 17/05/2021 17:14, Marc Zyngier wrote:
> > On Mon, 17 May 2021 13:32:34 +0100,
> > Steven Price <steven.price@arm.com> wrote:
> >>
> >> A KVM guest could store tags in a page even if the VMM hasn't mapped
> >> the page with PROT_MTE. So when restoring pages from swap we will
> >> need to check to see if there are any saved tags even if !pte_tagged().
> >>
> >> However don't check pages for which pte_access_permitted() returns false
> >> as these will not have been swapped out.
> >>
> >> Signed-off-by: Steven Price <steven.price@arm.com>
> >> ---
> >>  arch/arm64/include/asm/pgtable.h |  9 +++++++--
> >>  arch/arm64/kernel/mte.c          | 16 ++++++++++++++--
> >>  2 files changed, 21 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index 0b10204e72fc..275178a810c1 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
> >>  	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
> >>  		__sync_icache_dcache(pte);
> >>  
> >> -	if (system_supports_mte() &&
> >> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> >> +	/*
> >> +	 * If the PTE would provide user space access to the tags associated
> >> +	 * with it then ensure that the MTE tags are synchronised.  Exec-only
> >> +	 * mappings don't expose tags (instruction fetches don't check tags).
> > 
> > I'm not sure I understand this comment. Of course, execution doesn't
> > match tags. But the memory could still have tags associated with
> > it. Does this mean such a page would lose its tags is swapped out?
> 
> Hmm, I probably should have reread that - the context of the comment is
> lost.
> 
> I added the comment when changing to pte_access_permitted(), and the
> comment on pte_access_permitted() explains a potential gotcha:
> 
>  * p??_access_permitted() is true for valid user mappings (PTE_USER
>  * bit set, subject to the write permission check). For execute-only
>  * mappings, like PROT_EXEC with EPAN (both PTE_USER and PTE_UXN bits
>  * not set) must return false. PROT_NONE mappings do not have the
>  * PTE_VALID bit set.
> 
> So execute-only mappings return false even though that is effectively a
> type of user access. However, because MTE checks are not performed by
> the PE for instruction fetches this doesn't matter. I'll update the
> comment, how about:
> 
> /*
>  * If the PTE would provide user space access to the tags associated
>  * with it then ensure that the MTE tags are synchronised.  Although
>  * pte_access_permitted() returns false for exec only mappings, they
>  * don't expose tags (instruction fetches don't check tags).
>  */

This looks fine to me. We basically want to check the PTE_VALID and
PTE_USER bits and pte_access_permitted() does this (we could come up
with a new macro name like pte_valid_user() but since we don't care
about execute-only, it gets unnecessarily complicated).

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged
  2021-05-17 12:32 ` [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
  2021-05-17 16:14   ` Marc Zyngier
@ 2021-05-19 18:06   ` Catalin Marinas
  2021-05-20 11:55     ` Steven Price
  1 sibling, 1 reply; 49+ messages in thread
From: Catalin Marinas @ 2021-05-19 18:06 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, May 17, 2021 at 01:32:34PM +0100, Steven Price wrote:
> A KVM guest could store tags in a page even if the VMM hasn't mapped
> the page with PROT_MTE. So when restoring pages from swap we will
> need to check to see if there are any saved tags even if !pte_tagged().
> 
> However don't check pages for which pte_access_permitted() returns false
> as these will not have been swapped out.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h |  9 +++++++--
>  arch/arm64/kernel/mte.c          | 16 ++++++++++++++--
>  2 files changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0b10204e72fc..275178a810c1 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>  	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
>  		__sync_icache_dcache(pte);
>  
> -	if (system_supports_mte() &&
> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> +	/*
> +	 * If the PTE would provide user space access to the tags associated
> +	 * with it then ensure that the MTE tags are synchronised.  Exec-only
> +	 * mappings don't expose tags (instruction fetches don't check tags).
> +	 */
> +	if (system_supports_mte() && pte_present(pte) &&
> +	    pte_access_permitted(pte, false) && !pte_special(pte))
>  		mte_sync_tags(ptep, pte);

Looking at the mte_sync_page_tags() logic, we bail out early if it's the
old pte is not a swap one and the new pte is not tagged. So we only need
to call mte_sync_tags() if it's a tagged new pte or the old one is swap.
What about changing the set_pte_at() test to:

	if (system_supports_mte() && pte_present(pte) && !pte_special(pte) &&
	    (pte_tagged(pte) || is_swap_pte(READ_ONCE(*ptep))))
		mte_sync_tags(ptep, pte);

We can even change mte_sync_tags() to take the old pte directly:

	if (system_supports_mte() && pte_present(pte) && !pte_special(pte)) {
		pte_t old_pte = READ_ONCE(*ptep);
		if (pte_tagged(pte) || is_swap_pte(old_pte))
			mte_sync_tags(old_pte, pte);
	}

It would save a function call in most cases where the page is not
tagged.

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature
  2021-05-19 10:48     ` Steven Price
@ 2021-05-20  8:51       ` Marc Zyngier
  2021-05-20 14:46         ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-20  8:51 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Wed, 19 May 2021 11:48:21 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> On 17/05/2021 17:45, Marc Zyngier wrote:
> > On Mon, 17 May 2021 13:32:35 +0100,
> > Steven Price <steven.price@arm.com> wrote:
> >>
> >> Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
> >> for a VM. This will expose the feature to the guest and automatically
> >> tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
> >> storage) to ensure that the guest cannot see stale tags, and so that
> >> the tags are correctly saved/restored across swap.
> >>
> >> Actually exposing the new capability to user space happens in a later
> >> patch.
> > 
> > uber nit in $SUBJECT: "KVM: arm64:" is the preferred prefix (just like
> > patches 7 and 8).
> 
> Good spot - I obviously got carried away with the "arm64:" prefix ;)
> 
> >>
> >> Signed-off-by: Steven Price <steven.price@arm.com>
> >> ---
> >>  arch/arm64/include/asm/kvm_emulate.h |  3 +++
> >>  arch/arm64/include/asm/kvm_host.h    |  3 +++
> >>  arch/arm64/kvm/hyp/exception.c       |  3 ++-
> >>  arch/arm64/kvm/mmu.c                 | 37 +++++++++++++++++++++++++++-
> >>  arch/arm64/kvm/sys_regs.c            |  3 +++
> >>  include/uapi/linux/kvm.h             |  1 +
> >>  6 files changed, 48 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> >> index f612c090f2e4..6bf776c2399c 100644
> >> --- a/arch/arm64/include/asm/kvm_emulate.h
> >> +++ b/arch/arm64/include/asm/kvm_emulate.h
> >> @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
> >>  	if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
> >>  	    vcpu_el1_is_32bit(vcpu))
> >>  		vcpu->arch.hcr_el2 |= HCR_TID2;
> >> +
> >> +	if (kvm_has_mte(vcpu->kvm))
> >> +		vcpu->arch.hcr_el2 |= HCR_ATA;
> >>  }
> >>  
> >>  static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
> >> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> >> index 7cd7d5c8c4bc..afaa5333f0e4 100644
> >> --- a/arch/arm64/include/asm/kvm_host.h
> >> +++ b/arch/arm64/include/asm/kvm_host.h
> >> @@ -132,6 +132,8 @@ struct kvm_arch {
> >>  
> >>  	u8 pfr0_csv2;
> >>  	u8 pfr0_csv3;
> >> +	/* Memory Tagging Extension enabled for the guest */
> >> +	bool mte_enabled;
> >>  };
> >>  
> >>  struct kvm_vcpu_fault_info {
> >> @@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
> >>  #define kvm_arm_vcpu_sve_finalized(vcpu) \
> >>  	((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
> >>  
> >> +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
> >>  #define kvm_vcpu_has_pmu(vcpu)					\
> >>  	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
> >>  
> >> diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
> >> index 73629094f903..56426565600c 100644
> >> --- a/arch/arm64/kvm/hyp/exception.c
> >> +++ b/arch/arm64/kvm/hyp/exception.c
> >> @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
> >>  	new |= (old & PSR_C_BIT);
> >>  	new |= (old & PSR_V_BIT);
> >>  
> >> -	// TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
> >> +	if (kvm_has_mte(vcpu->kvm))
> >> +		new |= PSR_TCO_BIT;
> >>  
> >>  	new |= (old & PSR_DIT_BIT);
> >>  
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index c5d1f3c87dbd..8660f6a03f51 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> >>  	return PAGE_SIZE;
> >>  }
> >>  
> >> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
> >> +			     kvm_pfn_t pfn)
> > 
> > Nit: please order the parameters as address, then size.
> 
> Sure
> 
> >> +{
> >> +	if (kvm_has_mte(kvm)) {
> >> +		/*
> >> +		 * The page will be mapped in stage 2 as Normal Cacheable, so
> >> +		 * the VM will be able to see the page's tags and therefore
> >> +		 * they must be initialised first. If PG_mte_tagged is set,
> >> +		 * tags have already been initialised.
> >> +		 */
> >> +		unsigned long i, nr_pages = size >> PAGE_SHIFT;
> >> +		struct page *page = pfn_to_online_page(pfn);
> >> +
> >> +		if (!page)
> >> +			return -EFAULT;
> > 
> > Under which circumstances can this happen? We already have done a GUP
> > on the page, so I really can't see how the page can vanish from under
> > our feet.
> 
> It's less about the page vanishing and more that pfn_to_online_page()
> will reject some pages. Specifically in this case we want to reject any
> sort of device memory (e.g. graphics card memory or other memory on the
> end of a bus like PCIe) as it is unlikely to support MTE.

OK. We really never should see this error as we check for device
mappings right before calling this, but I guess it doesn't hurt.

> 
> >> +
> >> +		for (i = 0; i < nr_pages; i++, page++) {
> >> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> >> +				mte_clear_page_tags(page_address(page));
> > 
> > You seem to be doing this irrespective of the VMA being created with
> > PROT_MTE. This is fine form a guest perspective (all its memory should
> > be MTE capable). However, I can't see any guarantee that the VMM will
> > actually allocate memslots with PROT_MTE.
> > 
> > Aren't we missing some sanity checks at memslot registration time?
> 
> I've taken the decision not to require that the VMM allocates with
> PROT_MTE, there are two main reasons for this:
> 
>  1. The VMM generally doesn't actually want a PROT_MTE mapping as the
> tags from the guest are almost certainly wrong for most usages (e.g.
> device emulation). So a PROT_MTE mapping actively gets in the way of the
> VMM using MTE for it's own purposes. However this then leads to the
> requirement for the new ioctl in patch 7.
> 
>  2. Because the VMM can change the pages in a memslot at any time and
> KVM relies on mmu notifiers to spot the change it's hard and ugly to
> enforce that the memslot VMAs keep having the PROT_MTE flag. When I
> tried this it meant we've discover that a page doesn't have the MTE flag
> at fault time and have no other option that to kill the VM at that time.
> 
> So the model is that non-PROT_MTE memory can be supplied to the memslots
> and KVM will automatically upgrade it to PG_mte_tagged if you supply it
> to a VM with MTE enabled. This makes the VMM implementation easier for
> most cases, and the new ioctl helps for migration. I think the kernel
> code is tidier too.

OK, I see your point. I guess we rely on the implicit requirement that
all the available memory is MTE-capable, although I'm willing to bet
that someone will eventually break this requirement.

> Of course even better would be a stage 2 flag to control MTE
> availability on a page-by-page basis, but that doesn't exist in the
> architecture at the moment.

Nah, that'd be way too good. Let's not do that.

> 
> >> +		}
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> >>  			  unsigned long fault_status)
> >> @@ -971,8 +996,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>  	if (writable)
> >>  		prot |= KVM_PGTABLE_PROT_W;
> >>  
> >> -	if (fault_status != FSC_PERM && !device)
> >> +	if (fault_status != FSC_PERM && !device) {
> >> +		ret = sanitise_mte_tags(kvm, vma_pagesize, pfn);
> >> +		if (ret)
> >> +			goto out_unlock;
> >> +
> >>  		clean_dcache_guest_page(pfn, vma_pagesize);
> >> +	}
> >>  
> >>  	if (exec_fault) {
> >>  		prot |= KVM_PGTABLE_PROT_X;
> >> @@ -1168,12 +1198,17 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >>  {
> >>  	kvm_pfn_t pfn = pte_pfn(range->pte);
> >> +	int ret;
> >>  
> >>  	if (!kvm->arch.mmu.pgt)
> >>  		return 0;
> >>  
> >>  	WARN_ON(range->end - range->start != 1);
> >>  
> >> +	ret = sanitise_mte_tags(kvm, PAGE_SIZE, pfn);
> >> +	if (ret)
> >> +		return ret;
> > 
> > Notice the change in return type?
> 
> I do now - I was tricked by the use of '0' as false. Looks like false
> ('0') is actually the correct return here to avoid an unnecessary
> kvm_flush_remote_tlbs().

Yup. BTW, the return values have been fixed to proper boolean types in
the latest set of fixes.

> 
> >> +
> >>  	/*
> >>  	 * We've moved a page around, probably through CoW, so let's treat it
> >>  	 * just like a translation fault and clean the cache to the PoC.
> >> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> >> index 76ea2800c33e..24a844cb79ca 100644
> >> --- a/arch/arm64/kvm/sys_regs.c
> >> +++ b/arch/arm64/kvm/sys_regs.c
> >> @@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
> >>  		break;
> >>  	case SYS_ID_AA64PFR1_EL1:
> >>  		val &= ~FEATURE(ID_AA64PFR1_MTE);
> >> +		if (kvm_has_mte(vcpu->kvm))
> >> +			val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
> >> +					  ID_AA64PFR1_MTE);
> > 
> > Shouldn't this be consistent with what the HW is capable of
> > (i.e. FEAT_MTE3 if available), and extracted from the sanitised view
> > of the feature set?
> 
> Yes - however at the moment our sanitised view is either FEAT_MTE2 or
> nothing:
> 
> 	{
> 		.desc = "Memory Tagging Extension",
> 		.capability = ARM64_MTE,
> 		.type = ARM64_CPUCAP_STRICT_BOOT_CPU_FEATURE,
> 		.matches = has_cpuid_feature,
> 		.sys_reg = SYS_ID_AA64PFR1_EL1,
> 		.field_pos = ID_AA64PFR1_MTE_SHIFT,
> 		.min_field_value = ID_AA64PFR1_MTE,
> 		.sign = FTR_UNSIGNED,
> 		.cpu_enable = cpu_enable_mte,
> 	},
> 
> When host support for FEAT_MTE3 is added then the KVM code will need
> revisiting to expose that down to the guest safely (AFAICS there's
> nothing extra to do here, but I haven't tested any of the MTE3
> features). I don't think we want to expose newer versions to the guest
> than the host is aware of. (Or indeed expose FEAT_MTE if the host has
> MTE disabled because Linux requires at least FEAT_MTE2).

What I was suggesting is to have something like this:

     pfr = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1);
     mte = cpuid_feature_extract_unsigned_field(pfr, ID_AA64PFR1_MTE_SHIFT);
     val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE), mte);

which does the trick nicely, and doesn't expose more than the host
supports.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers
  2021-05-19 13:04     ` Steven Price
@ 2021-05-20  9:46       ` Marc Zyngier
  2021-05-20 15:21         ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-20  9:46 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Wed, 19 May 2021 14:04:20 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> On 17/05/2021 18:17, Marc Zyngier wrote:
> > On Mon, 17 May 2021 13:32:36 +0100,
> > Steven Price <steven.price@arm.com> wrote:
> >>
> >> Define the new system registers that MTE introduces and context switch
> >> them. The MTE feature is still hidden from the ID register as it isn't
> >> supported in a VM yet.
> >>
> >> Signed-off-by: Steven Price <steven.price@arm.com>
> >> ---
> >>  arch/arm64/include/asm/kvm_host.h          |  6 ++
> >>  arch/arm64/include/asm/kvm_mte.h           | 66 ++++++++++++++++++++++
> >>  arch/arm64/include/asm/sysreg.h            |  3 +-
> >>  arch/arm64/kernel/asm-offsets.c            |  3 +
> >>  arch/arm64/kvm/hyp/entry.S                 |  7 +++
> >>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++
> >>  arch/arm64/kvm/sys_regs.c                  | 22 ++++++--
> >>  7 files changed, 123 insertions(+), 5 deletions(-)
> >>  create mode 100644 arch/arm64/include/asm/kvm_mte.h
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> >> index afaa5333f0e4..309e36cc1b42 100644
> >> --- a/arch/arm64/include/asm/kvm_host.h
> >> +++ b/arch/arm64/include/asm/kvm_host.h
> >> @@ -208,6 +208,12 @@ enum vcpu_sysreg {
> >>  	CNTP_CVAL_EL0,
> >>  	CNTP_CTL_EL0,
> >>  
> >> +	/* Memory Tagging Extension registers */
> >> +	RGSR_EL1,	/* Random Allocation Tag Seed Register */
> >> +	GCR_EL1,	/* Tag Control Register */
> >> +	TFSR_EL1,	/* Tag Fault Status Register (EL1) */
> >> +	TFSRE0_EL1,	/* Tag Fault Status Register (EL0) */
> >> +
> >>  	/* 32bit specific registers. Keep them at the end of the range */
> >>  	DACR32_EL2,	/* Domain Access Control Register */
> >>  	IFSR32_EL2,	/* Instruction Fault Status Register */
> >> diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
> >> new file mode 100644
> >> index 000000000000..6541c7d6ce06
> >> --- /dev/null
> >> +++ b/arch/arm64/include/asm/kvm_mte.h
> >> @@ -0,0 +1,66 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +/*
> >> + * Copyright (C) 2020 ARM Ltd.
> >> + */
> >> +#ifndef __ASM_KVM_MTE_H
> >> +#define __ASM_KVM_MTE_H
> >> +
> >> +#ifdef __ASSEMBLY__
> >> +
> >> +#include <asm/sysreg.h>
> >> +
> >> +#ifdef CONFIG_ARM64_MTE
> >> +
> >> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
> >> +alternative_if_not ARM64_MTE
> >> +	b	.L__skip_switch\@
> >> +alternative_else_nop_endif
> >> +	mrs	\reg1, hcr_el2
> >> +	and	\reg1, \reg1, #(HCR_ATA)
> >> +	cbz	\reg1, .L__skip_switch\@
> >> +
> >> +	mrs_s	\reg1, SYS_RGSR_EL1
> >> +	str	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
> >> +	mrs_s	\reg1, SYS_GCR_EL1
> >> +	str	\reg1, [\h_ctxt, #CPU_GCR_EL1]
> >> +
> >> +	ldr	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
> >> +	msr_s	SYS_RGSR_EL1, \reg1
> >> +	ldr	\reg1, [\g_ctxt, #CPU_GCR_EL1]
> >> +	msr_s	SYS_GCR_EL1, \reg1
> >> +
> >> +.L__skip_switch\@:
> >> +.endm
> >> +
> >> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
> >> +alternative_if_not ARM64_MTE
> >> +	b	.L__skip_switch\@
> >> +alternative_else_nop_endif
> >> +	mrs	\reg1, hcr_el2
> >> +	and	\reg1, \reg1, #(HCR_ATA)
> >> +	cbz	\reg1, .L__skip_switch\@
> >> +
> >> +	mrs_s	\reg1, SYS_RGSR_EL1
> >> +	str	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
> >> +	mrs_s	\reg1, SYS_GCR_EL1
> >> +	str	\reg1, [\g_ctxt, #CPU_GCR_EL1]
> >> +
> >> +	ldr	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
> >> +	msr_s	SYS_RGSR_EL1, \reg1
> >> +	ldr	\reg1, [\h_ctxt, #CPU_GCR_EL1]
> >> +	msr_s	SYS_GCR_EL1, \reg1
> > 
> > What is the rational for not having any synchronisation here? It is
> > quite uncommon to allocate memory at EL2, but VHE can perform all kind
> > of tricks.
> 
> I don't follow. This is part of the __guest_exit path and there's an ISB
> at the end of that - is that not sufficient? I don't see any possibility
> for allocating memory before that. What am I missing?

Which ISB?  We have a few in the SError handling code, but that's
conditioned on not having RAS. With any RAS-enabled CPU, we return to
C code early, since we don't need any extra synchronisation (see the
comment about the absence of ISB on this path).

I would really like to ensure that we return to C code in the exact
state we left it.

> 
> >> +
> >> +.L__skip_switch\@:
> >> +.endm
> >> +
> >> +#else /* CONFIG_ARM64_MTE */
> >> +
> >> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
> >> +.endm
> >> +
> >> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
> >> +.endm
> >> +
> >> +#endif /* CONFIG_ARM64_MTE */
> >> +#endif /* __ASSEMBLY__ */
> >> +#endif /* __ASM_KVM_MTE_H */
> >> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> >> index 65d15700a168..347ccac2341e 100644
> >> --- a/arch/arm64/include/asm/sysreg.h
> >> +++ b/arch/arm64/include/asm/sysreg.h
> >> @@ -651,7 +651,8 @@
> >>  
> >>  #define INIT_SCTLR_EL2_MMU_ON						\
> >>  	(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |	\
> >> -	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
> >> +	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |		\
> >> +	 SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
> >>  
> >>  #define INIT_SCTLR_EL2_MMU_OFF \
> >>  	(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
> >> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> >> index 0cb34ccb6e73..6b489a8462f0 100644
> >> --- a/arch/arm64/kernel/asm-offsets.c
> >> +++ b/arch/arm64/kernel/asm-offsets.c
> >> @@ -111,6 +111,9 @@ int main(void)
> >>    DEFINE(VCPU_WORKAROUND_FLAGS,	offsetof(struct kvm_vcpu, arch.workaround_flags));
> >>    DEFINE(VCPU_HCR_EL2,		offsetof(struct kvm_vcpu, arch.hcr_el2));
> >>    DEFINE(CPU_USER_PT_REGS,	offsetof(struct kvm_cpu_context, regs));
> >> +  DEFINE(CPU_RGSR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1]));
> >> +  DEFINE(CPU_GCR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1]));
> >> +  DEFINE(CPU_TFSRE0_EL1,	offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1]));
> > 
> > TFSRE0_EL1 is never accessed from assembly code. Leftover from a
> > previous version?
> 
> Indeed, I will drop it.
> 
> >>    DEFINE(CPU_APIAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1]));
> >>    DEFINE(CPU_APIBKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1]));
> >>    DEFINE(CPU_APDAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1]));
> >> diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
> >> index e831d3dfd50d..435346ea1504 100644
> >> --- a/arch/arm64/kvm/hyp/entry.S
> >> +++ b/arch/arm64/kvm/hyp/entry.S
> >> @@ -13,6 +13,7 @@
> >>  #include <asm/kvm_arm.h>
> >>  #include <asm/kvm_asm.h>
> >>  #include <asm/kvm_mmu.h>
> >> +#include <asm/kvm_mte.h>
> >>  #include <asm/kvm_ptrauth.h>
> >>  
> >>  	.text
> >> @@ -51,6 +52,9 @@ alternative_else_nop_endif
> >>  
> >>  	add	x29, x0, #VCPU_CONTEXT
> >>  
> >> +	// mte_switch_to_guest(g_ctxt, h_ctxt, tmp1)
> >> +	mte_switch_to_guest x29, x1, x2
> >> +
> >>  	// Macro ptrauth_switch_to_guest format:
> >>  	// 	ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3)
> >>  	// The below macro to restore guest keys is not implemented in C code
> >> @@ -142,6 +146,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
> >>  	// when this feature is enabled for kernel code.
> >>  	ptrauth_switch_to_hyp x1, x2, x3, x4, x5
> >>  
> >> +	// mte_switch_to_hyp(g_ctxt, h_ctxt, reg1)
> >> +	mte_switch_to_hyp x1, x2, x3
> >> +
> >>  	// Restore hyp's sp_el0
> >>  	restore_sp_el0 x2, x3
> >>  
> >> diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
> >> index cce43bfe158f..de7e14c862e6 100644
> >> --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
> >> +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
> >> @@ -14,6 +14,7 @@
> >>  #include <asm/kvm_asm.h>
> >>  #include <asm/kvm_emulate.h>
> >>  #include <asm/kvm_hyp.h>
> >> +#include <asm/kvm_mmu.h>
> >>  
> >>  static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt)
> >>  {
> >> @@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt)
> >>  	ctxt_sys_reg(ctxt, TPIDRRO_EL0)	= read_sysreg(tpidrro_el0);
> >>  }
> >>  
> >> +static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt)
> >> +{
> >> +	struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu;
> >> +
> >> +	if (!vcpu)
> >> +		vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt);
> >> +
> >> +	return kvm_has_mte(kern_hyp_va(vcpu->kvm));
> >> +}
> >> +
> >>  static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
> >>  {
> >>  	ctxt_sys_reg(ctxt, CSSELR_EL1)	= read_sysreg(csselr_el1);
> >> @@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
> >>  	ctxt_sys_reg(ctxt, PAR_EL1)	= read_sysreg_par();
> >>  	ctxt_sys_reg(ctxt, TPIDR_EL1)	= read_sysreg(tpidr_el1);
> >>  
> >> +	if (ctxt_has_mte(ctxt)) {
> >> +		ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR);
> >> +		ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1);
> >> +	}
> > 
> > I remember suggesting that this is slightly heavier than necessary.
> > 
> > On nVHE, TFSRE0_EL1 could be moved to load/put, as we never run
> > userspace with a vcpu loaded. The same holds of course for VHE, but we
> > also can move TFSR_EL1 to load/put, as the host uses TFSR_EL2.
> > 
> > Do you see any issue with that?
> 
> The comment[1] I made before was:

Ah, I totally missed this email (or can't remember reading it, which
amounts to the same thing). Apologies for that.

>   For TFSR_EL1 + VHE I believe it is synchronised only on vcpu_load/put -
>   __sysreg_save_el1_state() is called from kvm_vcpu_load_sysregs_vhe().
> 
>   TFSRE0_EL1 potentially could be improved. I have to admit I was unsure
>   if it should be in __sysreg_save_user_state() instead. However AFAICT
>   that is called at the same time as __sysreg_save_el1_state() and there's
>   no optimisation for nVHE. And given it's an _EL1 register this seemed
>   like the logic place.
>
>   Am I missing something here? Potentially there are other registers to be
>   optimised (TPIDRRO_EL0 looks like a possiblity), but IMHO that doesn't
>   belong in this series.
> 
> For VHE TFSR_EL1 is already only saved/restored on load/put
> (__sysreg_save_el1_state() is called from kvm_vcpu_put_sysregs_vhe()).
> 
> TFSRE0_EL1 could be moved, but I'm not sure where it should live as I
> mentioned above.

Yeah, this looks fine, please ignore my rambling.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE
  2021-05-19 13:26     ` Steven Price
@ 2021-05-20 10:09       ` Marc Zyngier
  2021-05-20 10:51         ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-20 10:09 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Wed, 19 May 2021 14:26:31 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> On 17/05/2021 18:40, Marc Zyngier wrote:
> > On Mon, 17 May 2021 13:32:37 +0100,
> > Steven Price <steven.price@arm.com> wrote:
> >>
> >> It's now safe for the VMM to enable MTE in a guest, so expose the
> >> capability to user space.
> >>
> >> Signed-off-by: Steven Price <steven.price@arm.com>
> >> ---
> >>  arch/arm64/kvm/arm.c      | 9 +++++++++
> >>  arch/arm64/kvm/sys_regs.c | 3 +++
> >>  2 files changed, 12 insertions(+)
> >>
> >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >> index 1cb39c0803a4..e89a5e275e25 100644
> >> --- a/arch/arm64/kvm/arm.c
> >> +++ b/arch/arm64/kvm/arm.c
> >> @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >>  		r = 0;
> >>  		kvm->arch.return_nisv_io_abort_to_user = true;
> >>  		break;
> >> +	case KVM_CAP_ARM_MTE:
> >> +		if (!system_supports_mte() || kvm->created_vcpus)
> >> +			return -EINVAL;
> >> +		r = 0;
> >> +		kvm->arch.mte_enabled = true;
> > 
> > As far as I can tell from the architecture, this isn't valid for a
> > 32bit guest.
> 
> Indeed, however the MTE flag is a property of the VM not of the vCPU.
> And, unless I'm mistaken, it's technically possible to create a VM where
> some CPUs are 32 bit and some 64 bit. Not that I can see much use of a
> configuration like that.

It looks that this is indeed a bug, and I'm on my way to squash it.
Can't believe we allowed that for so long...

But the architecture clearly states:

<quote>
These features are supported in AArch64 state only.
</quote>

So I'd expect something like:

diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 956cdc240148..50635eacfa43 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -220,7 +220,8 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
 	switch (vcpu->arch.target) {
 	default:
 		if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
-			if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1)) {
+			if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) ||
+			    vcpu->kvm->arch.mte_enabled) {
 				ret = -EINVAL;
 				goto out;
 			}

that makes it completely impossible to create 32bit CPUs within a
MTE-enabled guest.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl
  2021-05-19 14:09     ` Steven Price
@ 2021-05-20 10:24       ` Marc Zyngier
  2021-05-20 10:52         ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Marc Zyngier @ 2021-05-20 10:24 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Wed, 19 May 2021 15:09:23 +0100,
Steven Price <steven.price@arm.com> wrote:
> 
> On 17/05/2021 19:09, Marc Zyngier wrote:
> > On Mon, 17 May 2021 13:32:39 +0100,
> > Steven Price <steven.price@arm.com> wrote:
> >>
> >> A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
> >> granting a guest access to the tags, and provides a mechanism for the
> >> VMM to enable it.
> >>
> >> A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
> >> access the tags of a guest without having to maintain a PROT_MTE mapping
> >> in userspace. The above capability gates access to the ioctl.
> >>
> >> Signed-off-by: Steven Price <steven.price@arm.com>
> >> ---
> >>  Documentation/virt/kvm/api.rst | 53 ++++++++++++++++++++++++++++++++++
> >>  1 file changed, 53 insertions(+)
> >>
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index 22d077562149..a31661b870ba 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -5034,6 +5034,40 @@ see KVM_XEN_VCPU_SET_ATTR above.
> >>  The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
> >>  with the KVM_XEN_VCPU_GET_ATTR ioctl.
> >>  
> >> +4.130 KVM_ARM_MTE_COPY_TAGS
> >> +---------------------------
> >> +
> >> +:Capability: KVM_CAP_ARM_MTE
> >> +:Architectures: arm64
> >> +:Type: vm ioctl
> >> +:Parameters: struct kvm_arm_copy_mte_tags
> >> +:Returns: 0 on success, < 0 on error
> >> +
> >> +::
> >> +
> >> +  struct kvm_arm_copy_mte_tags {
> >> +	__u64 guest_ipa;
> >> +	__u64 length;
> >> +	union {
> >> +		void __user *addr;
> >> +		__u64 padding;
> >> +	};
> >> +	__u64 flags;
> >> +	__u64 reserved[2];
> >> +  };
> > 
> > This doesn't exactly match the structure in the previous patch :-(.
> 
> :( I knew there was a reason I didn't include it in the documentation
> for the first 9 versions... I'll fix this up, thanks for spotting it.
> 
> >> +
> >> +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
> >> +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
> >> +fieldmust point to a buffer which the tags will be copied to or from.
> >> +
> >> +``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
> >> +``KVM_ARM_TAGS_FROM_GUEST``.
> >> +
> >> +The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)``
> > 
> > Should we add a UAPI definition for MTE_GRANULE_SIZE?
> 
> I wasn't sure whether to export this or not. The ioctl is based around
> the existing ptrace interface (PTRACE_{PEEK,POKE}MTETAGS) which doesn't
> expose a UAPI definition. Admittedly the documentation there also just
> says "16-byte granule" rather than MTE_GRANULE_SIZE.
> 
> So I'll just remove the reference to MTE_GRANULE_SIZE in the
> documentation unless you feel that we should have a UAPI definition.

Dropping the mention of this symbol and replacing it by the value 16
matches the architecture and doesn't require any extra UAPI
definition, so let's just do that.

> 
> >> +bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag
> >> +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
> >> +``PTRACE_POKEMTETAGS``.
> >> +
> >>  5. The kvm_run structure
> >>  ========================
> >>  
> >> @@ -6362,6 +6396,25 @@ default.
> >>  
> >>  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
> >>  
> >> +7.26 KVM_CAP_ARM_MTE
> >> +--------------------
> >> +
> >> +:Architectures: arm64
> >> +:Parameters: none
> >> +
> >> +This capability indicates that KVM (and the hardware) supports exposing the
> >> +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
> >> +VMM before the guest will be granted access.
> >> +
> >> +When enabled the guest is able to access tags associated with any memory given
> >> +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
> >> +that the tags are maintained during swap or hibernation of the host; however
> >> +the VMM needs to manually save/restore the tags as appropriate if the VM is
> >> +migrated.
> >> +
> >> +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
> >> +perform a bulk copy of tags to/from the guest.
> >> +
> > 
> > Missing limitation to AArch64 guests.
> 
> As mentioned previously it's not technically limited to AArch64, but
> I'll expand this to make it clear that MTE isn't usable from a AArch32 VCPU.

I believe the architecture is quite clear that it *is* limited to
AArch64. The clarification is welcome though.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE
  2021-05-20 10:09       ` Marc Zyngier
@ 2021-05-20 10:51         ` Steven Price
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Price @ 2021-05-20 10:51 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 20/05/2021 11:09, Marc Zyngier wrote:
> On Wed, 19 May 2021 14:26:31 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> On 17/05/2021 18:40, Marc Zyngier wrote:
>>> On Mon, 17 May 2021 13:32:37 +0100,
>>> Steven Price <steven.price@arm.com> wrote:
>>>>
>>>> It's now safe for the VMM to enable MTE in a guest, so expose the
>>>> capability to user space.
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>>  arch/arm64/kvm/arm.c      | 9 +++++++++
>>>>  arch/arm64/kvm/sys_regs.c | 3 +++
>>>>  2 files changed, 12 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index 1cb39c0803a4..e89a5e275e25 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>  		r = 0;
>>>>  		kvm->arch.return_nisv_io_abort_to_user = true;
>>>>  		break;
>>>> +	case KVM_CAP_ARM_MTE:
>>>> +		if (!system_supports_mte() || kvm->created_vcpus)
>>>> +			return -EINVAL;
>>>> +		r = 0;
>>>> +		kvm->arch.mte_enabled = true;
>>>
>>> As far as I can tell from the architecture, this isn't valid for a
>>> 32bit guest.
>>
>> Indeed, however the MTE flag is a property of the VM not of the vCPU.
>> And, unless I'm mistaken, it's technically possible to create a VM where
>> some CPUs are 32 bit and some 64 bit. Not that I can see much use of a
>> configuration like that.
> 
> It looks that this is indeed a bug, and I'm on my way to squash it.
> Can't believe we allowed that for so long...

Ah, well if you're going to kill off mixed 32bit/64bit VMs then...

> But the architecture clearly states:
> 
> <quote>
> These features are supported in AArch64 state only.
> </quote>
> 
> So I'd expect something like:
> 
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 956cdc240148..50635eacfa43 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -220,7 +220,8 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
>  	switch (vcpu->arch.target) {
>  	default:
>  		if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
> -			if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1)) {
> +			if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) ||
> +			    vcpu->kvm->arch.mte_enabled) {
>  				ret = -EINVAL;
>  				goto out;
>  			}
> 
> that makes it completely impossible to create 32bit CPUs within a
> MTE-enabled guest.

... that makes complete sense, and I'll include this hunk in my next
posting.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl
  2021-05-20 10:24       ` Marc Zyngier
@ 2021-05-20 10:52         ` Steven Price
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Price @ 2021-05-20 10:52 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 20/05/2021 11:24, Marc Zyngier wrote:
> On Wed, 19 May 2021 15:09:23 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> On 17/05/2021 19:09, Marc Zyngier wrote:
>>> On Mon, 17 May 2021 13:32:39 +0100,
>>> Steven Price <steven.price@arm.com> wrote:
[...]>>>> +bytes (i.e. 1/16th of the corresponding size). Each byte
contains a single tag
>>>> +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
>>>> +``PTRACE_POKEMTETAGS``.
>>>> +
>>>>  5. The kvm_run structure
>>>>  ========================
>>>>  
>>>> @@ -6362,6 +6396,25 @@ default.
>>>>  
>>>>  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
>>>>  
>>>> +7.26 KVM_CAP_ARM_MTE
>>>> +--------------------
>>>> +
>>>> +:Architectures: arm64
>>>> +:Parameters: none
>>>> +
>>>> +This capability indicates that KVM (and the hardware) supports exposing the
>>>> +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
>>>> +VMM before the guest will be granted access.
>>>> +
>>>> +When enabled the guest is able to access tags associated with any memory given
>>>> +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
>>>> +that the tags are maintained during swap or hibernation of the host; however
>>>> +the VMM needs to manually save/restore the tags as appropriate if the VM is
>>>> +migrated.
>>>> +
>>>> +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
>>>> +perform a bulk copy of tags to/from the guest.
>>>> +
>>>
>>> Missing limitation to AArch64 guests.
>>
>> As mentioned previously it's not technically limited to AArch64, but
>> I'll expand this to make it clear that MTE isn't usable from a AArch32 VCPU.
> 
> I believe the architecture is quite clear that it *is* limited to
> AArch64. The clarification is welcome though.

I explained that badly. A system supporting MTE doesn't have to have all
CPUs running AArch64 - fairly obviously you can boot a 32 bit OS on a
system supporting AArch64.

Since the KVM capability is a VM capability it's not architecturally
inconsistent to enable it even if all your CPUs are running AArch32 (at
EL1 and lower) - just a bit pointless.

However, given your comment that a mixture of AArch32/AArch64 VCPUs is a
bug - we can fail creation of AArch32 VCPUs and I'll explicitly document
this is a AArch64 only feature.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature
  2021-05-17 12:32 ` [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature Steven Price
  2021-05-17 16:45   ` Marc Zyngier
@ 2021-05-20 11:54   ` Catalin Marinas
  2021-05-20 15:05     ` Steven Price
  1 sibling, 1 reply; 49+ messages in thread
From: Catalin Marinas @ 2021-05-20 11:54 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, May 17, 2021 at 01:32:35PM +0100, Steven Price wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c5d1f3c87dbd..8660f6a03f51 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>  	return PAGE_SIZE;
>  }
>  
> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
> +			     kvm_pfn_t pfn)
> +{
> +	if (kvm_has_mte(kvm)) {
> +		/*
> +		 * The page will be mapped in stage 2 as Normal Cacheable, so
> +		 * the VM will be able to see the page's tags and therefore
> +		 * they must be initialised first. If PG_mte_tagged is set,
> +		 * tags have already been initialised.
> +		 */
> +		unsigned long i, nr_pages = size >> PAGE_SHIFT;
> +		struct page *page = pfn_to_online_page(pfn);
> +
> +		if (!page)
> +			return -EFAULT;

IIRC we ended up with pfn_to_online_page() to reject ZONE_DEVICE pages
that may be mapped into a guest and we have no idea whether they support
MTE. It may be worth adding a comment, otherwise, as Marc said, the page
wouldn't disappear.

> +
> +		for (i = 0; i < nr_pages; i++, page++) {
> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> +				mte_clear_page_tags(page_address(page));

We started the page->flags thread and ended up fixing it for the host
set_pte_at() as per the first patch:

https://lore.kernel.org/r/c3293d47-a5f2-ea4a-6730-f5cae26d8a7e@arm.com

Now, can we have a race between the stage 2 kvm_set_spte_gfn() and a
stage 1 set_pte_at()? Only the latter takes a lock. Or between two
kvm_set_spte_gfn() in different VMs? I think in the above thread we
concluded that there's only a problem if the page is shared between
multiple VMMs (MAP_SHARED). How realistic is this and what's the
workaround?

Either way, I think it's worth adding a comment here on the race on
page->flags as it looks strange that here it's just a test_and_set_bit()
while set_pte_at() uses a spinlock.

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged
  2021-05-19 18:06   ` Catalin Marinas
@ 2021-05-20 11:55     ` Steven Price
  2021-05-20 12:25       ` Catalin Marinas
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-20 11:55 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 19/05/2021 19:06, Catalin Marinas wrote:
> On Mon, May 17, 2021 at 01:32:34PM +0100, Steven Price wrote:
>> A KVM guest could store tags in a page even if the VMM hasn't mapped
>> the page with PROT_MTE. So when restoring pages from swap we will
>> need to check to see if there are any saved tags even if !pte_tagged().
>>
>> However don't check pages for which pte_access_permitted() returns false
>> as these will not have been swapped out.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h |  9 +++++++--
>>  arch/arm64/kernel/mte.c          | 16 ++++++++++++++--
>>  2 files changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 0b10204e72fc..275178a810c1 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>>  	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
>>  		__sync_icache_dcache(pte);
>>  
>> -	if (system_supports_mte() &&
>> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
>> +	/*
>> +	 * If the PTE would provide user space access to the tags associated
>> +	 * with it then ensure that the MTE tags are synchronised.  Exec-only
>> +	 * mappings don't expose tags (instruction fetches don't check tags).
>> +	 */
>> +	if (system_supports_mte() && pte_present(pte) &&
>> +	    pte_access_permitted(pte, false) && !pte_special(pte))
>>  		mte_sync_tags(ptep, pte);
> 
> Looking at the mte_sync_page_tags() logic, we bail out early if it's the
> old pte is not a swap one and the new pte is not tagged. So we only need
> to call mte_sync_tags() if it's a tagged new pte or the old one is swap.
> What about changing the set_pte_at() test to:
> 
> 	if (system_supports_mte() && pte_present(pte) && !pte_special(pte) &&
> 	    (pte_tagged(pte) || is_swap_pte(READ_ONCE(*ptep))))
> 		mte_sync_tags(ptep, pte);
> 
> We can even change mte_sync_tags() to take the old pte directly:
> 
> 	if (system_supports_mte() && pte_present(pte) && !pte_special(pte)) {
> 		pte_t old_pte = READ_ONCE(*ptep);
> 		if (pte_tagged(pte) || is_swap_pte(old_pte))
> 			mte_sync_tags(old_pte, pte);
> 	}
> 
> It would save a function call in most cases where the page is not
> tagged.
> 

Yes that looks like a good optimisation - although you've missed the
pte_access_permitted() part of the check ;) The problem I hit is one of
include dependencies:

is_swap_pte() is defined (as a static inline) in
include/linux/swapops.h. However the definition depends on
pte_none()/pte_present() which are defined in pgtable.h - so there's a
circular dependency.

Open coding is_swap_pte() in set_pte_at() works, but it's a bit ugly.
Any ideas on how to improve on the below?

	if (system_supports_mte() && pte_present(pte) &&
	    pte_access_permitted(pte, false) && !pte_special(pte)) {
		pte_t old_pte = READ_ONCE(*ptep);
		/*
		 * We only need to synchronise if the new PTE has tags enabled
		 * or if swapping in (in which case another mapping may have
		 * set tags in the past even if this PTE isn't tagged).
		 * (!pte_none() && !pte_present()) is an open coded version of
		 * is_swap_pte()
		 */
		if (pte_tagged(pte) || (!pte_none(pte) && !pte_present(pte)))
			mte_sync_tags(old_pte, pte);
	}

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-17 12:32 ` [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
  2021-05-17 18:04   ` Marc Zyngier
@ 2021-05-20 12:05   ` Catalin Marinas
  2021-05-20 15:58     ` Steven Price
  1 sibling, 1 reply; 49+ messages in thread
From: Catalin Marinas @ 2021-05-20 12:05 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Mon, May 17, 2021 at 01:32:38PM +0100, Steven Price wrote:
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 24223adae150..b3edde68bc3e 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>  	__u32 reserved[12];
>  };
>  
> +struct kvm_arm_copy_mte_tags {
> +	__u64 guest_ipa;
> +	__u64 length;
> +	void __user *addr;
> +	__u64 flags;
> +	__u64 reserved[2];

I forgot the past discussions, what's the reserved for? Future
expansion?

> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e89a5e275e25..4b6c83beb75d 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>  	}
>  }
>  
> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
> +				      struct kvm_arm_copy_mte_tags *copy_tags)
> +{
> +	gpa_t guest_ipa = copy_tags->guest_ipa;
> +	size_t length = copy_tags->length;
> +	void __user *tags = copy_tags->addr;
> +	gpa_t gfn;
> +	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
> +	int ret = 0;
> +
> +	if (copy_tags->reserved[0] || copy_tags->reserved[1])
> +		return -EINVAL;
> +
> +	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
> +		return -EINVAL;
> +
> +	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	gfn = gpa_to_gfn(guest_ipa);
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	while (length > 0) {
> +		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
> +		void *maddr;
> +		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
> +
> +		if (is_error_noslot_pfn(pfn)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		maddr = page_address(pfn_to_page(pfn));
> +
> +		if (!write) {
> +			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
> +			kvm_release_pfn_clean(pfn);

Do we need to check if PG_mte_tagged is set? If the page was not faulted
into the guest address space but the VMM has the page, does the
gfn_to_pfn_prot() guarantee that a kvm_set_spte_gfn() was called? If
not, this may read stale tags.

> +		} else {
> +			num_tags = mte_copy_tags_from_user(maddr, tags,
> +							   num_tags);
> +			kvm_release_pfn_dirty(pfn);
> +		}

Same question here, if the we can't guarantee the stage 2 pte being set,
we'd need to set PG_mte_tagged.

> +
> +		if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		gfn++;
> +		tags += num_tags;
> +		length -= PAGE_SIZE;
> +	}
> +
> +out:
> +	mutex_unlock(&kvm->slots_lock);
> +	return ret;
> +}
> +

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged
  2021-05-20 11:55     ` Steven Price
@ 2021-05-20 12:25       ` Catalin Marinas
  2021-05-20 13:02         ` Catalin Marinas
  2021-05-20 13:03         ` Steven Price
  0 siblings, 2 replies; 49+ messages in thread
From: Catalin Marinas @ 2021-05-20 12:25 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, May 20, 2021 at 12:55:21PM +0100, Steven Price wrote:
> On 19/05/2021 19:06, Catalin Marinas wrote:
> > On Mon, May 17, 2021 at 01:32:34PM +0100, Steven Price wrote:
> >> A KVM guest could store tags in a page even if the VMM hasn't mapped
> >> the page with PROT_MTE. So when restoring pages from swap we will
> >> need to check to see if there are any saved tags even if !pte_tagged().
> >>
> >> However don't check pages for which pte_access_permitted() returns false
> >> as these will not have been swapped out.
> >>
> >> Signed-off-by: Steven Price <steven.price@arm.com>
> >> ---
> >>  arch/arm64/include/asm/pgtable.h |  9 +++++++--
> >>  arch/arm64/kernel/mte.c          | 16 ++++++++++++++--
> >>  2 files changed, 21 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index 0b10204e72fc..275178a810c1 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
> >>  	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
> >>  		__sync_icache_dcache(pte);
> >>  
> >> -	if (system_supports_mte() &&
> >> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> >> +	/*
> >> +	 * If the PTE would provide user space access to the tags associated
> >> +	 * with it then ensure that the MTE tags are synchronised.  Exec-only
> >> +	 * mappings don't expose tags (instruction fetches don't check tags).
> >> +	 */
> >> +	if (system_supports_mte() && pte_present(pte) &&
> >> +	    pte_access_permitted(pte, false) && !pte_special(pte))
> >>  		mte_sync_tags(ptep, pte);
> > 
> > Looking at the mte_sync_page_tags() logic, we bail out early if it's the
> > old pte is not a swap one and the new pte is not tagged. So we only need
> > to call mte_sync_tags() if it's a tagged new pte or the old one is swap.
> > What about changing the set_pte_at() test to:
> > 
> > 	if (system_supports_mte() && pte_present(pte) && !pte_special(pte) &&
> > 	    (pte_tagged(pte) || is_swap_pte(READ_ONCE(*ptep))))
> > 		mte_sync_tags(ptep, pte);
> > 
> > We can even change mte_sync_tags() to take the old pte directly:
> > 
> > 	if (system_supports_mte() && pte_present(pte) && !pte_special(pte)) {
> > 		pte_t old_pte = READ_ONCE(*ptep);
> > 		if (pte_tagged(pte) || is_swap_pte(old_pte))
> > 			mte_sync_tags(old_pte, pte);
> > 	}
> > 
> > It would save a function call in most cases where the page is not
> > tagged.
> 
> Yes that looks like a good optimisation - although you've missed the
> pte_access_permitted() part of the check ;)

I was actually wondering if we could remove it. I don't think it buys us
much as we have a pte_present() check already, so we know it is pointing
to a valid page. Currently we'd only get a tagged pte on user mappings,
same with swap entries.

When vmalloc kasan_hw will be added, I think we have a set_pte_at() with
a tagged pte but init_mm and high address (we might as well add a
warning if addr > TASK_SIZE_64 on the mte_sync_tags path so that we
don't forget).

> The problem I hit is one of include dependencies:
> 
> is_swap_pte() is defined (as a static inline) in
> include/linux/swapops.h. However the definition depends on
> pte_none()/pte_present() which are defined in pgtable.h - so there's a
> circular dependency.
> 
> Open coding is_swap_pte() in set_pte_at() works, but it's a bit ugly.
> Any ideas on how to improve on the below?
> 
> 	if (system_supports_mte() && pte_present(pte) &&
> 	    pte_access_permitted(pte, false) && !pte_special(pte)) {
> 		pte_t old_pte = READ_ONCE(*ptep);
> 		/*
> 		 * We only need to synchronise if the new PTE has tags enabled
> 		 * or if swapping in (in which case another mapping may have
> 		 * set tags in the past even if this PTE isn't tagged).
> 		 * (!pte_none() && !pte_present()) is an open coded version of
> 		 * is_swap_pte()
> 		 */
> 		if (pte_tagged(pte) || (!pte_none(pte) && !pte_present(pte)))
> 			mte_sync_tags(old_pte, pte);
> 	}

That's why I avoided testing my suggestion ;). I think we should just
add !pte_none() in there with a comment that it may be a swap pte and
use the is_swap_pte() again on the mte_sync_tags() path. We already have
the pte_present() check.

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged
  2021-05-20 12:25       ` Catalin Marinas
@ 2021-05-20 13:02         ` Catalin Marinas
  2021-05-20 13:03         ` Steven Price
  1 sibling, 0 replies; 49+ messages in thread
From: Catalin Marinas @ 2021-05-20 13:02 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, May 20, 2021 at 01:25:50PM +0100, Catalin Marinas wrote:
> On Thu, May 20, 2021 at 12:55:21PM +0100, Steven Price wrote:
> > The problem I hit is one of include dependencies:
> > 
> > is_swap_pte() is defined (as a static inline) in
> > include/linux/swapops.h. However the definition depends on
> > pte_none()/pte_present() which are defined in pgtable.h - so there's a
> > circular dependency.
> > 
> > Open coding is_swap_pte() in set_pte_at() works, but it's a bit ugly.
> > Any ideas on how to improve on the below?
> > 
> > 	if (system_supports_mte() && pte_present(pte) &&
> > 	    pte_access_permitted(pte, false) && !pte_special(pte)) {
> > 		pte_t old_pte = READ_ONCE(*ptep);
> > 		/*
> > 		 * We only need to synchronise if the new PTE has tags enabled
> > 		 * or if swapping in (in which case another mapping may have
> > 		 * set tags in the past even if this PTE isn't tagged).
> > 		 * (!pte_none() && !pte_present()) is an open coded version of
> > 		 * is_swap_pte()
> > 		 */
> > 		if (pte_tagged(pte) || (!pte_none(pte) && !pte_present(pte)))
> > 			mte_sync_tags(old_pte, pte);
> > 	}
> 
> That's why I avoided testing my suggestion ;). I think we should just
> add !pte_none() in there with a comment that it may be a swap pte and
> use the is_swap_pte() again on the mte_sync_tags() path. We already have
> the pte_present() check.

Correction - pte_present() checks the new pte only, we need another for
the old pte. So it looks like we'll open-code is_swap_pte().

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged
  2021-05-20 12:25       ` Catalin Marinas
  2021-05-20 13:02         ` Catalin Marinas
@ 2021-05-20 13:03         ` Steven Price
  1 sibling, 0 replies; 49+ messages in thread
From: Steven Price @ 2021-05-20 13:03 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 20/05/2021 13:25, Catalin Marinas wrote:
> On Thu, May 20, 2021 at 12:55:21PM +0100, Steven Price wrote:
>> On 19/05/2021 19:06, Catalin Marinas wrote:
>>> On Mon, May 17, 2021 at 01:32:34PM +0100, Steven Price wrote:
>>>> A KVM guest could store tags in a page even if the VMM hasn't mapped
>>>> the page with PROT_MTE. So when restoring pages from swap we will
>>>> need to check to see if there are any saved tags even if !pte_tagged().
>>>>
>>>> However don't check pages for which pte_access_permitted() returns false
>>>> as these will not have been swapped out.
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>>  arch/arm64/include/asm/pgtable.h |  9 +++++++--
>>>>  arch/arm64/kernel/mte.c          | 16 ++++++++++++++--
>>>>  2 files changed, 21 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 0b10204e72fc..275178a810c1 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>>>>  	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
>>>>  		__sync_icache_dcache(pte);
>>>>  
>>>> -	if (system_supports_mte() &&
>>>> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
>>>> +	/*
>>>> +	 * If the PTE would provide user space access to the tags associated
>>>> +	 * with it then ensure that the MTE tags are synchronised.  Exec-only
>>>> +	 * mappings don't expose tags (instruction fetches don't check tags).
>>>> +	 */
>>>> +	if (system_supports_mte() && pte_present(pte) &&
>>>> +	    pte_access_permitted(pte, false) && !pte_special(pte))
>>>>  		mte_sync_tags(ptep, pte);
>>>
>>> Looking at the mte_sync_page_tags() logic, we bail out early if it's the
>>> old pte is not a swap one and the new pte is not tagged. So we only need
>>> to call mte_sync_tags() if it's a tagged new pte or the old one is swap.
>>> What about changing the set_pte_at() test to:
>>>
>>> 	if (system_supports_mte() && pte_present(pte) && !pte_special(pte) &&
>>> 	    (pte_tagged(pte) || is_swap_pte(READ_ONCE(*ptep))))
>>> 		mte_sync_tags(ptep, pte);
>>>
>>> We can even change mte_sync_tags() to take the old pte directly:
>>>
>>> 	if (system_supports_mte() && pte_present(pte) && !pte_special(pte)) {
>>> 		pte_t old_pte = READ_ONCE(*ptep);
>>> 		if (pte_tagged(pte) || is_swap_pte(old_pte))
>>> 			mte_sync_tags(old_pte, pte);
>>> 	}
>>>
>>> It would save a function call in most cases where the page is not
>>> tagged.
>>
>> Yes that looks like a good optimisation - although you've missed the
>> pte_access_permitted() part of the check ;)
> 
> I was actually wondering if we could remove it. I don't think it buys us
> much as we have a pte_present() check already, so we know it is pointing
> to a valid page. Currently we'd only get a tagged pte on user mappings,
> same with swap entries.

Actually the other way round makes more sense surely?
pte_access_permitted() is true if both PTE_VALID & PTE_USER are set.
pte_present() is true if *either* PTE_VALID or PTE_PROT_NONE are set. So
the pte_present() is actually redundant.

> When vmalloc kasan_hw will be added, I think we have a set_pte_at() with
> a tagged pte but init_mm and high address (we might as well add a
> warning if addr > TASK_SIZE_64 on the mte_sync_tags path so that we
> don't forget).

While we might not yet have tagged kernel pages - I'm not sure there's
much point weakening the check to have to then check addr as well in the
future.

>> The problem I hit is one of include dependencies:
>>
>> is_swap_pte() is defined (as a static inline) in
>> include/linux/swapops.h. However the definition depends on
>> pte_none()/pte_present() which are defined in pgtable.h - so there's a
>> circular dependency.
>>
>> Open coding is_swap_pte() in set_pte_at() works, but it's a bit ugly.
>> Any ideas on how to improve on the below?
>>
>> 	if (system_supports_mte() && pte_present(pte) &&
>> 	    pte_access_permitted(pte, false) && !pte_special(pte)) {
>> 		pte_t old_pte = READ_ONCE(*ptep);
>> 		/*
>> 		 * We only need to synchronise if the new PTE has tags enabled
>> 		 * or if swapping in (in which case another mapping may have
>> 		 * set tags in the past even if this PTE isn't tagged).
>> 		 * (!pte_none() && !pte_present()) is an open coded version of
>> 		 * is_swap_pte()
>> 		 */
>> 		if (pte_tagged(pte) || (!pte_none(pte) && !pte_present(pte)))
>> 			mte_sync_tags(old_pte, pte);
>> 	}
> 
> That's why I avoided testing my suggestion ;). I think we should just
> add !pte_none() in there with a comment that it may be a swap pte and
> use the is_swap_pte() again on the mte_sync_tags() path. We already have
> the pte_present() check.

Well of course I didn't test the above beyond building - and I've
screwed up because the open coded is_swap_pte() should have been called
on old_pte not pte!

So the pte_present() check above (which I've just removed...) is for the
*new* PTE. So I think we need to keep both here.

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature
  2021-05-20  8:51       ` Marc Zyngier
@ 2021-05-20 14:46         ` Steven Price
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Price @ 2021-05-20 14:46 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 20/05/2021 09:51, Marc Zyngier wrote:
> On Wed, 19 May 2021 11:48:21 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> On 17/05/2021 17:45, Marc Zyngier wrote:
>>> On Mon, 17 May 2021 13:32:35 +0100,
>>> Steven Price <steven.price@arm.com> wrote:
[...]
>>>> +		}
>>>> +	}
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>>>  			  unsigned long fault_status)
>>>> @@ -971,8 +996,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>  	if (writable)
>>>>  		prot |= KVM_PGTABLE_PROT_W;
>>>>  
>>>> -	if (fault_status != FSC_PERM && !device)
>>>> +	if (fault_status != FSC_PERM && !device) {
>>>> +		ret = sanitise_mte_tags(kvm, vma_pagesize, pfn);
>>>> +		if (ret)
>>>> +			goto out_unlock;
>>>> +
>>>>  		clean_dcache_guest_page(pfn, vma_pagesize);
>>>> +	}
>>>>  
>>>>  	if (exec_fault) {
>>>>  		prot |= KVM_PGTABLE_PROT_X;
>>>> @@ -1168,12 +1198,17 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>>>>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>>>>  {
>>>>  	kvm_pfn_t pfn = pte_pfn(range->pte);
>>>> +	int ret;
>>>>  
>>>>  	if (!kvm->arch.mmu.pgt)
>>>>  		return 0;
>>>>  
>>>>  	WARN_ON(range->end - range->start != 1);
>>>>  
>>>> +	ret = sanitise_mte_tags(kvm, PAGE_SIZE, pfn);
>>>> +	if (ret)
>>>> +		return ret;
>>>
>>> Notice the change in return type?
>>
>> I do now - I was tricked by the use of '0' as false. Looks like false
>> ('0') is actually the correct return here to avoid an unnecessary
>> kvm_flush_remote_tlbs().
> 
> Yup. BTW, the return values have been fixed to proper boolean types in
> the latest set of fixes.

Thanks for the heads up - I'll return 'false' to avoid regressing that.

>>
>>>> +
>>>>  	/*
>>>>  	 * We've moved a page around, probably through CoW, so let's treat it
>>>>  	 * just like a translation fault and clean the cache to the PoC.
>>>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>>>> index 76ea2800c33e..24a844cb79ca 100644
>>>> --- a/arch/arm64/kvm/sys_regs.c
>>>> +++ b/arch/arm64/kvm/sys_regs.c
>>>> @@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
>>>>  		break;
>>>>  	case SYS_ID_AA64PFR1_EL1:
>>>>  		val &= ~FEATURE(ID_AA64PFR1_MTE);
>>>> +		if (kvm_has_mte(vcpu->kvm))
>>>> +			val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
>>>> +					  ID_AA64PFR1_MTE);
>>>
>>> Shouldn't this be consistent with what the HW is capable of
>>> (i.e. FEAT_MTE3 if available), and extracted from the sanitised view
>>> of the feature set?
>>
>> Yes - however at the moment our sanitised view is either FEAT_MTE2 or
>> nothing:
>>
>> 	{
>> 		.desc = "Memory Tagging Extension",
>> 		.capability = ARM64_MTE,
>> 		.type = ARM64_CPUCAP_STRICT_BOOT_CPU_FEATURE,
>> 		.matches = has_cpuid_feature,
>> 		.sys_reg = SYS_ID_AA64PFR1_EL1,
>> 		.field_pos = ID_AA64PFR1_MTE_SHIFT,
>> 		.min_field_value = ID_AA64PFR1_MTE,
>> 		.sign = FTR_UNSIGNED,
>> 		.cpu_enable = cpu_enable_mte,
>> 	},
>>
>> When host support for FEAT_MTE3 is added then the KVM code will need
>> revisiting to expose that down to the guest safely (AFAICS there's
>> nothing extra to do here, but I haven't tested any of the MTE3
>> features). I don't think we want to expose newer versions to the guest
>> than the host is aware of. (Or indeed expose FEAT_MTE if the host has
>> MTE disabled because Linux requires at least FEAT_MTE2).
> 
> What I was suggesting is to have something like this:
> 
>      pfr = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1);
>      mte = cpuid_feature_extract_unsigned_field(pfr, ID_AA64PFR1_MTE_SHIFT);
>      val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE), mte);
> 
> which does the trick nicely, and doesn't expose more than the host
> supports.

Ok, I have to admit to not fully understanding the sanitised register
code - but wouldn't this expose higher MTE values if all CPUs support
it, even though the host doesn't know what a hypothetical 'MTE4' adds?
Or is there some magic in the sanitising that caps the value to what the
host knows about?

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature
  2021-05-20 11:54   ` Catalin Marinas
@ 2021-05-20 15:05     ` Steven Price
  2021-05-20 17:50       ` Catalin Marinas
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-20 15:05 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 20/05/2021 12:54, Catalin Marinas wrote:
> On Mon, May 17, 2021 at 01:32:35PM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c5d1f3c87dbd..8660f6a03f51 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>>  	return PAGE_SIZE;
>>  }
>>  
>> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
>> +			     kvm_pfn_t pfn)
>> +{
>> +	if (kvm_has_mte(kvm)) {
>> +		/*
>> +		 * The page will be mapped in stage 2 as Normal Cacheable, so
>> +		 * the VM will be able to see the page's tags and therefore
>> +		 * they must be initialised first. If PG_mte_tagged is set,
>> +		 * tags have already been initialised.
>> +		 */
>> +		unsigned long i, nr_pages = size >> PAGE_SHIFT;
>> +		struct page *page = pfn_to_online_page(pfn);
>> +
>> +		if (!page)
>> +			return -EFAULT;
> 
> IIRC we ended up with pfn_to_online_page() to reject ZONE_DEVICE pages
> that may be mapped into a guest and we have no idea whether they support
> MTE. It may be worth adding a comment, otherwise, as Marc said, the page
> wouldn't disappear.

I'll add a comment.

>> +
>> +		for (i = 0; i < nr_pages; i++, page++) {
>> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>> +				mte_clear_page_tags(page_address(page));
> 
> We started the page->flags thread and ended up fixing it for the host
> set_pte_at() as per the first patch:
> 
> https://lore.kernel.org/r/c3293d47-a5f2-ea4a-6730-f5cae26d8a7e@arm.com
> 
> Now, can we have a race between the stage 2 kvm_set_spte_gfn() and a
> stage 1 set_pte_at()? Only the latter takes a lock. Or between two
> kvm_set_spte_gfn() in different VMs? I think in the above thread we
> concluded that there's only a problem if the page is shared between
> multiple VMMs (MAP_SHARED). How realistic is this and what's the
> workaround?
> 
> Either way, I think it's worth adding a comment here on the race on
> page->flags as it looks strange that here it's just a test_and_set_bit()
> while set_pte_at() uses a spinlock.
> 

Very good point! I should have thought about that. I think splitting the
test_and_set_bit() in two (as with the cache flush) is sufficient. While
there technically still is a race which could lead to user space tags
being clobbered:

a) It's very odd for a VMM to be doing an mprotect() after the fact to
add PROT_MTE, or to be sharing the memory with another process which
sets PROT_MTE.

b) The window for the race is incredibly small and the VMM (generally)
needs to be robust against the guest changing tags anyway.

But I'll add a comment here as well:

	/*
	 * There is a potential race between sanitising the
	 * flags here and user space using mprotect() to add
	 * PROT_MTE to access the tags, however by splitting
	 * the test/set the only risk is user space tags
	 * being overwritten by the mte_clear_page_tags() call.
	 */

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers
  2021-05-20  9:46       ` Marc Zyngier
@ 2021-05-20 15:21         ` Steven Price
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Price @ 2021-05-20 15:21 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 20/05/2021 10:46, Marc Zyngier wrote:
> On Wed, 19 May 2021 14:04:20 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> On 17/05/2021 18:17, Marc Zyngier wrote:
>>> On Mon, 17 May 2021 13:32:36 +0100,
>>> Steven Price <steven.price@arm.com> wrote:
>>>>
>>>> Define the new system registers that MTE introduces and context switch
>>>> them. The MTE feature is still hidden from the ID register as it isn't
>>>> supported in a VM yet.
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>>  arch/arm64/include/asm/kvm_host.h          |  6 ++
>>>>  arch/arm64/include/asm/kvm_mte.h           | 66 ++++++++++++++++++++++
>>>>  arch/arm64/include/asm/sysreg.h            |  3 +-
>>>>  arch/arm64/kernel/asm-offsets.c            |  3 +
>>>>  arch/arm64/kvm/hyp/entry.S                 |  7 +++
>>>>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++
>>>>  arch/arm64/kvm/sys_regs.c                  | 22 ++++++--
>>>>  7 files changed, 123 insertions(+), 5 deletions(-)
>>>>  create mode 100644 arch/arm64/include/asm/kvm_mte.h
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>> index afaa5333f0e4..309e36cc1b42 100644
>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>> @@ -208,6 +208,12 @@ enum vcpu_sysreg {
>>>>  	CNTP_CVAL_EL0,
>>>>  	CNTP_CTL_EL0,
>>>>  
>>>> +	/* Memory Tagging Extension registers */
>>>> +	RGSR_EL1,	/* Random Allocation Tag Seed Register */
>>>> +	GCR_EL1,	/* Tag Control Register */
>>>> +	TFSR_EL1,	/* Tag Fault Status Register (EL1) */
>>>> +	TFSRE0_EL1,	/* Tag Fault Status Register (EL0) */
>>>> +
>>>>  	/* 32bit specific registers. Keep them at the end of the range */
>>>>  	DACR32_EL2,	/* Domain Access Control Register */
>>>>  	IFSR32_EL2,	/* Instruction Fault Status Register */
>>>> diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
>>>> new file mode 100644
>>>> index 000000000000..6541c7d6ce06
>>>> --- /dev/null
>>>> +++ b/arch/arm64/include/asm/kvm_mte.h
>>>> @@ -0,0 +1,66 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/*
>>>> + * Copyright (C) 2020 ARM Ltd.
>>>> + */
>>>> +#ifndef __ASM_KVM_MTE_H
>>>> +#define __ASM_KVM_MTE_H
>>>> +
>>>> +#ifdef __ASSEMBLY__
>>>> +
>>>> +#include <asm/sysreg.h>
>>>> +
>>>> +#ifdef CONFIG_ARM64_MTE
>>>> +
>>>> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
>>>> +alternative_if_not ARM64_MTE
>>>> +	b	.L__skip_switch\@
>>>> +alternative_else_nop_endif
>>>> +	mrs	\reg1, hcr_el2
>>>> +	and	\reg1, \reg1, #(HCR_ATA)
>>>> +	cbz	\reg1, .L__skip_switch\@
>>>> +
>>>> +	mrs_s	\reg1, SYS_RGSR_EL1
>>>> +	str	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
>>>> +	mrs_s	\reg1, SYS_GCR_EL1
>>>> +	str	\reg1, [\h_ctxt, #CPU_GCR_EL1]
>>>> +
>>>> +	ldr	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
>>>> +	msr_s	SYS_RGSR_EL1, \reg1
>>>> +	ldr	\reg1, [\g_ctxt, #CPU_GCR_EL1]
>>>> +	msr_s	SYS_GCR_EL1, \reg1
>>>> +
>>>> +.L__skip_switch\@:
>>>> +.endm
>>>> +
>>>> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
>>>> +alternative_if_not ARM64_MTE
>>>> +	b	.L__skip_switch\@
>>>> +alternative_else_nop_endif
>>>> +	mrs	\reg1, hcr_el2
>>>> +	and	\reg1, \reg1, #(HCR_ATA)
>>>> +	cbz	\reg1, .L__skip_switch\@
>>>> +
>>>> +	mrs_s	\reg1, SYS_RGSR_EL1
>>>> +	str	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
>>>> +	mrs_s	\reg1, SYS_GCR_EL1
>>>> +	str	\reg1, [\g_ctxt, #CPU_GCR_EL1]
>>>> +
>>>> +	ldr	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
>>>> +	msr_s	SYS_RGSR_EL1, \reg1
>>>> +	ldr	\reg1, [\h_ctxt, #CPU_GCR_EL1]
>>>> +	msr_s	SYS_GCR_EL1, \reg1
>>>
>>> What is the rational for not having any synchronisation here? It is
>>> quite uncommon to allocate memory at EL2, but VHE can perform all kind
>>> of tricks.
>>
>> I don't follow. This is part of the __guest_exit path and there's an ISB
>> at the end of that - is that not sufficient? I don't see any possibility
>> for allocating memory before that. What am I missing?
> 
> Which ISB?  We have a few in the SError handling code, but that's
> conditioned on not having RAS. With any RAS-enabled CPU, we return to
> C code early, since we don't need any extra synchronisation (see the
> comment about the absence of ISB on this path).

Ah, I clearly didn't read the code (or comment) carefully enough -
indeed with RAS we're potentially skipping the ISB.

> I would really like to ensure that we return to C code in the exact
> state we left it.

Agreed, I'll stick an ISB at the end of mte_switch_to_hyp. Although
there's clearly room for optimisation here as ptrauth_switch_to_hyp has
a similar ISB.

>>
>>>> +
>>>> +.L__skip_switch\@:
>>>> +.endm
>>>> +
>>>> +#else /* CONFIG_ARM64_MTE */
>>>> +
>>>> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
>>>> +.endm
>>>> +
>>>> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
>>>> +.endm
>>>> +
>>>> +#endif /* CONFIG_ARM64_MTE */
>>>> +#endif /* __ASSEMBLY__ */
>>>> +#endif /* __ASM_KVM_MTE_H */
>>>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>>>> index 65d15700a168..347ccac2341e 100644
>>>> --- a/arch/arm64/include/asm/sysreg.h
>>>> +++ b/arch/arm64/include/asm/sysreg.h
>>>> @@ -651,7 +651,8 @@
>>>>  
>>>>  #define INIT_SCTLR_EL2_MMU_ON						\
>>>>  	(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |	\
>>>> -	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
>>>> +	 SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |		\
>>>> +	 SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
>>>>  
>>>>  #define INIT_SCTLR_EL2_MMU_OFF \
>>>>  	(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
>>>> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
>>>> index 0cb34ccb6e73..6b489a8462f0 100644
>>>> --- a/arch/arm64/kernel/asm-offsets.c
>>>> +++ b/arch/arm64/kernel/asm-offsets.c
>>>> @@ -111,6 +111,9 @@ int main(void)
>>>>    DEFINE(VCPU_WORKAROUND_FLAGS,	offsetof(struct kvm_vcpu, arch.workaround_flags));
>>>>    DEFINE(VCPU_HCR_EL2,		offsetof(struct kvm_vcpu, arch.hcr_el2));
>>>>    DEFINE(CPU_USER_PT_REGS,	offsetof(struct kvm_cpu_context, regs));
>>>> +  DEFINE(CPU_RGSR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1]));
>>>> +  DEFINE(CPU_GCR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1]));
>>>> +  DEFINE(CPU_TFSRE0_EL1,	offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1]));
>>>
>>> TFSRE0_EL1 is never accessed from assembly code. Leftover from a
>>> previous version?
>>
>> Indeed, I will drop it.
>>
>>>>    DEFINE(CPU_APIAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1]));
>>>>    DEFINE(CPU_APIBKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1]));
>>>>    DEFINE(CPU_APDAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1]));
>>>> diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
>>>> index e831d3dfd50d..435346ea1504 100644
>>>> --- a/arch/arm64/kvm/hyp/entry.S
>>>> +++ b/arch/arm64/kvm/hyp/entry.S
>>>> @@ -13,6 +13,7 @@
>>>>  #include <asm/kvm_arm.h>
>>>>  #include <asm/kvm_asm.h>
>>>>  #include <asm/kvm_mmu.h>
>>>> +#include <asm/kvm_mte.h>
>>>>  #include <asm/kvm_ptrauth.h>
>>>>  
>>>>  	.text
>>>> @@ -51,6 +52,9 @@ alternative_else_nop_endif
>>>>  
>>>>  	add	x29, x0, #VCPU_CONTEXT
>>>>  
>>>> +	// mte_switch_to_guest(g_ctxt, h_ctxt, tmp1)
>>>> +	mte_switch_to_guest x29, x1, x2
>>>> +
>>>>  	// Macro ptrauth_switch_to_guest format:
>>>>  	// 	ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3)
>>>>  	// The below macro to restore guest keys is not implemented in C code
>>>> @@ -142,6 +146,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
>>>>  	// when this feature is enabled for kernel code.
>>>>  	ptrauth_switch_to_hyp x1, x2, x3, x4, x5
>>>>  
>>>> +	// mte_switch_to_hyp(g_ctxt, h_ctxt, reg1)
>>>> +	mte_switch_to_hyp x1, x2, x3
>>>> +
>>>>  	// Restore hyp's sp_el0
>>>>  	restore_sp_el0 x2, x3
>>>>  
>>>> diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
>>>> index cce43bfe158f..de7e14c862e6 100644
>>>> --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
>>>> +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
>>>> @@ -14,6 +14,7 @@
>>>>  #include <asm/kvm_asm.h>
>>>>  #include <asm/kvm_emulate.h>
>>>>  #include <asm/kvm_hyp.h>
>>>> +#include <asm/kvm_mmu.h>
>>>>  
>>>>  static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt)
>>>>  {
>>>> @@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt)
>>>>  	ctxt_sys_reg(ctxt, TPIDRRO_EL0)	= read_sysreg(tpidrro_el0);
>>>>  }
>>>>  
>>>> +static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt)
>>>> +{
>>>> +	struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu;
>>>> +
>>>> +	if (!vcpu)
>>>> +		vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt);
>>>> +
>>>> +	return kvm_has_mte(kern_hyp_va(vcpu->kvm));
>>>> +}
>>>> +
>>>>  static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
>>>>  {
>>>>  	ctxt_sys_reg(ctxt, CSSELR_EL1)	= read_sysreg(csselr_el1);
>>>> @@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
>>>>  	ctxt_sys_reg(ctxt, PAR_EL1)	= read_sysreg_par();
>>>>  	ctxt_sys_reg(ctxt, TPIDR_EL1)	= read_sysreg(tpidr_el1);
>>>>  
>>>> +	if (ctxt_has_mte(ctxt)) {
>>>> +		ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR);
>>>> +		ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1);
>>>> +	}
>>>
>>> I remember suggesting that this is slightly heavier than necessary.
>>>
>>> On nVHE, TFSRE0_EL1 could be moved to load/put, as we never run
>>> userspace with a vcpu loaded. The same holds of course for VHE, but we
>>> also can move TFSR_EL1 to load/put, as the host uses TFSR_EL2.
>>>
>>> Do you see any issue with that?
>>
>> The comment[1] I made before was:
> 
> Ah, I totally missed this email (or can't remember reading it, which
> amounts to the same thing). Apologies for that.
> 
>>   For TFSR_EL1 + VHE I believe it is synchronised only on vcpu_load/put -
>>   __sysreg_save_el1_state() is called from kvm_vcpu_load_sysregs_vhe().
>>
>>   TFSRE0_EL1 potentially could be improved. I have to admit I was unsure
>>   if it should be in __sysreg_save_user_state() instead. However AFAICT
>>   that is called at the same time as __sysreg_save_el1_state() and there's
>>   no optimisation for nVHE. And given it's an _EL1 register this seemed
>>   like the logic place.
>>
>>   Am I missing something here? Potentially there are other registers to be
>>   optimised (TPIDRRO_EL0 looks like a possiblity), but IMHO that doesn't
>>   belong in this series.
>>
>> For VHE TFSR_EL1 is already only saved/restored on load/put
>> (__sysreg_save_el1_state() is called from kvm_vcpu_put_sysregs_vhe()).
>>
>> TFSRE0_EL1 could be moved, but I'm not sure where it should live as I
>> mentioned above.
> 
> Yeah, this looks fine, please ignore my rambling.

No problem!

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-20 12:05   ` Catalin Marinas
@ 2021-05-20 15:58     ` Steven Price
  2021-05-20 17:27       ` Catalin Marinas
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-20 15:58 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 20/05/2021 13:05, Catalin Marinas wrote:
> On Mon, May 17, 2021 at 01:32:38PM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 24223adae150..b3edde68bc3e 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>>  	__u32 reserved[12];
>>  };
>>  
>> +struct kvm_arm_copy_mte_tags {
>> +	__u64 guest_ipa;
>> +	__u64 length;
>> +	void __user *addr;
>> +	__u64 flags;
>> +	__u64 reserved[2];
> 
> I forgot the past discussions, what's the reserved for? Future
> expansion?

Yes - for future expansion. Marc asked for them[1]:

> I'd be keen on a couple of reserved __64s. Just in case...

[1] https://lore.kernel.org/r/87ft14xl9e.wl-maz%40kernel.org

>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index e89a5e275e25..4b6c83beb75d 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>  	}
>>  }
>>  
>> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>> +				      struct kvm_arm_copy_mte_tags *copy_tags)
>> +{
>> +	gpa_t guest_ipa = copy_tags->guest_ipa;
>> +	size_t length = copy_tags->length;
>> +	void __user *tags = copy_tags->addr;
>> +	gpa_t gfn;
>> +	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
>> +	int ret = 0;
>> +
>> +	if (copy_tags->reserved[0] || copy_tags->reserved[1])
>> +		return -EINVAL;
>> +
>> +	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
>> +		return -EINVAL;
>> +
>> +	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
>> +		return -EINVAL;
>> +
>> +	gfn = gpa_to_gfn(guest_ipa);
>> +
>> +	mutex_lock(&kvm->slots_lock);
>> +
>> +	while (length > 0) {
>> +		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
>> +		void *maddr;
>> +		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
>> +
>> +		if (is_error_noslot_pfn(pfn)) {
>> +			ret = -EFAULT;
>> +			goto out;
>> +		}
>> +
>> +		maddr = page_address(pfn_to_page(pfn));
>> +
>> +		if (!write) {
>> +			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
>> +			kvm_release_pfn_clean(pfn);
> 
> Do we need to check if PG_mte_tagged is set? If the page was not faulted
> into the guest address space but the VMM has the page, does the
> gfn_to_pfn_prot() guarantee that a kvm_set_spte_gfn() was called? If
> not, this may read stale tags.

Ah, I hadn't thought about that... No I don't believe gfn_to_pfn_prot()
will fault it into the guest.

>> +		} else {
>> +			num_tags = mte_copy_tags_from_user(maddr, tags,
>> +							   num_tags);
>> +			kvm_release_pfn_dirty(pfn);
>> +		}
> 
> Same question here, if the we can't guarantee the stage 2 pte being set,
> we'd need to set PG_mte_tagged.

This is arguably worse as we'll be writing tags into the guest but
without setting PG_mte_tagged - so they'll be lost when the guest then
faults the pages in. Which sounds like it should break migration.

I think the below should be safe, and avoids the overhead of setting the
flag just for reads.

Thanks,

Steve

----8<----
		page = pfn_to_page(pfn);
		maddr = page_address(page);

		if (!write) {
			if (test_bit(PG_mte_tagged, &page->flags))
				num_tags = mte_copy_tags_to_user(tags, maddr,
							MTE_GRANULES_PER_PAGE);
			else
				/* No tags in memory, so write zeros */
				num_tags = MTE_GRANULES_PER_PAGE -
					clear_user(tag, MTE_GRANULES_PER_PAGE);
			kvm_release_pfn_clean(pfn);
		} else {
			num_tags = mte_copy_tags_from_user(maddr, tags,
							MTE_GRANULES_PER_PAGE);
			kvm_release_pfn_dirty(pfn);
		}

		if (num_tags != MTE_GRANULES_PER_PAGE) {
			ret = -EFAULT;
			goto out;
		}

		if (write)
			test_and_set_bit(PG_mte_tagged, &page->flags);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-20 15:58     ` Steven Price
@ 2021-05-20 17:27       ` Catalin Marinas
  2021-05-21  9:42         ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Catalin Marinas @ 2021-05-20 17:27 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, May 20, 2021 at 04:58:01PM +0100, Steven Price wrote:
> On 20/05/2021 13:05, Catalin Marinas wrote:
> > On Mon, May 17, 2021 at 01:32:38PM +0100, Steven Price wrote:
> >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >> index e89a5e275e25..4b6c83beb75d 100644
> >> --- a/arch/arm64/kvm/arm.c
> >> +++ b/arch/arm64/kvm/arm.c
> >> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> >>  	}
> >>  }
> >>  
> >> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
> >> +				      struct kvm_arm_copy_mte_tags *copy_tags)
> >> +{
> >> +	gpa_t guest_ipa = copy_tags->guest_ipa;
> >> +	size_t length = copy_tags->length;
> >> +	void __user *tags = copy_tags->addr;
> >> +	gpa_t gfn;
> >> +	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
> >> +	int ret = 0;
> >> +
> >> +	if (copy_tags->reserved[0] || copy_tags->reserved[1])
> >> +		return -EINVAL;
> >> +
> >> +	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
> >> +		return -EINVAL;
> >> +
> >> +	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
> >> +		return -EINVAL;
> >> +
> >> +	gfn = gpa_to_gfn(guest_ipa);
> >> +
> >> +	mutex_lock(&kvm->slots_lock);
> >> +
> >> +	while (length > 0) {
> >> +		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
> >> +		void *maddr;
> >> +		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
> >> +
> >> +		if (is_error_noslot_pfn(pfn)) {
> >> +			ret = -EFAULT;
> >> +			goto out;
> >> +		}
> >> +
> >> +		maddr = page_address(pfn_to_page(pfn));
> >> +
> >> +		if (!write) {
> >> +			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
> >> +			kvm_release_pfn_clean(pfn);
> > 
> > Do we need to check if PG_mte_tagged is set? If the page was not faulted
> > into the guest address space but the VMM has the page, does the
> > gfn_to_pfn_prot() guarantee that a kvm_set_spte_gfn() was called? If
> > not, this may read stale tags.
> 
> Ah, I hadn't thought about that... No I don't believe gfn_to_pfn_prot()
> will fault it into the guest.

It doesn't indeed. What it does is a get_user_pages() but it's not of
much help since the VMM pte wouldn't be tagged (we would have solved
lots of problems if we required PROT_MTE in the VMM...)

> >> +		} else {
> >> +			num_tags = mte_copy_tags_from_user(maddr, tags,
> >> +							   num_tags);
> >> +			kvm_release_pfn_dirty(pfn);
> >> +		}
> > 
> > Same question here, if the we can't guarantee the stage 2 pte being set,
> > we'd need to set PG_mte_tagged.
> 
> This is arguably worse as we'll be writing tags into the guest but
> without setting PG_mte_tagged - so they'll be lost when the guest then
> faults the pages in. Which sounds like it should break migration.
> 
> I think the below should be safe, and avoids the overhead of setting the
> flag just for reads.
> 
> Thanks,
> 
> Steve
> 
> ----8<----
> 		page = pfn_to_page(pfn);
> 		maddr = page_address(page);
> 
> 		if (!write) {
> 			if (test_bit(PG_mte_tagged, &page->flags))
> 				num_tags = mte_copy_tags_to_user(tags, maddr,
> 							MTE_GRANULES_PER_PAGE);
> 			else
> 				/* No tags in memory, so write zeros */
> 				num_tags = MTE_GRANULES_PER_PAGE -
> 					clear_user(tag, MTE_GRANULES_PER_PAGE);
> 			kvm_release_pfn_clean(pfn);

For ptrace we return a -EOPNOTSUPP if the address doesn't have VM_MTE
but I don't think it makes sense here, so I'm fine with clearing the
destination and assuming that the tags are zero (as they'd be on
faulting into the guest.

Another thing I forgot to ask, what's guaranteeing that the page
supports tags? Does this ioctl ensure that it would attempt the tag
copying from some device mapping? Do we need some kvm_is_device_pfn()
check? I guess ZONE_DEVICE memory we just refuse to map in an earlier
patch.

> 		} else {
> 			num_tags = mte_copy_tags_from_user(maddr, tags,
> 							MTE_GRANULES_PER_PAGE);
> 			kvm_release_pfn_dirty(pfn);
> 		}
> 
> 		if (num_tags != MTE_GRANULES_PER_PAGE) {
> 			ret = -EFAULT;
> 			goto out;
> 		}
> 
> 		if (write)
> 			test_and_set_bit(PG_mte_tagged, &page->flags);

I think a set_bit() would do, I doubt it's any more efficient. But why
not add it in the 'else' block above where we actually wrote the tags?
The copy function may have failed part-way through. Maybe your logic is
correct though, there are invalid tags in the page. Just add a comment.

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature
  2021-05-20 15:05     ` Steven Price
@ 2021-05-20 17:50       ` Catalin Marinas
  2021-05-21  9:28         ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Catalin Marinas @ 2021-05-20 17:50 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, May 20, 2021 at 04:05:46PM +0100, Steven Price wrote:
> On 20/05/2021 12:54, Catalin Marinas wrote:
> > On Mon, May 17, 2021 at 01:32:35PM +0100, Steven Price wrote:
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index c5d1f3c87dbd..8660f6a03f51 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> >>  	return PAGE_SIZE;
> >>  }
> >>  
> >> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
> >> +			     kvm_pfn_t pfn)
> >> +{
> >> +	if (kvm_has_mte(kvm)) {
> >> +		/*
> >> +		 * The page will be mapped in stage 2 as Normal Cacheable, so
> >> +		 * the VM will be able to see the page's tags and therefore
> >> +		 * they must be initialised first. If PG_mte_tagged is set,
> >> +		 * tags have already been initialised.
> >> +		 */
> >> +		unsigned long i, nr_pages = size >> PAGE_SHIFT;
> >> +		struct page *page = pfn_to_online_page(pfn);
> >> +
> >> +		if (!page)
> >> +			return -EFAULT;
> > 
> > IIRC we ended up with pfn_to_online_page() to reject ZONE_DEVICE pages
> > that may be mapped into a guest and we have no idea whether they support
> > MTE. It may be worth adding a comment, otherwise, as Marc said, the page
> > wouldn't disappear.
> 
> I'll add a comment.
> 
> >> +
> >> +		for (i = 0; i < nr_pages; i++, page++) {
> >> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> >> +				mte_clear_page_tags(page_address(page));
> > 
> > We started the page->flags thread and ended up fixing it for the host
> > set_pte_at() as per the first patch:
> > 
> > https://lore.kernel.org/r/c3293d47-a5f2-ea4a-6730-f5cae26d8a7e@arm.com
> > 
> > Now, can we have a race between the stage 2 kvm_set_spte_gfn() and a
> > stage 1 set_pte_at()? Only the latter takes a lock. Or between two
> > kvm_set_spte_gfn() in different VMs? I think in the above thread we
> > concluded that there's only a problem if the page is shared between
> > multiple VMMs (MAP_SHARED). How realistic is this and what's the
> > workaround?
> > 
> > Either way, I think it's worth adding a comment here on the race on
> > page->flags as it looks strange that here it's just a test_and_set_bit()
> > while set_pte_at() uses a spinlock.
> > 
> 
> Very good point! I should have thought about that. I think splitting the
> test_and_set_bit() in two (as with the cache flush) is sufficient. While
> there technically still is a race which could lead to user space tags
> being clobbered:
> 
> a) It's very odd for a VMM to be doing an mprotect() after the fact to
> add PROT_MTE, or to be sharing the memory with another process which
> sets PROT_MTE.
> 
> b) The window for the race is incredibly small and the VMM (generally)
> needs to be robust against the guest changing tags anyway.
> 
> But I'll add a comment here as well:
> 
> 	/*
> 	 * There is a potential race between sanitising the
> 	 * flags here and user space using mprotect() to add
> 	 * PROT_MTE to access the tags, however by splitting
> 	 * the test/set the only risk is user space tags
> 	 * being overwritten by the mte_clear_page_tags() call.
> 	 */

I think (well, I haven't re-checked), an mprotect() in the VMM ends up
calling set_pte_at_notify() which would call kvm_set_spte_gfn() and that
will map the page in the guest. So the problem only appears between
different VMMs sharing the same page. In principle they can be
MAP_PRIVATE but they'd be CoW so the race wouldn't matter. So it's left
with MAP_SHARED between multiple VMMs.

I think we should just state that this is unsafe and they can delete
each-others tags. If we are really worried, we can export that lock you
added in mte.c.

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature
  2021-05-20 17:50       ` Catalin Marinas
@ 2021-05-21  9:28         ` Steven Price
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Price @ 2021-05-21  9:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 20/05/2021 18:50, Catalin Marinas wrote:
> On Thu, May 20, 2021 at 04:05:46PM +0100, Steven Price wrote:
>> On 20/05/2021 12:54, Catalin Marinas wrote:
>>> On Mon, May 17, 2021 at 01:32:35PM +0100, Steven Price wrote:
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index c5d1f3c87dbd..8660f6a03f51 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>>>>  	return PAGE_SIZE;
>>>>  }
>>>>  
>>>> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
>>>> +			     kvm_pfn_t pfn)
>>>> +{
>>>> +	if (kvm_has_mte(kvm)) {
>>>> +		/*
>>>> +		 * The page will be mapped in stage 2 as Normal Cacheable, so
>>>> +		 * the VM will be able to see the page's tags and therefore
>>>> +		 * they must be initialised first. If PG_mte_tagged is set,
>>>> +		 * tags have already been initialised.
>>>> +		 */
>>>> +		unsigned long i, nr_pages = size >> PAGE_SHIFT;
>>>> +		struct page *page = pfn_to_online_page(pfn);
>>>> +
>>>> +		if (!page)
>>>> +			return -EFAULT;
>>>
>>> IIRC we ended up with pfn_to_online_page() to reject ZONE_DEVICE pages
>>> that may be mapped into a guest and we have no idea whether they support
>>> MTE. It may be worth adding a comment, otherwise, as Marc said, the page
>>> wouldn't disappear.
>>
>> I'll add a comment.
>>
>>>> +
>>>> +		for (i = 0; i < nr_pages; i++, page++) {
>>>> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>>>> +				mte_clear_page_tags(page_address(page));
>>>
>>> We started the page->flags thread and ended up fixing it for the host
>>> set_pte_at() as per the first patch:
>>>
>>> https://lore.kernel.org/r/c3293d47-a5f2-ea4a-6730-f5cae26d8a7e@arm.com
>>>
>>> Now, can we have a race between the stage 2 kvm_set_spte_gfn() and a
>>> stage 1 set_pte_at()? Only the latter takes a lock. Or between two
>>> kvm_set_spte_gfn() in different VMs? I think in the above thread we
>>> concluded that there's only a problem if the page is shared between
>>> multiple VMMs (MAP_SHARED). How realistic is this and what's the
>>> workaround?
>>>
>>> Either way, I think it's worth adding a comment here on the race on
>>> page->flags as it looks strange that here it's just a test_and_set_bit()
>>> while set_pte_at() uses a spinlock.
>>>
>>
>> Very good point! I should have thought about that. I think splitting the
>> test_and_set_bit() in two (as with the cache flush) is sufficient. While
>> there technically still is a race which could lead to user space tags
>> being clobbered:
>>
>> a) It's very odd for a VMM to be doing an mprotect() after the fact to
>> add PROT_MTE, or to be sharing the memory with another process which
>> sets PROT_MTE.
>>
>> b) The window for the race is incredibly small and the VMM (generally)
>> needs to be robust against the guest changing tags anyway.
>>
>> But I'll add a comment here as well:
>>
>> 	/*
>> 	 * There is a potential race between sanitising the
>> 	 * flags here and user space using mprotect() to add
>> 	 * PROT_MTE to access the tags, however by splitting
>> 	 * the test/set the only risk is user space tags
>> 	 * being overwritten by the mte_clear_page_tags() call.
>> 	 */
> 
> I think (well, I haven't re-checked), an mprotect() in the VMM ends up
> calling set_pte_at_notify() which would call kvm_set_spte_gfn() and that
> will map the page in the guest. So the problem only appears between
> different VMMs sharing the same page. In principle they can be
> MAP_PRIVATE but they'd be CoW so the race wouldn't matter. So it's left
> with MAP_SHARED between multiple VMMs.

mprotect.c only has a call to set_pte_at() not set_pte_at_notify(). And
AFAICT the MMU notifiers are called to invalidate only in
change_pmd_range(). So the stage 2 mappings would be invalidated rather
than populated. However I believe this should cause synchronisation
because of the KVM mmu_lock. So from my reading you are right an
mprotect() can't race.

MAP_SHARED between multiple VMs is then the only potential problem.

> I think we should just state that this is unsafe and they can delete
> each-others tags. If we are really worried, we can export that lock you
> added in mte.c.
> 

I'll just update the comment for now.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-20 17:27       ` Catalin Marinas
@ 2021-05-21  9:42         ` Steven Price
  2021-05-24 18:11           ` Catalin Marinas
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-21  9:42 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 20/05/2021 18:27, Catalin Marinas wrote:
> On Thu, May 20, 2021 at 04:58:01PM +0100, Steven Price wrote:
>> On 20/05/2021 13:05, Catalin Marinas wrote:
>>> On Mon, May 17, 2021 at 01:32:38PM +0100, Steven Price wrote:
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index e89a5e275e25..4b6c83beb75d 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>>>  	}
>>>>  }
>>>>  
>>>> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>>>> +				      struct kvm_arm_copy_mte_tags *copy_tags)
>>>> +{
>>>> +	gpa_t guest_ipa = copy_tags->guest_ipa;
>>>> +	size_t length = copy_tags->length;
>>>> +	void __user *tags = copy_tags->addr;
>>>> +	gpa_t gfn;
>>>> +	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
>>>> +	int ret = 0;
>>>> +
>>>> +	if (copy_tags->reserved[0] || copy_tags->reserved[1])
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
>>>> +		return -EINVAL;
>>>> +
>>>> +	gfn = gpa_to_gfn(guest_ipa);
>>>> +
>>>> +	mutex_lock(&kvm->slots_lock);
>>>> +
>>>> +	while (length > 0) {
>>>> +		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
>>>> +		void *maddr;
>>>> +		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
>>>> +
>>>> +		if (is_error_noslot_pfn(pfn)) {
>>>> +			ret = -EFAULT;
>>>> +			goto out;
>>>> +		}
>>>> +
>>>> +		maddr = page_address(pfn_to_page(pfn));
>>>> +
>>>> +		if (!write) {
>>>> +			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
>>>> +			kvm_release_pfn_clean(pfn);
>>>
>>> Do we need to check if PG_mte_tagged is set? If the page was not faulted
>>> into the guest address space but the VMM has the page, does the
>>> gfn_to_pfn_prot() guarantee that a kvm_set_spte_gfn() was called? If
>>> not, this may read stale tags.
>>
>> Ah, I hadn't thought about that... No I don't believe gfn_to_pfn_prot()
>> will fault it into the guest.
> 
> It doesn't indeed. What it does is a get_user_pages() but it's not of
> much help since the VMM pte wouldn't be tagged (we would have solved
> lots of problems if we required PROT_MTE in the VMM...)

Sadly it solves some problems and creates others :(

>>>> +		} else {
>>>> +			num_tags = mte_copy_tags_from_user(maddr, tags,
>>>> +							   num_tags);
>>>> +			kvm_release_pfn_dirty(pfn);
>>>> +		}
>>>
>>> Same question here, if the we can't guarantee the stage 2 pte being set,
>>> we'd need to set PG_mte_tagged.
>>
>> This is arguably worse as we'll be writing tags into the guest but
>> without setting PG_mte_tagged - so they'll be lost when the guest then
>> faults the pages in. Which sounds like it should break migration.
>>
>> I think the below should be safe, and avoids the overhead of setting the
>> flag just for reads.
>>
>> Thanks,
>>
>> Steve
>>
>> ----8<----
>> 		page = pfn_to_page(pfn);
>> 		maddr = page_address(page);
>>
>> 		if (!write) {
>> 			if (test_bit(PG_mte_tagged, &page->flags))
>> 				num_tags = mte_copy_tags_to_user(tags, maddr,
>> 							MTE_GRANULES_PER_PAGE);
>> 			else
>> 				/* No tags in memory, so write zeros */
>> 				num_tags = MTE_GRANULES_PER_PAGE -
>> 					clear_user(tag, MTE_GRANULES_PER_PAGE);
>> 			kvm_release_pfn_clean(pfn);
> 
> For ptrace we return a -EOPNOTSUPP if the address doesn't have VM_MTE
> but I don't think it makes sense here, so I'm fine with clearing the
> destination and assuming that the tags are zero (as they'd be on
> faulting into the guest.

Yeah - conceptually all pages in an MTE-enabled guest have tags. It's
just we don't actually populate the physical memory until the guest
tries to touch them. So it makes sense to just return zeros here.
Alternatively we could populate the physical tags but that seems
unnecessary.

> Another thing I forgot to ask, what's guaranteeing that the page
> supports tags? Does this ioctl ensure that it would attempt the tag
> copying from some device mapping? Do we need some kvm_is_device_pfn()
> check? I guess ZONE_DEVICE memory we just refuse to map in an earlier
> patch.

Hmm, nothing much. While reads are now fine (the memory won't have
PG_mte_tagged), writes could potentially happen on ZONE_DEVICE memory.

The fix is to just replace pfn_to_page() with pfn_to_online_page() and
handle the error.

>> 		} else {
>> 			num_tags = mte_copy_tags_from_user(maddr, tags,
>> 							MTE_GRANULES_PER_PAGE);
>> 			kvm_release_pfn_dirty(pfn);
>> 		}
>>
>> 		if (num_tags != MTE_GRANULES_PER_PAGE) {
>> 			ret = -EFAULT;
>> 			goto out;
>> 		}
>>
>> 		if (write)
>> 			test_and_set_bit(PG_mte_tagged, &page->flags);
> 
> I think a set_bit() would do, I doubt it's any more efficient. But why

I'd seen test_and_set_bit() used elsewhere (I forget where now) as a
slightly more efficient approach. It complies down to a READ_ONCE and a
conditional atomic, vs a single non-conditional atomic. But I don't have
any actual data on the performance and this isn't a hot path, so I'll
switch to the more obvious set_bit().

> not add it in the 'else' block above where we actually wrote the tags?
> The copy function may have failed part-way through. Maybe your logic is
> correct though, there are invalid tags in the page. Just add a comment.

Yeah it's in case the write fails part way through - we don't want to
expose the tags which were not written. I'll add a comment to explain that.

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-21  9:42         ` Steven Price
@ 2021-05-24 18:11           ` Catalin Marinas
  2021-05-27  7:50             ` Steven Price
  0 siblings, 1 reply; 49+ messages in thread
From: Catalin Marinas @ 2021-05-24 18:11 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Fri, May 21, 2021 at 10:42:09AM +0100, Steven Price wrote:
> On 20/05/2021 18:27, Catalin Marinas wrote:
> > On Thu, May 20, 2021 at 04:58:01PM +0100, Steven Price wrote:
> >> On 20/05/2021 13:05, Catalin Marinas wrote:
> >>> On Mon, May 17, 2021 at 01:32:38PM +0100, Steven Price wrote:
> >>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >>>> index e89a5e275e25..4b6c83beb75d 100644
> >>>> --- a/arch/arm64/kvm/arm.c
> >>>> +++ b/arch/arm64/kvm/arm.c
> >>>> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> >>>>  	}
> >>>>  }
> >>>>  
> >>>> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
> >>>> +				      struct kvm_arm_copy_mte_tags *copy_tags)
> >>>> +{
> >>>> +	gpa_t guest_ipa = copy_tags->guest_ipa;
> >>>> +	size_t length = copy_tags->length;
> >>>> +	void __user *tags = copy_tags->addr;
> >>>> +	gpa_t gfn;
> >>>> +	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	if (copy_tags->reserved[0] || copy_tags->reserved[1])
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
> >>>> +		return -EINVAL;
> >>>> +
> >>>> +	gfn = gpa_to_gfn(guest_ipa);
> >>>> +
> >>>> +	mutex_lock(&kvm->slots_lock);
> >>>> +
> >>>> +	while (length > 0) {
> >>>> +		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
> >>>> +		void *maddr;
> >>>> +		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
> >>>> +
> >>>> +		if (is_error_noslot_pfn(pfn)) {
> >>>> +			ret = -EFAULT;
> >>>> +			goto out;
> >>>> +		}
> >>>> +
> >>>> +		maddr = page_address(pfn_to_page(pfn));
> >>>> +
> >>>> +		if (!write) {
> >>>> +			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
> >>>> +			kvm_release_pfn_clean(pfn);
> >>>
> >>> Do we need to check if PG_mte_tagged is set? If the page was not faulted
> >>> into the guest address space but the VMM has the page, does the
> >>> gfn_to_pfn_prot() guarantee that a kvm_set_spte_gfn() was called? If
> >>> not, this may read stale tags.
> >>
> >> Ah, I hadn't thought about that... No I don't believe gfn_to_pfn_prot()
> >> will fault it into the guest.
> > 
> > It doesn't indeed. What it does is a get_user_pages() but it's not of
> > much help since the VMM pte wouldn't be tagged (we would have solved
> > lots of problems if we required PROT_MTE in the VMM...)
> 
> Sadly it solves some problems and creates others :(

I had some (random) thoughts on how to make things simpler, maybe. I
think most of these races would have been solved if we required PROT_MTE
in the VMM but this has an impact on the VMM if it wants to use MTE
itself. If such requirement was in place, all KVM needed to do is check
PG_mte_tagged.

So what we actually need is a set_pte_at() in the VMM to clear the tags
and set PG_mte_tagged. Currently, we only do this if the memory type is
tagged (PROT_MTE) but it's not strictly necessary.

As an optimisation for normal programs, we don't want to do this all the
time but the visible behaviour wouldn't change (well, maybe for ptrace
slightly). However, it doesn't mean we couldn't for a VMM, with an
opt-in via prctl(). This would add a MMCF_MTE_TAG_INIT bit (couldn't
think of a better name) to mm_context_t.flags and set_pte_at() would
behave as if the pte was tagged without actually mapping the memory in
user space as tagged (protection flags not changed). Pages that don't
support tagging are still safe, just some unnecessary ignored tag
writes. This would need to be set before the mmap() for the guest
memory.

If we want finer-grained control we'd have to store this information in
the vma flags, in addition to VM_MTE (e.g. VM_MTE_TAG_INIT) but without
affecting the actual memory type. The easiest would be another pte bit,
though we are short on them. A more intrusive (not too bad) approach is
to introduce a set_pte_at_vma() and read the flags directly in the arch
code. In most places where set_pte_at() is called on a user mm, the vma
is also available.

Anyway, I'm not saying we go this route, just thinking out loud, get
some opinions.

> > Another thing I forgot to ask, what's guaranteeing that the page
> > supports tags? Does this ioctl ensure that it would attempt the tag
> > copying from some device mapping? Do we need some kvm_is_device_pfn()
> > check? I guess ZONE_DEVICE memory we just refuse to map in an earlier
> > patch.
> 
> Hmm, nothing much. While reads are now fine (the memory won't have
> PG_mte_tagged), writes could potentially happen on ZONE_DEVICE memory.

I don't think it's a problem for writes either as the host wouldn't map
such memory as tagged. It's just that it returns zeros and writes are
ignored, so we could instead return an error (I haven't checked your
latest series yet).

> >> 		} else {
> >> 			num_tags = mte_copy_tags_from_user(maddr, tags,
> >> 							MTE_GRANULES_PER_PAGE);
> >> 			kvm_release_pfn_dirty(pfn);
> >> 		}
> >>
> >> 		if (num_tags != MTE_GRANULES_PER_PAGE) {
> >> 			ret = -EFAULT;
> >> 			goto out;
> >> 		}
> >>
> >> 		if (write)
> >> 			test_and_set_bit(PG_mte_tagged, &page->flags);
> > 
> > I think a set_bit() would do, I doubt it's any more efficient. But why
> 
> I'd seen test_and_set_bit() used elsewhere (I forget where now) as a
> slightly more efficient approach. It complies down to a READ_ONCE and a
> conditional atomic, vs a single non-conditional atomic. But I don't have
> any actual data on the performance and this isn't a hot path, so I'll
> switch to the more obvious set_bit().

Yeah, I think I've seen this as well. Anyway, it's probably lost in the
noise of tag writing here.

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-24 18:11           ` Catalin Marinas
@ 2021-05-27  7:50             ` Steven Price
  2021-05-27 13:08               ` Catalin Marinas
  0 siblings, 1 reply; 49+ messages in thread
From: Steven Price @ 2021-05-27  7:50 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 24/05/2021 19:11, Catalin Marinas wrote:
> On Fri, May 21, 2021 at 10:42:09AM +0100, Steven Price wrote:
>> On 20/05/2021 18:27, Catalin Marinas wrote:
>>> On Thu, May 20, 2021 at 04:58:01PM +0100, Steven Price wrote:
>>>> On 20/05/2021 13:05, Catalin Marinas wrote:
>>>>> On Mon, May 17, 2021 at 01:32:38PM +0100, Steven Price wrote:
>>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>>>> index e89a5e275e25..4b6c83beb75d 100644
>>>>>> --- a/arch/arm64/kvm/arm.c
>>>>>> +++ b/arch/arm64/kvm/arm.c
>>>>>> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>>>>>  	}
>>>>>>  }
>>>>>>  
>>>>>> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>>>>>> +				      struct kvm_arm_copy_mte_tags *copy_tags)
>>>>>> +{
>>>>>> +	gpa_t guest_ipa = copy_tags->guest_ipa;
>>>>>> +	size_t length = copy_tags->length;
>>>>>> +	void __user *tags = copy_tags->addr;
>>>>>> +	gpa_t gfn;
>>>>>> +	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
>>>>>> +	int ret = 0;
>>>>>> +
>>>>>> +	if (copy_tags->reserved[0] || copy_tags->reserved[1])
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	gfn = gpa_to_gfn(guest_ipa);
>>>>>> +
>>>>>> +	mutex_lock(&kvm->slots_lock);
>>>>>> +
>>>>>> +	while (length > 0) {
>>>>>> +		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
>>>>>> +		void *maddr;
>>>>>> +		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
>>>>>> +
>>>>>> +		if (is_error_noslot_pfn(pfn)) {
>>>>>> +			ret = -EFAULT;
>>>>>> +			goto out;
>>>>>> +		}
>>>>>> +
>>>>>> +		maddr = page_address(pfn_to_page(pfn));
>>>>>> +
>>>>>> +		if (!write) {
>>>>>> +			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
>>>>>> +			kvm_release_pfn_clean(pfn);
>>>>>
>>>>> Do we need to check if PG_mte_tagged is set? If the page was not faulted
>>>>> into the guest address space but the VMM has the page, does the
>>>>> gfn_to_pfn_prot() guarantee that a kvm_set_spte_gfn() was called? If
>>>>> not, this may read stale tags.
>>>>
>>>> Ah, I hadn't thought about that... No I don't believe gfn_to_pfn_prot()
>>>> will fault it into the guest.
>>>
>>> It doesn't indeed. What it does is a get_user_pages() but it's not of
>>> much help since the VMM pte wouldn't be tagged (we would have solved
>>> lots of problems if we required PROT_MTE in the VMM...)
>>
>> Sadly it solves some problems and creates others :(
> 
> I had some (random) thoughts on how to make things simpler, maybe. I
> think most of these races would have been solved if we required PROT_MTE
> in the VMM but this has an impact on the VMM if it wants to use MTE
> itself. If such requirement was in place, all KVM needed to do is check
> PG_mte_tagged.
> 
> So what we actually need is a set_pte_at() in the VMM to clear the tags
> and set PG_mte_tagged. Currently, we only do this if the memory type is
> tagged (PROT_MTE) but it's not strictly necessary.
> 
> As an optimisation for normal programs, we don't want to do this all the
> time but the visible behaviour wouldn't change (well, maybe for ptrace
> slightly). However, it doesn't mean we couldn't for a VMM, with an
> opt-in via prctl(). This would add a MMCF_MTE_TAG_INIT bit (couldn't
> think of a better name) to mm_context_t.flags and set_pte_at() would
> behave as if the pte was tagged without actually mapping the memory in
> user space as tagged (protection flags not changed). Pages that don't
> support tagging are still safe, just some unnecessary ignored tag
> writes. This would need to be set before the mmap() for the guest
> memory.
> 
> If we want finer-grained control we'd have to store this information in
> the vma flags, in addition to VM_MTE (e.g. VM_MTE_TAG_INIT) but without
> affecting the actual memory type. The easiest would be another pte bit,
> though we are short on them. A more intrusive (not too bad) approach is
> to introduce a set_pte_at_vma() and read the flags directly in the arch
> code. In most places where set_pte_at() is called on a user mm, the vma
> is also available.
> 
> Anyway, I'm not saying we go this route, just thinking out loud, get
> some opinions.

Does get_user_pages() actually end up calling set_pte_at() normally? If
not then on the normal user_mem_abort() route although we can easily
check VM_MTE_TAG_INIT there's no obvious place to hook in to ensure that
the pages actually allocated have the PG_mte_tagged flag.

I'm also not sure how well this would work with the MMU notifiers path
in KVM. With MMU notifiers (i.e. the VMM replacing a page in the
memslot) there's not even an obvious hook to enforce the VMA flag. So I
think we'd end up with something like the sanitise_mte_tags() function
to at least check that the PG_mte_tagged flag is set on the pages
(assuming that the trigger for the MMU notifier has done the
corresponding set_pte_at()). Admittedly this might close the current
race documented there.

It also feels wrong to me to tie this to a process with prctl(), it
seems much more normal to implement this as a new mprotect() flag as
this is really a memory property not a process property. And I think
we'll find some scary corner cases if we try to associate everything
back to a process - although I can't instantly think of anything that
will actually break.

>>> Another thing I forgot to ask, what's guaranteeing that the page
>>> supports tags? Does this ioctl ensure that it would attempt the tag
>>> copying from some device mapping? Do we need some kvm_is_device_pfn()
>>> check? I guess ZONE_DEVICE memory we just refuse to map in an earlier
>>> patch.
>>
>> Hmm, nothing much. While reads are now fine (the memory won't have
>> PG_mte_tagged), writes could potentially happen on ZONE_DEVICE memory.
> 
> I don't think it's a problem for writes either as the host wouldn't map
> such memory as tagged. It's just that it returns zeros and writes are
> ignored, so we could instead return an error (I haven't checked your
> latest series yet).

The latest series uses pfn_to_online_page() to reject ZONE_DEVICE early.

>>>> 		} else {
>>>> 			num_tags = mte_copy_tags_from_user(maddr, tags,
>>>> 							MTE_GRANULES_PER_PAGE);
>>>> 			kvm_release_pfn_dirty(pfn);
>>>> 		}
>>>>
>>>> 		if (num_tags != MTE_GRANULES_PER_PAGE) {
>>>> 			ret = -EFAULT;
>>>> 			goto out;
>>>> 		}
>>>>
>>>> 		if (write)
>>>> 			test_and_set_bit(PG_mte_tagged, &page->flags);
>>>
>>> I think a set_bit() would do, I doubt it's any more efficient. But why
>>
>> I'd seen test_and_set_bit() used elsewhere (I forget where now) as a
>> slightly more efficient approach. It complies down to a READ_ONCE and a
>> conditional atomic, vs a single non-conditional atomic. But I don't have
>> any actual data on the performance and this isn't a hot path, so I'll
>> switch to the more obvious set_bit().
> 
> Yeah, I think I've seen this as well. Anyway, it's probably lost in the
> noise of tag writing here.
> 

Agreed.

Thanks,

Steve

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-27  7:50             ` Steven Price
@ 2021-05-27 13:08               ` Catalin Marinas
  0 siblings, 0 replies; 49+ messages in thread
From: Catalin Marinas @ 2021-05-27 13:08 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, May 27, 2021 at 08:50:30AM +0100, Steven Price wrote:
> On 24/05/2021 19:11, Catalin Marinas wrote:
> > I had some (random) thoughts on how to make things simpler, maybe. I
> > think most of these races would have been solved if we required PROT_MTE
> > in the VMM but this has an impact on the VMM if it wants to use MTE
> > itself. If such requirement was in place, all KVM needed to do is check
> > PG_mte_tagged.
> > 
> > So what we actually need is a set_pte_at() in the VMM to clear the tags
> > and set PG_mte_tagged. Currently, we only do this if the memory type is
> > tagged (PROT_MTE) but it's not strictly necessary.
> > 
> > As an optimisation for normal programs, we don't want to do this all the
> > time but the visible behaviour wouldn't change (well, maybe for ptrace
> > slightly). However, it doesn't mean we couldn't for a VMM, with an
> > opt-in via prctl(). This would add a MMCF_MTE_TAG_INIT bit (couldn't
> > think of a better name) to mm_context_t.flags and set_pte_at() would
> > behave as if the pte was tagged without actually mapping the memory in
> > user space as tagged (protection flags not changed). Pages that don't
> > support tagging are still safe, just some unnecessary ignored tag
> > writes. This would need to be set before the mmap() for the guest
> > memory.
> > 
> > If we want finer-grained control we'd have to store this information in
> > the vma flags, in addition to VM_MTE (e.g. VM_MTE_TAG_INIT) but without
> > affecting the actual memory type. The easiest would be another pte bit,
> > though we are short on them. A more intrusive (not too bad) approach is
> > to introduce a set_pte_at_vma() and read the flags directly in the arch
> > code. In most places where set_pte_at() is called on a user mm, the vma
> > is also available.
> > 
> > Anyway, I'm not saying we go this route, just thinking out loud, get
> > some opinions.
> 
> Does get_user_pages() actually end up calling set_pte_at() normally?

Not always, at least as how it's called from hva_to_pfn(). My reading of
the get_user_page_fast_only() is that it doesn't touch the pte, just
walks the page tables and pins the page. Of course, it expects a valid
pte to have been set in the VMM already, otherwise it doesn't pin any
page and the caller falls back to the slow path.

The slow path, get_user_pages_unlocked(), passes FOLL_TOUCH and
set_pte_at() will be called either in follow_pfn_pte() if it was valid
or via faultin_page() -> handle_mm_fault().

> If not then on the normal user_mem_abort() route although we can
> easily check VM_MTE_TAG_INIT there's no obvious place to hook in to
> ensure that the pages actually allocated have the PG_mte_tagged flag.

I don't think it helps if we checked such vma flag in user_mem_abort(),
we'd still have the race with set_pte_at() on the page flags. What I was
trying to avoid is touching the page flags in too many places, so
deferring this always to set_pte_at() in the VMM.

> I'm also not sure how well this would work with the MMU notifiers path
> in KVM. With MMU notifiers (i.e. the VMM replacing a page in the
> memslot) there's not even an obvious hook to enforce the VMA flag. So I
> think we'd end up with something like the sanitise_mte_tags() function
> to at least check that the PG_mte_tagged flag is set on the pages
> (assuming that the trigger for the MMU notifier has done the
> corresponding set_pte_at()). Admittedly this might close the current
> race documented there.

If we kept this check to the VMM set_pte_at(), I think we can ignore the
notifiers.

> It also feels wrong to me to tie this to a process with prctl(), it
> seems much more normal to implement this as a new mprotect() flag as
> this is really a memory property not a process property. And I think
> we'll find some scary corner cases if we try to associate everything
> back to a process - although I can't instantly think of anything that
> will actually break.

I agree, tying it to the process looks wrong, only that it's less
intrusive. I don't think it would break anything, only potential
performance regression. A process would still need to pass PROT_MTE to
be able to get tag checking. That's basically what I had in an early MTE
implementation with clear_user_page() always zeroing the tags.

I agree with you that a vma flag would be better but it's more
complicated without an additional pte bit. We could also miss some
updates as mprotect() for example checks for pte_same() before calling
set_pte_at() (it would need to check the updated vma flags).

I'll review the latest series but I'm tempted to move the logic in
santise_mte_tags() to mte.c and take the big lock in there if
PG_mte_tagged is not already set. If we hit performance issues, we can
optimise this later to have the page flag set already on creation (new
PROT flag, prctl etc.).

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2021-05-27 13:08 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17 12:32 [PATCH v12 0/8] MTE support for KVM guest Steven Price
2021-05-17 12:32 ` [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags Steven Price
2021-05-17 14:03   ` Marc Zyngier
2021-05-17 14:56     ` Steven Price
2021-05-19 17:32   ` Catalin Marinas
2021-05-17 12:32 ` [PATCH v12 2/8] arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage() Steven Price
2021-05-17 12:32 ` [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
2021-05-17 16:14   ` Marc Zyngier
2021-05-19  9:32     ` Steven Price
2021-05-19 17:48       ` Catalin Marinas
2021-05-19 18:06   ` Catalin Marinas
2021-05-20 11:55     ` Steven Price
2021-05-20 12:25       ` Catalin Marinas
2021-05-20 13:02         ` Catalin Marinas
2021-05-20 13:03         ` Steven Price
2021-05-17 12:32 ` [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature Steven Price
2021-05-17 16:45   ` Marc Zyngier
2021-05-19 10:48     ` Steven Price
2021-05-20  8:51       ` Marc Zyngier
2021-05-20 14:46         ` Steven Price
2021-05-20 11:54   ` Catalin Marinas
2021-05-20 15:05     ` Steven Price
2021-05-20 17:50       ` Catalin Marinas
2021-05-21  9:28         ` Steven Price
2021-05-17 12:32 ` [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers Steven Price
2021-05-17 17:17   ` Marc Zyngier
2021-05-19 13:04     ` Steven Price
2021-05-20  9:46       ` Marc Zyngier
2021-05-20 15:21         ` Steven Price
2021-05-17 12:32 ` [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price
2021-05-17 17:40   ` Marc Zyngier
2021-05-19 13:26     ` Steven Price
2021-05-20 10:09       ` Marc Zyngier
2021-05-20 10:51         ` Steven Price
2021-05-17 12:32 ` [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
2021-05-17 18:04   ` Marc Zyngier
2021-05-19 13:51     ` Steven Price
2021-05-20 12:05   ` Catalin Marinas
2021-05-20 15:58     ` Steven Price
2021-05-20 17:27       ` Catalin Marinas
2021-05-21  9:42         ` Steven Price
2021-05-24 18:11           ` Catalin Marinas
2021-05-27  7:50             ` Steven Price
2021-05-27 13:08               ` Catalin Marinas
2021-05-17 12:32 ` [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl Steven Price
2021-05-17 18:09   ` Marc Zyngier
2021-05-19 14:09     ` Steven Price
2021-05-20 10:24       ` Marc Zyngier
2021-05-20 10:52         ` Steven Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).