linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 0/6] MTE support for KVM guest
@ 2021-04-16 15:43 Steven Price
  2021-04-16 15:43 ` [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Steven Price @ 2021-04-16 15:43 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

I know it's likely to be the merge window next week, but since there
were a couple of changes from Catalin's review I thought I'd send
another version out - there are some minor conflicts with what's
currently in -next so I'll rebase after -rc1.

This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

This version is rebased on v5.12-rc2.

Changes since v10[1]:

 * Replace pte_valid_user() with (pte_val(pte) & PTE_USER) in
   set_pte_at() as the former has been removed with the EPAN patches.
 * Don't attempt to check tags on memory which is going to be mapped in
   stage 2 as DEVICE as the guest won't be able to see the tags.
 * Use pfn_to_online_page() instead of pfn_to_page() in user_mem_abort()
   to prevent ZONE_DEVICE memory being populated in stage 2 if tags are
   enabled.

[1] https://lore.kernel.org/r/20210312151902.17853-1-steven.price%40arm.com

Steven Price (6):
  arm64: mte: Sync tags for pages where PTE is untagged
  arm64: kvm: Introduce MTE VM feature
  arm64: kvm: Save/restore MTE registers
  arm64: kvm: Expose KVM_ARM_CAP_MTE
  KVM: arm64: ioctl to fetch/store tags in a guest
  KVM: arm64: Document MTE capability and ioctl

 Documentation/virt/kvm/api.rst             | 53 +++++++++++++++
 arch/arm64/include/asm/kvm_emulate.h       |  3 +
 arch/arm64/include/asm/kvm_host.h          |  9 +++
 arch/arm64/include/asm/kvm_mte.h           | 66 ++++++++++++++++++
 arch/arm64/include/asm/pgtable.h           |  2 +-
 arch/arm64/include/asm/sysreg.h            |  3 +-
 arch/arm64/include/uapi/asm/kvm.h          | 14 ++++
 arch/arm64/kernel/asm-offsets.c            |  3 +
 arch/arm64/kernel/mte.c                    | 16 +++--
 arch/arm64/kvm/arm.c                       | 78 ++++++++++++++++++++++
 arch/arm64/kvm/hyp/entry.S                 |  7 ++
 arch/arm64/kvm/hyp/exception.c             |  3 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++++++
 arch/arm64/kvm/mmu.c                       | 20 ++++++
 arch/arm64/kvm/sys_regs.c                  | 28 ++++++--
 include/uapi/linux/kvm.h                   |  2 +
 16 files changed, 317 insertions(+), 11 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged
  2021-04-16 15:43 [PATCH v11 0/6] MTE support for KVM guest Steven Price
@ 2021-04-16 15:43 ` Steven Price
  2021-04-27 17:43   ` Catalin Marinas
  2021-04-16 15:43 ` [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature Steven Price
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Steven Price @ 2021-04-16 15:43 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages which are !pte_valid_user() as these will
not have been swapped out.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/kernel/mte.c          | 16 ++++++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e17b96d0e4b5..cf4b52a33b3c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		__sync_icache_dcache(pte);
 
 	if (system_supports_mte() &&
-	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
+	    pte_present(pte) && (pte_val(pte) & PTE_USER) && !pte_special(pte))
 		mte_sync_tags(ptep, pte);
 
 	__check_racy_pte_update(mm, ptep, pte);
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index b3c70a612c7a..e016ab57ea36 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -26,17 +26,23 @@ u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
+			       bool pte_is_tagged)
 {
 	pte_t old_pte = READ_ONCE(*ptep);
 
 	if (check_swap && is_swap_pte(old_pte)) {
 		swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-		if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
+		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+			set_bit(PG_mte_tagged, &page->flags);
 			return;
+		}
 	}
 
+	if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, &page->flags))
+		return;
+
 	page_kasan_tag_reset(page);
 	/*
 	 * We need smp_wmb() in between setting the flags and clearing the
@@ -54,11 +60,13 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
 	struct page *page = pte_page(pte);
 	long i, nr_pages = compound_nr(page);
 	bool check_swap = nr_pages == 1;
+	bool pte_is_tagged = pte_tagged(pte);
 
 	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
-		if (!test_and_set_bit(PG_mte_tagged, &page->flags))
-			mte_sync_page_tags(page, ptep, check_swap);
+		if (!test_bit(PG_mte_tagged, &page->flags))
+			mte_sync_page_tags(page, ptep, check_swap,
+					   pte_is_tagged);
 	}
 }
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-04-16 15:43 [PATCH v11 0/6] MTE support for KVM guest Steven Price
  2021-04-16 15:43 ` [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
@ 2021-04-16 15:43 ` Steven Price
  2021-04-28 17:07   ` Catalin Marinas
  2021-04-16 15:43 ` [PATCH v11 3/6] arm64: kvm: Save/restore MTE registers Steven Price
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Steven Price @ 2021-04-16 15:43 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/kvm_emulate.h |  3 +++
 arch/arm64/include/asm/kvm_host.h    |  3 +++
 arch/arm64/kvm/hyp/exception.c       |  3 ++-
 arch/arm64/kvm/mmu.c                 | 20 ++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c            |  3 +++
 include/uapi/linux/kvm.h             |  1 +
 6 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
 	if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
 	    vcpu_el1_is_32bit(vcpu))
 		vcpu->arch.hcr_el2 |= HCR_TID2;
+
+	if (kvm_has_mte(vcpu->kvm))
+		vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3d10e6527f7d..1170ee137096 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {
 
 	u8 pfr0_csv2;
 	u8 pfr0_csv3;
+	/* Memory Tagging Extension enabled for the guest */
+	bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -767,6 +769,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
 	((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu)					\
 	(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
 	new |= (old & PSR_C_BIT);
 	new |= (old & PSR_V_BIT);
 
-	// TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+	if (kvm_has_mte(vcpu->kvm))
+		new |= PSR_TCO_BIT;
 
 	new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..5f8e165ea053 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (vma_pagesize == PAGE_SIZE && !force_pte)
 		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
 							   &pfn, &fault_ipa);
+
+	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && !device &&
+	    pfn_valid(pfn)) {
+		/*
+		 * VM will be able to see the page's tags, so we must ensure
+		 * they have been initialised. if PG_mte_tagged is set, tags
+		 * have already been initialised.
+		 */
+		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+		struct page *page = pfn_to_online_page(pfn);
+
+		if (!page)
+			return -EFAULT;
+
+		for (i = 0; i < nr_pages; i++, page++) {
+			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+				mte_clear_page_tags(page_address(page));
+		}
+	}
+
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4f2f1e3145de..18c87500a7a8 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
 		break;
 	case SYS_ID_AA64PFR1_EL1:
 		val &= ~FEATURE(ID_AA64PFR1_MTE);
+		if (kvm_has_mte(vcpu->kvm))
+			val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
+					  ID_AA64PFR1_MTE);
 		break;
 	case SYS_ID_AA64ISAR1_EL1:
 		if (!vcpu_has_ptrauth(vcpu))
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f6afee209620..6dc16c09a2d1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1078,6 +1078,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING 192
 #define KVM_CAP_X86_BUS_LOCK_EXIT 193
 #define KVM_CAP_PPC_DAWR1 194
+#define KVM_CAP_ARM_MTE 195
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v11 3/6] arm64: kvm: Save/restore MTE registers
  2021-04-16 15:43 [PATCH v11 0/6] MTE support for KVM guest Steven Price
  2021-04-16 15:43 ` [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
  2021-04-16 15:43 ` [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature Steven Price
@ 2021-04-16 15:43 ` Steven Price
  2021-04-16 15:43 ` [PATCH v11 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Steven Price @ 2021-04-16 15:43 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/asm/kvm_host.h          |  6 ++
 arch/arm64/include/asm/kvm_mte.h           | 66 ++++++++++++++++++++++
 arch/arm64/include/asm/sysreg.h            |  3 +-
 arch/arm64/kernel/asm-offsets.c            |  3 +
 arch/arm64/kvm/hyp/entry.S                 |  7 +++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++++++
 arch/arm64/kvm/sys_regs.c                  | 22 ++++++--
 7 files changed, 123 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 1170ee137096..d00cc3590f6e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
 	CNTP_CVAL_EL0,
 	CNTP_CTL_EL0,
 
+	/* Memory Tagging Extension registers */
+	RGSR_EL1,	/* Random Allocation Tag Seed Register */
+	GCR_EL1,	/* Tag Control Register */
+	TFSR_EL1,	/* Tag Fault Status Register (EL1) */
+	TFSRE0_EL1,	/* Tag Fault Status Register (EL0) */
+
 	/* 32bit specific registers. Keep them at the end of the range */
 	DACR32_EL2,	/* Domain Access Control Register */
 	IFSR32_EL2,	/* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index 000000000000..6541c7d6ce06
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include <asm/sysreg.h>
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+	b	.L__skip_switch\@
+alternative_else_nop_endif
+	mrs	\reg1, hcr_el2
+	and	\reg1, \reg1, #(HCR_ATA)
+	cbz	\reg1, .L__skip_switch\@
+
+	mrs_s	\reg1, SYS_RGSR_EL1
+	str	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
+	mrs_s	\reg1, SYS_GCR_EL1
+	str	\reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+	ldr	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
+	msr_s	SYS_RGSR_EL1, \reg1
+	ldr	\reg1, [\g_ctxt, #CPU_GCR_EL1]
+	msr_s	SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+	b	.L__skip_switch\@
+alternative_else_nop_endif
+	mrs	\reg1, hcr_el2
+	and	\reg1, \reg1, #(HCR_ATA)
+	cbz	\reg1, .L__skip_switch\@
+
+	mrs_s	\reg1, SYS_RGSR_EL1
+	str	\reg1, [\g_ctxt, #CPU_RGSR_EL1]
+	mrs_s	\reg1, SYS_GCR_EL1
+	str	\reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+	ldr	\reg1, [\h_ctxt, #CPU_RGSR_EL1]
+	msr_s	SYS_RGSR_EL1, \reg1
+	ldr	\reg1, [\h_ctxt, #CPU_GCR_EL1]
+	msr_s	SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index dfd4edbfe360..5424d195cf96 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -580,7 +580,8 @@
 #define SCTLR_ELx_M	(BIT(0))
 
 #define SCTLR_ELx_FLAGS	(SCTLR_ELx_M  | SCTLR_ELx_A | SCTLR_ELx_C | \
-			 SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB)
+			 SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB | \
+			 SCTLR_ELx_ITFSB)
 
 /* SCTLR_EL2 specific flags. */
 #define SCTLR_EL2_RES1	((BIT(4))  | (BIT(5))  | (BIT(11)) | (BIT(16)) | \
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index a36e2fc330d4..944e4f1f45d9 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -108,6 +108,9 @@ int main(void)
   DEFINE(VCPU_WORKAROUND_FLAGS,	offsetof(struct kvm_vcpu, arch.workaround_flags));
   DEFINE(VCPU_HCR_EL2,		offsetof(struct kvm_vcpu, arch.hcr_el2));
   DEFINE(CPU_USER_PT_REGS,	offsetof(struct kvm_cpu_context, regs));
+  DEFINE(CPU_RGSR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[RGSR_EL1]));
+  DEFINE(CPU_GCR_EL1,		offsetof(struct kvm_cpu_context, sys_regs[GCR_EL1]));
+  DEFINE(CPU_TFSRE0_EL1,	offsetof(struct kvm_cpu_context, sys_regs[TFSRE0_EL1]));
   DEFINE(CPU_APIAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIAKEYLO_EL1]));
   DEFINE(CPU_APIBKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APIBKEYLO_EL1]));
   DEFINE(CPU_APDAKEYLO_EL1,	offsetof(struct kvm_cpu_context, sys_regs[APDAKEYLO_EL1]));
diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
index b0afad7a99c6..c67582c6dd55 100644
--- a/arch/arm64/kvm/hyp/entry.S
+++ b/arch/arm64/kvm/hyp/entry.S
@@ -13,6 +13,7 @@
 #include <asm/kvm_arm.h>
 #include <asm/kvm_asm.h>
 #include <asm/kvm_mmu.h>
+#include <asm/kvm_mte.h>
 #include <asm/kvm_ptrauth.h>
 
 	.text
@@ -51,6 +52,9 @@ alternative_else_nop_endif
 
 	add	x29, x0, #VCPU_CONTEXT
 
+	// mte_switch_to_guest(g_ctxt, h_ctxt, tmp1)
+	mte_switch_to_guest x29, x1, x2
+
 	// Macro ptrauth_switch_to_guest format:
 	// 	ptrauth_switch_to_guest(guest cxt, tmp1, tmp2, tmp3)
 	// The below macro to restore guest keys is not implemented in C code
@@ -140,6 +144,9 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
 	// when this feature is enabled for kernel code.
 	ptrauth_switch_to_hyp x1, x2, x3, x4, x5
 
+	// mte_switch_to_hyp(g_ctxt, h_ctxt, reg1)
+	mte_switch_to_hyp x1, x2, x3
+
 	// Restore hyp's sp_el0
 	restore_sp_el0 x2, x3
 
diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
index cce43bfe158f..de7e14c862e6 100644
--- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
+++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
@@ -14,6 +14,7 @@
 #include <asm/kvm_asm.h>
 #include <asm/kvm_emulate.h>
 #include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>
 
 static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt)
 {
@@ -26,6 +27,16 @@ static inline void __sysreg_save_user_state(struct kvm_cpu_context *ctxt)
 	ctxt_sys_reg(ctxt, TPIDRRO_EL0)	= read_sysreg(tpidrro_el0);
 }
 
+static inline bool ctxt_has_mte(struct kvm_cpu_context *ctxt)
+{
+	struct kvm_vcpu *vcpu = ctxt->__hyp_running_vcpu;
+
+	if (!vcpu)
+		vcpu = container_of(ctxt, struct kvm_vcpu, arch.ctxt);
+
+	return kvm_has_mte(kern_hyp_va(vcpu->kvm));
+}
+
 static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
 {
 	ctxt_sys_reg(ctxt, CSSELR_EL1)	= read_sysreg(csselr_el1);
@@ -46,6 +57,11 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
 	ctxt_sys_reg(ctxt, PAR_EL1)	= read_sysreg_par();
 	ctxt_sys_reg(ctxt, TPIDR_EL1)	= read_sysreg(tpidr_el1);
 
+	if (ctxt_has_mte(ctxt)) {
+		ctxt_sys_reg(ctxt, TFSR_EL1) = read_sysreg_el1(SYS_TFSR);
+		ctxt_sys_reg(ctxt, TFSRE0_EL1) = read_sysreg_s(SYS_TFSRE0_EL1);
+	}
+
 	ctxt_sys_reg(ctxt, SP_EL1)	= read_sysreg(sp_el1);
 	ctxt_sys_reg(ctxt, ELR_EL1)	= read_sysreg_el1(SYS_ELR);
 	ctxt_sys_reg(ctxt, SPSR_EL1)	= read_sysreg_el1(SYS_SPSR);
@@ -107,6 +123,11 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt)
 	write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1),	par_el1);
 	write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1),	tpidr_el1);
 
+	if (ctxt_has_mte(ctxt)) {
+		write_sysreg_el1(ctxt_sys_reg(ctxt, TFSR_EL1), SYS_TFSR);
+		write_sysreg_s(ctxt_sys_reg(ctxt, TFSRE0_EL1), SYS_TFSRE0_EL1);
+	}
+
 	if (!has_vhe() &&
 	    cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT) &&
 	    ctxt->__hyp_running_vcpu) {
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 18c87500a7a8..377ae6efb0ef 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1303,6 +1303,20 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
 	return true;
 }
 
+static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
+				   const struct sys_reg_desc *rd)
+{
+	return REG_HIDDEN;
+}
+
+#define MTE_REG(name) {				\
+	SYS_DESC(SYS_##name),			\
+	.access = undef_access,			\
+	.reset = reset_unknown,			\
+	.reg = name,				\
+	.visibility = mte_visibility,		\
+}
+
 /* sys_reg_desc initialiser for known cpufeature ID registers */
 #define ID_SANITISED(name) {			\
 	SYS_DESC(SYS_##name),			\
@@ -1471,8 +1485,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	{ SYS_DESC(SYS_ACTLR_EL1), access_actlr, reset_actlr, ACTLR_EL1 },
 	{ SYS_DESC(SYS_CPACR_EL1), NULL, reset_val, CPACR_EL1, 0 },
 
-	{ SYS_DESC(SYS_RGSR_EL1), undef_access },
-	{ SYS_DESC(SYS_GCR_EL1), undef_access },
+	MTE_REG(RGSR_EL1),
+	MTE_REG(GCR_EL1),
 
 	{ SYS_DESC(SYS_ZCR_EL1), NULL, reset_val, ZCR_EL1, 0, .visibility = sve_visibility },
 	{ SYS_DESC(SYS_TTBR0_EL1), access_vm_reg, reset_unknown, TTBR0_EL1 },
@@ -1498,8 +1512,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	{ SYS_DESC(SYS_ERXMISC0_EL1), trap_raz_wi },
 	{ SYS_DESC(SYS_ERXMISC1_EL1), trap_raz_wi },
 
-	{ SYS_DESC(SYS_TFSR_EL1), undef_access },
-	{ SYS_DESC(SYS_TFSRE0_EL1), undef_access },
+	MTE_REG(TFSR_EL1),
+	MTE_REG(TFSRE0_EL1),
 
 	{ SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 },
 	{ SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 },
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v11 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE
  2021-04-16 15:43 [PATCH v11 0/6] MTE support for KVM guest Steven Price
                   ` (2 preceding siblings ...)
  2021-04-16 15:43 ` [PATCH v11 3/6] arm64: kvm: Save/restore MTE registers Steven Price
@ 2021-04-16 15:43 ` Steven Price
  2021-04-16 15:43 ` [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
  2021-04-16 15:43 ` [PATCH v11 6/6] KVM: arm64: Document MTE capability and ioctl Steven Price
  5 siblings, 0 replies; 26+ messages in thread
From: Steven Price @ 2021-04-16 15:43 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/kvm/arm.c      | 9 +++++++++
 arch/arm64/kvm/sys_regs.c | 3 +++
 2 files changed, 12 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index fc4c95dd2d26..46bf319f6cb7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		r = 0;
 		kvm->arch.return_nisv_io_abort_to_user = true;
 		break;
+	case KVM_CAP_ARM_MTE:
+		if (!system_supports_mte() || kvm->created_vcpus)
+			return -EINVAL;
+		r = 0;
+		kvm->arch.mte_enabled = true;
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -234,6 +240,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		 */
 		r = 1;
 		break;
+	case KVM_CAP_ARM_MTE:
+		r = system_supports_mte();
+		break;
 	case KVM_CAP_STEAL_TIME:
 		r = kvm_arm_pvtime_supported();
 		break;
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 377ae6efb0ef..46937bfaac8a 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1306,6 +1306,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
 static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
 				   const struct sys_reg_desc *rd)
 {
+	if (kvm_has_mte(vcpu->kvm))
+		return 0;
+
 	return REG_HIDDEN;
 }
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-04-16 15:43 [PATCH v11 0/6] MTE support for KVM guest Steven Price
                   ` (3 preceding siblings ...)
  2021-04-16 15:43 ` [PATCH v11 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price
@ 2021-04-16 15:43 ` Steven Price
  2021-04-27 17:58   ` Catalin Marinas
  2021-04-16 15:43 ` [PATCH v11 6/6] KVM: arm64: Document MTE capability and ioctl Steven Price
  5 siblings, 1 reply; 26+ messages in thread
From: Steven Price @ 2021-04-16 15:43 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).

Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 arch/arm64/include/uapi/asm/kvm.h | 14 +++++++
 arch/arm64/kvm/arm.c              | 69 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h          |  1 +
 3 files changed, 84 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..2b85a047c37d 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,20 @@ struct kvm_vcpu_events {
 	__u32 reserved[12];
 };
 
+struct kvm_arm_copy_mte_tags {
+	__u64 guest_ipa;
+	__u64 length;
+	union {
+		void __user *addr;
+		__u64 padding;
+	};
+	__u64 flags;
+	__u64 reserved[2];
+};
+
+#define KVM_ARM_TAGS_TO_GUEST		0
+#define KVM_ARM_TAGS_FROM_GUEST		1
+
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_ARM_COPROC_MASK		0x000000000FFF0000
 #define KVM_REG_ARM_COPROC_SHIFT	16
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 46bf319f6cb7..9a6b26d37236 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1297,6 +1297,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
 	}
 }
 
+static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+				      struct kvm_arm_copy_mte_tags *copy_tags)
+{
+	gpa_t guest_ipa = copy_tags->guest_ipa;
+	size_t length = copy_tags->length;
+	void __user *tags = copy_tags->addr;
+	gpa_t gfn;
+	bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
+	int ret = 0;
+
+	if (copy_tags->reserved[0] || copy_tags->reserved[1])
+		return -EINVAL;
+
+	if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
+		return -EINVAL;
+
+	if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
+		return -EINVAL;
+
+	gfn = gpa_to_gfn(guest_ipa);
+
+	mutex_lock(&kvm->slots_lock);
+
+	while (length > 0) {
+		kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
+		void *maddr;
+		unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
+
+		if (is_error_noslot_pfn(pfn)) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		maddr = page_address(pfn_to_page(pfn));
+
+		if (!write) {
+			num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
+			kvm_release_pfn_clean(pfn);
+		} else {
+			num_tags = mte_copy_tags_from_user(maddr, tags,
+							   num_tags);
+			kvm_release_pfn_dirty(pfn);
+		}
+
+		if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		gfn++;
+		tags += num_tags;
+		length -= PAGE_SIZE;
+	}
+
+out:
+	mutex_unlock(&kvm->slots_lock);
+	return ret;
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
 		       unsigned int ioctl, unsigned long arg)
 {
@@ -1333,6 +1392,16 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 		return 0;
 	}
+	case KVM_ARM_MTE_COPY_TAGS: {
+		struct kvm_arm_copy_mte_tags copy_tags;
+
+		if (!kvm_has_mte(kvm))
+			return -EINVAL;
+
+		if (copy_from_user(&copy_tags, argp, sizeof(copy_tags)))
+			return -EFAULT;
+		return kvm_vm_ioctl_mte_copy_tags(kvm, &copy_tags);
+	}
 	default:
 		return -EINVAL;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6dc16c09a2d1..470c122f4c2d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1424,6 +1424,7 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PMU_EVENT_FILTER */
 #define KVM_SET_PMU_EVENT_FILTER  _IOW(KVMIO,  0xb2, struct kvm_pmu_event_filter)
 #define KVM_PPC_SVM_OFF		  _IO(KVMIO,  0xb3)
+#define KVM_ARM_MTE_COPY_TAGS	  _IOR(KVMIO,  0xb4, struct kvm_arm_copy_mte_tags)
 
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v11 6/6] KVM: arm64: Document MTE capability and ioctl
  2021-04-16 15:43 [PATCH v11 0/6] MTE support for KVM guest Steven Price
                   ` (4 preceding siblings ...)
  2021-04-16 15:43 ` [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
@ 2021-04-16 15:43 ` Steven Price
  5 siblings, 0 replies; 26+ messages in thread
From: Steven Price @ 2021-04-16 15:43 UTC (permalink / raw)
  To: Catalin Marinas, Marc Zyngier, Will Deacon
  Cc: Steven Price, James Morse, Julien Thierry, Suzuki K Poulose,
	kvmarm, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Rutland, Thomas Gleixner, qemu-devel, Juan Quintela,
	Dr. David Alan Gilbert, Richard Henderson, Peter Maydell,
	Haibo Xu, Andrew Jones

A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Signed-off-by: Steven Price <steven.price@arm.com>
---
 Documentation/virt/kvm/api.rst | 53 ++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 1a2b5210cdbf..ccc84f21ba5e 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4938,6 +4938,40 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.131 KVM_ARM_MTE_COPY_TAGS
+---------------------------
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: 0 on success, < 0 on error
+
+::
+
+  struct kvm_arm_copy_mte_tags {
+	__u64 guest_ipa;
+	__u64 length;
+	union {
+		void __user *addr;
+		__u64 padding;
+	};
+	__u64 flags;
+	__u64 reserved[2];
+  };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+fieldmust point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)``
+bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
 5. The kvm_run structure
 ========================
 
@@ -6227,6 +6261,25 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between them.
 This capability can be used to check / enable 2nd DAWR feature provided
 by POWER10 processor.
 
+7.23 KVM_CAP_ARM_MTE
+--------------------
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before the guest will be granted access.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
+that the tags are maintained during swap or hibernation of the host; however
+the VMM needs to manually save/restore the tags as appropriate if the VM is
+migrated.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
 8. Other capabilities.
 ======================
 
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged
  2021-04-16 15:43 ` [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
@ 2021-04-27 17:43   ` Catalin Marinas
  2021-04-29 16:06     ` Steven Price
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2021-04-27 17:43 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Fri, Apr 16, 2021 at 04:43:04PM +0100, Steven Price wrote:
> A KVM guest could store tags in a page even if the VMM hasn't mapped
> the page with PROT_MTE. So when restoring pages from swap we will
> need to check to see if there are any saved tags even if !pte_tagged().
> 
> However don't check pages which are !pte_valid_user() as these will
> not have been swapped out.

You should remove the pte_valid_user() mention from the commit log as
well.

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index e17b96d0e4b5..cf4b52a33b3c 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>  		__sync_icache_dcache(pte);
>  
>  	if (system_supports_mte() &&
> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> +	    pte_present(pte) && (pte_val(pte) & PTE_USER) && !pte_special(pte))

I would add a pte_user() macro here or, if we restore the tags only when
the page is readable, use pte_access_permitted(pte, false). Also add a
comment why we do this.

There's also the pte_user_exec() case which may not have the PTE_USER
set (exec-only permission) but I don't think it matters. We don't do tag
checking on instruction fetches, so if the user adds a PROT_READ to it,
it would go through set_pte_at() again. I'm not sure KVM does anything
special with exec-only mappings at stage 2, I suspect they won't be
accessible by the guest (but needs checking).

>  		mte_sync_tags(ptep, pte);
>  
>  	__check_racy_pte_update(mm, ptep, pte);
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index b3c70a612c7a..e016ab57ea36 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -26,17 +26,23 @@ u64 gcr_kernel_excl __ro_after_init;
>  
>  static bool report_fault_once = true;
>  
> -static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
> +static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
> +			       bool pte_is_tagged)
>  {
>  	pte_t old_pte = READ_ONCE(*ptep);
>  
>  	if (check_swap && is_swap_pte(old_pte)) {
>  		swp_entry_t entry = pte_to_swp_entry(old_pte);
>  
> -		if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
> +		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
> +			set_bit(PG_mte_tagged, &page->flags);
>  			return;
> +		}
>  	}
>  
> +	if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, &page->flags))
> +		return;

I don't think we need another test_bit() here, it was done in the
caller (bar potential races which need more thought).

> +
>  	page_kasan_tag_reset(page);
>  	/*
>  	 * We need smp_wmb() in between setting the flags and clearing the
> @@ -54,11 +60,13 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
>  	struct page *page = pte_page(pte);
>  	long i, nr_pages = compound_nr(page);
>  	bool check_swap = nr_pages == 1;
> +	bool pte_is_tagged = pte_tagged(pte);
>  
>  	/* if PG_mte_tagged is set, tags have already been initialised */
>  	for (i = 0; i < nr_pages; i++, page++) {
> -		if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> -			mte_sync_page_tags(page, ptep, check_swap);
> +		if (!test_bit(PG_mte_tagged, &page->flags))
> +			mte_sync_page_tags(page, ptep, check_swap,
> +					   pte_is_tagged);
>  	}
>  }

You were right in the previous thread that if we have a race, it's
already there even without your patches KVM patches.

If it's the same pte in a multithreaded app, we should be ok as the core
code holds the ptl (the arch code also holds the mmap_lock during
exception handling but only as a reader, so you can have multiple
holders).

If there are multiple ptes to the same page, for example mapped with
MAP_ANONYMOUS | MAP_SHARED, metadata recovery is done via
arch_swap_restore() before we even set the pte and with the page locked.
So calling lock_page() again in mte_restore_tags() would deadlock.

I can see that do_swap_page() also holds the page lock around
set_pte_at(), so I think we are covered.

Any other scenario I may have missed? My understanding is that if the
pte is the same, we have the ptl. Otherwise we have the page lock for
shared pages.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-04-16 15:43 ` [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
@ 2021-04-27 17:58   ` Catalin Marinas
  2021-04-29 16:06     ` Steven Price
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2021-04-27 17:58 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Fri, Apr 16, 2021 at 04:43:08PM +0100, Steven Price wrote:
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 24223adae150..2b85a047c37d 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -184,6 +184,20 @@ struct kvm_vcpu_events {
>  	__u32 reserved[12];
>  };
>  
> +struct kvm_arm_copy_mte_tags {
> +	__u64 guest_ipa;
> +	__u64 length;
> +	union {
> +		void __user *addr;
> +		__u64 padding;
> +	};
> +	__u64 flags;
> +	__u64 reserved[2];
> +};

I know Marc asked for some reserved space in here but I'm not sure it's
the right place. And what's with the union of a 64-bit pointer and
64-bit padding, it doesn't change any layout? Maybe add the two reserved
values to the union in case we want to store something else in the
future.

Or maybe I'm missing something, I haven't checked how other KVM ioctls
work.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-04-16 15:43 ` [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature Steven Price
@ 2021-04-28 17:07   ` Catalin Marinas
  2021-04-29 16:06     ` Steven Price
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2021-04-28 17:07 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Fri, Apr 16, 2021 at 04:43:05PM +0100, Steven Price wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..5f8e165ea053 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -879,6 +879,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (vma_pagesize == PAGE_SIZE && !force_pte)
>  		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>  							   &pfn, &fault_ipa);
> +
> +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && !device &&
> +	    pfn_valid(pfn)) {

In the current implementation, device == !pfn_valid(), so we could skip
the latter check.

> +		/*
> +		 * VM will be able to see the page's tags, so we must ensure
> +		 * they have been initialised. if PG_mte_tagged is set, tags
> +		 * have already been initialised.
> +		 */
> +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> +		struct page *page = pfn_to_online_page(pfn);
> +
> +		if (!page)
> +			return -EFAULT;

I think that's fine, though maybe adding a comment that otherwise it
would be mapped at stage 2 as Normal Cacheable and we cannot guarantee
that the memory supports MTE tags.

> +
> +		for (i = 0; i < nr_pages; i++, page++) {
> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
> +				mte_clear_page_tags(page_address(page));
> +		}
> +	}
> +
>  	if (writable)
>  		prot |= KVM_PGTABLE_PROT_W;

I probably asked already but is the only way to map a standard RAM page
(not device) in stage 2 via the fault handler? One case I had in mind
was something like get_user_pages() but it looks like that one doesn't
call set_pte_at_notify(). There are a few other places where
set_pte_at_notify() is called and these may happen before we got a
chance to fault on stage 2, effectively populating the entry (IIUC). If
that's an issue, we could move the above loop and check closer to the
actual pte setting like kvm_pgtable_stage2_map().

While the set_pte_at() race on the page flags is somewhat clearer, we
may still have a race here with the VMM's set_pte_at() if the page is
mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
handling the VMM page tables (well, not always, see below).

gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
sure whether get_user_page_fast_only() does the same.

The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
KVM mmu notifier is invoked before set_pte_at() and racing with another
user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
set_pte_at() would see the PG_mte_tagged set either by the current CPU
or by the one it was racing with.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged
  2021-04-27 17:43   ` Catalin Marinas
@ 2021-04-29 16:06     ` Steven Price
  2021-05-04 15:29       ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Price @ 2021-04-29 16:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 27/04/2021 18:43, Catalin Marinas wrote:
> On Fri, Apr 16, 2021 at 04:43:04PM +0100, Steven Price wrote:
>> A KVM guest could store tags in a page even if the VMM hasn't mapped
>> the page with PROT_MTE. So when restoring pages from swap we will
>> need to check to see if there are any saved tags even if !pte_tagged().
>>
>> However don't check pages which are !pte_valid_user() as these will
>> not have been swapped out.
> 
> You should remove the pte_valid_user() mention from the commit log as
> well.

Good spot - sorry about that. I really must get better at reading my own 
commit messages.

>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index e17b96d0e4b5..cf4b52a33b3c 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>>   		__sync_icache_dcache(pte);
>>   
>>   	if (system_supports_mte() &&
>> -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
>> +	    pte_present(pte) && (pte_val(pte) & PTE_USER) && !pte_special(pte))
> 
> I would add a pte_user() macro here or, if we restore the tags only when
> the page is readable, use pte_access_permitted(pte, false). Also add a
> comment why we do this.

pte_access_permitted() looks like it describes what we want (user space 
can access the memory). I'll add the following comment:

  /*
   * If the PTE would provide user space will access to the tags
   * associated with it then ensure that the MTE tags are synchronised.
   * Exec-only mappings don't expose tags (instruction fetches don't
   * check tags).
   */

> There's also the pte_user_exec() case which may not have the PTE_USER
> set (exec-only permission) but I don't think it matters. We don't do tag
> checking on instruction fetches, so if the user adds a PROT_READ to it,
> it would go through set_pte_at() again. I'm not sure KVM does anything
> special with exec-only mappings at stage 2, I suspect they won't be
> accessible by the guest (but needs checking).

It comes down to the behaviour of get_user_pages(). AFAICT that will 
fail if the memory is exec-only, so no stage 2 mapping will be created. 
Which of course means the guest can't do anything with that memory. That 
certainly seems like the only sane behaviour even without MTE.

>>   		mte_sync_tags(ptep, pte);
>>   
>>   	__check_racy_pte_update(mm, ptep, pte);
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index b3c70a612c7a..e016ab57ea36 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -26,17 +26,23 @@ u64 gcr_kernel_excl __ro_after_init;
>>   
>>   static bool report_fault_once = true;
>>   
>> -static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
>> +static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
>> +			       bool pte_is_tagged)
>>   {
>>   	pte_t old_pte = READ_ONCE(*ptep);
>>   
>>   	if (check_swap && is_swap_pte(old_pte)) {
>>   		swp_entry_t entry = pte_to_swp_entry(old_pte);
>>   
>> -		if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
>> +		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
>> +			set_bit(PG_mte_tagged, &page->flags);
>>   			return;
>> +		}
>>   	}
>>   
>> +	if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, &page->flags))
>> +		return;
> 
> I don't think we need another test_bit() here, it was done in the
> caller (bar potential races which need more thought).

Good point - I'll change that to a straight set_bit().

>> +
>>   	page_kasan_tag_reset(page);
>>   	/*
>>   	 * We need smp_wmb() in between setting the flags and clearing the
>> @@ -54,11 +60,13 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
>>   	struct page *page = pte_page(pte);
>>   	long i, nr_pages = compound_nr(page);
>>   	bool check_swap = nr_pages == 1;
>> +	bool pte_is_tagged = pte_tagged(pte);
>>   
>>   	/* if PG_mte_tagged is set, tags have already been initialised */
>>   	for (i = 0; i < nr_pages; i++, page++) {
>> -		if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>> -			mte_sync_page_tags(page, ptep, check_swap);
>> +		if (!test_bit(PG_mte_tagged, &page->flags))
>> +			mte_sync_page_tags(page, ptep, check_swap,
>> +					   pte_is_tagged);
>>   	}
>>   }
> 
> You were right in the previous thread that if we have a race, it's
> already there even without your patches KVM patches.
> 
> If it's the same pte in a multithreaded app, we should be ok as the core
> code holds the ptl (the arch code also holds the mmap_lock during
> exception handling but only as a reader, so you can have multiple
> holders).
> 
> If there are multiple ptes to the same page, for example mapped with
> MAP_ANONYMOUS | MAP_SHARED, metadata recovery is done via
> arch_swap_restore() before we even set the pte and with the page locked.
> So calling lock_page() again in mte_restore_tags() would deadlock.
> 
> I can see that do_swap_page() also holds the page lock around
> set_pte_at(), so I think we are covered.
> 
> Any other scenario I may have missed? My understanding is that if the
> pte is the same, we have the ptl. Otherwise we have the page lock for
> shared pages.

That is my understanding - either the PTL is held or the page is locked. 
But I am aware I was partly basing that on an assumption that the 
existing code is correct. If there's a way that a new PTE can be created 
which races with the arch_swap_restore() path then there is a problem. 
I'm not aware of how that would happen though.

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-04-27 17:58   ` Catalin Marinas
@ 2021-04-29 16:06     ` Steven Price
  2021-05-04 17:44       ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Price @ 2021-04-29 16:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 27/04/2021 18:58, Catalin Marinas wrote:
> On Fri, Apr 16, 2021 at 04:43:08PM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 24223adae150..2b85a047c37d 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -184,6 +184,20 @@ struct kvm_vcpu_events {
>>   	__u32 reserved[12];
>>   };
>>   
>> +struct kvm_arm_copy_mte_tags {
>> +	__u64 guest_ipa;
>> +	__u64 length;
>> +	union {
>> +		void __user *addr;
>> +		__u64 padding;
>> +	};
>> +	__u64 flags;
>> +	__u64 reserved[2];
>> +};
> 
> I know Marc asked for some reserved space in here but I'm not sure it's
> the right place. And what's with the union of a 64-bit pointer and
> 64-bit padding, it doesn't change any layout?

Yes it's unnecessary here - habits die hard. This would ensure that the 
layout is the same for 32 bit and 64 bit. But it's irrelevant here as 
(a) we don't support 32 bit, and (b) flags has 64 bit alignment anyway. 
I'll drop the union (and 'padding').

> Maybe add the two reserved
> values to the union in case we want to store something else in the
> future.

I'm not sure what you mean here. What would the reserved fields be 
unioned with? And surely they are no longer reserved in that case?

> Or maybe I'm missing something, I haven't checked how other KVM ioctls
> work.

KVM ioctls seem to (sometimes) have some reserved space at the end of 
the structure for expansion without the ioctl number changing (since the 
structure size is encoded into the ioctl).

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-04-28 17:07   ` Catalin Marinas
@ 2021-04-29 16:06     ` Steven Price
  2021-05-04 17:40       ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Price @ 2021-04-29 16:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 28/04/2021 18:07, Catalin Marinas wrote:
> On Fri, Apr 16, 2021 at 04:43:05PM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 77cb2d28f2a4..5f8e165ea053 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -879,6 +879,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>   	if (vma_pagesize == PAGE_SIZE && !force_pte)
>>   		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>>   							   &pfn, &fault_ipa);
>> +
>> +	if (fault_status != FSC_PERM && kvm_has_mte(kvm) && !device &&
>> +	    pfn_valid(pfn)) {
> 
> In the current implementation, device == !pfn_valid(), so we could skip
> the latter check.

Thanks, I'll drop that check.

>> +		/*
>> +		 * VM will be able to see the page's tags, so we must ensure
>> +		 * they have been initialised. if PG_mte_tagged is set, tags
>> +		 * have already been initialised.
>> +		 */
>> +		unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
>> +		struct page *page = pfn_to_online_page(pfn);
>> +
>> +		if (!page)
>> +			return -EFAULT;
> 
> I think that's fine, though maybe adding a comment that otherwise it
> would be mapped at stage 2 as Normal Cacheable and we cannot guarantee
> that the memory supports MTE tags.

That's what I intended by "be able to see the page's tags", but I'll 
reword to be explicit about it being Normal Cacheable.

>> +
>> +		for (i = 0; i < nr_pages; i++, page++) {
>> +			if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>> +				mte_clear_page_tags(page_address(page));
>> +		}
>> +	}
>> +
>>   	if (writable)
>>   		prot |= KVM_PGTABLE_PROT_W;
> 
> I probably asked already but is the only way to map a standard RAM page
> (not device) in stage 2 via the fault handler? One case I had in mind
> was something like get_user_pages() but it looks like that one doesn't
> call set_pte_at_notify(). There are a few other places where
> set_pte_at_notify() is called and these may happen before we got a
> chance to fault on stage 2, effectively populating the entry (IIUC). If
> that's an issue, we could move the above loop and check closer to the
> actual pte setting like kvm_pgtable_stage2_map().

The only call sites of kvm_pgtable_stage2_map() are in mmu.c:

  * kvm_phys_addr_ioremap() - maps as device in stage 2

  * user_mem_abort() - handled above

  * kvm_set_spte_handler() - ultimately called from the .change_pte() 
callback of the MMU notifier

So the last one is potentially a problem. It's called via the MMU 
notifiers in the case of set_pte_at_notify(). The users of that are:

  * uprobe_write_opcode(): Allocates a new page and performs a 
copy_highpage() to copy the data to the new page (which with MTE 
includes the tags and will copy across the PG_mte_tagged flag).

  * write_protect_page() (KSM): Changes the permissions on the PTE but 
it's still the same page, so nothing to do regarding MTE.

  * replace_page() (KSM): If the page has MTE tags then the MTE version 
of memcmp_pages() will return false, so the only caller 
(try_to_merge_one_page()) will never call this on a page with tags.

  * wp_page_copy(): This one is more interesting - if we go down the 
cow_user_page() path with an old page then everything is safe (tags are 
copied over). The is_zero_pfn() case worries me a bit - a new page is 
allocated, but I can't instantly see anything to zero out the tags (and 
set PG_mte_tagged).

  * migrate_vma_insert_page(): I think migration should be safe as the 
tags should be copied.

So wp_page_copy() looks suspicious.

kvm_pgtable_stage2_map() looks like it could be a good place for the 
checks, it looks like it should work and is probably a more obvious 
place for the checks.

> While the set_pte_at() race on the page flags is somewhat clearer, we
> may still have a race here with the VMM's set_pte_at() if the page is
> mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> handling the VMM page tables (well, not always, see below).
> 
> gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> sure whether get_user_page_fast_only() does the same.
> 
> The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> KVM mmu notifier is invoked before set_pte_at() and racing with another
> user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> set_pte_at() would see the PG_mte_tagged set either by the current CPU
> or by the one it was racing with.
> 

Given the changes to set_pte_at() which means that tags are restored 
from swap even if !PROT_MTE, the only race I can see remaining is the 
creation of new PROT_MTE mappings. As you mention an attempt to change 
mappings in the VMM memory space should involve a mmu notifier call 
which I think serialises this. So the remaining issue is doing this in a 
separate address space.

So I guess the potential problem is:

  * allocate memory MAP_SHARED but !PROT_MTE
  * fork()
  * VM causes a fault in parent address space
  * child does a mprotect(PROT_MTE)

With the last two potentially racing. Sadly I can't see a good way of 
handling that.

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged
  2021-04-29 16:06     ` Steven Price
@ 2021-05-04 15:29       ` Catalin Marinas
  0 siblings, 0 replies; 26+ messages in thread
From: Catalin Marinas @ 2021-05-04 15:29 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, Apr 29, 2021 at 05:06:05PM +0100, Steven Price wrote:
> On 27/04/2021 18:43, Catalin Marinas wrote:
> > On Fri, Apr 16, 2021 at 04:43:04PM +0100, Steven Price wrote:
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index e17b96d0e4b5..cf4b52a33b3c 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
> > >   		__sync_icache_dcache(pte);
> > >   	if (system_supports_mte() &&
> > > -	    pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> > > +	    pte_present(pte) && (pte_val(pte) & PTE_USER) && !pte_special(pte))
> > 
> > I would add a pte_user() macro here or, if we restore the tags only when
> > the page is readable, use pte_access_permitted(pte, false). Also add a
> > comment why we do this.
> 
> pte_access_permitted() looks like it describes what we want (user space can
> access the memory). I'll add the following comment:
> 
>  /*
>   * If the PTE would provide user space will access to the tags

I think drop "will".

>   * associated with it then ensure that the MTE tags are synchronised.
>   * Exec-only mappings don't expose tags (instruction fetches don't
>   * check tags).
>   */

Sounds fine.

> > There's also the pte_user_exec() case which may not have the PTE_USER
> > set (exec-only permission) but I don't think it matters. We don't do tag
> > checking on instruction fetches, so if the user adds a PROT_READ to it,
> > it would go through set_pte_at() again. I'm not sure KVM does anything
> > special with exec-only mappings at stage 2, I suspect they won't be
> > accessible by the guest (but needs checking).
> 
> It comes down to the behaviour of get_user_pages(). AFAICT that will fail if
> the memory is exec-only, so no stage 2 mapping will be created. Which of
> course means the guest can't do anything with that memory. That certainly
> seems like the only sane behaviour even without MTE.

That's my understanding as well. The get_user_pages_fast() path uses
pte_access_permitted() and should return false. The slower
get_user_pages() relies on checking the vma flags and it checks for
VM_READ.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-04-29 16:06     ` Steven Price
@ 2021-05-04 17:40       ` Catalin Marinas
  2021-05-06 16:15         ` Steven Price
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2021-05-04 17:40 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> On 28/04/2021 18:07, Catalin Marinas wrote:
> > I probably asked already but is the only way to map a standard RAM page
> > (not device) in stage 2 via the fault handler? One case I had in mind
> > was something like get_user_pages() but it looks like that one doesn't
> > call set_pte_at_notify(). There are a few other places where
> > set_pte_at_notify() is called and these may happen before we got a
> > chance to fault on stage 2, effectively populating the entry (IIUC). If
> > that's an issue, we could move the above loop and check closer to the
> > actual pte setting like kvm_pgtable_stage2_map().
> 
> The only call sites of kvm_pgtable_stage2_map() are in mmu.c:
> 
>  * kvm_phys_addr_ioremap() - maps as device in stage 2
> 
>  * user_mem_abort() - handled above
> 
>  * kvm_set_spte_handler() - ultimately called from the .change_pte()
> callback of the MMU notifier
> 
> So the last one is potentially a problem. It's called via the MMU notifiers
> in the case of set_pte_at_notify(). The users of that are:
> 
>  * uprobe_write_opcode(): Allocates a new page and performs a
> copy_highpage() to copy the data to the new page (which with MTE includes
> the tags and will copy across the PG_mte_tagged flag).
> 
>  * write_protect_page() (KSM): Changes the permissions on the PTE but it's
> still the same page, so nothing to do regarding MTE.

My concern here is that the VMM had a stage 1 pte but we haven't yet
faulted in at stage 2 via user_mem_abort(), so we don't have any stage 2
pte set. write_protect_page() comes in and sets the new stage 2 pte via
the callback. I couldn't find any check in kvm_pgtable_stage2_map() for
the old pte, so it will set the new stage 2 pte regardless. A subsequent
guest read would no longer fault at stage 2.

>  * replace_page() (KSM): If the page has MTE tags then the MTE version of
> memcmp_pages() will return false, so the only caller
> (try_to_merge_one_page()) will never call this on a page with tags.
> 
>  * wp_page_copy(): This one is more interesting - if we go down the
> cow_user_page() path with an old page then everything is safe (tags are
> copied over). The is_zero_pfn() case worries me a bit - a new page is
> allocated, but I can't instantly see anything to zero out the tags (and set
> PG_mte_tagged).

True, I think tag zeroing happens only if we map it as PROT_MTE in the
VMM.

>  * migrate_vma_insert_page(): I think migration should be safe as the tags
> should be copied.
> 
> So wp_page_copy() looks suspicious.
> 
> kvm_pgtable_stage2_map() looks like it could be a good place for the checks,
> it looks like it should work and is probably a more obvious place for the
> checks.

That would be my preference. It also matches the stage 1 set_pte_at().

> > While the set_pte_at() race on the page flags is somewhat clearer, we
> > may still have a race here with the VMM's set_pte_at() if the page is
> > mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> > handling the VMM page tables (well, not always, see below).
> > 
> > gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> > path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> > would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> > sure whether get_user_page_fast_only() does the same.
> > 
> > The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> > KVM mmu notifier is invoked before set_pte_at() and racing with another
> > user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> > set_pte_at() would see the PG_mte_tagged set either by the current CPU
> > or by the one it was racing with.
> 
> Given the changes to set_pte_at() which means that tags are restored from
> swap even if !PROT_MTE, the only race I can see remaining is the creation of
> new PROT_MTE mappings. As you mention an attempt to change mappings in the
> VMM memory space should involve a mmu notifier call which I think serialises
> this. So the remaining issue is doing this in a separate address space.
> 
> So I guess the potential problem is:
> 
>  * allocate memory MAP_SHARED but !PROT_MTE
>  * fork()
>  * VM causes a fault in parent address space
>  * child does a mprotect(PROT_MTE)
> 
> With the last two potentially racing. Sadly I can't see a good way of
> handling that.

Ah, the mmap lock doesn't help as they are different processes
(mprotect() acquires it as a writer).

I wonder whether this is racy even in the absence of KVM. If both parent
and child do an mprotect(PROT_MTE), one of them may be reading stale
tags for a brief period.

Maybe we should revisit whether shared MTE pages are of any use, though
it's an ABI change (not bad if no-one is relying on this). However...

Thinking about this, we have a similar problem with the PG_dcache_clean
and two processes doing mprotect(PROT_EXEC). One of them could see the
flag set and skip the I-cache maintenance while the other executes
stale instructions. change_pte_range() could acquire the page lock if
the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
solve the MTE/KVM case but we could at least take the page lock via
user_mem_abort().

Or maybe we just document this behaviour as racy both for PROT_EXEC and
PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
is the potential leaking of tags (it's harder to leak information
through the I-cache).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-04-29 16:06     ` Steven Price
@ 2021-05-04 17:44       ` Catalin Marinas
  2021-05-07  9:44         ` Steven Price
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2021-05-04 17:44 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, Apr 29, 2021 at 05:06:07PM +0100, Steven Price wrote:
> On 27/04/2021 18:58, Catalin Marinas wrote:
> > On Fri, Apr 16, 2021 at 04:43:08PM +0100, Steven Price wrote:
> > > diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> > > index 24223adae150..2b85a047c37d 100644
> > > --- a/arch/arm64/include/uapi/asm/kvm.h
> > > +++ b/arch/arm64/include/uapi/asm/kvm.h
> > > @@ -184,6 +184,20 @@ struct kvm_vcpu_events {
> > >   	__u32 reserved[12];
> > >   };
> > > +struct kvm_arm_copy_mte_tags {
> > > +	__u64 guest_ipa;
> > > +	__u64 length;
> > > +	union {
> > > +		void __user *addr;
> > > +		__u64 padding;
> > > +	};
> > > +	__u64 flags;
> > > +	__u64 reserved[2];
> > > +};
[...]
> > Maybe add the two reserved
> > values to the union in case we want to store something else in the
> > future.
> 
> I'm not sure what you mean here. What would the reserved fields be unioned
> with? And surely they are no longer reserved in that case?

In case you want to keep the structure size the same for future
expansion and the expansion only happens via the union, you'd add some
padding in there just in case. We do this for struct siginfo with an
_si_pad[] array in the union.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-05-04 17:40       ` Catalin Marinas
@ 2021-05-06 16:15         ` Steven Price
  2021-05-07 18:25           ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Price @ 2021-05-06 16:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 04/05/2021 18:40, Catalin Marinas wrote:
> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
>> On 28/04/2021 18:07, Catalin Marinas wrote:
>>> I probably asked already but is the only way to map a standard RAM page
>>> (not device) in stage 2 via the fault handler? One case I had in mind
>>> was something like get_user_pages() but it looks like that one doesn't
>>> call set_pte_at_notify(). There are a few other places where
>>> set_pte_at_notify() is called and these may happen before we got a
>>> chance to fault on stage 2, effectively populating the entry (IIUC). If
>>> that's an issue, we could move the above loop and check closer to the
>>> actual pte setting like kvm_pgtable_stage2_map().
>>
>> The only call sites of kvm_pgtable_stage2_map() are in mmu.c:
>>
>>   * kvm_phys_addr_ioremap() - maps as device in stage 2
>>
>>   * user_mem_abort() - handled above
>>
>>   * kvm_set_spte_handler() - ultimately called from the .change_pte()
>> callback of the MMU notifier
>>
>> So the last one is potentially a problem. It's called via the MMU notifiers
>> in the case of set_pte_at_notify(). The users of that are:
>>
>>   * uprobe_write_opcode(): Allocates a new page and performs a
>> copy_highpage() to copy the data to the new page (which with MTE includes
>> the tags and will copy across the PG_mte_tagged flag).
>>
>>   * write_protect_page() (KSM): Changes the permissions on the PTE but it's
>> still the same page, so nothing to do regarding MTE.
> 
> My concern here is that the VMM had a stage 1 pte but we haven't yet
> faulted in at stage 2 via user_mem_abort(), so we don't have any stage 2
> pte set. write_protect_page() comes in and sets the new stage 2 pte via
> the callback. I couldn't find any check in kvm_pgtable_stage2_map() for
> the old pte, so it will set the new stage 2 pte regardless. A subsequent
> guest read would no longer fault at stage 2.
> 
>>   * replace_page() (KSM): If the page has MTE tags then the MTE version of
>> memcmp_pages() will return false, so the only caller
>> (try_to_merge_one_page()) will never call this on a page with tags.
>>
>>   * wp_page_copy(): This one is more interesting - if we go down the
>> cow_user_page() path with an old page then everything is safe (tags are
>> copied over). The is_zero_pfn() case worries me a bit - a new page is
>> allocated, but I can't instantly see anything to zero out the tags (and set
>> PG_mte_tagged).
> 
> True, I think tag zeroing happens only if we map it as PROT_MTE in the
> VMM.
> 
>>   * migrate_vma_insert_page(): I think migration should be safe as the tags
>> should be copied.
>>
>> So wp_page_copy() looks suspicious.
>>
>> kvm_pgtable_stage2_map() looks like it could be a good place for the checks,
>> it looks like it should work and is probably a more obvious place for the
>> checks.
> 
> That would be my preference. It also matches the stage 1 set_pte_at().
> 
>>> While the set_pte_at() race on the page flags is somewhat clearer, we
>>> may still have a race here with the VMM's set_pte_at() if the page is
>>> mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
>>> handling the VMM page tables (well, not always, see below).
>>>
>>> gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
>>> path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
>>> would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
>>> sure whether get_user_page_fast_only() does the same.
>>>
>>> The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
>>> KVM mmu notifier is invoked before set_pte_at() and racing with another
>>> user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
>>> set_pte_at() would see the PG_mte_tagged set either by the current CPU
>>> or by the one it was racing with.
>>
>> Given the changes to set_pte_at() which means that tags are restored from
>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
>> VMM memory space should involve a mmu notifier call which I think serialises
>> this. So the remaining issue is doing this in a separate address space.
>>
>> So I guess the potential problem is:
>>
>>   * allocate memory MAP_SHARED but !PROT_MTE
>>   * fork()
>>   * VM causes a fault in parent address space
>>   * child does a mprotect(PROT_MTE)
>>
>> With the last two potentially racing. Sadly I can't see a good way of
>> handling that.
> 
> Ah, the mmap lock doesn't help as they are different processes
> (mprotect() acquires it as a writer).
> 
> I wonder whether this is racy even in the absence of KVM. If both parent
> and child do an mprotect(PROT_MTE), one of them may be reading stale
> tags for a brief period.
> 
> Maybe we should revisit whether shared MTE pages are of any use, though
> it's an ABI change (not bad if no-one is relying on this). However...

Shared MTE pages are certainly hard to use correctly (e.g. see the 
discussions with the VMM accessing guest memory). But I guess that boat 
has sailed.

> Thinking about this, we have a similar problem with the PG_dcache_clean
> and two processes doing mprotect(PROT_EXEC). One of them could see the
> flag set and skip the I-cache maintenance while the other executes
> stale instructions. change_pte_range() could acquire the page lock if
> the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> solve the MTE/KVM case but we could at least take the page lock via
> user_mem_abort().

For PG_dcache_clean AFAICS the solution is actually simple: split the 
test and set parts. i.e..:

  if (!test_bit(PG_dcache_clean, &page->flags)) {
	sync_icache_aliases(page_address(page), page_size(page));
	set_bit(PG_dcache_clean, &page->flags);
  }

There isn't a problem with repeating the sync_icache_aliases() call in 
the case of a race. Or am I missing something?

> Or maybe we just document this behaviour as racy both for PROT_EXEC and
> PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
> is the potential leaking of tags (it's harder to leak information
> through the I-cache).
> 

This is the real issue I see - the race in PROT_MTE case is either a 
data leak (syncing after setting the bit) or data loss (syncing before 
setting the bit).

But without serialising through a spinlock (in mte_sync_tags()) I 
haven't been able to come up with any way of closing the race. But with 
the change to set_pte_at() to call mte_sync_tags() even if the PTE isn't 
PROT_MTE that is likely to seriously hurt performance.

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-04 17:44       ` Catalin Marinas
@ 2021-05-07  9:44         ` Steven Price
  2021-05-07  9:59           ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Price @ 2021-05-07  9:44 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 04/05/2021 18:44, Catalin Marinas wrote:
> On Thu, Apr 29, 2021 at 05:06:07PM +0100, Steven Price wrote:
>> On 27/04/2021 18:58, Catalin Marinas wrote:
>>> On Fri, Apr 16, 2021 at 04:43:08PM +0100, Steven Price wrote:
>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>>>> index 24223adae150..2b85a047c37d 100644
>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>> @@ -184,6 +184,20 @@ struct kvm_vcpu_events {
>>>>    	__u32 reserved[12];
>>>>    };
>>>> +struct kvm_arm_copy_mte_tags {
>>>> +	__u64 guest_ipa;
>>>> +	__u64 length;
>>>> +	union {
>>>> +		void __user *addr;
>>>> +		__u64 padding;
>>>> +	};
>>>> +	__u64 flags;
>>>> +	__u64 reserved[2];
>>>> +};
> [...]
>>> Maybe add the two reserved
>>> values to the union in case we want to store something else in the
>>> future.
>>
>> I'm not sure what you mean here. What would the reserved fields be unioned
>> with? And surely they are no longer reserved in that case?
> 
> In case you want to keep the structure size the same for future
> expansion and the expansion only happens via the union, you'd add some
> padding in there just in case. We do this for struct siginfo with an
> _si_pad[] array in the union.
> 

Ah I see what you mean. In this case "padding" is just a sizer to ensure 
that flags is always the same alignment - it's not intended to be used. 
As I noted previously though it's completely pointless as this only on 
arm64 and even 32 bit Arm would naturally align the following __u64.

reserved[] is for expansion and I guess we could have a union over the 
whole struct (like siginfo) but I think it's generally clearer to just 
spell out the reserved fields at the end of the struct.

TLDR; the union will be gone along with "padding" in the next version. 
"reserved" remains at the end of the struct for future use.

Thanks,

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest
  2021-05-07  9:44         ` Steven Price
@ 2021-05-07  9:59           ` David Laight
  0 siblings, 0 replies; 26+ messages in thread
From: David Laight @ 2021-05-07  9:59 UTC (permalink / raw)
  To: 'Steven Price', Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

From: Steven Price <steven.price@arm.com>
> Sent: 07 May 2021 10:45
> 
> On 04/05/2021 18:44, Catalin Marinas wrote:
> > On Thu, Apr 29, 2021 at 05:06:07PM +0100, Steven Price wrote:
> >> On 27/04/2021 18:58, Catalin Marinas wrote:
> >>> On Fri, Apr 16, 2021 at 04:43:08PM +0100, Steven Price wrote:
> >>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> >>>> index 24223adae150..2b85a047c37d 100644
> >>>> --- a/arch/arm64/include/uapi/asm/kvm.h
> >>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
> >>>> @@ -184,6 +184,20 @@ struct kvm_vcpu_events {
> >>>>    	__u32 reserved[12];
> >>>>    };
> >>>> +struct kvm_arm_copy_mte_tags {
> >>>> +	__u64 guest_ipa;
> >>>> +	__u64 length;
> >>>> +	union {
> >>>> +		void __user *addr;
> >>>> +		__u64 padding;
> >>>> +	};
> >>>> +	__u64 flags;
> >>>> +	__u64 reserved[2];
> >>>> +};
> > [...]
> >>> Maybe add the two reserved
> >>> values to the union in case we want to store something else in the
> >>> future.
> >>
> >> I'm not sure what you mean here. What would the reserved fields be unioned
> >> with? And surely they are no longer reserved in that case?
> >
> > In case you want to keep the structure size the same for future
> > expansion and the expansion only happens via the union, you'd add some
> > padding in there just in case. We do this for struct siginfo with an
> > _si_pad[] array in the union.
> >
> 
> Ah I see what you mean. In this case "padding" is just a sizer to ensure
> that flags is always the same alignment - it's not intended to be used.
> As I noted previously though it's completely pointless as this only on
> arm64 and even 32 bit Arm would naturally align the following __u64.

It is nice to be explicit though.
You also have the problem that a 32bit (LE) application would leave the
high bits of the user address undefined.

All moot and pointless if 64bit only though.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-05-06 16:15         ` Steven Price
@ 2021-05-07 18:25           ` Catalin Marinas
  2021-05-10 18:35             ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2021-05-07 18:25 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> On 04/05/2021 18:40, Catalin Marinas wrote:
> > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > On 28/04/2021 18:07, Catalin Marinas wrote:
> > > > While the set_pte_at() race on the page flags is somewhat clearer, we
> > > > may still have a race here with the VMM's set_pte_at() if the page is
> > > > mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
> > > > handling the VMM page tables (well, not always, see below).
> > > > 
> > > > gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
> > > > path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
> > > > would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
> > > > sure whether get_user_page_fast_only() does the same.
> > > > 
> > > > The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
> > > > KVM mmu notifier is invoked before set_pte_at() and racing with another
> > > > user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
> > > > set_pte_at() would see the PG_mte_tagged set either by the current CPU
> > > > or by the one it was racing with.
> > > 
> > > Given the changes to set_pte_at() which means that tags are restored from
> > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > VMM memory space should involve a mmu notifier call which I think serialises
> > > this. So the remaining issue is doing this in a separate address space.
> > > 
> > > So I guess the potential problem is:
> > > 
> > >   * allocate memory MAP_SHARED but !PROT_MTE
> > >   * fork()
> > >   * VM causes a fault in parent address space
> > >   * child does a mprotect(PROT_MTE)
> > > 
> > > With the last two potentially racing. Sadly I can't see a good way of
> > > handling that.
> > 
> > Ah, the mmap lock doesn't help as they are different processes
> > (mprotect() acquires it as a writer).
> > 
> > I wonder whether this is racy even in the absence of KVM. If both parent
> > and child do an mprotect(PROT_MTE), one of them may be reading stale
> > tags for a brief period.
> > 
> > Maybe we should revisit whether shared MTE pages are of any use, though
> > it's an ABI change (not bad if no-one is relying on this). However...
> 
> Shared MTE pages are certainly hard to use correctly (e.g. see the
> discussions with the VMM accessing guest memory). But I guess that boat has
> sailed.

Digging out some old emails (two years ago), the Chrome people may have
found a use for MTE in shared mappings (with memfd_create), though not
sure they took advantage of this yet.

> > Thinking about this, we have a similar problem with the PG_dcache_clean
> > and two processes doing mprotect(PROT_EXEC). One of them could see the
> > flag set and skip the I-cache maintenance while the other executes
> > stale instructions. change_pte_range() could acquire the page lock if
> > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> > solve the MTE/KVM case but we could at least take the page lock via
> > user_mem_abort().
> 
> For PG_dcache_clean AFAICS the solution is actually simple: split the test
> and set parts. i.e..:
> 
>  if (!test_bit(PG_dcache_clean, &page->flags)) {
> 	sync_icache_aliases(page_address(page), page_size(page));
> 	set_bit(PG_dcache_clean, &page->flags);
>  }
> 
> There isn't a problem with repeating the sync_icache_aliases() call in the
> case of a race. Or am I missing something?

No, the fix is simple as you said.

> > Or maybe we just document this behaviour as racy both for PROT_EXEC and
> > PROT_MTE mappings and be done with this. The minor issue with PROT_MTE
> > is the potential leaking of tags (it's harder to leak information
> > through the I-cache).
> 
> This is the real issue I see - the race in PROT_MTE case is either a data
> leak (syncing after setting the bit) or data loss (syncing before setting
> the bit).

For a moment I thought an mmap(PROT_MTE, MAP_SHARED) on memfd/tmpfs file
may lead to the same situation but the mmap() itself won't directly
cause allocating the page. The subsequent fault via filemap_map_pages()
seems to take the page lock.

> But without serialising through a spinlock (in mte_sync_tags()) I haven't
> been able to come up with any way of closing the race. But with the change
> to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
> is likely to seriously hurt performance.

Yeah. We could add another page flag as a lock though I think it should
be the core code that prevents the race.

If we are to do it in the arch code, maybe easier with a custom
ptep_modify_prot_start/end() where we check if it's VM_SHARED and
VM_MTE, take a (big) lock.

In the core code, something like below (well, a partial hack, not tested
and it doesn't handle huge pages but just to give an idea):

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..6ba96ff141a6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -76,14 +76,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		if (pte_present(oldpte)) {
 			pte_t ptent;
 			bool preserve_write = prot_numa && pte_write(oldpte);
+			struct page *page = NULL;
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				struct page *page;
-
 				/* Avoid TLB flush if possible */
 				if (pte_protnone(oldpte))
 					continue;
@@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			}
 
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
+			if (vma->vm_flags & VM_SHARED) {
+				page = vm_normal_page(vma, addr, oldpte);
+				lock_page(page);
+			}
 			ptent = pte_modify(oldpte, newprot);
 			if (preserve_write)
 				ptent = pte_mk_savedwrite(ptent);
@@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				ptent = pte_mkwrite(ptent);
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+			if (page)
+				unlock_page(page);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-05-07 18:25           ` Catalin Marinas
@ 2021-05-10 18:35             ` Catalin Marinas
  2021-05-12 15:46               ` Steven Price
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2021-05-10 18:35 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> > On 04/05/2021 18:40, Catalin Marinas wrote:
> > > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > > Given the changes to set_pte_at() which means that tags are restored from
> > > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > > VMM memory space should involve a mmu notifier call which I think serialises
> > > > this. So the remaining issue is doing this in a separate address space.
> > > > 
> > > > So I guess the potential problem is:
> > > > 
> > > >   * allocate memory MAP_SHARED but !PROT_MTE
> > > >   * fork()
> > > >   * VM causes a fault in parent address space
> > > >   * child does a mprotect(PROT_MTE)
> > > > 
> > > > With the last two potentially racing. Sadly I can't see a good way of
> > > > handling that.
> > > 
> > > Ah, the mmap lock doesn't help as they are different processes
> > > (mprotect() acquires it as a writer).
> > > 
> > > I wonder whether this is racy even in the absence of KVM. If both parent
> > > and child do an mprotect(PROT_MTE), one of them may be reading stale
> > > tags for a brief period.
> > > 
> > > Maybe we should revisit whether shared MTE pages are of any use, though
> > > it's an ABI change (not bad if no-one is relying on this). However...
[...]
> > > Thinking about this, we have a similar problem with the PG_dcache_clean
> > > and two processes doing mprotect(PROT_EXEC). One of them could see the
> > > flag set and skip the I-cache maintenance while the other executes
> > > stale instructions. change_pte_range() could acquire the page lock if
> > > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> > > solve the MTE/KVM case but we could at least take the page lock via
> > > user_mem_abort().
[...]
> > This is the real issue I see - the race in PROT_MTE case is either a data
> > leak (syncing after setting the bit) or data loss (syncing before setting
> > the bit).
[...]
> > But without serialising through a spinlock (in mte_sync_tags()) I haven't
> > been able to come up with any way of closing the race. But with the change
> > to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
> > is likely to seriously hurt performance.
> 
> Yeah. We could add another page flag as a lock though I think it should
> be the core code that prevents the race.
> 
> If we are to do it in the arch code, maybe easier with a custom
> ptep_modify_prot_start/end() where we check if it's VM_SHARED and
> VM_MTE, take a (big) lock.

I think in the general case we don't even need VM_SHARED. For example,
we have two processes mapping a file, read-only. An mprotect() call in
both processes will race on the page->flags via the corresponding
set_pte_at(). I think an mprotect() with a page fault in different
processes can also race.

The PROT_EXEC case can be easily fixed, as you said already. The
PROT_MTE with MAP_PRIVATE I think can be made safe by a similar
approach: test flag, clear tags, set flag. A subsequent write would
trigger a CoW, so different page anyway.

Anyway, I don't think ptep_modify_prot_start/end would buy us much, it
probably makes the code even harder to read.

> In the core code, something like below (well, a partial hack, not tested
> and it doesn't handle huge pages but just to give an idea):
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 94188df1ee55..6ba96ff141a6 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  			}
>  
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> +			if (vma->vm_flags & VM_SHARED) {
> +				page = vm_normal_page(vma, addr, oldpte);
> +				lock_page(page);
> +			}
>  			ptent = pte_modify(oldpte, newprot);
>  			if (preserve_write)
>  				ptent = pte_mk_savedwrite(ptent);
> @@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				ptent = pte_mkwrite(ptent);
>  			}
>  			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> +			if (page)
> +				unlock_page(page);
>  			pages++;
>  		} else if (is_swap_pte(oldpte)) {
>  			swp_entry_t entry = pte_to_swp_entry(oldpte);

That's bogus: lock_page() might sleep but this whole code sequence is
under the ptl spinlock. There are some lock_page_* variants but that
would involve either a busy loop on this path or some bailing out,
waiting for a release.

Options:

1. Change the mte_sync_tags() code path to set the flag after clearing
   and avoid reading stale tags. We document that mprotect() on
   MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
   arch code and return an error.

2. Figure out some other locking in the core code. However, if
   mprotect() in one process can race with a handle_pte_fault() in
   another, on the same shared mapping, it's not trivial.
   filemap_map_pages() would take the page lock before calling
   do_set_pte(), so mprotect() would need the same page lock.

3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
   set it around the other PG_arch_* bit setting).

I ran out of ideas.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-05-10 18:35             ` Catalin Marinas
@ 2021-05-12 15:46               ` Steven Price
  2021-05-12 17:45                 ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Price @ 2021-05-12 15:46 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 10/05/2021 19:35, Catalin Marinas wrote:
> On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
>> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
>>> On 04/05/2021 18:40, Catalin Marinas wrote:
>>>> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
>>>>> Given the changes to set_pte_at() which means that tags are restored from
>>>>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
>>>>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
>>>>> VMM memory space should involve a mmu notifier call which I think serialises
>>>>> this. So the remaining issue is doing this in a separate address space.
>>>>>
>>>>> So I guess the potential problem is:
>>>>>
>>>>>    * allocate memory MAP_SHARED but !PROT_MTE
>>>>>    * fork()
>>>>>    * VM causes a fault in parent address space
>>>>>    * child does a mprotect(PROT_MTE)
>>>>>
>>>>> With the last two potentially racing. Sadly I can't see a good way of
>>>>> handling that.
>>>>
>>>> Ah, the mmap lock doesn't help as they are different processes
>>>> (mprotect() acquires it as a writer).
>>>>
>>>> I wonder whether this is racy even in the absence of KVM. If both parent
>>>> and child do an mprotect(PROT_MTE), one of them may be reading stale
>>>> tags for a brief period.
>>>>
>>>> Maybe we should revisit whether shared MTE pages are of any use, though
>>>> it's an ABI change (not bad if no-one is relying on this). However...
> [...]
>>>> Thinking about this, we have a similar problem with the PG_dcache_clean
>>>> and two processes doing mprotect(PROT_EXEC). One of them could see the
>>>> flag set and skip the I-cache maintenance while the other executes
>>>> stale instructions. change_pte_range() could acquire the page lock if
>>>> the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
>>>> solve the MTE/KVM case but we could at least take the page lock via
>>>> user_mem_abort().
> [...]
>>> This is the real issue I see - the race in PROT_MTE case is either a data
>>> leak (syncing after setting the bit) or data loss (syncing before setting
>>> the bit).
> [...]
>>> But without serialising through a spinlock (in mte_sync_tags()) I haven't
>>> been able to come up with any way of closing the race. But with the change
>>> to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
>>> is likely to seriously hurt performance.
>>
>> Yeah. We could add another page flag as a lock though I think it should
>> be the core code that prevents the race.
>>
>> If we are to do it in the arch code, maybe easier with a custom
>> ptep_modify_prot_start/end() where we check if it's VM_SHARED and
>> VM_MTE, take a (big) lock.
> 
> I think in the general case we don't even need VM_SHARED. For example,
> we have two processes mapping a file, read-only. An mprotect() call in
> both processes will race on the page->flags via the corresponding
> set_pte_at(). I think an mprotect() with a page fault in different
> processes can also race.
> 
> The PROT_EXEC case can be easily fixed, as you said already. The
> PROT_MTE with MAP_PRIVATE I think can be made safe by a similar
> approach: test flag, clear tags, set flag. A subsequent write would
> trigger a CoW, so different page anyway.
> 
> Anyway, I don't think ptep_modify_prot_start/end would buy us much, it
> probably makes the code even harder to read.
> 
>> In the core code, something like below (well, a partial hack, not tested
>> and it doesn't handle huge pages but just to give an idea):
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 94188df1ee55..6ba96ff141a6 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>>   			}
>>   
>>   			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> +			if (vma->vm_flags & VM_SHARED) {
>> +				page = vm_normal_page(vma, addr, oldpte);
>> +				lock_page(page);
>> +			}
>>   			ptent = pte_modify(oldpte, newprot);
>>   			if (preserve_write)
>>   				ptent = pte_mk_savedwrite(ptent);
>> @@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>>   				ptent = pte_mkwrite(ptent);
>>   			}
>>   			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>> +			if (page)
>> +				unlock_page(page);
>>   			pages++;
>>   		} else if (is_swap_pte(oldpte)) {
>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
> 
> That's bogus: lock_page() might sleep but this whole code sequence is
> under the ptl spinlock. There are some lock_page_* variants but that
> would involve either a busy loop on this path or some bailing out,
> waiting for a release.
> 
> Options:
> 
> 1. Change the mte_sync_tags() code path to set the flag after clearing
>     and avoid reading stale tags. We document that mprotect() on
>     MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
>     arch code and return an error.

This is the best option I've come up with so far - but it's not a good
one! We can replace the set_bit() with a test_and_set_bit() to catch the
race after it has occurred - but I'm not sure what we can do about it
then (we've already wiped the data). Returning an error doesn't seem
particularly useful at that point, a message in dmesg is about the best
I can come up with.

> 2. Figure out some other locking in the core code. However, if
>     mprotect() in one process can race with a handle_pte_fault() in
>     another, on the same shared mapping, it's not trivial.
>     filemap_map_pages() would take the page lock before calling
>     do_set_pte(), so mprotect() would need the same page lock.

I can't see how this is going to work without harming the performance of
non-MTE work. Ultimately we're trying to add some sort of locking for
two (mostly) unrelated processes doing page table operations, which will
hurt scalability.

> 3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
>     set it around the other PG_arch_* bit setting).

This is certainly tempting, although sadly the existing
wait_on_page_bit() is sleeping - so this would either be a literal spin,
or we'd need to implement a new non-sleeping wait mechanism.

> I ran out of ideas.
> 

4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
the MTE case where we have to sync tags (see below). What the actual
performance impact of this is I've no idea. It could certainly be bad
if there are a lot of pages with MTE enabled, which sadly is exactly
the case if KVM is used with MTE :(

Steve

--->8----
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 0d320c060ebe..389ad40256f6 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,23 +25,33 @@
  u64 gcr_kernel_excl __ro_after_init;
  
  static bool report_fault_once = true;
+static spinlock_t tag_sync_lock;
  
  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
  			       bool pte_is_tagged)
  {
  	pte_t old_pte = READ_ONCE(*ptep);
  
+	if (!is_swap_pte(old_pte) && !pte_is_tagged)
+		return;
+
+	spin_lock_irqsave(&tag_sync_lock, flags);
+
+	/* Recheck with the lock held */
+	if (test_bit(PG_mte_tagged, &page->flags))
+		goto out;
+
  	if (check_swap && is_swap_pte(old_pte)) {
  		swp_entry_t entry = pte_to_swp_entry(old_pte);
  
  		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
  			set_bit(PG_mte_tagged, &page->flags);
-			return;
+			goto out;
  		}
  	}
  
  	if (!pte_is_tagged)
-		return;
+		goto out;
  
  	page_kasan_tag_reset(page);
  	/*
@@ -55,6 +65,9 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
  	mte_clear_page_tags(page_address(page));
  
  	set_bit(PG_mte_tagged, &page->flags);
+
+out:
+	spin_unlock_irqrestore(&tag_sync_lock, flags);
  }
  
  void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -63,6 +76,11 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
  	long i, nr_pages = compound_nr(page);
  	bool check_swap = nr_pages == 1;
  	bool pte_is_tagged = pte_tagged(pte);
+	unsigned long flags;
+
+	/* Early out if there's nothing to do */
+	if (!check_swap && !pte_is_tagged)
+		return;
  
  	/* if PG_mte_tagged is set, tags have already been initialised */
  	for (i = 0; i < nr_pages; i++, page++) {

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-05-12 15:46               ` Steven Price
@ 2021-05-12 17:45                 ` Catalin Marinas
  2021-05-13 10:57                   ` Steven Price
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2021-05-12 17:45 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
> On 10/05/2021 19:35, Catalin Marinas wrote:
> > On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
> > > On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> > > > On 04/05/2021 18:40, Catalin Marinas wrote:
> > > > > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > > > > Given the changes to set_pte_at() which means that tags are restored from
> > > > > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > > > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > > > > VMM memory space should involve a mmu notifier call which I think serialises
> > > > > > this. So the remaining issue is doing this in a separate address space.
> > > > > > 
> > > > > > So I guess the potential problem is:
> > > > > > 
> > > > > >    * allocate memory MAP_SHARED but !PROT_MTE
> > > > > >    * fork()
> > > > > >    * VM causes a fault in parent address space
> > > > > >    * child does a mprotect(PROT_MTE)
> > > > > > 
> > > > > > With the last two potentially racing. Sadly I can't see a good way of
> > > > > > handling that.
[...]
> > Options:
> > 
> > 1. Change the mte_sync_tags() code path to set the flag after clearing
> >     and avoid reading stale tags. We document that mprotect() on
> >     MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
> >     arch code and return an error.
> 
> This is the best option I've come up with so far - but it's not a good
> one! We can replace the set_bit() with a test_and_set_bit() to catch the
> race after it has occurred - but I'm not sure what we can do about it
> then (we've already wiped the data). Returning an error doesn't seem
> particularly useful at that point, a message in dmesg is about the best
> I can come up with.

What I meant about intercepting is on something like
arch_validate_flags() to prevent VM_SHARED and VM_MTE together but only
for mprotect(), not mmap(). However, arch_validate_flags() is currently
called on both mmap() and mprotect() paths.

We can't do much in set_pte_at() to prevent the race with only a single
bit.

> > 2. Figure out some other locking in the core code. However, if
> >     mprotect() in one process can race with a handle_pte_fault() in
> >     another, on the same shared mapping, it's not trivial.
> >     filemap_map_pages() would take the page lock before calling
> >     do_set_pte(), so mprotect() would need the same page lock.
> 
> I can't see how this is going to work without harming the performance of
> non-MTE work. Ultimately we're trying to add some sort of locking for
> two (mostly) unrelated processes doing page table operations, which will
> hurt scalability.

Another option is to have an arch callback to force re-faulting on the
pte. That means we don't populate it back after the invalidation in the
change_protection() path. We could do this only if the new pte is tagged
and the page doesn't have PG_mte_tagged. The faulting path takes the
page lock IIUC.

Well, at least for stage 1, I haven't thought much about stage 2.

> > 3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
> >     set it around the other PG_arch_* bit setting).
> 
> This is certainly tempting, although sadly the existing
> wait_on_page_bit() is sleeping - so this would either be a literal spin,
> or we'd need to implement a new non-sleeping wait mechanism.

Yeah, it would have to be a custom spinning mechanism, something like:

	/* lock the page */
	while (test_and_set_bit(PG_arch_3, &page->flags))
		smp_cond_load_relaxed(&page->flags, !(VAL & PG_arch_3));
	...
	/* unlock the page */
	clear_bit(PG_arch_3, &page->flags);

> 4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
> the MTE case where we have to sync tags (see below). What the actual
> performance impact of this is I've no idea. It could certainly be bad
> if there are a lot of pages with MTE enabled, which sadly is exactly
> the case if KVM is used with MTE :(
> 
> --->8----
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index 0d320c060ebe..389ad40256f6 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -25,23 +25,33 @@
>  u64 gcr_kernel_excl __ro_after_init;
>  static bool report_fault_once = true;
> +static spinlock_t tag_sync_lock;
>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
>  			       bool pte_is_tagged)
>  {
>  	pte_t old_pte = READ_ONCE(*ptep);
> +	if (!is_swap_pte(old_pte) && !pte_is_tagged)
> +		return;
> +
> +	spin_lock_irqsave(&tag_sync_lock, flags);
> +
> +	/* Recheck with the lock held */
> +	if (test_bit(PG_mte_tagged, &page->flags))
> +		goto out;

Can we skip the lock if the page already has the PG_mte_tagged set?
That's assuming that we set the flag only after clearing the tags. The
normal case where mprotect() is called on a page already mapped with
PROT_MTE would not be affected.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-05-12 17:45                 ` Catalin Marinas
@ 2021-05-13 10:57                   ` Steven Price
  2021-05-13 15:08                     ` Catalin Marinas
  2021-05-13 15:21                     ` Catalin Marinas
  0 siblings, 2 replies; 26+ messages in thread
From: Steven Price @ 2021-05-13 10:57 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On 12/05/2021 18:45, Catalin Marinas wrote:
> On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
>> On 10/05/2021 19:35, Catalin Marinas wrote:
>>> On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
>>>> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
>>>>> On 04/05/2021 18:40, Catalin Marinas wrote:
>>>>>> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
>>>>>>> Given the changes to set_pte_at() which means that tags are restored from
>>>>>>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
>>>>>>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
>>>>>>> VMM memory space should involve a mmu notifier call which I think serialises
>>>>>>> this. So the remaining issue is doing this in a separate address space.
>>>>>>>
>>>>>>> So I guess the potential problem is:
>>>>>>>
>>>>>>>    * allocate memory MAP_SHARED but !PROT_MTE
>>>>>>>    * fork()
>>>>>>>    * VM causes a fault in parent address space
>>>>>>>    * child does a mprotect(PROT_MTE)
>>>>>>>
>>>>>>> With the last two potentially racing. Sadly I can't see a good way of
>>>>>>> handling that.
> [...]
>>> Options:
>>>
>>> 1. Change the mte_sync_tags() code path to set the flag after clearing
>>>     and avoid reading stale tags. We document that mprotect() on
>>>     MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
>>>     arch code and return an error.
>>
>> This is the best option I've come up with so far - but it's not a good
>> one! We can replace the set_bit() with a test_and_set_bit() to catch the
>> race after it has occurred - but I'm not sure what we can do about it
>> then (we've already wiped the data). Returning an error doesn't seem
>> particularly useful at that point, a message in dmesg is about the best
>> I can come up with.
> 
> What I meant about intercepting is on something like
> arch_validate_flags() to prevent VM_SHARED and VM_MTE together but only
> for mprotect(), not mmap(). However, arch_validate_flags() is currently
> called on both mmap() and mprotect() paths.

I think even if we were to restrict mprotect() there would be corner
cases around swapping in. For example if a page mapped VM_SHARED|VM_MTE
is faulted simultaneously in both processes then we have the same situation:

 * with test_and_set_bit() one process could potentially see the tags
before they have been restored - i.e. a data leak.

 * with separated test and set then one process could write to the tags
before the second restore has completed causing a lost update.

Obviously completely banning VM_SHARED|VM_MTE might work, but I don't
think that's a good idea.

> We can't do much in set_pte_at() to prevent the race with only a single
> bit.
> 
>>> 2. Figure out some other locking in the core code. However, if
>>>     mprotect() in one process can race with a handle_pte_fault() in
>>>     another, on the same shared mapping, it's not trivial.
>>>     filemap_map_pages() would take the page lock before calling
>>>     do_set_pte(), so mprotect() would need the same page lock.
>>
>> I can't see how this is going to work without harming the performance of
>> non-MTE work. Ultimately we're trying to add some sort of locking for
>> two (mostly) unrelated processes doing page table operations, which will
>> hurt scalability.
> 
> Another option is to have an arch callback to force re-faulting on the
> pte. That means we don't populate it back after the invalidation in the
> change_protection() path. We could do this only if the new pte is tagged
> and the page doesn't have PG_mte_tagged. The faulting path takes the
> page lock IIUC.

As above - I don't think this race is just on the change_protection() path.

> Well, at least for stage 1, I haven't thought much about stage 2.
> 
>>> 3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
>>>     set it around the other PG_arch_* bit setting).
>>
>> This is certainly tempting, although sadly the existing
>> wait_on_page_bit() is sleeping - so this would either be a literal spin,
>> or we'd need to implement a new non-sleeping wait mechanism.
> 
> Yeah, it would have to be a custom spinning mechanism, something like:
> 
> 	/* lock the page */
> 	while (test_and_set_bit(PG_arch_3, &page->flags))
> 		smp_cond_load_relaxed(&page->flags, !(VAL & PG_arch_3));
> 	...
> 	/* unlock the page */
> 	clear_bit(PG_arch_3, &page->flags);

Presumably we'd also need to ensure interrupts are disabled to ensure
we're not pre-empted in the middle and potentially deadlock. It's
doable, but I'd prefer not to invent a new lock type if possible.

>> 4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
>> the MTE case where we have to sync tags (see below). What the actual
>> performance impact of this is I've no idea. It could certainly be bad
>> if there are a lot of pages with MTE enabled, which sadly is exactly
>> the case if KVM is used with MTE :(
>>
>> --->8----
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index 0d320c060ebe..389ad40256f6 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -25,23 +25,33 @@
>>  u64 gcr_kernel_excl __ro_after_init;
>>  static bool report_fault_once = true;
>> +static spinlock_t tag_sync_lock;
>>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
>>  			       bool pte_is_tagged)
>>  {
>>  	pte_t old_pte = READ_ONCE(*ptep);
>> +	if (!is_swap_pte(old_pte) && !pte_is_tagged)
>> +		return;
>> +
>> +	spin_lock_irqsave(&tag_sync_lock, flags);
>> +
>> +	/* Recheck with the lock held */
>> +	if (test_bit(PG_mte_tagged, &page->flags))
>> +		goto out;
> 
> Can we skip the lock if the page already has the PG_mte_tagged set?
> That's assuming that we set the flag only after clearing the tags. The
> normal case where mprotect() is called on a page already mapped with
> PROT_MTE would not be affected.
> 

It was missing from the diff context (sorry, should have double checked
that), but I was keeping the check in mte_sync_tags():

  void mte_sync_tags(pte_t *ptep, pte_t pte)
  {
	struct page *page = pte_page(pte);
	long i, nr_pages = compound_nr(page);
	bool check_swap = nr_pages == 1;
	bool pte_is_tagged = pte_tagged(pte);
	unsigned long flags;

	/* Early out if there's nothing to do */
	if (!check_swap && !pte_is_tagged)
		return;

	/* if PG_mte_tagged is set, tags have already been initialised */
	for (i = 0; i < nr_pages; i++, page++) {
		if (!test_bit(PG_mte_tagged, &page->flags))
			mte_sync_page_tags(page, ptep, check_swap,
					   pte_is_tagged);
	}
  }

So the hit is only taken if !PG_mte_tagged - hence the "recheck" comment
in mte_sync_page_tags() once the lock is held. I guess if I'm going this
route it would make sense to refactor this to be a bit clearer.

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-05-13 10:57                   ` Steven Price
@ 2021-05-13 15:08                     ` Catalin Marinas
  2021-05-13 15:21                     ` Catalin Marinas
  1 sibling, 0 replies; 26+ messages in thread
From: Catalin Marinas @ 2021-05-13 15:08 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, May 13, 2021 at 11:57:39AM +0100, Steven Price wrote:
> On 12/05/2021 18:45, Catalin Marinas wrote:
> > On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
> >>>>>> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> >>>>>>> Given the changes to set_pte_at() which means that tags are restored from
> >>>>>>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
> >>>>>>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
> >>>>>>> VMM memory space should involve a mmu notifier call which I think serialises
> >>>>>>> this. So the remaining issue is doing this in a separate address space.
> >>>>>>>
> >>>>>>> So I guess the potential problem is:
> >>>>>>>
> >>>>>>>    * allocate memory MAP_SHARED but !PROT_MTE
> >>>>>>>    * fork()
> >>>>>>>    * VM causes a fault in parent address space
> >>>>>>>    * child does a mprotect(PROT_MTE)
> >>>>>>>
> >>>>>>> With the last two potentially racing. Sadly I can't see a good way of
> >>>>>>> handling that.
[...]
> >> 4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
> >> the MTE case where we have to sync tags (see below). What the actual
> >> performance impact of this is I've no idea. It could certainly be bad
> >> if there are a lot of pages with MTE enabled, which sadly is exactly
> >> the case if KVM is used with MTE :(
> >>
> >> --->8----
> >> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> >> index 0d320c060ebe..389ad40256f6 100644
> >> --- a/arch/arm64/kernel/mte.c
> >> +++ b/arch/arm64/kernel/mte.c
> >> @@ -25,23 +25,33 @@
> >>  u64 gcr_kernel_excl __ro_after_init;
> >>  static bool report_fault_once = true;
> >> +static spinlock_t tag_sync_lock;
> >>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
> >>  			       bool pte_is_tagged)
> >>  {
> >>  	pte_t old_pte = READ_ONCE(*ptep);
> >> +	if (!is_swap_pte(old_pte) && !pte_is_tagged)
> >> +		return;
> >> +
> >> +	spin_lock_irqsave(&tag_sync_lock, flags);
> >> +
> >> +	/* Recheck with the lock held */
> >> +	if (test_bit(PG_mte_tagged, &page->flags))
> >> +		goto out;
> > 
> > Can we skip the lock if the page already has the PG_mte_tagged set?
> > That's assuming that we set the flag only after clearing the tags. The
> > normal case where mprotect() is called on a page already mapped with
> > PROT_MTE would not be affected.
> 
> It was missing from the diff context (sorry, should have double checked
> that), but I was keeping the check in mte_sync_tags():
> 
>   void mte_sync_tags(pte_t *ptep, pte_t pte)
>   {
> 	struct page *page = pte_page(pte);
> 	long i, nr_pages = compound_nr(page);
> 	bool check_swap = nr_pages == 1;
> 	bool pte_is_tagged = pte_tagged(pte);
> 	unsigned long flags;
> 
> 	/* Early out if there's nothing to do */
> 	if (!check_swap && !pte_is_tagged)
> 		return;
> 
> 	/* if PG_mte_tagged is set, tags have already been initialised */
> 	for (i = 0; i < nr_pages; i++, page++) {
> 		if (!test_bit(PG_mte_tagged, &page->flags))
> 			mte_sync_page_tags(page, ptep, check_swap,
> 					   pte_is_tagged);
> 	}
>   }
> 
> So the hit is only taken if !PG_mte_tagged - hence the "recheck" comment
> in mte_sync_page_tags() once the lock is held. I guess if I'm going this
> route it would make sense to refactor this to be a bit clearer.

I think we can go with this for now but we should still raise it with
the mm folk, maybe they have a better idea on how to avoid the race in
the core code. There are other architectures affected, those that use
PG_arch_1.

As the kernel stands currently, we'd take the lock on any set_pte_at()
for a tagged page when first mapped. With Peter's patches to use DC
GZVA, the tag zeroing is done on allocation. Until those are merged, we
could do something similar in the arch code but without the DC GZVA
optimisation (useful if we need to cc this fix to stable):

----------8<--------------------------
From 9f445f794454cf139c0953391e6c30fa3f075dc1 Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Thu, 13 May 2021 14:15:37 +0100
Subject: [PATCH] arm64: Handle MTE tags zeroing in
 __alloc_zeroed_user_highpage()

Currently, on an anonymous page fault, the kernel allocates a zeroed
page and maps it in user space. If the mapping is tagged (PROT_MTE),
set_pte_at() additionally clears the tags under a spinlock to avoid a
race on the page->flags. In order to optimise the lock, clear the page
tags on allocation in __alloc_zeroed_user_highpage() if the vma flags
have VM_MTE set.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/include/asm/page.h |  6 ++++--
 arch/arm64/mm/fault.c         | 21 +++++++++++++++++++++
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..97853570d0f1 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -13,6 +13,7 @@
 #ifndef __ASSEMBLY__
 
 #include <linux/personality.h> /* for READ_IMPLIES_EXEC */
+#include <linux/types.h>
 #include <asm/pgtable-types.h>
 
 struct page;
@@ -28,8 +29,9 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
-#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+					  struct vm_area_struct *vma,
+					  unsigned long vaddr);
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 871c82ab0a30..5a03428e97f3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -921,3 +921,24 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr,
 	debug_exception_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug_exception);
+
+/*
+ * Used during anonymous page fault handling.
+ */
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+					  struct vm_area_struct *vma,
+					  unsigned long vaddr)
+{
+	struct page *page;
+	bool tagged = system_supports_mte() && (vma->vm_flags & VM_MTE);
+
+	page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma,
+			      vaddr);
+	if (tagged && page) {
+		mte_clear_page_tags(page_address(page));
+		page_kasan_tag_reset(page);
+		set_bit(PG_mte_tagged, &page->flags);
+	}
+
+	return page;
+}


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature
  2021-05-13 10:57                   ` Steven Price
  2021-05-13 15:08                     ` Catalin Marinas
@ 2021-05-13 15:21                     ` Catalin Marinas
  1 sibling, 0 replies; 26+ messages in thread
From: Catalin Marinas @ 2021-05-13 15:21 UTC (permalink / raw)
  To: Steven Price
  Cc: Marc Zyngier, Will Deacon, James Morse, Julien Thierry,
	Suzuki K Poulose, kvmarm, linux-arm-kernel, linux-kernel,
	Dave Martin, Mark Rutland, Thomas Gleixner, qemu-devel,
	Juan Quintela, Dr. David Alan Gilbert, Richard Henderson,
	Peter Maydell, Haibo Xu, Andrew Jones

On Thu, May 13, 2021 at 11:57:39AM +0100, Steven Price wrote:
> On 12/05/2021 18:45, Catalin Marinas wrote:
> > On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
> >> On 10/05/2021 19:35, Catalin Marinas wrote:
> >>>> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> >>>>>> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> >>>>>>> Given the changes to set_pte_at() which means that tags are restored from
> >>>>>>> swap even if !PROT_MTE, the only race I can see remaining is the creation of
> >>>>>>> new PROT_MTE mappings. As you mention an attempt to change mappings in the
> >>>>>>> VMM memory space should involve a mmu notifier call which I think serialises
> >>>>>>> this. So the remaining issue is doing this in a separate address space.
> >>>>>>>
> >>>>>>> So I guess the potential problem is:
> >>>>>>>
> >>>>>>>    * allocate memory MAP_SHARED but !PROT_MTE
> >>>>>>>    * fork()
> >>>>>>>    * VM causes a fault in parent address space
> >>>>>>>    * child does a mprotect(PROT_MTE)
> >>>>>>>
> >>>>>>> With the last two potentially racing. Sadly I can't see a good way of
> >>>>>>> handling that.
> > [...]
> >>> Options:
> >>>
> >>> 1. Change the mte_sync_tags() code path to set the flag after clearing
> >>>     and avoid reading stale tags. We document that mprotect() on
> >>>     MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
> >>>     arch code and return an error.
> >>
> >> This is the best option I've come up with so far - but it's not a good
> >> one! We can replace the set_bit() with a test_and_set_bit() to catch the
> >> race after it has occurred - but I'm not sure what we can do about it
> >> then (we've already wiped the data). Returning an error doesn't seem
> >> particularly useful at that point, a message in dmesg is about the best
> >> I can come up with.
> > 
> > What I meant about intercepting is on something like
> > arch_validate_flags() to prevent VM_SHARED and VM_MTE together but only
> > for mprotect(), not mmap(). However, arch_validate_flags() is currently
> > called on both mmap() and mprotect() paths.
> 
> I think even if we were to restrict mprotect() there would be corner
> cases around swapping in. For example if a page mapped VM_SHARED|VM_MTE
> is faulted simultaneously in both processes then we have the same situation:
> 
>  * with test_and_set_bit() one process could potentially see the tags
> before they have been restored - i.e. a data leak.
> 
>  * with separated test and set then one process could write to the tags
> before the second restore has completed causing a lost update.

I don't think this can happen. We have shmem_swapin_page() which I think
would be called on any faulting pte that was sharing such page. It takes
the page lock around arch_swap_restore().

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-05-13 15:24 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-16 15:43 [PATCH v11 0/6] MTE support for KVM guest Steven Price
2021-04-16 15:43 ` [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
2021-04-27 17:43   ` Catalin Marinas
2021-04-29 16:06     ` Steven Price
2021-05-04 15:29       ` Catalin Marinas
2021-04-16 15:43 ` [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature Steven Price
2021-04-28 17:07   ` Catalin Marinas
2021-04-29 16:06     ` Steven Price
2021-05-04 17:40       ` Catalin Marinas
2021-05-06 16:15         ` Steven Price
2021-05-07 18:25           ` Catalin Marinas
2021-05-10 18:35             ` Catalin Marinas
2021-05-12 15:46               ` Steven Price
2021-05-12 17:45                 ` Catalin Marinas
2021-05-13 10:57                   ` Steven Price
2021-05-13 15:08                     ` Catalin Marinas
2021-05-13 15:21                     ` Catalin Marinas
2021-04-16 15:43 ` [PATCH v11 3/6] arm64: kvm: Save/restore MTE registers Steven Price
2021-04-16 15:43 ` [PATCH v11 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price
2021-04-16 15:43 ` [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
2021-04-27 17:58   ` Catalin Marinas
2021-04-29 16:06     ` Steven Price
2021-05-04 17:44       ` Catalin Marinas
2021-05-07  9:44         ` Steven Price
2021-05-07  9:59           ` David Laight
2021-04-16 15:43 ` [PATCH v11 6/6] KVM: arm64: Document MTE capability and ioctl Steven Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).