linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFCv2 00/10] Linear Address Masking enabling
@ 2022-05-11  2:27 Kirill A. Shutemov
  2022-05-11  2:27 ` [PATCH] x86: Implement Linear Address Masking support Kirill A. Shutemov
                   ` (11 more replies)
  0 siblings, 12 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

Hi all. Here's long overdue update on LAM enabling.

# Description #

Linear Address Masking[1] (LAM) modifies the checking that is applied to
64-bit linear addresses, allowing software to use of the untranslated
address bits for metadata.

The patchset brings support for LAM for userspace addresses.

The most sensitive part of enabling is change in tlb.c, where CR3 flags
get set. Please take a look that what I'm doing makes sense.

The feature competes for bits with 5-level paging: LAM_U48 makes it
impossible to map anything about 47-bits. The patchset made these
capability mutually exclusive: whatever used first wins. LAM_U57 can be
combined with mappings above 47-bits.

[1] ISE, Chapter 14.
https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf

# What's new #

The main change is interface rework. It is now arch_prctl(2)-based and
suppose to be extendable to CET.

QEMU implementation is also updated. It can now be applied onto current
master branch. QEMU patch as it is was rejected by upstream, but it is
functinal and can be used for testing.

Please take a look. Any suggestions are welcome.

v2:
  - Rebased onto v5.18-rc1
  - New arch_prctl(2)-based API
  - Expose status of LAM (or other thread features) in
    /proc/$PID/arch_status.

Kirill A. Shutemov (10):
  x86/mm: Fix CR3_ADDR_MASK
  x86: CPUID and CR3/CR4 flags for Linear Address Masking
  x86: Introduce userspace API to handle per-thread features
  x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57
  x86/mm: Provide untagged_addr() helper
  x86/uaccess: Remove tags from the address before checking
  x86/mm: Handle tagged memory accesses from kernel threads
  x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive
  x86/mm: Add userspace API to enable Linear Address Masking
  x86: Expose thread features status in /proc/$PID/arch_status

 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/elf.h                  |   3 +-
 arch/x86/include/asm/mmu.h                  |   1 +
 arch/x86/include/asm/mmu_context.h          |  13 +++
 arch/x86/include/asm/page_32.h              |   3 +
 arch/x86/include/asm/page_64.h              |  20 ++++
 arch/x86/include/asm/processor-flags.h      |   2 +-
 arch/x86/include/asm/processor.h            |   3 +
 arch/x86/include/asm/tlbflush.h             |   5 +
 arch/x86/include/asm/uaccess.h              |  15 ++-
 arch/x86/include/uapi/asm/prctl.h           |   8 ++
 arch/x86/include/uapi/asm/processor-flags.h |   6 +
 arch/x86/kernel/Makefile                    |   2 +
 arch/x86/kernel/fpu/xstate.c                |  47 --------
 arch/x86/kernel/proc.c                      |  63 ++++++++++
 arch/x86/kernel/process.c                   |  56 +++++++++
 arch/x86/kernel/process.h                   |   2 +
 arch/x86/kernel/process_64.c                |  46 ++++++++
 arch/x86/kernel/sys_x86_64.c                |   5 +-
 arch/x86/mm/hugetlbpage.c                   |   6 +-
 arch/x86/mm/mmap.c                          |   9 +-
 arch/x86/mm/tlb.c                           | 123 +++++++++++++++++---
 22 files changed, 367 insertions(+), 72 deletions(-)
 create mode 100644 arch/x86/kernel/proc.c

-- 
2.35.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] x86: Implement Linear Address Masking support
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-12 13:01   ` David Laight
  2022-05-11  2:27 ` [RFCv2 01/10] x86/mm: Fix CR3_ADDR_MASK Kirill A. Shutemov
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

Linear Address Masking feature makes CPU ignore some bits of the virtual
address. These bits can be used to encode metadata.

The feature is enumerated with CPUID.(EAX=07H, ECX=01H):EAX.LAM[bit 26].

CR3.LAM_U57[bit 62] allows to encode 6 bits of metadata in bits 62:57 of
user pointers.

CR3.LAM_U48[bit 61] allows to encode 15 bits of metadata in bits 62:48
of user pointers.

CR4.LAM_SUP[bit 28] allows to encode metadata of supervisor pointers.
If 5-level paging is in use, 6 bits of metadata can be encoded in 62:57.
For 4-level paging, 15 bits of metadata can be encoded in bits 62:48.

QEMU strips address from the metadata bits and gets it to canonical
shape before handling memory access. It has to be done very early before
TLB lookup.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 accel/tcg/cputlb.c                   | 20 +++++++++++++++++---
 include/hw/core/tcg-cpu-ops.h        |  5 +++++
 target/i386/cpu.c                    |  4 ++--
 target/i386/cpu.h                    | 26 +++++++++++++++++++++++++-
 target/i386/helper.c                 |  2 +-
 target/i386/tcg/helper-tcg.h         |  1 +
 target/i386/tcg/sysemu/excp_helper.c | 28 +++++++++++++++++++++++++++-
 target/i386/tcg/sysemu/misc_helper.c |  3 +--
 target/i386/tcg/sysemu/svm_helper.c  |  3 +--
 target/i386/tcg/tcg-cpu.c            |  1 +
 10 files changed, 81 insertions(+), 12 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 2035b2ac0ac0..15eff0df39c1 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1295,6 +1295,17 @@ static inline ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr)
     return ram_addr;
 }
 
+static vaddr clean_addr(CPUArchState *env, vaddr addr)
+{
+    CPUClass *cc = CPU_GET_CLASS(env_cpu(env));
+
+    if (cc->tcg_ops->do_clean_addr) {
+        addr = cc->tcg_ops->do_clean_addr(env_cpu(env), addr);
+    }
+
+    return addr;
+}
+
 /*
  * Note: tlb_fill() can trigger a resize of the TLB. This means that all of the
  * caller's prior references to the TLB table (e.g. CPUTLBEntry pointers) must
@@ -1757,10 +1768,11 @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong addr, int mmu_idx,
  *
  * @prot may be PAGE_READ, PAGE_WRITE, or PAGE_READ|PAGE_WRITE.
  */
-static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
+static void *atomic_mmu_lookup(CPUArchState *env, target_ulong address,
                                MemOpIdx oi, int size, int prot,
                                uintptr_t retaddr)
 {
+    target_ulong addr = clean_addr(env, address);
     size_t mmu_idx = get_mmuidx(oi);
     MemOp mop = get_memop(oi);
     int a_bits = get_alignment_bits(mop);
@@ -1904,10 +1916,11 @@ load_memop(const void *haddr, MemOp op)
 }
 
 static inline uint64_t QEMU_ALWAYS_INLINE
-load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+load_helper(CPUArchState *env, target_ulong address, MemOpIdx oi,
             uintptr_t retaddr, MemOp op, bool code_read,
             FullLoadHelper *full_load)
 {
+    target_ulong addr = clean_addr(env, address);
     uintptr_t mmu_idx = get_mmuidx(oi);
     uintptr_t index = tlb_index(env, mmu_idx, addr);
     CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
@@ -2307,9 +2320,10 @@ store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
 }
 
 static inline void QEMU_ALWAYS_INLINE
-store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
+store_helper(CPUArchState *env, target_ulong address, uint64_t val,
              MemOpIdx oi, uintptr_t retaddr, MemOp op)
 {
+    target_ulong addr = clean_addr(env, address);
     uintptr_t mmu_idx = get_mmuidx(oi);
     uintptr_t index = tlb_index(env, mmu_idx, addr);
     CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
diff --git a/include/hw/core/tcg-cpu-ops.h b/include/hw/core/tcg-cpu-ops.h
index e13898553aff..8e81f45510bf 100644
--- a/include/hw/core/tcg-cpu-ops.h
+++ b/include/hw/core/tcg-cpu-ops.h
@@ -82,6 +82,11 @@ struct TCGCPUOps {
                                 MMUAccessType access_type,
                                 int mmu_idx, uintptr_t retaddr) QEMU_NORETURN;
 
+    /**
+     * @do_clean_addr: Callback for clearing metadata/tags from the address.
+     */
+    vaddr (*do_clean_addr)(CPUState *cpu, vaddr addr);
+
     /**
      * @adjust_watchpoint_address: hack for cpu_check_watchpoint used by ARM
      */
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index cb6b5467d067..6e3e8473bf04 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -662,7 +662,7 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t vendor1,
           /* CPUID_7_0_ECX_OSPKE is dynamic */ \
           CPUID_7_0_ECX_LA57 | CPUID_7_0_ECX_PKS)
 #define TCG_7_0_EDX_FEATURES 0
-#define TCG_7_1_EAX_FEATURES 0
+#define TCG_7_1_EAX_FEATURES CPUID_7_1_EAX_LAM
 #define TCG_APM_FEATURES 0
 #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
 #define TCG_XSAVE_FEATURES (CPUID_XSAVE_XSAVEOPT | CPUID_XSAVE_XGETBV1)
@@ -876,7 +876,7 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
             NULL, NULL, NULL, NULL,
             NULL, NULL, NULL, NULL,
             NULL, NULL, NULL, NULL,
-            NULL, NULL, NULL, NULL,
+            NULL, NULL, "lam", NULL,
             NULL, NULL, NULL, NULL,
         },
         .cpuid = {
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 982c5323537c..5d6cc8efb7da 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -232,6 +232,9 @@ typedef enum X86Seg {
 #define CR0_CD_MASK  (1U << 30)
 #define CR0_PG_MASK  (1U << 31)
 
+#define CR3_LAM_U57  (1ULL << 61)
+#define CR3_LAM_U48  (1ULL << 62)
+
 #define CR4_VME_MASK  (1U << 0)
 #define CR4_PVI_MASK  (1U << 1)
 #define CR4_TSD_MASK  (1U << 2)
@@ -255,6 +258,7 @@ typedef enum X86Seg {
 #define CR4_SMAP_MASK   (1U << 21)
 #define CR4_PKE_MASK   (1U << 22)
 #define CR4_PKS_MASK   (1U << 24)
+#define CR4_LAM_SUP    (1U << 28)
 
 #define CR4_RESERVED_MASK \
 (~(target_ulong)(CR4_VME_MASK | CR4_PVI_MASK | CR4_TSD_MASK \
@@ -263,7 +267,8 @@ typedef enum X86Seg {
                 | CR4_OSFXSR_MASK | CR4_OSXMMEXCPT_MASK | CR4_UMIP_MASK \
                 | CR4_LA57_MASK \
                 | CR4_FSGSBASE_MASK | CR4_PCIDE_MASK | CR4_OSXSAVE_MASK \
-                | CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_PKE_MASK | CR4_PKS_MASK))
+                | CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_PKE_MASK | CR4_PKS_MASK \
+                | CR4_LAM_SUP))
 
 #define DR6_BD          (1 << 13)
 #define DR6_BS          (1 << 14)
@@ -877,6 +882,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 #define CPUID_7_1_EAX_AVX_VNNI          (1U << 4)
 /* AVX512 BFloat16 Instruction */
 #define CPUID_7_1_EAX_AVX512_BF16       (1U << 5)
+/* Linear Address Masking */
+#define CPUID_7_1_EAX_LAM               (1U << 26)
 /* XFD Extend Feature Disabled */
 #define CPUID_D_1_EAX_XFD               (1U << 4)
 
@@ -2287,6 +2294,23 @@ static inline bool hyperv_feat_enabled(X86CPU *cpu, int feat)
     return !!(cpu->hyperv_features & BIT(feat));
 }
 
+static inline uint64_t cr3_reserved_bits(CPUX86State *env)
+{
+    uint64_t reserved_bits;
+
+    if (!(env->efer & MSR_EFER_LMA)) {
+        return 0;
+    }
+
+    reserved_bits = (~0ULL) << env_archcpu(env)->phys_bits;
+
+    if (env->features[FEAT_7_1_EAX] & CPUID_7_1_EAX_LAM) {
+        reserved_bits &= ~(CR3_LAM_U48 | CR3_LAM_U57);
+    }
+
+    return reserved_bits;
+}
+
 static inline uint64_t cr4_reserved_bits(CPUX86State *env)
 {
     uint64_t reserved_bits = CR4_RESERVED_MASK;
diff --git a/target/i386/helper.c b/target/i386/helper.c
index fa409e9c44a8..f91ebab840d6 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -247,7 +247,7 @@ hwaddr x86_cpu_get_phys_page_attrs_debug(CPUState *cs, vaddr addr,
             }
 
             if (la57) {
-                pml5e_addr = ((env->cr[3] & ~0xfff) +
+                pml5e_addr = ((env->cr[3] & PG_ADDRESS_MASK) +
                         (((addr >> 48) & 0x1ff) << 3)) & a20_mask;
                 pml5e = x86_ldq_phys(cs, pml5e_addr);
                 if (!(pml5e & PG_PRESENT_MASK)) {
diff --git a/target/i386/tcg/helper-tcg.h b/target/i386/tcg/helper-tcg.h
index 0a4401e917f9..03ab858598d2 100644
--- a/target/i386/tcg/helper-tcg.h
+++ b/target/i386/tcg/helper-tcg.h
@@ -51,6 +51,7 @@ void x86_cpu_record_sigsegv(CPUState *cs, vaddr addr,
 bool x86_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
                       MMUAccessType access_type, int mmu_idx,
                       bool probe, uintptr_t retaddr);
+vaddr x86_cpu_clean_addr(CPUState *cpu, vaddr addr);
 #endif
 
 void breakpoint_handler(CPUState *cs);
diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
index e1b6d8868338..caaab413381b 100644
--- a/target/i386/tcg/sysemu/excp_helper.c
+++ b/target/i386/tcg/sysemu/excp_helper.c
@@ -64,7 +64,7 @@ static int mmu_translate(CPUState *cs, hwaddr addr, MMUTranslateFunc get_hphys_f
             uint64_t pml4e_addr, pml4e;
 
             if (la57) {
-                pml5e_addr = ((cr3 & ~0xfff) +
+                pml5e_addr = ((cr3 & PG_ADDRESS_MASK) +
                         (((addr >> 48) & 0x1ff) << 3)) & a20_mask;
                 pml5e_addr = GET_HPHYS(cs, pml5e_addr, MMU_DATA_STORE, NULL);
                 pml5e = x86_ldq_phys(cs, pml5e_addr);
@@ -437,3 +437,29 @@ bool x86_cpu_tlb_fill(CPUState *cs, vaddr addr, int size,
     }
     return true;
 }
+
+static inline int64_t sign_extend64(uint64_t value, int index)
+{
+    int shift = 63 - index;
+    return (int64_t)(value << shift) >> shift;
+}
+
+vaddr x86_cpu_clean_addr(CPUState *cs, vaddr addr)
+{
+    CPUX86State *env = &X86_CPU(cs)->env;
+    bool la57 = env->cr[4] & CR4_LA57_MASK;
+
+    if (addr >> 63) {
+        if (env->cr[4] & CR4_LAM_SUP) {
+            return sign_extend64(addr, la57 ? 56 : 47);
+        }
+    } else {
+        if (env->cr[3] & CR3_LAM_U57) {
+            return sign_extend64(addr, 56);
+        } else if (env->cr[3] & CR3_LAM_U48) {
+            return sign_extend64(addr, 47);
+        }
+    }
+
+    return addr;
+}
diff --git a/target/i386/tcg/sysemu/misc_helper.c b/target/i386/tcg/sysemu/misc_helper.c
index 3715c1e2625b..faeb4a16383c 100644
--- a/target/i386/tcg/sysemu/misc_helper.c
+++ b/target/i386/tcg/sysemu/misc_helper.c
@@ -97,8 +97,7 @@ void helper_write_crN(CPUX86State *env, int reg, target_ulong t0)
         cpu_x86_update_cr0(env, t0);
         break;
     case 3:
-        if ((env->efer & MSR_EFER_LMA) &&
-                (t0 & ((~0ULL) << env_archcpu(env)->phys_bits))) {
+        if (t0 & cr3_reserved_bits(env)) {
             cpu_vmexit(env, SVM_EXIT_ERR, 0, GETPC());
         }
         if (!(env->efer & MSR_EFER_LMA)) {
diff --git a/target/i386/tcg/sysemu/svm_helper.c b/target/i386/tcg/sysemu/svm_helper.c
index 2b6f450af959..cbd99f240bb8 100644
--- a/target/i386/tcg/sysemu/svm_helper.c
+++ b/target/i386/tcg/sysemu/svm_helper.c
@@ -287,8 +287,7 @@ void helper_vmrun(CPUX86State *env, int aflag, int next_eip_addend)
         cpu_vmexit(env, SVM_EXIT_ERR, 0, GETPC());
     }
     new_cr3 = x86_ldq_phys(cs, env->vm_vmcb + offsetof(struct vmcb, save.cr3));
-    if ((env->efer & MSR_EFER_LMA) &&
-            (new_cr3 & ((~0ULL) << cpu->phys_bits))) {
+    if (new_cr3 & cr3_reserved_bits(env)) {
         cpu_vmexit(env, SVM_EXIT_ERR, 0, GETPC());
     }
     new_cr4 = x86_ldq_phys(cs, env->vm_vmcb + offsetof(struct vmcb, save.cr4));
diff --git a/target/i386/tcg/tcg-cpu.c b/target/i386/tcg/tcg-cpu.c
index 6fdfdf959899..754454d19041 100644
--- a/target/i386/tcg/tcg-cpu.c
+++ b/target/i386/tcg/tcg-cpu.c
@@ -77,6 +77,7 @@ static const struct TCGCPUOps x86_tcg_ops = {
     .record_sigsegv = x86_cpu_record_sigsegv,
 #else
     .tlb_fill = x86_cpu_tlb_fill,
+    .do_clean_addr = x86_cpu_clean_addr,
     .do_interrupt = x86_cpu_do_interrupt,
     .cpu_exec_interrupt = x86_cpu_exec_interrupt,
     .debug_excp_handler = breakpoint_handler,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 01/10] x86/mm: Fix CR3_ADDR_MASK
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
  2022-05-11  2:27 ` [PATCH] x86: Implement Linear Address Masking support Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-11  2:27 ` [RFCv2 02/10] x86: CPUID and CR3/CR4 flags for Linear Address Masking Kirill A. Shutemov
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

The mask must not include bits above physical address mask. These bits
are reserved and can be used for other things. Bits 61 and 62 are used
for Linear Address Masking.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/processor-flags.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 02c2cbda4a74..a7f3d9100adb 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,7 +35,7 @@
  */
 #ifdef CONFIG_X86_64
 /* Mask off the address space ID and SME encryption bits. */
-#define CR3_ADDR_MASK	__sme_clr(0x7FFFFFFFFFFFF000ull)
+#define CR3_ADDR_MASK	__sme_clr(PHYSICAL_PAGE_MASK)
 #define CR3_PCID_MASK	0xFFFull
 #define CR3_NOFLUSH	BIT_ULL(63)
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 02/10] x86: CPUID and CR3/CR4 flags for Linear Address Masking
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
  2022-05-11  2:27 ` [PATCH] x86: Implement Linear Address Masking support Kirill A. Shutemov
  2022-05-11  2:27 ` [RFCv2 01/10] x86/mm: Fix CR3_ADDR_MASK Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-11  2:27 ` [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features Kirill A. Shutemov
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

Enumerate Linear Address Masking and provide defines for CR3 and CR4
flags.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h          | 1 +
 arch/x86/include/uapi/asm/processor-flags.h | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 73e643ae94b6..d443d1ba231a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -299,6 +299,7 @@
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
 #define X86_FEATURE_AVX512_BF16		(12*32+ 5) /* AVX512 BFLOAT16 instructions */
+#define X86_FEATURE_LAM			(12*32+26) /* Linear Address Masking */
 
 /* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 13 */
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* CLZERO instruction */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index c47cc7f2feeb..d898432947ff 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -82,6 +82,10 @@
 #define X86_CR3_PCID_BITS	12
 #define X86_CR3_PCID_MASK	(_AC((1UL << X86_CR3_PCID_BITS) - 1, UL))
 
+#define X86_CR3_LAM_U57_BIT	61 /* Activate LAM for userspace, 62:57 bits masked */
+#define X86_CR3_LAM_U57		_BITULL(X86_CR3_LAM_U57_BIT)
+#define X86_CR3_LAM_U48_BIT	62 /* Activate LAM for userspace, 62:48 bits masked */
+#define X86_CR3_LAM_U48		_BITULL(X86_CR3_LAM_U48_BIT)
 #define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
 #define X86_CR3_PCID_NOFLUSH    _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
 
@@ -132,6 +136,8 @@
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 #define X86_CR4_CET_BIT		23 /* enable Control-flow Enforcement Technology */
 #define X86_CR4_CET		_BITUL(X86_CR4_CET_BIT)
+#define X86_CR4_LAM_SUP_BIT	28 /* LAM for supervisor pointers */
+#define X86_CR4_LAM_SUP		_BITUL(X86_CR4_LAM_SUP_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2022-05-11  2:27 ` [RFCv2 02/10] x86: CPUID and CR3/CR4 flags for Linear Address Masking Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-12 12:02   ` Thomas Gleixner
  2022-05-13 14:09   ` [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features Alexander Potapenko
  2022-05-11  2:27 ` [RFCv2 04/10] x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57 Kirill A. Shutemov
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

Add three new arch_prctl() handles:

 - ARCH_THREAD_FEATURE_ENABLE/DISABLE enables or disables the specified
   features. Returns what features are enabled after the operation.

 - ARCH_THREAD_FEATURE_LOCK prevents future disabling or enabling of the
   specified features. Returns the new set of locked features.

The features handled per-thread and inherited over fork(2)/clone(2), but
reset on exec().

This is preparation patch. It does not impelement any features.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/processor.h  |  3 +++
 arch/x86/include/uapi/asm/prctl.h |  5 +++++
 arch/x86/kernel/process.c         | 37 +++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 91d0f93a00c7..ff0c34e18cc6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -530,6 +530,9 @@ struct thread_struct {
 	 */
 	u32			pkru;
 
+	unsigned long		features;
+	unsigned long		features_locked;
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..67fc30d36c73 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,9 @@
 #define ARCH_MAP_VDSO_32		0x2002
 #define ARCH_MAP_VDSO_64		0x2003
 
+/* Never implement 0x3001, it will confuse old glibc's */
+#define ARCH_THREAD_FEATURE_ENABLE	0x3002
+#define ARCH_THREAD_FEATURE_DISABLE	0x3003
+#define ARCH_THREAD_FEATURE_LOCK	0x3004
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b370767f5b19..cb8fc28f2eae 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -367,6 +367,10 @@ void arch_setup_new_exec(void)
 		task_clear_spec_ssb_noexec(current);
 		speculation_ctrl_update(read_thread_flags());
 	}
+
+	/* Reset thread features on exec */
+	current->thread.features = 0;
+	current->thread.features_locked = 0;
 }
 
 #ifdef CONFIG_X86_IOPL_IOPERM
@@ -985,6 +989,35 @@ unsigned long __get_wchan(struct task_struct *p)
 	return addr;
 }
 
+static long thread_feature_prctl(struct task_struct *task, int option,
+				 unsigned long features)
+{
+	const unsigned long known_features = 0;
+
+	if (features & ~known_features)
+		return -EINVAL;
+
+	if (option == ARCH_THREAD_FEATURE_LOCK) {
+		task->thread.features_locked |= features;
+		return task->thread.features_locked;
+	}
+
+	/* Do not allow to change locked features */
+	if (features & task->thread.features_locked)
+		return -EPERM;
+
+	if (option == ARCH_THREAD_FEATURE_DISABLE) {
+		task->thread.features &= ~features;
+		goto out;
+	}
+
+	/* Handle ARCH_THREAD_FEATURE_ENABLE */
+
+	task->thread.features |= features;
+out:
+	return task->thread.features;
+}
+
 long do_arch_prctl_common(struct task_struct *task, int option,
 			  unsigned long arg2)
 {
@@ -999,6 +1032,10 @@ long do_arch_prctl_common(struct task_struct *task, int option,
 	case ARCH_GET_XCOMP_GUEST_PERM:
 	case ARCH_REQ_XCOMP_GUEST_PERM:
 		return fpu_xstate_prctl(task, option, arg2);
+	case ARCH_THREAD_FEATURE_ENABLE:
+	case ARCH_THREAD_FEATURE_DISABLE:
+	case ARCH_THREAD_FEATURE_LOCK:
+		return thread_feature_prctl(task, option, arg2);
 	}
 
 	return -EINVAL;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 04/10] x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2022-05-11  2:27 ` [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-11  7:02   ` Peter Zijlstra
  2022-05-11  2:27 ` [RFCv2 05/10] x86/mm: Provide untagged_addr() helper Kirill A. Shutemov
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

Linear Address Masking mode for userspace pointers encoded in CR3 bits.
The mode is selected per-thread. Add new thread features indicate that the
thread has Linear Address Masking enabled.

switch_mm_irqs_off() now respects these flags and constructs CR3
accordingly.

The active LAM mode gets recorded in the tlb_state.

The thread features are not yet exposed via userpsace API.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/tlbflush.h   |  5 ++
 arch/x86/include/uapi/asm/prctl.h |  3 +
 arch/x86/mm/tlb.c                 | 95 ++++++++++++++++++++++++++-----
 3 files changed, 88 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 98fa0a114074..77cae8623858 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -17,6 +17,10 @@ void __flush_tlb_all(void);
 
 #define TLB_FLUSH_ALL	-1UL
 
+#define LAM_NONE	0
+#define LAM_U57		1
+#define LAM_U48		2
+
 void cr4_update_irqsoff(unsigned long set, unsigned long clear);
 unsigned long cr4_read_shadow(void);
 
@@ -88,6 +92,7 @@ struct tlb_state {
 
 	u16 loaded_mm_asid;
 	u16 next_asid;
+	u8 lam;
 
 	/*
 	 * If set we changed the page tables in such a way that we
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 67fc30d36c73..2dd16472d078 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -25,4 +25,7 @@
 #define ARCH_THREAD_FEATURE_DISABLE	0x3003
 #define ARCH_THREAD_FEATURE_LOCK	0x3004
 
+#define X86_THREAD_LAM_U48		0x1
+#define X86_THREAD_LAM_U57		0x2
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6eb4d91d5365..f9fe71d1f42c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -18,6 +18,7 @@
 #include <asm/cacheflush.h>
 #include <asm/apic.h>
 #include <asm/perf_event.h>
+#include <uapi/asm/prctl.h>
 
 #include "mm_internal.h"
 
@@ -154,17 +155,72 @@ static inline u16 user_pcid(u16 asid)
 	return ret;
 }
 
-static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+#ifdef CONFIG_X86_64
+static inline unsigned long lam_to_cr3(u8 lam)
+{
+	switch (lam) {
+	case LAM_NONE:
+		return 0;
+	case LAM_U57:
+		return X86_CR3_LAM_U57;
+	case LAM_U48:
+		return X86_CR3_LAM_U48;
+	default:
+		WARN_ON_ONCE(1);
+		return 0;
+	}
+}
+
+static inline u8 cr3_to_lam(unsigned long cr3)
+{
+	if (cr3 & X86_CR3_LAM_U57)
+		return LAM_U57;
+	if (cr3 & X86_CR3_LAM_U48)
+		return LAM_U48;
+	return 0;
+}
+
+static u8 gen_lam(struct task_struct *tsk, struct mm_struct *mm)
+{
+	if (!tsk)
+		return LAM_NONE;
+
+	if (tsk->thread.features & X86_THREAD_LAM_U57)
+		return LAM_U57;
+	if (tsk->thread.features & X86_THREAD_LAM_U48)
+		return LAM_U48;
+	return LAM_NONE;
+}
+
+#else
+
+static inline unsigned long lam_to_cr3(u8 lam)
+{
+	return 0;
+}
+
+static inline u8 cr3_to_lam(unsigned long cr3)
+{
+	return LAM_NONE;
+}
+
+static u8 gen_lam(struct task_struct *tsk, struct mm_struct *mm)
+{
+	return LAM_NONE;
+}
+#endif
+
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid, u8 lam)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		return __sme_pa(pgd) | kern_pcid(asid);
+		return __sme_pa(pgd) | kern_pcid(asid) | lam_to_cr3(lam);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
-		return __sme_pa(pgd);
+		return __sme_pa(pgd) | lam_to_cr3(lam);
 	}
 }
 
-static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid, u8 lam)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 	/*
@@ -173,7 +229,7 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 	 * boot because all CPU's the have same capabilities:
 	 */
 	VM_WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_PCID));
-	return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH;
+	return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH | lam_to_cr3(lam);
 }
 
 /*
@@ -274,15 +330,15 @@ static inline void invalidate_user_asid(u16 asid)
 		  (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
 }
 
-static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, u8 lam, bool need_flush)
 {
 	unsigned long new_mm_cr3;
 
 	if (need_flush) {
 		invalidate_user_asid(new_asid);
-		new_mm_cr3 = build_cr3(pgdir, new_asid);
+		new_mm_cr3 = build_cr3(pgdir, new_asid, lam);
 	} else {
-		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid, lam);
 	}
 
 	/*
@@ -491,6 +547,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	u8 prev_lam = this_cpu_read(cpu_tlbstate.lam);
+	u8 new_lam = gen_lam(tsk, next);
 	bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
 	unsigned cpu = smp_processor_id();
 	u64 next_tlb_gen;
@@ -504,6 +562,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 * cpu_tlbstate.loaded_mm) matches next.
 	 *
 	 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
+	 *
+	 * NB: Initial LAM enabling calls us with prev == next. We must update
+	 * CR3 if prev_lam doesn't match the new one.
 	 */
 
 	/* We don't want flush_tlb_func() to run concurrently with us. */
@@ -520,7 +581,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 * isn't free.
 	 */
 #ifdef CONFIG_DEBUG_VM
-	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
+	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid, prev_lam))) {
 		/*
 		 * If we were to BUG here, we'd be very likely to kill
 		 * the system so hard that we don't see the call trace.
@@ -551,7 +612,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 * provides that full memory barrier and core serializing
 	 * instruction.
 	 */
-	if (real_prev == next) {
+	if (real_prev == next && prev_lam == new_lam) {
 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			   next->context.ctx_id);
 
@@ -622,15 +683,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		barrier();
 	}
 
+	this_cpu_write(cpu_tlbstate.lam, new_lam);
 	if (need_flush) {
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-		load_new_mm_cr3(next->pgd, new_asid, true);
+		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
 
 		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 	} else {
 		/* The new ASID is already up to date. */
-		load_new_mm_cr3(next->pgd, new_asid, false);
+		load_new_mm_cr3(next->pgd, new_asid, new_lam, false);
 
 		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
 	}
@@ -687,6 +749,7 @@ void initialize_tlbstate_and_flush(void)
 	struct mm_struct *mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u64 tlb_gen = atomic64_read(&init_mm.context.tlb_gen);
 	unsigned long cr3 = __read_cr3();
+	u8 lam = cr3_to_lam(cr3);
 
 	/* Assert that CR3 already references the right mm. */
 	WARN_ON((cr3 & CR3_ADDR_MASK) != __pa(mm->pgd));
@@ -700,7 +763,7 @@ void initialize_tlbstate_and_flush(void)
 		!(cr4_read_shadow() & X86_CR4_PCIDE));
 
 	/* Force ASID 0 and force a TLB flush. */
-	write_cr3(build_cr3(mm->pgd, 0));
+	write_cr3(build_cr3(mm->pgd, 0, lam));
 
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_INIT);
@@ -1074,8 +1137,10 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
  */
 unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
-		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+	unsigned long cr3 =
+		build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+		this_cpu_read(cpu_tlbstate.loaded_mm_asid),
+		this_cpu_read(cpu_tlbstate.lam));
 
 	/* For now, be very restrictive about when this can be called. */
 	VM_WARN_ON(in_nmi() || preemptible());
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 05/10] x86/mm: Provide untagged_addr() helper
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2022-05-11  2:27 ` [RFCv2 04/10] x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57 Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-11  7:21   ` Peter Zijlstra
  2022-05-12 13:06   ` Thomas Gleixner
  2022-05-11  2:27 ` [RFCv2 06/10] x86/uaccess: Remove tags from the address before checking Kirill A. Shutemov
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

The helper used by the core-mm to strip tag bits and get the address to
the canonical shape. In only handles userspace addresses.

For LAM, the address gets sanitized according to the thread features.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/page_32.h |  3 +++
 arch/x86/include/asm/page_64.h | 20 ++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index df42f8aa99e4..2d35059b90c1 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -15,6 +15,9 @@ extern unsigned long __phys_addr(unsigned long);
 #define __phys_addr_symbol(x)	__phys_addr(x)
 #define __phys_reloc_hide(x)	RELOC_HIDE((x), 0)
 
+#define untagged_addr(addr)	(addr)
+#define untagged_ptr(ptr)	(ptr)
+
 #ifdef CONFIG_FLATMEM
 #define pfn_valid(pfn)		((pfn) < max_mapnr)
 #endif /* CONFIG_FLATMEM */
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index e9c86299b835..3a40c958b24a 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -7,6 +7,7 @@
 #ifndef __ASSEMBLY__
 #include <asm/cpufeatures.h>
 #include <asm/alternative.h>
+#include <uapi/asm/prctl.h>
 
 /* duplicated to the one in bootmem.h */
 extern unsigned long max_pfn;
@@ -90,6 +91,25 @@ static __always_inline unsigned long task_size_max(void)
 }
 #endif	/* CONFIG_X86_5LEVEL */
 
+#define __untagged_addr(addr, n)	\
+	((__force __typeof__(addr))sign_extend64((__force u64)(addr), n))
+
+#define untagged_addr(addr)	({					\
+	u64 __addr = (__force u64)(addr);				\
+	if (__addr >> 63 == 0) {					\
+		if (current->thread.features & X86_THREAD_LAM_U57)	\
+			__addr &= __untagged_addr(__addr, 56);		\
+		else if (current->thread.features & X86_THREAD_LAM_U48)	\
+			__addr &= __untagged_addr(__addr, 47);		\
+	}								\
+	(__force __typeof__(addr))__addr;				\
+})
+
+#define untagged_ptr(ptr)	({					\
+	u64 __ptrval = (__force u64)(ptr);				\
+	__ptrval = untagged_addr(__ptrval);				\
+	(__force __typeof__(*(ptr)) *)__ptrval;				\
+})
 #endif	/* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 06/10] x86/uaccess: Remove tags from the address before checking
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2022-05-11  2:27 ` [RFCv2 05/10] x86/mm: Provide untagged_addr() helper Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-12 13:02   ` David Laight
  2022-05-11  2:27 ` [RFCv2 07/10] x86/mm: Handle tagged memory accesses from kernel threads Kirill A. Shutemov
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

The tags must not be included into check whether it's okay to access the
userspace address.

Strip tags in  access_ok().

get_user() and put_user() don't use access_ok(), but check access
against TASK_SIZE directly in assembly. Strip tags, before calling into
the assembly helper.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/uaccess.h | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index f78e2b3501a1..0f5bf7db4ec9 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -40,7 +40,7 @@ static inline bool pagefault_disabled(void);
 #define access_ok(addr, size)					\
 ({									\
 	WARN_ON_IN_IRQ();						\
-	likely(__access_ok(addr, size));				\
+	likely(__access_ok(untagged_addr(addr), size));			\
 })
 
 #include <asm-generic/access_ok.h>
@@ -125,7 +125,12 @@ extern int __get_user_bad(void);
  * Return: zero on success, or -EFAULT on error.
  * On error, the variable @x is set to zero.
  */
-#define get_user(x,ptr) ({ might_fault(); do_get_user_call(get_user,x,ptr); })
+#define get_user(x,ptr)							\
+({									\
+	__typeof__(*(ptr)) __user *__ptr_clean = untagged_ptr(ptr);	\
+	might_fault();							\
+	do_get_user_call(get_user,x,__ptr_clean);			\
+})
 
 /**
  * __get_user - Get a simple variable from user space, with less checking.
@@ -222,7 +227,11 @@ extern void __put_user_nocheck_8(void);
  *
  * Return: zero on success, or -EFAULT on error.
  */
-#define put_user(x, ptr) ({ might_fault(); do_put_user_call(put_user,x,ptr); })
+#define put_user(x, ptr) ({						\
+	__typeof__(*(ptr)) __user *__ptr_clean = untagged_ptr(ptr);	\
+	might_fault();							\
+	do_put_user_call(put_user,x,__ptr_clean);			\
+})
 
 /**
  * __put_user - Write a simple value into user space, with less checking.
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 07/10] x86/mm: Handle tagged memory accesses from kernel threads
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2022-05-11  2:27 ` [RFCv2 06/10] x86/uaccess: Remove tags from the address before checking Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-11  7:23   ` Peter Zijlstra
  2022-05-12 13:30   ` Thomas Gleixner
  2022-05-11  2:27 ` [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive Kirill A. Shutemov
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

When a kernel thread performs memory access on behalf of a process (like
in async I/O, io_uring, etc.) it has to respect tagging setup of the
process as user addresses can include tags.

Normally, LAM setup is per-thread and recorded in thread features, but
for this use case kernel also tracks LAM setup per-mm. mm->context.lam
would record LAM that allows the most tag bits among the threads of
the mm.

The info used by switch_mm_irqs_off() to construct CR3 if the task is
kernel thread. Thread featrues of the kernel thread get updated
according to mm->context.lam. It allows untagged_addr() to work
correctly.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/mmu.h |  1 +
 arch/x86/mm/tlb.c          | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5d7494631ea9..52f3749f14e8 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -40,6 +40,7 @@ typedef struct {
 
 #ifdef CONFIG_X86_64
 	unsigned short flags;
+	u8 lam;
 #endif
 
 	struct mutex lock;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f9fe71d1f42c..b320556e1c22 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -185,6 +185,34 @@ static u8 gen_lam(struct task_struct *tsk, struct mm_struct *mm)
 	if (!tsk)
 		return LAM_NONE;
 
+	if (tsk->flags & PF_KTHREAD) {
+		/*
+		 * For kernel thread use the most permissive LAM
+		 * used by the mm. It's required to handle kernel thread
+		 * memory accesses on behalf of a process.
+		 *
+		 * Adjust thread flags accodringly, so untagged_addr() would
+		 * work correctly.
+		 */
+
+		tsk->thread.features &= ~(X86_THREAD_LAM_U48 |
+					  X86_THREAD_LAM_U57);
+
+		switch (mm->context.lam) {
+		case LAM_NONE:
+			return LAM_NONE;
+		case LAM_U57:
+			tsk->thread.features |= X86_THREAD_LAM_U57;
+			return LAM_U57;
+		case LAM_U48:
+			tsk->thread.features |= X86_THREAD_LAM_U48;
+			return LAM_U48;
+		default:
+			WARN_ON_ONCE(1);
+			return LAM_NONE;
+		}
+	}
+
 	if (tsk->thread.features & X86_THREAD_LAM_U57)
 		return LAM_U57;
 	if (tsk->thread.features & X86_THREAD_LAM_U48)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2022-05-11  2:27 ` [RFCv2 07/10] x86/mm: Handle tagged memory accesses from kernel threads Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-12 13:36   ` Thomas Gleixner
  2022-05-18  8:43   ` Bharata B Rao
  2022-05-11  2:27 ` [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking Kirill A. Shutemov
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

LAM_U48 steals bits above 47-bit for tags and makes it impossible for
userspace to use full address space on 5-level paging machine.

Make these features mutually exclusive: whichever gets enabled first
blocks the othe one.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/elf.h         |  3 ++-
 arch/x86/include/asm/mmu_context.h | 13 +++++++++++++
 arch/x86/kernel/sys_x86_64.c       |  5 +++--
 arch/x86/mm/hugetlbpage.c          |  6 ++++--
 arch/x86/mm/mmap.c                 |  9 ++++++++-
 5 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 29fea180a665..53b96b0c8cc3 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -328,7 +328,8 @@ static inline int mmap_is_ia32(void)
 extern unsigned long task_size_32bit(void);
 extern unsigned long task_size_64bit(int full_addr_space);
 extern unsigned long get_mmap_base(int is_legacy);
-extern bool mmap_address_hint_valid(unsigned long addr, unsigned long len);
+extern bool mmap_address_hint_valid(struct mm_struct *mm,
+				    unsigned long addr, unsigned long len);
 extern unsigned long get_sigframe_size(void);
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..c8a6d80dfec3 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -218,6 +218,19 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
 
 unsigned long __get_current_cr3_fast(void);
 
+#ifdef CONFIG_X86_5LEVEL
+static inline bool full_va_allowed(struct mm_struct *mm)
+{
+	/* LAM_U48 steals VA bits abouve 47-bit for tags */
+	return mm->context.lam != LAM_U48;
+}
+#else
+static inline bool full_va_allowed(struct mm_struct *mm)
+{
+	return false;
+}
+#endif
+
 #include <asm-generic/mmu_context.h>
 
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 660b78827638..4526e8fadfd2 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 
 #include <asm/elf.h>
 #include <asm/ia32.h>
+#include <asm/mmu_context.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -185,7 +186,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	/* requesting a specific address */
 	if (addr) {
 		addr &= PAGE_MASK;
-		if (!mmap_address_hint_valid(addr, len))
+		if (!mmap_address_hint_valid(mm, addr, len))
 			goto get_unmapped_area;
 
 		vma = find_vma(mm, addr);
@@ -206,7 +207,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	 * !in_32bit_syscall() check to avoid high addresses for x32
 	 * (and make it no op on native i386).
 	 */
-	if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall())
+	if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall() && full_va_allowed(mm))
 		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
 
 	info.align_mask = 0;
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index a0d023cb4292..9fdc8db42365 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/elf.h>
+#include <asm/mmu_context.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -103,6 +104,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
+	struct mm_struct *mm = current->mm;
 	struct vm_unmapped_area_info info;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
@@ -114,7 +116,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
 	 * in the full address space.
 	 */
-	if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall())
+	if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall() && full_va_allowed(mm))
 		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
 
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
@@ -161,7 +163,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (addr) {
 		addr &= huge_page_mask(h);
-		if (!mmap_address_hint_valid(addr, len))
+		if (!mmap_address_hint_valid(mm, addr, len))
 			goto get_unmapped_area;
 
 		vma = find_vma(mm, addr);
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index c90c20904a60..f9ca824729de 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -21,6 +21,7 @@
 #include <linux/elf-randomize.h>
 #include <asm/elf.h>
 #include <asm/io.h>
+#include <asm/mmu_context.h>
 
 #include "physaddr.h"
 
@@ -35,6 +36,8 @@ unsigned long task_size_32bit(void)
 
 unsigned long task_size_64bit(int full_addr_space)
 {
+	if (!full_va_allowed(current->mm))
+		return DEFAULT_MAP_WINDOW;
 	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
 }
 
@@ -206,11 +209,15 @@ const char *arch_vma_name(struct vm_area_struct *vma)
  * the failure of such a fixed mapping request, so the restriction is not
  * applied.
  */
-bool mmap_address_hint_valid(unsigned long addr, unsigned long len)
+bool mmap_address_hint_valid(struct mm_struct *mm,
+			     unsigned long addr, unsigned long len)
 {
 	if (TASK_SIZE - len < addr)
 		return false;
 
+	if (addr + len > DEFAULT_MAP_WINDOW && !full_va_allowed(mm))
+		return false;
+
 	return (addr > DEFAULT_MAP_WINDOW) == (addr + len > DEFAULT_MAP_WINDOW);
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2022-05-11  2:27 ` [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-11  7:26   ` Peter Zijlstra
  2022-05-11 14:15   ` H.J. Lu
  2022-05-11  2:27 ` [RFCv2 10/10] x86: Expose thread features status in /proc/$PID/arch_status Kirill A. Shutemov
  2022-05-11  6:49 ` [RFCv2 00/10] Linear Address Masking enabling Peter Zijlstra
  11 siblings, 2 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

Allow to enable Linear Address Masking via ARCH_THREAD_FEATURE_ENABLE
arch_prctl(2).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/process.c    | 21 +++++++++++++++-
 arch/x86/kernel/process.h    |  2 ++
 arch/x86/kernel/process_64.c | 46 ++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index cb8fc28f2eae..911c24321312 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -46,6 +46,8 @@
 #include <asm/proto.h>
 #include <asm/frame.h>
 #include <asm/unwind.h>
+#include <asm/mmu_context.h>
+#include <asm/compat.h>
 
 #include "process.h"
 
@@ -992,7 +994,9 @@ unsigned long __get_wchan(struct task_struct *p)
 static long thread_feature_prctl(struct task_struct *task, int option,
 				 unsigned long features)
 {
-	const unsigned long known_features = 0;
+	const unsigned long known_features =
+		X86_THREAD_LAM_U48 |
+		X86_THREAD_LAM_U57;
 
 	if (features & ~known_features)
 		return -EINVAL;
@@ -1013,8 +1017,23 @@ static long thread_feature_prctl(struct task_struct *task, int option,
 
 	/* Handle ARCH_THREAD_FEATURE_ENABLE */
 
+	if (features & (X86_THREAD_LAM_U48 | X86_THREAD_LAM_U57)) {
+		long ret;
+
+		/* LAM is only available in long mode */
+		if (in_32bit_syscall())
+			return -EINVAL;
+
+		ret = enable_lam(task, features);
+		if (ret)
+			return ret;
+	}
+
 	task->thread.features |= features;
 out:
+	/* Update CR3 to get LAM active */
+	switch_mm(task->mm, task->mm, task);
+
 	return task->thread.features;
 }
 
diff --git a/arch/x86/kernel/process.h b/arch/x86/kernel/process.h
index 76b547b83232..b8fa0e599c6e 100644
--- a/arch/x86/kernel/process.h
+++ b/arch/x86/kernel/process.h
@@ -4,6 +4,8 @@
 
 #include <asm/spec-ctrl.h>
 
+long enable_lam(struct task_struct *task, unsigned long features);
+
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e459253649be..a25c51da7005 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -729,6 +729,52 @@ void set_personality_ia32(bool x32)
 }
 EXPORT_SYMBOL_GPL(set_personality_ia32);
 
+static bool lam_u48_allowed(void)
+{
+	struct mm_struct *mm = current->mm;
+
+	if (!full_va_allowed(mm))
+		return true;
+
+	return find_vma(mm, DEFAULT_MAP_WINDOW) == NULL;
+}
+
+long enable_lam(struct task_struct *task, unsigned long features)
+{
+	features |= task->thread.features;
+
+	/* LAM_U48 and LAM_U57 are mutually exclusive */
+	if ((features & X86_THREAD_LAM_U48) && (features & X86_THREAD_LAM_U57))
+		return -EINVAL;
+
+	if (!cpu_feature_enabled(X86_FEATURE_LAM))
+		return -ENXIO;
+
+	if (mmap_write_lock_killable(task->mm))
+		return -EINTR;
+
+	if ((features & X86_THREAD_LAM_U48) && !lam_u48_allowed()) {
+		mmap_write_unlock(task->mm);
+		return -EINVAL;
+	}
+
+	/*
+	 * Record the most permissive (allowing the widest tags) LAM
+	 * mode to the mm context. It determinates if a mappings above
+	 * 47 bit is allowed for the process.
+	 *
+	 * The mode is also used by a kernel thread when it does work
+	 * on behalf of the process (like async I/O, io_uring, etc.)
+	 */
+	if (features & X86_THREAD_LAM_U48)
+		current->mm->context.lam = LAM_U48;
+	else if (current->mm->context.lam == LAM_NONE)
+		current->mm->context.lam = LAM_U57;
+
+	mmap_write_unlock(task->mm);
+	return 0;
+}
+
 #ifdef CONFIG_CHECKPOINT_RESTORE
 static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
 {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFCv2 10/10] x86: Expose thread features status in /proc/$PID/arch_status
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2022-05-11  2:27 ` [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking Kirill A. Shutemov
@ 2022-05-11  2:27 ` Kirill A. Shutemov
  2022-05-11  6:49 ` [RFCv2 00/10] Linear Address Masking enabling Peter Zijlstra
  11 siblings, 0 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-11  2:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

Add two lines in /proc/$PID/arch_status to report enabled and locked
features.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/Makefile     |  2 ++
 arch/x86/kernel/fpu/xstate.c | 47 ---------------------------
 arch/x86/kernel/proc.c       | 63 ++++++++++++++++++++++++++++++++++++
 3 files changed, 65 insertions(+), 47 deletions(-)
 create mode 100644 arch/x86/kernel/proc.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index c41ef42adbe8..19dae7a4201b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -149,6 +149,8 @@ obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)		+= sev.o
 
+obj-$(CONFIG_PROC_FS)			+= proc.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 39e1c8626ab9..789a7a1429df 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -10,8 +10,6 @@
 #include <linux/mman.h>
 #include <linux/nospec.h>
 #include <linux/pkeys.h>
-#include <linux/seq_file.h>
-#include <linux/proc_fs.h>
 #include <linux/vmalloc.h>
 
 #include <asm/fpu/api.h>
@@ -1730,48 +1728,3 @@ long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2)
 		return -EINVAL;
 	}
 }
-
-#ifdef CONFIG_PROC_PID_ARCH_STATUS
-/*
- * Report the amount of time elapsed in millisecond since last AVX512
- * use in the task.
- */
-static void avx512_status(struct seq_file *m, struct task_struct *task)
-{
-	unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
-	long delta;
-
-	if (!timestamp) {
-		/*
-		 * Report -1 if no AVX512 usage
-		 */
-		delta = -1;
-	} else {
-		delta = (long)(jiffies - timestamp);
-		/*
-		 * Cap to LONG_MAX if time difference > LONG_MAX
-		 */
-		if (delta < 0)
-			delta = LONG_MAX;
-		delta = jiffies_to_msecs(delta);
-	}
-
-	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
-	seq_putc(m, '\n');
-}
-
-/*
- * Report architecture specific information
- */
-int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
-			struct pid *pid, struct task_struct *task)
-{
-	/*
-	 * Report AVX512 state if the processor and build option supported.
-	 */
-	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
-		avx512_status(m, task);
-
-	return 0;
-}
-#endif /* CONFIG_PROC_PID_ARCH_STATUS */
diff --git a/arch/x86/kernel/proc.c b/arch/x86/kernel/proc.c
new file mode 100644
index 000000000000..7b2f39031d8a
--- /dev/null
+++ b/arch/x86/kernel/proc.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <uapi/asm/prctl.h>
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+	unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+	long delta;
+
+	if (!timestamp) {
+		/*
+		 * Report -1 if no AVX512 usage
+		 */
+		delta = -1;
+	} else {
+		delta = (long)(jiffies - timestamp);
+		/*
+		 * Cap to LONG_MAX if time difference > LONG_MAX
+		 */
+		if (delta < 0)
+			delta = LONG_MAX;
+		delta = jiffies_to_msecs(delta);
+	}
+
+	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+	seq_putc(m, '\n');
+}
+
+static void dump_features(struct seq_file *m, unsigned long features)
+{
+	if (features & X86_THREAD_LAM_U48)
+		seq_puts(m, "lam_u48 ");
+	if (features & X86_THREAD_LAM_U57)
+		seq_puts(m, "lam_u57 ");
+}
+
+/*
+ * Report architecture specific information
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+			struct pid *pid, struct task_struct *task)
+{
+	/*
+	 * Report AVX512 state if the processor and build option supported.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+		avx512_status(m, task);
+
+	seq_puts(m, "Thread_features:\t");
+	dump_features(m, task->thread.features);
+	seq_putc(m, '\n');
+
+	seq_puts(m, "Thread_features_locked:\t");
+	dump_features(m, task->thread.features_locked);
+	seq_putc(m, '\n');
+
+	return 0;
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2022-05-11  2:27 ` [RFCv2 10/10] x86: Expose thread features status in /proc/$PID/arch_status Kirill A. Shutemov
@ 2022-05-11  6:49 ` Peter Zijlstra
  2022-05-12 15:42   ` Thomas Gleixner
  2022-05-12 17:22   ` Dave Hansen
  11 siblings, 2 replies; 82+ messages in thread
From: Peter Zijlstra @ 2022-05-11  6:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Wed, May 11, 2022 at 05:27:40AM +0300, Kirill A. Shutemov wrote:
> Hi all. Here's long overdue update on LAM enabling.
> 
> # Description #
> 
> Linear Address Masking[1] (LAM) modifies the checking that is applied to
> 64-bit linear addresses, allowing software to use of the untranslated
> address bits for metadata.
> 
> The patchset brings support for LAM for userspace addresses.
> 
> The most sensitive part of enabling is change in tlb.c, where CR3 flags
> get set. Please take a look that what I'm doing makes sense.
> 
> The feature competes for bits with 5-level paging: LAM_U48 makes it
> impossible to map anything about 47-bits. The patchset made these
> capability mutually exclusive: whatever used first wins. LAM_U57 can be
> combined with mappings above 47-bits.

So aren't we creating a problem with LAM_U48 where programs relying on
it are of limited sustainability?

Any such program simply *cannot* run on 5 level pagetables. Why do we
want to do this?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 04/10] x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57
  2022-05-11  2:27 ` [RFCv2 04/10] x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57 Kirill A. Shutemov
@ 2022-05-11  7:02   ` Peter Zijlstra
  2022-05-12 12:24     ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2022-05-11  7:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Wed, May 11, 2022 at 05:27:45AM +0300, Kirill A. Shutemov wrote:

> +#define LAM_NONE	0
> +#define LAM_U57		1
> +#define LAM_U48		2

> +#define X86_THREAD_LAM_U48		0x1
> +#define X86_THREAD_LAM_U57		0x2

Seriously pick an order and stick with it. I would suggest keeping the
hardware order and then you can do:

> +static inline unsigned long lam_to_cr3(u8 lam)
> +{
> +	switch (lam) {
> +	case LAM_NONE:
> +		return 0;
> +	case LAM_U57:
> +		return X86_CR3_LAM_U57;
> +	case LAM_U48:
> +		return X86_CR3_LAM_U48;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return 0;
> +	}

	return (lam & 0x3) << X86_CR3_LAM_U57;

> +}
> +
> +static inline u8 cr3_to_lam(unsigned long cr3)
> +{
> +	if (cr3 & X86_CR3_LAM_U57)
> +		return LAM_U57;
> +	if (cr3 & X86_CR3_LAM_U48)
> +		return LAM_U48;
> +	return 0;


	return (cr3 >> X86_CR3_LAM_U57) & 0x3;

> +}

and call it a day, or something.

I'm still not liking LAM(e), I'm thikning it's going to create more
problems than it solves.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 05/10] x86/mm: Provide untagged_addr() helper
  2022-05-11  2:27 ` [RFCv2 05/10] x86/mm: Provide untagged_addr() helper Kirill A. Shutemov
@ 2022-05-11  7:21   ` Peter Zijlstra
  2022-05-11  7:45     ` Peter Zijlstra
  2022-05-12 13:06   ` Thomas Gleixner
  1 sibling, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2022-05-11  7:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Wed, May 11, 2022 at 05:27:46AM +0300, Kirill A. Shutemov wrote:
> +#define __untagged_addr(addr, n)	\
> +	((__force __typeof__(addr))sign_extend64((__force u64)(addr), n))
> +
> +#define untagged_addr(addr)	({					\
> +	u64 __addr = (__force u64)(addr);				\
> +	if (__addr >> 63 == 0) {					\
> +		if (current->thread.features & X86_THREAD_LAM_U57)	\
> +			__addr &= __untagged_addr(__addr, 56);		\
> +		else if (current->thread.features & X86_THREAD_LAM_U48)	\
> +			__addr &= __untagged_addr(__addr, 47);		\
> +	}								\
> +	(__force __typeof__(addr))__addr;				\
> +})

Assuming you got your bits in hardware order:

	u64 __addr = addr;
	if ((s64)__addr >= 0) {
		int lam = (current->thread.features >> X86_THREAD_LAM_U57) & 3;
		if (lam)
			__addr &= sign_extend64(__addr, 65 - 9*lam);
	}
	__addr;

has less branches on and should definitely result in better code (or I
need more morning juice).

> +
> +#define untagged_ptr(ptr)	({					\
> +	u64 __ptrval = (__force u64)(ptr);				\
> +	__ptrval = untagged_addr(__ptrval);				\
> +	(__force __typeof__(*(ptr)) *)__ptrval;				\
> +})
>  #endif	/* !__ASSEMBLY__ */
>  
>  #ifdef CONFIG_X86_VSYSCALL_EMULATION
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 07/10] x86/mm: Handle tagged memory accesses from kernel threads
  2022-05-11  2:27 ` [RFCv2 07/10] x86/mm: Handle tagged memory accesses from kernel threads Kirill A. Shutemov
@ 2022-05-11  7:23   ` Peter Zijlstra
  2022-05-12 13:30   ` Thomas Gleixner
  1 sibling, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2022-05-11  7:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Wed, May 11, 2022 at 05:27:48AM +0300, Kirill A. Shutemov wrote:
> When a kernel thread performs memory access on behalf of a process (like
> in async I/O, io_uring, etc.) it has to respect tagging setup of the
> process as user addresses can include tags.
> 
> Normally, LAM setup is per-thread and recorded in thread features, but
> for this use case kernel also tracks LAM setup per-mm. mm->context.lam
> would record LAM that allows the most tag bits among the threads of
> the mm.

Then why does it *ever* make sense to track it per thread? It's not like
it makes heaps of sense to allow one thread in a process to use LAM but
not the others.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking
  2022-05-11  2:27 ` [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking Kirill A. Shutemov
@ 2022-05-11  7:26   ` Peter Zijlstra
  2022-05-12 14:46     ` Thomas Gleixner
  2022-05-11 14:15   ` H.J. Lu
  1 sibling, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2022-05-11  7:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Wed, May 11, 2022 at 05:27:50AM +0300, Kirill A. Shutemov wrote:
> @@ -1013,8 +1017,23 @@ static long thread_feature_prctl(struct task_struct *task, int option,
>  
>  	/* Handle ARCH_THREAD_FEATURE_ENABLE */
>  
> +	if (features & (X86_THREAD_LAM_U48 | X86_THREAD_LAM_U57)) {
> +		long ret;
> +
> +		/* LAM is only available in long mode */
> +		if (in_32bit_syscall())
> +			return -EINVAL;

So what happens if userspace sets up a 32bit code entry in the LDT and
does the LAM thing as a 64bit syscamm but then goes run 32bit code?

> +
> +		ret = enable_lam(task, features);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	task->thread.features |= features;
>  out:
> +	/* Update CR3 to get LAM active */
> +	switch_mm(task->mm, task->mm, task);
> +
>  	return task->thread.features;
>  }
>  

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 05/10] x86/mm: Provide untagged_addr() helper
  2022-05-11  7:21   ` Peter Zijlstra
@ 2022-05-11  7:45     ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2022-05-11  7:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Wed, May 11, 2022 at 09:21:16AM +0200, Peter Zijlstra wrote:
> On Wed, May 11, 2022 at 05:27:46AM +0300, Kirill A. Shutemov wrote:
> > +#define __untagged_addr(addr, n)	\
> > +	((__force __typeof__(addr))sign_extend64((__force u64)(addr), n))
> > +
> > +#define untagged_addr(addr)	({					\
> > +	u64 __addr = (__force u64)(addr);				\
> > +	if (__addr >> 63 == 0) {					\
> > +		if (current->thread.features & X86_THREAD_LAM_U57)	\
> > +			__addr &= __untagged_addr(__addr, 56);		\
> > +		else if (current->thread.features & X86_THREAD_LAM_U48)	\
> > +			__addr &= __untagged_addr(__addr, 47);		\
> > +	}								\
> > +	(__force __typeof__(addr))__addr;				\
> > +})
> 
> Assuming you got your bits in hardware order:
> 
> 	u64 __addr = addr;
> 	if ((s64)__addr >= 0) {
> 		int lam = (current->thread.features >> X86_THREAD_LAM_U57) & 3;

That needs a _BIT suffix or something, same in the previous reply.

> 		if (lam)
> 			__addr &= sign_extend64(__addr, 65 - 9*lam);
> 	}
> 	__addr;
> 
> has less branches on and should definitely result in better code (or I
> need more morning juice).

I definitely needs more morning juice :-)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking
  2022-05-11  2:27 ` [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking Kirill A. Shutemov
  2022-05-11  7:26   ` Peter Zijlstra
@ 2022-05-11 14:15   ` H.J. Lu
  2022-05-12 14:21     ` Thomas Gleixner
  1 sibling, 1 reply; 82+ messages in thread
From: H.J. Lu @ 2022-05-11 14:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Andrey Ryabinin, Alexander Potapenko,
	Dmitry Vyukov, Andi Kleen, Rick Edgecombe, Linux-MM, LKML

On Tue, May 10, 2022 at 7:29 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> Allow to enable Linear Address Masking via ARCH_THREAD_FEATURE_ENABLE
> arch_prctl(2).
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/kernel/process.c    | 21 +++++++++++++++-
>  arch/x86/kernel/process.h    |  2 ++
>  arch/x86/kernel/process_64.c | 46 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 68 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index cb8fc28f2eae..911c24321312 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -46,6 +46,8 @@
>  #include <asm/proto.h>
>  #include <asm/frame.h>
>  #include <asm/unwind.h>
> +#include <asm/mmu_context.h>
> +#include <asm/compat.h>
>
>  #include "process.h"
>
> @@ -992,7 +994,9 @@ unsigned long __get_wchan(struct task_struct *p)
>  static long thread_feature_prctl(struct task_struct *task, int option,
>                                  unsigned long features)

Since this arch_prctl will also be used for CET,  which supports
32-bit processes,
shouldn't int, instead of long, be used?

>  {
> -       const unsigned long known_features = 0;
> +       const unsigned long known_features =
> +               X86_THREAD_LAM_U48 |
> +               X86_THREAD_LAM_U57;
>
>         if (features & ~known_features)
>                 return -EINVAL;
> @@ -1013,8 +1017,23 @@ static long thread_feature_prctl(struct task_struct *task, int option,
>
>         /* Handle ARCH_THREAD_FEATURE_ENABLE */
>
> +       if (features & (X86_THREAD_LAM_U48 | X86_THREAD_LAM_U57)) {
> +               long ret;
> +
> +               /* LAM is only available in long mode */
> +               if (in_32bit_syscall())
> +                       return -EINVAL;
> +
> +               ret = enable_lam(task, features);
> +               if (ret)
> +                       return ret;
> +       }
> +
>         task->thread.features |= features;
>  out:
> +       /* Update CR3 to get LAM active */
> +       switch_mm(task->mm, task->mm, task);
> +
>         return task->thread.features;
>  }
>
> diff --git a/arch/x86/kernel/process.h b/arch/x86/kernel/process.h
> index 76b547b83232..b8fa0e599c6e 100644
> --- a/arch/x86/kernel/process.h
> +++ b/arch/x86/kernel/process.h
> @@ -4,6 +4,8 @@
>
>  #include <asm/spec-ctrl.h>
>
> +long enable_lam(struct task_struct *task, unsigned long features);
> +
>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
>
>  /*
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index e459253649be..a25c51da7005 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -729,6 +729,52 @@ void set_personality_ia32(bool x32)
>  }
>  EXPORT_SYMBOL_GPL(set_personality_ia32);
>
> +static bool lam_u48_allowed(void)
> +{
> +       struct mm_struct *mm = current->mm;
> +
> +       if (!full_va_allowed(mm))
> +               return true;
> +
> +       return find_vma(mm, DEFAULT_MAP_WINDOW) == NULL;
> +}
> +
> +long enable_lam(struct task_struct *task, unsigned long features)
> +{
> +       features |= task->thread.features;
> +
> +       /* LAM_U48 and LAM_U57 are mutually exclusive */
> +       if ((features & X86_THREAD_LAM_U48) && (features & X86_THREAD_LAM_U57))
> +               return -EINVAL;
> +
> +       if (!cpu_feature_enabled(X86_FEATURE_LAM))
> +               return -ENXIO;
> +
> +       if (mmap_write_lock_killable(task->mm))
> +               return -EINTR;
> +
> +       if ((features & X86_THREAD_LAM_U48) && !lam_u48_allowed()) {
> +               mmap_write_unlock(task->mm);
> +               return -EINVAL;
> +       }
> +
> +       /*
> +        * Record the most permissive (allowing the widest tags) LAM
> +        * mode to the mm context. It determinates if a mappings above
> +        * 47 bit is allowed for the process.
> +        *
> +        * The mode is also used by a kernel thread when it does work
> +        * on behalf of the process (like async I/O, io_uring, etc.)
> +        */
> +       if (features & X86_THREAD_LAM_U48)
> +               current->mm->context.lam = LAM_U48;
> +       else if (current->mm->context.lam == LAM_NONE)
> +               current->mm->context.lam = LAM_U57;
> +
> +       mmap_write_unlock(task->mm);
> +       return 0;
> +}
> +
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
>  {
> --
> 2.35.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-11  2:27 ` [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features Kirill A. Shutemov
@ 2022-05-12 12:02   ` Thomas Gleixner
  2022-05-12 12:04     ` [PATCH] x86/prctl: Remove pointless task argument Thomas Gleixner
  2022-05-13 14:09   ` [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features Alexander Potapenko
  1 sibling, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 12:02 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Wed, May 11 2022 at 05:27, Kirill A. Shutemov wrote:
> Add three new arch_prctl() handles:
>  
> +static long thread_feature_prctl(struct task_struct *task, int option,
> +				 unsigned long features)

Bah. I really hate the task pointer on all these x86 prctls. @task must
always be current, so this @task argument is just confusion.

> +{
> +	const unsigned long known_features = 0;
> +
> +	if (features & ~known_features)
> +		return -EINVAL;

This implementation allows to task->read features_[locked] with
@features == 0. That should be documented somewhere.

> +	if (option == ARCH_THREAD_FEATURE_LOCK) {
> +		task->thread.features_locked |= features;
> +		return task->thread.features_locked;
> +	}


> +	/* Do not allow to change locked features */
> +	if (features & task->thread.features_locked)
> +		return -EPERM;
> +
> +	if (option == ARCH_THREAD_FEATURE_DISABLE) {
> +		task->thread.features &= ~features;
> +		goto out;
> +	}
> +
> +	/* Handle ARCH_THREAD_FEATURE_ENABLE */
> +
> +	task->thread.features |= features;
> +out:
> +	return task->thread.features;

Eyes bleed.

	if (option == ARCH_THREAD_FEATURE_ENABLE)
        	task->thread.features |= features;
        else
        	task->thread.features &= ~features;

	return task->thread.features;

It's bloody obvious from the code that the counterpart of enable is
disable, no? So neither the goto nor the 'comment the obvious' is useful
in any way.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] x86/prctl: Remove pointless task argument
  2022-05-12 12:02   ` Thomas Gleixner
@ 2022-05-12 12:04     ` Thomas Gleixner
  2022-05-13 12:30       ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 12:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J. Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

The functions invoked via do_arch_prctl_common() can only operate on
the current task and none of these function uses the task argument.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/fpu/api.h |    3 +--
 arch/x86/include/asm/proto.h   |    3 +--
 arch/x86/kernel/fpu/xstate.c   |    5 +----
 arch/x86/kernel/process.c      |    9 ++++-----
 arch/x86/kernel/process_32.c   |    2 +-
 arch/x86/kernel/process_64.c   |    4 ++--
 6 files changed, 10 insertions(+), 16 deletions(-)

--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -162,7 +162,6 @@ static inline bool fpstate_is_confidenti
 }
 
 /* prctl */
-struct task_struct;
-extern long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2);
+extern long fpu_xstate_prctl(int option, unsigned long arg2);
 
 #endif /* _ASM_X86_FPU_API_H */
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -38,7 +38,6 @@ void x86_configure_nx(void);
 
 extern int reboot_force;
 
-long do_arch_prctl_common(struct task_struct *task, int option,
-			  unsigned long arg2);
+long do_arch_prctl_common(int option, unsigned long arg2);
 
 #endif /* _ASM_X86_PROTO_H */
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1705,16 +1705,13 @@ EXPORT_SYMBOL_GPL(xstate_get_guest_group
  * e.g. for AMX which requires XFEATURE_XTILE_CFG(17) and
  * XFEATURE_XTILE_DATA(18) this would be XFEATURE_XTILE_DATA(18).
  */
-long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2)
+long fpu_xstate_prctl(int option, unsigned long arg2)
 {
 	u64 __user *uptr = (u64 __user *)arg2;
 	u64 permitted, supported;
 	unsigned long idx = arg2;
 	bool guest = false;
 
-	if (tsk != current)
-		return -EPERM;
-
 	switch (option) {
 	case ARCH_GET_XCOMP_SUPP:
 		supported = fpu_user_cfg.max_features |	fpu_user_cfg.legacy_features;
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -332,7 +332,7 @@ static int get_cpuid_mode(void)
 	return !test_thread_flag(TIF_NOCPUID);
 }
 
-static int set_cpuid_mode(struct task_struct *task, unsigned long cpuid_enabled)
+static int set_cpuid_mode(unsigned long cpuid_enabled)
 {
 	if (!boot_cpu_has(X86_FEATURE_CPUID_FAULT))
 		return -ENODEV;
@@ -983,20 +983,19 @@ unsigned long __get_wchan(struct task_st
 	return addr;
 }
 
-long do_arch_prctl_common(struct task_struct *task, int option,
-			  unsigned long arg2)
+long do_arch_prctl_common(int option, unsigned long arg2)
 {
 	switch (option) {
 	case ARCH_GET_CPUID:
 		return get_cpuid_mode();
 	case ARCH_SET_CPUID:
-		return set_cpuid_mode(task, arg2);
+		return set_cpuid_mode(arg2);
 	case ARCH_GET_XCOMP_SUPP:
 	case ARCH_GET_XCOMP_PERM:
 	case ARCH_REQ_XCOMP_PERM:
 	case ARCH_GET_XCOMP_GUEST_PERM:
 	case ARCH_REQ_XCOMP_GUEST_PERM:
-		return fpu_xstate_prctl(task, option, arg2);
+		return fpu_xstate_prctl(option, arg2);
 	}
 
 	return -EINVAL;
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -219,5 +219,5 @@ EXPORT_SYMBOL_GPL(start_thread);
 
 SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
 {
-	return do_arch_prctl_common(current, option, arg2);
+	return do_arch_prctl_common(option, arg2);
 }
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -844,7 +844,7 @@ SYSCALL_DEFINE2(arch_prctl, int, option,
 
 	ret = do_arch_prctl_64(current, option, arg2);
 	if (ret == -EINVAL)
-		ret = do_arch_prctl_common(current, option, arg2);
+		ret = do_arch_prctl_common(option, arg2);
 
 	return ret;
 }
@@ -852,7 +852,7 @@ SYSCALL_DEFINE2(arch_prctl, int, option,
 #ifdef CONFIG_IA32_EMULATION
 COMPAT_SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
 {
-	return do_arch_prctl_common(current, option, arg2);
+	return do_arch_prctl_common(option, arg2);
 }
 #endif
 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 04/10] x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57
  2022-05-11  7:02   ` Peter Zijlstra
@ 2022-05-12 12:24     ` Thomas Gleixner
  2022-05-12 14:37       ` Peter Zijlstra
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 12:24 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Wed, May 11 2022 at 09:02, Peter Zijlstra wrote:
> On Wed, May 11, 2022 at 05:27:45AM +0300, Kirill A. Shutemov wrote:
>
>> +#define LAM_NONE	0
>> +#define LAM_U57		1
>> +#define LAM_U48		2
>
>> +#define X86_THREAD_LAM_U48		0x1
>> +#define X86_THREAD_LAM_U57		0x2
>
> Seriously pick an order and stick with it. I would suggest keeping the
> hardware order and then you can do:
>
>> +static inline unsigned long lam_to_cr3(u8 lam)
>> +{
>
> 	return (lam & 0x3) << X86_CR3_LAM_U57;

This "works" because the hardware ignores LAM_U48 if LAM_U57 is set, but
I'd rather make that exclusive in the prctl() as setting both does not
make any sense.

> I'm still not liking LAM(e), I'm thikning it's going to create more
> problems than it solves.

Isn't that true for most new hardware features?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH] x86: Implement Linear Address Masking support
  2022-05-11  2:27 ` [PATCH] x86: Implement Linear Address Masking support Kirill A. Shutemov
@ 2022-05-12 13:01   ` David Laight
  2022-05-12 14:07     ` Matthew Wilcox
                       ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: David Laight @ 2022-05-12 13:01 UTC (permalink / raw)
  To: 'Kirill A. Shutemov',
	Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

From: Kirill A. Shutemov
> Sent: 11 May 2022 03:28
> 
> Linear Address Masking feature makes CPU ignore some bits of the virtual
> address. These bits can be used to encode metadata.
> 
> The feature is enumerated with CPUID.(EAX=07H, ECX=01H):EAX.LAM[bit 26].
> 
> CR3.LAM_U57[bit 62] allows to encode 6 bits of metadata in bits 62:57 of
> user pointers.
> 
> CR3.LAM_U48[bit 61] allows to encode 15 bits of metadata in bits 62:48
> of user pointers.
> 
> CR4.LAM_SUP[bit 28] allows to encode metadata of supervisor pointers.
> If 5-level paging is in use, 6 bits of metadata can be encoded in 62:57.
> For 4-level paging, 15 bits of metadata can be encoded in bits 62:48.
> 
...
> +static vaddr clean_addr(CPUArchState *env, vaddr addr)
> +{
> +    CPUClass *cc = CPU_GET_CLASS(env_cpu(env));
> +
> +    if (cc->tcg_ops->do_clean_addr) {
> +        addr = cc->tcg_ops->do_clean_addr(env_cpu(env), addr);

The performance of a conditional indirect call will be horrid.
Over-engineered when there is only one possible function.

....
> +
> +static inline int64_t sign_extend64(uint64_t value, int index)
> +{
> +    int shift = 63 - index;
> +    return (int64_t)(value << shift) >> shift;
> +}

Shift of signed integers are UB.

> +vaddr x86_cpu_clean_addr(CPUState *cs, vaddr addr)
> +{
> +    CPUX86State *env = &X86_CPU(cs)->env;
> +    bool la57 = env->cr[4] & CR4_LA57_MASK;
> +
> +    if (addr >> 63) {
> +        if (env->cr[4] & CR4_LAM_SUP) {
> +            return sign_extend64(addr, la57 ? 56 : 47);
> +        }
> +    } else {
> +        if (env->cr[3] & CR3_LAM_U57) {
> +            return sign_extend64(addr, 56);
> +        } else if (env->cr[3] & CR3_LAM_U48) {
> +            return sign_extend64(addr, 47);
> +        }
> +    }

That is completely horrid.
Surely it can be just:
	if (addr && 1u << 63)
		return addr | env->address_mask;
	else
		return addr & ~env->address_mask;
Where 'address_mask' is 0x7ff....
although since you really want a big gap between valid user and
valid kernel addresses allowing masked kernel addresses adds
costs elsewhere.

I've no idea how often the address masking is required?
Hopefully almost never?

copy_to/from_user() (etc) need to be able to use user addresses
without having to mask them.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFCv2 06/10] x86/uaccess: Remove tags from the address before checking
  2022-05-11  2:27 ` [RFCv2 06/10] x86/uaccess: Remove tags from the address before checking Kirill A. Shutemov
@ 2022-05-12 13:02   ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2022-05-12 13:02 UTC (permalink / raw)
  To: 'Kirill A. Shutemov',
	Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

From: Kirill A. Shutemov
> Sent: 11 May 2022 03:28
> 
> The tags must not be included into check whether it's okay to access the
> userspace address.
> 
> Strip tags in  access_ok().

Why?
Just change the test to check the buffer doesn't cross 1u << 63.


	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 05/10] x86/mm: Provide untagged_addr() helper
  2022-05-11  2:27 ` [RFCv2 05/10] x86/mm: Provide untagged_addr() helper Kirill A. Shutemov
  2022-05-11  7:21   ` Peter Zijlstra
@ 2022-05-12 13:06   ` Thomas Gleixner
  2022-05-12 14:23     ` Peter Zijlstra
  1 sibling, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 13:06 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Wed, May 11 2022 at 05:27, Kirill A. Shutemov wrote:
> +#define __untagged_addr(addr, n)	\
> +	((__force __typeof__(addr))sign_extend64((__force u64)(addr), n))

How is this supposed to be correct? This sign extends based on bit 47
resp. 56, i.e. the topmost bit of the userspace address space for the
LAM mode.

So if that bit _is_ set, then the result has bit 48-63 resp. 57-63 set
as well. Not really what you want, right?

This has to mask out bit 48-62 resp. 57-62 and leave all other bits
alone.

> +#define untagged_addr(addr)	({					\
> +	u64 __addr = (__force u64)(addr);				\
> +	if (__addr >> 63 == 0) {					\
> +		if (current->thread.features & X86_THREAD_LAM_U57)	\
> +			__addr &= __untagged_addr(__addr, 56);		\
> +		else if (current->thread.features & X86_THREAD_LAM_U48)	\
> +			__addr &= __untagged_addr(__addr, 47);		\
> +	}								\
> +	(__force __typeof__(addr))__addr;				\
> +})

So this wants something like this:

#define untagged_addr(addr)	({			\
	u64 __addr = (__force u64)(addr);		\
							\
	__addr &= current->thread.lam_untag_mask;	\
	(__force __typeof__(addr))__addr;		\
})

No conditionals, fast _and_ correct. Setting this untag mask up once
when LAM is enabled is not rocket science.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 07/10] x86/mm: Handle tagged memory accesses from kernel threads
  2022-05-11  2:27 ` [RFCv2 07/10] x86/mm: Handle tagged memory accesses from kernel threads Kirill A. Shutemov
  2022-05-11  7:23   ` Peter Zijlstra
@ 2022-05-12 13:30   ` Thomas Gleixner
  1 sibling, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 13:30 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Wed, May 11 2022 at 05:27, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index f9fe71d1f42c..b320556e1c22 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -185,6 +185,34 @@ static u8 gen_lam(struct task_struct *tsk, struct mm_struct *mm)
>  	if (!tsk)
>  		return LAM_NONE;
>  
> +	if (tsk->flags & PF_KTHREAD) {
> +		/*
> +		 * For kernel thread use the most permissive LAM
> +		 * used by the mm. It's required to handle kernel thread
> +		 * memory accesses on behalf of a process.
> +		 *
> +		 * Adjust thread flags accodringly, so untagged_addr() would
> +		 * work correctly.
> +		 */
> +
> +		tsk->thread.features &= ~(X86_THREAD_LAM_U48 |
> +					  X86_THREAD_LAM_U57);
> +
> +		switch (mm->context.lam) {
> +		case LAM_NONE:
> +			return LAM_NONE;
> +		case LAM_U57:
> +			tsk->thread.features |= X86_THREAD_LAM_U57;
> +			return LAM_U57;
> +		case LAM_U48:
> +			tsk->thread.features |= X86_THREAD_LAM_U48;
> +			return LAM_U48;

Pretending that LAM is configurable per thread and then having a magic
override in the per process mm when accessing that process' memory from
a kernel thread is inconsistent, a horrible hack and a recipe for
hard to diagnose problems.

LAM has to be enabled by the process _before_ creating threads and then
stay enabled until the whole thing dies. That's the only sensible use
case.

I understand that tsk->thread.features is conveniant for the untagging
mechanism, but the whole setup should be:

prctl(ENABLE, which)
     if (can_enable_lam(which)) {
     	mm->lam.c3_mask = CR3_LAM(which);
        mm->lam.untag_mask = UNTAG_LAM(which);
        current->thread.lam_untag_mask = mm->lam.untag_mask;
     }

and

can_enable_lam(which)
    if (current_is_multithreaded())
    	return -ETOOLATE;
    if (current->mm->lam_cr3_mask)
    	return -EBUSY;
    ....
    	

Now vs. kernel threads. Doing this like the above is just the wrong
place. If a kernel thread accesses user space memory of a process then
it has to invoke kthread_use_mm(), right? So the obvious point to cache
that setting is in kthread_use_mm() and kthread_unuse_mm() clears it:

kthread_use_mm()
     current->thread.lam_untag_mask = mm->lam.untag_mask;

kthread_unuse_mm()
     current->thread.lam_untag_mask = 0;

This makes all of the mechanics trivial because CR3 switch then simply
does:

     new_cr3 |= mm->lam.c3_mask;

No conditionals and evaluations, nothing. Just straight forward and
comprehensible code.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive
  2022-05-11  2:27 ` [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive Kirill A. Shutemov
@ 2022-05-12 13:36   ` Thomas Gleixner
  2022-05-13 23:22     ` Kirill A. Shutemov
  2022-05-18  8:43   ` Bharata B Rao
  1 sibling, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 13:36 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel,
	Kirill A. Shutemov

On Wed, May 11 2022 at 05:27, Kirill A. Shutemov wrote:
> LAM_U48 steals bits above 47-bit for tags and makes it impossible for
> userspace to use full address space on 5-level paging machine.

> Make these features mutually exclusive: whichever gets enabled first
> blocks the othe one.

So this patch prevents a mapping above 47bit when LAM48 is enabled, but
I fail to spot how an already existing mapping above 47bit would prevent
LAM48 from being enabled.

Maybe I'm missing something which makes this magically mutually
exclusive.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86: Implement Linear Address Masking support
  2022-05-12 13:01   ` David Laight
@ 2022-05-12 14:07     ` Matthew Wilcox
  2022-05-12 15:06       ` Thomas Gleixner
  2022-05-12 14:35     ` Peter Zijlstra
  2022-05-12 17:00     ` Kirill A. Shutemov
  2 siblings, 1 reply; 82+ messages in thread
From: Matthew Wilcox @ 2022-05-12 14:07 UTC (permalink / raw)
  To: David Laight
  Cc: 'Kirill A. Shutemov',
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12, 2022 at 01:01:07PM +0000, David Laight wrote:
> > +static inline int64_t sign_extend64(uint64_t value, int index)
> > +{
> > +    int shift = 63 - index;
> > +    return (int64_t)(value << shift) >> shift;
> > +}
> 
> Shift of signed integers are UB.

Citation needed.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking
  2022-05-11 14:15   ` H.J. Lu
@ 2022-05-12 14:21     ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 14:21 UTC (permalink / raw)
  To: H.J. Lu, Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Andrey Ryabinin, Alexander Potapenko,
	Dmitry Vyukov, Andi Kleen, Rick Edgecombe, Linux-MM, LKML

On Wed, May 11 2022 at 07:15, H. J. Lu wrote:
>> @@ -992,7 +994,9 @@ unsigned long __get_wchan(struct task_struct *p)
>>  static long thread_feature_prctl(struct task_struct *task, int option,
>>                                  unsigned long features)
>
> Since this arch_prctl will also be used for CET,  which supports
> 32-bit processes,
> shouldn't int, instead of long, be used?

This is kernel internal code and the compat syscall takes care of the
int to long conversion. So yes, that could use unsigned int, but it does
not matter.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 05/10] x86/mm: Provide untagged_addr() helper
  2022-05-12 13:06   ` Thomas Gleixner
@ 2022-05-12 14:23     ` Peter Zijlstra
  2022-05-12 15:16       ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2022-05-12 14:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12, 2022 at 03:06:38PM +0200, Thomas Gleixner wrote:

> #define untagged_addr(addr)	({			\
> 	u64 __addr = (__force u64)(addr);		\
> 							\
> 	__addr &= current->thread.lam_untag_mask;	\
> 	(__force __typeof__(addr))__addr;		\
> })
> 
> No conditionals, fast _and_ correct. Setting this untag mask up once
> when LAM is enabled is not rocket science.

But that goes wrong if someone ever wants to untag a kernel address and
not use the result for access_ok().

I'd feel better about something like:

	s64 __addr = (addr);
	s64 __sign = __addr;

	__sign >>= 63;
	__sign &= lam_untag_mask;
	__addr &= lam_untag_mask;
	__addr |= __sign;

	__addr;

Which simply extends bit 63 downwards -- although possibly there's an
easier way to do that, this is pretty gross.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86: Implement Linear Address Masking support
  2022-05-12 13:01   ` David Laight
  2022-05-12 14:07     ` Matthew Wilcox
@ 2022-05-12 14:35     ` Peter Zijlstra
  2022-05-12 17:00     ` Kirill A. Shutemov
  2 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2022-05-12 14:35 UTC (permalink / raw)
  To: David Laight
  Cc: 'Kirill A. Shutemov',
	Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12, 2022 at 01:01:07PM +0000, David Laight wrote:

> > +static inline int64_t sign_extend64(uint64_t value, int index)
> > +{
> > +    int shift = 63 - index;
> > +    return (int64_t)(value << shift) >> shift;
> > +}
> 
> Shift of signed integers are UB.

The kernel uses -fno-strict-overflow, all the signed UB is gone.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 04/10] x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57
  2022-05-12 12:24     ` Thomas Gleixner
@ 2022-05-12 14:37       ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2022-05-12 14:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12, 2022 at 02:24:22PM +0200, Thomas Gleixner wrote:
> On Wed, May 11 2022 at 09:02, Peter Zijlstra wrote:
> > On Wed, May 11, 2022 at 05:27:45AM +0300, Kirill A. Shutemov wrote:
> >
> >> +#define LAM_NONE	0
> >> +#define LAM_U57		1
> >> +#define LAM_U48		2
> >
> >> +#define X86_THREAD_LAM_U48		0x1
> >> +#define X86_THREAD_LAM_U57		0x2
> >
> > Seriously pick an order and stick with it. I would suggest keeping the
> > hardware order and then you can do:
> >
> >> +static inline unsigned long lam_to_cr3(u8 lam)
> >> +{
> >
> > 	return (lam & 0x3) << X86_CR3_LAM_U57;
> 
> This "works" because the hardware ignores LAM_U48 if LAM_U57 is set, but
> I'd rather make that exclusive in the prctl() as setting both does not
> make any sense.

Yes, I was very much assuming the interface would not allow setting
both, that would be daft.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking
  2022-05-11  7:26   ` Peter Zijlstra
@ 2022-05-12 14:46     ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 14:46 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Alexander Potapenko,
	Dmitry Vyukov, H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm,
	linux-kernel

On Wed, May 11 2022 at 09:26, Peter Zijlstra wrote:

> On Wed, May 11, 2022 at 05:27:50AM +0300, Kirill A. Shutemov wrote:
>> @@ -1013,8 +1017,23 @@ static long thread_feature_prctl(struct task_struct *task, int option,
>>  
>>  	/* Handle ARCH_THREAD_FEATURE_ENABLE */
>>  
>> +	if (features & (X86_THREAD_LAM_U48 | X86_THREAD_LAM_U57)) {
>> +		long ret;
>> +
>> +		/* LAM is only available in long mode */
>> +		if (in_32bit_syscall())
>> +			return -EINVAL;
>
> So what happens if userspace sets up a 32bit code entry in the LDT and
> does the LAM thing as a 64bit syscamm but then goes run 32bit code?

AFAICS, nothing happens. The only requirements are CR4.PAE = 1,
IA32_EFER.LME = 1. Those are unaffected from user space running 32bit
code, no?

32bit code can't use 64bit pointers so it can't have metadata bits
set. But x32 can and is excluded by the above too.

So the whole muck must be conditional on CONFIG_X86_64=y and does not
need any other restrictions IMO.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86: Implement Linear Address Masking support
  2022-05-12 14:07     ` Matthew Wilcox
@ 2022-05-12 15:06       ` Thomas Gleixner
  2022-05-12 15:33         ` David Laight
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 15:06 UTC (permalink / raw)
  To: Matthew Wilcox, David Laight
  Cc: 'Kirill A. Shutemov',
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12 2022 at 15:07, Matthew Wilcox wrote:
> On Thu, May 12, 2022 at 01:01:07PM +0000, David Laight wrote:
>> > +static inline int64_t sign_extend64(uint64_t value, int index)
>> > +{
>> > +    int shift = 63 - index;
>> > +    return (int64_t)(value << shift) >> shift;
>> > +}
>> 
>> Shift of signed integers are UB.
>
> Citation needed.

I'll bite :)

C11/19: 6.5.7 Bitwise shift operators

  4 The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated
    bits are filled with zeros. If E1 has an unsigned type, the value of
    the result is E1 × 2E2, reduced modulo one more than the maximum
    value representable in the result type. If E1 has a signed type and
    nonnegative value, and E1 × 2E2 is representable in the result type,
    then that is the resulting value; otherwise, the behavior is
    undefined.

This is irrelevant for the case above because the left shift is on an
unsigned integer. The interesting part is this:

  5 The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1
    has an unsigned type or if E1 has a signed type and a nonnegative
    value, the value of the result is the integral part of the quotient
    of E1/2E2.  If E1 has a signed type and a negative value, the
    resulting value is implementation-defined.

So it's not UB, it's implementation defined. The obvious choice is to
keep LSB set, i.e. arithmetic shift, what both GCC and clang do.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 05/10] x86/mm: Provide untagged_addr() helper
  2022-05-12 14:23     ` Peter Zijlstra
@ 2022-05-12 15:16       ` Thomas Gleixner
  2022-05-12 23:14         ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12 2022 at 16:23, Peter Zijlstra wrote:
> On Thu, May 12, 2022 at 03:06:38PM +0200, Thomas Gleixner wrote:
>
>> #define untagged_addr(addr)	({			\
>> 	u64 __addr = (__force u64)(addr);		\
>> 							\
>> 	__addr &= current->thread.lam_untag_mask;	\
>> 	(__force __typeof__(addr))__addr;		\
>> })
>> 
>> No conditionals, fast _and_ correct. Setting this untag mask up once
>> when LAM is enabled is not rocket science.
>
> But that goes wrong if someone ever wants to untag a kernel address and
> not use the result for access_ok().
>
> I'd feel better about something like:
>
> 	s64 __addr = (addr);
> 	s64 __sign = __addr;
>
> 	__sign >>= 63;
> 	__sign &= lam_untag_mask;

that needs to be

 	__sign &= ~lam_untag_mask;

> 	__addr &= lam_untag_mask;
> 	__addr |= __sign;
>
> 	__addr;
>
> Which simply extends bit 63 downwards -- although possibly there's an
> easier way to do that, this is pretty gross.

For the price of a conditional:

    __addr &= lam_untag_mask;
    if (__addr & BIT(63))
        __addr |= ~lam_untag_mask;

Now you have the choice between gross and ugly.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH] x86: Implement Linear Address Masking support
  2022-05-12 15:06       ` Thomas Gleixner
@ 2022-05-12 15:33         ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2022-05-12 15:33 UTC (permalink / raw)
  To: 'Thomas Gleixner', Matthew Wilcox
  Cc: 'Kirill A. Shutemov',
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

From: Thomas Gleixner
> Sent: 12 May 2022 16:07
> 
> On Thu, May 12 2022 at 15:07, Matthew Wilcox wrote:
> > On Thu, May 12, 2022 at 01:01:07PM +0000, David Laight wrote:
> >> > +static inline int64_t sign_extend64(uint64_t value, int index)
> >> > +{
> >> > +    int shift = 63 - index;
> >> > +    return (int64_t)(value << shift) >> shift;
> >> > +}
> >>
> >> Shift of signed integers are UB.
> >
> > Citation needed.
> 
> I'll bite :)
> 
> C11/19: 6.5.7 Bitwise shift operators
> 
>   4 The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated
>     bits are filled with zeros. If E1 has an unsigned type, the value of
>     the result is E1 × 2E2, reduced modulo one more than the maximum
>     value representable in the result type. If E1 has a signed type and
>     nonnegative value, and E1 × 2E2 is representable in the result type,
>     then that is the resulting value; otherwise, the behavior is
>     undefined.
> 
> This is irrelevant for the case above because the left shift is on an
> unsigned integer. The interesting part is this:
> 
>   5 The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1
>     has an unsigned type or if E1 has a signed type and a nonnegative
>     value, the value of the result is the integral part of the quotient
>     of E1/2E2.  If E1 has a signed type and a negative value, the
>     resulting value is implementation-defined.
> 
> So it's not UB, it's implementation defined. The obvious choice is to
> keep LSB set, i.e. arithmetic shift, what both GCC and clang do.

I'm sure someone recently said one of the standards had made them UB.

In any case, given the caller seems to know whether the top bit is set
(and does a different call) using |= or &= is distinctly better.
Especially since the required constant can be computed in a slow path.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-11  6:49 ` [RFCv2 00/10] Linear Address Masking enabling Peter Zijlstra
@ 2022-05-12 15:42   ` Thomas Gleixner
  2022-05-12 16:56     ` Kirill A. Shutemov
  2022-05-12 17:22   ` Dave Hansen
  1 sibling, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 15:42 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Wed, May 11 2022 at 08:49, Peter Zijlstra wrote:
> On Wed, May 11, 2022 at 05:27:40AM +0300, Kirill A. Shutemov wrote:
>> Hi all. Here's long overdue update on LAM enabling.
>> 
>> # Description #
>> 
>> Linear Address Masking[1] (LAM) modifies the checking that is applied to
>> 64-bit linear addresses, allowing software to use of the untranslated
>> address bits for metadata.
>> 
>> The patchset brings support for LAM for userspace addresses.
>> 
>> The most sensitive part of enabling is change in tlb.c, where CR3 flags
>> get set. Please take a look that what I'm doing makes sense.
>> 
>> The feature competes for bits with 5-level paging: LAM_U48 makes it
>> impossible to map anything about 47-bits. The patchset made these
>> capability mutually exclusive: whatever used first wins. LAM_U57 can be
>> combined with mappings above 47-bits.
>
> So aren't we creating a problem with LAM_U48 where programs relying on
> it are of limited sustainability?
>
> Any such program simply *cannot* run on 5 level pagetables. Why do we
> want to do this?

More bits are better :)

Seriously, I agree that restricting it to LAM57, which gives us 6 bits,
makes a lot of sense _and_ makes the whole thing way simpler.

So supporting both needs a truly good justification and a real world use
case.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 15:42   ` Thomas Gleixner
@ 2022-05-12 16:56     ` Kirill A. Shutemov
  2022-05-12 19:31       ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-12 16:56 UTC (permalink / raw)
  To: Thomas Gleixner, Dmitry Vyukov
  Cc: Peter Zijlstra, Dave Hansen, Andy Lutomirski, x86,
	Andrey Ryabinin, Alexander Potapenko, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12, 2022 at 05:42:58PM +0200, Thomas Gleixner wrote:
> On Wed, May 11 2022 at 08:49, Peter Zijlstra wrote:
> > On Wed, May 11, 2022 at 05:27:40AM +0300, Kirill A. Shutemov wrote:
> >> Hi all. Here's long overdue update on LAM enabling.
> >> 
> >> # Description #
> >> 
> >> Linear Address Masking[1] (LAM) modifies the checking that is applied to
> >> 64-bit linear addresses, allowing software to use of the untranslated
> >> address bits for metadata.
> >> 
> >> The patchset brings support for LAM for userspace addresses.
> >> 
> >> The most sensitive part of enabling is change in tlb.c, where CR3 flags
> >> get set. Please take a look that what I'm doing makes sense.
> >> 
> >> The feature competes for bits with 5-level paging: LAM_U48 makes it
> >> impossible to map anything about 47-bits. The patchset made these
> >> capability mutually exclusive: whatever used first wins. LAM_U57 can be
> >> combined with mappings above 47-bits.
> >
> > So aren't we creating a problem with LAM_U48 where programs relying on
> > it are of limited sustainability?
> >
> > Any such program simply *cannot* run on 5 level pagetables. Why do we
> > want to do this?
> 
> More bits are better :)
> 
> Seriously, I agree that restricting it to LAM57, which gives us 6 bits,
> makes a lot of sense _and_ makes the whole thing way simpler.
> 
> So supporting both needs a truly good justification and a real world use
> case.

I asked the question before[1]. Basically, more bits more better:

	For HWASAN #bits == detection probability.
	For MarkUS #bits == exponential cost reduction

I would really like to have only LAM_U57, but IIUC 6 bits is not always
enough.

Dmitry, could you elaborate?

[1] https://mobile.twitter.com/dvyukov/status/1342019823400837120
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86: Implement Linear Address Masking support
  2022-05-12 13:01   ` David Laight
  2022-05-12 14:07     ` Matthew Wilcox
  2022-05-12 14:35     ` Peter Zijlstra
@ 2022-05-12 17:00     ` Kirill A. Shutemov
  2 siblings, 0 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-12 17:00 UTC (permalink / raw)
  To: David Laight
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12, 2022 at 01:01:07PM +0000, David Laight wrote:
> From: Kirill A. Shutemov
> > Sent: 11 May 2022 03:28
> > 
> > Linear Address Masking feature makes CPU ignore some bits of the virtual
> > address. These bits can be used to encode metadata.
> > 
> > The feature is enumerated with CPUID.(EAX=07H, ECX=01H):EAX.LAM[bit 26].
> > 
> > CR3.LAM_U57[bit 62] allows to encode 6 bits of metadata in bits 62:57 of
> > user pointers.
> > 
> > CR3.LAM_U48[bit 61] allows to encode 15 bits of metadata in bits 62:48
> > of user pointers.
> > 
> > CR4.LAM_SUP[bit 28] allows to encode metadata of supervisor pointers.
> > If 5-level paging is in use, 6 bits of metadata can be encoded in 62:57.
> > For 4-level paging, 15 bits of metadata can be encoded in bits 62:48.
> > 
> ...
> > +static vaddr clean_addr(CPUArchState *env, vaddr addr)
> > +{
> > +    CPUClass *cc = CPU_GET_CLASS(env_cpu(env));
> > +
> > +    if (cc->tcg_ops->do_clean_addr) {
> > +        addr = cc->tcg_ops->do_clean_addr(env_cpu(env), addr);
> 
> The performance of a conditional indirect call will be horrid.
> Over-engineered when there is only one possible function.

It is QEMU patch. As I mentioned in the cover letter, it was rejected by
upstream, but it is functional and can be used for testing before HW
arrived.

I may give it another try when I get time to look deeper at TCG.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-11  6:49 ` [RFCv2 00/10] Linear Address Masking enabling Peter Zijlstra
  2022-05-12 15:42   ` Thomas Gleixner
@ 2022-05-12 17:22   ` Dave Hansen
  2022-05-12 19:39     ` Thomas Gleixner
  1 sibling, 1 reply; 82+ messages in thread
From: Dave Hansen @ 2022-05-12 17:22 UTC (permalink / raw)
  To: Peter Zijlstra, Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On 5/10/22 23:49, Peter Zijlstra wrote:
>> The feature competes for bits with 5-level paging: LAM_U48 makes it
>> impossible to map anything about 47-bits. The patchset made these
>> capability mutually exclusive: whatever used first wins. LAM_U57 can be
>> combined with mappings above 47-bits.
> So aren't we creating a problem with LAM_U48 where programs relying on
> it are of limited sustainability?

I think allowing apps to say, "It's LAM_U48 or bust!" is a mistake.
It's OK for a debugging build that runs on one kind of hardware.  But,
if we want LAM-using binaries to be portable, we have to do something
different.

One of the stated reasons for adding LAM hardware is that folks want to
use sanitizers outside of debugging environments.  To me, that means
that LAM is something that the same binary might run with or without.

It's totally fine with me if the kernel only initially supports LAM_U57.
 But, I'd ideally like to make sure that the ABI can support LAM_U57,
LAM_U48, AMD's UAI (in whatever form it settles), or other masks.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 16:56     ` Kirill A. Shutemov
@ 2022-05-12 19:31       ` Thomas Gleixner
  2022-05-12 23:21         ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 19:31 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dmitry Vyukov
  Cc: Peter Zijlstra, Dave Hansen, Andy Lutomirski, x86,
	Alexander Potapenko, H . J . Lu, Andi Kleen, Rick Edgecombe,
	linux-mm, linux-kernel, Dmitry Vyukov

On Thu, May 12 2022 at 19:56, Kirill A. Shutemov wrote:
> On Thu, May 12, 2022 at 05:42:58PM +0200, Thomas Gleixner wrote:
>> On Wed, May 11 2022 at 08:49, Peter Zijlstra wrote:
>> > On Wed, May 11, 2022 at 05:27:40AM +0300, Kirill A. Shutemov wrote:
>> > So aren't we creating a problem with LAM_U48 where programs relying on
>> > it are of limited sustainability?
>> >
>> > Any such program simply *cannot* run on 5 level pagetables. Why do we
>> > want to do this?
>> 
>> More bits are better :)
>> 
>> Seriously, I agree that restricting it to LAM57, which gives us 6 bits,
>> makes a lot of sense _and_ makes the whole thing way simpler.
>> 
>> So supporting both needs a truly good justification and a real world use
>> case.
>
> I asked the question before[1]. Basically, more bits more better:
>
> 	For HWASAN #bits == detection probability.
> 	For MarkUS #bits == exponential cost reduction

What is MarkUS? It's not really helpful to provide acronyms which are
not decodable.

> I would really like to have only LAM_U57, but IIUC 6 bits is not always
> enough.
>
> Dmitry, could you elaborate?
>
> [1] https://mobile.twitter.com/dvyukov/status/1342019823400837120

I don't know whether he reacts on posting a link to his twitter
account. I've CC'ed him now. Maybe that works better.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 17:22   ` Dave Hansen
@ 2022-05-12 19:39     ` Thomas Gleixner
  2022-05-12 21:24       ` Thomas Gleixner
  2022-05-12 21:51       ` Dave Hansen
  0 siblings, 2 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 19:39 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Alexander Potapenko,
	Dmitry Vyukov, H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm,
	linux-kernel

On Thu, May 12 2022 at 10:22, Dave Hansen wrote:

> On 5/10/22 23:49, Peter Zijlstra wrote:
>>> The feature competes for bits with 5-level paging: LAM_U48 makes it
>>> impossible to map anything about 47-bits. The patchset made these
>>> capability mutually exclusive: whatever used first wins. LAM_U57 can be
>>> combined with mappings above 47-bits.
>> So aren't we creating a problem with LAM_U48 where programs relying on
>> it are of limited sustainability?
>
> I think allowing apps to say, "It's LAM_U48 or bust!" is a mistake.

That'd be outright stupid.

> It's OK for a debugging build that runs on one kind of hardware.  But,
> if we want LAM-using binaries to be portable, we have to do something
> different.
>
> One of the stated reasons for adding LAM hardware is that folks want to
> use sanitizers outside of debugging environments.  To me, that means
> that LAM is something that the same binary might run with or without.

On/off yes, but is there an actual use case where such a mechanism would
at start time dynamically chose the number of bits?

> It's totally fine with me if the kernel only initially supports LAM_U57.
>  But, I'd ideally like to make sure that the ABI can support LAM_U57,
> LAM_U48, AMD's UAI (in whatever form it settles), or other masks.

Sure. No argument here.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 19:39     ` Thomas Gleixner
@ 2022-05-12 21:24       ` Thomas Gleixner
  2022-05-13 14:43         ` Matthew Wilcox
  2022-05-13 22:59         ` Kirill A. Shutemov
  2022-05-12 21:51       ` Dave Hansen
  1 sibling, 2 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 21:24 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Alexander Potapenko,
	Dmitry Vyukov, H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm,
	linux-kernel

On Thu, May 12 2022 at 21:39, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 10:22, Dave Hansen wrote:
>> One of the stated reasons for adding LAM hardware is that folks want to
>> use sanitizers outside of debugging environments.  To me, that means
>> that LAM is something that the same binary might run with or without.
>
> On/off yes, but is there an actual use case where such a mechanism would
> at start time dynamically chose the number of bits?

This would need cooperation from the application because it has to tell
the magic facility whether it intends to utilize the large VA space on a
5-level paging system or not.

I have no idea how that is supposed to work, but what do I know about
magic.

>> It's totally fine with me if the kernel only initially supports LAM_U57.
>>  But, I'd ideally like to make sure that the ABI can support LAM_U57,
>> LAM_U48, AMD's UAI (in whatever form it settles), or other masks.
>
> Sure. No argument here.

Assumed that the acronym of the day, which uses this, has a real benefit
from the larger number of bits, we can support it.

But we are not going to make this a per thread selectable thing.

It's a process wide decision at startup simply because it does no buy
thread A anything to select U57 if thread B selects U48 before thread A
was able to map something into the U48 covered address space. Same issue
the other way round as then B has to fallback to U57 or NONE. That does
not make any sense at all.

I'm all for flexible, but not just because we can. It has to make sense.

Making it process wide and once on startup puts the 'complexity' into
the prctl(), but keeps the runtime overhead as small as it gets:

  - CR3 switching needs just the | mm->lam_cr3_mask

  - untagging one of the uglies Peter and I came up with

Making U48/U57 hardcoded would not buy much.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 19:39     ` Thomas Gleixner
  2022-05-12 21:24       ` Thomas Gleixner
@ 2022-05-12 21:51       ` Dave Hansen
  2022-05-12 22:10         ` H.J. Lu
  2022-05-13 11:07         ` Alexander Potapenko
  1 sibling, 2 replies; 82+ messages in thread
From: Dave Hansen @ 2022-05-12 21:51 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, x86, Alexander Potapenko,
	Dmitry Vyukov, H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm,
	linux-kernel

On 5/12/22 12:39, Thomas Gleixner wrote:
>> It's OK for a debugging build that runs on one kind of hardware.  But,
>> if we want LAM-using binaries to be portable, we have to do something
>> different.
>>
>> One of the stated reasons for adding LAM hardware is that folks want to
>> use sanitizers outside of debugging environments.  To me, that means
>> that LAM is something that the same binary might run with or without.
> On/off yes, but is there an actual use case where such a mechanism would
> at start time dynamically chose the number of bits?

I'd love to hear from folks doing the userspace side of this.  Will
userspace be saying: "Give me all the bits you can!".  Or, will it
really just be looking for 6 bits only, and it doesn't care whether it
gets 6 or 15, it will use only 6?

Do the sanitizers have more overhead with more bits?  Or *less* overhead
because they can store more metadata in the pointers?

Will anyone care about the difference about potentially missing 1/64
issues with U57 versus 1/32768 with U48?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 21:51       ` Dave Hansen
@ 2022-05-12 22:10         ` H.J. Lu
  2022-05-12 23:35           ` Thomas Gleixner
  2022-05-13 11:07         ` Alexander Potapenko
  1 sibling, 1 reply; 82+ messages in thread
From: H.J. Lu @ 2022-05-12 22:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Peter Zijlstra, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, the arch/x86 maintainers, Alexander Potapenko,
	Dmitry Vyukov, Andi Kleen, Rick Edgecombe, Linux-MM, LKML

On Thu, May 12, 2022 at 2:51 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/12/22 12:39, Thomas Gleixner wrote:
> >> It's OK for a debugging build that runs on one kind of hardware.  But,
> >> if we want LAM-using binaries to be portable, we have to do something
> >> different.
> >>
> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> use sanitizers outside of debugging environments.  To me, that means
> >> that LAM is something that the same binary might run with or without.
> > On/off yes, but is there an actual use case where such a mechanism would
> > at start time dynamically chose the number of bits?
>
> I'd love to hear from folks doing the userspace side of this.  Will
> userspace be saying: "Give me all the bits you can!".  Or, will it
> really just be looking for 6 bits only, and it doesn't care whether it
> gets 6 or 15, it will use only 6?
>
> Do the sanitizers have more overhead with more bits?  Or *less* overhead
> because they can store more metadata in the pointers?
>
> Will anyone care about the difference about potentially missing 1/64
> issues with U57 versus 1/32768 with U48?

The only LAM usage I know so far is LAM_U57 in HWASAN.   An application
can ask for LAM_U48 or LAM_U57.  But the decision should be made by
application.  When an application asks for LAM_U57, I expect it will store
tags in upper 6 bits, even if the kernel enables LAM_U48.

-- 
H.J.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 05/10] x86/mm: Provide untagged_addr() helper
  2022-05-12 15:16       ` Thomas Gleixner
@ 2022-05-12 23:14         ` Thomas Gleixner
  2022-05-13 10:14           ` David Laight
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 23:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12 2022 at 17:16, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 16:23, Peter Zijlstra wrote:
>> On Thu, May 12, 2022 at 03:06:38PM +0200, Thomas Gleixner wrote:
>>
>>> #define untagged_addr(addr)	({			\
>>> 	u64 __addr = (__force u64)(addr);		\
>>> 							\
>>> 	__addr &= current->thread.lam_untag_mask;	\
>>> 	(__force __typeof__(addr))__addr;		\
>>> })
>>> 
>>> No conditionals, fast _and_ correct. Setting this untag mask up once
>>> when LAM is enabled is not rocket science.
>>
>> But that goes wrong if someone ever wants to untag a kernel address and
>> not use the result for access_ok().
>>
>> I'd feel better about something like:
>>
>> 	s64 __addr = (addr);
>> 	s64 __sign = __addr;
>>
>> 	__sign >>= 63;
>> 	__sign &= lam_untag_mask;
>
> that needs to be
>
>  	__sign &= ~lam_untag_mask;
>
>> 	__addr &= lam_untag_mask;
>> 	__addr |= __sign;
>>
>> 	__addr;
>>
>> Which simply extends bit 63 downwards -- although possibly there's an
>> easier way to do that, this is pretty gross.
>
> For the price of a conditional:
>
>     __addr &= lam_untag_mask;
>     if (__addr & BIT(63))
>         __addr |= ~lam_untag_mask;
>
> Now you have the choice between gross and ugly.

Though we can also replace your flavour of gross with a different
flavour of gross:

	s64 sign = (s64)(addr) >> 63;

	addr ^= sign;
	addr &= mask;
	addr ^= sign;

After twisting my brain around replacing gross by something differently
gross and coming up with the gem above I actually did compile the
variants and discovered that GCC compiles your flavour of gross exactly
to this:

        mov    %rdi,%rax
        sar    $0x3f,%rax
        xor    %rax,%rdi
        and    %rsi,%rdi
        xor    %rdi,%rax

I have to admit that compilers are sometimes pretty smart. I might have
to rethink my prejudice. :)

But then clang converts your flavour of 'gross' to:

     	mov    %rsi,%rax
     	mov    %rsi,%rcx
     	and    %rdi,%rax
     	sar    $0x3f,%rdi
     	not    %rcx
     	and    %rdi,%rcx
     	or     %rcx,%rax

and my explicit flavour to:

      	mov    %rdi,%rax
      	mov    %rdi,%rcx
      	sar    $0x3f,%rcx
      	xor    %rcx,%rax
      	and    %rsi,%rax
      	xor    %rcx,%rax

which is at least slightly less retarted, but still has a pointless mov
there. Note, that this was compiled in user space with noinline
functions. I did some inlined variants as well and clang still insists
on using an extra register for no obvious reason. This might be more
efficient in reality, but I haven't bothered to write a test which
might give an answer via perf.

The ugly with the conditional resolves for both compilers to:

       	mov    %rsi,%rax
       	mov    %rsi,%rcx
       	not    %rcx
       	or     %rdi,%rcx
       	and    %rdi,%rax
       	test   %rdi,%rdi
       	cmovs  %rcx,%rax

At least they agree on that one.

But whatever we chose, it's sad, that we need to have support for
interfaces which swallow any pointer (user or kernel) because otherwise
this really boils down to a single OR resp. AND operation plus the
according mov to retrieve the mask.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 19:31       ` Thomas Gleixner
@ 2022-05-12 23:21         ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 23:21 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dmitry Vyukov
  Cc: Peter Zijlstra, Dave Hansen, Andy Lutomirski, x86,
	Alexander Potapenko, H . J . Lu, Andi Kleen, Rick Edgecombe,
	linux-mm, linux-kernel

On Thu, May 12 2022 at 21:31, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 19:56, Kirill A. Shutemov wrote:
>> On Thu, May 12, 2022 at 05:42:58PM +0200, Thomas Gleixner wrote:
>>> On Wed, May 11 2022 at 08:49, Peter Zijlstra wrote:
>>> > On Wed, May 11, 2022 at 05:27:40AM +0300, Kirill A. Shutemov wrote:
>>> > So aren't we creating a problem with LAM_U48 where programs relying on
>>> > it are of limited sustainability?
>>> >
>>> > Any such program simply *cannot* run on 5 level pagetables. Why do we
>>> > want to do this?
>>> 
>>> More bits are better :)
>>> 
>>> Seriously, I agree that restricting it to LAM57, which gives us 6 bits,
>>> makes a lot of sense _and_ makes the whole thing way simpler.
>>> 
>>> So supporting both needs a truly good justification and a real world use
>>> case.
>>
>> I asked the question before[1]. Basically, more bits more better:
>>
>> 	For HWASAN #bits == detection probability.
>> 	For MarkUS #bits == exponential cost reduction
>
> What is MarkUS? It's not really helpful to provide acronyms which are
> not decodable.
>
>> I would really like to have only LAM_U57, but IIUC 6 bits is not always
>> enough.
>>
>> Dmitry, could you elaborate?
>>
>> [1] https://mobile.twitter.com/dvyukov/status/1342019823400837120
>
> I don't know whether he reacts on posting a link to his twitter
> account. I've CC'ed him now. Maybe that works better.

Duh. I should have looked at 'To:' and not only at 'Cc:'

Maybe someday I get used to this email thing.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 22:10         ` H.J. Lu
@ 2022-05-12 23:35           ` Thomas Gleixner
  2022-05-13  0:08             ` H.J. Lu
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-12 23:35 UTC (permalink / raw)
  To: H.J. Lu, Dave Hansen
  Cc: Peter Zijlstra, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	the arch/x86 maintainers, Alexander Potapenko, Dmitry Vyukov,
	Andi Kleen, Rick Edgecombe, Linux-MM, LKML

On Thu, May 12 2022 at 15:10, H. J. Lu wrote:
> On Thu, May 12, 2022 at 2:51 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 5/12/22 12:39, Thomas Gleixner wrote:
>> >> It's OK for a debugging build that runs on one kind of hardware.  But,
>> >> if we want LAM-using binaries to be portable, we have to do something
>> >> different.
>> >>
>> >> One of the stated reasons for adding LAM hardware is that folks want to
>> >> use sanitizers outside of debugging environments.  To me, that means
>> >> that LAM is something that the same binary might run with or without.
>> > On/off yes, but is there an actual use case where such a mechanism would
>> > at start time dynamically chose the number of bits?
>>
>> I'd love to hear from folks doing the userspace side of this.  Will
>> userspace be saying: "Give me all the bits you can!".  Or, will it
>> really just be looking for 6 bits only, and it doesn't care whether it
>> gets 6 or 15, it will use only 6?
>>
>> Do the sanitizers have more overhead with more bits?  Or *less* overhead
>> because they can store more metadata in the pointers?
>>
>> Will anyone care about the difference about potentially missing 1/64
>> issues with U57 versus 1/32768 with U48?
>
> The only LAM usage I know so far is LAM_U57 in HWASAN.

That's at least a halfways useful answer.

> An application can ask for LAM_U48 or LAM_U57. But the decision should
> be made by application.

It can ask for whatever, but the decision whether it's granted is made
by the kernel for obvious reasons.

> When an application asks for LAM_U57, I expect it will store tags in
> upper 6 bits, even if the kernel enables LAM_U48.

The kernel does not enable LAM_U48 when the application only wants to
have LAM_U57, because that would restrict the address space of the
application to 47 bits on 5-level capable system for no reason.

So what are you trying to tell me?

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 23:35           ` Thomas Gleixner
@ 2022-05-13  0:08             ` H.J. Lu
  2022-05-13  0:46               ` Dave Hansen
  2022-05-13  0:46               ` Thomas Gleixner
  0 siblings, 2 replies; 82+ messages in thread
From: H.J. Lu @ 2022-05-13  0:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Peter Zijlstra, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, the arch/x86 maintainers, Alexander Potapenko,
	Dmitry Vyukov, Andi Kleen, Rick Edgecombe, Linux-MM, LKML

On Thu, May 12, 2022 at 4:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, May 12 2022 at 15:10, H. J. Lu wrote:
> > On Thu, May 12, 2022 at 2:51 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 5/12/22 12:39, Thomas Gleixner wrote:
> >> >> It's OK for a debugging build that runs on one kind of hardware.  But,
> >> >> if we want LAM-using binaries to be portable, we have to do something
> >> >> different.
> >> >>
> >> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> >> use sanitizers outside of debugging environments.  To me, that means
> >> >> that LAM is something that the same binary might run with or without.
> >> > On/off yes, but is there an actual use case where such a mechanism would
> >> > at start time dynamically chose the number of bits?
> >>
> >> I'd love to hear from folks doing the userspace side of this.  Will
> >> userspace be saying: "Give me all the bits you can!".  Or, will it
> >> really just be looking for 6 bits only, and it doesn't care whether it
> >> gets 6 or 15, it will use only 6?
> >>
> >> Do the sanitizers have more overhead with more bits?  Or *less* overhead
> >> because they can store more metadata in the pointers?
> >>
> >> Will anyone care about the difference about potentially missing 1/64
> >> issues with U57 versus 1/32768 with U48?
> >
> > The only LAM usage I know so far is LAM_U57 in HWASAN.
>
> That's at least a halfways useful answer.
>
> > An application can ask for LAM_U48 or LAM_U57. But the decision should
> > be made by application.
>
> It can ask for whatever, but the decision whether it's granted is made
> by the kernel for obvious reasons.
>
> > When an application asks for LAM_U57, I expect it will store tags in
> > upper 6 bits, even if the kernel enables LAM_U48.
>
> The kernel does not enable LAM_U48 when the application only wants to
> have LAM_U57, because that would restrict the address space of the
> application to 47 bits on 5-level capable system for no reason.
>
> So what are you trying to tell me?
>

I am expecting applications to ask for LAM_U48 or LAM_U57, not just
LAM.

-- 
H.J.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13  0:08             ` H.J. Lu
@ 2022-05-13  0:46               ` Dave Hansen
  2022-05-13  1:27                 ` Thomas Gleixner
  2022-05-13  0:46               ` Thomas Gleixner
  1 sibling, 1 reply; 82+ messages in thread
From: Dave Hansen @ 2022-05-13  0:46 UTC (permalink / raw)
  To: H.J. Lu, Thomas Gleixner
  Cc: Peter Zijlstra, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	the arch/x86 maintainers, Alexander Potapenko, Dmitry Vyukov,
	Andi Kleen, Rick Edgecombe, Linux-MM, LKML

On 5/12/22 17:08, H.J. Lu wrote:
> I am expecting applications to ask for LAM_U48 or LAM_U57, not just
> LAM.

If AMD comes along with UAI that doesn't match LAM_U48 or LAM_U57, apps
will specifically be coded to ask for one of the three?  That seems like
an awfully rigid ABI.

That also seems like a surefire way to have non-portable users of this
feature.  It basically guarantees that userspace code will look like this:

	if (support_lam_57()) {
		sys_enable_masking(LAM_57);
		mask = LAM_57_MASK;
	} else if (support_lam_48()) {
		sys_enable_masking(LAM_48);
		mask = LAM_48_MASK;
	} else if (...)
		... others

Which is *ENTIRELY* non-portable and needs to get patched if anything
changes in the slightest.  Where, if we move that logic into the kernel,
it's something more like:

	mask = sys_enable_masking(...);
	if (bitmap_weight(&mask) < MINIMUM_BITS)
		goto whoops;

That actually works for all underlying implementations and doesn't
hard-code any assumptions about the implementation other than a basic
sanity check.

There are three choices we'd have to make for a more generic ABI that I
can think of:

ABI Question #1:

Should userspace be asking the kernel for a specific type of masking,
like a number of bits to mask or a mask itself?  If not, the enabling
syscall is dirt simple: it's "mask = sys_enable_masking()".  The kernel
picks what it wants to mask unilaterally and just tells userspace.

ABI Question #2:

Assuming that userspace is asking for a specific kind of address
masking: Should that request be made in terms of an actual mask or a
number of bits?  For instance, if userspace asks for 0xf000000000000000,
it would fit UAI or ARM TBI.  If it asks for 0x7e00000000000000, it
would match LAM_U57 behavior.

Or, does userspace ask for "8 bits", or "6 bits" or "15 bits"?

ABI Question #3:

If userspace asks for something that the kernel can't satisfy exactly,
like "8 bits" on a LAM system, is it OK for the kernel to fall back to
the next-largest mask?  For instance sys_enable_masking(bits=8), could
the kernel unilaterally return a LAM_U48 mask because LAM_U48 means
supports 15>8 bits?  Or, could this "fuzzy" behavior be an opt-in?

If I had to take a shot at this today, I think I'd opt for:

	mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);

although I'm not super confident about the "fuzzy" flag.  I also don't
think I'd totally hate the "blind" interface where the kernel just gets
to pick unilaterally and takes zero input from userspace.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13  0:08             ` H.J. Lu
  2022-05-13  0:46               ` Dave Hansen
@ 2022-05-13  0:46               ` Thomas Gleixner
  1 sibling, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-13  0:46 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Dave Hansen, Peter Zijlstra, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, the arch/x86 maintainers, Alexander Potapenko,
	Dmitry Vyukov, Andi Kleen, Rick Edgecombe, Linux-MM, LKML

On Thu, May 12 2022 at 17:08, H. J. Lu wrote:
> On Thu, May 12, 2022 at 4:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > When an application asks for LAM_U57, I expect it will store tags in
>> > upper 6 bits, even if the kernel enables LAM_U48.
>>
>> The kernel does not enable LAM_U48 when the application only wants to
>> have LAM_U57, because that would restrict the address space of the
>> application to 47 bits on 5-level capable system for no reason.
>>
>> So what are you trying to tell me?
>>
> I am expecting applications to ask for LAM_U48 or LAM_U57, not just
> LAM.

That still does not tell anything.

You can as well tell me, that this will depend on the moon phase. That
would be more telling at least.

If we commit to an ABI, which we have to support forever, then we want
factual arguments, not expectations.

The fact that hardware supports 5 variants does not mean that all of
them make sense to support in the OS.

Quite the contrary, 99% of all 'flexible' hardware features are based on
bogus assumptions. The worst of these assumptions is:

      'We can handle this in software'

Sure, most of the trainwrecks hardware people provide can be 'handled'
in software, but you have to consider the price for doing so.

   The price is that we amount technical debt.

You are well aware of this as you have contributed your share of
technical debt by cramming unsupportable crap into user space projects
prematurely.

So can you please take a step back and seriously think about the
semantics and long term consequences instead of providing handwaving
expectations which are biased by uninformed wishful thinking, AKA
marketing?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13  0:46               ` Dave Hansen
@ 2022-05-13  1:27                 ` Thomas Gleixner
  2022-05-13  3:05                   ` Dave Hansen
  2022-05-13  9:14                   ` Catalin Marinas
  0 siblings, 2 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-13  1:27 UTC (permalink / raw)
  To: Dave Hansen, H.J. Lu
  Cc: Peter Zijlstra, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	the arch/x86 maintainers, Alexander Potapenko, Dmitry Vyukov,
	Andi Kleen, Rick Edgecombe, Linux-MM, Catalin Marinas, LKML

On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
> On 5/12/22 17:08, H.J. Lu wrote:
> If I had to take a shot at this today, I think I'd opt for:
>
> 	mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
>
> although I'm not super confident about the "fuzzy" flag.  I also don't
> think I'd totally hate the "blind" interface where the kernel just gets
> to pick unilaterally and takes zero input from userspace.

That's the only sane choice and you can make it simple for userspace:

       ret = prctl(GET_XXX_MASK, &mask);

and then let it decide based on @ret and @mask whether to use it or not.

But of course nobody thought about this as a generic feature and so we
have the ARM64 TBI muck as a precedence.

So much for coordination and portability...

I'm so tired of this short sighted 'cram my feature in' approach of
_all_ involved parties.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13  1:27                 ` Thomas Gleixner
@ 2022-05-13  3:05                   ` Dave Hansen
  2022-05-13  8:28                     ` Thomas Gleixner
  2022-05-13 22:48                     ` Kirill A. Shutemov
  2022-05-13  9:14                   ` Catalin Marinas
  1 sibling, 2 replies; 82+ messages in thread
From: Dave Hansen @ 2022-05-13  3:05 UTC (permalink / raw)
  To: Thomas Gleixner, H.J. Lu
  Cc: Peter Zijlstra, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	the arch/x86 maintainers, Alexander Potapenko, Dmitry Vyukov,
	Andi Kleen, Rick Edgecombe, Linux-MM, Catalin Marinas, LKML

On 5/12/22 18:27, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
>> On 5/12/22 17:08, H.J. Lu wrote:
>> If I had to take a shot at this today, I think I'd opt for:
>>
>> 	mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
>>
>> although I'm not super confident about the "fuzzy" flag.  I also don't
>> think I'd totally hate the "blind" interface where the kernel just gets
>> to pick unilaterally and takes zero input from userspace.
> That's the only sane choice and you can make it simple for userspace:
> 
>        ret = prctl(GET_XXX_MASK, &mask);
> 
> and then let it decide based on @ret and @mask whether to use it or not.
> 
> But of course nobody thought about this as a generic feature and so we
> have the ARM64 TBI muck as a precedence.

Well, not quite *nobody*:

 https://lore.kernel.org/linux-arm-kernel/7a34470c-73f0-26ac-e63d-161191d4b1e4@intel.com/


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13  3:05                   ` Dave Hansen
@ 2022-05-13  8:28                     ` Thomas Gleixner
  2022-05-13 22:48                     ` Kirill A. Shutemov
  1 sibling, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-13  8:28 UTC (permalink / raw)
  To: Dave Hansen, H.J. Lu
  Cc: Peter Zijlstra, Kirill A. Shutemov, Dave Hansen, Andy Lutomirski,
	the arch/x86 maintainers, Alexander Potapenko, Dmitry Vyukov,
	Andi Kleen, Rick Edgecombe, Linux-MM, Catalin Marinas, LKML

On Thu, May 12 2022 at 20:05, Dave Hansen wrote:

> On 5/12/22 18:27, Thomas Gleixner wrote:
>> On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
>>> On 5/12/22 17:08, H.J. Lu wrote:
>>> If I had to take a shot at this today, I think I'd opt for:
>>>
>>> 	mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
>>>
>>> although I'm not super confident about the "fuzzy" flag.  I also don't
>>> think I'd totally hate the "blind" interface where the kernel just gets
>>> to pick unilaterally and takes zero input from userspace.
>> That's the only sane choice and you can make it simple for userspace:
>> 
>>        ret = prctl(GET_XXX_MASK, &mask);
>> 
>> and then let it decide based on @ret and @mask whether to use it or not.
>> 
>> But of course nobody thought about this as a generic feature and so we
>> have the ARM64 TBI muck as a precedence.
>
> Well, not quite *nobody*:
>
>  https://lore.kernel.org/linux-arm-kernel/7a34470c-73f0-26ac-e63d-161191d4b1e4@intel.com/

Sigh....

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13  1:27                 ` Thomas Gleixner
  2022-05-13  3:05                   ` Dave Hansen
@ 2022-05-13  9:14                   ` Catalin Marinas
  2022-05-13  9:26                     ` Thomas Gleixner
  1 sibling, 1 reply; 82+ messages in thread
From: Catalin Marinas @ 2022-05-13  9:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, H.J. Lu, Peter Zijlstra, Kirill A. Shutemov,
	Dave Hansen, Andy Lutomirski, the arch/x86 maintainers,
	Alexander Potapenko, Dmitry Vyukov, Andi Kleen, Rick Edgecombe,
	Linux-MM, LKML

On Fri, May 13, 2022 at 03:27:24AM +0200, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
> > On 5/12/22 17:08, H.J. Lu wrote:
> > If I had to take a shot at this today, I think I'd opt for:
> >
> > 	mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
> >
> > although I'm not super confident about the "fuzzy" flag.  I also don't
> > think I'd totally hate the "blind" interface where the kernel just gets
> > to pick unilaterally and takes zero input from userspace.
> 
> That's the only sane choice and you can make it simple for userspace:
> 
>        ret = prctl(GET_XXX_MASK, &mask);
> 
> and then let it decide based on @ret and @mask whether to use it or not.

Getting the mask would work for arm64 as well (it's always 0xffUL << 56,
top-byte-ignore). Setting the mask from user space won't be of any use
to us, it's baked in hardware.

> But of course nobody thought about this as a generic feature and so we
> have the ARM64 TBI muck as a precedence.
> 
> So much for coordination and portability...

Well, we had TBI in the architecture and enabled for user-space since
the first arm64 kernel port (2012), no ABI controls needed. It had some
specific uses like JITs to avoid masking out type bits encoded in
pointers.

In 2019 sanitisers appeared and we relaxed the TBI at the syscall level
but, to avoid potentially confusing some programs, we added a control
which only changes the behaviour of access_ok(). More of a safety thing,
we might have as well skipped it. There is no hardware configuration
toggled by this control, nor is the user address space layout (max
52-bit on arm64). Since sanitisers require compiler instrumentation (or,
with MTE, arm64-specific libc changes), it's pretty much all within the
arm64-specific user codebase. MTE came along and we added some more bits
on top which, again, are hardware specific and contained within the
arm64 libc startup code (tag checking modes etc).

Dave indeed mentioned passing a mask to allow a more flexible control
but, as already mentioned in the old thread, for arm64 the feature was
already on, so it didn't make much sense, it seemed more like
over-engineering. Had we known that Intel is pursing something similar,
maybe we'd have designed the interface differently (we didn't get the
hint).

Intel's LAM has more flexibility but I don't see the arm64 TBI getting
in the way. Just don't use it as an example because they evolved in
different ways. I'm happy for arm64 to adopt a more flexible interface
while keeping the current one around for backwards compatibility). But
on arm64 we can't control the masking, not even disable it per process
since it has always been on.

> I'm so tired of this short sighted 'cram my feature in' approach of
> _all_ involved parties.

Unfortunately it happens occasionally, especially when developers can't
disclose that their companies work on similar features (resctrl is a
good example where arm64 would have benefited from a more generic
approach but at the time MPAM was not public).

-- 
Catalin

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13  9:14                   ` Catalin Marinas
@ 2022-05-13  9:26                     ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-13  9:26 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Dave Hansen, H.J. Lu, Peter Zijlstra, Kirill A. Shutemov,
	Dave Hansen, Andy Lutomirski, the arch/x86 maintainers,
	Alexander Potapenko, Dmitry Vyukov, Andi Kleen, Rick Edgecombe,
	Linux-MM, LKML

On Fri, May 13 2022 at 10:14, Catalin Marinas wrote:
> On Fri, May 13, 2022 at 03:27:24AM +0200, Thomas Gleixner wrote:
>> On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
>> > On 5/12/22 17:08, H.J. Lu wrote:
>> > If I had to take a shot at this today, I think I'd opt for:
>> >
>> > 	mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
>> >
>> > although I'm not super confident about the "fuzzy" flag.  I also don't
>> > think I'd totally hate the "blind" interface where the kernel just gets
>> > to pick unilaterally and takes zero input from userspace.
>> 
>> That's the only sane choice and you can make it simple for userspace:
>> 
>>        ret = prctl(GET_XXX_MASK, &mask);
>> 
>> and then let it decide based on @ret and @mask whether to use it or not.
>
> Getting the mask would work for arm64 as well (it's always 0xffUL << 56,
> top-byte-ignore). Setting the mask from user space won't be of any use
> to us, it's baked in hardware.

Sure.

> Dave indeed mentioned passing a mask to allow a more flexible control
> but, as already mentioned in the old thread, for arm64 the feature was
> already on, so it didn't make much sense, it seemed more like
> over-engineering. Had we known that Intel is pursing something similar,
> maybe we'd have designed the interface differently (we didn't get the
> hint).

Fair enough

> Intel's LAM has more flexibility but I don't see the arm64 TBI getting
> in the way. Just don't use it as an example because they evolved in
> different ways. I'm happy for arm64 to adopt a more flexible interface
> while keeping the current one around for backwards compatibility). But
> on arm64 we can't control the masking, not even disable it per process
> since it has always been on.

That's fine. The point is that we want uniform interfaces for the same
functionality. It's obviously hardware specific which subset of the
interface is supported.

>> I'm so tired of this short sighted 'cram my feature in' approach of
>> _all_ involved parties.
>
> Unfortunately it happens occasionally, especially when developers can't
> disclose that their companies work on similar features (resctrl is a
> good example where arm64 would have benefited from a more generic
> approach but at the time MPAM was not public).

Yeah. It's a constant pain.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFCv2 05/10] x86/mm: Provide untagged_addr() helper
  2022-05-12 23:14         ` Thomas Gleixner
@ 2022-05-13 10:14           ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2022-05-13 10:14 UTC (permalink / raw)
  To: 'Thomas Gleixner', Peter Zijlstra
  Cc: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

From: Thomas Gleixner
> Sent: 13 May 2022 00:15
...
> But whatever we chose, it's sad, that we need to have support for
> interfaces which swallow any pointer (user or kernel) because otherwise
> this really boils down to a single OR resp. AND operation plus the
> according mov to retrieve the mask.

Are there any of those left?
Most will have gone with setfs(KERNEL_DS) removal.
Almost all code has to know whether an address is user
or kernel - the value can't be used because of architectures
that use the same address values in user and kernel.

How often do addresses actually need de-tagging?
Probably only code that is looking for page table
entries for virtual addresses?
How often does that happen for user addresses?

If the hardware is ignoring the bits then you don't
need to remove them before memory accesses.
That would include all userspace accesses.
Clearly access_ok() has to work with tagged addresses,
but that doesn't require the tag be masked off.
It can just check the transfer doesn't cross 1u<<63.
It (probably) just requires the fault handler to treat
non-canonical address faults as page faults.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 21:51       ` Dave Hansen
  2022-05-12 22:10         ` H.J. Lu
@ 2022-05-13 11:07         ` Alexander Potapenko
  2022-05-13 11:28           ` David Laight
  2022-05-13 23:01           ` Kirill A. Shutemov
  1 sibling, 2 replies; 82+ messages in thread
From: Alexander Potapenko @ 2022-05-13 11:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Peter Zijlstra, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, the arch/x86 maintainers, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe,
	Linux Memory Management List, LKML

On Thu, May 12, 2022 at 11:51 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/12/22 12:39, Thomas Gleixner wrote:
> >> It's OK for a debugging build that runs on one kind of hardware.  But,
> >> if we want LAM-using binaries to be portable, we have to do something
> >> different.
> >>
> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> use sanitizers outside of debugging environments.  To me, that means
> >> that LAM is something that the same binary might run with or without.
> > On/off yes, but is there an actual use case where such a mechanism would
> > at start time dynamically chose the number of bits?
>
> I'd love to hear from folks doing the userspace side of this.  Will
> userspace be saying: "Give me all the bits you can!".  Or, will it
> really just be looking for 6 bits only, and it doesn't care whether it
> gets 6 or 15, it will use only 6?

(speaking more or less on behalf of the userspace folks here)
I think it is safe to assume that in the upcoming year or two HWASan
will be fine having just 6 bits for the tags on x86 machines.
We are interested in running it on kernels with and without
CONFIG_X86_5LEVEL=y, so U48 is not an option in some cases anyway.

> Do the sanitizers have more overhead with more bits?  Or *less* overhead
> because they can store more metadata in the pointers?
Once we have the possibility to store tags in the pointers, we don't
need redzones for heap/stack objects anymore, which saves quite a bit
of memory.
Also, HWASan doesn't use quarantine and has smaller shadow memory size
(see https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
for more details).
Having more bits increases the probability to detect a UAF or buffer
overflow and reduces the shadow memory size further, but that one is
small enough already.

> Will anyone care about the difference about potentially missing 1/64
> issues with U57 versus 1/32768 with U48?
I don't think anyone will.

Having said that, I agree with Dave that it would be nice to have an
interface that would just request the mask from the system.
That way we could have support for U57 in the kernel now and keep the
possibility to add U48 in the future without breaking existing users.

I also may be missing something obvious, but I can't come up with a
case where different apps in the system may request U48 and U57 at the
same time.
It seems natural to me to let the OS decide which of the modes is
supported and give the app the freedom to use it or lose it.

--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise
erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes
weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich
bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.


This e-mail is confidential. If you received this communication by
mistake, please don't forward it to anyone else, please erase all
copies and attachments, and please let me know that it has gone to the
wrong person.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13 11:07         ` Alexander Potapenko
@ 2022-05-13 11:28           ` David Laight
  2022-05-13 12:26             ` Alexander Potapenko
  2022-05-13 23:01           ` Kirill A. Shutemov
  1 sibling, 1 reply; 82+ messages in thread
From: David Laight @ 2022-05-13 11:28 UTC (permalink / raw)
  To: 'Alexander Potapenko', Dave Hansen
  Cc: Thomas Gleixner, Peter Zijlstra, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, the arch/x86 maintainers, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe,
	Linux Memory Management List, LKML

...
> Once we have the possibility to store tags in the pointers, we don't
> need redzones for heap/stack objects anymore, which saves quite a bit
> of memory.

You still need redzones.
The high bits are ignored for actual memory accesses.

To do otherwise you'd need the high bits to be in the PTE,
copied to the TLB and finally get into the cache tag.

Then you'd have to use the correct tags for each page.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13 11:28           ` David Laight
@ 2022-05-13 12:26             ` Alexander Potapenko
  2022-05-13 14:26               ` David Laight
  0 siblings, 1 reply; 82+ messages in thread
From: Alexander Potapenko @ 2022-05-13 12:26 UTC (permalink / raw)
  To: David Laight
  Cc: Dave Hansen, Thomas Gleixner, Peter Zijlstra, Kirill A. Shutemov,
	Dave Hansen, Andy Lutomirski, the arch/x86 maintainers,
	Dmitry Vyukov, H . J . Lu, Andi Kleen, Rick Edgecombe,
	Linux Memory Management List, LKML

On Fri, May 13, 2022 at 1:28 PM David Laight <David.Laight@aculab.com> wrote:
>
> ...
> > Once we have the possibility to store tags in the pointers, we don't
> > need redzones for heap/stack objects anymore, which saves quite a bit
> > of memory.
>
> You still need redzones.
> The high bits are ignored for actual memory accesses.
>
> To do otherwise you'd need the high bits to be in the PTE,
> copied to the TLB and finally get into the cache tag.
>
> Then you'd have to use the correct tags for each page.

Sorry, I don't understand how this is relevant to HWASan in the userspace.
Like in ASan, we have a custom allocator that assigns tags to heap
objects. The assigned tag is stored in both the shadow memory for the
object and the pointer returned by the allocator.
Instrumentation inserted by the compiler checks the pointer before
every memory access and ensures that its tag matches the tag of the
object in the shadow memory.
A tag mismatch is reported as an out-of-bounds or a use-after-free,
depending on whether the accessed memory is still considered
allocated.
Because objects with different tags follow each other, there is no
need to add extra redzones to the objects to detect buffer overflows.
(We might need to increase the object alignment though, but that's a
different story).

>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise
erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes
weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich
bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.


This e-mail is confidential. If you received this communication by
mistake, please don't forward it to anyone else, please erase all
copies and attachments, and please let me know that it has gone to the
wrong person.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [tip: x86/cleanups] x86/prctl: Remove pointless task argument
  2022-05-12 12:04     ` [PATCH] x86/prctl: Remove pointless task argument Thomas Gleixner
@ 2022-05-13 12:30       ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-05-13 12:30 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID:     f5c0b4f30416c670408a77be94703d04d22b57df
Gitweb:        https://git.kernel.org/tip/f5c0b4f30416c670408a77be94703d04d22b57df
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 12 May 2022 14:04:08 +02:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Fri, 13 May 2022 12:56:28 +02:00

x86/prctl: Remove pointless task argument

The functions invoked via do_arch_prctl_common() can only operate on
the current task and none of these function uses the task argument.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/87lev7vtxj.ffs@tglx
---
 arch/x86/include/asm/fpu/api.h |  3 +--
 arch/x86/include/asm/proto.h   |  3 +--
 arch/x86/kernel/fpu/xstate.c   |  5 +----
 arch/x86/kernel/process.c      |  9 ++++-----
 arch/x86/kernel/process_32.c   |  2 +-
 arch/x86/kernel/process_64.c   |  4 ++--
 6 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index c83b302..6b0f31f 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -162,7 +162,6 @@ static inline bool fpstate_is_confidential(struct fpu_guest *gfpu)
 }
 
 /* prctl */
-struct task_struct;
-extern long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2);
+extern long fpu_xstate_prctl(int option, unsigned long arg2);
 
 #endif /* _ASM_X86_FPU_API_H */
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index feed36d..80be803 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -39,7 +39,6 @@ void x86_report_nx(void);
 
 extern int reboot_force;
 
-long do_arch_prctl_common(struct task_struct *task, int option,
-			  unsigned long arg2);
+long do_arch_prctl_common(int option, unsigned long arg2);
 
 #endif /* _ASM_X86_PROTO_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 39e1c86..1b016a1 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1687,16 +1687,13 @@ EXPORT_SYMBOL_GPL(xstate_get_guest_group_perm);
  * e.g. for AMX which requires XFEATURE_XTILE_CFG(17) and
  * XFEATURE_XTILE_DATA(18) this would be XFEATURE_XTILE_DATA(18).
  */
-long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2)
+long fpu_xstate_prctl(int option, unsigned long arg2)
 {
 	u64 __user *uptr = (u64 __user *)arg2;
 	u64 permitted, supported;
 	unsigned long idx = arg2;
 	bool guest = false;
 
-	if (tsk != current)
-		return -EPERM;
-
 	switch (option) {
 	case ARCH_GET_XCOMP_SUPP:
 		supported = fpu_user_cfg.max_features |	fpu_user_cfg.legacy_features;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b3d2d41..e86d09a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -334,7 +334,7 @@ static int get_cpuid_mode(void)
 	return !test_thread_flag(TIF_NOCPUID);
 }
 
-static int set_cpuid_mode(struct task_struct *task, unsigned long cpuid_enabled)
+static int set_cpuid_mode(unsigned long cpuid_enabled)
 {
 	if (!boot_cpu_has(X86_FEATURE_CPUID_FAULT))
 		return -ENODEV;
@@ -985,20 +985,19 @@ unsigned long __get_wchan(struct task_struct *p)
 	return addr;
 }
 
-long do_arch_prctl_common(struct task_struct *task, int option,
-			  unsigned long arg2)
+long do_arch_prctl_common(int option, unsigned long arg2)
 {
 	switch (option) {
 	case ARCH_GET_CPUID:
 		return get_cpuid_mode();
 	case ARCH_SET_CPUID:
-		return set_cpuid_mode(task, arg2);
+		return set_cpuid_mode(arg2);
 	case ARCH_GET_XCOMP_SUPP:
 	case ARCH_GET_XCOMP_PERM:
 	case ARCH_REQ_XCOMP_PERM:
 	case ARCH_GET_XCOMP_GUEST_PERM:
 	case ARCH_REQ_XCOMP_GUEST_PERM:
-		return fpu_xstate_prctl(task, option, arg2);
+		return fpu_xstate_prctl(option, arg2);
 	}
 
 	return -EINVAL;
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 26edb1c..0faa5e2 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -222,5 +222,5 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
 {
-	return do_arch_prctl_common(current, option, arg2);
+	return do_arch_prctl_common(option, arg2);
 }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e459253..1962008 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -844,7 +844,7 @@ SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
 
 	ret = do_arch_prctl_64(current, option, arg2);
 	if (ret == -EINVAL)
-		ret = do_arch_prctl_common(current, option, arg2);
+		ret = do_arch_prctl_common(option, arg2);
 
 	return ret;
 }
@@ -852,7 +852,7 @@ SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
 #ifdef CONFIG_IA32_EMULATION
 COMPAT_SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
 {
-	return do_arch_prctl_common(current, option, arg2);
+	return do_arch_prctl_common(option, arg2);
 }
 #endif
 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-11  2:27 ` [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features Kirill A. Shutemov
  2022-05-12 12:02   ` Thomas Gleixner
@ 2022-05-13 14:09   ` Alexander Potapenko
  2022-05-13 17:34     ` Edgecombe, Rick P
  1 sibling, 1 reply; 82+ messages in thread
From: Alexander Potapenko @ 2022-05-13 14:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Andrey Ryabinin, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe,
	Linux Memory Management List, LKML

> +
> +       /* Handle ARCH_THREAD_FEATURE_ENABLE */
> +
> +       task->thread.features |= features;
> +out:
> +       return task->thread.features;
Isn't arch_prctl() supposed to return 0 on success?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13 12:26             ` Alexander Potapenko
@ 2022-05-13 14:26               ` David Laight
  2022-05-13 15:28                 ` Alexander Potapenko
  0 siblings, 1 reply; 82+ messages in thread
From: David Laight @ 2022-05-13 14:26 UTC (permalink / raw)
  To: 'Alexander Potapenko'
  Cc: Dave Hansen, Thomas Gleixner, Peter Zijlstra, Kirill A. Shutemov,
	Dave Hansen, Andy Lutomirski, the arch/x86 maintainers,
	Dmitry Vyukov, H . J . Lu, Andi Kleen, Rick Edgecombe,
	Linux Memory Management List, LKML

From: Alexander Potapenko
> Sent: 13 May 2022 13:26
> 
> On Fri, May 13, 2022 at 1:28 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > ...
> > > Once we have the possibility to store tags in the pointers, we don't
> > > need redzones for heap/stack objects anymore, which saves quite a bit
> > > of memory.
> >
> > You still need redzones.
> > The high bits are ignored for actual memory accesses.
> >
> > To do otherwise you'd need the high bits to be in the PTE,
> > copied to the TLB and finally get into the cache tag.
> >
> > Then you'd have to use the correct tags for each page.
> 
> Sorry, I don't understand how this is relevant to HWASan in the userspace.
> Like in ASan, we have a custom allocator that assigns tags to heap
> objects. The assigned tag is stored in both the shadow memory for the
> object and the pointer returned by the allocator.
> Instrumentation inserted by the compiler checks the pointer before
> every memory access and ensures that its tag matches the tag of the
> object in the shadow memory.

Doesn't that add so much overhead that the system runs like a sick pig?
I don't see any point adding overhead to a generic kernel to support
such operation.

> A tag mismatch is reported as an out-of-bounds or a use-after-free,
> depending on whether the accessed memory is still considered
> allocated.
> Because objects with different tags follow each other, there is no
> need to add extra redzones to the objects to detect buffer overflows.
> (We might need to increase the object alignment though, but that's a
> different story).

How does all that help if a system call (eg read()) is given
an invalid length.
If that length is checked then the 'unmasked' address can be
passed to the kernel.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 21:24       ` Thomas Gleixner
@ 2022-05-13 14:43         ` Matthew Wilcox
  2022-05-13 22:59         ` Kirill A. Shutemov
  1 sibling, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2022-05-13 14:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Peter Zijlstra, Kirill A. Shutemov, Dave Hansen,
	Andy Lutomirski, x86, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12, 2022 at 11:24:27PM +0200, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 21:39, Thomas Gleixner wrote:
> > On Thu, May 12 2022 at 10:22, Dave Hansen wrote:
> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> use sanitizers outside of debugging environments.  To me, that means
> >> that LAM is something that the same binary might run with or without.
> >
> > On/off yes, but is there an actual use case where such a mechanism would
> > at start time dynamically chose the number of bits?
> 
> This would need cooperation from the application because it has to tell
> the magic facility whether it intends to utilize the large VA space on a
> 5-level paging system or not.

Taking a step back ...

Consider an application like 'grep'.  It operates on streams of data,
and has a fairly bounded amount of virtual memory that it will use.
Let's say it's 1GB (*handwave*).  So it has 33 bits of address space that
it can use for "security" of one form or another.  ASLR is one thing which
is vying for bits, and now ASAN is another.  Somehow it is supposed to
tell the kernel "I want to use 6 bits for ASAN and 27 bits for ASLR".
Or "I want to use 15 bits for ASAN and 18 bits for ASLR".

So how does grep make that decision?  How does it find out what the
kernel supports?  Deciding which tradeoff is more valuable to grep is
clearly not something the kernel can help with, but I do think that the
kernel needs to have an API to query what's available.

Something like Libreoffice or Firefox is obviously going to be much more
complex; it doesn't really have a bounded amount of virtual memory to
work with.  But if we can't even solve this problem for applications
with bounded address space, we stand no chance of solving it for hard
cases.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13 14:26               ` David Laight
@ 2022-05-13 15:28                 ` Alexander Potapenko
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Potapenko @ 2022-05-13 15:28 UTC (permalink / raw)
  To: David Laight
  Cc: Dave Hansen, Thomas Gleixner, Peter Zijlstra, Kirill A. Shutemov,
	Dave Hansen, Andy Lutomirski, the arch/x86 maintainers,
	Dmitry Vyukov, H . J . Lu, Andi Kleen, Rick Edgecombe,
	Linux Memory Management List, LKML

On Fri, May 13, 2022 at 4:26 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Alexander Potapenko
> > Sent: 13 May 2022 13:26
> >
> > On Fri, May 13, 2022 at 1:28 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > ...
> > > > Once we have the possibility to store tags in the pointers, we don't
> > > > need redzones for heap/stack objects anymore, which saves quite a bit
> > > > of memory.
> > >
> > > You still need redzones.
> > > The high bits are ignored for actual memory accesses.
> > >
> > > To do otherwise you'd need the high bits to be in the PTE,
> > > copied to the TLB and finally get into the cache tag.
> > >
> > > Then you'd have to use the correct tags for each page.
> >
> > Sorry, I don't understand how this is relevant to HWASan in the userspace.
> > Like in ASan, we have a custom allocator that assigns tags to heap
> > objects. The assigned tag is stored in both the shadow memory for the
> > object and the pointer returned by the allocator.
> > Instrumentation inserted by the compiler checks the pointer before
> > every memory access and ensures that its tag matches the tag of the
> > object in the shadow memory.
>
> Doesn't that add so much overhead that the system runs like a sick pig?
> I don't see any point adding overhead to a generic kernel to support
> such operation.
Let me ensure we are on the same page here. Right now we are talking
about LAM support for userspace addresses.
At this point nobody is going to add instrumentation to a generic
kernel - just a prctl (let aside how exactly it works) that makes the
CPU ignore certain address bits in a particular userspace process.
The whole system is not supposed to run slower because of that - even
if one or many processes choose to enable LAM.

Now let's consider ASan (https://clang.llvm.org/docs/AddressSanitizer.html).
It is a powerful detector of memory corruptions in the userspace, but
it comes with a cost:
 - compiler instrumentation bloats the code (by roughly 50%) and slows
down the execution (by up to 2x);
 - redzones around stack and heap objects, memory quarantine and
shadow memory increase the memory consumption (by up to 4x).
In short, for each 8 bytes of app memory ASan stores one byte of the
metadata (shadow memory) indicating the addressability of those 8
bytes.
It then uses compiler instrumentation to verify that every memory
access in the program accesses only addressable memory.
ASan is widely used for testing and to some extent can be used in
production, but for big server-side apps the RAM overhead becomes
critical.

This is where HWASan
(https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html)
comes to the rescue.
Instead of storing addressability info in the shadow memory, it stores
a 1-byte tag for every 16 bytes of app memory (see
https://arxiv.org/pdf/1802.09517.pdf for other options).
As I mentioned before, the custom userspace memory allocator assigns
the tags to memory chunks and returns tagged pointers.
Like ASan, HWASan uses compiler instrumentation to verify that every
memory access is touching valid memory, but in this case it must
ensure that the pointer tag matches the tag stored in the shadow
memory.
Because of instrumentation, HWASan still has comparable code size and
execution overheads, but it uses way less memory (10-35% of the
original app memory consumption).
This lets us test beefy applications, e.g. feeding real-world queries
to production services. Even smaller applications benefit from it,
e.g. because of reduced cache pressure.
HWASan has been available for Android for a while now, and proved itself useful.

> > A tag mismatch is reported as an out-of-bounds or a use-after-free,
> > depending on whether the accessed memory is still considered
> > allocated.
> > Because objects with different tags follow each other, there is no
> > need to add extra redzones to the objects to detect buffer overflows.
> > (We might need to increase the object alignment though, but that's a
> > different story).
>
> How does all that help if a system call (eg read()) is given
> an invalid length.
Neither ASan nor HWASan care about what happens in the kernel.
Instead, they wrap system calls (along with some important libc
functions) and check their arguments to ensure there are no buffer
overflows.

HTH,
Alex
--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise
erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes
weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich
bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.


This e-mail is confidential. If you received this communication by
mistake, please don't forward it to anyone else, please erase all
copies and attachments, and please let me know that it has gone to the
wrong person.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-13 14:09   ` [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features Alexander Potapenko
@ 2022-05-13 17:34     ` Edgecombe, Rick P
  2022-05-13 23:09       ` Kirill A. Shutemov
  0 siblings, 1 reply; 82+ messages in thread
From: Edgecombe, Rick P @ 2022-05-13 17:34 UTC (permalink / raw)
  To: kirill.shutemov, glider
  Cc: linux-kernel, peterz, hjl.tools, linux-mm, dave.hansen,
	aryabinin, dvyukov, x86, ak, Lutomirski, Andy

On Fri, 2022-05-13 at 16:09 +0200, Alexander Potapenko wrote:
> > +
> > +       /* Handle ARCH_THREAD_FEATURE_ENABLE */
> > +
> > +       task->thread.features |= features;
> > +out:
> > +       return task->thread.features;
> 
> Isn't arch_prctl() supposed to return 0 on success?

Hmm, good point. Maybe we'll need a struct to pass info in and out.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13  3:05                   ` Dave Hansen
  2022-05-13  8:28                     ` Thomas Gleixner
@ 2022-05-13 22:48                     ` Kirill A. Shutemov
  1 sibling, 0 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-13 22:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, H.J. Lu, Peter Zijlstra, Dave Hansen,
	Andy Lutomirski, the arch/x86 maintainers, Alexander Potapenko,
	Dmitry Vyukov, Andi Kleen, Rick Edgecombe, Linux-MM,
	Catalin Marinas, LKML

On Thu, May 12, 2022 at 08:05:26PM -0700, Dave Hansen wrote:
> On 5/12/22 18:27, Thomas Gleixner wrote:
> > On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
> >> On 5/12/22 17:08, H.J. Lu wrote:
> >> If I had to take a shot at this today, I think I'd opt for:
> >>
> >> 	mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
> >>
> >> although I'm not super confident about the "fuzzy" flag.  I also don't
> >> think I'd totally hate the "blind" interface where the kernel just gets
> >> to pick unilaterally and takes zero input from userspace.
> > That's the only sane choice and you can make it simple for userspace:
> > 
> >        ret = prctl(GET_XXX_MASK, &mask);
> > 
> > and then let it decide based on @ret and @mask whether to use it or not.
> > 
> > But of course nobody thought about this as a generic feature and so we
> > have the ARM64 TBI muck as a precedence.
> 
> Well, not quite *nobody*:
> 
>  https://lore.kernel.org/linux-arm-kernel/7a34470c-73f0-26ac-e63d-161191d4b1e4@intel.com/

In the first RFC I tried to get ARM TBI interface generic. I resurrect it
if it fits better:

https://lore.kernel.org/all/20210205151631.43511-2-kirill.shutemov@linux.intel.com/


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-12 21:24       ` Thomas Gleixner
  2022-05-13 14:43         ` Matthew Wilcox
@ 2022-05-13 22:59         ` Kirill A. Shutemov
  1 sibling, 0 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-13 22:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Peter Zijlstra, Dave Hansen, Andy Lutomirski, x86,
	Alexander Potapenko, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12, 2022 at 11:24:27PM +0200, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 21:39, Thomas Gleixner wrote:
> > On Thu, May 12 2022 at 10:22, Dave Hansen wrote:
> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> use sanitizers outside of debugging environments.  To me, that means
> >> that LAM is something that the same binary might run with or without.
> >
> > On/off yes, but is there an actual use case where such a mechanism would
> > at start time dynamically chose the number of bits?
> 
> This would need cooperation from the application because it has to tell
> the magic facility whether it intends to utilize the large VA space on a
> 5-level paging system or not.
> 
> I have no idea how that is supposed to work, but what do I know about
> magic.
> 
> >> It's totally fine with me if the kernel only initially supports LAM_U57.
> >>  But, I'd ideally like to make sure that the ABI can support LAM_U57,
> >> LAM_U48, AMD's UAI (in whatever form it settles), or other masks.
> >
> > Sure. No argument here.
> 
> Assumed that the acronym of the day, which uses this, has a real benefit
> from the larger number of bits, we can support it.
> 
> But we are not going to make this a per thread selectable thing.
> 
> It's a process wide decision at startup simply because it does no buy
> thread A anything to select U57 if thread B selects U48 before thread A
> was able to map something into the U48 covered address space. Same issue
> the other way round as then B has to fallback to U57 or NONE. That does
> not make any sense at all.
> 
> I'm all for flexible, but not just because we can. It has to make sense.

Some JVMs arn javascript engines are known for using tags in high bit of
pointers (and clearing them manually on dereferencing as of now).

One use-case I had in mind was having a thread that runs VM/JIT, like
javascript/JVM/LUA/whatever that serves the rest of the application.
The thread uses LAM while the rest of the application does not. Leaking
tagged pointer into into thread that does not use LAM would indicate an
issue and SIGSEGV would be deserved.

No idea if it is practical.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13 11:07         ` Alexander Potapenko
  2022-05-13 11:28           ` David Laight
@ 2022-05-13 23:01           ` Kirill A. Shutemov
  2022-05-14 10:00             ` Thomas Gleixner
  1 sibling, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-13 23:01 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Dave Hansen, Thomas Gleixner, Peter Zijlstra, Dave Hansen,
	Andy Lutomirski, the arch/x86 maintainers, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe,
	Linux Memory Management List, LKML

On Fri, May 13, 2022 at 01:07:43PM +0200, Alexander Potapenko wrote:
> On Thu, May 12, 2022 at 11:51 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 5/12/22 12:39, Thomas Gleixner wrote:
> > >> It's OK for a debugging build that runs on one kind of hardware.  But,
> > >> if we want LAM-using binaries to be portable, we have to do something
> > >> different.
> > >>
> > >> One of the stated reasons for adding LAM hardware is that folks want to
> > >> use sanitizers outside of debugging environments.  To me, that means
> > >> that LAM is something that the same binary might run with or without.
> > > On/off yes, but is there an actual use case where such a mechanism would
> > > at start time dynamically chose the number of bits?
> >
> > I'd love to hear from folks doing the userspace side of this.  Will
> > userspace be saying: "Give me all the bits you can!".  Or, will it
> > really just be looking for 6 bits only, and it doesn't care whether it
> > gets 6 or 15, it will use only 6?
> 
> (speaking more or less on behalf of the userspace folks here)
> I think it is safe to assume that in the upcoming year or two HWASan
> will be fine having just 6 bits for the tags on x86 machines.
> We are interested in running it on kernels with and without
> CONFIG_X86_5LEVEL=y, so U48 is not an option in some cases anyway.

Just to be clear: LAM_U48 works on machine with 5-level paging enabled as
long as the process doesn't map anything above 47-bit.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-13 17:34     ` Edgecombe, Rick P
@ 2022-05-13 23:09       ` Kirill A. Shutemov
  2022-05-13 23:50         ` Edgecombe, Rick P
  0 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-13 23:09 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: glider, linux-kernel, peterz, hjl.tools, linux-mm, dave.hansen,
	aryabinin, dvyukov, x86, ak, Lutomirski, Andy

On Fri, May 13, 2022 at 05:34:12PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2022-05-13 at 16:09 +0200, Alexander Potapenko wrote:
> > > +
> > > +       /* Handle ARCH_THREAD_FEATURE_ENABLE */
> > > +
> > > +       task->thread.features |= features;
> > > +out:
> > > +       return task->thread.features;
> > 
> > Isn't arch_prctl() supposed to return 0 on success?
> 
> Hmm, good point. Maybe we'll need a struct to pass info in and out.

But values >0 are unused. I don't see why can't we use them.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive
  2022-05-12 13:36   ` Thomas Gleixner
@ 2022-05-13 23:22     ` Kirill A. Shutemov
  2022-05-14  8:37       ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-13 23:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Thu, May 12, 2022 at 03:36:31PM +0200, Thomas Gleixner wrote:
> On Wed, May 11 2022 at 05:27, Kirill A. Shutemov wrote:
> > LAM_U48 steals bits above 47-bit for tags and makes it impossible for
> > userspace to use full address space on 5-level paging machine.
> 
> > Make these features mutually exclusive: whichever gets enabled first
> > blocks the othe one.
> 
> So this patch prevents a mapping above 47bit when LAM48 is enabled, but
> I fail to spot how an already existing mapping above 47bit would prevent
> LAM48 from being enabled.
> 
> Maybe I'm missing something which makes this magically mutually
> exclusive.

It is in 09/10. See lam_u48_allowed()

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-13 23:09       ` Kirill A. Shutemov
@ 2022-05-13 23:50         ` Edgecombe, Rick P
  2022-05-14  8:37           ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Edgecombe, Rick P @ 2022-05-13 23:50 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: linux-kernel, peterz, hjl.tools, linux-mm, dave.hansen,
	aryabinin, dvyukov, x86, ak, Lutomirski, Andy, glider

On Sat, 2022-05-14 at 02:09 +0300, Kirill A. Shutemov wrote:
> On Fri, May 13, 2022 at 05:34:12PM +0000, Edgecombe, Rick P wrote:
> > On Fri, 2022-05-13 at 16:09 +0200, Alexander Potapenko wrote:
> > > > +
> > > > +       /* Handle ARCH_THREAD_FEATURE_ENABLE */
> > > > +
> > > > +       task->thread.features |= features;
> > > > +out:
> > > > +       return task->thread.features;
> > > 
> > > Isn't arch_prctl() supposed to return 0 on success?
> > 
> > Hmm, good point. Maybe we'll need a struct to pass info in and out.
> 
> But values >0 are unused. I don't see why can't we use them.

Hmm, I don't know what it would break since it is a new "code"
argument. But the man page says:
"On success, arch_prctl() returns 0; on error, -1 is returned, and
errno is set to indicate the error."

So just change the man pages?
"On success, arch_prctl() returns positive values; on error, -1 is
returned, and errno is set to indicate the error."



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-13 23:50         ` Edgecombe, Rick P
@ 2022-05-14  8:37           ` Thomas Gleixner
  2022-05-14 23:06             ` Edgecombe, Rick P
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-14  8:37 UTC (permalink / raw)
  To: Edgecombe, Rick P, kirill.shutemov
  Cc: linux-kernel, peterz, hjl.tools, linux-mm, dave.hansen,
	aryabinin, dvyukov, x86, ak, Lutomirski, Andy, glider

On Fri, May 13 2022 at 23:50, Edgecombe, Rick P wrote:
> On Sat, 2022-05-14 at 02:09 +0300, Kirill A. Shutemov wrote:
>> On Fri, May 13, 2022 at 05:34:12PM +0000, Edgecombe, Rick P wrote:
>> > On Fri, 2022-05-13 at 16:09 +0200, Alexander Potapenko wrote:
>> > > > +
>> > > > +       /* Handle ARCH_THREAD_FEATURE_ENABLE */
>> > > > +
>> > > > +       task->thread.features |= features;
>> > > > +out:
>> > > > +       return task->thread.features;
>> > > 
>> > > Isn't arch_prctl() supposed to return 0 on success?
>> > 
>> > Hmm, good point. Maybe we'll need a struct to pass info in and out.
>> 
>> But values >0 are unused. I don't see why can't we use them.
>
> Hmm, I don't know what it would break since it is a new "code"
> argument. But the man page says:
> "On success, arch_prctl() returns 0; on error, -1 is returned, and
> errno is set to indicate the error."
>
> So just change the man pages?
> "On success, arch_prctl() returns positive values; on error, -1 is
> returned, and errno is set to indicate the error."

Why?

        prctl(GET, &out)

is the pattern used all over the place.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive
  2022-05-13 23:22     ` Kirill A. Shutemov
@ 2022-05-14  8:37       ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-14  8:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Sat, May 14 2022 at 02:22, Kirill A. Shutemov wrote:

> On Thu, May 12, 2022 at 03:36:31PM +0200, Thomas Gleixner wrote:
>> On Wed, May 11 2022 at 05:27, Kirill A. Shutemov wrote:
>> > LAM_U48 steals bits above 47-bit for tags and makes it impossible for
>> > userspace to use full address space on 5-level paging machine.
>> 
>> > Make these features mutually exclusive: whichever gets enabled first
>> > blocks the othe one.
>> 
>> So this patch prevents a mapping above 47bit when LAM48 is enabled, but
>> I fail to spot how an already existing mapping above 47bit would prevent
>> LAM48 from being enabled.
>> 
>> Maybe I'm missing something which makes this magically mutually
>> exclusive.
>
> It is in 09/10. See lam_u48_allowed()

Sure, but that makes this changelog not any more correct.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 00/10] Linear Address Masking enabling
  2022-05-13 23:01           ` Kirill A. Shutemov
@ 2022-05-14 10:00             ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-14 10:00 UTC (permalink / raw)
  To: Kirill A. Shutemov, Alexander Potapenko
  Cc: Dave Hansen, Peter Zijlstra, Dave Hansen, Andy Lutomirski,
	the arch/x86 maintainers, Dmitry Vyukov, H . J . Lu, Andi Kleen,
	Rick Edgecombe, Linux Memory Management List, LKML

On Sat, May 14 2022 at 02:01, Kirill A. Shutemov wrote:
> On Fri, May 13, 2022 at 01:07:43PM +0200, Alexander Potapenko wrote:
>> On Thu, May 12, 2022 at 11:51 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> >
>> > On 5/12/22 12:39, Thomas Gleixner wrote:
>> > >> It's OK for a debugging build that runs on one kind of hardware.  But,
>> > >> if we want LAM-using binaries to be portable, we have to do something
>> > >> different.
>> > >>
>> > >> One of the stated reasons for adding LAM hardware is that folks want to
>> > >> use sanitizers outside of debugging environments.  To me, that means
>> > >> that LAM is something that the same binary might run with or without.
>> > > On/off yes, but is there an actual use case where such a mechanism would
>> > > at start time dynamically chose the number of bits?
>> >
>> > I'd love to hear from folks doing the userspace side of this.  Will
>> > userspace be saying: "Give me all the bits you can!".  Or, will it
>> > really just be looking for 6 bits only, and it doesn't care whether it
>> > gets 6 or 15, it will use only 6?
>> 
>> (speaking more or less on behalf of the userspace folks here)
>> I think it is safe to assume that in the upcoming year or two HWASan
>> will be fine having just 6 bits for the tags on x86 machines.
>> We are interested in running it on kernels with and without
>> CONFIG_X86_5LEVEL=y, so U48 is not an option in some cases anyway.
>
> Just to be clear: LAM_U48 works on machine with 5-level paging enabled as
> long as the process doesn't map anything above 47-bit.

That's the whole point. If you use LAM_U48 in one thread for some
obscure reason, you prevent the whole process from utilizing the full
virtual address space. The other way round is also weird. If one thread
manages to have a virtual address above bit 47 then you can't get U48
for the other anymore.

Aside of that the whole per-thread approach can only ever work when an
application is crafted carefully to handle it. Think about shared data
structures with pointers inside. This surely can be made work, but is it
something which is generally safe? No.

Plus it comes with inconsistent behaviour in the kthread_use_mm() case.

What could work is a mechanism which tells the loader that it should apply
U48 and restrict the address space randomization to 47 bits, but
anything else is going to create more problems than it solves.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-14  8:37           ` Thomas Gleixner
@ 2022-05-14 23:06             ` Edgecombe, Rick P
  2022-05-15  9:02               ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Edgecombe, Rick P @ 2022-05-14 23:06 UTC (permalink / raw)
  To: tglx, kirill.shutemov
  Cc: linux-kernel, peterz, hjl.tools, linux-mm, dave.hansen,
	aryabinin, dvyukov, x86, ak, Lutomirski, Andy, glider

On Sat, 2022-05-14 at 10:37 +0200, Thomas Gleixner wrote:
> On Fri, May 13 2022 at 23:50, Edgecombe, Rick P wrote:
> > On Sat, 2022-05-14 at 02:09 +0300, Kirill A. Shutemov wrote:
> > > On Fri, May 13, 2022 at 05:34:12PM +0000, Edgecombe, Rick P
> > > wrote:
> > > > On Fri, 2022-05-13 at 16:09 +0200, Alexander Potapenko wrote:
> > > > > > +
> > > > > > +       /* Handle ARCH_THREAD_FEATURE_ENABLE */
> > > > > > +
> > > > > > +       task->thread.features |= features;
> > > > > > +out:
> > > > > > +       return task->thread.features;
> > > > > 
> > > > > Isn't arch_prctl() supposed to return 0 on success?
> > > > 
> > > > Hmm, good point. Maybe we'll need a struct to pass info in and
> > > > out.
> > > 
> > > But values >0 are unused. I don't see why can't we use them.
> > 
> > Hmm, I don't know what it would break since it is a new "code"
> > argument. But the man page says:
> > "On success, arch_prctl() returns 0; on error, -1 is returned, and
> > errno is set to indicate the error."
> > 
> > So just change the man pages?
> > "On success, arch_prctl() returns positive values; on error, -1 is
> > returned, and errno is set to indicate the error."
> 
> Why?
> 
>         prctl(GET, &out)
> 
> is the pattern used all over the place.

It seems better to me, but we also need to pass something in.

The idea of the "enable" operation is that userspace would pass in all
the features that it wants in one call, and then find out back what was
successfully enabled. So unlike the other arch_prctl()s, it wants to
pass something in AND get a result in one arch_prctl() call. It doesn't
need to check what is supported ahead of time. Since these enabling
operations can fail (OOM, etc), userspace has to handle unexpected per-
feature failure anyway. So it just blindly asks for what it wants.

Any objections to having it write back the result in the same
structure?

Otherwise, the option that used to be used here was a "status"
arch_prctl(), which was called separately to find out what actually got
enabled after an "enable" call. That fit with the GET/SET semantics
already in place.

I guess we could also get rid of the batch enabling idea, and just have
one "enable" call per feature too. But then it is syscall heavy.

Any chance you could weigh in on this?


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-14 23:06             ` Edgecombe, Rick P
@ 2022-05-15  9:02               ` Thomas Gleixner
  2022-05-15 18:24                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-15  9:02 UTC (permalink / raw)
  To: Edgecombe, Rick P, kirill.shutemov
  Cc: linux-kernel, peterz, hjl.tools, linux-mm, dave.hansen,
	aryabinin, dvyukov, x86, ak, Lutomirski, Andy, glider

On Sat, May 14 2022 at 23:06, Edgecombe, Rick P wrote:
> On Sat, 2022-05-14 at 10:37 +0200, Thomas Gleixner wrote:
>> > "On success, arch_prctl() returns positive values; on error, -1 is
>> > returned, and errno is set to indicate the error."
>> 
>> Why?
>> 
>>         prctl(GET, &out)
>> 
>> is the pattern used all over the place.
>
> It seems better to me, but we also need to pass something in.
>
> The idea of the "enable" operation is that userspace would pass in all
> the features that it wants in one call, and then find out back what was
> successfully enabled. So unlike the other arch_prctl()s, it wants to
> pass something in AND get a result in one arch_prctl() call. It doesn't
> need to check what is supported ahead of time. Since these enabling
> operations can fail (OOM, etc), userspace has to handle unexpected per-
> feature failure anyway. So it just blindly asks for what it wants.

I'm not convinced at all, that this wholesale enabling of independent
features makes any sense. That would require:

  - that all features are strict binary of/off which is alredy not the
    case with LAM due to the different mask sizes.

  - that user space knows at some potentially early point of process
    startup which features it needs. Some of them might be requested
    later when libraries are loaded, but that would be too late for
    others.

Aside of that, if you lump all these things together, what's the
semantics of this feature lump in case of failure:

  Try to enable the rest when one fails, skip the rest, roll back?

Also such a feature lump results in a demultiplexing prctl() which is
code wise always inferior to dedicated controls.

> Any objections to having it write back the result in the same
> structure?

Why not.

> Otherwise, the option that used to be used here was a "status"
> arch_prctl(), which was called separately to find out what actually got
> enabled after an "enable" call. That fit with the GET/SET semantics
> already in place.
>
> I guess we could also get rid of the batch enabling idea, and just have
> one "enable" call per feature too. But then it is syscall heavy.

This is not a runtime hotpath problem. Those prctls() happen once when
the process starts, so having three which are designed for the
individual purpose instead of one ill defined is definitely the better
choice.

Premature optimization is never a good idea. Keep it simple is the right
starting point.

If it really turns out to be something which matters, then you can
provide a batch interface later on if it makes sense to do so, but see
above.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-15  9:02               ` Thomas Gleixner
@ 2022-05-15 18:24                 ` Edgecombe, Rick P
  2022-05-15 19:38                   ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Edgecombe, Rick P @ 2022-05-15 18:24 UTC (permalink / raw)
  To: tglx, kirill.shutemov
  Cc: linux-kernel, peterz, hjl.tools, linux-mm, dave.hansen,
	aryabinin, dvyukov, x86, ak, Lutomirski, Andy, glider

On Sun, 2022-05-15 at 11:02 +0200, Thomas Gleixner wrote:
> > Otherwise, the option that used to be used here was a "status"
> > arch_prctl(), which was called separately to find out what actually
> > got
> > enabled after an "enable" call. That fit with the GET/SET semantics
> > already in place.
> > 
> > I guess we could also get rid of the batch enabling idea, and just
> > have
> > one "enable" call per feature too. But then it is syscall heavy.
> 
> This is not a runtime hotpath problem. Those prctls() happen once
> when
> the process starts, so having three which are designed for the
> individual purpose instead of one ill defined is definitely the
> better
> choice.
> 
> Premature optimization is never a good idea. Keep it simple is the
> right
> starting point.
> 
> If it really turns out to be something which matters, then you can
> provide a batch interface later on if it makes sense to do so, but
> see
> above.

Thanks, sounds good to me.

Kirill, so I guess we can just change ARCH_THREAD_FEATURE_ENABLE/
ARCH_THREAD_FEATURE_DISABLE to return EINVAL if more than one bit is
set. It returns 0 on success and whatever error code on failure.
Userspace can do whatever rollback logic it wants. What do you think?


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-15 18:24                 ` Edgecombe, Rick P
@ 2022-05-15 19:38                   ` Thomas Gleixner
  2022-05-15 22:01                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2022-05-15 19:38 UTC (permalink / raw)
  To: Edgecombe, Rick P, kirill.shutemov
  Cc: linux-kernel, peterz, hjl.tools, linux-mm, dave.hansen,
	aryabinin, dvyukov, x86, ak, Lutomirski, Andy, glider

On Sun, May 15 2022 at 18:24, Edgecombe, Rick P wrote:
> On Sun, 2022-05-15 at 11:02 +0200, Thomas Gleixner wrote:
>> If it really turns out to be something which matters, then you can
>> provide a batch interface later on if it makes sense to do so, but
>> see
>> above.
>
> Thanks, sounds good to me.
>
> Kirill, so I guess we can just change ARCH_THREAD_FEATURE_ENABLE/
> ARCH_THREAD_FEATURE_DISABLE to return EINVAL if more than one bit is
> set. It returns 0 on success and whatever error code on failure.
> Userspace can do whatever rollback logic it wants. What do you think?

Why having this feature bit interface in the first place?

It's going to be a demultiplex mechanism with incompatible
arguments. Just look at LAM. What's really architecture specific about
it?

The mechanism per se is architecture independent: pointer tagging.

What's architecture specific is whether it's supported, the address mask
and the enable/disable mechanism.

So having e.g.

   prctl(POINTER_TAGGING_GET_MASK, &mask);

works on all architectures which support this. Ditto

   prctl(POINTER_TAGGING_ENABLE, &mask);

is architecture agnostic. Both need to be backed by an architecture
specific implementation of course.

This makes it future proof because new CPUs could define the mask to be
bit 57-61 and use bit 62 for something else. So from a user space
perspective the mask retrival is useful because it's obvious and trivial
to use and does not need code changes when the hardware implementation
provides a different mask.

See?

The thread.features bitmap could still be used as an internal storage
for enabled features, but having this as the primary programming
interface is cumbersome and unflexible for anything which is not binary
on/off.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features
  2022-05-15 19:38                   ` Thomas Gleixner
@ 2022-05-15 22:01                     ` Edgecombe, Rick P
  0 siblings, 0 replies; 82+ messages in thread
From: Edgecombe, Rick P @ 2022-05-15 22:01 UTC (permalink / raw)
  To: tglx, kirill.shutemov
  Cc: linux-kernel, peterz, hjl.tools, linux-mm, dave.hansen,
	aryabinin, dvyukov, x86, ak, Lutomirski, Andy, glider

On Sun, 2022-05-15 at 21:38 +0200, Thomas Gleixner wrote:
> On Sun, May 15 2022 at 18:24, Edgecombe, Rick P wrote:
> > On Sun, 2022-05-15 at 11:02 +0200, Thomas Gleixner wrote:
> > > If it really turns out to be something which matters, then you
> > > can
> > > provide a batch interface later on if it makes sense to do so,
> > > but
> > > see
> > > above.
> > 
> > Thanks, sounds good to me.
> > 
> > Kirill, so I guess we can just change ARCH_THREAD_FEATURE_ENABLE/
> > ARCH_THREAD_FEATURE_DISABLE to return EINVAL if more than one bit
> > is
> > set. It returns 0 on success and whatever error code on failure.
> > Userspace can do whatever rollback logic it wants. What do you
> > think?
> 
> Why having this feature bit interface in the first place?

The idea was that we should not have duplicate interfaces if we can
avoid it. It of course grew out of the "elf feature bit" stuff, but we
considered splitting them after moving away from that. LAM and CET's
enabling needs seemed close enough to avoid having two interfaces.

> 
> It's going to be a demultiplex mechanism with incompatible
> arguments. Just look at LAM. What's really architecture specific
> about
> it?
> 
> The mechanism per se is architecture independent: pointer tagging.
> 
> What's architecture specific is whether it's supported, the address
> mask
> and the enable/disable mechanism.
> 
> So having e.g.
> 
>    prctl(POINTER_TAGGING_GET_MASK, &mask);
> 
> works on all architectures which support this. Ditto
> 
>    prctl(POINTER_TAGGING_ENABLE, &mask);
> 
> is architecture agnostic. Both need to be backed by an architecture
> specific implementation of course.
> 
> This makes it future proof because new CPUs could define the mask to
> be
> bit 57-61 and use bit 62 for something else. So from a user space
> perspective the mask retrival is useful because it's obvious and
> trivial
> to use and does not need code changes when the hardware
> implementation
> provides a different mask.

The lack of ability to pass extra arguments is a good point.

> 
> See?

Regarding making it arch specific or not, if the LAM interface can be
arch agnostic, then that makes sense to me. I guess some CPU features
(virtual memory, etc) are similar enough that the kernel can hide them
beyond common interfaces. Some aren't (cpuid, gs register, etc). If LAM
can be one of the former, then sharing an interface with other
architectures does seem much better.

I'm thinking CET is different enough from other similar features that
leaving it as an arch thing is probably appropriate. BTI is probably
the closest (to IBT). It uses it's own BTI specific elf header bit, and
requires special PROT on memory, unlike IBT.

> 
> The thread.features bitmap could still be used as an internal storage
> for enabled features, but having this as the primary programming
> interface is cumbersome and unflexible for anything which is not
> binary
> on/off.
> 
> Thanks,
> 
>         tglx
> 
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive
  2022-05-11  2:27 ` [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive Kirill A. Shutemov
  2022-05-12 13:36   ` Thomas Gleixner
@ 2022-05-18  8:43   ` Bharata B Rao
  2022-05-18 17:08     ` Kirill A. Shutemov
  1 sibling, 1 reply; 82+ messages in thread
From: Bharata B Rao @ 2022-05-18  8:43 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: x86, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov,
	H . J . Lu, Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On 5/11/2022 7:57 AM, Kirill A. Shutemov wrote:
> LAM_U48 steals bits above 47-bit for tags and makes it impossible for
> userspace to use full address space on 5-level paging machine.
> 
> Make these features mutually exclusive: whichever gets enabled first
> blocks the othe one.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/include/asm/elf.h         |  3 ++-
>  arch/x86/include/asm/mmu_context.h | 13 +++++++++++++
>  arch/x86/kernel/sys_x86_64.c       |  5 +++--
>  arch/x86/mm/hugetlbpage.c          |  6 ++++--
>  arch/x86/mm/mmap.c                 |  9 ++++++++-
>  5 files changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index 29fea180a665..53b96b0c8cc3 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -328,7 +328,8 @@ static inline int mmap_is_ia32(void)
>  extern unsigned long task_size_32bit(void);
>  extern unsigned long task_size_64bit(int full_addr_space);
>  extern unsigned long get_mmap_base(int is_legacy);
> -extern bool mmap_address_hint_valid(unsigned long addr, unsigned long len);
> +extern bool mmap_address_hint_valid(struct mm_struct *mm,
> +				    unsigned long addr, unsigned long len);
>  extern unsigned long get_sigframe_size(void);
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 27516046117a..c8a6d80dfec3 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -218,6 +218,19 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
>  
>  unsigned long __get_current_cr3_fast(void);
>  
> +#ifdef CONFIG_X86_5LEVEL
> +static inline bool full_va_allowed(struct mm_struct *mm)
> +{
> +	/* LAM_U48 steals VA bits abouve 47-bit for tags */
> +	return mm->context.lam != LAM_U48;
> +}
> +#else

This is called from X86 common code but appears to be LAM-specific.
What would mm->context.lam contain if X86_FEATURE_LAM isn't set?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive
  2022-05-18  8:43   ` Bharata B Rao
@ 2022-05-18 17:08     ` Kirill A. Shutemov
  0 siblings, 0 replies; 82+ messages in thread
From: Kirill A. Shutemov @ 2022-05-18 17:08 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86,
	Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov, H . J . Lu,
	Andi Kleen, Rick Edgecombe, linux-mm, linux-kernel

On Wed, May 18, 2022 at 02:13:06PM +0530, Bharata B Rao wrote:
> On 5/11/2022 7:57 AM, Kirill A. Shutemov wrote:
> > LAM_U48 steals bits above 47-bit for tags and makes it impossible for
> > userspace to use full address space on 5-level paging machine.
> > 
> > Make these features mutually exclusive: whichever gets enabled first
> > blocks the othe one.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/include/asm/elf.h         |  3 ++-
> >  arch/x86/include/asm/mmu_context.h | 13 +++++++++++++
> >  arch/x86/kernel/sys_x86_64.c       |  5 +++--
> >  arch/x86/mm/hugetlbpage.c          |  6 ++++--
> >  arch/x86/mm/mmap.c                 |  9 ++++++++-
> >  5 files changed, 30 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> > index 29fea180a665..53b96b0c8cc3 100644
> > --- a/arch/x86/include/asm/elf.h
> > +++ b/arch/x86/include/asm/elf.h
> > @@ -328,7 +328,8 @@ static inline int mmap_is_ia32(void)
> >  extern unsigned long task_size_32bit(void);
> >  extern unsigned long task_size_64bit(int full_addr_space);
> >  extern unsigned long get_mmap_base(int is_legacy);
> > -extern bool mmap_address_hint_valid(unsigned long addr, unsigned long len);
> > +extern bool mmap_address_hint_valid(struct mm_struct *mm,
> > +				    unsigned long addr, unsigned long len);
> >  extern unsigned long get_sigframe_size(void);
> >  
> >  #ifdef CONFIG_X86_32
> > diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> > index 27516046117a..c8a6d80dfec3 100644
> > --- a/arch/x86/include/asm/mmu_context.h
> > +++ b/arch/x86/include/asm/mmu_context.h
> > @@ -218,6 +218,19 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
> >  
> >  unsigned long __get_current_cr3_fast(void);
> >  
> > +#ifdef CONFIG_X86_5LEVEL
> > +static inline bool full_va_allowed(struct mm_struct *mm)
> > +{
> > +	/* LAM_U48 steals VA bits abouve 47-bit for tags */
> > +	return mm->context.lam != LAM_U48;
> > +}
> > +#else
> 
> This is called from X86 common code but appears to be LAM-specific.
> What would mm->context.lam contain if X86_FEATURE_LAM isn't set?

0. So full_va_allowed() will always return true.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2022-05-18 17:28 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-11  2:27 [RFCv2 00/10] Linear Address Masking enabling Kirill A. Shutemov
2022-05-11  2:27 ` [PATCH] x86: Implement Linear Address Masking support Kirill A. Shutemov
2022-05-12 13:01   ` David Laight
2022-05-12 14:07     ` Matthew Wilcox
2022-05-12 15:06       ` Thomas Gleixner
2022-05-12 15:33         ` David Laight
2022-05-12 14:35     ` Peter Zijlstra
2022-05-12 17:00     ` Kirill A. Shutemov
2022-05-11  2:27 ` [RFCv2 01/10] x86/mm: Fix CR3_ADDR_MASK Kirill A. Shutemov
2022-05-11  2:27 ` [RFCv2 02/10] x86: CPUID and CR3/CR4 flags for Linear Address Masking Kirill A. Shutemov
2022-05-11  2:27 ` [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features Kirill A. Shutemov
2022-05-12 12:02   ` Thomas Gleixner
2022-05-12 12:04     ` [PATCH] x86/prctl: Remove pointless task argument Thomas Gleixner
2022-05-13 12:30       ` [tip: x86/cleanups] " tip-bot2 for Thomas Gleixner
2022-05-13 14:09   ` [RFCv2 03/10] x86: Introduce userspace API to handle per-thread features Alexander Potapenko
2022-05-13 17:34     ` Edgecombe, Rick P
2022-05-13 23:09       ` Kirill A. Shutemov
2022-05-13 23:50         ` Edgecombe, Rick P
2022-05-14  8:37           ` Thomas Gleixner
2022-05-14 23:06             ` Edgecombe, Rick P
2022-05-15  9:02               ` Thomas Gleixner
2022-05-15 18:24                 ` Edgecombe, Rick P
2022-05-15 19:38                   ` Thomas Gleixner
2022-05-15 22:01                     ` Edgecombe, Rick P
2022-05-11  2:27 ` [RFCv2 04/10] x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57 Kirill A. Shutemov
2022-05-11  7:02   ` Peter Zijlstra
2022-05-12 12:24     ` Thomas Gleixner
2022-05-12 14:37       ` Peter Zijlstra
2022-05-11  2:27 ` [RFCv2 05/10] x86/mm: Provide untagged_addr() helper Kirill A. Shutemov
2022-05-11  7:21   ` Peter Zijlstra
2022-05-11  7:45     ` Peter Zijlstra
2022-05-12 13:06   ` Thomas Gleixner
2022-05-12 14:23     ` Peter Zijlstra
2022-05-12 15:16       ` Thomas Gleixner
2022-05-12 23:14         ` Thomas Gleixner
2022-05-13 10:14           ` David Laight
2022-05-11  2:27 ` [RFCv2 06/10] x86/uaccess: Remove tags from the address before checking Kirill A. Shutemov
2022-05-12 13:02   ` David Laight
2022-05-11  2:27 ` [RFCv2 07/10] x86/mm: Handle tagged memory accesses from kernel threads Kirill A. Shutemov
2022-05-11  7:23   ` Peter Zijlstra
2022-05-12 13:30   ` Thomas Gleixner
2022-05-11  2:27 ` [RFCv2 08/10] x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive Kirill A. Shutemov
2022-05-12 13:36   ` Thomas Gleixner
2022-05-13 23:22     ` Kirill A. Shutemov
2022-05-14  8:37       ` Thomas Gleixner
2022-05-18  8:43   ` Bharata B Rao
2022-05-18 17:08     ` Kirill A. Shutemov
2022-05-11  2:27 ` [RFCv2 09/10] x86/mm: Add userspace API to enable Linear Address Masking Kirill A. Shutemov
2022-05-11  7:26   ` Peter Zijlstra
2022-05-12 14:46     ` Thomas Gleixner
2022-05-11 14:15   ` H.J. Lu
2022-05-12 14:21     ` Thomas Gleixner
2022-05-11  2:27 ` [RFCv2 10/10] x86: Expose thread features status in /proc/$PID/arch_status Kirill A. Shutemov
2022-05-11  6:49 ` [RFCv2 00/10] Linear Address Masking enabling Peter Zijlstra
2022-05-12 15:42   ` Thomas Gleixner
2022-05-12 16:56     ` Kirill A. Shutemov
2022-05-12 19:31       ` Thomas Gleixner
2022-05-12 23:21         ` Thomas Gleixner
2022-05-12 17:22   ` Dave Hansen
2022-05-12 19:39     ` Thomas Gleixner
2022-05-12 21:24       ` Thomas Gleixner
2022-05-13 14:43         ` Matthew Wilcox
2022-05-13 22:59         ` Kirill A. Shutemov
2022-05-12 21:51       ` Dave Hansen
2022-05-12 22:10         ` H.J. Lu
2022-05-12 23:35           ` Thomas Gleixner
2022-05-13  0:08             ` H.J. Lu
2022-05-13  0:46               ` Dave Hansen
2022-05-13  1:27                 ` Thomas Gleixner
2022-05-13  3:05                   ` Dave Hansen
2022-05-13  8:28                     ` Thomas Gleixner
2022-05-13 22:48                     ` Kirill A. Shutemov
2022-05-13  9:14                   ` Catalin Marinas
2022-05-13  9:26                     ` Thomas Gleixner
2022-05-13  0:46               ` Thomas Gleixner
2022-05-13 11:07         ` Alexander Potapenko
2022-05-13 11:28           ` David Laight
2022-05-13 12:26             ` Alexander Potapenko
2022-05-13 14:26               ` David Laight
2022-05-13 15:28                 ` Alexander Potapenko
2022-05-13 23:01           ` Kirill A. Shutemov
2022-05-14 10:00             ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).