* [GIT PULL] x86/mm for 6.4 @ 2023-04-27 22:56 Dave Hansen 2023-04-28 17:23 ` pr-tracker-bot 2023-04-28 20:07 ` Linus Torvalds 0 siblings, 2 replies; 28+ messages in thread From: Dave Hansen @ 2023-04-27 22:56 UTC (permalink / raw) To: torvalds; +Cc: x86, linux-kernel, kirill.shutemov, Dave Hansen Hi Linus, Please pull some x86/mm changes for 6.4. The only content here is solely a new revision of Kirill's Linear Address Masking implementation. You had some concerns[1] with the last version and the (lack of) locking while the feature was being enabled in multithreaded programs. It's been fixed up since then to simply only allow LAM enabling when the process is single threaded. It's also accumulated a few random fixes and cleanups since then. This conflicts with the shadow stack work (x86/shstk) that I sent earlier this week. This is no surprise since LAM and shadow stacks both add prctl()'s, selftests and touch the same headers. Despite there being a few sites of conflict, the merge is logically straightforward. I've included a suggested resolution. Both Kirill (LAM) and Rick (shadow stacks) tested the result and confirmed that nothing broke. 1. https://lore.kernel.org/all/CAHk-=wi=TY3Kte5Z1_nvfcsEh+rcz86pYnzeASw=pbG9QtpJEQ@mail.gmail.com/ -- The following changes since commit eeac8ede17557680855031c6f305ece2378af326: Linux 6.3-rc2 (2023-03-12 16:36:44 -0700) are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4 for you to fetch changes up to 97740266de26e5dfe6e4fbecacb6995b66c2e378: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside (2023-04-06 13:45:06 -0700) ---------------------------------------------------------------- Add support for new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use. ---------------------------------------------------------------- Kirill A. Shutemov (14): x86/mm: Rework address range check in get_user() and put_user() x86: Allow atomic MM_CONTEXT flags setting x86: CPUID and CR3/CR4 flags for Linear Address Masking x86/mm: Handle LAM on context switch mm: Introduce untagged_addr_remote() x86/uaccess: Provide untagged_addr() and remove tags before address check x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/mm: Provide arch_prctl() interface for LAM mm: Expose untagging mask in /proc/$PID/status iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() x86/mm/iommu/sva: Make LAM and SVA mutually exclusive selftests/x86/lam: Add test cases for LAM vs thread creation x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside Weihong Zhang (5): selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking arch/arm64/include/asm/mmu_context.h | 6 + arch/sparc/include/asm/mmu_context_64.h | 6 + arch/sparc/include/asm/uaccess_64.h | 2 + arch/x86/Kconfig | 11 + arch/x86/entry/vsyscall/vsyscall_64.c | 2 +- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/mmu.h | 18 +- arch/x86/include/asm/mmu_context.h | 49 +- arch/x86/include/asm/processor-flags.h | 2 + arch/x86/include/asm/tlbflush.h | 48 +- arch/x86/include/asm/uaccess.h | 58 +- arch/x86/include/uapi/asm/prctl.h | 5 + arch/x86/include/uapi/asm/processor-flags.h | 6 + arch/x86/kernel/process.c | 6 + arch/x86/kernel/process_64.c | 68 +- arch/x86/kernel/traps.c | 6 +- arch/x86/lib/getuser.S | 83 +- arch/x86/lib/putuser.S | 54 +- arch/x86/mm/init.c | 5 + arch/x86/mm/tlb.c | 53 +- drivers/iommu/iommu-sva.c | 8 +- drivers/vfio/vfio_iommu_type1.c | 2 +- fs/proc/array.c | 7 + fs/proc/task_mmu.c | 9 +- include/linux/ioasid.h | 9 - include/linux/mm.h | 11 - include/linux/mmu_context.h | 14 + include/linux/sched/mm.h | 8 +- include/linux/uaccess.h | 22 + mm/gup.c | 4 +- mm/madvise.c | 5 +- mm/migrate.c | 11 +- tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/lam.c | 1241 +++++++++++++++++++++++++++ 35 files changed, 1701 insertions(+), 149 deletions(-) create mode 100644 tools/testing/selftests/x86/lam.c diff --cc arch/x86/include/asm/disabled-features.h index fafe9be7a6f4,652e366b68a0..702d93fdd10e --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@@ -120,8 -126,8 +132,8 @@@ #define DISABLED_MASK9 (DISABLE_SGX) #define DISABLED_MASK10 0 #define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \ - DISABLE_CALL_DEPTH_TRACKING) + DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK) -#define DISABLED_MASK12 0 +#define DISABLED_MASK12 (DISABLE_LAM) #define DISABLED_MASK13 0 #define DISABLED_MASK14 0 #define DISABLED_MASK15 0 diff --cc arch/x86/include/uapi/asm/prctl.h index eb290d89cb32,1b85bc876c2d..abe3fe6db6d2 --- a/arch/x86/include/uapi/asm/prctl.h +++ b/arch/x86/include/uapi/asm/prctl.h @@@ -20,9 -20,16 +20,21 @@@ #define ARCH_MAP_VDSO_32 0x2002 #define ARCH_MAP_VDSO_64 0x2003 + /* Don't use 0x3001-0x3004 because of old glibcs */ + +#define ARCH_GET_UNTAG_MASK 0x4001 +#define ARCH_ENABLE_TAGGED_ADDR 0x4002 +#define ARCH_GET_MAX_TAG_BITS 0x4003 +#define ARCH_FORCE_TAGGED_SVA 0x4004 + + #define ARCH_SHSTK_ENABLE 0x5001 + #define ARCH_SHSTK_DISABLE 0x5002 + #define ARCH_SHSTK_LOCK 0x5003 + #define ARCH_SHSTK_UNLOCK 0x5004 + #define ARCH_SHSTK_STATUS 0x5005 + + /* ARCH_SHSTK_ features bits */ + #define ARCH_SHSTK_SHSTK (1ULL << 0) + #define ARCH_SHSTK_WRSS (1ULL << 1) + #endif /* _ASM_X86_PRCTL_H */ diff --cc arch/x86/kernel/process.c index 50d950771371,8bf13cff0141..7157b09d1cbf --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@@ -48,7 -48,7 +48,8 @@@ #include <asm/frame.h> #include <asm/unwind.h> #include <asm/tdx.h> +#include <asm/mmu_context.h> + #include <asm/shstk.h> #include "process.h" diff --cc arch/x86/kernel/process_64.c index b46924c9e46d,31241930b60c..74c7e84a94d8 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@@ -875,22 -831,13 +877,28 @@@ long do_arch_prctl_64(struct task_struc # endif case ARCH_MAP_VDSO_64: return prctl_map_vdso(&vdso_image_64, arg2); +#endif +#ifdef CONFIG_ADDRESS_MASKING + case ARCH_GET_UNTAG_MASK: + return put_user(task->mm->context.untag_mask, + (unsigned long __user *)arg2); + case ARCH_ENABLE_TAGGED_ADDR: + return prctl_enable_tagged_addr(task->mm, arg2); + case ARCH_FORCE_TAGGED_SVA: + set_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &task->mm->context.flags); + return 0; + case ARCH_GET_MAX_TAG_BITS: + if (!cpu_feature_enabled(X86_FEATURE_LAM)) + return put_user(0, (unsigned long __user *)arg2); + else + return put_user(LAM_U57_BITS, (unsigned long __user *)arg2); #endif + case ARCH_SHSTK_ENABLE: + case ARCH_SHSTK_DISABLE: + case ARCH_SHSTK_LOCK: + case ARCH_SHSTK_UNLOCK: + case ARCH_SHSTK_STATUS: + return shstk_prctl(task, option, arg2); default: ret = -EINVAL; break; diff --cc fs/proc/array.c index 6daea628bc76,3e1a33dcd0d0..a880c4e44752 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@@ -424,11 -423,11 +424,16 @@@ static inline void task_thp_status(stru seq_printf(m, "THP_enabled:\t%d\n", thp_enabled); } +static inline void task_untag_mask(struct seq_file *m, struct mm_struct *mm) +{ + seq_printf(m, "untag_mask:\t%#lx\n", mm_untag_mask(mm)); +} + + __weak void arch_proc_pid_thread_features(struct seq_file *m, + struct task_struct *task) + { + } + int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { diff --cc tools/testing/selftests/x86/Makefile index 598135d3162b,cfc8a26ad151..fa2216a8c0d5 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@@ -18,7 -18,7 +18,7 @@@ TARGETS_C_32BIT_ONLY := entry_from_vm8 test_FCMOV test_FCOMI test_FISTTP \ vdso_restorer TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \ - corrupt_xstate_header amx lam - corrupt_xstate_header amx test_shadow_stack ++ corrupt_xstate_header amx test_shadow_stack lam # Some selftests require 32bit support enabled also on 64bit systems TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-04-27 22:56 [GIT PULL] x86/mm for 6.4 Dave Hansen @ 2023-04-28 17:23 ` pr-tracker-bot 2023-04-28 20:07 ` Linus Torvalds 1 sibling, 0 replies; 28+ messages in thread From: pr-tracker-bot @ 2023-04-28 17:23 UTC (permalink / raw) To: Dave Hansen; +Cc: torvalds, x86, linux-kernel, kirill.shutemov, Dave Hansen The pull request you sent on Thu, 27 Apr 2023 15:56:47 -0700: > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/22b8cc3e78f5448b4c5df00303817a9137cd663f Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-04-27 22:56 [GIT PULL] x86/mm for 6.4 Dave Hansen 2023-04-28 17:23 ` pr-tracker-bot @ 2023-04-28 20:07 ` Linus Torvalds 2023-04-28 20:15 ` Linus Torvalds 2023-06-16 8:47 ` Alexander Potapenko 1 sibling, 2 replies; 28+ messages in thread From: Linus Torvalds @ 2023-04-28 20:07 UTC (permalink / raw) To: Dave Hansen; +Cc: x86, linux-kernel, kirill.shutemov, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 861 bytes --] On Thu, Apr 27, 2023 at 3:57 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > Please pull some x86/mm changes for 6.4. The only content here is > solely a new revision of Kirill's Linear Address Masking implementation. So I was waiting for this for my final piece of the x86 user copy changes, so here goes... I think we should now make 'access_ok()' do the same thing that get/put_user() do with the LAM code: only worry about the sign bit. So here's my suggested change on top of the current tree. Comments? PeterZ also added to the cc, because he's the source of that WARN_ON_IN_IRQ() in the x86 'access_ok()' macro. That's the only reason x86 has its own copy of that. I wonder if that WARN_ON_IN_IRQ() should just be removed, or perhaps moved into the generic code in <asm-generic/access_ok.h>? Linus [-- Attachment #2: 0001-x86-64-make-access_ok-independent-of-LAM.patch --] [-- Type: text/x-patch, Size: 8169 bytes --] From 2c283a649b07d180d5a7bcfaf539365297e82c23 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri, 28 Apr 2023 12:55:10 -0700 Subject: [PATCH] x86-64: make access_ok() independent of LAM The linear address masking (LAM) code made access_ok() more complicated, in that it now needs to untag the address in order to verify the access range. See commit 74c228d20a51 ("x86/uaccess: Provide untagged_addr() and remove tags before address check"). We were able to avoid that overhead in the get_user/put_user code paths by simply using the sign bit for the address check, and depending on the GP fault if the address was non-canonical, which made it all independent of LAM. And we can do the same thing for access_ok(): simply check that the user pointer range has the high bit clear. No need to bother with any address bit masking. In fact, we can go a bit further, and just check the starting address for known small accesses ranges: any accesses that overflow will still be in the non-canonical area and will still GP fault. To still make syzkaller catch any potentially unchecked user addresses, we'll continue to warn about GP faults that are caused by accesses in the non-canonical range. But we'll limit that to purely "high bit set and past the one-page 'slop' area". We could probably just do that "check only starting address" for any arbitrary range size: realistically all kernel accesses to user space will be done starting at the low address. But let's leave that kind of optimization for later. As it is, this already allows us to generate simpler code and not worry about any tag bits in the address. The one thing to look out for is the GUP address check: instead of actually copying data in the virtual address range (and thus bad addresses being caught by the GP fault), GUP will look up the page tables manually. As a result, the page table limits need to be checked, and that was previously implicitly done by the access_ok(). With the relaxed access_ok() check, we need to just do an explicit check for TASK_SIZE_MAX in the GUP code instead. The GUP code already needs to do the tag bit unmasking anyway, so there this is all very straightforward, and there are no LAM issues. Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- arch/x86/include/asm/uaccess.h | 39 +++++++++++++++++++++++++++++---- arch/x86/mm/extable.c | 40 +++++++++++++++++++++++++++++----- mm/gup.c | 2 ++ 3 files changed, 71 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 457e814712af..123135d60f72 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -75,6 +75,34 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, #define untagged_addr(addr) (addr) #endif +#ifdef CONFIG_X86_64 +/* + * On x86-64, we may have tag bits in the user pointer. Rather than + * mask them off, just change the rules for __access_ok(). + * + * Make the rule be that 'ptr+size' must not overflow, and must not + * have the high bit set. Compilers generally understand about + * unsigned overflow and the CF bit and generate reasonable code for + * this. Although it looks like the combination confuses at least + * clang (and instead of just doing an "add" followed by a test of + * SF and CF, you'll see that unnecessary comparison). + * + * For the common case of small sizes that can be checked at compile + * time, don't even bother with the addition, and just check that the + * base pointer is ok. + */ +static inline bool __access_ok(const void __user *ptr, unsigned long size) +{ + if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) { + return (long)ptr >= 0; + } else { + unsigned long sum = size + (unsigned long)ptr; + return (long) sum >= 0 && sum >= (unsigned long)ptr; + } +} +#define __access_ok __access_ok +#endif + /** * access_ok - Checks if a user space pointer is valid * @addr: User space pointer to start of block to check @@ -91,11 +119,14 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, * * Return: true (nonzero) if the memory block may be valid, false (zero) * if it is definitely invalid. + * + * This should not be x86-specific. The only odd things out here is + * the WARN_ON_IN_IRQ(), which doesn't exist in the generic version. */ -#define access_ok(addr, size) \ -({ \ - WARN_ON_IN_IRQ(); \ - likely(__access_ok(untagged_addr(addr), size)); \ +#define access_ok(addr, size) \ +({ \ + WARN_ON_IN_IRQ(); \ + likely(__access_ok(addr, size)); \ }) #include <asm-generic/access_ok.h> diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c index 60814e110a54..8d38dedadbb1 100644 --- a/arch/x86/mm/extable.c +++ b/arch/x86/mm/extable.c @@ -130,10 +130,36 @@ static bool ex_handler_fprestore(const struct exception_table_entry *fixup, return true; } +/* + * On x86-64, we end up being imprecise with 'access_ok()', and allow + * non-canonical user addresses to make the range comparisons simpler, + * and to not have to worry about LAM being enabled. + * + * In fact, we allow up to one page of "slop" at the sign boundary, + * which means that we can do access_ok() by just checking the sign + * of the pointer for the common case of having a small access size. + */ +static bool gp_fault_address_ok(unsigned long fault_address) +{ +#ifdef CONFIG_X86_64 + /* Is it in the "user space" part of the non-canonical space? */ + if ((long) fault_address >= 0) + return true; + + /* .. or just above it? */ + fault_address -= PAGE_SIZE; + if ((long) fault_address >= 0) + return true; +#endif + return false; +} + static bool ex_handler_uaccess(const struct exception_table_entry *fixup, - struct pt_regs *regs, int trapnr) + struct pt_regs *regs, int trapnr, + unsigned long fault_address) { - WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in user access. Non-canonical address?"); + WARN_ONCE(trapnr == X86_TRAP_GP && !gp_fault_address_ok(fault_address), + "General protection fault in user access. Non-canonical address?"); return ex_handler_default(fixup, regs); } @@ -189,10 +215,12 @@ static bool ex_handler_imm_reg(const struct exception_table_entry *fixup, } static bool ex_handler_ucopy_len(const struct exception_table_entry *fixup, - struct pt_regs *regs, int trapnr, int reg, int imm) + struct pt_regs *regs, int trapnr, + unsigned long fault_address, + int reg, int imm) { regs->cx = imm * regs->cx + *pt_regs_nr(regs, reg); - return ex_handler_uaccess(fixup, regs, trapnr); + return ex_handler_uaccess(fixup, regs, trapnr, fault_address); } int ex_get_fixup_type(unsigned long ip) @@ -238,7 +266,7 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code, case EX_TYPE_FAULT_MCE_SAFE: return ex_handler_fault(e, regs, trapnr); case EX_TYPE_UACCESS: - return ex_handler_uaccess(e, regs, trapnr); + return ex_handler_uaccess(e, regs, trapnr, fault_addr); case EX_TYPE_COPY: return ex_handler_copy(e, regs, trapnr); case EX_TYPE_CLEAR_FS: @@ -269,7 +297,7 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code, case EX_TYPE_FAULT_SGX: return ex_handler_sgx(e, regs, trapnr); case EX_TYPE_UCOPY_LEN: - return ex_handler_ucopy_len(e, regs, trapnr, reg, imm); + return ex_handler_ucopy_len(e, regs, trapnr, fault_addr, reg, imm); case EX_TYPE_ZEROPAD: return ex_handler_zeropad(e, regs, fault_addr); } diff --git a/mm/gup.c b/mm/gup.c index ff689c88a357..bbe416236593 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2970,6 +2970,8 @@ static int internal_get_user_pages_fast(unsigned long start, len = nr_pages << PAGE_SHIFT; if (check_add_overflow(start, len, &end)) return 0; + if (end > TASK_SIZE_MAX) + return -EFAULT; if (unlikely(!access_ok((void __user *)start, len))) return -EFAULT; -- 2.39.1.437.g88ae5c7f81 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-04-28 20:07 ` Linus Torvalds @ 2023-04-28 20:15 ` Linus Torvalds 2023-04-29 0:38 ` Kirill A. Shutemov 2023-06-16 8:47 ` Alexander Potapenko 1 sibling, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2023-04-28 20:15 UTC (permalink / raw) To: Dave Hansen; +Cc: x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Fri, Apr 28, 2023 at 1:07 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So here's my suggested change on top of the current tree. Comments? Oh, and I wanted to particularly mention that We could probably just do that "check only starting address" for any arbitrary range size: realistically all kernel accesses to user space will be done starting at the low address. But let's leave that kind of optimization for later. As it is, this already allows us to generate simpler code and not worry about any tag bits in the address. part of the commit log. Right now that patch only simplifies the range check for when the compiler statically knows that the range is small (which does happen, but not very often, because 'copy_to/from_user()' isn't inlined on x86-64, so the compiler doesn't actually see the constant size case that is very common there). However, that "check the end of the range" is sometimes actually fairly complicated code, and it would be nice to drop that entirely. See for example the fs/readdir.c case, where the length of the access is kind of annoying: if (!user_write_access_begin(dirent, (unsigned long)(dirent->d_name + namlen + 1) - (unsigned long)dirent)) goto efault; and there really isn't any actual reason to check the end of the access on x86: if the beginning address has the low bit clear, it doesn't really matter what the end is, because we'll either have a good area, or we'll fault in the non-canonical area even if the sign changes. So being careful about the range is kind of annoying, when we don't really need it. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-04-28 20:15 ` Linus Torvalds @ 2023-04-29 0:38 ` Kirill A. Shutemov 2023-04-29 1:04 ` Linus Torvalds 2023-05-02 15:42 ` Dave Hansen 0 siblings, 2 replies; 28+ messages in thread From: Kirill A. Shutemov @ 2023-04-29 0:38 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Fri, Apr 28, 2023 at 01:15:33PM -0700, Linus Torvalds wrote: > On Fri, Apr 28, 2023 at 1:07 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So here's my suggested change on top of the current tree. Comments? > > Oh, and I wanted to particularly mention that > > We could probably just do that "check only starting address" for any > arbitrary range size: realistically all kernel accesses to user space > will be done starting at the low address. But let's leave that kind of > optimization for later. As it is, this already allows us to generate > simpler code and not worry about any tag bits in the address. > > part of the commit log. > > Right now that patch only simplifies the range check for when the > compiler statically knows that the range is small (which does happen, > but not very often, because 'copy_to/from_user()' isn't inlined on > x86-64, so the compiler doesn't actually see the constant size case > that is very common there). BTW, I think the static check can be relaxed. Checking size against PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level paging) guard hole at the begging of kernel address space. > However, that "check the end of the range" is sometimes actually > fairly complicated code, and it would be nice to drop that entirely. > > See for example the fs/readdir.c case, where the length of the access > is kind of annoying: > > if (!user_write_access_begin(dirent, > (unsigned long)(dirent->d_name + namlen + 1) - > (unsigned long)dirent)) > goto efault; > > and there really isn't any actual reason to check the end of the > access on x86: if the beginning address has the low bit clear, it > doesn't really matter what the end is, because we'll either have a > good area, or we'll fault in the non-canonical area even if the sign > changes. > > So being careful about the range is kind of annoying, when we don't > really need it. Hm. Is there anybody who access high to low after the check (glibc memcpy() bug flashbacks)? Or not in any particular order? -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-04-29 0:38 ` Kirill A. Shutemov @ 2023-04-29 1:04 ` Linus Torvalds 2023-05-02 15:42 ` Dave Hansen 1 sibling, 0 replies; 28+ messages in thread From: Linus Torvalds @ 2023-04-29 1:04 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Fri, Apr 28, 2023 at 5:38 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > BTW, I think the static check can be relaxed. Checking size against > PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level > paging) guard hole at the begging of kernel address space. So I don't worry about the size per se - we just don't have any constant sized accesses that are bigger than a page. The constant-sized case is for things like structures being copied to user space. And having a bug gap is nice for the suzkaller case, although I don't think that GP fault has triggered lately (or ever, I don't remember). Having random system call arguments that trigger "oh, this is in the non-canonical region" is a good thing. > > So being careful about the range is kind of annoying, when we don't > > really need it. > > Hm. Is there anybody who access high to low after the check (glibc > memcpy() bug flashbacks)? Or not in any particular order? Yeah, I can't think of a single case, which is why it seems so silly to even bother. Almost all real life cases end up being limited by things like the page/folio size. We do have exceptions, like module loading etc that might copy a bigger area from user space, but no, we don't have any backwards copies. So you'd almost have to have some "access_ok()" followed by random access with a user-controlled offset, and that seems nonsensical and fundamentally impossible anyway. But just because I can't think of it, and go "that would be insane" doesn't mean that some driver ioctl interface might not try it. Which is why I think having others look at it would be a good idea. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-04-29 0:38 ` Kirill A. Shutemov 2023-04-29 1:04 ` Linus Torvalds @ 2023-05-02 15:42 ` Dave Hansen 2023-05-02 16:00 ` Linus Torvalds 1 sibling, 1 reply; 28+ messages in thread From: Dave Hansen @ 2023-05-02 15:42 UTC (permalink / raw) To: Kirill A. Shutemov, Linus Torvalds Cc: Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 978 bytes --] On 4/28/23 17:38, Kirill A. Shutemov wrote: > BTW, I think the static check can be relaxed. Checking size against > PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level > paging) guard hole at the begging of kernel address space. Whatever we relax it to, let's make sure we get a note in Documentation/x86/x86_64/mm.rst. But that's totally minor and we can fix it up later. Have anyone seen any actual code generation difference between: return (long)ptr >= 0; and return !((unsigned long)ptr & (1UL<<(BITS_PER_LONG-1))); ? I'm seeing gcc generate the same code for both the <=PAGE_SIZE side and the 'sum' side. It's longer, but I'd rather read the explicit "check bit 63" than the positive/negative address space thing. I certainly grok both, but have to think through the "(long)ptr >= 0" check every time. I guess it also wouldn't matter as much either if we hid it in a helper like the attached patch and I didn't have to read it twice. ;) [-- Attachment #2: ptr_in_user_half.patch --] [-- Type: text/x-patch, Size: 2540 bytes --] diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 123135d60f72..7bb11d5a7f8f 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -77,15 +77,28 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, #ifdef CONFIG_X86_64 /* - * On x86-64, we may have tag bits in the user pointer. Rather than - * mask them off, just change the rules for __access_ok(). + * The virtual address space space is logically divided into a kernel + * half and a user half. When cast to a signed type, user pointers + * are positive and kernel pointers are negative. + */ +static inline bool ptr_in_user_half(void *ptr) +{ + return (long)ptr >= 0; +} + +/* + * User pointers can have tag bits on x86-64. This scheme tolerates + * arbitrary values in those bits rather masking them off. + * + * Enforce two rules: + * 1. 'ptr' must be in the user half of the address space + * 2. 'ptr+size' must not overflow (back into the kernel half) * - * Make the rule be that 'ptr+size' must not overflow, and must not - * have the high bit set. Compilers generally understand about - * unsigned overflow and the CF bit and generate reasonable code for - * this. Although it looks like the combination confuses at least - * clang (and instead of just doing an "add" followed by a test of - * SF and CF, you'll see that unnecessary comparison). + * Compilers generally understand about unsigned overflow and the CF + * bit and generate reasonable code for this. Although it looks like + * the combination confuses at least clang (and instead of just doing + * an "add" followed by a test of SF and CF, you'll see that + * unnecessary comparison). * * For the common case of small sizes that can be checked at compile * time, don't even bother with the addition, and just check that the @@ -93,11 +106,16 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, */ static inline bool __access_ok(const void __user *ptr, unsigned long size) { + /* + * Check only the pointer (not ptr+size) for small accesses. + * This is OK because the kernel address space begins with a + * >=PAGE_SIZE guard hole. + */ if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) { - return (long)ptr >= 0; + return ptr_in_user_half(ptr); } else { unsigned long sum = size + (unsigned long)ptr; - return (long) sum >= 0 && sum >= (unsigned long)ptr; + return ptr_in_user_half(ptr) && sum >= (unsigned long)ptr; } } #define __access_ok __access_ok ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-02 15:42 ` Dave Hansen @ 2023-05-02 16:00 ` Linus Torvalds 2023-05-02 20:14 ` Linus Torvalds 0 siblings, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2023-05-02 16:00 UTC (permalink / raw) To: Dave Hansen Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Tue, May 2, 2023 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote: > > Have anyone seen any actual code generation difference between: > > return (long)ptr >= 0; > > and > > return !((unsigned long)ptr & (1UL<<(BITS_PER_LONG-1))); No, as far as I know, they both generate the same code. > It's longer, but I'd rather read the explicit "check bit 63" than the > positive/negative address space thing. I certainly grok both, but have > to think through the "(long)ptr >= 0" check every time. I'm very much the other way. I think it's much clearer to say "check the sign bit". Doing the explicit bit check means that I have to look at what the bit number is, and that is a much more complex expression. In fact, I'd find it easier to read return !((unsigned long)ptr & (1UL<< 63)); just because then you go "Oh, checking bit 63" without having to parse the expression. But even then I find the '!' is easy to miss, so you really have to parse it. But: > I guess it also wouldn't matter as much either if we hid it in a helper > like the attached patch and I didn't have to read it twice. ;) Yeah, I think that's a good solution. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-02 16:00 ` Linus Torvalds @ 2023-05-02 20:14 ` Linus Torvalds 2023-05-03 0:53 ` Dave Hansen 2023-05-03 8:01 ` Peter Zijlstra 0 siblings, 2 replies; 28+ messages in thread From: Linus Torvalds @ 2023-05-02 20:14 UTC (permalink / raw) To: Dave Hansen Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Tue, May 2, 2023 at 9:00 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > I guess it also wouldn't matter as much either if we hid it in a helper > > like the attached patch and I didn't have to read it twice. ;) > > Yeah, I think that's a good solution. Hmm. And as I was rebasing the patch to fix up my patch, I realized that the current -git top-of-tree state is actually broken. That #define access_ok(addr, size) \ ({ \ WARN_ON_IN_IRQ(); \ likely(__access_ok(untagged_addr(addr), size)); \ }) is actually *wrong* in two ways. Now, in my original patch, I added a comment about how that "WARN_ON_IN_IRQ()" is bogus and this shouldn't be x86-specific at all. I ended up going back in time to see why it was added, and I think it was added because we used to access 'current' in access_ok(), due to it using that user_addr_max() thing: likely(!__range_not_ok(addr, size, user_addr_max())); but that was all removed by the set_fs() removal by Christoph Hellwig. So there is actually nothing task-related in "access_ok()" any more, and any warning about using it in the wrong context is just bogus. That warning simply shouldn't exist any more (or maybe it should be in a different place, like the actual user copy functions) But that's actually not the problem with the above. No, the problem is that probably *because* "access_ok()" has that warning, not all users use "access_ok()" at all. We have places that use "__access_ok()" instead. Like copy_from_nmi(). So now copy_from_nmi() doesn't do the untagging, so if you were to use tagged pointers for the stack, you'd not get stack traces. End result: I think that (a) that WARN_ON_IN_IRQ() is actively detrimental and causes problems (b) the current "use untagged_addr() in access_ok()" model is also broken and my patch - which was meant to just improve code generation - actually fixes this. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-02 20:14 ` Linus Torvalds @ 2023-05-03 0:53 ` Dave Hansen 2023-05-03 1:17 ` Linus Torvalds 2023-05-03 8:01 ` Peter Zijlstra 1 sibling, 1 reply; 28+ messages in thread From: Dave Hansen @ 2023-05-03 0:53 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On 5/2/23 13:14, Linus Torvalds wrote: > No, the problem is that probably *because* "access_ok()" has that > warning, not all users use "access_ok()" at all. We have places that > use "__access_ok()" instead. Like copy_from_nmi(). > > So now copy_from_nmi() doesn't do the untagging, so if you were to use > tagged pointers for the stack, you'd not get stack traces. > > End result: I think that > > (a) that WARN_ON_IN_IRQ() is actively detrimental and causes problems > > (b) the current "use untagged_addr() in access_ok()" model is also broken Ugh, yes. The fallout seems limited to (probably) perf and tracing poking at user stack frames. But, yes, it definitely looks broken there. While I bet we could shoehorn the existing tlbstate checks into the __access_ok() sites, I also vastly prefer the high bit checks in access_ok() instead. The less state we have to consult, the better. Once the WARN_ON_IN_IRQ() is gone, it seems like it's just a matter of collapsing __access_ok() into access_ok() and converting the (~3) callers. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 0:53 ` Dave Hansen @ 2023-05-03 1:17 ` Linus Torvalds 2023-05-03 16:38 ` Linus Torvalds 0 siblings, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2023-05-03 1:17 UTC (permalink / raw) To: Dave Hansen Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Tue, May 2, 2023 at 5:53 PM Dave Hansen <dave.hansen@intel.com> wrote: > > The fallout seems limited to (probably) perf and tracing poking at user > stack frames. But, yes, it definitely looks broken there. So I ended up doing more cleanups - moving the 64-bit specific code to 'uaccess_64.h' etc. And in the process I found another broken thing:__untagged_addr_remote() is very very buggy. The reason? long sign = addr >> 63; that does *not* do at all what '__untagged_addr()' does, because while 'sign' is a signed long, 'addr' is an *unsigned* long. So the actual shift ends up being done as an unsigned shift, and then just the result is assigned to a signed variable. End result? 'sign' ends up being 0 for user space (intentional) or 1 for kernel space (not intentional).. It's not 0 or ~0 as the __untagged_addr() is, wh9ch uses a signed shift in the inline asm ('sar') That said, while this is all horribly buggy, I'm not actually 100% sure that we care too much about untagging kernel addresses. So it's an error case that doesn't get marked as an error. So I guess in *practice* the horribly buggy code is actually just a small harmless bug, and causes anybody who passes in an invalid (kernel) address look like a possibly valid user address after all. HOWEVER. Why does it do that "shift-by-63" game there, instead of making tlbstate_untag_mask just have bit #63 always set? Untagging a kernel address is not a sensible operation, so the only thing you want is to *keep* a kernel address as a bad address. You don't have to actually keep it a *valid* kernel address, it just should not become a (possibly valid) user address after untagging. Hmmm? Am I missing something? Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 1:17 ` Linus Torvalds @ 2023-05-03 16:38 ` Linus Torvalds 2023-05-03 16:44 ` Dave Hansen ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Linus Torvalds @ 2023-05-03 16:38 UTC (permalink / raw) To: Dave Hansen Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Tue, May 2, 2023 at 6:17 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And in the process I found another broken > thing:__untagged_addr_remote() is very very buggy. > > The reason? > > long sign = addr >> 63; > > that does *not* do at all what '__untagged_addr()' does, because while > 'sign' is a signed long, 'addr' is an *unsigned* long. > > So the actual shift ends up being done as an unsigned shift, and then > just the result is assigned to a signed variable. > > End result? 'sign' ends up being 0 for user space (intentional) or 1 > for kernel space (not intentional).. Looking around, this same bug used to exists for the normal (non-remote) case too, until it was accidentally fixed when changing that to use inline asm and the alternatives code. At that point the non-remote case got an explicit 'sar' instruction, and the result really was ~0 for kernel mode addresses. > Why does it do that "shift-by-63" game there, instead of making > tlbstate_untag_mask just have bit #63 always set? And it turns out that bit #63 really _is_ always set, so I think the solution to this all is to remove the sign games in untag_addr() entirely. Untagging a kernel address will "corrupt" it, but it will stay a kernel address (well, it will stay a "high bit set" address), which is all we care about anyway. If somebody actually tries to untag a kernel address, that would be a bug anyway, as far as I can tell. So I'm going to just remove the 'sign' games entirely. They are completely broken in 'untagged_addr_remote()', they _used_ to be completely broken in 'untagged_addr()', and it looks like it's all unnecessary. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 16:38 ` Linus Torvalds @ 2023-05-03 16:44 ` Dave Hansen 2023-05-03 16:51 ` Linus Torvalds 2023-05-03 17:54 ` Linus Torvalds 2023-05-04 6:28 ` Kirill A. Shutemov 2 siblings, 1 reply; 28+ messages in thread From: Dave Hansen @ 2023-05-03 16:44 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On 5/3/23 09:38, Linus Torvalds wrote: >> Why does it do that "shift-by-63" game there, instead of making >> tlbstate_untag_mask just have bit #63 always set? > And it turns out that bit #63 really _is_ always set, so I think the > solution to this all is to remove the sign games in untag_addr() > entirely. Yes, there are only two possible values right now, both of which have bit 63 set: LAM off: mm->context.untag_mask = -1UL; LAM on: mm->context.untag_mask = ~GENMASK(62, 57); > Untagging a kernel address will "corrupt" it, but it will stay a > kernel address (well, it will stay a "high bit set" address), which is > all we care about anyway. > > If somebody actually tries to untag a kernel address, that would be a > bug anyway, as far as I can tell. Is it a bug? The do_madvise() path, for instance, is passing a value in there that came right from userspace. > So I'm going to just remove the 'sign' games entirely. They are > completely broken in 'untagged_addr_remote()', they _used_ to be > completely broken in 'untagged_addr()', and it looks like it's all > unnecessary. Yes, it looks completely superfluous. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 16:44 ` Dave Hansen @ 2023-05-03 16:51 ` Linus Torvalds 0 siblings, 0 replies; 28+ messages in thread From: Linus Torvalds @ 2023-05-03 16:51 UTC (permalink / raw) To: Dave Hansen Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Wed, May 3, 2023 at 9:45 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/3/23 09:38, Linus Torvalds wrote: > > If somebody actually tries to untag a kernel address, that would be a > > bug anyway, as far as I can tell. > > Is it a bug? The do_madvise() path, for instance, is passing a value in > there that came right from userspace. That's still a "user address" - just not a *valid* one. So we do not want to mask the high bit off - because that is what will catch people later doing things like vma address range comparisons on it and notice "that's not a valid address", but it's also not a "kernel address" that we need to preserve as such. So yeah, it's a bit confusing in that it's _also_ true that "kernel addresses have the high bit set" and "user addresses have the high bit clear", and I'm basically using two different semantics for "kernel address". IOW: what I mean by "it's not valid to do 'untagged_addr()' on a kernel address" is that you can't take a (valid) kernel address, do 'untagged_addr()' on it, and expect it to still work as a kernel address. But at the same time you *are* supposed to be able to use 'untagged_addr()' on a - untrusted and possibly invalid - user pointer, and it's supposed to end up having the tag bits clear and still be usable as a user pointer. And yet still also be caught by any validity checks (ie a high bit set would never be a valid user pointer, not even after doing 'untagged_addr()' on it). Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 16:38 ` Linus Torvalds 2023-05-03 16:44 ` Dave Hansen @ 2023-05-03 17:54 ` Linus Torvalds 2023-05-03 19:01 ` Peter Zijlstra 2023-05-04 6:28 ` Kirill A. Shutemov 2 siblings, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2023-05-03 17:54 UTC (permalink / raw) To: Dave Hansen Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Wed, May 3, 2023 at 9:38 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I'm going to just remove the 'sign' games entirely. They are > completely broken in 'untagged_addr_remote()', they _used_ to be > completely broken in 'untagged_addr()', and it looks like it's all > unnecessary. Ok, I've pushed out my changes to the 'x86-uaccess-cleanup' branch. I think it's all good, but it would be really nice to hear it's been tested on a setup that actually has LAM (simulator? or maybe there is actual hw inside Intel) And any other commentary is appreciated, Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 17:54 ` Linus Torvalds @ 2023-05-03 19:01 ` Peter Zijlstra 2023-05-03 19:19 ` Linus Torvalds 2023-05-05 19:56 ` Linus Torvalds 0 siblings, 2 replies; 28+ messages in thread From: Peter Zijlstra @ 2023-05-03 19:01 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov On Wed, May 03, 2023 at 10:54:38AM -0700, Linus Torvalds wrote: > On Wed, May 3, 2023 at 9:38 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So I'm going to just remove the 'sign' games entirely. They are > > completely broken in 'untagged_addr_remote()', they _used_ to be > > completely broken in 'untagged_addr()', and it looks like it's all > > unnecessary. > > Ok, I've pushed out my changes to the 'x86-uaccess-cleanup' branch. > > I think it's all good, but it would be really nice to hear it's been > tested on a setup that actually has LAM (simulator? or maybe there is > actual hw inside Intel) > > And any other commentary is appreciated, Looks sane from a first reading. But I'll try and have another look tomorrow. Also per how 2s complement is constructed 0 has to be in the positive space for it to be 'half'. Also, typically 0 is included in N and the traditional 'positive integers' set is then written as N \ {0}, but that's somewhat modern and a lot of variation exists -- although typically books tend to specify where they stand on that issue. I suppose that's a very long winded way of saying, that yes, ofcourse 0 is a positive number :-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 19:01 ` Peter Zijlstra @ 2023-05-03 19:19 ` Linus Torvalds 2023-05-05 19:56 ` Linus Torvalds 1 sibling, 0 replies; 28+ messages in thread From: Linus Torvalds @ 2023-05-03 19:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > > I suppose that's a very long winded way of saying, that yes, ofcourse 0 > is a positive number :-) Well, I do consider it as such, but I guess I took all my math classes when people still considered negative, zero and positive to be three disjoint sets, and 'non-negative' was required for rigor. Some googling around says that a lot of people still think that, and that it might even be language-specific. I think the commit commentary about "ok, strictly non-negative" might still be relevant. At least to some people, and at least for sticklers. Also, I do consider 0 to be part of ℕ, although I wouldn't consider that to be an argument about "positive" at all. The argument for 0 in ℕ would be that without it, you don't have an identity element for addition, which would arguably make natural numbers kind of broken as a set. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 19:01 ` Peter Zijlstra 2023-05-03 19:19 ` Linus Torvalds @ 2023-05-05 19:56 ` Linus Torvalds 2023-05-05 20:59 ` Kirill A. Shutemov 1 sibling, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2023-05-05 19:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Looks sane from a first reading. But I'll try and have another look > tomorrow. Well, I didn't hear back any more, and I was merging architecture code, so I merged my branch too. It does fix at least two bugs, even if those bugs are not necessarily all that noticeable. I guess we'll see if it introduces new ones... Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-05 19:56 ` Linus Torvalds @ 2023-05-05 20:59 ` Kirill A. Shutemov 0 siblings, 0 replies; 28+ messages in thread From: Kirill A. Shutemov @ 2023-05-05 20:59 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, x86, linux-kernel, kirill.shutemov On Fri, May 05, 2023 at 12:56:31PM -0700, Linus Torvalds wrote: > On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Looks sane from a first reading. But I'll try and have another look > > tomorrow. > > Well, I didn't hear back any more, and I was merging architecture > code, so I merged my branch too. > > It does fix at least two bugs, even if those bugs are not necessarily > all that noticeable. > > I guess we'll see if it introduces new ones... The branch passed LAM selftests fine. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 16:38 ` Linus Torvalds 2023-05-03 16:44 ` Dave Hansen 2023-05-03 17:54 ` Linus Torvalds @ 2023-05-04 6:28 ` Kirill A. Shutemov 2023-05-04 15:25 ` Dave Hansen 2023-05-04 17:10 ` Linus Torvalds 2 siblings, 2 replies; 28+ messages in thread From: Kirill A. Shutemov @ 2023-05-04 6:28 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Hansen, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Wed, May 03, 2023 at 09:38:03AM -0700, Linus Torvalds wrote: > > Why does it do that "shift-by-63" game there, instead of making > > tlbstate_untag_mask just have bit #63 always set? > > And it turns out that bit #63 really _is_ always set, so I think the > solution to this all is to remove the sign games in untag_addr() > entirely. Untagging kernel pointer with LAM enabled will land it in the canonical hole which is safe as it leads to #GP on dereference. > Untagging a kernel address will "corrupt" it, but it will stay a > kernel address (well, it will stay a "high bit set" address), which is > all we care about anyway. The interesting case to consider is untagging kernel pointer when LAM_U48 is enabled (not part of current LAM enabling). LAM_U48 would make the untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP enabled (also not supported yet) untagging kernel may transform it to other valid kernel pointer. So we cannot rely on #GP as backstop here. The kernel has to exclude kernel pointer by other means. It can be fun to debug. Looks like you are not worried about this kind of corruption. Hm. Maybe it worth to indicate in the helper name that it is intended only for userspace addresses? Rename it to untagged_uaddr() or something? -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-04 6:28 ` Kirill A. Shutemov @ 2023-05-04 15:25 ` Dave Hansen 2023-05-04 17:10 ` Kirill A. Shutemov 2023-05-04 17:10 ` Linus Torvalds 1 sibling, 1 reply; 28+ messages in thread From: Dave Hansen @ 2023-05-04 15:25 UTC (permalink / raw) To: Kirill A. Shutemov, Linus Torvalds Cc: Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 1937 bytes --] On 5/3/23 23:28, Kirill A. Shutemov wrote: >> Untagging a kernel address will "corrupt" it, but it will stay a >> kernel address (well, it will stay a "high bit set" address), which is >> all we care about anyway. > The interesting case to consider is untagging kernel pointer when LAM_U48 > is enabled (not part of current LAM enabling). LAM_U48 would make the > untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP > enabled (also not supported yet) untagging kernel may transform it to > other valid kernel pointer. > > So we cannot rely on #GP as backstop here. The kernel has to exclude > kernel pointer by other means. It can be fun to debug. Yeah, I have the feeling that we're really going to need a pair of untagging functions once we get to doing kernel LAM for a _bunch_ of reasons. Just as a practical matter, I think we should OR bits into the mask on the kernel side, effectively: unsigned long untag_kernel_addr(unsigned long addr) { return addr | kernel_untag_mask; } and kernel_untag_mask should have bit 63 *clear*. That way the pointers that have gone through untagging won't look silly. If you untag VMALLOC_END or something, it'll still look like the addresses we have in mm.rst. Also, it'll be impossible to have the mask turn a userspace address into a kernel one. Last, we can add some debugging in there, probably conditional on some mm debugging options like: if (WARN_ON_ONCE(!valid_user_address(addr))) return 0; It's kinda like "void __user *" versus "void *". The __user ones can *absolutely* point anywhere, user or kernel. That's why we can't WARN() in the untagged_addr() function that takes user pointers. But "void *" can only point to the kernel. It has totally different rules. We should probably also do something like the attached patch sooner rather than later. 'untag_mask' really is a misleading name for a mask that gets applied only to user addresses. [-- Attachment #2: lam.patch --] [-- Type: text/x-patch, Size: 6944 bytes --] diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 0da5c227f490..adee70ee398c 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -51,7 +51,7 @@ typedef struct { unsigned long lam_cr3_mask; /* Significant bits of the virtual address. Excludes tag bits. */ - u64 untag_mask; + u64 user_untag_mask; #endif struct mutex lock; diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 1d29dc791f5a..b4bb4b1a36e4 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -94,18 +94,18 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm) static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm) { mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask; - mm->context.untag_mask = oldmm->context.untag_mask; + mm->context.user_untag_mask = oldmm->context.user_untag_mask; } #define mm_untag_mask mm_untag_mask static inline unsigned long mm_untag_mask(struct mm_struct *mm) { - return mm->context.untag_mask; + return mm->context.user_untag_mask; } static inline void mm_reset_untag_mask(struct mm_struct *mm) { - mm->context.untag_mask = -1UL; + mm->context.user_untag_mask = -1UL; } #define arch_pgtable_dma_compat arch_pgtable_dma_compat diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 75bfaa421030..9fc6a2ac5318 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -55,11 +55,11 @@ static inline void cr4_clear_bits(unsigned long mask) } #ifdef CONFIG_ADDRESS_MASKING -DECLARE_PER_CPU(u64, tlbstate_untag_mask); +DECLARE_PER_CPU(u64, tlbstate_user_untag_mask); -static inline u64 current_untag_mask(void) +static inline u64 current_user_untag_mask(void) { - return this_cpu_read(tlbstate_untag_mask); + return this_cpu_read(tlbstate_user_untag_mask); } #endif @@ -389,7 +389,7 @@ static inline void set_tlbstate_lam_mode(struct mm_struct *mm) { this_cpu_write(cpu_tlbstate.lam, mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT); - this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask); + this_cpu_write(tlbstate_user_untag_mask, mm->context.user_untag_mask); } #else diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 123135d60f72..044d82e98b27 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -36,16 +36,16 @@ static inline unsigned long __untagged_addr(unsigned long addr) long sign; /* - * Refer tlbstate_untag_mask directly to avoid RIP-relative relocation - * in alternative instructions. The relocation gets wrong when gets - * copied to the target place. + * Refer tlbstate_user_untag_mask directly to avoid RIP-relative + * relocation in alternative instructions. The relocation gets wrong when + * gets copied to the target place. */ asm (ALTERNATIVE("", "sar $63, %[sign]\n\t" /* user_ptr ? 0 : -1UL */ - "or %%gs:tlbstate_untag_mask, %[sign]\n\t" + "or %%gs:tlbstate_user_untag_mask, %[sign]\n\t" "and %[sign], %[addr]\n\t", X86_FEATURE_LAM) : [addr] "+r" (addr), [sign] "=r" (sign) - : "m" (tlbstate_untag_mask), "[sign]" (addr)); + : "m" (tlbstate_user_untag_mask), "[sign]" (addr)); return addr; } @@ -55,20 +55,20 @@ static inline unsigned long __untagged_addr(unsigned long addr) (__force __typeof__(addr))__untagged_addr(__addr); \ }) -static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, - unsigned long addr) +static inline unsigned long __untag_remote_user_addr(struct mm_struct *mm, + unsigned long addr) { long sign = addr >> 63; mmap_assert_locked(mm); - addr &= (mm)->context.untag_mask | sign; + addr &= (mm)->context.user_untag_mask | sign; return addr; } -#define untagged_addr_remote(mm, addr) ({ \ +#define untag_remote_user_addr(mm, addr) ({ \ unsigned long __addr = (__force unsigned long)(addr); \ - (__force __typeof__(addr))__untagged_addr_remote(mm, __addr); \ + (__force __typeof__(addr))__untag_remote_user_addr(mm, __addr); \ }) #else diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 3d4dd9420c30..929152b28b8c 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -580,7 +580,7 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr, goto done; } - vaddr = untagged_addr_remote(mm, vaddr); + vaddr = untag_remote_user_addr(mm, vaddr); retry: vma = vma_lookup(mm, vaddr); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 420510f6a545..7ccd3f341af4 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1692,7 +1692,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf, ret = mmap_read_lock_killable(mm); if (ret) goto out_free; - start_vaddr = untagged_addr_remote(mm, svpfn << PAGE_SHIFT); + start_vaddr = untag_remote_user_addr(mm, svpfn << PAGE_SHIFT); mmap_read_unlock(mm); } diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 3064314f4832..8c956172252f 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -25,8 +25,8 @@ #define untagged_addr(addr) (addr) #endif -#ifndef untagged_addr_remote -#define untagged_addr_remote(mm, addr) ({ \ +#ifndef untag_remote_user_addr +#define untag_remote_user_addr(mm, addr) ({ \ mmap_assert_locked(mm); \ untagged_addr(addr); \ }) diff --git a/mm/gup.c b/mm/gup.c index bbe416236593..d22bad9147d7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1085,7 +1085,7 @@ static long __get_user_pages(struct mm_struct *mm, if (!nr_pages) return 0; - start = untagged_addr_remote(mm, start); + start = untag_remote_user_addr(mm, start); VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); @@ -1259,7 +1259,7 @@ int fixup_user_fault(struct mm_struct *mm, struct vm_area_struct *vma; vm_fault_t ret; - address = untagged_addr_remote(mm, address); + address = untag_remote_user_addr(mm, address); if (unlocked) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; diff --git a/mm/madvise.c b/mm/madvise.c index b5ffbaf616f5..ffb5d76a9481 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1421,7 +1421,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh mmap_read_lock(mm); } - start = untagged_addr_remote(mm, start); + start = untag_remote_user_addr(mm, start); end = start + len; blk_start_plug(&plug); diff --git a/mm/migrate.c b/mm/migrate.c index 01cac26a3127..7647c0a6ff30 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2109,7 +2109,7 @@ static int add_page_for_migration(struct mm_struct *mm, const void __user *p, bool isolated; mmap_read_lock(mm); - addr = (unsigned long)untagged_addr_remote(mm, p); + addr = (unsigned long)untag_remote_user_addr(mm, p); err = -EFAULT; vma = vma_lookup(mm, addr); ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-04 15:25 ` Dave Hansen @ 2023-05-04 17:10 ` Kirill A. Shutemov 0 siblings, 0 replies; 28+ messages in thread From: Kirill A. Shutemov @ 2023-05-04 17:10 UTC (permalink / raw) To: Dave Hansen Cc: Linus Torvalds, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Thu, May 04, 2023 at 08:25:58AM -0700, Dave Hansen wrote: > On 5/3/23 23:28, Kirill A. Shutemov wrote: > >> Untagging a kernel address will "corrupt" it, but it will stay a > >> kernel address (well, it will stay a "high bit set" address), which is > >> all we care about anyway. > > The interesting case to consider is untagging kernel pointer when LAM_U48 > > is enabled (not part of current LAM enabling). LAM_U48 would make the > > untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP > > enabled (also not supported yet) untagging kernel may transform it to > > other valid kernel pointer. > > > > So we cannot rely on #GP as backstop here. The kernel has to exclude > > kernel pointer by other means. It can be fun to debug. > > Yeah, I have the feeling that we're really going to need a pair of > untagging functions once we get to doing kernel LAM for a _bunch_ of > reasons. There's already arch_kasan_reset_tag() used on ARM64 (and two more helpers to set/get tag). Don't see a reason to add new. > Just as a practical matter, I think we should OR bits into the mask on > the kernel side, effectively: > > unsigned long untag_kernel_addr(unsigned long addr) > { > return addr | kernel_untag_mask; > } > > and kernel_untag_mask should have bit 63 *clear*. > > That way the pointers that have gone through untagging won't look silly. > If you untag VMALLOC_END or something, it'll still look like the > addresses we have in mm.rst. > > Also, it'll be impossible to have the mask turn a userspace address into > a kernel one. > > Last, we can add some debugging in there, probably conditional on some > mm debugging options like: > > if (WARN_ON_ONCE(!valid_user_address(addr))) > return 0; > > It's kinda like "void __user *" versus "void *". The __user ones can > *absolutely* point anywhere, user or kernel. That's why we can't WARN() > in the untagged_addr() function that takes user pointers. > > But "void *" can only point to the kernel. It has totally different rules. > > We should probably also do something like the attached patch sooner > rather than later. 'untag_mask' really is a misleading name for a mask > that gets applied only to user addresses. A bit too verbose to my liking, but okay. Maybe _uaddr() instead of _user_addr()? -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-04 6:28 ` Kirill A. Shutemov 2023-05-04 15:25 ` Dave Hansen @ 2023-05-04 17:10 ` Linus Torvalds 1 sibling, 0 replies; 28+ messages in thread From: Linus Torvalds @ 2023-05-04 17:10 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dave Hansen, Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra On Wed, May 3, 2023 at 11:28 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Wed, May 03, 2023 at 09:38:03AM -0700, Linus Torvalds wrote: > > > Why does it do that "shift-by-63" game there, instead of making > > > tlbstate_untag_mask just have bit #63 always set? > > > > And it turns out that bit #63 really _is_ always set, so I think the > > solution to this all is to remove the sign games in untag_addr() > > entirely. > > Untagging kernel pointer with LAM enabled will land it in the canonical > hole which is safe as it leads to #GP on dereference. You are entirely missing the point. The GP fault does *NOT MATTER*. Think of 'untagged_addr()' as a companion to - but absolutely *NOT* a replacement for - 'access_ok()'. You have three distinct cases: (a) user-supplied valid user address (b) user-supplied invalid user address (it high bit set) (c) actual kernel address and 'untagged_addr()' and 'access_ok()' work on the same basic input domain: cases (a) and (b). And the important thing for 'untagged_addr()' is that in case (a) it needs to remove the tab bits, and in case (b) needs to *keep* the address as an invalid user address. Note that it does not need to keep the value the *same*. Nobody cares. And also note that the resulting pointer may or may not GP-fault. Nobody cares. Just to hit this home with a very explicit example, think of the case where the kernel config for address masking isn't even enabled, and 'untagged_addr()' is a 1:1 map with no changes. Doing 'untagged_addr()' on a valid user address results in a valid user address. And doing it on an invalid user address results in an invalid user address. GOOD. Note that doing 'untagged_addr()' on an invalid user access does NOT IN ANY WAY make that invalid address somehow "safe" and cause a GP fault. Sure, it _might_ GP-fault. Or it might not. It might point to random kernel data. Verification of the address is simply not the job of 'untagged_addr()'. Never has been, never will be, and fundamentally *cannot* be, since for the forseeable future 'untagged_addr()' is a no-op for every single user outside of Intel. Verification is separate. The verification is never "let's randomly just access this pointer and see if it gets a GP-fault". No. We have a user pointer, it needs *checking*. It needs to have something like "use lookup_vma() to look up the vma that is associated with that address". But it could also be something like "just do the range check in GUP". And that's why "keep an invalid user address as an invalid address", because that *separate* stage of verifying the address needs to still show that it's invalid. Now, sometimes the "verification" might actually be "access_ok()" itself, but honestly, if the address is used for an actual access, then it shouldn't have gone through the 'untagged_addr()' thing at all. It should just have been used as-is for the access. So normally 'untagged_addr()' does not get used *together* with 'access_ok()', although that should obviously also work. End result: all that really matters on x86-64 is that 'untagged_addr()' must keep the high bit as-is. That's the "this is not a valid user address" bit. That's very much explicit in my current series, of course, but even without my current series it was implicitly the truth with those LAM patches (particularly considering that 'untagged_addr()' didn't actually do the "keep kernel addresses as-is" that it *thought* it did due to the signed type confusion). So what about that (c) case? It doesn't matter. It's simply fundamentally wrong for the kernel to pass an actual kernel address to 'untagged_addr()' and expect something useful back. It's a nonsensical thing to do, and it's a huge bug. So for the (c) case, the fact that the result would be useless and *usually* GP-fault on access is a good thing. But it's not a requirement, it's more of a "oh, cool, it's good that the nonsensical operation causes quick failures". So in that case, the GP fault is a "feature", but not a requirement. Again, the normal 'untagged_addr()' case (of not changing the pointer at all), obviously does *not* cause the kernel pointer corruption, but maybe we could even have a debug mode. We could literally make 'untagged_addr()' do something like #ifdef CONFIG_DEBUG_NONLAM // Make sure nobody tries to use untagged_addr() on non-user addresses #define untagged_addr(x) ((x) | (long)(x)>>63) #endif except obviously with the "get the types right and use 'x' only once" thing (so the above #define is buggy, and puresly for conceptual documentation purposes). See? Side note: one day, maybe we want to use address tagging *inside* the kernel. However, that will not use 'untagged_addr()'. That would use some *other* model for cleaning up kernel pointers when necessary. Even on x86-64, the tagging rules for kernel and user space is entirely different, in that user-space LAM rules are "U48" or "U57", while kernel LAM rules depend on the paging mode, and the two are *not* tied together. But more importantly from a kernel perspective: regardless of hardware implementations like that, the notion of masking bits off a untrusted user pointer is simply completely *different* from the notion of masking off bits of our own kernel pointers. You can see this in the kernel everywhere: user pointers should be statically always user pointers, and should look something like struct iocb __user *user_iocb which is very very different from a kernel pointer that doesn't have that "__user" thing. Again, on some architectures, the *EXACT*SAME* bit pattern pointer value may point to different things, because user accesses are simply in a different address space entirely. So if/when KASAN starts using LAM inside the kernel, it will do its own things. It had better not use "untagged_addr()". Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-02 20:14 ` Linus Torvalds 2023-05-03 0:53 ` Dave Hansen @ 2023-05-03 8:01 ` Peter Zijlstra 2023-05-03 16:38 ` Linus Torvalds 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2023-05-03 8:01 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov On Tue, May 02, 2023 at 01:14:33PM -0700, Linus Torvalds wrote: > On Tue, May 2, 2023 at 9:00 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > > I guess it also wouldn't matter as much either if we hid it in a helper > > > like the attached patch and I didn't have to read it twice. ;) > > > > Yeah, I think that's a good solution. > > Hmm. And as I was rebasing the patch to fix up my patch, I realized > that the current -git top-of-tree state is actually broken. > > That > > #define access_ok(addr, size) \ > ({ \ > WARN_ON_IN_IRQ(); \ > likely(__access_ok(untagged_addr(addr), size)); \ > }) > > is actually *wrong* in two ways. > > Now, in my original patch, I added a comment about how that > "WARN_ON_IN_IRQ()" is bogus and this shouldn't be x86-specific at all. > > I ended up going back in time to see why it was added, and I think it > was added because we used to access 'current' in access_ok(), due to > it using that user_addr_max() thing: > > likely(!__range_not_ok(addr, size, user_addr_max())); > > but that was all removed by the set_fs() removal by Christoph Hellwig. So I had a poke around, trying to figure out where it came from, and yes. Commit ae31fe51a3cc ("perf/x86: Restore TASK_SIZE check on frame pointer") is the reason that WARN_ON_IN_IRQ() thing got added. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-05-03 8:01 ` Peter Zijlstra @ 2023-05-03 16:38 ` Linus Torvalds 0 siblings, 0 replies; 28+ messages in thread From: Linus Torvalds @ 2023-05-03 16:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel, kirill.shutemov On Wed, May 3, 2023 at 1:01 AM Peter Zijlstra <peterz@infradead.org> wrote: > > So I had a poke around, trying to figure out where it came from, and > yes. Commit ae31fe51a3cc ("perf/x86: Restore TASK_SIZE check on frame > pointer") is the reason that WARN_ON_IN_IRQ() thing got added. I added a pointer to that commit in the commit message, and took the above to mean 'Ack' from you. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-04-28 20:07 ` Linus Torvalds 2023-04-28 20:15 ` Linus Torvalds @ 2023-06-16 8:47 ` Alexander Potapenko 2023-06-16 16:22 ` Linus Torvalds 1 sibling, 1 reply; 28+ messages in thread From: Alexander Potapenko @ 2023-06-16 8:47 UTC (permalink / raw) To: torvalds Cc: dave.hansen, kirill.shutemov, linux-kernel, peterz, x86, Alexander Potapenko Hi Linus, > static bool ex_handler_uaccess(const struct exception_table_entry *fixup, > - struct pt_regs *regs, int trapnr) > + struct pt_regs *regs, int trapnr, > + unsigned long fault_address) > { > - WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in user access. Non-canonical address?"); > + WARN_ONCE(trapnr == X86_TRAP_GP && !gp_fault_address_ok(fault_address), > + "General protection fault in user access. Non-canonical address?"); > return ex_handler_default(fixup, regs); > } Shouldn't ex_handler_copy() be fixed in the same way? Looks like it's still possible for a tagged userspace address to be passed to it and trigger a warning. Alex ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-06-16 8:47 ` Alexander Potapenko @ 2023-06-16 16:22 ` Linus Torvalds 2023-06-19 12:03 ` Alexander Potapenko 0 siblings, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2023-06-16 16:22 UTC (permalink / raw) To: Alexander Potapenko Cc: dave.hansen, kirill.shutemov, linux-kernel, peterz, x86 On Fri, 16 Jun 2023 at 01:47, Alexander Potapenko <glider@google.com> wrote: > > Shouldn't ex_handler_copy() be fixed in the same way? I don't think ex_handler_copy() is actually reachable any more. The only thing ex_handler_copy() did was to set %ax to the fault type. It was used by the strange copy_user_generic_unrolled() that had special machine check case for "don't do the tail if we get X86_TRAP_MC". But that was always bogus. The MC case in question was for the __copy_user_nocache function, and the machine check case was for the *destination*, which wasn't in user space at all. So instead of looking at "what was the trap number", the code should have just noticed "trapped on the destination", and stopped for *that* reason. See commit 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function") and in particular, see the comment there about writes on the destination: * An exception on a write means that we're * done, but we need to update the count * depending on where in the unrolled loop * we were. but yeah, I never removed the actual now unused _ASM_EXTABLE_CPY case. Honestly, I had no way of even testing the code. I doubt anybody does. There are a couple of users: - rdma mis-uses it for regular kernel-to-kernel copies that don't fault (because rdma wants the "nocache" case). So it can't fault at all. - a couple of GPU drivers mis-use it similarly to rdma, but at least with a user source in the form of __copy_from_user_inatomic_nocache() - _copy_from_iter_flushcache uses it for pmem and dax, because it wants that machine check handling for non-volatile memory So two of three users don't actually *have* the MC case at all on the destination. And the third case - the actual "copy using uncached accesses in order to get errors on the destination" case - depends on hardware that I'm not convinced really exists any more. Oh, I think there's some odd ntb mis-use too. I'd love for somebody to actually test the machine check case, but the old code was completely insane with odd "we handle _one_ case of 4-byte alignment, but not actually the general case", so I wonder if the MC case ever really worked in reality. And as mentioned, I am not convinced the hardware is available. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [GIT PULL] x86/mm for 6.4 2023-06-16 16:22 ` Linus Torvalds @ 2023-06-19 12:03 ` Alexander Potapenko 0 siblings, 0 replies; 28+ messages in thread From: Alexander Potapenko @ 2023-06-19 12:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: dave.hansen, kirill.shutemov, linux-kernel, peterz, x86 On Fri, Jun 16, 2023 at 6:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 16 Jun 2023 at 01:47, Alexander Potapenko <glider@google.com> wrote: > > > > Shouldn't ex_handler_copy() be fixed in the same way? > > I don't think ex_handler_copy() is actually reachable any more. > > The only thing ex_handler_copy() did was to set %ax to the fault type. > > It was used by the strange copy_user_generic_unrolled() that had > special machine check case for "don't do the tail if we get > X86_TRAP_MC". Oh, I see, thanks! I am using a downstream kernel (way before 034ff37d3407) that still has _ASM_EXTABLE_CPY. Looks like I just need to adjust ex_handler_copy() in my code. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2023-06-19 12:04 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-04-27 22:56 [GIT PULL] x86/mm for 6.4 Dave Hansen 2023-04-28 17:23 ` pr-tracker-bot 2023-04-28 20:07 ` Linus Torvalds 2023-04-28 20:15 ` Linus Torvalds 2023-04-29 0:38 ` Kirill A. Shutemov 2023-04-29 1:04 ` Linus Torvalds 2023-05-02 15:42 ` Dave Hansen 2023-05-02 16:00 ` Linus Torvalds 2023-05-02 20:14 ` Linus Torvalds 2023-05-03 0:53 ` Dave Hansen 2023-05-03 1:17 ` Linus Torvalds 2023-05-03 16:38 ` Linus Torvalds 2023-05-03 16:44 ` Dave Hansen 2023-05-03 16:51 ` Linus Torvalds 2023-05-03 17:54 ` Linus Torvalds 2023-05-03 19:01 ` Peter Zijlstra 2023-05-03 19:19 ` Linus Torvalds 2023-05-05 19:56 ` Linus Torvalds 2023-05-05 20:59 ` Kirill A. Shutemov 2023-05-04 6:28 ` Kirill A. Shutemov 2023-05-04 15:25 ` Dave Hansen 2023-05-04 17:10 ` Kirill A. Shutemov 2023-05-04 17:10 ` Linus Torvalds 2023-05-03 8:01 ` Peter Zijlstra 2023-05-03 16:38 ` Linus Torvalds 2023-06-16 8:47 ` Alexander Potapenko 2023-06-16 16:22 ` Linus Torvalds 2023-06-19 12:03 ` Alexander Potapenko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.