All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] x86/mm for 6.4
@ 2023-04-27 22:56 Dave Hansen
  2023-04-28 17:23 ` pr-tracker-bot
  2023-04-28 20:07 ` Linus Torvalds
  0 siblings, 2 replies; 28+ messages in thread
From: Dave Hansen @ 2023-04-27 22:56 UTC (permalink / raw)
  To: torvalds; +Cc: x86, linux-kernel, kirill.shutemov, Dave Hansen

Hi Linus,

Please pull some x86/mm changes for 6.4.  The only content here is
solely a new revision of Kirill's Linear Address Masking implementation.

You had some concerns[1] with the last version and the (lack of) locking
while the feature was being enabled in multithreaded programs.  It's
been fixed up since then to simply only allow LAM enabling when the
process is single threaded.  It's also accumulated a few random fixes
and cleanups since then.

This conflicts with the shadow stack work (x86/shstk) that I sent
earlier this week.  This is no surprise since LAM and shadow stacks both
add prctl()'s, selftests and touch the same headers.

Despite there being a few sites of conflict, the merge is logically
straightforward.  I've included a suggested resolution.  Both Kirill
(LAM) and Rick (shadow stacks) tested the result and confirmed that
nothing broke.

1. https://lore.kernel.org/all/CAHk-=wi=TY3Kte5Z1_nvfcsEh+rcz86pYnzeASw=pbG9QtpJEQ@mail.gmail.com/

--

The following changes since commit eeac8ede17557680855031c6f305ece2378af326:

  Linux 6.3-rc2 (2023-03-12 16:36:44 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4

for you to fetch changes up to 97740266de26e5dfe6e4fbecacb6995b66c2e378:

  x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside (2023-04-06 13:45:06 -0700)

----------------------------------------------------------------
Add support for new Linear Address Masking CPU feature.  This is similar
to ARM's Top Byte Ignore and allows userspace to store metadata in some
bits of pointers without masking it out before use.

----------------------------------------------------------------
Kirill A. Shutemov (14):
      x86/mm: Rework address range check in get_user() and put_user()
      x86: Allow atomic MM_CONTEXT flags setting
      x86: CPUID and CR3/CR4 flags for Linear Address Masking
      x86/mm: Handle LAM on context switch
      mm: Introduce untagged_addr_remote()
      x86/uaccess: Provide untagged_addr() and remove tags before address check
      x86/mm: Reduce untagged_addr() overhead for systems without LAM
      x86/mm: Provide arch_prctl() interface for LAM
      mm: Expose untagging mask in /proc/$PID/status
      iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
      x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
      selftests/x86/lam: Add test cases for LAM vs thread creation
      x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA
      x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside

Weihong Zhang (5):
      selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking
      selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking
      selftests/x86/lam: Add io_uring test cases for linear-address masking
      selftests/x86/lam: Add inherit test cases for linear-address masking
      selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking

 arch/arm64/include/asm/mmu_context.h        |    6 +
 arch/sparc/include/asm/mmu_context_64.h     |    6 +
 arch/sparc/include/asm/uaccess_64.h         |    2 +
 arch/x86/Kconfig                            |   11 +
 arch/x86/entry/vsyscall/vsyscall_64.c       |    2 +-
 arch/x86/include/asm/cpufeatures.h          |    1 +
 arch/x86/include/asm/disabled-features.h    |    8 +-
 arch/x86/include/asm/mmu.h                  |   18 +-
 arch/x86/include/asm/mmu_context.h          |   49 +-
 arch/x86/include/asm/processor-flags.h      |    2 +
 arch/x86/include/asm/tlbflush.h             |   48 +-
 arch/x86/include/asm/uaccess.h              |   58 +-
 arch/x86/include/uapi/asm/prctl.h           |    5 +
 arch/x86/include/uapi/asm/processor-flags.h |    6 +
 arch/x86/kernel/process.c                   |    6 +
 arch/x86/kernel/process_64.c                |   68 +-
 arch/x86/kernel/traps.c                     |    6 +-
 arch/x86/lib/getuser.S                      |   83 +-
 arch/x86/lib/putuser.S                      |   54 +-
 arch/x86/mm/init.c                          |    5 +
 arch/x86/mm/tlb.c                           |   53 +-
 drivers/iommu/iommu-sva.c                   |    8 +-
 drivers/vfio/vfio_iommu_type1.c             |    2 +-
 fs/proc/array.c                             |    7 +
 fs/proc/task_mmu.c                          |    9 +-
 include/linux/ioasid.h                      |    9 -
 include/linux/mm.h                          |   11 -
 include/linux/mmu_context.h                 |   14 +
 include/linux/sched/mm.h                    |    8 +-
 include/linux/uaccess.h                     |   22 +
 mm/gup.c                                    |    4 +-
 mm/madvise.c                                |    5 +-
 mm/migrate.c                                |   11 +-
 tools/testing/selftests/x86/Makefile        |    2 +-
 tools/testing/selftests/x86/lam.c           | 1241 +++++++++++++++++++++++++++
 35 files changed, 1701 insertions(+), 149 deletions(-)
 create mode 100644 tools/testing/selftests/x86/lam.c
diff --cc arch/x86/include/asm/disabled-features.h
index fafe9be7a6f4,652e366b68a0..702d93fdd10e
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@@ -120,8 -126,8 +132,8 @@@
  #define DISABLED_MASK9	(DISABLE_SGX)
  #define DISABLED_MASK10	0
  #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
- 			 DISABLE_CALL_DEPTH_TRACKING)
+ 			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 -#define DISABLED_MASK12	0
 +#define DISABLED_MASK12	(DISABLE_LAM)
  #define DISABLED_MASK13	0
  #define DISABLED_MASK14	0
  #define DISABLED_MASK15	0
diff --cc arch/x86/include/uapi/asm/prctl.h
index eb290d89cb32,1b85bc876c2d..abe3fe6db6d2
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@@ -20,9 -20,16 +20,21 @@@
  #define ARCH_MAP_VDSO_32		0x2002
  #define ARCH_MAP_VDSO_64		0x2003
  
+ /* Don't use 0x3001-0x3004 because of old glibcs */
+ 
 +#define ARCH_GET_UNTAG_MASK		0x4001
 +#define ARCH_ENABLE_TAGGED_ADDR		0x4002
 +#define ARCH_GET_MAX_TAG_BITS		0x4003
 +#define ARCH_FORCE_TAGGED_SVA		0x4004
 +
+ #define ARCH_SHSTK_ENABLE		0x5001
+ #define ARCH_SHSTK_DISABLE		0x5002
+ #define ARCH_SHSTK_LOCK			0x5003
+ #define ARCH_SHSTK_UNLOCK		0x5004
+ #define ARCH_SHSTK_STATUS		0x5005
+ 
+ /* ARCH_SHSTK_ features bits */
+ #define ARCH_SHSTK_SHSTK		(1ULL <<  0)
+ #define ARCH_SHSTK_WRSS			(1ULL <<  1)
+ 
  #endif /* _ASM_X86_PRCTL_H */
diff --cc arch/x86/kernel/process.c
index 50d950771371,8bf13cff0141..7157b09d1cbf
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@@ -48,7 -48,7 +48,8 @@@
  #include <asm/frame.h>
  #include <asm/unwind.h>
  #include <asm/tdx.h>
 +#include <asm/mmu_context.h>
+ #include <asm/shstk.h>
  
  #include "process.h"
  
diff --cc arch/x86/kernel/process_64.c
index b46924c9e46d,31241930b60c..74c7e84a94d8
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@@ -875,22 -831,13 +877,28 @@@ long do_arch_prctl_64(struct task_struc
  # endif
  	case ARCH_MAP_VDSO_64:
  		return prctl_map_vdso(&vdso_image_64, arg2);
 +#endif
 +#ifdef CONFIG_ADDRESS_MASKING
 +	case ARCH_GET_UNTAG_MASK:
 +		return put_user(task->mm->context.untag_mask,
 +				(unsigned long __user *)arg2);
 +	case ARCH_ENABLE_TAGGED_ADDR:
 +		return prctl_enable_tagged_addr(task->mm, arg2);
 +	case ARCH_FORCE_TAGGED_SVA:
 +		set_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &task->mm->context.flags);
 +		return 0;
 +	case ARCH_GET_MAX_TAG_BITS:
 +		if (!cpu_feature_enabled(X86_FEATURE_LAM))
 +			return put_user(0, (unsigned long __user *)arg2);
 +		else
 +			return put_user(LAM_U57_BITS, (unsigned long __user *)arg2);
  #endif
+ 	case ARCH_SHSTK_ENABLE:
+ 	case ARCH_SHSTK_DISABLE:
+ 	case ARCH_SHSTK_LOCK:
+ 	case ARCH_SHSTK_UNLOCK:
+ 	case ARCH_SHSTK_STATUS:
+ 		return shstk_prctl(task, option, arg2);
  	default:
  		ret = -EINVAL;
  		break;
diff --cc fs/proc/array.c
index 6daea628bc76,3e1a33dcd0d0..a880c4e44752
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@@ -424,11 -423,11 +424,16 @@@ static inline void task_thp_status(stru
  	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
  }
  
 +static inline void task_untag_mask(struct seq_file *m, struct mm_struct *mm)
 +{
 +	seq_printf(m, "untag_mask:\t%#lx\n", mm_untag_mask(mm));
 +}
 +
+ __weak void arch_proc_pid_thread_features(struct seq_file *m,
+ 					  struct task_struct *task)
+ {
+ }
+ 
  int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
  			struct pid *pid, struct task_struct *task)
  {
diff --cc tools/testing/selftests/x86/Makefile
index 598135d3162b,cfc8a26ad151..fa2216a8c0d5
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@@ -18,7 -18,7 +18,7 @@@ TARGETS_C_32BIT_ONLY := entry_from_vm8
  			test_FCMOV test_FCOMI test_FISTTP \
  			vdso_restorer
  TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
- 			corrupt_xstate_header amx lam
 -			corrupt_xstate_header amx test_shadow_stack
++			corrupt_xstate_header amx test_shadow_stack lam
  # Some selftests require 32bit support enabled also on 64bit systems
  TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
  

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-04-27 22:56 [GIT PULL] x86/mm for 6.4 Dave Hansen
@ 2023-04-28 17:23 ` pr-tracker-bot
  2023-04-28 20:07 ` Linus Torvalds
  1 sibling, 0 replies; 28+ messages in thread
From: pr-tracker-bot @ 2023-04-28 17:23 UTC (permalink / raw)
  To: Dave Hansen; +Cc: torvalds, x86, linux-kernel, kirill.shutemov, Dave Hansen

The pull request you sent on Thu, 27 Apr 2023 15:56:47 -0700:

> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git tags/x86_mm_for_6.4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/22b8cc3e78f5448b4c5df00303817a9137cd663f

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-04-27 22:56 [GIT PULL] x86/mm for 6.4 Dave Hansen
  2023-04-28 17:23 ` pr-tracker-bot
@ 2023-04-28 20:07 ` Linus Torvalds
  2023-04-28 20:15   ` Linus Torvalds
  2023-06-16  8:47   ` Alexander Potapenko
  1 sibling, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2023-04-28 20:07 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, kirill.shutemov, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 861 bytes --]

On Thu, Apr 27, 2023 at 3:57 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> Please pull some x86/mm changes for 6.4.  The only content here is
> solely a new revision of Kirill's Linear Address Masking implementation.

So I was waiting for this for my final piece of the x86 user copy
changes, so here goes...

I think we should now make 'access_ok()' do the same thing that
get/put_user() do with the LAM code: only worry about the sign bit.

So here's my suggested change on top of the current tree. Comments?

PeterZ also added to the cc, because he's the source of that
WARN_ON_IN_IRQ() in the x86 'access_ok()' macro. That's the only
reason x86 has its own copy of that.

I wonder if that WARN_ON_IN_IRQ() should just be removed, or perhaps
moved into the generic code in <asm-generic/access_ok.h>?

                  Linus

[-- Attachment #2: 0001-x86-64-make-access_ok-independent-of-LAM.patch --]
[-- Type: text/x-patch, Size: 8169 bytes --]

From 2c283a649b07d180d5a7bcfaf539365297e82c23 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 28 Apr 2023 12:55:10 -0700
Subject: [PATCH] x86-64: make access_ok() independent of LAM

The linear address masking (LAM) code made access_ok() more complicated,
in that it now needs to untag the address in order to verify the access
range.  See commit 74c228d20a51 ("x86/uaccess: Provide untagged_addr()
and remove tags before address check").

We were able to avoid that overhead in the get_user/put_user code paths
by simply using the sign bit for the address check, and depending on the
GP fault if the address was non-canonical, which made it all independent
of LAM.

And we can do the same thing for access_ok(): simply check that the user
pointer range has the high bit clear.  No need to bother with any
address bit masking.

In fact, we can go a bit further, and just check the starting address
for known small accesses ranges: any accesses that overflow will still
be in the non-canonical area and will still GP fault.

To still make syzkaller catch any potentially unchecked user addresses,
we'll continue to warn about GP faults that are caused by accesses in
the non-canonical range.  But we'll limit that to purely "high bit set
and past the one-page 'slop' area".

We could probably just do that "check only starting address" for any
arbitrary range size: realistically all kernel accesses to user space
will be done starting at the low address.  But let's leave that kind of
optimization for later.  As it is, this already allows us to generate
simpler code and not worry about any tag bits in the address.

The one thing to look out for is the GUP address check: instead of
actually copying data in the virtual address range (and thus bad
addresses being caught by the GP fault), GUP will look up the page
tables manually.  As a result, the page table limits need to be checked,
and that was previously implicitly done by the access_ok().

With the relaxed access_ok() check, we need to just do an explicit check
for TASK_SIZE_MAX in the GUP code instead.  The GUP code already needs
to do the tag bit unmasking anyway, so there this is all very
straightforward, and there are no LAM issues.

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/x86/include/asm/uaccess.h | 39 +++++++++++++++++++++++++++++----
 arch/x86/mm/extable.c          | 40 +++++++++++++++++++++++++++++-----
 mm/gup.c                       |  2 ++
 3 files changed, 71 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 457e814712af..123135d60f72 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -75,6 +75,34 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 #define untagged_addr(addr)	(addr)
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * On x86-64, we may have tag bits in the user pointer. Rather than
+ * mask them off, just change the rules for __access_ok().
+ *
+ * Make the rule be that 'ptr+size' must not overflow, and must not
+ * have the high bit set. Compilers generally understand about
+ * unsigned overflow and the CF bit and generate reasonable code for
+ * this. Although it looks like the combination confuses at least
+ * clang (and instead of just doing an "add" followed by a test of
+ * SF and CF, you'll see that unnecessary comparison).
+ *
+ * For the common case of small sizes that can be checked at compile
+ * time, don't even bother with the addition, and just check that the
+ * base pointer is ok.
+ */
+static inline bool __access_ok(const void __user *ptr, unsigned long size)
+{
+	if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) {
+		return (long)ptr >= 0;
+	} else {
+		unsigned long sum = size + (unsigned long)ptr;
+		return (long) sum >= 0 && sum >= (unsigned long)ptr;
+	}
+}
+#define __access_ok __access_ok
+#endif
+
 /**
  * access_ok - Checks if a user space pointer is valid
  * @addr: User space pointer to start of block to check
@@ -91,11 +119,14 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
  *
  * Return: true (nonzero) if the memory block may be valid, false (zero)
  * if it is definitely invalid.
+ *
+ * This should not be x86-specific. The only odd things out here is
+ * the WARN_ON_IN_IRQ(), which doesn't exist in the generic version.
  */
-#define access_ok(addr, size)						\
-({									\
-	WARN_ON_IN_IRQ();						\
-	likely(__access_ok(untagged_addr(addr), size));			\
+#define access_ok(addr, size)			\
+({						\
+	WARN_ON_IN_IRQ();			\
+	likely(__access_ok(addr, size));	\
 })
 
 #include <asm-generic/access_ok.h>
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 60814e110a54..8d38dedadbb1 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -130,10 +130,36 @@ static bool ex_handler_fprestore(const struct exception_table_entry *fixup,
 	return true;
 }
 
+/*
+ * On x86-64, we end up being imprecise with 'access_ok()', and allow
+ * non-canonical user addresses to make the range comparisons simpler,
+ * and to not have to worry about LAM being enabled.
+ *
+ * In fact, we allow up to one page of "slop" at the sign boundary,
+ * which means that we can do access_ok() by just checking the sign
+ * of the pointer for the common case of having a small access size.
+ */
+static bool gp_fault_address_ok(unsigned long fault_address)
+{
+#ifdef CONFIG_X86_64
+	/* Is it in the "user space" part of the non-canonical space? */
+	if ((long) fault_address >= 0)
+		return true;
+
+	/* .. or just above it? */
+	fault_address -= PAGE_SIZE;
+	if ((long) fault_address >= 0)
+		return true;
+#endif
+	return false;
+}
+
 static bool ex_handler_uaccess(const struct exception_table_entry *fixup,
-			       struct pt_regs *regs, int trapnr)
+			       struct pt_regs *regs, int trapnr,
+			       unsigned long fault_address)
 {
-	WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in user access. Non-canonical address?");
+	WARN_ONCE(trapnr == X86_TRAP_GP && !gp_fault_address_ok(fault_address),
+		"General protection fault in user access. Non-canonical address?");
 	return ex_handler_default(fixup, regs);
 }
 
@@ -189,10 +215,12 @@ static bool ex_handler_imm_reg(const struct exception_table_entry *fixup,
 }
 
 static bool ex_handler_ucopy_len(const struct exception_table_entry *fixup,
-				  struct pt_regs *regs, int trapnr, int reg, int imm)
+				  struct pt_regs *regs, int trapnr,
+				  unsigned long fault_address,
+				  int reg, int imm)
 {
 	regs->cx = imm * regs->cx + *pt_regs_nr(regs, reg);
-	return ex_handler_uaccess(fixup, regs, trapnr);
+	return ex_handler_uaccess(fixup, regs, trapnr, fault_address);
 }
 
 int ex_get_fixup_type(unsigned long ip)
@@ -238,7 +266,7 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 	case EX_TYPE_FAULT_MCE_SAFE:
 		return ex_handler_fault(e, regs, trapnr);
 	case EX_TYPE_UACCESS:
-		return ex_handler_uaccess(e, regs, trapnr);
+		return ex_handler_uaccess(e, regs, trapnr, fault_addr);
 	case EX_TYPE_COPY:
 		return ex_handler_copy(e, regs, trapnr);
 	case EX_TYPE_CLEAR_FS:
@@ -269,7 +297,7 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 	case EX_TYPE_FAULT_SGX:
 		return ex_handler_sgx(e, regs, trapnr);
 	case EX_TYPE_UCOPY_LEN:
-		return ex_handler_ucopy_len(e, regs, trapnr, reg, imm);
+		return ex_handler_ucopy_len(e, regs, trapnr, fault_addr, reg, imm);
 	case EX_TYPE_ZEROPAD:
 		return ex_handler_zeropad(e, regs, fault_addr);
 	}
diff --git a/mm/gup.c b/mm/gup.c
index ff689c88a357..bbe416236593 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2970,6 +2970,8 @@ static int internal_get_user_pages_fast(unsigned long start,
 	len = nr_pages << PAGE_SHIFT;
 	if (check_add_overflow(start, len, &end))
 		return 0;
+	if (end > TASK_SIZE_MAX)
+		return -EFAULT;
 	if (unlikely(!access_ok((void __user *)start, len)))
 		return -EFAULT;
 
-- 
2.39.1.437.g88ae5c7f81


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-04-28 20:07 ` Linus Torvalds
@ 2023-04-28 20:15   ` Linus Torvalds
  2023-04-29  0:38     ` Kirill A. Shutemov
  2023-06-16  8:47   ` Alexander Potapenko
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-04-28 20:15 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, kirill.shutemov, Peter Zijlstra

On Fri, Apr 28, 2023 at 1:07 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So here's my suggested change on top of the current tree. Comments?

Oh, and I wanted to particularly mention that

    We could probably just do that "check only starting address" for any
    arbitrary range size: realistically all kernel accesses to user space
    will be done starting at the low address.  But let's leave that kind of
    optimization for later.  As it is, this already allows us to generate
    simpler code and not worry about any tag bits in the address.

part of the commit log.

Right now that patch only simplifies the range check for when the
compiler statically knows that the range is small (which does happen,
but not very often, because 'copy_to/from_user()' isn't inlined on
x86-64, so the compiler doesn't actually see the constant size case
that is very common there).

However, that "check the end of the range" is sometimes actually
fairly complicated code, and it would be nice to drop that entirely.

See for example the fs/readdir.c case, where the length of the access
is kind of annoying:

        if (!user_write_access_begin(dirent,
                        (unsigned long)(dirent->d_name + namlen + 1) -
                                (unsigned long)dirent))
                goto efault;

and there really isn't any actual reason to check the end of the
access on x86: if the beginning address has the low bit clear, it
doesn't really matter what the end is, because we'll either have a
good area, or we'll fault in the non-canonical area even if the sign
changes.

So being careful about the range is kind of annoying, when we don't
really need it.

                  Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-04-28 20:15   ` Linus Torvalds
@ 2023-04-29  0:38     ` Kirill A. Shutemov
  2023-04-29  1:04       ` Linus Torvalds
  2023-05-02 15:42       ` Dave Hansen
  0 siblings, 2 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2023-04-29  0:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra

On Fri, Apr 28, 2023 at 01:15:33PM -0700, Linus Torvalds wrote:
> On Fri, Apr 28, 2023 at 1:07 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So here's my suggested change on top of the current tree. Comments?
> 
> Oh, and I wanted to particularly mention that
> 
>     We could probably just do that "check only starting address" for any
>     arbitrary range size: realistically all kernel accesses to user space
>     will be done starting at the low address.  But let's leave that kind of
>     optimization for later.  As it is, this already allows us to generate
>     simpler code and not worry about any tag bits in the address.
> 
> part of the commit log.
> 
> Right now that patch only simplifies the range check for when the
> compiler statically knows that the range is small (which does happen,
> but not very often, because 'copy_to/from_user()' isn't inlined on
> x86-64, so the compiler doesn't actually see the constant size case
> that is very common there).

BTW, I think the static check can be relaxed. Checking size against
PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level
paging) guard hole at the begging of kernel address space.

> However, that "check the end of the range" is sometimes actually
> fairly complicated code, and it would be nice to drop that entirely.
> 
> See for example the fs/readdir.c case, where the length of the access
> is kind of annoying:
> 
>         if (!user_write_access_begin(dirent,
>                         (unsigned long)(dirent->d_name + namlen + 1) -
>                                 (unsigned long)dirent))
>                 goto efault;
> 
> and there really isn't any actual reason to check the end of the
> access on x86: if the beginning address has the low bit clear, it
> doesn't really matter what the end is, because we'll either have a
> good area, or we'll fault in the non-canonical area even if the sign
> changes.
> 
> So being careful about the range is kind of annoying, when we don't
> really need it.

Hm. Is there anybody who access high to low after the check (glibc
memcpy() bug flashbacks)? Or not in any particular order?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-04-29  0:38     ` Kirill A. Shutemov
@ 2023-04-29  1:04       ` Linus Torvalds
  2023-05-02 15:42       ` Dave Hansen
  1 sibling, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2023-04-29  1:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra

On Fri, Apr 28, 2023 at 5:38 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> BTW, I think the static check can be relaxed. Checking size against
> PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level
> paging) guard hole at the begging of kernel address space.

So I don't worry about the size per se - we just don't have any
constant sized accesses that are bigger than a page.

The constant-sized case is for things like structures being copied to
user space.

And having a bug gap is nice for the suzkaller case, although I don't
think that GP fault has triggered lately (or ever, I don't remember).
Having random system call arguments that trigger "oh, this is in the
non-canonical region" is a good thing.

> > So being careful about the range is kind of annoying, when we don't
> > really need it.
>
> Hm. Is there anybody who access high to low after the check (glibc
> memcpy() bug flashbacks)? Or not in any particular order?

Yeah, I can't think of a single case, which is why it seems so silly
to even bother.

Almost all real life cases end up being limited by things like the
page/folio size.

We do have exceptions, like module loading etc that might copy a
bigger area from user space, but no, we don't have any backwards
copies.

So you'd almost have to have some "access_ok()" followed by random
access with a user-controlled offset, and that seems nonsensical and
fundamentally impossible anyway.

But just because I can't think of it, and go "that would be insane"
doesn't mean that some driver ioctl interface might not try it.

Which is why I think having others look at it would be a good idea.

               Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-04-29  0:38     ` Kirill A. Shutemov
  2023-04-29  1:04       ` Linus Torvalds
@ 2023-05-02 15:42       ` Dave Hansen
  2023-05-02 16:00         ` Linus Torvalds
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2023-05-02 15:42 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds
  Cc: Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 978 bytes --]

On 4/28/23 17:38, Kirill A. Shutemov wrote:
> BTW, I think the static check can be relaxed. Checking size against
> PAGE_SIZE is rather conservative: there's 8 TB (or 4 PB for 5-level
> paging) guard hole at the begging of kernel address space.

Whatever we relax it to, let's make sure we get a note in
Documentation/x86/x86_64/mm.rst.  But that's totally minor and we can
fix it up later.

Have anyone seen any actual code generation difference between:

	return (long)ptr >= 0;

and

        return !((unsigned long)ptr & (1UL<<(BITS_PER_LONG-1)));

?  I'm seeing gcc generate the same code for both the <=PAGE_SIZE side
and the 'sum' side.

It's longer, but I'd rather read the explicit "check bit 63" than the
positive/negative address space thing.  I certainly grok both, but have
to think through the "(long)ptr >= 0" check every time.

I guess it also wouldn't matter as much either if we hid it in a helper
like the attached patch and I didn't have to read it twice. ;)

[-- Attachment #2: ptr_in_user_half.patch --]
[-- Type: text/x-patch, Size: 2540 bytes --]

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 123135d60f72..7bb11d5a7f8f 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -77,15 +77,28 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 
 #ifdef CONFIG_X86_64
 /*
- * On x86-64, we may have tag bits in the user pointer. Rather than
- * mask them off, just change the rules for __access_ok().
+ * The virtual address space space is logically divided into a kernel
+ * half and a user half.  When cast to a signed type, user pointers
+ * are positive and kernel pointers are negative.
+ */
+static inline bool ptr_in_user_half(void *ptr)
+{
+	return (long)ptr >= 0;
+}
+
+/*
+ * User pointers can have tag bits on x86-64.  This scheme tolerates
+ * arbitrary values in those bits rather masking them off.
+ *
+ * Enforce two rules:
+ * 1. 'ptr' must be in the user half of the address space
+ * 2. 'ptr+size' must not overflow (back into the kernel half)
  *
- * Make the rule be that 'ptr+size' must not overflow, and must not
- * have the high bit set. Compilers generally understand about
- * unsigned overflow and the CF bit and generate reasonable code for
- * this. Although it looks like the combination confuses at least
- * clang (and instead of just doing an "add" followed by a test of
- * SF and CF, you'll see that unnecessary comparison).
+ * Compilers generally understand about  unsigned overflow and the CF
+ * bit and generate reasonable code for this. Although it looks like
+ * the combination confuses at least clang (and instead of just doing
+ * an "add" followed by a test of SF and CF, you'll see that
+ * unnecessary comparison).
  *
  * For the common case of small sizes that can be checked at compile
  * time, don't even bother with the addition, and just check that the
@@ -93,11 +106,16 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
  */
 static inline bool __access_ok(const void __user *ptr, unsigned long size)
 {
+	/*
+	 * Check only the pointer (not ptr+size) for small accesses.
+	 * This is OK because the kernel address space begins with a
+	 * >=PAGE_SIZE guard hole.
+	 */
 	if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) {
-		return (long)ptr >= 0;
+		return ptr_in_user_half(ptr);
 	} else {
 		unsigned long sum = size + (unsigned long)ptr;
-		return (long) sum >= 0 && sum >= (unsigned long)ptr;
+		return ptr_in_user_half(ptr) && sum >= (unsigned long)ptr;
 	}
 }
 #define __access_ok __access_ok

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-02 15:42       ` Dave Hansen
@ 2023-05-02 16:00         ` Linus Torvalds
  2023-05-02 20:14           ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-05-02 16:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov, Peter Zijlstra

On Tue, May 2, 2023 at 8:42 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> Have anyone seen any actual code generation difference between:
>
>         return (long)ptr >= 0;
>
> and
>
>         return !((unsigned long)ptr & (1UL<<(BITS_PER_LONG-1)));

No, as far as I know, they both generate the same code.

> It's longer, but I'd rather read the explicit "check bit 63" than the
> positive/negative address space thing.  I certainly grok both, but have
> to think through the "(long)ptr >= 0" check every time.

I'm very much the other way. I think it's much clearer to say "check
the sign bit".

Doing the explicit bit check means that I have to look at what the bit
number is, and that is a much more complex expression.

In fact, I'd find it easier to read

       return !((unsigned long)ptr & (1UL<< 63));

just because then you go "Oh, checking bit 63" without having to parse
the expression.

But even then I find the '!' is easy to miss, so you really have to parse it.

But:

> I guess it also wouldn't matter as much either if we hid it in a helper
> like the attached patch and I didn't have to read it twice. ;)

Yeah, I think that's a good solution.

                 Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-02 16:00         ` Linus Torvalds
@ 2023-05-02 20:14           ` Linus Torvalds
  2023-05-03  0:53             ` Dave Hansen
  2023-05-03  8:01             ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2023-05-02 20:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov, Peter Zijlstra

On Tue, May 2, 2023 at 9:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> > I guess it also wouldn't matter as much either if we hid it in a helper
> > like the attached patch and I didn't have to read it twice. ;)
>
> Yeah, I think that's a good solution.

Hmm. And as I was rebasing the patch to fix up my patch, I realized
that the current -git top-of-tree state is actually broken.

That

  #define access_ok(addr, size)                                           \
  ({                                                                      \
          WARN_ON_IN_IRQ();                                               \
          likely(__access_ok(untagged_addr(addr), size));                 \
  })

is actually *wrong* in two ways.

Now, in my original patch, I added a comment about how that
"WARN_ON_IN_IRQ()" is bogus and this shouldn't be x86-specific at all.

I ended up going back in time to see why it was added, and I think it
was added because we used to access 'current' in access_ok(), due to
it using that user_addr_max() thing:

        likely(!__range_not_ok(addr, size, user_addr_max()));

but that was all removed by the set_fs() removal by Christoph Hellwig.

So there is actually nothing task-related in "access_ok()" any more,
and any warning about using it in the wrong context is just bogus.
That warning simply shouldn't exist any more (or maybe it should be in
a different place, like the actual user copy functions)

But that's actually not the problem with the above.

No, the problem is that probably *because* "access_ok()" has that
warning, not all users use "access_ok()" at all. We have places that
use "__access_ok()" instead. Like copy_from_nmi().

So now copy_from_nmi() doesn't do the untagging, so if you were to use
tagged pointers for the stack, you'd not get stack traces.

End result: I think that

 (a) that WARN_ON_IN_IRQ() is actively detrimental and causes problems

 (b) the current "use untagged_addr() in access_ok()" model is also broken

and my patch - which was meant to just improve code generation -
actually fixes this.

            Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-02 20:14           ` Linus Torvalds
@ 2023-05-03  0:53             ` Dave Hansen
  2023-05-03  1:17               ` Linus Torvalds
  2023-05-03  8:01             ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2023-05-03  0:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov, Peter Zijlstra

On 5/2/23 13:14, Linus Torvalds wrote:
> No, the problem is that probably *because* "access_ok()" has that
> warning, not all users use "access_ok()" at all. We have places that
> use "__access_ok()" instead. Like copy_from_nmi().
> 
> So now copy_from_nmi() doesn't do the untagging, so if you were to use
> tagged pointers for the stack, you'd not get stack traces.
> 
> End result: I think that
> 
>  (a) that WARN_ON_IN_IRQ() is actively detrimental and causes problems
> 
>  (b) the current "use untagged_addr() in access_ok()" model is also broken

Ugh, yes.

The fallout seems limited to (probably) perf and tracing poking at user
stack frames.  But, yes, it definitely looks broken there.

While I bet we could shoehorn the existing tlbstate checks into the
__access_ok() sites, I also vastly prefer the high bit checks in
access_ok() instead.  The less state we have to consult, the better.

Once the WARN_ON_IN_IRQ() is gone, it seems like it's just a matter of
collapsing __access_ok() into access_ok() and converting the (~3) callers.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03  0:53             ` Dave Hansen
@ 2023-05-03  1:17               ` Linus Torvalds
  2023-05-03 16:38                 ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-05-03  1:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov, Peter Zijlstra

On Tue, May 2, 2023 at 5:53 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> The fallout seems limited to (probably) perf and tracing poking at user
> stack frames.  But, yes, it definitely looks broken there.

So I ended up doing more cleanups - moving the 64-bit specific code to
'uaccess_64.h' etc.

And in the process I found another broken
thing:__untagged_addr_remote() is very very buggy.

The reason?

        long sign = addr >> 63;

that does *not* do at all what '__untagged_addr()' does, because while
'sign' is a signed long, 'addr' is an *unsigned* long.

So the actual shift ends up being done as an unsigned shift, and then
just the result is assigned to a signed variable.

End result? 'sign' ends up being 0 for user space (intentional) or 1
for kernel space (not intentional)..

It's not 0 or ~0 as the __untagged_addr() is, wh9ch uses a signed
shift in the inline asm ('sar')

That said, while this is all horribly buggy, I'm not actually 100%
sure that we care too much about untagging kernel addresses. So it's
an error case that doesn't get marked as an error.

So I guess in *practice* the horribly buggy code is actually just a
small harmless bug, and causes anybody who passes in an invalid
(kernel) address look like a possibly valid user address after all.

HOWEVER.

Why does it do that "shift-by-63" game there, instead of making
tlbstate_untag_mask just have bit #63 always set?

Untagging a kernel address is not a sensible operation, so the only
thing you want is to *keep* a kernel address as a bad address. You
don't have to actually keep it a *valid* kernel address, it just
should not become a (possibly valid) user address after untagging.

Hmmm? Am I missing something?

                  Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-02 20:14           ` Linus Torvalds
  2023-05-03  0:53             ` Dave Hansen
@ 2023-05-03  8:01             ` Peter Zijlstra
  2023-05-03 16:38               ` Linus Torvalds
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2023-05-03  8:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov

On Tue, May 02, 2023 at 01:14:33PM -0700, Linus Torvalds wrote:
> On Tue, May 2, 2023 at 9:00 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > > I guess it also wouldn't matter as much either if we hid it in a helper
> > > like the attached patch and I didn't have to read it twice. ;)
> >
> > Yeah, I think that's a good solution.
> 
> Hmm. And as I was rebasing the patch to fix up my patch, I realized
> that the current -git top-of-tree state is actually broken.
> 
> That
> 
>   #define access_ok(addr, size)                                           \
>   ({                                                                      \
>           WARN_ON_IN_IRQ();                                               \
>           likely(__access_ok(untagged_addr(addr), size));                 \
>   })
> 
> is actually *wrong* in two ways.
> 
> Now, in my original patch, I added a comment about how that
> "WARN_ON_IN_IRQ()" is bogus and this shouldn't be x86-specific at all.
> 
> I ended up going back in time to see why it was added, and I think it
> was added because we used to access 'current' in access_ok(), due to
> it using that user_addr_max() thing:
> 
>         likely(!__range_not_ok(addr, size, user_addr_max()));
> 
> but that was all removed by the set_fs() removal by Christoph Hellwig.

So I had a poke around, trying to figure out where it came from, and
yes. Commit ae31fe51a3cc ("perf/x86: Restore TASK_SIZE check on frame
pointer") is the reason that WARN_ON_IN_IRQ() thing got added.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03  1:17               ` Linus Torvalds
@ 2023-05-03 16:38                 ` Linus Torvalds
  2023-05-03 16:44                   ` Dave Hansen
                                     ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Linus Torvalds @ 2023-05-03 16:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov, Peter Zijlstra

On Tue, May 2, 2023 at 6:17 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And in the process I found another broken
> thing:__untagged_addr_remote() is very very buggy.
>
> The reason?
>
>         long sign = addr >> 63;
>
> that does *not* do at all what '__untagged_addr()' does, because while
> 'sign' is a signed long, 'addr' is an *unsigned* long.
>
> So the actual shift ends up being done as an unsigned shift, and then
> just the result is assigned to a signed variable.
>
> End result? 'sign' ends up being 0 for user space (intentional) or 1
> for kernel space (not intentional)..

Looking around, this same bug used to exists for the normal
(non-remote) case too, until it was accidentally fixed when changing
that to use inline asm and the alternatives code.

At that point the non-remote case got an explicit 'sar' instruction,
and the result really was ~0 for kernel mode addresses.

> Why does it do that "shift-by-63" game there, instead of making
> tlbstate_untag_mask just have bit #63 always set?

And it turns out that bit #63 really _is_ always set, so I think the
solution to this all is to remove the sign games in untag_addr()
entirely.

Untagging a kernel address will "corrupt" it, but it will stay a
kernel address (well, it will stay a "high bit set" address), which is
all we care about anyway.

If somebody actually tries to untag a kernel address, that would be a
bug anyway, as far as I can tell.

So I'm going to just remove the 'sign' games entirely. They are
completely broken in 'untagged_addr_remote()', they _used_ to be
completely broken in 'untagged_addr()', and it looks like it's all
unnecessary.

           Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03  8:01             ` Peter Zijlstra
@ 2023-05-03 16:38               ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2023-05-03 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov

On Wed, May 3, 2023 at 1:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I had a poke around, trying to figure out where it came from, and
> yes. Commit ae31fe51a3cc ("perf/x86: Restore TASK_SIZE check on frame
> pointer") is the reason that WARN_ON_IN_IRQ() thing got added.

I added a pointer to that commit in the commit message, and took the
above to mean 'Ack' from you.

               Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03 16:38                 ` Linus Torvalds
@ 2023-05-03 16:44                   ` Dave Hansen
  2023-05-03 16:51                     ` Linus Torvalds
  2023-05-03 17:54                   ` Linus Torvalds
  2023-05-04  6:28                   ` Kirill A. Shutemov
  2 siblings, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2023-05-03 16:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov, Peter Zijlstra

On 5/3/23 09:38, Linus Torvalds wrote:
>> Why does it do that "shift-by-63" game there, instead of making
>> tlbstate_untag_mask just have bit #63 always set?
> And it turns out that bit #63 really _is_ always set, so I think the
> solution to this all is to remove the sign games in untag_addr()
> entirely.

Yes, there are only two possible values right now, both of which have
bit 63 set:

LAM off: mm->context.untag_mask = -1UL;
LAM on:  mm->context.untag_mask = ~GENMASK(62, 57);

> Untagging a kernel address will "corrupt" it, but it will stay a
> kernel address (well, it will stay a "high bit set" address), which is
> all we care about anyway.
> 
> If somebody actually tries to untag a kernel address, that would be a
> bug anyway, as far as I can tell.

Is it a bug?  The do_madvise() path, for instance, is passing a value in
there that came right from userspace.

> So I'm going to just remove the 'sign' games entirely. They are
> completely broken in 'untagged_addr_remote()', they _used_ to be
> completely broken in 'untagged_addr()', and it looks like it's all
> unnecessary.

Yes, it looks completely superfluous.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03 16:44                   ` Dave Hansen
@ 2023-05-03 16:51                     ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2023-05-03 16:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov, Peter Zijlstra

On Wed, May 3, 2023 at 9:45 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/3/23 09:38, Linus Torvalds wrote:
> > If somebody actually tries to untag a kernel address, that would be a
> > bug anyway, as far as I can tell.
>
> Is it a bug?  The do_madvise() path, for instance, is passing a value in
> there that came right from userspace.

That's still a "user address" - just not a *valid* one.

So we do not want to mask the high bit off - because that is what will
catch people later doing things like vma address range comparisons on
it and notice "that's not a valid address", but it's also not a
"kernel address" that we need to preserve as such.

So yeah, it's a bit confusing in that it's _also_ true that "kernel
addresses have the high bit set" and "user addresses have the high bit
clear", and I'm basically using two different semantics for "kernel
address".

IOW: what I mean by "it's not valid to do 'untagged_addr()' on a
kernel address" is that you can't take a (valid) kernel address, do
'untagged_addr()' on it, and expect it to still work as a kernel
address.

But at the same time you *are* supposed to be able to use
'untagged_addr()' on a - untrusted and possibly invalid - user
pointer, and it's supposed to end up having the tag bits clear and
still be usable as a user pointer. And yet still also be caught by any
validity checks (ie a high bit set would never be a valid user
pointer, not even after doing 'untagged_addr()' on it).

                Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03 16:38                 ` Linus Torvalds
  2023-05-03 16:44                   ` Dave Hansen
@ 2023-05-03 17:54                   ` Linus Torvalds
  2023-05-03 19:01                     ` Peter Zijlstra
  2023-05-04  6:28                   ` Kirill A. Shutemov
  2 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-05-03 17:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov, Peter Zijlstra

On Wed, May 3, 2023 at 9:38 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I'm going to just remove the 'sign' games entirely. They are
> completely broken in 'untagged_addr_remote()', they _used_ to be
> completely broken in 'untagged_addr()', and it looks like it's all
> unnecessary.

Ok, I've pushed out my changes to the 'x86-uaccess-cleanup' branch.

I think it's all good, but it would be really nice to hear it's been
tested on a setup that actually has LAM (simulator? or maybe there is
actual hw inside Intel)

And any other commentary is appreciated,

              Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03 17:54                   ` Linus Torvalds
@ 2023-05-03 19:01                     ` Peter Zijlstra
  2023-05-03 19:19                       ` Linus Torvalds
  2023-05-05 19:56                       ` Linus Torvalds
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2023-05-03 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov

On Wed, May 03, 2023 at 10:54:38AM -0700, Linus Torvalds wrote:
> On Wed, May 3, 2023 at 9:38 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So I'm going to just remove the 'sign' games entirely. They are
> > completely broken in 'untagged_addr_remote()', they _used_ to be
> > completely broken in 'untagged_addr()', and it looks like it's all
> > unnecessary.
> 
> Ok, I've pushed out my changes to the 'x86-uaccess-cleanup' branch.
> 
> I think it's all good, but it would be really nice to hear it's been
> tested on a setup that actually has LAM (simulator? or maybe there is
> actual hw inside Intel)
> 
> And any other commentary is appreciated,

Looks sane from a first reading. But I'll try and have another look
tomorrow.

Also per how 2s complement is constructed 0 has to be in the positive
space for it to be 'half'. Also, typically 0 is included in N and the
traditional 'positive integers' set is then written as N \ {0}, but
that's somewhat modern and a lot of variation exists -- although
typically books tend to specify where they stand on that issue.

I suppose that's a very long winded way of saying, that yes, ofcourse 0
is a positive number :-)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03 19:01                     ` Peter Zijlstra
@ 2023-05-03 19:19                       ` Linus Torvalds
  2023-05-05 19:56                       ` Linus Torvalds
  1 sibling, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2023-05-03 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov

On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> I suppose that's a very long winded way of saying, that yes, ofcourse 0
> is a positive number :-)

Well, I do consider it as such, but I guess I took all my math classes
when people still considered negative, zero and positive to be three
disjoint sets, and 'non-negative' was required for rigor.

Some googling around says that a lot of people still think that, and
that it might even be language-specific.

I think the commit commentary about "ok, strictly non-negative" might
still be relevant. At least to some people, and at least for
sticklers.

Also, I do consider 0 to be part of ℕ, although I wouldn't consider
that to be an argument about "positive" at all.

The argument for 0 in ℕ would be that without it, you don't have an
identity element for addition, which would arguably make natural
numbers kind of broken as a set.

            Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03 16:38                 ` Linus Torvalds
  2023-05-03 16:44                   ` Dave Hansen
  2023-05-03 17:54                   ` Linus Torvalds
@ 2023-05-04  6:28                   ` Kirill A. Shutemov
  2023-05-04 15:25                     ` Dave Hansen
  2023-05-04 17:10                     ` Linus Torvalds
  2 siblings, 2 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2023-05-04  6:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Dave Hansen, x86, linux-kernel, kirill.shutemov,
	Peter Zijlstra

On Wed, May 03, 2023 at 09:38:03AM -0700, Linus Torvalds wrote:
> > Why does it do that "shift-by-63" game there, instead of making
> > tlbstate_untag_mask just have bit #63 always set?
> 
> And it turns out that bit #63 really _is_ always set, so I think the
> solution to this all is to remove the sign games in untag_addr()
> entirely.

Untagging kernel pointer with LAM enabled will land it in the canonical
hole which is safe as it leads to #GP on dereference.

> Untagging a kernel address will "corrupt" it, but it will stay a
> kernel address (well, it will stay a "high bit set" address), which is
> all we care about anyway.

The interesting case to consider is untagging kernel pointer when LAM_U48
is enabled (not part of current LAM enabling). LAM_U48 would make the
untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP
enabled (also not supported yet) untagging kernel may transform it to
other valid kernel pointer.

So we cannot rely on #GP as backstop here. The kernel has to exclude
kernel pointer by other means. It can be fun to debug.

Looks like you are not worried about this kind of corruption. Hm.

Maybe it worth to indicate in the helper name that it is intended only for
userspace addresses? Rename it to untagged_uaddr() or something?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-04  6:28                   ` Kirill A. Shutemov
@ 2023-05-04 15:25                     ` Dave Hansen
  2023-05-04 17:10                       ` Kirill A. Shutemov
  2023-05-04 17:10                     ` Linus Torvalds
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2023-05-04 15:25 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds
  Cc: Dave Hansen, x86, linux-kernel, kirill.shutemov, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1937 bytes --]

On 5/3/23 23:28, Kirill A. Shutemov wrote:
>> Untagging a kernel address will "corrupt" it, but it will stay a
>> kernel address (well, it will stay a "high bit set" address), which is
>> all we care about anyway.
> The interesting case to consider is untagging kernel pointer when LAM_U48
> is enabled (not part of current LAM enabling). LAM_U48 would make the
> untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP
> enabled (also not supported yet) untagging kernel may transform it to
> other valid kernel pointer.
> 
> So we cannot rely on #GP as backstop here. The kernel has to exclude
> kernel pointer by other means. It can be fun to debug.

Yeah, I have the feeling that we're really going to need a pair of
untagging functions once we get to doing kernel LAM for a _bunch_ of
reasons.

Just as a practical matter, I think we should OR bits into the mask on
the kernel side, effectively:

unsigned long untag_kernel_addr(unsigned long addr)
{
	return addr | kernel_untag_mask;
}

and kernel_untag_mask should have bit 63 *clear*.

That way the pointers that have gone through untagging won't look silly.
 If you untag VMALLOC_END or something, it'll still look like the
addresses we have in mm.rst.

Also, it'll be impossible to have the mask turn a userspace address into
a kernel one.

Last, we can add some debugging in there, probably conditional on some
mm debugging options like:

	if (WARN_ON_ONCE(!valid_user_address(addr)))
		return 0;

It's kinda like "void __user *" versus "void *".  The __user ones can
*absolutely* point anywhere, user or kernel.  That's why we can't WARN()
in the untagged_addr() function that takes user pointers.

But "void *" can only point to the kernel.  It has totally different rules.

We should probably also do something like the attached patch sooner
rather than later.  'untag_mask' really is a misleading name for a mask
that gets applied only to user addresses.

[-- Attachment #2: lam.patch --]
[-- Type: text/x-patch, Size: 6944 bytes --]

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 0da5c227f490..adee70ee398c 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -51,7 +51,7 @@ typedef struct {
 	unsigned long lam_cr3_mask;
 
 	/* Significant bits of the virtual address. Excludes tag bits. */
-	u64 untag_mask;
+	u64 user_untag_mask;
 #endif
 
 	struct mutex lock;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 1d29dc791f5a..b4bb4b1a36e4 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -94,18 +94,18 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
 static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
-	mm->context.untag_mask = oldmm->context.untag_mask;
+	mm->context.user_untag_mask = oldmm->context.user_untag_mask;
 }
 
 #define mm_untag_mask mm_untag_mask
 static inline unsigned long mm_untag_mask(struct mm_struct *mm)
 {
-	return mm->context.untag_mask;
+	return mm->context.user_untag_mask;
 }
 
 static inline void mm_reset_untag_mask(struct mm_struct *mm)
 {
-	mm->context.untag_mask = -1UL;
+	mm->context.user_untag_mask = -1UL;
 }
 
 #define arch_pgtable_dma_compat arch_pgtable_dma_compat
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 75bfaa421030..9fc6a2ac5318 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -55,11 +55,11 @@ static inline void cr4_clear_bits(unsigned long mask)
 }
 
 #ifdef CONFIG_ADDRESS_MASKING
-DECLARE_PER_CPU(u64, tlbstate_untag_mask);
+DECLARE_PER_CPU(u64, tlbstate_user_untag_mask);
 
-static inline u64 current_untag_mask(void)
+static inline u64 current_user_untag_mask(void)
 {
-	return this_cpu_read(tlbstate_untag_mask);
+	return this_cpu_read(tlbstate_user_untag_mask);
 }
 #endif
 
@@ -389,7 +389,7 @@ static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
 {
 	this_cpu_write(cpu_tlbstate.lam,
 		       mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT);
-	this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask);
+	this_cpu_write(tlbstate_user_untag_mask, mm->context.user_untag_mask);
 }
 
 #else
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 123135d60f72..044d82e98b27 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -36,16 +36,16 @@ static inline unsigned long __untagged_addr(unsigned long addr)
 	long sign;
 
 	/*
-	 * Refer tlbstate_untag_mask directly to avoid RIP-relative relocation
-	 * in alternative instructions. The relocation gets wrong when gets
-	 * copied to the target place.
+	 * Refer tlbstate_user_untag_mask directly to avoid RIP-relative
+	 * relocation in alternative instructions. The relocation gets wrong when
+	 * gets copied to the target place.
 	 */
 	asm (ALTERNATIVE("",
 			 "sar $63, %[sign]\n\t" /* user_ptr ? 0 : -1UL */
-			 "or %%gs:tlbstate_untag_mask, %[sign]\n\t"
+			 "or %%gs:tlbstate_user_untag_mask, %[sign]\n\t"
 			 "and %[sign], %[addr]\n\t", X86_FEATURE_LAM)
 	     : [addr] "+r" (addr), [sign] "=r" (sign)
-	     : "m" (tlbstate_untag_mask), "[sign]" (addr));
+	     : "m" (tlbstate_user_untag_mask), "[sign]" (addr));
 
 	return addr;
 }
@@ -55,20 +55,20 @@ static inline unsigned long __untagged_addr(unsigned long addr)
 	(__force __typeof__(addr))__untagged_addr(__addr);		\
 })
 
-static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
-						   unsigned long addr)
+static inline unsigned long __untag_remote_user_addr(struct mm_struct *mm,
+						     unsigned long addr)
 {
 	long sign = addr >> 63;
 
 	mmap_assert_locked(mm);
-	addr &= (mm)->context.untag_mask | sign;
+	addr &= (mm)->context.user_untag_mask | sign;
 
 	return addr;
 }
 
-#define untagged_addr_remote(mm, addr)	({				\
+#define untag_remote_user_addr(mm, addr)	({				\
 	unsigned long __addr = (__force unsigned long)(addr);		\
-	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
+	(__force __typeof__(addr))__untag_remote_user_addr(mm, __addr);	\
 })
 
 #else
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3d4dd9420c30..929152b28b8c 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -580,7 +580,7 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 		goto done;
 	}
 
-	vaddr = untagged_addr_remote(mm, vaddr);
+	vaddr = untag_remote_user_addr(mm, vaddr);
 
 retry:
 	vma = vma_lookup(mm, vaddr);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 420510f6a545..7ccd3f341af4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1692,7 +1692,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
 		ret = mmap_read_lock_killable(mm);
 		if (ret)
 			goto out_free;
-		start_vaddr = untagged_addr_remote(mm, svpfn << PAGE_SHIFT);
+		start_vaddr = untag_remote_user_addr(mm, svpfn << PAGE_SHIFT);
 		mmap_read_unlock(mm);
 	}
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 3064314f4832..8c956172252f 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -25,8 +25,8 @@
 #define untagged_addr(addr) (addr)
 #endif
 
-#ifndef untagged_addr_remote
-#define untagged_addr_remote(mm, addr)	({		\
+#ifndef untag_remote_user_addr
+#define untag_remote_user_addr(mm, addr)	({		\
 	mmap_assert_locked(mm);				\
 	untagged_addr(addr);				\
 })
diff --git a/mm/gup.c b/mm/gup.c
index bbe416236593..d22bad9147d7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1085,7 +1085,7 @@ static long __get_user_pages(struct mm_struct *mm,
 	if (!nr_pages)
 		return 0;
 
-	start = untagged_addr_remote(mm, start);
+	start = untag_remote_user_addr(mm, start);
 
 	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
 
@@ -1259,7 +1259,7 @@ int fixup_user_fault(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 	vm_fault_t ret;
 
-	address = untagged_addr_remote(mm, address);
+	address = untag_remote_user_addr(mm, address);
 
 	if (unlocked)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
diff --git a/mm/madvise.c b/mm/madvise.c
index b5ffbaf616f5..ffb5d76a9481 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1421,7 +1421,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 		mmap_read_lock(mm);
 	}
 
-	start = untagged_addr_remote(mm, start);
+	start = untag_remote_user_addr(mm, start);
 	end = start + len;
 
 	blk_start_plug(&plug);
diff --git a/mm/migrate.c b/mm/migrate.c
index 01cac26a3127..7647c0a6ff30 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2109,7 +2109,7 @@ static int add_page_for_migration(struct mm_struct *mm, const void __user *p,
 	bool isolated;
 
 	mmap_read_lock(mm);
-	addr = (unsigned long)untagged_addr_remote(mm, p);
+	addr = (unsigned long)untag_remote_user_addr(mm, p);
 
 	err = -EFAULT;
 	vma = vma_lookup(mm, addr);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-04  6:28                   ` Kirill A. Shutemov
  2023-05-04 15:25                     ` Dave Hansen
@ 2023-05-04 17:10                     ` Linus Torvalds
  1 sibling, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2023-05-04 17:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Dave Hansen, x86, linux-kernel, kirill.shutemov,
	Peter Zijlstra

On Wed, May 3, 2023 at 11:28 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Wed, May 03, 2023 at 09:38:03AM -0700, Linus Torvalds wrote:
> > > Why does it do that "shift-by-63" game there, instead of making
> > > tlbstate_untag_mask just have bit #63 always set?
> >
> > And it turns out that bit #63 really _is_ always set, so I think the
> > solution to this all is to remove the sign games in untag_addr()
> > entirely.
>
> Untagging kernel pointer with LAM enabled will land it in the canonical
> hole which is safe as it leads to #GP on dereference.

You are entirely missing the point.

The GP fault does *NOT MATTER*.

Think of 'untagged_addr()' as a companion to - but absolutely *NOT* a
replacement for - 'access_ok()'.

You have three distinct cases:

 (a) user-supplied valid user address

 (b) user-supplied invalid user address (it high bit set)

 (c) actual kernel address

and 'untagged_addr()' and 'access_ok()' work on the same basic input
domain: cases (a) and (b).

And the important thing for 'untagged_addr()' is that in case (a) it
needs to remove the tab bits, and in case (b) needs to *keep* the
address as an invalid user address.

Note that it does not need to keep the value the *same*. Nobody cares.

And also note that the resulting pointer may or may not GP-fault. Nobody cares.

Just to hit this home with a very explicit example, think of the case
where the kernel config for address masking isn't even enabled, and
'untagged_addr()' is a 1:1 map with no changes.

Doing 'untagged_addr()' on a valid user address results in a valid
user address. And doing it on an invalid user address results in an
invalid user address. GOOD.

Note that doing 'untagged_addr()' on an invalid user access does NOT
IN ANY WAY make that invalid address somehow "safe" and cause a GP
fault. Sure, it _might_ GP-fault. Or it might not. It might point to
random kernel data.

Verification of the address is simply not the job of
'untagged_addr()'. Never has been, never will be, and fundamentally
*cannot* be, since for the forseeable future 'untagged_addr()' is a
no-op for every single user outside of Intel.

Verification is separate. The verification is never "let's randomly
just access this pointer and see if it gets a GP-fault". No. We have a
user pointer, it needs *checking*. It needs to have something like
"use lookup_vma() to look up the vma that is associated with that
address". But it could also be something like "just do the range check
in GUP".

And that's why "keep an invalid user address as an invalid address",
because that *separate* stage of verifying the address needs to still
show that it's invalid.

Now, sometimes the "verification" might actually be "access_ok()"
itself, but honestly, if the address is used for an actual access,
then it shouldn't have gone through the 'untagged_addr()' thing at
all. It should just have been used as-is for the access. So normally
'untagged_addr()' does not get used *together* with 'access_ok()',
although that should obviously also work.

End result: all that really matters on x86-64 is that
'untagged_addr()' must keep the high bit as-is. That's the "this is
not a valid user address" bit. That's very much explicit in my current
series, of course, but even without my current series it was
implicitly the truth with those LAM patches (particularly considering
that 'untagged_addr()' didn't actually do the "keep kernel addresses
as-is" that it *thought* it did due to the signed type confusion).

So what about that (c) case? It doesn't matter. It's simply
fundamentally wrong for the kernel to pass an actual kernel address to
'untagged_addr()' and expect something useful back. It's a nonsensical
thing to do, and it's a huge bug.

So for the (c) case, the fact that the result would be useless and
*usually* GP-fault on access is a good thing. But it's not a
requirement, it's more of a "oh, cool, it's good that the nonsensical
operation causes quick failures".

So in that case, the GP fault is a "feature", but not a requirement.
Again, the normal 'untagged_addr()' case (of not changing the pointer
at all), obviously does *not* cause the kernel pointer corruption, but
maybe we could even have a debug mode. We could literally make
'untagged_addr()' do something like

   #ifdef CONFIG_DEBUG_NONLAM
   // Make sure nobody tries to use untagged_addr() on non-user addresses
   #define untagged_addr(x) ((x) | (long)(x)>>63)
   #endif

except obviously with the "get the types right and use 'x' only once"
thing (so the above #define is buggy, and puresly for conceptual
documentation purposes).

See?

Side note: one day, maybe we want to use address tagging *inside* the
kernel. However, that will not use 'untagged_addr()'. That would use
some *other* model for cleaning up kernel pointers when necessary.

Even on x86-64, the tagging rules for kernel and user space is
entirely different, in that user-space LAM rules are "U48" or "U57",
while kernel LAM rules depend on the paging mode, and the two are
*not* tied together.

But more importantly from a kernel perspective: regardless of hardware
implementations like that, the notion of masking bits off a untrusted
user pointer is simply completely *different* from the notion of
masking off bits of our own kernel pointers. You can see this in the
kernel everywhere: user pointers should be statically always user
pointers, and should look something like

   struct iocb __user *user_iocb

which is very very different from a kernel pointer that doesn't have
that "__user" thing. Again, on some architectures, the *EXACT*SAME*
bit pattern pointer value may point to different things, because user
accesses are simply in a different address space entirely.

So if/when KASAN starts using LAM inside the kernel, it will do its
own things. It had better not use "untagged_addr()".

           Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-04 15:25                     ` Dave Hansen
@ 2023-05-04 17:10                       ` Kirill A. Shutemov
  0 siblings, 0 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2023-05-04 17:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Dave Hansen, x86, linux-kernel, kirill.shutemov,
	Peter Zijlstra

On Thu, May 04, 2023 at 08:25:58AM -0700, Dave Hansen wrote:
> On 5/3/23 23:28, Kirill A. Shutemov wrote:
> >> Untagging a kernel address will "corrupt" it, but it will stay a
> >> kernel address (well, it will stay a "high bit set" address), which is
> >> all we care about anyway.
> > The interesting case to consider is untagging kernel pointer when LAM_U48
> > is enabled (not part of current LAM enabling). LAM_U48 would make the
> > untagging mask wider -- ~GENMASK(62, 48). With 5-level paging and LAM_SUP
> > enabled (also not supported yet) untagging kernel may transform it to
> > other valid kernel pointer.
> > 
> > So we cannot rely on #GP as backstop here. The kernel has to exclude
> > kernel pointer by other means. It can be fun to debug.
> 
> Yeah, I have the feeling that we're really going to need a pair of
> untagging functions once we get to doing kernel LAM for a _bunch_ of
> reasons.

There's already arch_kasan_reset_tag() used on ARM64 (and two more helpers
to set/get tag). Don't see a reason to add new.

> Just as a practical matter, I think we should OR bits into the mask on
> the kernel side, effectively:
> 
> unsigned long untag_kernel_addr(unsigned long addr)
> {
> 	return addr | kernel_untag_mask;
> }
> 
> and kernel_untag_mask should have bit 63 *clear*.
> 
> That way the pointers that have gone through untagging won't look silly.
>  If you untag VMALLOC_END or something, it'll still look like the
> addresses we have in mm.rst.
> 
> Also, it'll be impossible to have the mask turn a userspace address into
> a kernel one.
> 
> Last, we can add some debugging in there, probably conditional on some
> mm debugging options like:
> 
> 	if (WARN_ON_ONCE(!valid_user_address(addr)))
> 		return 0;
> 
> It's kinda like "void __user *" versus "void *".  The __user ones can
> *absolutely* point anywhere, user or kernel.  That's why we can't WARN()
> in the untagged_addr() function that takes user pointers.
> 
> But "void *" can only point to the kernel.  It has totally different rules.
> 
> We should probably also do something like the attached patch sooner
> rather than later.  'untag_mask' really is a misleading name for a mask
> that gets applied only to user addresses.

A bit too verbose to my liking, but okay.

Maybe _uaddr() instead of _user_addr()?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-03 19:01                     ` Peter Zijlstra
  2023-05-03 19:19                       ` Linus Torvalds
@ 2023-05-05 19:56                       ` Linus Torvalds
  2023-05-05 20:59                         ` Kirill A. Shutemov
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-05-05 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Kirill A. Shutemov, Dave Hansen, x86, linux-kernel,
	kirill.shutemov

On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Looks sane from a first reading. But I'll try and have another look
> tomorrow.

Well, I didn't hear back any more, and I was merging architecture
code, so I merged my branch too.

It does fix at least two bugs, even if those bugs are not necessarily
all that noticeable.

I guess we'll see if it introduces new ones...

                    Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-05-05 19:56                       ` Linus Torvalds
@ 2023-05-05 20:59                         ` Kirill A. Shutemov
  0 siblings, 0 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2023-05-05 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, x86, linux-kernel,
	kirill.shutemov

On Fri, May 05, 2023 at 12:56:31PM -0700, Linus Torvalds wrote:
> On Wed, May 3, 2023 at 12:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Looks sane from a first reading. But I'll try and have another look
> > tomorrow.
> 
> Well, I didn't hear back any more, and I was merging architecture
> code, so I merged my branch too.
> 
> It does fix at least two bugs, even if those bugs are not necessarily
> all that noticeable.
> 
> I guess we'll see if it introduces new ones...

The branch passed LAM selftests fine.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-04-28 20:07 ` Linus Torvalds
  2023-04-28 20:15   ` Linus Torvalds
@ 2023-06-16  8:47   ` Alexander Potapenko
  2023-06-16 16:22     ` Linus Torvalds
  1 sibling, 1 reply; 28+ messages in thread
From: Alexander Potapenko @ 2023-06-16  8:47 UTC (permalink / raw)
  To: torvalds
  Cc: dave.hansen, kirill.shutemov, linux-kernel, peterz, x86,
	Alexander Potapenko

Hi Linus,

>  static bool ex_handler_uaccess(const struct exception_table_entry *fixup,
> -			       struct pt_regs *regs, int trapnr)
> +			       struct pt_regs *regs, int trapnr,
> +			       unsigned long fault_address)
>  {
> -	WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in user access. Non-canonical address?");
> +	WARN_ONCE(trapnr == X86_TRAP_GP && !gp_fault_address_ok(fault_address),
> +		"General protection fault in user access. Non-canonical address?");
>  	return ex_handler_default(fixup, regs);
>  }

Shouldn't ex_handler_copy() be fixed in the same way?
Looks like it's still possible for a tagged userspace address to be passed to it and trigger a warning.

Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-06-16  8:47   ` Alexander Potapenko
@ 2023-06-16 16:22     ` Linus Torvalds
  2023-06-19 12:03       ` Alexander Potapenko
  0 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-06-16 16:22 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: dave.hansen, kirill.shutemov, linux-kernel, peterz, x86

On Fri, 16 Jun 2023 at 01:47, Alexander Potapenko <glider@google.com> wrote:
>
> Shouldn't ex_handler_copy() be fixed in the same way?

I don't think ex_handler_copy() is actually reachable any more.

The only thing ex_handler_copy() did was to set %ax to the fault type.

It was used by the strange copy_user_generic_unrolled() that had
special machine check case for "don't do the tail if we get
X86_TRAP_MC".

But that was always bogus. The MC case in question was for the
__copy_user_nocache function, and the machine check case was for the
*destination*, which wasn't in user space at all.

So instead of looking at "what was the trap number", the code should
have just noticed "trapped on the destination", and stopped for *that*
reason.

See commit 034ff37d3407 ("x86: rewrite '__copy_user_nocache'
function") and in particular, see the comment there about writes on
the destination:

  * An exception on a write means that we're
  * done, but we need to update the count
  * depending on where in the unrolled loop
  * we were.

but yeah, I never removed the actual now unused _ASM_EXTABLE_CPY case.

Honestly, I had no way of even testing the code. I doubt anybody does.
There are a couple of users:

 - rdma mis-uses it for regular kernel-to-kernel copies that don't
fault (because rdma wants the "nocache" case). So it can't fault at
all.

 - a couple of GPU drivers mis-use it similarly to rdma, but at least
with a user source in the form of __copy_from_user_inatomic_nocache()

 - _copy_from_iter_flushcache uses it for pmem and dax, because it
wants that machine check handling for non-volatile memory

So two of three users don't actually *have* the MC case at all on the
destination.

And the third case - the actual "copy using uncached accesses in order
to get errors on the destination" case - depends on hardware that I'm
not convinced really exists any more.

Oh, I think there's some odd ntb mis-use too.

I'd love for somebody to actually test the machine check case, but the
old code was completely insane with odd "we handle _one_ case of
4-byte alignment, but not actually the general case", so I wonder if
the MC case ever really worked in reality. And as mentioned, I am not
convinced the hardware is available.

                 Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [GIT PULL] x86/mm for 6.4
  2023-06-16 16:22     ` Linus Torvalds
@ 2023-06-19 12:03       ` Alexander Potapenko
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Potapenko @ 2023-06-19 12:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dave.hansen, kirill.shutemov, linux-kernel, peterz, x86

On Fri, Jun 16, 2023 at 6:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 16 Jun 2023 at 01:47, Alexander Potapenko <glider@google.com> wrote:
> >
> > Shouldn't ex_handler_copy() be fixed in the same way?
>
> I don't think ex_handler_copy() is actually reachable any more.
>
> The only thing ex_handler_copy() did was to set %ax to the fault type.
>
> It was used by the strange copy_user_generic_unrolled() that had
> special machine check case for "don't do the tail if we get
> X86_TRAP_MC".

Oh, I see, thanks!
I am using a downstream kernel (way before 034ff37d3407) that still
has _ASM_EXTABLE_CPY.
Looks like I just need to adjust ex_handler_copy() in my code.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-06-19 12:04 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-27 22:56 [GIT PULL] x86/mm for 6.4 Dave Hansen
2023-04-28 17:23 ` pr-tracker-bot
2023-04-28 20:07 ` Linus Torvalds
2023-04-28 20:15   ` Linus Torvalds
2023-04-29  0:38     ` Kirill A. Shutemov
2023-04-29  1:04       ` Linus Torvalds
2023-05-02 15:42       ` Dave Hansen
2023-05-02 16:00         ` Linus Torvalds
2023-05-02 20:14           ` Linus Torvalds
2023-05-03  0:53             ` Dave Hansen
2023-05-03  1:17               ` Linus Torvalds
2023-05-03 16:38                 ` Linus Torvalds
2023-05-03 16:44                   ` Dave Hansen
2023-05-03 16:51                     ` Linus Torvalds
2023-05-03 17:54                   ` Linus Torvalds
2023-05-03 19:01                     ` Peter Zijlstra
2023-05-03 19:19                       ` Linus Torvalds
2023-05-05 19:56                       ` Linus Torvalds
2023-05-05 20:59                         ` Kirill A. Shutemov
2023-05-04  6:28                   ` Kirill A. Shutemov
2023-05-04 15:25                     ` Dave Hansen
2023-05-04 17:10                       ` Kirill A. Shutemov
2023-05-04 17:10                     ` Linus Torvalds
2023-05-03  8:01             ` Peter Zijlstra
2023-05-03 16:38               ` Linus Torvalds
2023-06-16  8:47   ` Alexander Potapenko
2023-06-16 16:22     ` Linus Torvalds
2023-06-19 12:03       ` Alexander Potapenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.