All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND PATCH v4 0/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-18  9:06 ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This series allows AArch64 tasks to perform 32-bit syscalls by setting
the top bit of x8 and using AArch32 compat syscall numbers:

    syscall(0x80000000 | __ARM_NR_write, 1, "foo\n", 4);

Internally, setting this bit does the following:
- The remainder of x8 is treated as a compat syscall number and is used
  to index the compat syscall table.
- in_compat_syscall will return true for the duration of the syscall.
- VM allocations performed by the syscall will be located in the lower
  4G of the address space. A separate compat_mmap_base is used so that
  these allocations are still properly randomized.
- Interrupted compat syscalls are properly restarted as compat syscalls.
- Seccomp will treats the syscall as having AUDIT_ARCH_ARM instead of
  AUDIT_ARCH_AARCH64. This affects the arch value seen by seccomp
  filters and reported by SIGSYS.
- PTRACE_GET_SYSCALL_INFO also treats the syscall as having
  AUDIT_ARCH_ARM. Recent versions of strace will correctly report the
  syscall name and parameters when an AArch64 task mixes 32-bit and
  64-bit syscalls.

This feature is intended for use in software compatibility layers which
emulate a 32-bit program on AArch64. This patch has been tested on two
such emulators:
- Tango [1], which enables AArch32 binaries to run on AArch64 CPUs which
  do not have hardware support for AArch32. Tango is used to run virtual
  Android devices on AArch64 servers.
- FEX [2], an emulator for running x86 and x86_64 binaries on AArch64.
  FEX can already run many x86_64 programs including 3D games, but
  requires kernel support for running 32-bit x86 binaries.

Both FEX and Tango have previously attempted to translate 32-bit
syscalls purely in user mode like QEMU does for its user mode
emulation. While this works for simple programs, there are many
limitations which cannot be solved without kernel support, for example:
- There are a huge number of ioctls which behave differently in 32-bit
  mode. It is impractical and error prone to manually emulate them all
  in user mode. Specifically, the kernel already has a well-tested and
  reliable compatibility layer and it makes sense to reuse this. QEMU
  supports emulating some ioctls in userspace but this still does not
  cover devices like GPUs which are needed for accelerated rendering.
- The 64-bit set_robust_list is not compatible with the 32-bit ABI. The
  compat version of set_robust_list must be used. Emulating this in user
  mode is not reliable since SIGKILL cannot be caught.
- io_uring uses iovec structures as part of its API, which have
  different sizes on 32-bit and 64-bit.
- ext4 represents positions in directories as 64-bit hashes, which break
  if they are truncated to 32 bits. There is special support for 32-bit
  off_t in the ext4 driver but this is only used when in_compat_syscall
  is true: https://bugzilla.kernel.org/show_bug.cgi?id=205957
- The io_setup syscall allocates a VM area for the AIO context and
  returns it. But there is no way to control where this context is
  allocated so it will almost always end up above the 4GB limit.
- Some ioctls will also perform VM allocations, with the same issues as
  io_setup. Search for "vm_mmap" in drivers/.
- Some file descriptors have alignment requirements which are not known
  to userspace. For example, a hugetlbfs file can only be mmaped at a
  huge page alignment but there is no way for userspace to know this
  when it needs to manually select an address below 4GB for the mapping.

All of these issues are solved in FEX and Tango by invoking compat
syscalls directly. In the case of FEX, there remain some differences
between the arm and x86 ABIs due to alignment issues, but these are few
enough to be individually handled in userspace.

There is a precedent for exposing this functionality to userspace:
x86_64 has 2 ways to invoke 32-bit syscalls. The first is to use int
0x80 with a 32-bit syscall number and the second is to use
__X32_SYSCALL_BIT with a 64-bit syscall number. As such, the generic
kernel code is already able to properly handle tasks that invoke both
32-bit and 64-bit syscalls.

[1] https://www.amanieusystems.com/
[2] https://github.com/FEX-Emu/FEX

Changelog since v3:
- Renamed aarch64_compat_syscall to use_compat_syscall and enable it
  permanently for AArch32 tasks.

Changelog since v2:
- Complete rewrite, based on the patch that was previously posted as:
  [PATCH v2] [RFC] arm64: Exposes support for 32-bit syscalls

Amanieu d'Antras (8):
  mm: Add arch_get_mmap_base_topdown macro
  hugetlbfs: Use arch_get_mmap_* macros
  mm: Support mmap_compat_base with the generic layout
  arm64: Separate in_compat_syscall from is_compat_task
  arm64: mm: Use HAVE_ARCH_COMPAT_MMAP_BASES
  arm64: Add a compat syscall flag to thread_info
  arm64: Forbid calling compat sigreturn from 64-bit tasks
  arm64: Allow 64-bit tasks to invoke compat syscalls

 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/compat.h      | 24 ++++++++++++---
 arch/arm64/include/asm/elf.h         | 21 ++++++++++---
 arch/arm64/include/asm/ftrace.h      |  2 +-
 arch/arm64/include/asm/processor.h   | 32 +++++++++++--------
 arch/arm64/include/asm/syscall.h     |  6 ++--
 arch/arm64/include/asm/thread_info.h |  6 ++++
 arch/arm64/include/uapi/asm/unistd.h |  2 ++
 arch/arm64/kernel/ptrace.c           |  2 +-
 arch/arm64/kernel/signal.c           |  5 +++
 arch/arm64/kernel/signal32.c         |  8 +++++
 arch/arm64/kernel/syscall.c          | 23 ++++++++++++--
 arch/arm64/mm/mmap.c                 | 33 ++++++++++++++++++++
 fs/hugetlbfs/inode.c                 | 22 ++++++++++---
 mm/mmap.c                            | 14 ++++++---
 mm/util.c                            | 46 +++++++++++++++++++++++-----
 16 files changed, 202 insertions(+), 45 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 0/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-18  9:06 ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This series allows AArch64 tasks to perform 32-bit syscalls by setting
the top bit of x8 and using AArch32 compat syscall numbers:

    syscall(0x80000000 | __ARM_NR_write, 1, "foo\n", 4);

Internally, setting this bit does the following:
- The remainder of x8 is treated as a compat syscall number and is used
  to index the compat syscall table.
- in_compat_syscall will return true for the duration of the syscall.
- VM allocations performed by the syscall will be located in the lower
  4G of the address space. A separate compat_mmap_base is used so that
  these allocations are still properly randomized.
- Interrupted compat syscalls are properly restarted as compat syscalls.
- Seccomp will treats the syscall as having AUDIT_ARCH_ARM instead of
  AUDIT_ARCH_AARCH64. This affects the arch value seen by seccomp
  filters and reported by SIGSYS.
- PTRACE_GET_SYSCALL_INFO also treats the syscall as having
  AUDIT_ARCH_ARM. Recent versions of strace will correctly report the
  syscall name and parameters when an AArch64 task mixes 32-bit and
  64-bit syscalls.

This feature is intended for use in software compatibility layers which
emulate a 32-bit program on AArch64. This patch has been tested on two
such emulators:
- Tango [1], which enables AArch32 binaries to run on AArch64 CPUs which
  do not have hardware support for AArch32. Tango is used to run virtual
  Android devices on AArch64 servers.
- FEX [2], an emulator for running x86 and x86_64 binaries on AArch64.
  FEX can already run many x86_64 programs including 3D games, but
  requires kernel support for running 32-bit x86 binaries.

Both FEX and Tango have previously attempted to translate 32-bit
syscalls purely in user mode like QEMU does for its user mode
emulation. While this works for simple programs, there are many
limitations which cannot be solved without kernel support, for example:
- There are a huge number of ioctls which behave differently in 32-bit
  mode. It is impractical and error prone to manually emulate them all
  in user mode. Specifically, the kernel already has a well-tested and
  reliable compatibility layer and it makes sense to reuse this. QEMU
  supports emulating some ioctls in userspace but this still does not
  cover devices like GPUs which are needed for accelerated rendering.
- The 64-bit set_robust_list is not compatible with the 32-bit ABI. The
  compat version of set_robust_list must be used. Emulating this in user
  mode is not reliable since SIGKILL cannot be caught.
- io_uring uses iovec structures as part of its API, which have
  different sizes on 32-bit and 64-bit.
- ext4 represents positions in directories as 64-bit hashes, which break
  if they are truncated to 32 bits. There is special support for 32-bit
  off_t in the ext4 driver but this is only used when in_compat_syscall
  is true: https://bugzilla.kernel.org/show_bug.cgi?id=205957
- The io_setup syscall allocates a VM area for the AIO context and
  returns it. But there is no way to control where this context is
  allocated so it will almost always end up above the 4GB limit.
- Some ioctls will also perform VM allocations, with the same issues as
  io_setup. Search for "vm_mmap" in drivers/.
- Some file descriptors have alignment requirements which are not known
  to userspace. For example, a hugetlbfs file can only be mmaped at a
  huge page alignment but there is no way for userspace to know this
  when it needs to manually select an address below 4GB for the mapping.

All of these issues are solved in FEX and Tango by invoking compat
syscalls directly. In the case of FEX, there remain some differences
between the arm and x86 ABIs due to alignment issues, but these are few
enough to be individually handled in userspace.

There is a precedent for exposing this functionality to userspace:
x86_64 has 2 ways to invoke 32-bit syscalls. The first is to use int
0x80 with a 32-bit syscall number and the second is to use
__X32_SYSCALL_BIT with a 64-bit syscall number. As such, the generic
kernel code is already able to properly handle tasks that invoke both
32-bit and 64-bit syscalls.

[1] https://www.amanieusystems.com/
[2] https://github.com/FEX-Emu/FEX

Changelog since v3:
- Renamed aarch64_compat_syscall to use_compat_syscall and enable it
  permanently for AArch32 tasks.

Changelog since v2:
- Complete rewrite, based on the patch that was previously posted as:
  [PATCH v2] [RFC] arm64: Exposes support for 32-bit syscalls

Amanieu d'Antras (8):
  mm: Add arch_get_mmap_base_topdown macro
  hugetlbfs: Use arch_get_mmap_* macros
  mm: Support mmap_compat_base with the generic layout
  arm64: Separate in_compat_syscall from is_compat_task
  arm64: mm: Use HAVE_ARCH_COMPAT_MMAP_BASES
  arm64: Add a compat syscall flag to thread_info
  arm64: Forbid calling compat sigreturn from 64-bit tasks
  arm64: Allow 64-bit tasks to invoke compat syscalls

 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/compat.h      | 24 ++++++++++++---
 arch/arm64/include/asm/elf.h         | 21 ++++++++++---
 arch/arm64/include/asm/ftrace.h      |  2 +-
 arch/arm64/include/asm/processor.h   | 32 +++++++++++--------
 arch/arm64/include/asm/syscall.h     |  6 ++--
 arch/arm64/include/asm/thread_info.h |  6 ++++
 arch/arm64/include/uapi/asm/unistd.h |  2 ++
 arch/arm64/kernel/ptrace.c           |  2 +-
 arch/arm64/kernel/signal.c           |  5 +++
 arch/arm64/kernel/signal32.c         |  8 +++++
 arch/arm64/kernel/syscall.c          | 23 ++++++++++++--
 arch/arm64/mm/mmap.c                 | 33 ++++++++++++++++++++
 fs/hugetlbfs/inode.c                 | 22 ++++++++++---
 mm/mmap.c                            | 14 ++++++---
 mm/util.c                            | 46 +++++++++++++++++++++++-----
 16 files changed, 202 insertions(+), 45 deletions(-)

-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 1/8] mm: Add arch_get_mmap_base_topdown macro
  2021-05-18  9:06 ` Amanieu d'Antras
@ 2021-05-18  9:06   ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This allows architectures to customize the mmap base to use depending on
the direction of allocation.

The base argument is also removed from arch_get_mmap_base[_topdown] in
prepartion for future changes.

arm64 is currently the only user of the arch_get_mmap_* macros and is
adjusted accordingly. Specifically it only needs to limit the upper
bound of VM allocations and therefore only needs to customize
arch_get_mmap_base_topdown but not arch_get_mmap_base.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/include/asm/processor.h |  7 ++++---
 mm/mmap.c                          | 14 +++++++++-----
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index efc10e9041a0..f47528aae321 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -88,9 +88,10 @@
 #define arch_get_mmap_end(addr) ((addr > DEFAULT_MAP_WINDOW) ? TASK_SIZE :\
 				DEFAULT_MAP_WINDOW)
 
-#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \
-					base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
-					base)
+#define arch_get_mmap_base_topdown(addr) \
+	((addr > DEFAULT_MAP_WINDOW) ? \
+	current->mm->mmap_base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
+	current->mm->mmap_base)
 #endif /* CONFIG_ARM64_FORCE_52BIT */
 
 extern phys_addr_t arm64_dma_phys_limit;
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f287599a7a3..4937b34085cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2139,11 +2139,15 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info)
 }
 
 #ifndef arch_get_mmap_end
-#define arch_get_mmap_end(addr)	(TASK_SIZE)
+#define arch_get_mmap_end(addr)			(TASK_SIZE)
 #endif
 
 #ifndef arch_get_mmap_base
-#define arch_get_mmap_base(addr, base) (base)
+#define arch_get_mmap_base(addr)		(current->mm->mmap_base)
+#endif
+
+#ifndef arch_get_mmap_base_topdown
+#define arch_get_mmap_base_topdown(addr)	(current->mm->mmap_base)
 #endif
 
 /* Get an address range which is currently unmapped.
@@ -2184,7 +2188,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 
 	info.flags = 0;
 	info.length = len;
-	info.low_limit = mm->mmap_base;
+	info.low_limit = arch_get_mmap_base(addr);
 	info.high_limit = mmap_end;
 	info.align_mask = 0;
 	info.align_offset = 0;
@@ -2227,7 +2231,7 @@ arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
-	info.high_limit = arch_get_mmap_base(addr, mm->mmap_base);
+	info.high_limit = arch_get_mmap_base_topdown(addr);
 	info.align_mask = 0;
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -2241,7 +2245,7 @@ arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
 	if (offset_in_page(addr)) {
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
-		info.low_limit = TASK_UNMAPPED_BASE;
+		info.low_limit = arch_get_mmap_base(addr);
 		info.high_limit = mmap_end;
 		addr = vm_unmapped_area(&info);
 	}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 1/8] mm: Add arch_get_mmap_base_topdown macro
@ 2021-05-18  9:06   ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This allows architectures to customize the mmap base to use depending on
the direction of allocation.

The base argument is also removed from arch_get_mmap_base[_topdown] in
prepartion for future changes.

arm64 is currently the only user of the arch_get_mmap_* macros and is
adjusted accordingly. Specifically it only needs to limit the upper
bound of VM allocations and therefore only needs to customize
arch_get_mmap_base_topdown but not arch_get_mmap_base.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/include/asm/processor.h |  7 ++++---
 mm/mmap.c                          | 14 +++++++++-----
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index efc10e9041a0..f47528aae321 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -88,9 +88,10 @@
 #define arch_get_mmap_end(addr) ((addr > DEFAULT_MAP_WINDOW) ? TASK_SIZE :\
 				DEFAULT_MAP_WINDOW)
 
-#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \
-					base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
-					base)
+#define arch_get_mmap_base_topdown(addr) \
+	((addr > DEFAULT_MAP_WINDOW) ? \
+	current->mm->mmap_base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
+	current->mm->mmap_base)
 #endif /* CONFIG_ARM64_FORCE_52BIT */
 
 extern phys_addr_t arm64_dma_phys_limit;
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f287599a7a3..4937b34085cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2139,11 +2139,15 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info)
 }
 
 #ifndef arch_get_mmap_end
-#define arch_get_mmap_end(addr)	(TASK_SIZE)
+#define arch_get_mmap_end(addr)			(TASK_SIZE)
 #endif
 
 #ifndef arch_get_mmap_base
-#define arch_get_mmap_base(addr, base) (base)
+#define arch_get_mmap_base(addr)		(current->mm->mmap_base)
+#endif
+
+#ifndef arch_get_mmap_base_topdown
+#define arch_get_mmap_base_topdown(addr)	(current->mm->mmap_base)
 #endif
 
 /* Get an address range which is currently unmapped.
@@ -2184,7 +2188,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 
 	info.flags = 0;
 	info.length = len;
-	info.low_limit = mm->mmap_base;
+	info.low_limit = arch_get_mmap_base(addr);
 	info.high_limit = mmap_end;
 	info.align_mask = 0;
 	info.align_offset = 0;
@@ -2227,7 +2231,7 @@ arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
-	info.high_limit = arch_get_mmap_base(addr, mm->mmap_base);
+	info.high_limit = arch_get_mmap_base_topdown(addr);
 	info.align_mask = 0;
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -2241,7 +2245,7 @@ arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
 	if (offset_in_page(addr)) {
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
-		info.low_limit = TASK_UNMAPPED_BASE;
+		info.low_limit = arch_get_mmap_base(addr);
 		info.high_limit = mmap_end;
 		addr = vm_unmapped_area(&info);
 	}
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 2/8] hugetlbfs: Use arch_get_mmap_* macros
  2021-05-18  9:06 ` Amanieu d'Antras
@ 2021-05-18  9:06   ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

hugetlb_get_unmapped_area should obey the same arch-specific constraints
as mmap when selecting an address.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 fs/hugetlbfs/inode.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 701c82c36138..526ccb524329 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -191,6 +191,18 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
  */
 
 #ifndef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
+#ifndef arch_get_mmap_end
+#define arch_get_mmap_end(addr)			(TASK_SIZE)
+#endif
+
+#ifndef arch_get_mmap_base
+#define arch_get_mmap_base(addr)		(current->mm->mmap_base)
+#endif
+
+#ifndef arch_get_mmap_base_topdown
+#define arch_get_mmap_base_topdown(addr)	(current->mm->mmap_base)
+#endif
+
 static unsigned long
 hugetlb_get_unmapped_area_bottomup(struct file *file, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags)
@@ -200,8 +212,8 @@ hugetlb_get_unmapped_area_bottomup(struct file *file, unsigned long addr,
 
 	info.flags = 0;
 	info.length = len;
-	info.low_limit = current->mm->mmap_base;
-	info.high_limit = TASK_SIZE;
+	info.low_limit = arch_get_mmap_base(addr);
+	info.high_limit = arch_get_mmap_end(addr);
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
@@ -217,7 +229,7 @@ hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr,
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
-	info.high_limit = current->mm->mmap_base;
+	info.high_limit = arch_get_mmap_base_topdown(addr);
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -231,8 +243,8 @@ hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(addr))) {
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
-		info.low_limit = current->mm->mmap_base;
-		info.high_limit = TASK_SIZE;
+		info.low_limit = arch_get_mmap_base(addr);
+		info.high_limit = arch_get_mmap_end(addr);
 		addr = vm_unmapped_area(&info);
 	}
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 2/8] hugetlbfs: Use arch_get_mmap_* macros
@ 2021-05-18  9:06   ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

hugetlb_get_unmapped_area should obey the same arch-specific constraints
as mmap when selecting an address.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 fs/hugetlbfs/inode.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 701c82c36138..526ccb524329 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -191,6 +191,18 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
  */
 
 #ifndef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
+#ifndef arch_get_mmap_end
+#define arch_get_mmap_end(addr)			(TASK_SIZE)
+#endif
+
+#ifndef arch_get_mmap_base
+#define arch_get_mmap_base(addr)		(current->mm->mmap_base)
+#endif
+
+#ifndef arch_get_mmap_base_topdown
+#define arch_get_mmap_base_topdown(addr)	(current->mm->mmap_base)
+#endif
+
 static unsigned long
 hugetlb_get_unmapped_area_bottomup(struct file *file, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags)
@@ -200,8 +212,8 @@ hugetlb_get_unmapped_area_bottomup(struct file *file, unsigned long addr,
 
 	info.flags = 0;
 	info.length = len;
-	info.low_limit = current->mm->mmap_base;
-	info.high_limit = TASK_SIZE;
+	info.low_limit = arch_get_mmap_base(addr);
+	info.high_limit = arch_get_mmap_end(addr);
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
@@ -217,7 +229,7 @@ hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr,
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
-	info.high_limit = current->mm->mmap_base;
+	info.high_limit = arch_get_mmap_base_topdown(addr);
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -231,8 +243,8 @@ hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(addr))) {
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
-		info.low_limit = current->mm->mmap_base;
-		info.high_limit = TASK_SIZE;
+		info.low_limit = arch_get_mmap_base(addr);
+		info.high_limit = arch_get_mmap_end(addr);
 		addr = vm_unmapped_area(&info);
 	}
 
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 3/8] mm: Support mmap_compat_base with the generic layout
  2021-05-18  9:06 ` Amanieu d'Antras
@ 2021-05-18  9:06   ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This enables architectures using the generic mmap layout to support
32-bit mmap calls from 64-bit processes and vice-versa.

Architectures using this must define separate 32-bit and 64-bit versions
of STACK_TOP, TASK_UNMAPPED_BASE and STACK_RND_MASK.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 mm/util.c | 46 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 38 insertions(+), 8 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 54870226cea6..37bd764174b5 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -353,12 +353,12 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
 	return randomize_page(mm->brk, SZ_1G);
 }
 
-unsigned long arch_mmap_rnd(void)
+static unsigned long mmap_rnd(bool compat)
 {
 	unsigned long rnd;
 
 #ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS
-	if (is_compat_task())
+	if (compat)
 		rnd = get_random_long() & ((1UL << mmap_rnd_compat_bits) - 1);
 	else
 #endif /* CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS */
@@ -367,6 +367,11 @@ unsigned long arch_mmap_rnd(void)
 	return rnd << PAGE_SHIFT;
 }
 
+unsigned long arch_mmap_rnd(void)
+{
+	return mmap_rnd(is_compat_task());
+}
+
 static int mmap_is_legacy(struct rlimit *rlim_stack)
 {
 	if (current->personality & ADDR_COMPAT_LAYOUT)
@@ -383,16 +388,17 @@ static int mmap_is_legacy(struct rlimit *rlim_stack)
  * the face of randomisation.
  */
 #define MIN_GAP		(SZ_128M)
-#define MAX_GAP		(STACK_TOP / 6 * 5)
+#define MAX_GAP		(stack_top / 6 * 5)
 
-static unsigned long mmap_base(unsigned long rnd, struct rlimit *rlim_stack)
+static unsigned long mmap_base(unsigned long rnd, struct rlimit *rlim_stack,
+	unsigned long stack_top, unsigned long stack_rnd_mask)
 {
 	unsigned long gap = rlim_stack->rlim_cur;
 	unsigned long pad = stack_guard_gap;
 
 	/* Account for stack randomization if necessary */
 	if (current->flags & PF_RANDOMIZE)
-		pad += (STACK_RND_MASK << PAGE_SHIFT);
+		pad += (stack_rnd_mask << PAGE_SHIFT);
 
 	/* Values close to RLIM_INFINITY can overflow. */
 	if (gap + pad > gap)
@@ -403,21 +409,45 @@ static unsigned long mmap_base(unsigned long rnd, struct rlimit *rlim_stack)
 	else if (gap > MAX_GAP)
 		gap = MAX_GAP;
 
-	return PAGE_ALIGN(STACK_TOP - gap - rnd);
+	return PAGE_ALIGN(stack_top - gap - rnd);
 }
 
 void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
 {
 	unsigned long random_factor = 0UL;
+#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
+	unsigned long compat_random_factor = 0UL;
+#endif
 
-	if (current->flags & PF_RANDOMIZE)
+	if (current->flags & PF_RANDOMIZE) {
+#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
+		random_factor = mmap_rnd(false);
+		compat_random_factor = mmap_rnd(true);
+#else
 		random_factor = arch_mmap_rnd();
+#endif
+	}
 
 	if (mmap_is_legacy(rlim_stack)) {
+#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
+		mm->mmap_base = TASK_UNMAPPED_BASE_64 + random_factor;
+		mm->mmap_compat_base =
+			TASK_UNMAPPED_BASE_32 + compat_random_factor;
+#else
 		mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
+#endif
 		mm->get_unmapped_area = arch_get_unmapped_area;
 	} else {
-		mm->mmap_base = mmap_base(random_factor, rlim_stack);
+#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
+		mm->mmap_base = mmap_base(random_factor, rlim_stack,
+					  STACK_TOP_64, STACK_RND_MASK_64);
+		mm->mmap_compat_base = mmap_base(compat_random_factor,
+						 rlim_stack, STACK_TOP_32,
+						 STACK_RND_MASK_32);
+#else
+		mm->mmap_base = mmap_base(random_factor, rlim_stack, STACK_TOP,
+					  STACK_RND_MASK);
+#endif
 		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
 	}
 }
-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 3/8] mm: Support mmap_compat_base with the generic layout
@ 2021-05-18  9:06   ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This enables architectures using the generic mmap layout to support
32-bit mmap calls from 64-bit processes and vice-versa.

Architectures using this must define separate 32-bit and 64-bit versions
of STACK_TOP, TASK_UNMAPPED_BASE and STACK_RND_MASK.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 mm/util.c | 46 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 38 insertions(+), 8 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 54870226cea6..37bd764174b5 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -353,12 +353,12 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
 	return randomize_page(mm->brk, SZ_1G);
 }
 
-unsigned long arch_mmap_rnd(void)
+static unsigned long mmap_rnd(bool compat)
 {
 	unsigned long rnd;
 
 #ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS
-	if (is_compat_task())
+	if (compat)
 		rnd = get_random_long() & ((1UL << mmap_rnd_compat_bits) - 1);
 	else
 #endif /* CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS */
@@ -367,6 +367,11 @@ unsigned long arch_mmap_rnd(void)
 	return rnd << PAGE_SHIFT;
 }
 
+unsigned long arch_mmap_rnd(void)
+{
+	return mmap_rnd(is_compat_task());
+}
+
 static int mmap_is_legacy(struct rlimit *rlim_stack)
 {
 	if (current->personality & ADDR_COMPAT_LAYOUT)
@@ -383,16 +388,17 @@ static int mmap_is_legacy(struct rlimit *rlim_stack)
  * the face of randomisation.
  */
 #define MIN_GAP		(SZ_128M)
-#define MAX_GAP		(STACK_TOP / 6 * 5)
+#define MAX_GAP		(stack_top / 6 * 5)
 
-static unsigned long mmap_base(unsigned long rnd, struct rlimit *rlim_stack)
+static unsigned long mmap_base(unsigned long rnd, struct rlimit *rlim_stack,
+	unsigned long stack_top, unsigned long stack_rnd_mask)
 {
 	unsigned long gap = rlim_stack->rlim_cur;
 	unsigned long pad = stack_guard_gap;
 
 	/* Account for stack randomization if necessary */
 	if (current->flags & PF_RANDOMIZE)
-		pad += (STACK_RND_MASK << PAGE_SHIFT);
+		pad += (stack_rnd_mask << PAGE_SHIFT);
 
 	/* Values close to RLIM_INFINITY can overflow. */
 	if (gap + pad > gap)
@@ -403,21 +409,45 @@ static unsigned long mmap_base(unsigned long rnd, struct rlimit *rlim_stack)
 	else if (gap > MAX_GAP)
 		gap = MAX_GAP;
 
-	return PAGE_ALIGN(STACK_TOP - gap - rnd);
+	return PAGE_ALIGN(stack_top - gap - rnd);
 }
 
 void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
 {
 	unsigned long random_factor = 0UL;
+#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
+	unsigned long compat_random_factor = 0UL;
+#endif
 
-	if (current->flags & PF_RANDOMIZE)
+	if (current->flags & PF_RANDOMIZE) {
+#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
+		random_factor = mmap_rnd(false);
+		compat_random_factor = mmap_rnd(true);
+#else
 		random_factor = arch_mmap_rnd();
+#endif
+	}
 
 	if (mmap_is_legacy(rlim_stack)) {
+#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
+		mm->mmap_base = TASK_UNMAPPED_BASE_64 + random_factor;
+		mm->mmap_compat_base =
+			TASK_UNMAPPED_BASE_32 + compat_random_factor;
+#else
 		mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
+#endif
 		mm->get_unmapped_area = arch_get_unmapped_area;
 	} else {
-		mm->mmap_base = mmap_base(random_factor, rlim_stack);
+#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
+		mm->mmap_base = mmap_base(random_factor, rlim_stack,
+					  STACK_TOP_64, STACK_RND_MASK_64);
+		mm->mmap_compat_base = mmap_base(compat_random_factor,
+						 rlim_stack, STACK_TOP_32,
+						 STACK_RND_MASK_32);
+#else
+		mm->mmap_base = mmap_base(random_factor, rlim_stack, STACK_TOP,
+					  STACK_RND_MASK);
+#endif
 		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
 	}
 }
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 4/8] arm64: Separate in_compat_syscall from is_compat_task
  2021-05-18  9:06 ` Amanieu d'Antras
@ 2021-05-18  9:06   ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This is preliminary work for allowing 64-bit processes to invoke compat
syscalls.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/include/asm/compat.h  | 24 ++++++++++++++++++++----
 arch/arm64/include/asm/ftrace.h  |  2 +-
 arch/arm64/include/asm/syscall.h |  6 +++---
 arch/arm64/kernel/ptrace.c       |  2 +-
 arch/arm64/kernel/syscall.c      |  2 +-
 5 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
index 23a9fb73c04f..a2f5001f7793 100644
--- a/arch/arm64/include/asm/compat.h
+++ b/arch/arm64/include/asm/compat.h
@@ -178,21 +178,37 @@ struct compat_shmid64_ds {
 	compat_ulong_t __unused5;
 };
 
-static inline int is_compat_task(void)
+static inline bool is_compat_task(void)
 {
 	return test_thread_flag(TIF_32BIT);
 }
 
-static inline int is_compat_thread(struct thread_info *thread)
+static inline bool is_compat_thread(struct thread_info *thread)
 {
 	return test_ti_thread_flag(thread, TIF_32BIT);
 }
 
+static inline bool in_compat_syscall(void)
+{
+	return is_compat_task();
+}
+#define in_compat_syscall in_compat_syscall	/* override the generic impl */
+
+static inline bool thread_in_compat_syscall(struct thread_info *thread)
+{
+	return is_compat_thread(thread);
+}
+
 #else /* !CONFIG_COMPAT */
 
-static inline int is_compat_thread(struct thread_info *thread)
+static inline bool is_compat_thread(struct thread_info *thread)
+{
+	return false;
+}
+
+static inline bool thread_in_compat_syscall(struct thread_info *thread)
 {
-	return 0;
+	return false;
 }
 
 #endif /* CONFIG_COMPAT */
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 91fa4baa1a93..f41aad92c67a 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -88,7 +88,7 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec);
 #define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
 static inline bool arch_trace_is_compat_syscall(struct pt_regs *regs)
 {
-	return is_compat_task();
+	return in_compat_syscall();
 }
 
 #define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index cfc0672013f6..0dfc01ea386c 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -35,7 +35,7 @@ static inline long syscall_get_error(struct task_struct *task,
 {
 	unsigned long error = regs->regs[0];
 
-	if (is_compat_thread(task_thread_info(task)))
+	if (thread_in_compat_syscall(task_thread_info(task)))
 		error = sign_extend64(error, 31);
 
 	return IS_ERR_VALUE(error) ? error : 0;
@@ -54,7 +54,7 @@ static inline void syscall_set_return_value(struct task_struct *task,
 	if (error)
 		val = error;
 
-	if (is_compat_thread(task_thread_info(task)))
+	if (thread_in_compat_syscall(task_thread_info(task)))
 		val = lower_32_bits(val);
 
 	regs->regs[0] = val;
@@ -88,7 +88,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
  */
 static inline int syscall_get_arch(struct task_struct *task)
 {
-	if (is_compat_thread(task_thread_info(task)))
+	if (thread_in_compat_syscall(task_thread_info(task)))
 		return AUDIT_ARCH_ARM;
 
 	return AUDIT_ARCH_AARCH64;
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 170f42fd6101..017a82b24f49 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1721,7 +1721,7 @@ const struct user_regset_view *task_user_regset_view(struct task_struct *task)
 	 * 32-bit children use an extended user_aarch32_ptrace_view to allow
 	 * access to the TLS register.
 	 */
-	if (is_compat_task())
+	if (in_compat_syscall())
 		return &user_aarch32_view;
 	else if (is_compat_thread(task_thread_info(task)))
 		return &user_aarch32_ptrace_view;
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index b9cf12b271d7..e0e9d54de0a2 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -51,7 +51,7 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
 		ret = do_ni_syscall(regs, scno);
 	}
 
-	if (is_compat_task())
+	if (in_compat_syscall())
 		ret = lower_32_bits(ret);
 
 	regs->regs[0] = ret;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 4/8] arm64: Separate in_compat_syscall from is_compat_task
@ 2021-05-18  9:06   ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This is preliminary work for allowing 64-bit processes to invoke compat
syscalls.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/include/asm/compat.h  | 24 ++++++++++++++++++++----
 arch/arm64/include/asm/ftrace.h  |  2 +-
 arch/arm64/include/asm/syscall.h |  6 +++---
 arch/arm64/kernel/ptrace.c       |  2 +-
 arch/arm64/kernel/syscall.c      |  2 +-
 5 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
index 23a9fb73c04f..a2f5001f7793 100644
--- a/arch/arm64/include/asm/compat.h
+++ b/arch/arm64/include/asm/compat.h
@@ -178,21 +178,37 @@ struct compat_shmid64_ds {
 	compat_ulong_t __unused5;
 };
 
-static inline int is_compat_task(void)
+static inline bool is_compat_task(void)
 {
 	return test_thread_flag(TIF_32BIT);
 }
 
-static inline int is_compat_thread(struct thread_info *thread)
+static inline bool is_compat_thread(struct thread_info *thread)
 {
 	return test_ti_thread_flag(thread, TIF_32BIT);
 }
 
+static inline bool in_compat_syscall(void)
+{
+	return is_compat_task();
+}
+#define in_compat_syscall in_compat_syscall	/* override the generic impl */
+
+static inline bool thread_in_compat_syscall(struct thread_info *thread)
+{
+	return is_compat_thread(thread);
+}
+
 #else /* !CONFIG_COMPAT */
 
-static inline int is_compat_thread(struct thread_info *thread)
+static inline bool is_compat_thread(struct thread_info *thread)
+{
+	return false;
+}
+
+static inline bool thread_in_compat_syscall(struct thread_info *thread)
 {
-	return 0;
+	return false;
 }
 
 #endif /* CONFIG_COMPAT */
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 91fa4baa1a93..f41aad92c67a 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -88,7 +88,7 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec);
 #define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
 static inline bool arch_trace_is_compat_syscall(struct pt_regs *regs)
 {
-	return is_compat_task();
+	return in_compat_syscall();
 }
 
 #define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index cfc0672013f6..0dfc01ea386c 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -35,7 +35,7 @@ static inline long syscall_get_error(struct task_struct *task,
 {
 	unsigned long error = regs->regs[0];
 
-	if (is_compat_thread(task_thread_info(task)))
+	if (thread_in_compat_syscall(task_thread_info(task)))
 		error = sign_extend64(error, 31);
 
 	return IS_ERR_VALUE(error) ? error : 0;
@@ -54,7 +54,7 @@ static inline void syscall_set_return_value(struct task_struct *task,
 	if (error)
 		val = error;
 
-	if (is_compat_thread(task_thread_info(task)))
+	if (thread_in_compat_syscall(task_thread_info(task)))
 		val = lower_32_bits(val);
 
 	regs->regs[0] = val;
@@ -88,7 +88,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
  */
 static inline int syscall_get_arch(struct task_struct *task)
 {
-	if (is_compat_thread(task_thread_info(task)))
+	if (thread_in_compat_syscall(task_thread_info(task)))
 		return AUDIT_ARCH_ARM;
 
 	return AUDIT_ARCH_AARCH64;
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 170f42fd6101..017a82b24f49 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1721,7 +1721,7 @@ const struct user_regset_view *task_user_regset_view(struct task_struct *task)
 	 * 32-bit children use an extended user_aarch32_ptrace_view to allow
 	 * access to the TLS register.
 	 */
-	if (is_compat_task())
+	if (in_compat_syscall())
 		return &user_aarch32_view;
 	else if (is_compat_thread(task_thread_info(task)))
 		return &user_aarch32_ptrace_view;
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index b9cf12b271d7..e0e9d54de0a2 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -51,7 +51,7 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
 		ret = do_ni_syscall(regs, scno);
 	}
 
-	if (is_compat_task())
+	if (in_compat_syscall())
 		ret = lower_32_bits(ret);
 
 	regs->regs[0] = ret;
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 5/8] arm64: mm: Use HAVE_ARCH_COMPAT_MMAP_BASES
  2021-05-18  9:06 ` Amanieu d'Antras
@ 2021-05-18  9:06   ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This patch switches arm64 to use separate mmap_base for 32-bit and
64-bit mmaps. This is needed to ensure that compat syscalls
invoked by 64-bit tasks perform VM allocations in the low 4G of the
address space.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/Kconfig                 |  1 +
 arch/arm64/include/asm/elf.h       |  8 +++++---
 arch/arm64/include/asm/processor.h | 33 ++++++++++++++++++------------
 arch/arm64/mm/mmap.c               | 33 ++++++++++++++++++++++++++++++
 4 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index dfdc3e0af5e1..d57b7bcbd758 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -133,6 +133,7 @@ config ARM64
 	select HAVE_ALIGNED_STRUCT_PAGE if SLUB
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_BITREVERSE
+	select HAVE_ARCH_COMPAT_MMAP_BASES if COMPAT
 	select HAVE_ARCH_COMPILER_H
 	select HAVE_ARCH_HUGE_VMAP
 	select HAVE_ARCH_JUMP_LABEL
diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 8d1c8dcb87fd..e21964898d06 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -187,12 +187,14 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 				       int uses_interp);
 
 /* 1GB of VA */
+#define STACK_RND_MASK_64		(0x3ffff >> (PAGE_SHIFT - 12))
 #ifdef CONFIG_COMPAT
+#define STACK_RND_MASK_32		(0x7ff >> (PAGE_SHIFT - 12))
 #define STACK_RND_MASK			(test_thread_flag(TIF_32BIT) ? \
-						0x7ff >> (PAGE_SHIFT - 12) : \
-						0x3ffff >> (PAGE_SHIFT - 12))
+						STACK_RND_MASK_32 : \
+						STACK_RND_MASK_64)
 #else
-#define STACK_RND_MASK			(0x3ffff >> (PAGE_SHIFT - 12))
+#define STACK_RND_MASK			STACK_RND_MASK_64
 #endif
 
 #ifdef __AARCH64EB__
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index f47528aae321..f8309f8d0ece 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -70,29 +70,36 @@
 
 #ifdef CONFIG_ARM64_FORCE_52BIT
 #define STACK_TOP_MAX		TASK_SIZE_64
-#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 4))
+#define TASK_UNMAPPED_BASE_64	(PAGE_ALIGN(TASK_SIZE_64 / 4))
 #else
 #define STACK_TOP_MAX		DEFAULT_MAP_WINDOW_64
-#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(DEFAULT_MAP_WINDOW / 4))
+#define TASK_UNMAPPED_BASE_64	(PAGE_ALIGN(DEFAULT_MAP_WINDOW_64 / 4))
 #endif /* CONFIG_ARM64_FORCE_52BIT */
 
+#ifdef CONFIG_COMPAT
+#define TASK_UNMAPPED_BASE_32	(PAGE_ALIGN(TASK_SIZE_32 / 4))
+#define TASK_UNMAPPED_BASE	(test_thread_flag(TIF_32BIT) ? \
+				TASK_UNMAPPED_BASE_32 : TASK_UNMAPPED_BASE_64)
+#else
+#define TASK_UNMAPPED_BASE	TASK_UNMAPPED_BASE_64
+#endif /* CONFIG_COMPAT */
+
+#define STACK_TOP_64		STACK_TOP_MAX
 #ifdef CONFIG_COMPAT
 #define AARCH32_VECTORS_BASE	0xffff0000
+#define STACK_TOP_32		AARCH32_VECTORS_BASE
 #define STACK_TOP		(test_thread_flag(TIF_32BIT) ? \
-				AARCH32_VECTORS_BASE : STACK_TOP_MAX)
+				STACK_TOP_32 : STACK_TOP_64)
 #else
-#define STACK_TOP		STACK_TOP_MAX
+#define STACK_TOP		STACK_TOP_64
 #endif /* CONFIG_COMPAT */
 
-#ifndef CONFIG_ARM64_FORCE_52BIT
-#define arch_get_mmap_end(addr) ((addr > DEFAULT_MAP_WINDOW) ? TASK_SIZE :\
-				DEFAULT_MAP_WINDOW)
-
-#define arch_get_mmap_base_topdown(addr) \
-	((addr > DEFAULT_MAP_WINDOW) ? \
-	current->mm->mmap_base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
-	current->mm->mmap_base)
-#endif /* CONFIG_ARM64_FORCE_52BIT */
+#define arch_get_mmap_end arch_get_mmap_end
+#define arch_get_mmap_base arch_get_mmap_base
+#define arch_get_mmap_base_topdown arch_get_mmap_base_topdown
+unsigned long arch_get_mmap_end(unsigned long addr);
+unsigned long arch_get_mmap_base(unsigned long addr);
+unsigned long arch_get_mmap_base_topdown(unsigned long addr);
 
 extern phys_addr_t arm64_dma_phys_limit;
 #define ARCH_LOW_ADDRESS_LIMIT	(arm64_dma_phys_limit - 1)
diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
index a38f54cd638c..956cab3ade11 100644
--- a/arch/arm64/mm/mmap.c
+++ b/arch/arm64/mm/mmap.c
@@ -38,3 +38,36 @@ int valid_mmap_phys_addr_range(unsigned long pfn, size_t size)
 {
 	return !(((pfn << PAGE_SHIFT) + size) & ~PHYS_MASK);
 }
+
+unsigned long arch_get_mmap_end(unsigned long addr)
+{
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall())
+		return TASK_SIZE_32;
+#endif /* CONFIG_COMPAT */
+#ifndef CONFIG_ARM64_FORCE_52BIT
+	if (addr > DEFAULT_MAP_WINDOW_64)
+		return TASK_SIZE_64;
+#endif /* CONFIG_ARM64_FORCE_52BIT */
+	return DEFAULT_MAP_WINDOW_64;
+}
+unsigned long arch_get_mmap_base(unsigned long addr)
+{
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall())
+		return current->mm->mmap_compat_base;
+#endif /* CONFIG_COMPAT */
+	return current->mm->mmap_base;
+}
+unsigned long arch_get_mmap_base_topdown(unsigned long addr)
+{
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall())
+		return current->mm->mmap_compat_base;
+#endif /* CONFIG_COMPAT */
+#ifndef CONFIG_ARM64_FORCE_52BIT
+	if (addr > DEFAULT_MAP_WINDOW_64)
+		return current->mm->mmap_base + TASK_SIZE - DEFAULT_MAP_WINDOW_64;
+#endif /* CONFIG_ARM64_FORCE_52BIT */
+	return current->mm->mmap_base;
+}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 5/8] arm64: mm: Use HAVE_ARCH_COMPAT_MMAP_BASES
@ 2021-05-18  9:06   ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This patch switches arm64 to use separate mmap_base for 32-bit and
64-bit mmaps. This is needed to ensure that compat syscalls
invoked by 64-bit tasks perform VM allocations in the low 4G of the
address space.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/Kconfig                 |  1 +
 arch/arm64/include/asm/elf.h       |  8 +++++---
 arch/arm64/include/asm/processor.h | 33 ++++++++++++++++++------------
 arch/arm64/mm/mmap.c               | 33 ++++++++++++++++++++++++++++++
 4 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index dfdc3e0af5e1..d57b7bcbd758 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -133,6 +133,7 @@ config ARM64
 	select HAVE_ALIGNED_STRUCT_PAGE if SLUB
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_BITREVERSE
+	select HAVE_ARCH_COMPAT_MMAP_BASES if COMPAT
 	select HAVE_ARCH_COMPILER_H
 	select HAVE_ARCH_HUGE_VMAP
 	select HAVE_ARCH_JUMP_LABEL
diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 8d1c8dcb87fd..e21964898d06 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -187,12 +187,14 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 				       int uses_interp);
 
 /* 1GB of VA */
+#define STACK_RND_MASK_64		(0x3ffff >> (PAGE_SHIFT - 12))
 #ifdef CONFIG_COMPAT
+#define STACK_RND_MASK_32		(0x7ff >> (PAGE_SHIFT - 12))
 #define STACK_RND_MASK			(test_thread_flag(TIF_32BIT) ? \
-						0x7ff >> (PAGE_SHIFT - 12) : \
-						0x3ffff >> (PAGE_SHIFT - 12))
+						STACK_RND_MASK_32 : \
+						STACK_RND_MASK_64)
 #else
-#define STACK_RND_MASK			(0x3ffff >> (PAGE_SHIFT - 12))
+#define STACK_RND_MASK			STACK_RND_MASK_64
 #endif
 
 #ifdef __AARCH64EB__
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index f47528aae321..f8309f8d0ece 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -70,29 +70,36 @@
 
 #ifdef CONFIG_ARM64_FORCE_52BIT
 #define STACK_TOP_MAX		TASK_SIZE_64
-#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 4))
+#define TASK_UNMAPPED_BASE_64	(PAGE_ALIGN(TASK_SIZE_64 / 4))
 #else
 #define STACK_TOP_MAX		DEFAULT_MAP_WINDOW_64
-#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(DEFAULT_MAP_WINDOW / 4))
+#define TASK_UNMAPPED_BASE_64	(PAGE_ALIGN(DEFAULT_MAP_WINDOW_64 / 4))
 #endif /* CONFIG_ARM64_FORCE_52BIT */
 
+#ifdef CONFIG_COMPAT
+#define TASK_UNMAPPED_BASE_32	(PAGE_ALIGN(TASK_SIZE_32 / 4))
+#define TASK_UNMAPPED_BASE	(test_thread_flag(TIF_32BIT) ? \
+				TASK_UNMAPPED_BASE_32 : TASK_UNMAPPED_BASE_64)
+#else
+#define TASK_UNMAPPED_BASE	TASK_UNMAPPED_BASE_64
+#endif /* CONFIG_COMPAT */
+
+#define STACK_TOP_64		STACK_TOP_MAX
 #ifdef CONFIG_COMPAT
 #define AARCH32_VECTORS_BASE	0xffff0000
+#define STACK_TOP_32		AARCH32_VECTORS_BASE
 #define STACK_TOP		(test_thread_flag(TIF_32BIT) ? \
-				AARCH32_VECTORS_BASE : STACK_TOP_MAX)
+				STACK_TOP_32 : STACK_TOP_64)
 #else
-#define STACK_TOP		STACK_TOP_MAX
+#define STACK_TOP		STACK_TOP_64
 #endif /* CONFIG_COMPAT */
 
-#ifndef CONFIG_ARM64_FORCE_52BIT
-#define arch_get_mmap_end(addr) ((addr > DEFAULT_MAP_WINDOW) ? TASK_SIZE :\
-				DEFAULT_MAP_WINDOW)
-
-#define arch_get_mmap_base_topdown(addr) \
-	((addr > DEFAULT_MAP_WINDOW) ? \
-	current->mm->mmap_base + TASK_SIZE - DEFAULT_MAP_WINDOW :\
-	current->mm->mmap_base)
-#endif /* CONFIG_ARM64_FORCE_52BIT */
+#define arch_get_mmap_end arch_get_mmap_end
+#define arch_get_mmap_base arch_get_mmap_base
+#define arch_get_mmap_base_topdown arch_get_mmap_base_topdown
+unsigned long arch_get_mmap_end(unsigned long addr);
+unsigned long arch_get_mmap_base(unsigned long addr);
+unsigned long arch_get_mmap_base_topdown(unsigned long addr);
 
 extern phys_addr_t arm64_dma_phys_limit;
 #define ARCH_LOW_ADDRESS_LIMIT	(arm64_dma_phys_limit - 1)
diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
index a38f54cd638c..956cab3ade11 100644
--- a/arch/arm64/mm/mmap.c
+++ b/arch/arm64/mm/mmap.c
@@ -38,3 +38,36 @@ int valid_mmap_phys_addr_range(unsigned long pfn, size_t size)
 {
 	return !(((pfn << PAGE_SHIFT) + size) & ~PHYS_MASK);
 }
+
+unsigned long arch_get_mmap_end(unsigned long addr)
+{
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall())
+		return TASK_SIZE_32;
+#endif /* CONFIG_COMPAT */
+#ifndef CONFIG_ARM64_FORCE_52BIT
+	if (addr > DEFAULT_MAP_WINDOW_64)
+		return TASK_SIZE_64;
+#endif /* CONFIG_ARM64_FORCE_52BIT */
+	return DEFAULT_MAP_WINDOW_64;
+}
+unsigned long arch_get_mmap_base(unsigned long addr)
+{
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall())
+		return current->mm->mmap_compat_base;
+#endif /* CONFIG_COMPAT */
+	return current->mm->mmap_base;
+}
+unsigned long arch_get_mmap_base_topdown(unsigned long addr)
+{
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall())
+		return current->mm->mmap_compat_base;
+#endif /* CONFIG_COMPAT */
+#ifndef CONFIG_ARM64_FORCE_52BIT
+	if (addr > DEFAULT_MAP_WINDOW_64)
+		return current->mm->mmap_base + TASK_SIZE - DEFAULT_MAP_WINDOW_64;
+#endif /* CONFIG_ARM64_FORCE_52BIT */
+	return current->mm->mmap_base;
+}
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 6/8] arm64: Add a compat syscall flag to thread_info
  2021-05-18  9:06 ` Amanieu d'Antras
@ 2021-05-18  9:06   ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This flag is used by in_compat_syscall to handle compat syscalls coming
from 64-bit tasks.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/include/asm/compat.h      |  4 ++--
 arch/arm64/include/asm/elf.h         | 13 ++++++++++++-
 arch/arm64/include/asm/thread_info.h |  6 ++++++
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
index a2f5001f7793..124f4487dfee 100644
--- a/arch/arm64/include/asm/compat.h
+++ b/arch/arm64/include/asm/compat.h
@@ -190,13 +190,13 @@ static inline bool is_compat_thread(struct thread_info *thread)
 
 static inline bool in_compat_syscall(void)
 {
-	return is_compat_task();
+	return current_thread_info()->use_compat_syscall;
 }
 #define in_compat_syscall in_compat_syscall	/* override the generic impl */
 
 static inline bool thread_in_compat_syscall(struct thread_info *thread)
 {
-	return is_compat_thread(thread);
+	return thread->use_compat_syscall;
 }
 
 #else /* !CONFIG_COMPAT */
diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index e21964898d06..49a9a9db612c 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -158,10 +158,20 @@ typedef struct user_fpsimd_state elf_fpregset_t;
  */
 #define ELF_PLAT_INIT(_r, load_addr)	(_r)->regs[0] = 0
 
+#ifdef CONFIG_COMPAT
+#define CLEAR_COMPAT_SYSCALL()						\
+({									\
+	current_thread_info()->use_compat_syscall = false;		\
+})
+#else
+#define CLEAR_COMPAT_SYSCALL()	((void)0)
+#endif
+
 #define SET_PERSONALITY(ex)						\
 ({									\
 	clear_thread_flag(TIF_32BIT);					\
 	current->personality &= ~READ_IMPLIES_EXEC;			\
+	CLEAR_COMPAT_SYSCALL();						\
 })
 
 /* update AT_VECTOR_SIZE_ARCH if the number of NEW_AUX_ENT entries changes */
@@ -228,7 +238,8 @@ typedef compat_elf_greg_t		compat_elf_gregset_t[COMPAT_ELF_NGREG];
 #define COMPAT_SET_PERSONALITY(ex)					\
 ({									\
 	set_thread_flag(TIF_32BIT);					\
- })
+	current_thread_info()->use_compat_syscall = true;		\
+})
 #ifdef CONFIG_COMPAT_VDSO
 #define COMPAT_ARCH_DLINFO						\
 do {									\
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 6623c99f0984..02310b45900d 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -42,6 +42,12 @@ struct thread_info {
 	void			*scs_base;
 	void			*scs_sp;
 #endif
+#ifdef CONFIG_COMPAT
+	/*
+	 * compat task or inside a compat syscall from a 64-bit task
+	 */
+	bool			use_compat_syscall;
+#endif
 };
 
 #define thread_saved_pc(tsk)	\
-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 6/8] arm64: Add a compat syscall flag to thread_info
@ 2021-05-18  9:06   ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

This flag is used by in_compat_syscall to handle compat syscalls coming
from 64-bit tasks.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/include/asm/compat.h      |  4 ++--
 arch/arm64/include/asm/elf.h         | 13 ++++++++++++-
 arch/arm64/include/asm/thread_info.h |  6 ++++++
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
index a2f5001f7793..124f4487dfee 100644
--- a/arch/arm64/include/asm/compat.h
+++ b/arch/arm64/include/asm/compat.h
@@ -190,13 +190,13 @@ static inline bool is_compat_thread(struct thread_info *thread)
 
 static inline bool in_compat_syscall(void)
 {
-	return is_compat_task();
+	return current_thread_info()->use_compat_syscall;
 }
 #define in_compat_syscall in_compat_syscall	/* override the generic impl */
 
 static inline bool thread_in_compat_syscall(struct thread_info *thread)
 {
-	return is_compat_thread(thread);
+	return thread->use_compat_syscall;
 }
 
 #else /* !CONFIG_COMPAT */
diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index e21964898d06..49a9a9db612c 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -158,10 +158,20 @@ typedef struct user_fpsimd_state elf_fpregset_t;
  */
 #define ELF_PLAT_INIT(_r, load_addr)	(_r)->regs[0] = 0
 
+#ifdef CONFIG_COMPAT
+#define CLEAR_COMPAT_SYSCALL()						\
+({									\
+	current_thread_info()->use_compat_syscall = false;		\
+})
+#else
+#define CLEAR_COMPAT_SYSCALL()	((void)0)
+#endif
+
 #define SET_PERSONALITY(ex)						\
 ({									\
 	clear_thread_flag(TIF_32BIT);					\
 	current->personality &= ~READ_IMPLIES_EXEC;			\
+	CLEAR_COMPAT_SYSCALL();						\
 })
 
 /* update AT_VECTOR_SIZE_ARCH if the number of NEW_AUX_ENT entries changes */
@@ -228,7 +238,8 @@ typedef compat_elf_greg_t		compat_elf_gregset_t[COMPAT_ELF_NGREG];
 #define COMPAT_SET_PERSONALITY(ex)					\
 ({									\
 	set_thread_flag(TIF_32BIT);					\
- })
+	current_thread_info()->use_compat_syscall = true;		\
+})
 #ifdef CONFIG_COMPAT_VDSO
 #define COMPAT_ARCH_DLINFO						\
 do {									\
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 6623c99f0984..02310b45900d 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -42,6 +42,12 @@ struct thread_info {
 	void			*scs_base;
 	void			*scs_sp;
 #endif
+#ifdef CONFIG_COMPAT
+	/*
+	 * compat task or inside a compat syscall from a 64-bit task
+	 */
+	bool			use_compat_syscall;
+#endif
 };
 
 #define thread_saved_pc(tsk)	\
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 7/8] arm64: Forbid calling compat sigreturn from 64-bit tasks
  2021-05-18  9:06 ` Amanieu d'Antras
@ 2021-05-18  9:06   ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

It's impossible for this syscall to do anything sensible in this
context.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/kernel/signal32.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 2f507f565c48..e2bdd1eaefd8 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -237,6 +237,10 @@ COMPAT_SYSCALL_DEFINE0(sigreturn)
 	/* Always make any pending restarted system calls return -EINTR */
 	current->restart_block.fn = do_no_restart_syscall;
 
+	/* Reject attempts to call this from a 64-bit process. */
+	if (!is_compat_task())
+		goto badframe;
+
 	/*
 	 * Since we stacked the signal on a 64-bit boundary,
 	 * then 'sp' should be word aligned here.  If it's
@@ -268,6 +272,10 @@ COMPAT_SYSCALL_DEFINE0(rt_sigreturn)
 	/* Always make any pending restarted system calls return -EINTR */
 	current->restart_block.fn = do_no_restart_syscall;
 
+	/* Reject attempts to call this from a 64-bit process. */
+	if (!is_compat_task())
+		goto badframe;
+
 	/*
 	 * Since we stacked the signal on a 64-bit boundary,
 	 * then 'sp' should be word aligned here.  If it's
-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 7/8] arm64: Forbid calling compat sigreturn from 64-bit tasks
@ 2021-05-18  9:06   ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

It's impossible for this syscall to do anything sensible in this
context.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/kernel/signal32.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 2f507f565c48..e2bdd1eaefd8 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -237,6 +237,10 @@ COMPAT_SYSCALL_DEFINE0(sigreturn)
 	/* Always make any pending restarted system calls return -EINTR */
 	current->restart_block.fn = do_no_restart_syscall;
 
+	/* Reject attempts to call this from a 64-bit process. */
+	if (!is_compat_task())
+		goto badframe;
+
 	/*
 	 * Since we stacked the signal on a 64-bit boundary,
 	 * then 'sp' should be word aligned here.  If it's
@@ -268,6 +272,10 @@ COMPAT_SYSCALL_DEFINE0(rt_sigreturn)
 	/* Always make any pending restarted system calls return -EINTR */
 	current->restart_block.fn = do_no_restart_syscall;
 
+	/* Reject attempts to call this from a 64-bit process. */
+	if (!is_compat_task())
+		goto badframe;
+
 	/*
 	 * Since we stacked the signal on a 64-bit boundary,
 	 * then 'sp' should be word aligned here.  If it's
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-18  9:06 ` Amanieu d'Antras
@ 2021-05-18  9:06   ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

Setting bit 31 in x8 when performing a syscall will do the following:
- The remainder of x8 is treated as a compat syscall number and is used
  to index the compat syscall table.
- in_compat_syscall will return true for the duration of the syscall.
- VM allocations performed by the syscall will be located in the lower
  4G of the address space.
- Interrupted syscalls are properly restarted as compat syscalls.
- Seccomp will treats the syscall as having AUDIT_ARCH_ARM instead of
  AUDIT_ARCH_AARCH64. This affects the arch value seen by seccomp
  filters and reported by SIGSYS.
- PTRACE_GET_SYSCALL_INFO also treats the syscall as having
  AUDIT_ARCH_ARM. Recent versions of strace will correctly report the
  system call name and parameters when an AArch64 task mixes 32-bit and
  64-bit syscalls.

Previously, setting bit 31 of the syscall number would always cause the
sygscall to return ENOSYS. This allows user programs to reliably detect
kernel support for compat syscall by trying a simple syscall such as
getpid.

The AArch32-private compat syscalls (__ARM_NR_compat_*) are not exposed
through this interface. These syscalls do not make sense in the context
of an AArch64 task.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/include/uapi/asm/unistd.h |  2 ++
 arch/arm64/kernel/signal.c           |  5 +++++
 arch/arm64/kernel/syscall.c          | 21 ++++++++++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..5574bc6ab0a3 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -15,6 +15,8 @@
  * along with this program.  If not, see <http://www.gnu.org/licenses/>.
  */
 
+#define __ARM64_COMPAT_SYSCALL_BIT 0x80000000
+
 #define __ARCH_WANT_RENAMEAT
 #define __ARCH_WANT_NEW_STAT
 #define __ARCH_WANT_SET_GET_RLIMIT
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 6237486ff6bb..463c8a82050e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -795,6 +795,11 @@ static void setup_restart_syscall(struct pt_regs *regs)
 {
 	if (is_compat_task())
 		compat_setup_restart_syscall(regs);
+#ifdef COMPAT
+	else if (in_compat_syscall())
+		regs->regs[8] = __ARM64_COMPAT_SYSCALL_BIT |
+				__NR_compat_restart_syscall;
+#endif
 	else
 		regs->regs[8] = __NR_restart_syscall;
 }
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index e0e9d54de0a2..83747cf4b5b7 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -118,6 +118,11 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
 		 * user-issued syscall(-1). However, requesting a skip and not
 		 * setting the return value is unlikely to do anything sensible
 		 * anyway.
+		 *
+		 * This edge case goes away with CONFIG_COMPAT since a
+		 * user-issued syscall(-1) is interpreted as a
+		 * compat_syscall(0x7fffffff) which still ends up returning
+		 * -ENOSYS in x0.
 		 */
 		if (scno == NO_SYSCALL)
 			regs->regs[0] = -ENOSYS;
@@ -165,7 +170,21 @@ static inline void sve_user_discard(void)
 void do_el0_svc(struct pt_regs *regs)
 {
 	sve_user_discard();
-	el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
+
+#ifdef CONFIG_COMPAT
+	/*
+	 * Setting bit 31 of x8 allows a 64-bit processe to perform compat
+	 * syscalls.
+	 */
+	if (regs->regs[8] & __ARM64_COMPAT_SYSCALL_BIT) {
+		current_thread_info()->use_compat_syscall = true;
+		el0_svc_common(regs,
+			       regs->regs[8] & ~__ARM64_COMPAT_SYSCALL_BIT,
+			       __NR_compat_syscalls, compat_sys_call_table);
+		current_thread_info()->use_compat_syscall = false;
+	} else
+#endif
+		el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
 }
 
 #ifdef CONFIG_COMPAT
-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-18  9:06   ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18  9:06 UTC (permalink / raw)
  Cc: Amanieu d'Antras, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Steven Price, Arnd Bergmann, David Laight,
	Mark Brown, linux-arm-kernel, linux-kernel

Setting bit 31 in x8 when performing a syscall will do the following:
- The remainder of x8 is treated as a compat syscall number and is used
  to index the compat syscall table.
- in_compat_syscall will return true for the duration of the syscall.
- VM allocations performed by the syscall will be located in the lower
  4G of the address space.
- Interrupted syscalls are properly restarted as compat syscalls.
- Seccomp will treats the syscall as having AUDIT_ARCH_ARM instead of
  AUDIT_ARCH_AARCH64. This affects the arch value seen by seccomp
  filters and reported by SIGSYS.
- PTRACE_GET_SYSCALL_INFO also treats the syscall as having
  AUDIT_ARCH_ARM. Recent versions of strace will correctly report the
  system call name and parameters when an AArch64 task mixes 32-bit and
  64-bit syscalls.

Previously, setting bit 31 of the syscall number would always cause the
sygscall to return ENOSYS. This allows user programs to reliably detect
kernel support for compat syscall by trying a simple syscall such as
getpid.

The AArch32-private compat syscalls (__ARM_NR_compat_*) are not exposed
through this interface. These syscalls do not make sense in the context
of an AArch64 task.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
---
 arch/arm64/include/uapi/asm/unistd.h |  2 ++
 arch/arm64/kernel/signal.c           |  5 +++++
 arch/arm64/kernel/syscall.c          | 21 ++++++++++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..5574bc6ab0a3 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -15,6 +15,8 @@
  * along with this program.  If not, see <http://www.gnu.org/licenses/>.
  */
 
+#define __ARM64_COMPAT_SYSCALL_BIT 0x80000000
+
 #define __ARCH_WANT_RENAMEAT
 #define __ARCH_WANT_NEW_STAT
 #define __ARCH_WANT_SET_GET_RLIMIT
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 6237486ff6bb..463c8a82050e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -795,6 +795,11 @@ static void setup_restart_syscall(struct pt_regs *regs)
 {
 	if (is_compat_task())
 		compat_setup_restart_syscall(regs);
+#ifdef COMPAT
+	else if (in_compat_syscall())
+		regs->regs[8] = __ARM64_COMPAT_SYSCALL_BIT |
+				__NR_compat_restart_syscall;
+#endif
 	else
 		regs->regs[8] = __NR_restart_syscall;
 }
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index e0e9d54de0a2..83747cf4b5b7 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -118,6 +118,11 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
 		 * user-issued syscall(-1). However, requesting a skip and not
 		 * setting the return value is unlikely to do anything sensible
 		 * anyway.
+		 *
+		 * This edge case goes away with CONFIG_COMPAT since a
+		 * user-issued syscall(-1) is interpreted as a
+		 * compat_syscall(0x7fffffff) which still ends up returning
+		 * -ENOSYS in x0.
 		 */
 		if (scno == NO_SYSCALL)
 			regs->regs[0] = -ENOSYS;
@@ -165,7 +170,21 @@ static inline void sve_user_discard(void)
 void do_el0_svc(struct pt_regs *regs)
 {
 	sve_user_discard();
-	el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
+
+#ifdef CONFIG_COMPAT
+	/*
+	 * Setting bit 31 of x8 allows a 64-bit processe to perform compat
+	 * syscalls.
+	 */
+	if (regs->regs[8] & __ARM64_COMPAT_SYSCALL_BIT) {
+		current_thread_info()->use_compat_syscall = true;
+		el0_svc_common(regs,
+			       regs->regs[8] & ~__ARM64_COMPAT_SYSCALL_BIT,
+			       __NR_compat_syscalls, compat_sys_call_table);
+		current_thread_info()->use_compat_syscall = false;
+	} else
+#endif
+		el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
 }
 
 #ifdef CONFIG_COMPAT
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-18  9:06   ` Amanieu d'Antras
@ 2021-05-18 13:02     ` Arnd Bergmann
  -1 siblings, 0 replies; 40+ messages in thread
From: Arnd Bergmann @ 2021-05-18 13:02 UTC (permalink / raw)
  To: Amanieu d'Antras
  Cc: Ryan Houdek, Catalin Marinas, Will Deacon, Mark Rutland,
	Steven Price, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Tue, May 18, 2021 at 11:06 AM Amanieu d'Antras <amanieu@gmail.com> wrote:
>
> Setting bit 31 in x8 when performing a syscall will do the following:
> - The remainder of x8 is treated as a compat syscall number and is used
>   to index the compat syscall table.
> - in_compat_syscall will return true for the duration of the syscall.
> - VM allocations performed by the syscall will be located in the lower
>   4G of the address space.
> - Interrupted syscalls are properly restarted as compat syscalls.
> - Seccomp will treats the syscall as having AUDIT_ARCH_ARM instead of
>   AUDIT_ARCH_AARCH64. This affects the arch value seen by seccomp
>   filters and reported by SIGSYS.
> - PTRACE_GET_SYSCALL_INFO also treats the syscall as having
>   AUDIT_ARCH_ARM. Recent versions of strace will correctly report the
>   system call name and parameters when an AArch64 task mixes 32-bit and
>   64-bit syscalls.
>
> Previously, setting bit 31 of the syscall number would always cause the
> sygscall to return ENOSYS. This allows user programs to reliably detect
> kernel support for compat syscall by trying a simple syscall such as
> getpid.
>
> The AArch32-private compat syscalls (__ARM_NR_compat_*) are not exposed
> through this interface. These syscalls do not make sense in the context
> of an AArch64 task.
>
> Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
> Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
> Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>

I'm still undecided about this approach. It is an easy way to expose the 32-bit
ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
it doesn't expose a lot of attack surface that isn't already exposed to normal
32-bit tasks running compat mode.

On the other hand, exposing the entire aarch32 syscall set seems both
too broad and not broad enough: Half of the system calls behave the
exact same way in native and compat mode, so they wouldn't need to
be exposed like this, a lot of others are trivially emulated in user space
by calling the native versions. The syscalls that are actually hard to do
such as ioctl() or the signal handling will work for aarch32 emulation, but
they are still insufficient to correctly emulate other 32-bit architectures
that have a slightly different ABI. This means the interface is a fairly good
fit for Tango, but much less so for FEX.

It's also worth pointing out that this approach has a few things in common
with Yury's ilp32 tree at https://github.com/norov/linux/tree/ilp32-5.2
Unlike the x86 x32 mode, that port however does not allow calling compat
syscalls from normal 64-bit tasks but rather keys the syscall entry point
off the executable format., which wouldn't work here. It also uses the
asm-generic system call numbers instead of the arm32 syscall numbers.

I assume you have already considered or tried the alternative approach of
only adding a minimal set of syscalls that are needed for the emulation.
Having a way to limit the address space for mmap() and similar
system calls sounds like a generally useful addition, and having an
extended variant of ioctl() that lets you pick the target ABI (arm32, x86-32,
...) on supported drivers would probably be better for FEX. Can you
explain the tradeoffs that led you towards duplicating the syscall
entry points instead?

         Arnd

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-18 13:02     ` Arnd Bergmann
  0 siblings, 0 replies; 40+ messages in thread
From: Arnd Bergmann @ 2021-05-18 13:02 UTC (permalink / raw)
  To: Amanieu d'Antras
  Cc: Ryan Houdek, Catalin Marinas, Will Deacon, Mark Rutland,
	Steven Price, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Tue, May 18, 2021 at 11:06 AM Amanieu d'Antras <amanieu@gmail.com> wrote:
>
> Setting bit 31 in x8 when performing a syscall will do the following:
> - The remainder of x8 is treated as a compat syscall number and is used
>   to index the compat syscall table.
> - in_compat_syscall will return true for the duration of the syscall.
> - VM allocations performed by the syscall will be located in the lower
>   4G of the address space.
> - Interrupted syscalls are properly restarted as compat syscalls.
> - Seccomp will treats the syscall as having AUDIT_ARCH_ARM instead of
>   AUDIT_ARCH_AARCH64. This affects the arch value seen by seccomp
>   filters and reported by SIGSYS.
> - PTRACE_GET_SYSCALL_INFO also treats the syscall as having
>   AUDIT_ARCH_ARM. Recent versions of strace will correctly report the
>   system call name and parameters when an AArch64 task mixes 32-bit and
>   64-bit syscalls.
>
> Previously, setting bit 31 of the syscall number would always cause the
> sygscall to return ENOSYS. This allows user programs to reliably detect
> kernel support for compat syscall by trying a simple syscall such as
> getpid.
>
> The AArch32-private compat syscalls (__ARM_NR_compat_*) are not exposed
> through this interface. These syscalls do not make sense in the context
> of an AArch64 task.
>
> Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
> Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
> Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>

I'm still undecided about this approach. It is an easy way to expose the 32-bit
ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
it doesn't expose a lot of attack surface that isn't already exposed to normal
32-bit tasks running compat mode.

On the other hand, exposing the entire aarch32 syscall set seems both
too broad and not broad enough: Half of the system calls behave the
exact same way in native and compat mode, so they wouldn't need to
be exposed like this, a lot of others are trivially emulated in user space
by calling the native versions. The syscalls that are actually hard to do
such as ioctl() or the signal handling will work for aarch32 emulation, but
they are still insufficient to correctly emulate other 32-bit architectures
that have a slightly different ABI. This means the interface is a fairly good
fit for Tango, but much less so for FEX.

It's also worth pointing out that this approach has a few things in common
with Yury's ilp32 tree at https://github.com/norov/linux/tree/ilp32-5.2
Unlike the x86 x32 mode, that port however does not allow calling compat
syscalls from normal 64-bit tasks but rather keys the syscall entry point
off the executable format., which wouldn't work here. It also uses the
asm-generic system call numbers instead of the arm32 syscall numbers.

I assume you have already considered or tried the alternative approach of
only adding a minimal set of syscalls that are needed for the emulation.
Having a way to limit the address space for mmap() and similar
system calls sounds like a generally useful addition, and having an
extended variant of ioctl() that lets you pick the target ABI (arm32, x86-32,
...) on supported drivers would probably be better for FEX. Can you
explain the tradeoffs that led you towards duplicating the syscall
entry points instead?

         Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-18 13:02     ` Arnd Bergmann
@ 2021-05-18 20:26       ` David Laight
  -1 siblings, 0 replies; 40+ messages in thread
From: David Laight @ 2021-05-18 20:26 UTC (permalink / raw)
  To: 'Arnd Bergmann', Amanieu d'Antras
  Cc: Ryan Houdek, Catalin Marinas, Will Deacon, Mark Rutland,
	Steven Price, Mark Brown, Linux ARM, Linux Kernel Mailing List

From: Arnd Bergmann
> Sent: 18 May 2021 14:02
...
> 
> I'm still undecided about this approach. It is an easy way to expose the 32-bit
> ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
> it doesn't expose a lot of attack surface that isn't already exposed to normal
> 32-bit tasks running compat mode.
> 
> On the other hand, exposing the entire aarch32 syscall set seems both
> too broad and not broad enough: Half of the system calls behave the
> exact same way in native and compat mode, so they wouldn't need to
> be exposed like this, a lot of others are trivially emulated in user space
> by calling the native versions. The syscalls that are actually hard to do
> such as ioctl() or the signal handling will work for aarch32 emulation, but
> they are still insufficient to correctly emulate other 32-bit architectures
> that have a slightly different ABI. This means the interface is a fairly good
> fit for Tango, but much less so for FEX.

To my mind because the kernel already contains the emulation code
there isn't much point trying to replicate it in userspace.

OTOH I think they are trying to emulate x86 system calls not arm ones.
So the structure layouts don't always match.
However it is probably a lot nearer than the 64bit arm.

Whether including some of the 'x32' code in an arm kernel will
help is another matter - it might be a useful source of differences.

Am I also right in thinking that this isn't actually needed as part
of a 'generic' ARM kernel? Just ones for some specific platforms?

	David

(Oh - I'm not involved in the project and will probably never use it.)

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-18 20:26       ` David Laight
  0 siblings, 0 replies; 40+ messages in thread
From: David Laight @ 2021-05-18 20:26 UTC (permalink / raw)
  To: 'Arnd Bergmann', Amanieu d'Antras
  Cc: Ryan Houdek, Catalin Marinas, Will Deacon, Mark Rutland,
	Steven Price, Mark Brown, Linux ARM, Linux Kernel Mailing List

From: Arnd Bergmann
> Sent: 18 May 2021 14:02
...
> 
> I'm still undecided about this approach. It is an easy way to expose the 32-bit
> ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
> it doesn't expose a lot of attack surface that isn't already exposed to normal
> 32-bit tasks running compat mode.
> 
> On the other hand, exposing the entire aarch32 syscall set seems both
> too broad and not broad enough: Half of the system calls behave the
> exact same way in native and compat mode, so they wouldn't need to
> be exposed like this, a lot of others are trivially emulated in user space
> by calling the native versions. The syscalls that are actually hard to do
> such as ioctl() or the signal handling will work for aarch32 emulation, but
> they are still insufficient to correctly emulate other 32-bit architectures
> that have a slightly different ABI. This means the interface is a fairly good
> fit for Tango, but much less so for FEX.

To my mind because the kernel already contains the emulation code
there isn't much point trying to replicate it in userspace.

OTOH I think they are trying to emulate x86 system calls not arm ones.
So the structure layouts don't always match.
However it is probably a lot nearer than the 64bit arm.

Whether including some of the 'x32' code in an arm kernel will
help is another matter - it might be a useful source of differences.

Am I also right in thinking that this isn't actually needed as part
of a 'generic' ARM kernel? Just ones for some specific platforms?

	David

(Oh - I'm not involved in the project and will probably never use it.)

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-18 20:26       ` David Laight
@ 2021-05-18 22:41         ` Ryan Houdek
  -1 siblings, 0 replies; 40+ messages in thread
From: Ryan Houdek @ 2021-05-18 22:41 UTC (permalink / raw)
  To: David Laight
  Cc: Arnd Bergmann, Amanieu d'Antras, Catalin Marinas,
	Will Deacon, Mark Rutland, Steven Price, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Tue, May 18, 2021 at 1:26 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Arnd Bergmann
> > Sent: 18 May 2021 14:02
> ...
> >
> > I'm still undecided about this approach. It is an easy way to expose the 32-bit
> > ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
> > it doesn't expose a lot of attack surface that isn't already exposed to normal
> > 32-bit tasks running compat mode.
> >
> > On the other hand, exposing the entire aarch32 syscall set seems both
> > too broad and not broad enough: Half of the system calls behave the
> > exact same way in native and compat mode, so they wouldn't need to
> > be exposed like this, a lot of others are trivially emulated in user space
> > by calling the native versions. The syscalls that are actually hard to do
> > such as ioctl() or the signal handling will work for aarch32 emulation, but
> > they are still insufficient to correctly emulate other 32-bit architectures
> > that have a slightly different ABI. This means the interface is a fairly good
> > fit for Tango, but much less so for FEX.
>
> To my mind because the kernel already contains the emulation code
> there isn't much point trying to replicate it in userspace.
>
> OTOH I think they are trying to emulate x86 system calls not arm ones.
> So the structure layouts don't always match.
> However it is probably a lot nearer than the 64bit arm.

Take care not to conflate the Tango and FEX project needs here.
Tango is doing aarch32->aarch64 translation. So they are translating
aarch32 syscalls.
FEX is doing {x86, x86-64}->aarch64 translation.
The simplicity of the interface helps Tango more than FEX in this regard.
Since FEX likely still needs userspace fixups to *some* structures.

>
> Whether including some of the 'x32' code in an arm kernel will
> help is another matter - it might be a useful source of differences.
>
> Am I also right in thinking that this isn't actually needed as part
> of a 'generic' ARM kernel? Just ones for some specific platforms?

This isn't correct from FEX's viewpoint.
FEX isn't a product that will be shipping on any specific platform; The user
is expected to just install FEX on their ARMv8.0+ device of choice.
After they install FEX then they will freely be able to install *any* x86/x86-64
software and run it.
The primary target is running full fledged games from the user's Steam library,
but it can be anything the user desires.

For context we have users running FEX on Lenovo c630, Apple M1, HardKernel SBCs,
and recently some randomly picked Android devices.

I can't speak on Tango's behalf here since I don't know their user ecosystem.
>
>         David
>
> (Oh - I'm not involved in the project and will probably never use it.)
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-18 22:41         ` Ryan Houdek
  0 siblings, 0 replies; 40+ messages in thread
From: Ryan Houdek @ 2021-05-18 22:41 UTC (permalink / raw)
  To: David Laight
  Cc: Arnd Bergmann, Amanieu d'Antras, Catalin Marinas,
	Will Deacon, Mark Rutland, Steven Price, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Tue, May 18, 2021 at 1:26 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Arnd Bergmann
> > Sent: 18 May 2021 14:02
> ...
> >
> > I'm still undecided about this approach. It is an easy way to expose the 32-bit
> > ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
> > it doesn't expose a lot of attack surface that isn't already exposed to normal
> > 32-bit tasks running compat mode.
> >
> > On the other hand, exposing the entire aarch32 syscall set seems both
> > too broad and not broad enough: Half of the system calls behave the
> > exact same way in native and compat mode, so they wouldn't need to
> > be exposed like this, a lot of others are trivially emulated in user space
> > by calling the native versions. The syscalls that are actually hard to do
> > such as ioctl() or the signal handling will work for aarch32 emulation, but
> > they are still insufficient to correctly emulate other 32-bit architectures
> > that have a slightly different ABI. This means the interface is a fairly good
> > fit for Tango, but much less so for FEX.
>
> To my mind because the kernel already contains the emulation code
> there isn't much point trying to replicate it in userspace.
>
> OTOH I think they are trying to emulate x86 system calls not arm ones.
> So the structure layouts don't always match.
> However it is probably a lot nearer than the 64bit arm.

Take care not to conflate the Tango and FEX project needs here.
Tango is doing aarch32->aarch64 translation. So they are translating
aarch32 syscalls.
FEX is doing {x86, x86-64}->aarch64 translation.
The simplicity of the interface helps Tango more than FEX in this regard.
Since FEX likely still needs userspace fixups to *some* structures.

>
> Whether including some of the 'x32' code in an arm kernel will
> help is another matter - it might be a useful source of differences.
>
> Am I also right in thinking that this isn't actually needed as part
> of a 'generic' ARM kernel? Just ones for some specific platforms?

This isn't correct from FEX's viewpoint.
FEX isn't a product that will be shipping on any specific platform; The user
is expected to just install FEX on their ARMv8.0+ device of choice.
After they install FEX then they will freely be able to install *any* x86/x86-64
software and run it.
The primary target is running full fledged games from the user's Steam library,
but it can be anything the user desires.

For context we have users running FEX on Lenovo c630, Apple M1, HardKernel SBCs,
and recently some randomly picked Android devices.

I can't speak on Tango's behalf here since I don't know their user ecosystem.
>
>         David
>
> (Oh - I'm not involved in the project and will probably never use it.)
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-18 13:02     ` Arnd Bergmann
@ 2021-05-18 23:51       ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18 23:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ryan Houdek, Catalin Marinas, Will Deacon, Mark Rutland,
	Steven Price, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Tue, May 18, 2021 at 2:03 PM Arnd Bergmann <arnd@kernel.org> wrote:
> I'm still undecided about this approach. It is an easy way to expose the 32-bit
> ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
> it doesn't expose a lot of attack surface that isn't already exposed to normal
> 32-bit tasks running compat mode.
>
> On the other hand, exposing the entire aarch32 syscall set seems both
> too broad and not broad enough: Half of the system calls behave the
> exact same way in native and compat mode, so they wouldn't need to
> be exposed like this, a lot of others are trivially emulated in user space
> by calling the native versions. The syscalls that are actually hard to do
> such as ioctl() or the signal handling will work for aarch32 emulation, but
> they are still insufficient to correctly emulate other 32-bit architectures
> that have a slightly different ABI. This means the interface is a fairly good
> fit for Tango, but much less so for FEX.
>
> It's also worth pointing out that this approach has a few things in common
> with Yury's ilp32 tree at https://github.com/norov/linux/tree/ilp32-5.2
> Unlike the x86 x32 mode, that port however does not allow calling compat
> syscalls from normal 64-bit tasks but rather keys the syscall entry point
> off the executable format., which wouldn't work here. It also uses the
> asm-generic system call numbers instead of the arm32 syscall numbers.
>
> I assume you have already considered or tried the alternative approach of
> only adding a minimal set of syscalls that are needed for the emulation.
> Having a way to limit the address space for mmap() and similar
> system calls sounds like a generally useful addition, and having an
> extended variant of ioctl() that lets you pick the target ABI (arm32, x86-32,
> ...) on supported drivers would probably be better for FEX. Can you
> explain the tradeoffs that led you towards duplicating the syscall
> entry points instead?

Tango needs the entire compat ABI to be exposed to support seccomp for
translated AArch32 processes. Here's how this works:

1. When a translated process installs a seccomp filter, Tango injects
a prefix into the seccomp program which effectively does:
    if (arch == AUDIT_ARCH_AARCH64) {
        // 64-bit syscalls used by Tango for internal operations
        if (syscall_in_tango_whitelist(nr))
            return SECCOMP_RET_ALLOW;
    }
    // continue to user-supplied seccomp program

2. When Tango performs a 32-bit syscall on behalf of the translated
process, the seccomp filter will see a syscall with AUDIT_ARCH_ARM and
the compat syscall number. This allows the user-supplied seccomp
filter to behave exactly as if it was running in a native AArch32
process.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-18 23:51       ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-18 23:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ryan Houdek, Catalin Marinas, Will Deacon, Mark Rutland,
	Steven Price, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Tue, May 18, 2021 at 2:03 PM Arnd Bergmann <arnd@kernel.org> wrote:
> I'm still undecided about this approach. It is an easy way to expose the 32-bit
> ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
> it doesn't expose a lot of attack surface that isn't already exposed to normal
> 32-bit tasks running compat mode.
>
> On the other hand, exposing the entire aarch32 syscall set seems both
> too broad and not broad enough: Half of the system calls behave the
> exact same way in native and compat mode, so they wouldn't need to
> be exposed like this, a lot of others are trivially emulated in user space
> by calling the native versions. The syscalls that are actually hard to do
> such as ioctl() or the signal handling will work for aarch32 emulation, but
> they are still insufficient to correctly emulate other 32-bit architectures
> that have a slightly different ABI. This means the interface is a fairly good
> fit for Tango, but much less so for FEX.
>
> It's also worth pointing out that this approach has a few things in common
> with Yury's ilp32 tree at https://github.com/norov/linux/tree/ilp32-5.2
> Unlike the x86 x32 mode, that port however does not allow calling compat
> syscalls from normal 64-bit tasks but rather keys the syscall entry point
> off the executable format., which wouldn't work here. It also uses the
> asm-generic system call numbers instead of the arm32 syscall numbers.
>
> I assume you have already considered or tried the alternative approach of
> only adding a minimal set of syscalls that are needed for the emulation.
> Having a way to limit the address space for mmap() and similar
> system calls sounds like a generally useful addition, and having an
> extended variant of ioctl() that lets you pick the target ABI (arm32, x86-32,
> ...) on supported drivers would probably be better for FEX. Can you
> explain the tradeoffs that led you towards duplicating the syscall
> entry points instead?

Tango needs the entire compat ABI to be exposed to support seccomp for
translated AArch32 processes. Here's how this works:

1. When a translated process installs a seccomp filter, Tango injects
a prefix into the seccomp program which effectively does:
    if (arch == AUDIT_ARCH_AARCH64) {
        // 64-bit syscalls used by Tango for internal operations
        if (syscall_in_tango_whitelist(nr))
            return SECCOMP_RET_ALLOW;
    }
    // continue to user-supplied seccomp program

2. When Tango performs a 32-bit syscall on behalf of the translated
process, the seccomp filter will see a syscall with AUDIT_ARCH_ARM and
the compat syscall number. This allows the user-supplied seccomp
filter to behave exactly as if it was running in a native AArch32
process.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-18 13:02     ` Arnd Bergmann
@ 2021-05-18 23:52       ` Ryan Houdek
  -1 siblings, 0 replies; 40+ messages in thread
From: Ryan Houdek @ 2021-05-18 23:52 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Amanieu d'Antras, Catalin Marinas, Will Deacon, Mark Rutland,
	Steven Price, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Tue, May 18, 2021 at 6:03 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Tue, May 18, 2021 at 11:06 AM Amanieu d'Antras <amanieu@gmail.com> wrote:
> >
> > Setting bit 31 in x8 when performing a syscall will do the following:
> > - The remainder of x8 is treated as a compat syscall number and is used
> >   to index the compat syscall table.
> > - in_compat_syscall will return true for the duration of the syscall.
> > - VM allocations performed by the syscall will be located in the lower
> >   4G of the address space.
> > - Interrupted syscalls are properly restarted as compat syscalls.
> > - Seccomp will treats the syscall as having AUDIT_ARCH_ARM instead of
> >   AUDIT_ARCH_AARCH64. This affects the arch value seen by seccomp
> >   filters and reported by SIGSYS.
> > - PTRACE_GET_SYSCALL_INFO also treats the syscall as having
> >   AUDIT_ARCH_ARM. Recent versions of strace will correctly report the
> >   system call name and parameters when an AArch64 task mixes 32-bit and
> >   64-bit syscalls.
> >
> > Previously, setting bit 31 of the syscall number would always cause the
> > sygscall to return ENOSYS. This allows user programs to reliably detect
> > kernel support for compat syscall by trying a simple syscall such as
> > getpid.
> >
> > The AArch32-private compat syscalls (__ARM_NR_compat_*) are not exposed
> > through this interface. These syscalls do not make sense in the context
> > of an AArch64 task.
> >
> > Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
> > Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
> > Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
>
> I'm still undecided about this approach. It is an easy way to expose the 32-bit
> ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
> it doesn't expose a lot of attack surface that isn't already exposed to normal
> 32-bit tasks running compat mode.
>
> On the other hand, exposing the entire aarch32 syscall set seems both
> too broad and not broad enough: Half of the system calls behave the
> exact same way in native and compat mode, so they wouldn't need to
> be exposed like this, a lot of others are trivially emulated in user space
> by calling the native versions. The syscalls that are actually hard to do
> such as ioctl() or the signal handling will work for aarch32 emulation, but
> they are still insufficient to correctly emulate other 32-bit architectures
> that have a slightly different ABI. This means the interface is a fairly good
> fit for Tango, but much less so for FEX.

You are correct here. This meshes perfectly for Tango's use case. Where
the syscalls will match perfectly for their aarch32->aarch64->compat syscall
path.

For FEX's use case, we still need to deal with any data structure that
doesn't match between the 32-bit x86 to compat syscall boundary. While
x86->compat will require significantly less fixups than x86->aarch64, it
is still likely to have some structure differences that need fixing.

>
> It's also worth pointing out that this approach has a few things in common
> with Yury's ilp32 tree at https://github.com/norov/linux/tree/ilp32-5.2
> Unlike the x86 x32 mode, that port however does not allow calling compat
> syscalls from normal 64-bit tasks but rather keys the syscall entry point
> off the executable format., which wouldn't work here. It also uses the
> asm-generic system call numbers instead of the arm32 syscall numbers.
>
> I assume you have already considered or tried the alternative approach of
> only adding a minimal set of syscalls that are needed for the emulation.
> Having a way to limit the address space for mmap() and similar
> system calls sounds like a generally useful addition, and having an
> extended variant of ioctl() that lets you pick the target ABI (arm32, x86-32,
> ...) on supported drivers would probably be better for FEX. Can you
> explain the tradeoffs that led you towards duplicating the syscall
> entry points instead?

I'm likely to not be very concise here. There are many paper cuts for
any route taken here.
For me, this one is the best route because of its ability to future proof
for any upcoming additions to syscalls.

If we were wanting to take a path of duplicating a bunch of compat
syscalls to work from the 64-bit side. We would first need to start with
around nine syscalls that are causing immediate problems.
mmap/mmap2, mremap, shmat, ioctl, recvmsg, recvmmsg, getdents,
and getdents64.
So we could carve those out, adding effectively the same memory
handling code that is being added here[1]. Do the ML dance to upstream.
We now have nine-ish syscalls that are added specifically for userspace
compatibility layers.
That's already beginning to have a bad smell.

Next step is a couple months down the line, someone adds a super cool
syscall that say, allocates memory that is secure over infiniband and flushes
to persistence on hibernate. Neato. Oops, this is allocating memory, and
since FEX is tracking very close to upstream kernel syscall support, we now
need to add yet another syscall that handles the compat version in a
64-bit space.
Or maybe it appends to a linked list of secure memory regions. Only visible as
the head of the list (Hello robust futexes).

See what I mean? Exposing the 32-bit compat syscalls removes the burden of
now needing to think about every syscall in a context of 32-bit,
64-bit, 32-bit on 64-bit.
Also removes the burden that I then need to come back and pester the ML
every single time with new patchsets adding syscalls only for compat layers.

And I'm all about removing unnecessary burden

[1]Side grade, personality flags won't be pretty here, FEX lives in a
mixed syscall world and doesn't want only one or the other working.
FEX does a bunch of stuff in the background and a personality flag
would be hard to work around whenever we need to do some
memory allocations, or file system handling, or its own 64-bit ioctl
handling. Just not very versatile.
FEX is already allocating all 48/52-bit VA, breaking ASLR and stack
growing, as a partial workaround here.

>
>          Arnd

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-18 23:52       ` Ryan Houdek
  0 siblings, 0 replies; 40+ messages in thread
From: Ryan Houdek @ 2021-05-18 23:52 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Amanieu d'Antras, Catalin Marinas, Will Deacon, Mark Rutland,
	Steven Price, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Tue, May 18, 2021 at 6:03 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Tue, May 18, 2021 at 11:06 AM Amanieu d'Antras <amanieu@gmail.com> wrote:
> >
> > Setting bit 31 in x8 when performing a syscall will do the following:
> > - The remainder of x8 is treated as a compat syscall number and is used
> >   to index the compat syscall table.
> > - in_compat_syscall will return true for the duration of the syscall.
> > - VM allocations performed by the syscall will be located in the lower
> >   4G of the address space.
> > - Interrupted syscalls are properly restarted as compat syscalls.
> > - Seccomp will treats the syscall as having AUDIT_ARCH_ARM instead of
> >   AUDIT_ARCH_AARCH64. This affects the arch value seen by seccomp
> >   filters and reported by SIGSYS.
> > - PTRACE_GET_SYSCALL_INFO also treats the syscall as having
> >   AUDIT_ARCH_ARM. Recent versions of strace will correctly report the
> >   system call name and parameters when an AArch64 task mixes 32-bit and
> >   64-bit syscalls.
> >
> > Previously, setting bit 31 of the syscall number would always cause the
> > sygscall to return ENOSYS. This allows user programs to reliably detect
> > kernel support for compat syscall by trying a simple syscall such as
> > getpid.
> >
> > The AArch32-private compat syscalls (__ARM_NR_compat_*) are not exposed
> > through this interface. These syscalls do not make sense in the context
> > of an AArch64 task.
> >
> > Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
> > Co-developed-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
> > Signed-off-by: Ryan Houdek <Houdek.Ryan@fex-emu.org>
>
> I'm still undecided about this approach. It is an easy way to expose the 32-bit
> ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
> it doesn't expose a lot of attack surface that isn't already exposed to normal
> 32-bit tasks running compat mode.
>
> On the other hand, exposing the entire aarch32 syscall set seems both
> too broad and not broad enough: Half of the system calls behave the
> exact same way in native and compat mode, so they wouldn't need to
> be exposed like this, a lot of others are trivially emulated in user space
> by calling the native versions. The syscalls that are actually hard to do
> such as ioctl() or the signal handling will work for aarch32 emulation, but
> they are still insufficient to correctly emulate other 32-bit architectures
> that have a slightly different ABI. This means the interface is a fairly good
> fit for Tango, but much less so for FEX.

You are correct here. This meshes perfectly for Tango's use case. Where
the syscalls will match perfectly for their aarch32->aarch64->compat syscall
path.

For FEX's use case, we still need to deal with any data structure that
doesn't match between the 32-bit x86 to compat syscall boundary. While
x86->compat will require significantly less fixups than x86->aarch64, it
is still likely to have some structure differences that need fixing.

>
> It's also worth pointing out that this approach has a few things in common
> with Yury's ilp32 tree at https://github.com/norov/linux/tree/ilp32-5.2
> Unlike the x86 x32 mode, that port however does not allow calling compat
> syscalls from normal 64-bit tasks but rather keys the syscall entry point
> off the executable format., which wouldn't work here. It also uses the
> asm-generic system call numbers instead of the arm32 syscall numbers.
>
> I assume you have already considered or tried the alternative approach of
> only adding a minimal set of syscalls that are needed for the emulation.
> Having a way to limit the address space for mmap() and similar
> system calls sounds like a generally useful addition, and having an
> extended variant of ioctl() that lets you pick the target ABI (arm32, x86-32,
> ...) on supported drivers would probably be better for FEX. Can you
> explain the tradeoffs that led you towards duplicating the syscall
> entry points instead?

I'm likely to not be very concise here. There are many paper cuts for
any route taken here.
For me, this one is the best route because of its ability to future proof
for any upcoming additions to syscalls.

If we were wanting to take a path of duplicating a bunch of compat
syscalls to work from the 64-bit side. We would first need to start with
around nine syscalls that are causing immediate problems.
mmap/mmap2, mremap, shmat, ioctl, recvmsg, recvmmsg, getdents,
and getdents64.
So we could carve those out, adding effectively the same memory
handling code that is being added here[1]. Do the ML dance to upstream.
We now have nine-ish syscalls that are added specifically for userspace
compatibility layers.
That's already beginning to have a bad smell.

Next step is a couple months down the line, someone adds a super cool
syscall that say, allocates memory that is secure over infiniband and flushes
to persistence on hibernate. Neato. Oops, this is allocating memory, and
since FEX is tracking very close to upstream kernel syscall support, we now
need to add yet another syscall that handles the compat version in a
64-bit space.
Or maybe it appends to a linked list of secure memory regions. Only visible as
the head of the list (Hello robust futexes).

See what I mean? Exposing the 32-bit compat syscalls removes the burden of
now needing to think about every syscall in a context of 32-bit,
64-bit, 32-bit on 64-bit.
Also removes the burden that I then need to come back and pester the ML
every single time with new patchsets adding syscalls only for compat layers.

And I'm all about removing unnecessary burden

[1]Side grade, personality flags won't be pretty here, FEX lives in a
mixed syscall world and doesn't want only one or the other working.
FEX does a bunch of stuff in the background and a personality flag
would be hard to work around whenever we need to do some
memory allocations, or file system handling, or its own 64-bit ioctl
handling. Just not very versatile.
FEX is already allocating all 48/52-bit VA, breaking ASLR and stack
growing, as a partial workaround here.

>
>          Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-18 23:51       ` Amanieu d'Antras
@ 2021-05-19 15:30         ` Steven Price
  -1 siblings, 0 replies; 40+ messages in thread
From: Steven Price @ 2021-05-19 15:30 UTC (permalink / raw)
  To: Amanieu d'Antras, Arnd Bergmann
  Cc: Ryan Houdek, Catalin Marinas, Will Deacon, Mark Rutland,
	David Laight, Mark Brown, Linux ARM, Linux Kernel Mailing List

On 19/05/2021 00:51, Amanieu d'Antras wrote:
> On Tue, May 18, 2021 at 2:03 PM Arnd Bergmann <arnd@kernel.org> wrote:
>> I'm still undecided about this approach. It is an easy way to expose the 32-bit
>> ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
>> it doesn't expose a lot of attack surface that isn't already exposed to normal
>> 32-bit tasks running compat mode.
>>
>> On the other hand, exposing the entire aarch32 syscall set seems both
>> too broad and not broad enough: Half of the system calls behave the
>> exact same way in native and compat mode, so they wouldn't need to
>> be exposed like this, a lot of others are trivially emulated in user space
>> by calling the native versions. The syscalls that are actually hard to do
>> such as ioctl() or the signal handling will work for aarch32 emulation, but
>> they are still insufficient to correctly emulate other 32-bit architectures
>> that have a slightly different ABI. This means the interface is a fairly good
>> fit for Tango, but much less so for FEX.
>>
>> It's also worth pointing out that this approach has a few things in common
>> with Yury's ilp32 tree at https://github.com/norov/linux/tree/ilp32-5.2
>> Unlike the x86 x32 mode, that port however does not allow calling compat
>> syscalls from normal 64-bit tasks but rather keys the syscall entry point
>> off the executable format., which wouldn't work here. It also uses the
>> asm-generic system call numbers instead of the arm32 syscall numbers.
>>
>> I assume you have already considered or tried the alternative approach of
>> only adding a minimal set of syscalls that are needed for the emulation.
>> Having a way to limit the address space for mmap() and similar
>> system calls sounds like a generally useful addition, and having an
>> extended variant of ioctl() that lets you pick the target ABI (arm32, x86-32,
>> ...) on supported drivers would probably be better for FEX. Can you
>> explain the tradeoffs that led you towards duplicating the syscall
>> entry points instead?
> 
> Tango needs the entire compat ABI to be exposed to support seccomp for
> translated AArch32 processes. Here's how this works:
> 
> 1. When a translated process installs a seccomp filter, Tango injects
> a prefix into the seccomp program which effectively does:
>     if (arch == AUDIT_ARCH_AARCH64) {
>         // 64-bit syscalls used by Tango for internal operations
>         if (syscall_in_tango_whitelist(nr))
>             return SECCOMP_RET_ALLOW;
>     }
>     // continue to user-supplied seccomp program
> 
> 2. When Tango performs a 32-bit syscall on behalf of the translated
> process, the seccomp filter will see a syscall with AUDIT_ARCH_ARM and
> the compat syscall number. This allows the user-supplied seccomp
> filter to behave exactly as if it was running in a native AArch32
> process.
> 

Perhaps I'm missing something, but surely some syscalls that would be
native on 32 bit will have to be translated by Tango to 64 bit syscalls
to do the right thing? E.g. from the previous patch compat sigreturn
isn't available.

In those cases to correctly emulate seccomp, isn't Tango is going to
have to implement the seccomp filter in user space?

I guess the question comes down to how big a hole is
syscall_in_tango_whitelist() - if Tango only requires a small set of
syscalls then there is still some security benefit, but otherwise this
doesn't seem like a particularly big benefit considering you're already
going to need the BPF infrastructure in user space.

Or perhaps I'm wrong and there's some magic that makes this work in the
kernel?

Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-19 15:30         ` Steven Price
  0 siblings, 0 replies; 40+ messages in thread
From: Steven Price @ 2021-05-19 15:30 UTC (permalink / raw)
  To: Amanieu d'Antras, Arnd Bergmann
  Cc: Ryan Houdek, Catalin Marinas, Will Deacon, Mark Rutland,
	David Laight, Mark Brown, Linux ARM, Linux Kernel Mailing List

On 19/05/2021 00:51, Amanieu d'Antras wrote:
> On Tue, May 18, 2021 at 2:03 PM Arnd Bergmann <arnd@kernel.org> wrote:
>> I'm still undecided about this approach. It is an easy way to expose the 32-bit
>> ABIs, it mostly copies what x86-64 already does with 32-bit syscalls and
>> it doesn't expose a lot of attack surface that isn't already exposed to normal
>> 32-bit tasks running compat mode.
>>
>> On the other hand, exposing the entire aarch32 syscall set seems both
>> too broad and not broad enough: Half of the system calls behave the
>> exact same way in native and compat mode, so they wouldn't need to
>> be exposed like this, a lot of others are trivially emulated in user space
>> by calling the native versions. The syscalls that are actually hard to do
>> such as ioctl() or the signal handling will work for aarch32 emulation, but
>> they are still insufficient to correctly emulate other 32-bit architectures
>> that have a slightly different ABI. This means the interface is a fairly good
>> fit for Tango, but much less so for FEX.
>>
>> It's also worth pointing out that this approach has a few things in common
>> with Yury's ilp32 tree at https://github.com/norov/linux/tree/ilp32-5.2
>> Unlike the x86 x32 mode, that port however does not allow calling compat
>> syscalls from normal 64-bit tasks but rather keys the syscall entry point
>> off the executable format., which wouldn't work here. It also uses the
>> asm-generic system call numbers instead of the arm32 syscall numbers.
>>
>> I assume you have already considered or tried the alternative approach of
>> only adding a minimal set of syscalls that are needed for the emulation.
>> Having a way to limit the address space for mmap() and similar
>> system calls sounds like a generally useful addition, and having an
>> extended variant of ioctl() that lets you pick the target ABI (arm32, x86-32,
>> ...) on supported drivers would probably be better for FEX. Can you
>> explain the tradeoffs that led you towards duplicating the syscall
>> entry points instead?
> 
> Tango needs the entire compat ABI to be exposed to support seccomp for
> translated AArch32 processes. Here's how this works:
> 
> 1. When a translated process installs a seccomp filter, Tango injects
> a prefix into the seccomp program which effectively does:
>     if (arch == AUDIT_ARCH_AARCH64) {
>         // 64-bit syscalls used by Tango for internal operations
>         if (syscall_in_tango_whitelist(nr))
>             return SECCOMP_RET_ALLOW;
>     }
>     // continue to user-supplied seccomp program
> 
> 2. When Tango performs a 32-bit syscall on behalf of the translated
> process, the seccomp filter will see a syscall with AUDIT_ARCH_ARM and
> the compat syscall number. This allows the user-supplied seccomp
> filter to behave exactly as if it was running in a native AArch32
> process.
> 

Perhaps I'm missing something, but surely some syscalls that would be
native on 32 bit will have to be translated by Tango to 64 bit syscalls
to do the right thing? E.g. from the previous patch compat sigreturn
isn't available.

In those cases to correctly emulate seccomp, isn't Tango is going to
have to implement the seccomp filter in user space?

I guess the question comes down to how big a hole is
syscall_in_tango_whitelist() - if Tango only requires a small set of
syscalls then there is still some security benefit, but otherwise this
doesn't seem like a particularly big benefit considering you're already
going to need the BPF infrastructure in user space.

Or perhaps I'm wrong and there's some magic that makes this work in the
kernel?

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-19 15:30         ` Steven Price
@ 2021-05-19 16:14           ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-19 16:14 UTC (permalink / raw)
  To: Steven Price
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Wed, May 19, 2021 at 4:30 PM Steven Price <steven.price@arm.com> wrote:
> Perhaps I'm missing something, but surely some syscalls that would be
> native on 32 bit will have to be translated by Tango to 64 bit syscalls
> to do the right thing? E.g. from the previous patch compat sigreturn
> isn't available.

That's correct.

Tango handles syscalls in 3 different ways:
- ~20 syscalls are completely emulated in userspace or through 64-bit
syscalls. E.g. sigaction, sigreturn, clone, exit.
- Another ~50 syscalls have various forms of pre/post-processing, but
are otherwise passed on to the kernel compat syscall handler. E.g.
open, mmap, ptrace.
- The remaining syscalls are passed on to the kernel compat syscall
handler directly.

The first group of ~20 syscalls will effectively bypass the
user-specified seccomp filter: any 64-bit syscalls used to emulate
them will be whitelisted. I consider this an acceptable limitation to
Tango's seccomp support since I see no viable way of supporting
seccomp filtering for these syscalls.

> In those cases to correctly emulate seccomp, isn't Tango is going to
> have to implement the seccomp filter in user space?

I have not implemented user-mode seccomp emulation because it can
trivially be bypassed by spawning a 64-bit child process which runs
outside Tango. Even when spawning another translated process, the
user-mode filter will not be preserved across an execve.

> I guess the question comes down to how big a hole is
> syscall_in_tango_whitelist() - if Tango only requires a small set of
> syscalls then there is still some security benefit, but otherwise this
> doesn't seem like a particularly big benefit considering you're already
> going to need the BPF infrastructure in user space.

Currently Tango only whitelists ~50 syscalls, which is small enough to
provide security benefits and definitely better than not supporting
seccomp at all.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-19 16:14           ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-19 16:14 UTC (permalink / raw)
  To: Steven Price
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Wed, May 19, 2021 at 4:30 PM Steven Price <steven.price@arm.com> wrote:
> Perhaps I'm missing something, but surely some syscalls that would be
> native on 32 bit will have to be translated by Tango to 64 bit syscalls
> to do the right thing? E.g. from the previous patch compat sigreturn
> isn't available.

That's correct.

Tango handles syscalls in 3 different ways:
- ~20 syscalls are completely emulated in userspace or through 64-bit
syscalls. E.g. sigaction, sigreturn, clone, exit.
- Another ~50 syscalls have various forms of pre/post-processing, but
are otherwise passed on to the kernel compat syscall handler. E.g.
open, mmap, ptrace.
- The remaining syscalls are passed on to the kernel compat syscall
handler directly.

The first group of ~20 syscalls will effectively bypass the
user-specified seccomp filter: any 64-bit syscalls used to emulate
them will be whitelisted. I consider this an acceptable limitation to
Tango's seccomp support since I see no viable way of supporting
seccomp filtering for these syscalls.

> In those cases to correctly emulate seccomp, isn't Tango is going to
> have to implement the seccomp filter in user space?

I have not implemented user-mode seccomp emulation because it can
trivially be bypassed by spawning a 64-bit child process which runs
outside Tango. Even when spawning another translated process, the
user-mode filter will not be preserved across an execve.

> I guess the question comes down to how big a hole is
> syscall_in_tango_whitelist() - if Tango only requires a small set of
> syscalls then there is still some security benefit, but otherwise this
> doesn't seem like a particularly big benefit considering you're already
> going to need the BPF infrastructure in user space.

Currently Tango only whitelists ~50 syscalls, which is small enough to
provide security benefits and definitely better than not supporting
seccomp at all.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-19 16:14           ` Amanieu d'Antras
@ 2021-05-21  8:51             ` Steven Price
  -1 siblings, 0 replies; 40+ messages in thread
From: Steven Price @ 2021-05-21  8:51 UTC (permalink / raw)
  To: Amanieu d'Antras
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On 19/05/2021 17:14, Amanieu d'Antras wrote:
> On Wed, May 19, 2021 at 4:30 PM Steven Price <steven.price@arm.com> wrote:
>> Perhaps I'm missing something, but surely some syscalls that would be
>> native on 32 bit will have to be translated by Tango to 64 bit syscalls
>> to do the right thing? E.g. from the previous patch compat sigreturn
>> isn't available.
> 
> That's correct.
> 
> Tango handles syscalls in 3 different ways:
> - ~20 syscalls are completely emulated in userspace or through 64-bit
> syscalls. E.g. sigaction, sigreturn, clone, exit.
> - Another ~50 syscalls have various forms of pre/post-processing, but
> are otherwise passed on to the kernel compat syscall handler. E.g.
> open, mmap, ptrace.
> - The remaining syscalls are passed on to the kernel compat syscall
> handler directly.
> 
> The first group of ~20 syscalls will effectively bypass the
> user-specified seccomp filter: any 64-bit syscalls used to emulate
> them will be whitelisted. I consider this an acceptable limitation to
> Tango's seccomp support since I see no viable way of supporting
> seccomp filtering for these syscalls.

I agree it's difficult - the only 'solution' I can see is like I said to
emulate the BPF code in user space.

>> In those cases to correctly emulate seccomp, isn't Tango is going to
>> have to implement the seccomp filter in user space?
> 
> I have not implemented user-mode seccomp emulation because it can
> trivially be bypassed by spawning a 64-bit child process which runs
> outside Tango. Even when spawning another translated process, the
> user-mode filter will not be preserved across an execve.

Clearly if you have user-mode seccomp emulation then you'd hook execve
and either install the real BPF filter (if spawning a 64 bit child
outside Tango) or ensure that the user-mode emulation is passed on to
the child (if running within Tango).

You already have a potential 'issue' here of a 64 bit process setting up
a seccomp filter and then execve()ing a 32 bit (Tango'd) process. The
set of syscalls needed for the system which supports AArch32 natively is
going to be different from the syscalls needed for Tango. (Fundamentally
this is a major limitation with the whole seccomp syscall filtering
approach).

>> I guess the question comes down to how big a hole is
>> syscall_in_tango_whitelist() - if Tango only requires a small set of
>> syscalls then there is still some security benefit, but otherwise this
>> doesn't seem like a particularly big benefit considering you're already
>> going to need the BPF infrastructure in user space.
> 
> Currently Tango only whitelists ~50 syscalls, which is small enough to
> provide security benefits and definitely better than not supporting
> seccomp at all.

Agreed, and I don't want to imply that this approach is necessarily
wrong. But given that the approach of getting the kernel to do the
compat syscall filtering is not perfect, I'm not sure in itself it's a
great justification for needing the kernel to support all the compat
syscalls.

One other thought: I suspect in practise there aren't actually many
variations in the BPF programs used with seccomp. It may well be quite
possible to convert the 32-bit syscall filtering programs to filter the
equivalent 64-bit syscalls that Tango would use. Sadly this would be
fragile if a program used a BPF program which didn't follow the "normal"
pattern.

Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-21  8:51             ` Steven Price
  0 siblings, 0 replies; 40+ messages in thread
From: Steven Price @ 2021-05-21  8:51 UTC (permalink / raw)
  To: Amanieu d'Antras
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On 19/05/2021 17:14, Amanieu d'Antras wrote:
> On Wed, May 19, 2021 at 4:30 PM Steven Price <steven.price@arm.com> wrote:
>> Perhaps I'm missing something, but surely some syscalls that would be
>> native on 32 bit will have to be translated by Tango to 64 bit syscalls
>> to do the right thing? E.g. from the previous patch compat sigreturn
>> isn't available.
> 
> That's correct.
> 
> Tango handles syscalls in 3 different ways:
> - ~20 syscalls are completely emulated in userspace or through 64-bit
> syscalls. E.g. sigaction, sigreturn, clone, exit.
> - Another ~50 syscalls have various forms of pre/post-processing, but
> are otherwise passed on to the kernel compat syscall handler. E.g.
> open, mmap, ptrace.
> - The remaining syscalls are passed on to the kernel compat syscall
> handler directly.
> 
> The first group of ~20 syscalls will effectively bypass the
> user-specified seccomp filter: any 64-bit syscalls used to emulate
> them will be whitelisted. I consider this an acceptable limitation to
> Tango's seccomp support since I see no viable way of supporting
> seccomp filtering for these syscalls.

I agree it's difficult - the only 'solution' I can see is like I said to
emulate the BPF code in user space.

>> In those cases to correctly emulate seccomp, isn't Tango is going to
>> have to implement the seccomp filter in user space?
> 
> I have not implemented user-mode seccomp emulation because it can
> trivially be bypassed by spawning a 64-bit child process which runs
> outside Tango. Even when spawning another translated process, the
> user-mode filter will not be preserved across an execve.

Clearly if you have user-mode seccomp emulation then you'd hook execve
and either install the real BPF filter (if spawning a 64 bit child
outside Tango) or ensure that the user-mode emulation is passed on to
the child (if running within Tango).

You already have a potential 'issue' here of a 64 bit process setting up
a seccomp filter and then execve()ing a 32 bit (Tango'd) process. The
set of syscalls needed for the system which supports AArch32 natively is
going to be different from the syscalls needed for Tango. (Fundamentally
this is a major limitation with the whole seccomp syscall filtering
approach).

>> I guess the question comes down to how big a hole is
>> syscall_in_tango_whitelist() - if Tango only requires a small set of
>> syscalls then there is still some security benefit, but otherwise this
>> doesn't seem like a particularly big benefit considering you're already
>> going to need the BPF infrastructure in user space.
> 
> Currently Tango only whitelists ~50 syscalls, which is small enough to
> provide security benefits and definitely better than not supporting
> seccomp at all.

Agreed, and I don't want to imply that this approach is necessarily
wrong. But given that the approach of getting the kernel to do the
compat syscall filtering is not perfect, I'm not sure in itself it's a
great justification for needing the kernel to support all the compat
syscalls.

One other thought: I suspect in practise there aren't actually many
variations in the BPF programs used with seccomp. It may well be quite
possible to convert the 32-bit syscall filtering programs to filter the
equivalent 64-bit syscalls that Tango would use. Sadly this would be
fragile if a program used a BPF program which didn't follow the "normal"
pattern.

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-21  8:51             ` Steven Price
@ 2021-05-21 19:18               ` Amanieu d'Antras
  -1 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-21 19:18 UTC (permalink / raw)
  To: Steven Price
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Fri, May 21, 2021 at 9:51 AM Steven Price <steven.price@arm.com> wrote:
> >> In those cases to correctly emulate seccomp, isn't Tango is going to
> >> have to implement the seccomp filter in user space?
> >
> > I have not implemented user-mode seccomp emulation because it can
> > trivially be bypassed by spawning a 64-bit child process which runs
> > outside Tango. Even when spawning another translated process, the
> > user-mode filter will not be preserved across an execve.
>
> Clearly if you have user-mode seccomp emulation then you'd hook execve
> and either install the real BPF filter (if spawning a 64 bit child
> outside Tango) or ensure that the user-mode emulation is passed on to
> the child (if running within Tango).

Spawning another process is just an example. Fundamentally, Tango is
not intended or designed to be a sandbox around the 32-bit code. For
example, many of the newer ioctls use u64 instead of a pointer type to
avoid the need for a compat_ioctl handler. This means that such ioctls
could be abused to read/write any address in the process address
space, including the code that is performing the usermode seccomp
emulation.

> You already have a potential 'issue' here of a 64 bit process setting up
> a seccomp filter and then execve()ing a 32 bit (Tango'd) process. The
> set of syscalls needed for the system which supports AArch32 natively is
> going to be different from the syscalls needed for Tango. (Fundamentally
> this is a major limitation with the whole seccomp syscall filtering
> approach).

The specific example I had in mind here is Android which installs a
global seccomp filter on the zygote process from which app processes
are forked from. This filter is designed for mixed arm32/arm64 systems
and therefore has syscall whitelists for both AArch32 and AArch64.
This filter allows 32-bit processes to spawn 64-bit processes and
vice-versa: for example, many 32-bit apps will invoke another 32-bit
executable via system() which uses a 64-bit /system/bin/sh.

> >> I guess the question comes down to how big a hole is
> >> syscall_in_tango_whitelist() - if Tango only requires a small set of
> >> syscalls then there is still some security benefit, but otherwise this
> >> doesn't seem like a particularly big benefit considering you're already
> >> going to need the BPF infrastructure in user space.
> >
> > Currently Tango only whitelists ~50 syscalls, which is small enough to
> > provide security benefits and definitely better than not supporting
> > seccomp at all.
>
> Agreed, and I don't want to imply that this approach is necessarily
> wrong. But given that the approach of getting the kernel to do the
> compat syscall filtering is not perfect, I'm not sure in itself it's a
> great justification for needing the kernel to support all the compat
> syscalls.

I feel that exposing all compat syscalls to 64-bit processes is better
than the alternative of only exposing a subset of them. Of the top of
my head I can think of quite a few compat syscalls that cannot be
fully emulated in userspace and would need to be exposed in the
kernel:
- mmap/mremap/shmat/io_setup: anything that allocates VM space needs
to return a pointer in the low 4GB.
- ioctl: too many variants to reasonably maintain a separate compat
layer in userspace.
- getdents/lseek: ext4 uses 32-bit directory offsets for 32-bit processes.
- get_robust_list/set_robust_list: different in-memory ABI for
32/64-bit processes.
- open: don't force O_LARGEFILE for 32-bit processes.
- io_uring_create: different in-memory ABI for 32/64-bit processes.
- (and possibly many others)

Also consider the churn involved when adding a new syscall which
behaves differently in compat processes: rather than just using
in_compat_syscall() or wiring up a COMPAT_SYSCALL_DEFINE, a compat
variant of this syscall would also need to be added to the 64-bit
syscall table to support translation layers like Tango and FEX.

> One other thought: I suspect in practise there aren't actually many
> variations in the BPF programs used with seccomp. It may well be quite
> possible to convert the 32-bit syscall filtering programs to filter the
> equivalent 64-bit syscalls that Tango would use. Sadly this would be
> fragile if a program used a BPF program which didn't follow the "normal"
> pattern.

This might work for simple filters that only look at the syscall
number, but becomes much harder when the filter also inspects the
syscall arguments.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-21 19:18               ` Amanieu d'Antras
  0 siblings, 0 replies; 40+ messages in thread
From: Amanieu d'Antras @ 2021-05-21 19:18 UTC (permalink / raw)
  To: Steven Price
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On Fri, May 21, 2021 at 9:51 AM Steven Price <steven.price@arm.com> wrote:
> >> In those cases to correctly emulate seccomp, isn't Tango is going to
> >> have to implement the seccomp filter in user space?
> >
> > I have not implemented user-mode seccomp emulation because it can
> > trivially be bypassed by spawning a 64-bit child process which runs
> > outside Tango. Even when spawning another translated process, the
> > user-mode filter will not be preserved across an execve.
>
> Clearly if you have user-mode seccomp emulation then you'd hook execve
> and either install the real BPF filter (if spawning a 64 bit child
> outside Tango) or ensure that the user-mode emulation is passed on to
> the child (if running within Tango).

Spawning another process is just an example. Fundamentally, Tango is
not intended or designed to be a sandbox around the 32-bit code. For
example, many of the newer ioctls use u64 instead of a pointer type to
avoid the need for a compat_ioctl handler. This means that such ioctls
could be abused to read/write any address in the process address
space, including the code that is performing the usermode seccomp
emulation.

> You already have a potential 'issue' here of a 64 bit process setting up
> a seccomp filter and then execve()ing a 32 bit (Tango'd) process. The
> set of syscalls needed for the system which supports AArch32 natively is
> going to be different from the syscalls needed for Tango. (Fundamentally
> this is a major limitation with the whole seccomp syscall filtering
> approach).

The specific example I had in mind here is Android which installs a
global seccomp filter on the zygote process from which app processes
are forked from. This filter is designed for mixed arm32/arm64 systems
and therefore has syscall whitelists for both AArch32 and AArch64.
This filter allows 32-bit processes to spawn 64-bit processes and
vice-versa: for example, many 32-bit apps will invoke another 32-bit
executable via system() which uses a 64-bit /system/bin/sh.

> >> I guess the question comes down to how big a hole is
> >> syscall_in_tango_whitelist() - if Tango only requires a small set of
> >> syscalls then there is still some security benefit, but otherwise this
> >> doesn't seem like a particularly big benefit considering you're already
> >> going to need the BPF infrastructure in user space.
> >
> > Currently Tango only whitelists ~50 syscalls, which is small enough to
> > provide security benefits and definitely better than not supporting
> > seccomp at all.
>
> Agreed, and I don't want to imply that this approach is necessarily
> wrong. But given that the approach of getting the kernel to do the
> compat syscall filtering is not perfect, I'm not sure in itself it's a
> great justification for needing the kernel to support all the compat
> syscalls.

I feel that exposing all compat syscalls to 64-bit processes is better
than the alternative of only exposing a subset of them. Of the top of
my head I can think of quite a few compat syscalls that cannot be
fully emulated in userspace and would need to be exposed in the
kernel:
- mmap/mremap/shmat/io_setup: anything that allocates VM space needs
to return a pointer in the low 4GB.
- ioctl: too many variants to reasonably maintain a separate compat
layer in userspace.
- getdents/lseek: ext4 uses 32-bit directory offsets for 32-bit processes.
- get_robust_list/set_robust_list: different in-memory ABI for
32/64-bit processes.
- open: don't force O_LARGEFILE for 32-bit processes.
- io_uring_create: different in-memory ABI for 32/64-bit processes.
- (and possibly many others)

Also consider the churn involved when adding a new syscall which
behaves differently in compat processes: rather than just using
in_compat_syscall() or wiring up a COMPAT_SYSCALL_DEFINE, a compat
variant of this syscall would also need to be added to the 64-bit
syscall table to support translation layers like Tango and FEX.

> One other thought: I suspect in practise there aren't actually many
> variations in the BPF programs used with seccomp. It may well be quite
> possible to convert the 32-bit syscall filtering programs to filter the
> equivalent 64-bit syscalls that Tango would use. Sadly this would be
> fragile if a program used a BPF program which didn't follow the "normal"
> pattern.

This might work for simple filters that only look at the syscall
number, but becomes much harder when the filter also inspects the
syscall arguments.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-21 19:18               ` Amanieu d'Antras
@ 2021-05-24 11:20                 ` Steven Price
  -1 siblings, 0 replies; 40+ messages in thread
From: Steven Price @ 2021-05-24 11:20 UTC (permalink / raw)
  To: Amanieu d'Antras
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On 21/05/2021 20:18, Amanieu d'Antras wrote:
> On Fri, May 21, 2021 at 9:51 AM Steven Price <steven.price@arm.com> wrote:
>>>> In those cases to correctly emulate seccomp, isn't Tango is going to
>>>> have to implement the seccomp filter in user space?
>>>
>>> I have not implemented user-mode seccomp emulation because it can
>>> trivially be bypassed by spawning a 64-bit child process which runs
>>> outside Tango. Even when spawning another translated process, the
>>> user-mode filter will not be preserved across an execve.
>>
>> Clearly if you have user-mode seccomp emulation then you'd hook execve
>> and either install the real BPF filter (if spawning a 64 bit child
>> outside Tango) or ensure that the user-mode emulation is passed on to
>> the child (if running within Tango).
> 
> Spawning another process is just an example. Fundamentally, Tango is
> not intended or designed to be a sandbox around the 32-bit code. For
> example, many of the newer ioctls use u64 instead of a pointer type to
> avoid the need for a compat_ioctl handler. This means that such ioctls
> could be abused to read/write any address in the process address
> space, including the code that is performing the usermode seccomp
> emulation.

Indeed, I think it's almost impossible to fully preserve all the
security aspects of the seccomp filter - whether that's because the 32
bit code can "attack" Tango itself, or whether it's because Tango
requires white-listing ioctls that would normally be excluded by the
seccomp filter.

Even with the compat syscalls exposed, ioctls that take a u64 and don't
do any special handling for a compat call would happily accept addresses
>4GB even though they wouldn't if it was a real compat process.

For the user space emulation of the compat syscalls, Tango does have the
option of attempting to fix up such ioctls to restrict the address range
- but it's obviously a very large set of ioctls to audit if you want to
be complete.

>> You already have a potential 'issue' here of a 64 bit process setting up
>> a seccomp filter and then execve()ing a 32 bit (Tango'd) process. The
>> set of syscalls needed for the system which supports AArch32 natively is
>> going to be different from the syscalls needed for Tango. (Fundamentally
>> this is a major limitation with the whole seccomp syscall filtering
>> approach).
> 
> The specific example I had in mind here is Android which installs a
> global seccomp filter on the zygote process from which app processes
> are forked from. This filter is designed for mixed arm32/arm64 systems
> and therefore has syscall whitelists for both AArch32 and AArch64.
> This filter allows 32-bit processes to spawn 64-bit processes and
> vice-versa: for example, many 32-bit apps will invoke another 32-bit
> executable via system() which uses a 64-bit /system/bin/sh.

Given that Tango is going to be constrained by the 64 bit version of the
seccomp filter, I don't really understand why the emulated 32-bit
process needs to be separately constrained by the list of 32-bit calls
in this case.

The problem really comes if the controlling 64 bit process is written
"knowing" that it is executing a 32 bit process and tailors the seccomp
filter for that. I don't see any solution actually solving that (Tango
requires extra 64 bit syscalls which won't be naturally included).

>>>> I guess the question comes down to how big a hole is
>>>> syscall_in_tango_whitelist() - if Tango only requires a small set of
>>>> syscalls then there is still some security benefit, but otherwise this
>>>> doesn't seem like a particularly big benefit considering you're already
>>>> going to need the BPF infrastructure in user space.
>>>
>>> Currently Tango only whitelists ~50 syscalls, which is small enough to
>>> provide security benefits and definitely better than not supporting
>>> seccomp at all.
>>
>> Agreed, and I don't want to imply that this approach is necessarily
>> wrong. But given that the approach of getting the kernel to do the
>> compat syscall filtering is not perfect, I'm not sure in itself it's a
>> great justification for needing the kernel to support all the compat
>> syscalls.
> 
> I feel that exposing all compat syscalls to 64-bit processes is better
> than the alternative of only exposing a subset of them. Of the top of
> my head I can think of quite a few compat syscalls that cannot be
> fully emulated in userspace and would need to be exposed in the
> kernel:
> - mmap/mremap/shmat/io_setup: anything that allocates VM space needs
> to return a pointer in the low 4GB.

So a "generic" way of requested the kernel limit the address space for
allocations would be potentially useful for other purposes. Adding a new
syscall for this purpose would be sensible. We already have (at least)
two "hacks" in mmap for controlling the address range that can be used:

 * MAP_32BIT - x86 only, and really "31 bit"

 * Providing a mmap() hint with the top bits set to opt-in to 52-bit VAs.

A well defined mechanism for controlling the valid VA range for
allocations would be much better than adding more hacks - and bonus
points if it works for all the different types of allocation unlike the
above.

> - ioctl: too many variants to reasonably maintain a separate compat
> layer in userspace.

I do agree that it probably makes sense to expose a compat_ioctl()
syscall, but this isn't a complete fix and user space is likely to need
to do some ioctl translations. And x86->aarch64 is obviously going to
require translation in many cases.

> - getdents/lseek: ext4 uses 32-bit directory offsets for 32-bit processes.

My guess is that a getdents32() call makes sense, but I haven't looked
into this.

> - get_robust_list/set_robust_list: different in-memory ABI for
> 32/64-bit processes.

Here some careful thought needs to go into how both 32 bit and 64 bit
robust lists would coexist. I haven't looked into it, but I doubt it's
as simple as just exposing the compat calls.

> - open: don't force O_LARGEFILE for 32-bit processes.

At first glance it seems reasonable to expose the compat version.

> - io_uring_create: different in-memory ABI for 32/64-bit processes.

I'm not familiar enough with io_uring to sensibly comment on this.

> - (and possibly many others)
> 
> Also consider the churn involved when adding a new syscall which
> behaves differently in compat processes: rather than just using
> in_compat_syscall() or wiring up a COMPAT_SYSCALL_DEFINE, a compat
> variant of this syscall would also need to be added to the 64-bit
> syscall table to support translation layers like Tango and FEX.

Just exposing a new compat call to translated processes would also be
potentially dangerous - Tango needs to actually consider whether the
syscall is safe before exposing it. Yes there's some potential cost, but
in general there's much more of an attempt to make the compat syscalls
the same as native ones wherever possible. So I would expect that the
cases needing special handling are reasonably rare.

>> One other thought: I suspect in practise there aren't actually many
>> variations in the BPF programs used with seccomp. It may well be quite
>> possible to convert the 32-bit syscall filtering programs to filter the
>> equivalent 64-bit syscalls that Tango would use. Sadly this would be
>> fragile if a program used a BPF program which didn't follow the "normal"
>> pattern.
> 
> This might work for simple filters that only look at the syscall
> number, but becomes much harder when the filter also inspects the
> syscall arguments.

Yes it heavily relies on being able to pattern match what specific
programs/libraries do. My hunch is that most seccomp filters are set by
a very small number of programs/libraries. But I've no data to back that up.

Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-24 11:20                 ` Steven Price
  0 siblings, 0 replies; 40+ messages in thread
From: Steven Price @ 2021-05-24 11:20 UTC (permalink / raw)
  To: Amanieu d'Antras
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, David Laight, Mark Brown, Linux ARM,
	Linux Kernel Mailing List

On 21/05/2021 20:18, Amanieu d'Antras wrote:
> On Fri, May 21, 2021 at 9:51 AM Steven Price <steven.price@arm.com> wrote:
>>>> In those cases to correctly emulate seccomp, isn't Tango is going to
>>>> have to implement the seccomp filter in user space?
>>>
>>> I have not implemented user-mode seccomp emulation because it can
>>> trivially be bypassed by spawning a 64-bit child process which runs
>>> outside Tango. Even when spawning another translated process, the
>>> user-mode filter will not be preserved across an execve.
>>
>> Clearly if you have user-mode seccomp emulation then you'd hook execve
>> and either install the real BPF filter (if spawning a 64 bit child
>> outside Tango) or ensure that the user-mode emulation is passed on to
>> the child (if running within Tango).
> 
> Spawning another process is just an example. Fundamentally, Tango is
> not intended or designed to be a sandbox around the 32-bit code. For
> example, many of the newer ioctls use u64 instead of a pointer type to
> avoid the need for a compat_ioctl handler. This means that such ioctls
> could be abused to read/write any address in the process address
> space, including the code that is performing the usermode seccomp
> emulation.

Indeed, I think it's almost impossible to fully preserve all the
security aspects of the seccomp filter - whether that's because the 32
bit code can "attack" Tango itself, or whether it's because Tango
requires white-listing ioctls that would normally be excluded by the
seccomp filter.

Even with the compat syscalls exposed, ioctls that take a u64 and don't
do any special handling for a compat call would happily accept addresses
>4GB even though they wouldn't if it was a real compat process.

For the user space emulation of the compat syscalls, Tango does have the
option of attempting to fix up such ioctls to restrict the address range
- but it's obviously a very large set of ioctls to audit if you want to
be complete.

>> You already have a potential 'issue' here of a 64 bit process setting up
>> a seccomp filter and then execve()ing a 32 bit (Tango'd) process. The
>> set of syscalls needed for the system which supports AArch32 natively is
>> going to be different from the syscalls needed for Tango. (Fundamentally
>> this is a major limitation with the whole seccomp syscall filtering
>> approach).
> 
> The specific example I had in mind here is Android which installs a
> global seccomp filter on the zygote process from which app processes
> are forked from. This filter is designed for mixed arm32/arm64 systems
> and therefore has syscall whitelists for both AArch32 and AArch64.
> This filter allows 32-bit processes to spawn 64-bit processes and
> vice-versa: for example, many 32-bit apps will invoke another 32-bit
> executable via system() which uses a 64-bit /system/bin/sh.

Given that Tango is going to be constrained by the 64 bit version of the
seccomp filter, I don't really understand why the emulated 32-bit
process needs to be separately constrained by the list of 32-bit calls
in this case.

The problem really comes if the controlling 64 bit process is written
"knowing" that it is executing a 32 bit process and tailors the seccomp
filter for that. I don't see any solution actually solving that (Tango
requires extra 64 bit syscalls which won't be naturally included).

>>>> I guess the question comes down to how big a hole is
>>>> syscall_in_tango_whitelist() - if Tango only requires a small set of
>>>> syscalls then there is still some security benefit, but otherwise this
>>>> doesn't seem like a particularly big benefit considering you're already
>>>> going to need the BPF infrastructure in user space.
>>>
>>> Currently Tango only whitelists ~50 syscalls, which is small enough to
>>> provide security benefits and definitely better than not supporting
>>> seccomp at all.
>>
>> Agreed, and I don't want to imply that this approach is necessarily
>> wrong. But given that the approach of getting the kernel to do the
>> compat syscall filtering is not perfect, I'm not sure in itself it's a
>> great justification for needing the kernel to support all the compat
>> syscalls.
> 
> I feel that exposing all compat syscalls to 64-bit processes is better
> than the alternative of only exposing a subset of them. Of the top of
> my head I can think of quite a few compat syscalls that cannot be
> fully emulated in userspace and would need to be exposed in the
> kernel:
> - mmap/mremap/shmat/io_setup: anything that allocates VM space needs
> to return a pointer in the low 4GB.

So a "generic" way of requested the kernel limit the address space for
allocations would be potentially useful for other purposes. Adding a new
syscall for this purpose would be sensible. We already have (at least)
two "hacks" in mmap for controlling the address range that can be used:

 * MAP_32BIT - x86 only, and really "31 bit"

 * Providing a mmap() hint with the top bits set to opt-in to 52-bit VAs.

A well defined mechanism for controlling the valid VA range for
allocations would be much better than adding more hacks - and bonus
points if it works for all the different types of allocation unlike the
above.

> - ioctl: too many variants to reasonably maintain a separate compat
> layer in userspace.

I do agree that it probably makes sense to expose a compat_ioctl()
syscall, but this isn't a complete fix and user space is likely to need
to do some ioctl translations. And x86->aarch64 is obviously going to
require translation in many cases.

> - getdents/lseek: ext4 uses 32-bit directory offsets for 32-bit processes.

My guess is that a getdents32() call makes sense, but I haven't looked
into this.

> - get_robust_list/set_robust_list: different in-memory ABI for
> 32/64-bit processes.

Here some careful thought needs to go into how both 32 bit and 64 bit
robust lists would coexist. I haven't looked into it, but I doubt it's
as simple as just exposing the compat calls.

> - open: don't force O_LARGEFILE for 32-bit processes.

At first glance it seems reasonable to expose the compat version.

> - io_uring_create: different in-memory ABI for 32/64-bit processes.

I'm not familiar enough with io_uring to sensibly comment on this.

> - (and possibly many others)
> 
> Also consider the churn involved when adding a new syscall which
> behaves differently in compat processes: rather than just using
> in_compat_syscall() or wiring up a COMPAT_SYSCALL_DEFINE, a compat
> variant of this syscall would also need to be added to the 64-bit
> syscall table to support translation layers like Tango and FEX.

Just exposing a new compat call to translated processes would also be
potentially dangerous - Tango needs to actually consider whether the
syscall is safe before exposing it. Yes there's some potential cost, but
in general there's much more of an attempt to make the compat syscalls
the same as native ones wherever possible. So I would expect that the
cases needing special handling are reasonably rare.

>> One other thought: I suspect in practise there aren't actually many
>> variations in the BPF programs used with seccomp. It may well be quite
>> possible to convert the 32-bit syscall filtering programs to filter the
>> equivalent 64-bit syscalls that Tango would use. Sadly this would be
>> fragile if a program used a BPF program which didn't follow the "normal"
>> pattern.
> 
> This might work for simple filters that only look at the syscall
> number, but becomes much harder when the filter also inspects the
> syscall arguments.

Yes it heavily relies on being able to pattern match what specific
programs/libraries do. My hunch is that most seccomp filters are set by
a very small number of programs/libraries. But I've no data to back that up.

Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
  2021-05-24 11:20                 ` Steven Price
@ 2021-05-24 12:38                   ` David Laight
  -1 siblings, 0 replies; 40+ messages in thread
From: David Laight @ 2021-05-24 12:38 UTC (permalink / raw)
  To: 'Steven Price', Amanieu d'Antras
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Mark Brown, Linux ARM, Linux Kernel Mailing List

From: Steven Price
> Sent: 24 May 2021 12:21
...
> So a "generic" way of requested the kernel limit the address space for
> allocations would be potentially useful for other purposes. Adding a new
> syscall for this purpose would be sensible. We already have (at least)
> two "hacks" in mmap for controlling the address range that can be used:
> 
>  * MAP_32BIT - x86 only, and really "31 bit"
> 
>  * Providing a mmap() hint with the top bits set to opt-in to 52-bit VAs.
> 
> A well defined mechanism for controlling the valid VA range for
> allocations would be much better than adding more hacks - and bonus
> points if it works for all the different types of allocation unlike the
> above.

I'd have thought a 'MAP_BELOW' flag (cf MAP_FIXED) would suffice.
I'm sort of surprised MAP_32BIT wasn't implemented that way.
'man mmap' says MAP_32BIT was added for x64 thread stacks - I though
the requirement can from 32bit wine (windows has a 2G user/kernel boundary).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls
@ 2021-05-24 12:38                   ` David Laight
  0 siblings, 0 replies; 40+ messages in thread
From: David Laight @ 2021-05-24 12:38 UTC (permalink / raw)
  To: 'Steven Price', Amanieu d'Antras
  Cc: Arnd Bergmann, Ryan Houdek, Catalin Marinas, Will Deacon,
	Mark Rutland, Mark Brown, Linux ARM, Linux Kernel Mailing List

From: Steven Price
> Sent: 24 May 2021 12:21
...
> So a "generic" way of requested the kernel limit the address space for
> allocations would be potentially useful for other purposes. Adding a new
> syscall for this purpose would be sensible. We already have (at least)
> two "hacks" in mmap for controlling the address range that can be used:
> 
>  * MAP_32BIT - x86 only, and really "31 bit"
> 
>  * Providing a mmap() hint with the top bits set to opt-in to 52-bit VAs.
> 
> A well defined mechanism for controlling the valid VA range for
> allocations would be much better than adding more hacks - and bonus
> points if it works for all the different types of allocation unlike the
> above.

I'd have thought a 'MAP_BELOW' flag (cf MAP_FIXED) would suffice.
I'm sort of surprised MAP_32BIT wasn't implemented that way.
'man mmap' says MAP_32BIT was added for x64 thread stacks - I though
the requirement can from 32bit wine (windows has a 2G user/kernel boundary).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-05-24 23:22 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-18  9:06 [RESEND PATCH v4 0/8] arm64: Allow 64-bit tasks to invoke compat syscalls Amanieu d'Antras
2021-05-18  9:06 ` Amanieu d'Antras
2021-05-18  9:06 ` [RESEND PATCH v4 1/8] mm: Add arch_get_mmap_base_topdown macro Amanieu d'Antras
2021-05-18  9:06   ` Amanieu d'Antras
2021-05-18  9:06 ` [RESEND PATCH v4 2/8] hugetlbfs: Use arch_get_mmap_* macros Amanieu d'Antras
2021-05-18  9:06   ` Amanieu d'Antras
2021-05-18  9:06 ` [RESEND PATCH v4 3/8] mm: Support mmap_compat_base with the generic layout Amanieu d'Antras
2021-05-18  9:06   ` Amanieu d'Antras
2021-05-18  9:06 ` [RESEND PATCH v4 4/8] arm64: Separate in_compat_syscall from is_compat_task Amanieu d'Antras
2021-05-18  9:06   ` Amanieu d'Antras
2021-05-18  9:06 ` [RESEND PATCH v4 5/8] arm64: mm: Use HAVE_ARCH_COMPAT_MMAP_BASES Amanieu d'Antras
2021-05-18  9:06   ` Amanieu d'Antras
2021-05-18  9:06 ` [RESEND PATCH v4 6/8] arm64: Add a compat syscall flag to thread_info Amanieu d'Antras
2021-05-18  9:06   ` Amanieu d'Antras
2021-05-18  9:06 ` [RESEND PATCH v4 7/8] arm64: Forbid calling compat sigreturn from 64-bit tasks Amanieu d'Antras
2021-05-18  9:06   ` Amanieu d'Antras
2021-05-18  9:06 ` [RESEND PATCH v4 8/8] arm64: Allow 64-bit tasks to invoke compat syscalls Amanieu d'Antras
2021-05-18  9:06   ` Amanieu d'Antras
2021-05-18 13:02   ` Arnd Bergmann
2021-05-18 13:02     ` Arnd Bergmann
2021-05-18 20:26     ` David Laight
2021-05-18 20:26       ` David Laight
2021-05-18 22:41       ` Ryan Houdek
2021-05-18 22:41         ` Ryan Houdek
2021-05-18 23:51     ` Amanieu d'Antras
2021-05-18 23:51       ` Amanieu d'Antras
2021-05-19 15:30       ` Steven Price
2021-05-19 15:30         ` Steven Price
2021-05-19 16:14         ` Amanieu d'Antras
2021-05-19 16:14           ` Amanieu d'Antras
2021-05-21  8:51           ` Steven Price
2021-05-21  8:51             ` Steven Price
2021-05-21 19:18             ` Amanieu d'Antras
2021-05-21 19:18               ` Amanieu d'Antras
2021-05-24 11:20               ` Steven Price
2021-05-24 11:20                 ` Steven Price
2021-05-24 12:38                 ` David Laight
2021-05-24 12:38                   ` David Laight
2021-05-18 23:52     ` Ryan Houdek
2021-05-18 23:52       ` Ryan Houdek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.