linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
@ 2021-02-11 20:21 sonicadvance1
  2021-02-12 11:30 ` Steven Price
  2021-02-12 14:13 ` Mark Rutland
  0 siblings, 2 replies; 12+ messages in thread
From: sonicadvance1 @ 2021-02-11 20:21 UTC (permalink / raw)
  Cc: amanieu, Ryan Houdek, Catalin Marinas, Will Deacon, Mark Rutland,
	Oleg Nesterov, Al Viro, Dave Martin, Amit Daniel Kachhap,
	Mark Brown, Marc Zyngier, David Brazdil, Jean-Philippe Brucker,
	Andrew Morton, Anshuman Khandual, Gavin Shan, Mike Rapoport,
	Steven Price, Vincenzo Frascino, Kristina Martsenko, Kees Cook,
	Sami Tolvanen, Frederic Weisbecker, Kevin Hao, Jason Yan,
	Andrey Ignatov, Peter Collingbourne, Julien Grall, Tian Tao,
	Qais Yousef, Jens Axboe, linux-arm-kernel, linux-kernel

From: Ryan Houdek <Sonicadvance1@gmail.com>

Sorry about the noise. I obviously don't work in this ecosystem.
Didn't get any comments previously so I'm resending

The problem:
We need to support 32-bit processes running under a userspace
compatibility layer. The compatibility layer is a AArch64 process.
This means exposing the 32bit compatibility syscalls to userspace.

Why do we need compatibility layers?
There are ARMv8 CPUs that only support AArch64 but still need to run
AArch32 applications.
Cortex-A34/R82 and other cores are prime examples of this.
Additionally if a user is needing to run legacy 32-bit x86 software, it
needs the same compatibility layer.

Who does this matter to?
Any user that has a specific need to run legacy 32-bit software under a
compatibility layer.
Not all software is open source or easy to convert to 64bit, it's
something we need to live with.
Professional software and the gaming ecosystem is rife with this.

What applications have tried to work around this problem?
FEX emulator (1) - Userspace x86 to AArch64 compatibility layer
Tango binary translator (2) - AArch32 to AArch64 compatibility layer
QEmu (3) - Not really but they do some userspace ioctl emulation

What problems did they hit?
FEX and Tango hit problems with emulating memory related syscalls.
- Emulating 32-bit mmap, mremap, shmat in userspace changes behaviour
All three hit issues with ioctl emulation
- ioctls are free to do what they want including allocating memory and
returning opaque structures with pointers.

With this patch we will be exposing the compatibility syscall table
through the regular syscall svc API. There is prior art here where on
x86-64 they also expose the compatibility tables.
The justification for this is that we need to maintain support for 32bit
application compatibility going in to the foreseeable future.
Userspace does almost all of the heavy lifting here, especially when the
hardware no longer supports the 32bit use case.

A couple of caveats to this approach.
Userspace must know that this doesn't solve structure alignment problems
for the x86->AArch64 (1) case.
The API for this changes from syscall number in x8 to x7 to match
AArch32 syscall semantics
This is now exposing the compat syscalls to userspace, but for the sake
of userspace compatibility it is a necessary evil.

Why does the upstream kernel care?
I believe every user wants to have their software ecosystem continue
working if they are in a mixed AArch32/AArch64 world even when they are
running AArch64 only hardware. The kernel should facilitate a good user
experience.

External Resources
(1) https://github.com/FEX-Emu/FEX
(2) https://www.amanieusystems.com/
(3) https://www.qemu.org/

Further reading
- https://github.com/FEX-Emu/FEX/wiki/32Bit-x86-Woes
- Original patch: https://github.com/Amanieu/linux/commit/b4783002afb0

Changes in v2:
- Removed a tangential code path to make this more concise
  - Now doesn't cover Tango's full use case
  - This is purely for conciseness sake, easy enough to add back
- Cleaned up commit message
Signed-off-by: Ryan Houdek <Sonicadvance1@gmail.com>
---
 arch/arm64/Kconfig                   |   9 +
 arch/arm64/include/asm/compat.h      |  20 +++
 arch/arm64/include/asm/exception.h   |   2 +-
 arch/arm64/include/asm/mmu.h         |   7 +
 arch/arm64/include/asm/pgtable.h     |  10 ++
 arch/arm64/include/asm/processor.h   |   6 +-
 arch/arm64/include/asm/thread_info.h |   7 +
 arch/arm64/kernel/asm-offsets.c      |   3 +
 arch/arm64/kernel/entry-common.c     |   9 +-
 arch/arm64/kernel/fpsimd.c           |   2 +-
 arch/arm64/kernel/hw_breakpoint.c    |   2 +-
 arch/arm64/kernel/perf_regs.c        |   2 +-
 arch/arm64/kernel/process.c          |  13 +-
 arch/arm64/kernel/ptrace.c           |   6 +-
 arch/arm64/kernel/signal.c           |   2 +-
 arch/arm64/kernel/syscall.c          |  41 ++++-
 arch/arm64/mm/mmap.c                 | 249 +++++++++++++++++++++++++++
 17 files changed, 369 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1515f6f153a0..9832f05daaee 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1147,6 +1147,15 @@ config XEN
 	help
 	  Say Y if you want to run Linux in a Virtual Machine on Xen on ARM64.
 
+config ARM_COMPAT_DISPATCH
+	bool "32bit syscall dispatch table"
+	depends on COMPAT && ARM64
+	default y
+	help
+	  Kernel support for exposing the 32-bit syscall dispatch table to
+	  userspace.
+	  For dynamically translating 32-bit applications to a 64-bit process.
+
 config FORCE_MAX_ZONEORDER
 	int
 	default "14" if (ARM64_64K_PAGES && TRANSPARENT_HUGEPAGE)
diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
index 23a9fb73c04f..d00c6f427999 100644
--- a/arch/arm64/include/asm/compat.h
+++ b/arch/arm64/include/asm/compat.h
@@ -180,10 +180,30 @@ struct compat_shmid64_ds {
 
 static inline int is_compat_task(void)
 {
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+	/* It is compatible if Tango, 32bit compat, or 32bit thread */
+	return current_thread_info()->compat_syscall_flags != 0 || test_thread_flag(TIF_32BIT);
+#else
 	return test_thread_flag(TIF_32BIT);
+#endif
 }
 
 static inline int is_compat_thread(struct thread_info *thread)
+{
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+	/* It is compatible if Tango, 32bit compat, or 32bit thread */
+	return thread->compat_syscall_flags != 0 || test_ti_thread_flag(thread, TIF_32BIT);
+#else
+	return test_ti_thread_flag(thread, TIF_32BIT);
+#endif
+}
+
+static inline int is_aarch32_compat_task(void)
+{
+	return test_thread_flag(TIF_32BIT);
+}
+
+static inline int is_aarch32_compat_thread(struct thread_info *thread)
 {
 	return test_ti_thread_flag(thread, TIF_32BIT);
 }
diff --git a/arch/arm64/include/asm/exception.h b/arch/arm64/include/asm/exception.h
index 99b9383cd036..f2c94b44b51c 100644
--- a/arch/arm64/include/asm/exception.h
+++ b/arch/arm64/include/asm/exception.h
@@ -45,7 +45,7 @@ void do_sysinstr(unsigned int esr, struct pt_regs *regs);
 void do_sp_pc_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs);
 void bad_el0_sync(struct pt_regs *regs, int reason, unsigned int esr);
 void do_cp15instr(unsigned int esr, struct pt_regs *regs);
-void do_el0_svc(struct pt_regs *regs);
+void do_el0_svc(struct pt_regs *regs, unsigned int iss);
 void do_el0_svc_compat(struct pt_regs *regs);
 void do_ptrauth_fault(struct pt_regs *regs, unsigned int esr);
 #endif	/* __ASM_EXCEPTION_H */
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index b2e91c187e2a..0744db65c0a9 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -27,6 +27,9 @@ typedef struct {
 	refcount_t	pinned;
 	void		*vdso;
 	unsigned long	flags;
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+	unsigned long	compat_mmap_base;
+#endif
 } mm_context_t;
 
 /*
@@ -79,6 +82,10 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
 extern void mark_linear_text_alias_ro(void);
 extern bool kaslr_requires_kpti(void);
 
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+extern void process_init_compat_mmap(void);
+#endif
+
 #define INIT_MM_CONTEXT(name)	\
 	.pgd = init_pg_dir,
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4ff12a7adcfd..5e7662c2675c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -974,6 +974,16 @@ static inline bool arch_faults_on_old_pte(void)
 }
 #define arch_faults_on_old_pte arch_faults_on_old_pte
 
+/*
+ * We provide our own arch_get_unmapped_area to handle 32-bit mmap calls from
+ * tango.
+ */
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+#define HAVE_ARCH_UNMAPPED_AREA
+#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
+#define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index fce8cbecd6bc..03c05cd19f87 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -175,7 +175,7 @@ static inline void arch_thread_struct_whitelist(unsigned long *offset,
 #define task_user_tls(t)						\
 ({									\
 	unsigned long *__tls;						\
-	if (is_compat_thread(task_thread_info(t)))			\
+	if (is_aarch32_compat_thread(task_thread_info(t)))			\
 		__tls = &(t)->thread.uw.tp2_value;			\
 	else								\
 		__tls = &(t)->thread.uw.tp_value;			\
@@ -256,8 +256,8 @@ extern struct task_struct *cpu_switch_to(struct task_struct *prev,
 #define task_pt_regs(p) \
 	((struct pt_regs *)(THREAD_SIZE + task_stack_page(p)) - 1)
 
-#define KSTK_EIP(tsk)	((unsigned long)task_pt_regs(tsk)->pc)
-#define KSTK_ESP(tsk)	user_stack_pointer(task_pt_regs(tsk))
+#define KSTK_EIP(tsk)  ((unsigned long)task_pt_regs(tsk)->pc)
+#define KSTK_ESP(tsk)  user_stack_pointer(task_pt_regs(tsk))
 
 /*
  * Prefetching support
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 1fbab854a51b..cb04c7c4df38 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -41,6 +41,9 @@ struct thread_info {
 #endif
 		} preempt;
 	};
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+	int			compat_syscall_flags;	/* 32-bit compat syscall */
+#endif
 #ifdef CONFIG_SHADOW_CALL_STACK
 	void			*scs_base;
 	void			*scs_sp;
@@ -107,6 +110,10 @@ void arch_release_task_struct(struct task_struct *tsk);
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
 				 _TIF_SYSCALL_EMU)
 
+#define TIF_COMPAT_32BITSYSCALL 0 /* Trivial 32bit compatible syscall */
+
+#define _TIF_COMPAT_32BITSYSCALL (1 << TIF_COMPAT_32BITSYSCALL)
+
 #ifdef CONFIG_SHADOW_CALL_STACK
 #define INIT_SCS							\
 	.scs_base	= init_shadow_call_stack,			\
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 7d32fc959b1a..742203cff128 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -34,6 +34,9 @@ int main(void)
 #ifdef CONFIG_ARM64_SW_TTBR0_PAN
   DEFINE(TSK_TI_TTBR0,		offsetof(struct task_struct, thread_info.ttbr0));
 #endif
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+  DEFINE(TI_COMPAT_SYSCALL,	offsetof(struct task_struct, thread_info.compat_syscall_flags));
+#endif
 #ifdef CONFIG_SHADOW_CALL_STACK
   DEFINE(TSK_TI_SCS_BASE,	offsetof(struct task_struct, thread_info.scs_base));
   DEFINE(TSK_TI_SCS_SP,		offsetof(struct task_struct, thread_info.scs_sp));
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 43d4c329775f..6d98a9c6fafd 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -228,12 +228,12 @@ static void notrace el0_dbg(struct pt_regs *regs, unsigned long esr)
 }
 NOKPROBE_SYMBOL(el0_dbg);
 
-static void notrace el0_svc(struct pt_regs *regs)
+static void notrace el0_svc(struct pt_regs *regs, unsigned int iss)
 {
 	if (system_uses_irq_prio_masking())
 		gic_write_pmr(GIC_PRIO_IRQON | GIC_PRIO_PSR_I_SET);
 
-	do_el0_svc(regs);
+	do_el0_svc(regs, iss);
 }
 NOKPROBE_SYMBOL(el0_svc);
 
@@ -251,7 +251,10 @@ asmlinkage void notrace el0_sync_handler(struct pt_regs *regs)
 
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_SVC64:
-		el0_svc(regs);
+		/* Redundant masking here to show we are getting ISS mask
+		 * Then we are pulling the imm16 out of it for SVC64
+		 */
+		el0_svc(regs, (esr & ESR_ELx_ISS_MASK) & 0xffff);
 		break;
 	case ESR_ELx_EC_DABT_LOW:
 		el0_da(regs, esr);
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 062b21f30f94..a35ab449a466 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -937,7 +937,7 @@ void fpsimd_release_task(struct task_struct *dead_task)
 void do_sve_acc(unsigned int esr, struct pt_regs *regs)
 {
 	/* Even if we chose not to use SVE, the hardware could still trap: */
-	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
+	if (unlikely(!system_supports_sve()) || WARN_ON(is_aarch32_compat_task())) {
 		force_signal_inject(SIGILL, ILL_ILLOPC, regs->pc, 0);
 		return;
 	}
diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
index 712e97c03e54..37c9349c4999 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -168,7 +168,7 @@ static int is_compat_bp(struct perf_event *bp)
 	 * deprecated behaviour if we use unaligned watchpoints in
 	 * AArch64 state.
 	 */
-	return tsk && is_compat_thread(task_thread_info(tsk));
+	return tsk && is_aarch32_compat_thread(task_thread_info(tsk));
 }
 
 /**
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index f6f58e6265df..c4b061f0d182 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -66,7 +66,7 @@ int perf_reg_validate(u64 mask)
 
 u64 perf_reg_abi(struct task_struct *task)
 {
-	if (is_compat_thread(task_thread_info(task)))
+	if (is_aarch32_compat_thread(task_thread_info(task)))
 		return PERF_SAMPLE_REGS_ABI_32;
 	else
 		return PERF_SAMPLE_REGS_ABI_64;
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index a47a40ec6ad9..9c0775babbd0 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -314,7 +314,7 @@ static void tls_thread_flush(void)
 {
 	write_sysreg(0, tpidr_el0);
 
-	if (is_compat_task()) {
+	if (is_aarch32_compat_task()) {
 		current->thread.uw.tp_value = 0;
 
 		/*
@@ -409,7 +409,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
 		*task_user_tls(p) = read_sysreg(tpidr_el0);
 
 		if (stack_start) {
-			if (is_compat_thread(task_thread_info(p)))
+			if (is_aarch32_compat_thread(task_thread_info(p)))
 				childregs->compat_sp = stack_start;
 			else
 				childregs->sp = stack_start;
@@ -453,7 +453,7 @@ static void tls_thread_switch(struct task_struct *next)
 {
 	tls_preserve_current_state();
 
-	if (is_compat_thread(task_thread_info(next)))
+	if (is_aarch32_compat_thread(task_thread_info(next)))
 		write_sysreg(next->thread.uw.tp_value, tpidrro_el0);
 	else if (!arm64_kernel_unmapped_at_el0())
 		write_sysreg(0, tpidrro_el0);
@@ -619,7 +619,12 @@ unsigned long arch_align_stack(unsigned long sp)
  */
 void arch_setup_new_exec(void)
 {
-	current->mm->context.flags = is_compat_task() ? MMCF_AARCH32 : 0;
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+	process_init_compat_mmap();
+	current_thread_info()->compat_syscall_flags = 0;
+#endif
+
+	current->mm->context.flags = is_aarch32_compat_task() ? MMCF_AARCH32 : 0;
 
 	ptrauth_thread_init_user(current);
 
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index f49b349e16a3..2e3c242941d1 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -175,7 +175,7 @@ static void ptrace_hbptriggered(struct perf_event *bp,
 	const char *desc = "Hardware breakpoint trap (ptrace)";
 
 #ifdef CONFIG_COMPAT
-	if (is_compat_task()) {
+	if (is_aarch32_compat_task()) {
 		int si_errno = 0;
 		int i;
 
@@ -1725,7 +1725,7 @@ const struct user_regset_view *task_user_regset_view(struct task_struct *task)
 	 */
 	if (is_compat_task())
 		return &user_aarch32_view;
-	else if (is_compat_thread(task_thread_info(task)))
+	else if (is_aarch32_compat_thread(task_thread_info(task)))
 		return &user_aarch32_ptrace_view;
 #endif
 	return &user_aarch64_view;
@@ -1906,7 +1906,7 @@ int valid_user_regs(struct user_pt_regs *regs, struct task_struct *task)
 	/* https://lore.kernel.org/lkml/20191118131525.GA4180@willie-the-truck */
 	user_regs_reset_single_step(regs, task);
 
-	if (is_compat_thread(task_thread_info(task)))
+	if (is_aarch32_compat_thread(task_thread_info(task)))
 		return valid_compat_regs(regs);
 	else
 		return valid_native_regs(regs);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index a8184cad8890..e6462b32effa 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -813,7 +813,7 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 	/*
 	 * Set up the stack frame
 	 */
-	if (is_compat_task()) {
+	if (is_aarch32_compat_task()) {
 		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
 			ret = compat_setup_rt_frame(usig, ksig, oldset, regs);
 		else
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index e4c0dadf0d92..6857dad5df8e 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -21,7 +21,7 @@ static long do_ni_syscall(struct pt_regs *regs, int scno)
 {
 #ifdef CONFIG_COMPAT
 	long ret;
-	if (is_compat_task()) {
+	if (is_aarch32_compat_task()) {
 		ret = compat_arm_syscall(regs, scno);
 		if (ret != -ENOSYS)
 			return ret;
@@ -167,6 +167,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
 		local_daif_mask();
 		flags = current_thread_info()->flags;
 		if (!has_syscall_work(flags) && !(flags & _TIF_SINGLESTEP)) {
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+			current_thread_info()->compat_syscall_flags = 0;
+#endif
 			/*
 			 * We're off to userspace, where interrupts are
 			 * always enabled after we restore the flags from
@@ -180,6 +183,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
 
 trace_exit:
 	syscall_trace_exit(regs);
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+	current_thread_info()->compat_syscall_flags = 0;
+#endif
 }
 
 static inline void sve_user_discard(void)
@@ -199,10 +205,39 @@ static inline void sve_user_discard(void)
 	sve_user_disable();
 }
 
-void do_el0_svc(struct pt_regs *regs)
+void do_el0_svc(struct pt_regs *regs, unsigned int iss)
 {
 	sve_user_discard();
-	el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
+	/* XXX: Which style is more ideal to take here? */
+#if 0
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+	/* Hardcode syscall 0x8000'0000 to be a 32bit support syscall */
+	if (regs->regs[8] == 0x80000000) {
+		current_thread_info()->compat_syscall_flags = _TIF_COMPAT_32BITSYSCALL;
+		el0_svc_common(regs, regs->regs[7], __NR_compat_syscalls,
+			       compat_sys_call_table);
+
+	} else
+#endif
+		el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
+#else
+	switch (iss) {
+	/* SVC #1 is now a 32bit support syscall
+	 * Any other SVC ISS falls down the regular syscall code path
+	 */
+	case 1:
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+		current_thread_info()->compat_syscall_flags = _TIF_COMPAT_32BITSYSCALL;
+		el0_svc_common(regs, regs->regs[7], __NR_compat_syscalls,
+			       compat_sys_call_table);
+#else
+		return -ENOSYS;
+#endif
+		break;
+	default:
+		el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
+	}
+#endif
 }
 
 #ifdef CONFIG_COMPAT
diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
index 3028bacbc4e9..857aa03a3ac2 100644
--- a/arch/arm64/mm/mmap.c
+++ b/arch/arm64/mm/mmap.c
@@ -17,6 +17,8 @@
 #include <linux/io.h>
 #include <linux/personality.h>
 #include <linux/random.h>
+#include <linux/security.h>
+#include <linux/hugetlb.h>
 
 #include <asm/cputype.h>
 
@@ -68,3 +70,250 @@ int devmem_is_allowed(unsigned long pfn)
 }
 
 #endif
+
+#ifdef CONFIG_ARM_COMPAT_DISPATCH
+
+/* Definitions for compat syscall guest mmap area */
+#define COMPAT_MIN_GAP			(SZ_128M)
+#define COMPAT_STACK_TOP		0xffff0000
+#define COMPAT_MAX_GAP			(COMPAT_STACK_TOP/6*5)
+#define COMPAT_TASK_UNMAPPED_BASE	PAGE_ALIGN(TASK_SIZE_32 / 4)
+#define COMPAT_STACK_RND_MASK		(0x7ff >> (PAGE_SHIFT - 12))
+
+#ifndef arch_get_mmap_end
+#define arch_get_mmap_end(addr)	(TASK_SIZE)
+#endif
+
+#ifndef arch_get_mmap_base
+#define arch_get_mmap_base(addr, base) (base)
+#endif
+
+static int mmap_is_legacy(unsigned long rlim_stack)
+{
+	if (current->personality & ADDR_COMPAT_LAYOUT)
+		return 1;
+
+	if (rlim_stack == RLIM_INFINITY)
+		return 1;
+
+	return sysctl_legacy_va_layout;
+}
+
+static unsigned long compat_mmap_base(unsigned long rnd, unsigned long gap)
+{
+	unsigned long pad = stack_guard_gap;
+
+	/* Account for stack randomization if necessary */
+	if (current->flags & PF_RANDOMIZE)
+		pad += (COMPAT_STACK_RND_MASK << PAGE_SHIFT);
+
+	/* Values close to RLIM_INFINITY can overflow. */
+	if (gap + pad > gap)
+		gap += pad;
+
+	if (gap < COMPAT_MIN_GAP)
+		gap = COMPAT_MIN_GAP;
+	else if (gap > COMPAT_MAX_GAP)
+		gap = COMPAT_MAX_GAP;
+
+	return PAGE_ALIGN(COMPAT_STACK_TOP - gap - rnd);
+}
+
+void process_init_compat_mmap(void)
+{
+	unsigned long random_factor = 0UL;
+	unsigned long rlim_stack = rlimit(RLIMIT_STACK);
+
+	if (current->flags & PF_RANDOMIZE) {
+		random_factor = (get_random_long() &
+			((1UL << mmap_rnd_compat_bits) - 1)) << PAGE_SHIFT;
+	}
+
+	if (mmap_is_legacy(rlim_stack)) {
+		current->mm->context.compat_mmap_base =
+			COMPAT_TASK_UNMAPPED_BASE + random_factor;
+	} else {
+		current->mm->context.compat_mmap_base =
+			compat_mmap_base(random_factor, rlim_stack);
+	}
+}
+
+/* Get an address range which is currently unmapped.
+ * For shmat() with addr=0.
+ *
+ * Ugly calling convention alert:
+ * Return value with the low bits set means error value,
+ * ie
+ *	if (ret & ~PAGE_MASK)
+ *		error = ret;
+ *
+ * This function "knows" that -ENOMEM has the bits set.
+ */
+unsigned long
+arch_get_unmapped_area(struct file *filp, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma, *prev;
+	struct vm_unmapped_area_info info;
+	const unsigned long mmap_end = arch_get_mmap_end(addr);
+	bool bad_addr = false;
+
+	if (len > mmap_end - mmap_min_addr)
+		return -ENOMEM;
+
+	/*
+	 * Ensure that translated processes do not allocate the last
+	 * page of the 32-bit address space, or anything above it.
+	 */
+	if (is_compat_task())
+		bad_addr = addr + len > TASK_SIZE_32;
+
+	if (flags & MAP_FIXED)
+		return bad_addr ? -ENOMEM : addr;
+
+	if (addr && !bad_addr) {
+		addr = PAGE_ALIGN(addr);
+		vma = find_vma_prev(mm, addr, &prev);
+		if (mmap_end - len >= addr && addr >= mmap_min_addr &&
+		    (!vma || addr + len <= vm_start_gap(vma)) &&
+		    (!prev || addr >= vm_end_gap(prev)))
+			return addr;
+	}
+
+	info.flags = 0;
+	info.length = len;
+	if (is_compat_task()) {
+		info.low_limit = mm->context.compat_mmap_base;
+		info.high_limit = TASK_SIZE_32;
+	} else {
+		info.low_limit = mm->mmap_base;
+		info.high_limit = mmap_end;
+	}
+	info.align_mask = 0;
+	return vm_unmapped_area(&info);
+}
+
+/*
+ * This mmap-allocator allocates new areas top-down from below the
+ * stack's low limit (the base):
+ */
+unsigned long
+arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
+			  unsigned long len, unsigned long pgoff,
+			  unsigned long flags)
+{
+
+	struct vm_area_struct *vma, *prev;
+	struct mm_struct *mm = current->mm;
+	struct vm_unmapped_area_info info;
+	const unsigned long mmap_end = arch_get_mmap_end(addr);
+	bool bad_addr = false;
+
+	/* requested length too big for entire address space */
+	if (len > mmap_end - mmap_min_addr)
+		return -ENOMEM;
+
+	/*
+	 * Ensure that translated processes do not allocate the last
+	 * page of the 32-bit address space, or anything above it.
+	 */
+	if (is_compat_task())
+		bad_addr = addr + len > TASK_SIZE_32;
+
+	if (flags & MAP_FIXED)
+		return bad_addr ? -ENOMEM : addr;
+
+	/* requesting a specific address */
+	if (addr && !bad_addr) {
+		addr = PAGE_ALIGN(addr);
+		vma = find_vma_prev(mm, addr, &prev);
+		if (mmap_end - len >= addr && addr >= mmap_min_addr &&
+				(!vma || addr + len <= vm_start_gap(vma)) &&
+				(!prev || addr >= vm_end_gap(prev)))
+			return addr;
+	}
+
+	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+	info.length = len;
+	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
+	if (is_compat_task())
+		info.high_limit = mm->context.compat_mmap_base;
+	else
+		info.high_limit = arch_get_mmap_base(addr, mm->mmap_base);
+	info.align_mask = 0;
+	addr = vm_unmapped_area(&info);
+
+	/*
+	 * A failed mmap() very likely causes application failure,
+	 * so fall back to the bottom-up function here. This scenario
+	 * can happen with large stack limits and large mmap()
+	 * allocations.
+	 */
+	if (offset_in_page(addr)) {
+		VM_BUG_ON(addr != -ENOMEM);
+		info.flags = 0;
+		if (is_compat_task()) {
+			info.low_limit = COMPAT_TASK_UNMAPPED_BASE;
+			info.high_limit = TASK_SIZE_32;
+		} else {
+			info.low_limit = TASK_UNMAPPED_BASE;
+			info.high_limit = mmap_end;
+		}
+		addr = vm_unmapped_area(&info);
+	}
+
+	return addr;
+}
+
+unsigned long
+hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	struct hstate *h = hstate_file(file);
+	struct vm_unmapped_area_info info;
+	bool bad_addr = false;
+
+	if (len & ~huge_page_mask(h))
+		return -EINVAL;
+	if (len > TASK_SIZE)
+		return -ENOMEM;
+
+	/*
+	 * Ensure that translated processes do not allocate the last
+	 * page of the 32-bit address space, or anything above it.
+	 */
+	if (is_compat_task())
+		bad_addr = addr + len > TASK_SIZE_32;
+
+	if (flags & MAP_FIXED) {
+		if (prepare_hugepage_range(file, addr, len))
+			return -EINVAL;
+		return bad_addr ? -ENOMEM : addr;
+	}
+
+	if (addr && !bad_addr) {
+		addr = ALIGN(addr, huge_page_size(h));
+		vma = find_vma(mm, addr);
+		if (TASK_SIZE - len >= addr &&
+		    (!vma || addr + len <= vm_start_gap(vma)))
+			return addr;
+	}
+
+	info.flags = 0;
+	info.length = len;
+	if (is_compat_task()) {
+		info.low_limit = COMPAT_TASK_UNMAPPED_BASE;
+		info.high_limit = TASK_SIZE_32;
+	} else {
+		info.low_limit = TASK_UNMAPPED_BASE;
+		info.high_limit = TASK_SIZE;
+	}
+	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
+	info.align_offset = 0;
+	return vm_unmapped_area(&info);
+}
+
+#endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-11 20:21 [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls sonicadvance1
@ 2021-02-12 11:30 ` Steven Price
  2021-02-12 12:35   ` Mark Brown
  2021-02-12 13:59   ` Arnd Bergmann
  2021-02-12 14:13 ` Mark Rutland
  1 sibling, 2 replies; 12+ messages in thread
From: Steven Price @ 2021-02-12 11:30 UTC (permalink / raw)
  To: sonicadvance1
  Cc: amanieu, Catalin Marinas, Will Deacon, Mark Rutland,
	Oleg Nesterov, Al Viro, Dave Martin, Amit Daniel Kachhap,
	Mark Brown, Marc Zyngier, David Brazdil, Jean-Philippe Brucker,
	Andrew Morton, Anshuman Khandual, Gavin Shan, Mike Rapoport,
	Vincenzo Frascino, Kristina Martsenko, Kees Cook, Sami Tolvanen,
	Frederic Weisbecker, Kevin Hao, Jason Yan, Andrey Ignatov,
	Peter Collingbourne, Julien Grall, Tian Tao, Qais Yousef,
	Jens Axboe, linux-arm-kernel, linux-kernel

On 11/02/2021 20:21, sonicadvance1@gmail.com wrote:
> From: Ryan Houdek <Sonicadvance1@gmail.com>
> 
> Sorry about the noise. I obviously don't work in this ecosystem.
> Didn't get any comments previously so I'm resending

We're just coming up to a merge window, so I expect people are fairly 
busy at the moment. Also from a reviewability perspective I think you 
need to split this up into several patches with logical changes, as it 
stands the actual code changes are hard to review.

> The problem:
> We need to support 32-bit processes running under a userspace
> compatibility layer. The compatibility layer is a AArch64 process.
> This means exposing the 32bit compatibility syscalls to userspace.

I'm not sure how you come to this conclusion. Running 32-bit processes 
under a compatibility layer is a fine goal, but it's not clear why the 
entire 32-bit compat syscall layer is needed for this.

As a case in point QEMU's user mode emulation already achieves this in 
many cases without any changes to the kernel.

> Why do we need compatibility layers?
> There are ARMv8 CPUs that only support AArch64 but still need to run
> AArch32 applications.
> Cortex-A34/R82 and other cores are prime examples of this.
> Additionally if a user is needing to run legacy 32-bit x86 software, it
> needs the same compatibility layer.

Unless I'm much mistaken QEMU's user mode already does this - admittedly 
I don't tend to run "legacy 32-bit x86 software".

> Who does this matter to?
> Any user that has a specific need to run legacy 32-bit software under a
> compatibility layer.
> Not all software is open source or easy to convert to 64bit, it's
> something we need to live with.
> Professional software and the gaming ecosystem is rife with this.
> 
> What applications have tried to work around this problem?
> FEX emulator (1) - Userspace x86 to AArch64 compatibility layer
> Tango binary translator (2) - AArch32 to AArch64 compatibility layer
> QEmu (3) - Not really but they do some userspace ioctl emulation

Can you expand on "not really"? Clearly there are limitations, but in 
general I can happily "chroot" into a distro filesystem using an 
otherwise incompatible architecture using a qemu-xxx-static binary.

> What problems did they hit?
> FEX and Tango hit problems with emulating memory related syscalls.
> - Emulating 32-bit mmap, mremap, shmat in userspace changes behaviour
> All three hit issues with ioctl emulation
> - ioctls are free to do what they want including allocating memory and
> returning opaque structures with pointers.

Now I think we're getting to what the actual problems are:

  * mmap and friends have no (easy) way of forcing a mapping into a 32 
bit region.
  * ioctls are a mess

The first seems like a reasonable goal - I've seen examples of MAP_32BIT 
being (ab)used to do this, but it actually restricts to 31 bits and it's 
not even available on arm64. Here I think you'd be better off focusing 
on coming up with a new (generic) way of restricting the addresses that 
the kernel will pick.

ioctls are going to be a problem whatever you do, and I don't think 
there is much option other than having a list of known ioctls and 
translating them in user space - see below.

> With this patch we will be exposing the compatibility syscall table
> through the regular syscall svc API. There is prior art here where on
> x86-64 they also expose the compatibility tables.
> The justification for this is that we need to maintain support for 32bit
> application compatibility going in to the foreseeable future.
> Userspace does almost all of the heavy lifting here, especially when the
> hardware no longer supports the 32bit use case.
> 
> A couple of caveats to this approach.
> Userspace must know that this doesn't solve structure alignment problems
> for the x86->AArch64 (1) case.
> The API for this changes from syscall number in x8 to x7 to match
> AArch32 syscall semantics

This is where the argument of exposing compat falls down - for one of 
the main use cases (x86->aarch64) you still need to do a load of fixups 
in user space due to the differing alignment/semantics of the 
architectures. It's not clear to me why you can't just convert the 
arguments to the full 64-bit native ioctls at the same time. You are 
already going to have to have an allow-list of ioctls that are handled 
because any unknown ioctl is likely to blow up in strange ways due to 
the likes of structure alignment differences.

> This is now exposing the compat syscalls to userspace, but for the sake
> of userspace compatibility it is a necessary evil.

You've yet to convince me that it's "necessary" - I agree on the "evil" 
part ;)

> Why does the upstream kernel care?
> I believe every user wants to have their software ecosystem continue
> working if they are in a mixed AArch32/AArch64 world even when they are
> running AArch64 only hardware. The kernel should facilitate a good user
> experience.

I fully agree on the goal - just I think you need more justification for 
the approach you are taking.

Steve

> External Resources
> (1) https://github.com/FEX-Emu/FEX
> (2) https://www.amanieusystems.com/
> (3) https://www.qemu.org/
> 
> Further reading
> - https://github.com/FEX-Emu/FEX/wiki/32Bit-x86-Woes
> - Original patch: https://github.com/Amanieu/linux/commit/b4783002afb0
> 
> Changes in v2:
> - Removed a tangential code path to make this more concise
>    - Now doesn't cover Tango's full use case
>    - This is purely for conciseness sake, easy enough to add back
> - Cleaned up commit message
> Signed-off-by: Ryan Houdek <Sonicadvance1@gmail.com>
> ---
>   arch/arm64/Kconfig                   |   9 +
>   arch/arm64/include/asm/compat.h      |  20 +++
>   arch/arm64/include/asm/exception.h   |   2 +-
>   arch/arm64/include/asm/mmu.h         |   7 +
>   arch/arm64/include/asm/pgtable.h     |  10 ++
>   arch/arm64/include/asm/processor.h   |   6 +-
>   arch/arm64/include/asm/thread_info.h |   7 +
>   arch/arm64/kernel/asm-offsets.c      |   3 +
>   arch/arm64/kernel/entry-common.c     |   9 +-
>   arch/arm64/kernel/fpsimd.c           |   2 +-
>   arch/arm64/kernel/hw_breakpoint.c    |   2 +-
>   arch/arm64/kernel/perf_regs.c        |   2 +-
>   arch/arm64/kernel/process.c          |  13 +-
>   arch/arm64/kernel/ptrace.c           |   6 +-
>   arch/arm64/kernel/signal.c           |   2 +-
>   arch/arm64/kernel/syscall.c          |  41 ++++-
>   arch/arm64/mm/mmap.c                 | 249 +++++++++++++++++++++++++++
>   17 files changed, 369 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1515f6f153a0..9832f05daaee 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1147,6 +1147,15 @@ config XEN
>   	help
>   	  Say Y if you want to run Linux in a Virtual Machine on Xen on ARM64.
>   
> +config ARM_COMPAT_DISPATCH
> +	bool "32bit syscall dispatch table"
> +	depends on COMPAT && ARM64
> +	default y
> +	help
> +	  Kernel support for exposing the 32-bit syscall dispatch table to
> +	  userspace.
> +	  For dynamically translating 32-bit applications to a 64-bit process.
> +
>   config FORCE_MAX_ZONEORDER
>   	int
>   	default "14" if (ARM64_64K_PAGES && TRANSPARENT_HUGEPAGE)
> diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
> index 23a9fb73c04f..d00c6f427999 100644
> --- a/arch/arm64/include/asm/compat.h
> +++ b/arch/arm64/include/asm/compat.h
> @@ -180,10 +180,30 @@ struct compat_shmid64_ds {
>   
>   static inline int is_compat_task(void)
>   {
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +	/* It is compatible if Tango, 32bit compat, or 32bit thread */
> +	return current_thread_info()->compat_syscall_flags != 0 || test_thread_flag(TIF_32BIT);
> +#else
>   	return test_thread_flag(TIF_32BIT);
> +#endif
>   }
>   
>   static inline int is_compat_thread(struct thread_info *thread)
> +{
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +	/* It is compatible if Tango, 32bit compat, or 32bit thread */
> +	return thread->compat_syscall_flags != 0 || test_ti_thread_flag(thread, TIF_32BIT);
> +#else
> +	return test_ti_thread_flag(thread, TIF_32BIT);
> +#endif
> +}
> +
> +static inline int is_aarch32_compat_task(void)
> +{
> +	return test_thread_flag(TIF_32BIT);
> +}
> +
> +static inline int is_aarch32_compat_thread(struct thread_info *thread)
>   {
>   	return test_ti_thread_flag(thread, TIF_32BIT);
>   }
> diff --git a/arch/arm64/include/asm/exception.h b/arch/arm64/include/asm/exception.h
> index 99b9383cd036..f2c94b44b51c 100644
> --- a/arch/arm64/include/asm/exception.h
> +++ b/arch/arm64/include/asm/exception.h
> @@ -45,7 +45,7 @@ void do_sysinstr(unsigned int esr, struct pt_regs *regs);
>   void do_sp_pc_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs);
>   void bad_el0_sync(struct pt_regs *regs, int reason, unsigned int esr);
>   void do_cp15instr(unsigned int esr, struct pt_regs *regs);
> -void do_el0_svc(struct pt_regs *regs);
> +void do_el0_svc(struct pt_regs *regs, unsigned int iss);
>   void do_el0_svc_compat(struct pt_regs *regs);
>   void do_ptrauth_fault(struct pt_regs *regs, unsigned int esr);
>   #endif	/* __ASM_EXCEPTION_H */
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index b2e91c187e2a..0744db65c0a9 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -27,6 +27,9 @@ typedef struct {
>   	refcount_t	pinned;
>   	void		*vdso;
>   	unsigned long	flags;
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +	unsigned long	compat_mmap_base;
> +#endif
>   } mm_context_t;
>   
>   /*
> @@ -79,6 +82,10 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
>   extern void mark_linear_text_alias_ro(void);
>   extern bool kaslr_requires_kpti(void);
>   
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +extern void process_init_compat_mmap(void);
> +#endif
> +
>   #define INIT_MM_CONTEXT(name)	\
>   	.pgd = init_pg_dir,
>   
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 4ff12a7adcfd..5e7662c2675c 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -974,6 +974,16 @@ static inline bool arch_faults_on_old_pte(void)
>   }
>   #define arch_faults_on_old_pte arch_faults_on_old_pte
>   
> +/*
> + * We provide our own arch_get_unmapped_area to handle 32-bit mmap calls from
> + * tango.
> + */
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +#define HAVE_ARCH_UNMAPPED_AREA
> +#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> +#define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
> +#endif
> +
>   #endif /* !__ASSEMBLY__ */
>   
>   #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> index fce8cbecd6bc..03c05cd19f87 100644
> --- a/arch/arm64/include/asm/processor.h
> +++ b/arch/arm64/include/asm/processor.h
> @@ -175,7 +175,7 @@ static inline void arch_thread_struct_whitelist(unsigned long *offset,
>   #define task_user_tls(t)						\
>   ({									\
>   	unsigned long *__tls;						\
> -	if (is_compat_thread(task_thread_info(t)))			\
> +	if (is_aarch32_compat_thread(task_thread_info(t)))			\
>   		__tls = &(t)->thread.uw.tp2_value;			\
>   	else								\
>   		__tls = &(t)->thread.uw.tp_value;			\
> @@ -256,8 +256,8 @@ extern struct task_struct *cpu_switch_to(struct task_struct *prev,
>   #define task_pt_regs(p) \
>   	((struct pt_regs *)(THREAD_SIZE + task_stack_page(p)) - 1)
>   
> -#define KSTK_EIP(tsk)	((unsigned long)task_pt_regs(tsk)->pc)
> -#define KSTK_ESP(tsk)	user_stack_pointer(task_pt_regs(tsk))
> +#define KSTK_EIP(tsk)  ((unsigned long)task_pt_regs(tsk)->pc)
> +#define KSTK_ESP(tsk)  user_stack_pointer(task_pt_regs(tsk))
>   
>   /*
>    * Prefetching support
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index 1fbab854a51b..cb04c7c4df38 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -41,6 +41,9 @@ struct thread_info {
>   #endif
>   		} preempt;
>   	};
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +	int			compat_syscall_flags;	/* 32-bit compat syscall */
> +#endif
>   #ifdef CONFIG_SHADOW_CALL_STACK
>   	void			*scs_base;
>   	void			*scs_sp;
> @@ -107,6 +110,10 @@ void arch_release_task_struct(struct task_struct *tsk);
>   				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
>   				 _TIF_SYSCALL_EMU)
>   
> +#define TIF_COMPAT_32BITSYSCALL 0 /* Trivial 32bit compatible syscall */
> +
> +#define _TIF_COMPAT_32BITSYSCALL (1 << TIF_COMPAT_32BITSYSCALL)
> +
>   #ifdef CONFIG_SHADOW_CALL_STACK
>   #define INIT_SCS							\
>   	.scs_base	= init_shadow_call_stack,			\
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 7d32fc959b1a..742203cff128 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -34,6 +34,9 @@ int main(void)
>   #ifdef CONFIG_ARM64_SW_TTBR0_PAN
>     DEFINE(TSK_TI_TTBR0,		offsetof(struct task_struct, thread_info.ttbr0));
>   #endif
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +  DEFINE(TI_COMPAT_SYSCALL,	offsetof(struct task_struct, thread_info.compat_syscall_flags));
> +#endif
>   #ifdef CONFIG_SHADOW_CALL_STACK
>     DEFINE(TSK_TI_SCS_BASE,	offsetof(struct task_struct, thread_info.scs_base));
>     DEFINE(TSK_TI_SCS_SP,		offsetof(struct task_struct, thread_info.scs_sp));
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 43d4c329775f..6d98a9c6fafd 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -228,12 +228,12 @@ static void notrace el0_dbg(struct pt_regs *regs, unsigned long esr)
>   }
>   NOKPROBE_SYMBOL(el0_dbg);
>   
> -static void notrace el0_svc(struct pt_regs *regs)
> +static void notrace el0_svc(struct pt_regs *regs, unsigned int iss)
>   {
>   	if (system_uses_irq_prio_masking())
>   		gic_write_pmr(GIC_PRIO_IRQON | GIC_PRIO_PSR_I_SET);
>   
> -	do_el0_svc(regs);
> +	do_el0_svc(regs, iss);
>   }
>   NOKPROBE_SYMBOL(el0_svc);
>   
> @@ -251,7 +251,10 @@ asmlinkage void notrace el0_sync_handler(struct pt_regs *regs)
>   
>   	switch (ESR_ELx_EC(esr)) {
>   	case ESR_ELx_EC_SVC64:
> -		el0_svc(regs);
> +		/* Redundant masking here to show we are getting ISS mask
> +		 * Then we are pulling the imm16 out of it for SVC64
> +		 */
> +		el0_svc(regs, (esr & ESR_ELx_ISS_MASK) & 0xffff);
>   		break;
>   	case ESR_ELx_EC_DABT_LOW:
>   		el0_da(regs, esr);
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 062b21f30f94..a35ab449a466 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -937,7 +937,7 @@ void fpsimd_release_task(struct task_struct *dead_task)
>   void do_sve_acc(unsigned int esr, struct pt_regs *regs)
>   {
>   	/* Even if we chose not to use SVE, the hardware could still trap: */
> -	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
> +	if (unlikely(!system_supports_sve()) || WARN_ON(is_aarch32_compat_task())) {
>   		force_signal_inject(SIGILL, ILL_ILLOPC, regs->pc, 0);
>   		return;
>   	}
> diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
> index 712e97c03e54..37c9349c4999 100644
> --- a/arch/arm64/kernel/hw_breakpoint.c
> +++ b/arch/arm64/kernel/hw_breakpoint.c
> @@ -168,7 +168,7 @@ static int is_compat_bp(struct perf_event *bp)
>   	 * deprecated behaviour if we use unaligned watchpoints in
>   	 * AArch64 state.
>   	 */
> -	return tsk && is_compat_thread(task_thread_info(tsk));
> +	return tsk && is_aarch32_compat_thread(task_thread_info(tsk));
>   }
>   
>   /**
> diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
> index f6f58e6265df..c4b061f0d182 100644
> --- a/arch/arm64/kernel/perf_regs.c
> +++ b/arch/arm64/kernel/perf_regs.c
> @@ -66,7 +66,7 @@ int perf_reg_validate(u64 mask)
>   
>   u64 perf_reg_abi(struct task_struct *task)
>   {
> -	if (is_compat_thread(task_thread_info(task)))
> +	if (is_aarch32_compat_thread(task_thread_info(task)))
>   		return PERF_SAMPLE_REGS_ABI_32;
>   	else
>   		return PERF_SAMPLE_REGS_ABI_64;
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index a47a40ec6ad9..9c0775babbd0 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -314,7 +314,7 @@ static void tls_thread_flush(void)
>   {
>   	write_sysreg(0, tpidr_el0);
>   
> -	if (is_compat_task()) {
> +	if (is_aarch32_compat_task()) {
>   		current->thread.uw.tp_value = 0;
>   
>   		/*
> @@ -409,7 +409,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
>   		*task_user_tls(p) = read_sysreg(tpidr_el0);
>   
>   		if (stack_start) {
> -			if (is_compat_thread(task_thread_info(p)))
> +			if (is_aarch32_compat_thread(task_thread_info(p)))
>   				childregs->compat_sp = stack_start;
>   			else
>   				childregs->sp = stack_start;
> @@ -453,7 +453,7 @@ static void tls_thread_switch(struct task_struct *next)
>   {
>   	tls_preserve_current_state();
>   
> -	if (is_compat_thread(task_thread_info(next)))
> +	if (is_aarch32_compat_thread(task_thread_info(next)))
>   		write_sysreg(next->thread.uw.tp_value, tpidrro_el0);
>   	else if (!arm64_kernel_unmapped_at_el0())
>   		write_sysreg(0, tpidrro_el0);
> @@ -619,7 +619,12 @@ unsigned long arch_align_stack(unsigned long sp)
>    */
>   void arch_setup_new_exec(void)
>   {
> -	current->mm->context.flags = is_compat_task() ? MMCF_AARCH32 : 0;
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +	process_init_compat_mmap();
> +	current_thread_info()->compat_syscall_flags = 0;
> +#endif
> +
> +	current->mm->context.flags = is_aarch32_compat_task() ? MMCF_AARCH32 : 0;
>   
>   	ptrauth_thread_init_user(current);
>   
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index f49b349e16a3..2e3c242941d1 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -175,7 +175,7 @@ static void ptrace_hbptriggered(struct perf_event *bp,
>   	const char *desc = "Hardware breakpoint trap (ptrace)";
>   
>   #ifdef CONFIG_COMPAT
> -	if (is_compat_task()) {
> +	if (is_aarch32_compat_task()) {
>   		int si_errno = 0;
>   		int i;
>   
> @@ -1725,7 +1725,7 @@ const struct user_regset_view *task_user_regset_view(struct task_struct *task)
>   	 */
>   	if (is_compat_task())
>   		return &user_aarch32_view;
> -	else if (is_compat_thread(task_thread_info(task)))
> +	else if (is_aarch32_compat_thread(task_thread_info(task)))
>   		return &user_aarch32_ptrace_view;
>   #endif
>   	return &user_aarch64_view;
> @@ -1906,7 +1906,7 @@ int valid_user_regs(struct user_pt_regs *regs, struct task_struct *task)
>   	/* https://lore.kernel.org/lkml/20191118131525.GA4180@willie-the-truck */
>   	user_regs_reset_single_step(regs, task);
>   
> -	if (is_compat_thread(task_thread_info(task)))
> +	if (is_aarch32_compat_thread(task_thread_info(task)))
>   		return valid_compat_regs(regs);
>   	else
>   		return valid_native_regs(regs);
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index a8184cad8890..e6462b32effa 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -813,7 +813,7 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
>   	/*
>   	 * Set up the stack frame
>   	 */
> -	if (is_compat_task()) {
> +	if (is_aarch32_compat_task()) {
>   		if (ksig->ka.sa.sa_flags & SA_SIGINFO)
>   			ret = compat_setup_rt_frame(usig, ksig, oldset, regs);
>   		else
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index e4c0dadf0d92..6857dad5df8e 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -21,7 +21,7 @@ static long do_ni_syscall(struct pt_regs *regs, int scno)
>   {
>   #ifdef CONFIG_COMPAT
>   	long ret;
> -	if (is_compat_task()) {
> +	if (is_aarch32_compat_task()) {
>   		ret = compat_arm_syscall(regs, scno);
>   		if (ret != -ENOSYS)
>   			return ret;
> @@ -167,6 +167,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
>   		local_daif_mask();
>   		flags = current_thread_info()->flags;
>   		if (!has_syscall_work(flags) && !(flags & _TIF_SINGLESTEP)) {
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +			current_thread_info()->compat_syscall_flags = 0;
> +#endif
>   			/*
>   			 * We're off to userspace, where interrupts are
>   			 * always enabled after we restore the flags from
> @@ -180,6 +183,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
>   
>   trace_exit:
>   	syscall_trace_exit(regs);
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +	current_thread_info()->compat_syscall_flags = 0;
> +#endif
>   }
>   
>   static inline void sve_user_discard(void)
> @@ -199,10 +205,39 @@ static inline void sve_user_discard(void)
>   	sve_user_disable();
>   }
>   
> -void do_el0_svc(struct pt_regs *regs)
> +void do_el0_svc(struct pt_regs *regs, unsigned int iss)
>   {
>   	sve_user_discard();
> -	el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
> +	/* XXX: Which style is more ideal to take here? */
> +#if 0
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +	/* Hardcode syscall 0x8000'0000 to be a 32bit support syscall */
> +	if (regs->regs[8] == 0x80000000) {
> +		current_thread_info()->compat_syscall_flags = _TIF_COMPAT_32BITSYSCALL;
> +		el0_svc_common(regs, regs->regs[7], __NR_compat_syscalls,
> +			       compat_sys_call_table);
> +
> +	} else
> +#endif
> +		el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
> +#else
> +	switch (iss) {
> +	/* SVC #1 is now a 32bit support syscall
> +	 * Any other SVC ISS falls down the regular syscall code path
> +	 */
> +	case 1:
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +		current_thread_info()->compat_syscall_flags = _TIF_COMPAT_32BITSYSCALL;
> +		el0_svc_common(regs, regs->regs[7], __NR_compat_syscalls,
> +			       compat_sys_call_table);
> +#else
> +		return -ENOSYS;
> +#endif
> +		break;
> +	default:
> +		el0_svc_common(regs, regs->regs[8], __NR_syscalls, sys_call_table);
> +	}
> +#endif
>   }
>   
>   #ifdef CONFIG_COMPAT
> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
> index 3028bacbc4e9..857aa03a3ac2 100644
> --- a/arch/arm64/mm/mmap.c
> +++ b/arch/arm64/mm/mmap.c
> @@ -17,6 +17,8 @@
>   #include <linux/io.h>
>   #include <linux/personality.h>
>   #include <linux/random.h>
> +#include <linux/security.h>
> +#include <linux/hugetlb.h>
>   
>   #include <asm/cputype.h>
>   
> @@ -68,3 +70,250 @@ int devmem_is_allowed(unsigned long pfn)
>   }
>   
>   #endif
> +
> +#ifdef CONFIG_ARM_COMPAT_DISPATCH
> +
> +/* Definitions for compat syscall guest mmap area */
> +#define COMPAT_MIN_GAP			(SZ_128M)
> +#define COMPAT_STACK_TOP		0xffff0000
> +#define COMPAT_MAX_GAP			(COMPAT_STACK_TOP/6*5)
> +#define COMPAT_TASK_UNMAPPED_BASE	PAGE_ALIGN(TASK_SIZE_32 / 4)
> +#define COMPAT_STACK_RND_MASK		(0x7ff >> (PAGE_SHIFT - 12))
> +
> +#ifndef arch_get_mmap_end
> +#define arch_get_mmap_end(addr)	(TASK_SIZE)
> +#endif
> +
> +#ifndef arch_get_mmap_base
> +#define arch_get_mmap_base(addr, base) (base)
> +#endif
> +
> +static int mmap_is_legacy(unsigned long rlim_stack)
> +{
> +	if (current->personality & ADDR_COMPAT_LAYOUT)
> +		return 1;
> +
> +	if (rlim_stack == RLIM_INFINITY)
> +		return 1;
> +
> +	return sysctl_legacy_va_layout;
> +}
> +
> +static unsigned long compat_mmap_base(unsigned long rnd, unsigned long gap)
> +{
> +	unsigned long pad = stack_guard_gap;
> +
> +	/* Account for stack randomization if necessary */
> +	if (current->flags & PF_RANDOMIZE)
> +		pad += (COMPAT_STACK_RND_MASK << PAGE_SHIFT);
> +
> +	/* Values close to RLIM_INFINITY can overflow. */
> +	if (gap + pad > gap)
> +		gap += pad;
> +
> +	if (gap < COMPAT_MIN_GAP)
> +		gap = COMPAT_MIN_GAP;
> +	else if (gap > COMPAT_MAX_GAP)
> +		gap = COMPAT_MAX_GAP;
> +
> +	return PAGE_ALIGN(COMPAT_STACK_TOP - gap - rnd);
> +}
> +
> +void process_init_compat_mmap(void)
> +{
> +	unsigned long random_factor = 0UL;
> +	unsigned long rlim_stack = rlimit(RLIMIT_STACK);
> +
> +	if (current->flags & PF_RANDOMIZE) {
> +		random_factor = (get_random_long() &
> +			((1UL << mmap_rnd_compat_bits) - 1)) << PAGE_SHIFT;
> +	}
> +
> +	if (mmap_is_legacy(rlim_stack)) {
> +		current->mm->context.compat_mmap_base =
> +			COMPAT_TASK_UNMAPPED_BASE + random_factor;
> +	} else {
> +		current->mm->context.compat_mmap_base =
> +			compat_mmap_base(random_factor, rlim_stack);
> +	}
> +}
> +
> +/* Get an address range which is currently unmapped.
> + * For shmat() with addr=0.
> + *
> + * Ugly calling convention alert:
> + * Return value with the low bits set means error value,
> + * ie
> + *	if (ret & ~PAGE_MASK)
> + *		error = ret;
> + *
> + * This function "knows" that -ENOMEM has the bits set.
> + */
> +unsigned long
> +arch_get_unmapped_area(struct file *filp, unsigned long addr,
> +		unsigned long len, unsigned long pgoff, unsigned long flags)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma, *prev;
> +	struct vm_unmapped_area_info info;
> +	const unsigned long mmap_end = arch_get_mmap_end(addr);
> +	bool bad_addr = false;
> +
> +	if (len > mmap_end - mmap_min_addr)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Ensure that translated processes do not allocate the last
> +	 * page of the 32-bit address space, or anything above it.
> +	 */
> +	if (is_compat_task())
> +		bad_addr = addr + len > TASK_SIZE_32;
> +
> +	if (flags & MAP_FIXED)
> +		return bad_addr ? -ENOMEM : addr;
> +
> +	if (addr && !bad_addr) {
> +		addr = PAGE_ALIGN(addr);
> +		vma = find_vma_prev(mm, addr, &prev);
> +		if (mmap_end - len >= addr && addr >= mmap_min_addr &&
> +		    (!vma || addr + len <= vm_start_gap(vma)) &&
> +		    (!prev || addr >= vm_end_gap(prev)))
> +			return addr;
> +	}
> +
> +	info.flags = 0;
> +	info.length = len;
> +	if (is_compat_task()) {
> +		info.low_limit = mm->context.compat_mmap_base;
> +		info.high_limit = TASK_SIZE_32;
> +	} else {
> +		info.low_limit = mm->mmap_base;
> +		info.high_limit = mmap_end;
> +	}
> +	info.align_mask = 0;
> +	return vm_unmapped_area(&info);
> +}
> +
> +/*
> + * This mmap-allocator allocates new areas top-down from below the
> + * stack's low limit (the base):
> + */
> +unsigned long
> +arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
> +			  unsigned long len, unsigned long pgoff,
> +			  unsigned long flags)
> +{
> +
> +	struct vm_area_struct *vma, *prev;
> +	struct mm_struct *mm = current->mm;
> +	struct vm_unmapped_area_info info;
> +	const unsigned long mmap_end = arch_get_mmap_end(addr);
> +	bool bad_addr = false;
> +
> +	/* requested length too big for entire address space */
> +	if (len > mmap_end - mmap_min_addr)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Ensure that translated processes do not allocate the last
> +	 * page of the 32-bit address space, or anything above it.
> +	 */
> +	if (is_compat_task())
> +		bad_addr = addr + len > TASK_SIZE_32;
> +
> +	if (flags & MAP_FIXED)
> +		return bad_addr ? -ENOMEM : addr;
> +
> +	/* requesting a specific address */
> +	if (addr && !bad_addr) {
> +		addr = PAGE_ALIGN(addr);
> +		vma = find_vma_prev(mm, addr, &prev);
> +		if (mmap_end - len >= addr && addr >= mmap_min_addr &&
> +				(!vma || addr + len <= vm_start_gap(vma)) &&
> +				(!prev || addr >= vm_end_gap(prev)))
> +			return addr;
> +	}
> +
> +	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> +	info.length = len;
> +	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
> +	if (is_compat_task())
> +		info.high_limit = mm->context.compat_mmap_base;
> +	else
> +		info.high_limit = arch_get_mmap_base(addr, mm->mmap_base);
> +	info.align_mask = 0;
> +	addr = vm_unmapped_area(&info);
> +
> +	/*
> +	 * A failed mmap() very likely causes application failure,
> +	 * so fall back to the bottom-up function here. This scenario
> +	 * can happen with large stack limits and large mmap()
> +	 * allocations.
> +	 */
> +	if (offset_in_page(addr)) {
> +		VM_BUG_ON(addr != -ENOMEM);
> +		info.flags = 0;
> +		if (is_compat_task()) {
> +			info.low_limit = COMPAT_TASK_UNMAPPED_BASE;
> +			info.high_limit = TASK_SIZE_32;
> +		} else {
> +			info.low_limit = TASK_UNMAPPED_BASE;
> +			info.high_limit = mmap_end;
> +		}
> +		addr = vm_unmapped_area(&info);
> +	}
> +
> +	return addr;
> +}
> +
> +unsigned long
> +hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> +		unsigned long len, unsigned long pgoff, unsigned long flags)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma;
> +	struct hstate *h = hstate_file(file);
> +	struct vm_unmapped_area_info info;
> +	bool bad_addr = false;
> +
> +	if (len & ~huge_page_mask(h))
> +		return -EINVAL;
> +	if (len > TASK_SIZE)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Ensure that translated processes do not allocate the last
> +	 * page of the 32-bit address space, or anything above it.
> +	 */
> +	if (is_compat_task())
> +		bad_addr = addr + len > TASK_SIZE_32;
> +
> +	if (flags & MAP_FIXED) {
> +		if (prepare_hugepage_range(file, addr, len))
> +			return -EINVAL;
> +		return bad_addr ? -ENOMEM : addr;
> +	}
> +
> +	if (addr && !bad_addr) {
> +		addr = ALIGN(addr, huge_page_size(h));
> +		vma = find_vma(mm, addr);
> +		if (TASK_SIZE - len >= addr &&
> +		    (!vma || addr + len <= vm_start_gap(vma)))
> +			return addr;
> +	}
> +
> +	info.flags = 0;
> +	info.length = len;
> +	if (is_compat_task()) {
> +		info.low_limit = COMPAT_TASK_UNMAPPED_BASE;
> +		info.high_limit = TASK_SIZE_32;
> +	} else {
> +		info.low_limit = TASK_UNMAPPED_BASE;
> +		info.high_limit = TASK_SIZE;
> +	}
> +	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
> +	info.align_offset = 0;
> +	return vm_unmapped_area(&info);
> +}
> +
> +#endif
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-12 11:30 ` Steven Price
@ 2021-02-12 12:35   ` Mark Brown
  2021-02-12 13:28     ` Catalin Marinas
  2021-02-12 13:59   ` Arnd Bergmann
  1 sibling, 1 reply; 12+ messages in thread
From: Mark Brown @ 2021-02-12 12:35 UTC (permalink / raw)
  To: Steven Price
  Cc: sonicadvance1, amanieu, Catalin Marinas, Will Deacon,
	Mark Rutland, Oleg Nesterov, Al Viro, Dave Martin,
	Amit Daniel Kachhap, Marc Zyngier, David Brazdil,
	Jean-Philippe Brucker, Andrew Morton, Anshuman Khandual,
	Gavin Shan, Mike Rapoport, Vincenzo Frascino, Kristina Martsenko,
	Kees Cook, Sami Tolvanen, Frederic Weisbecker, Kevin Hao,
	Jason Yan, Andrey Ignatov, Peter Collingbourne, Julien Grall,
	Tian Tao, Qais Yousef, Jens Axboe, linux-arm-kernel,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 783 bytes --]

On Fri, Feb 12, 2021 at 11:30:41AM +0000, Steven Price wrote:
> On 11/02/2021 20:21, sonicadvance1@gmail.com wrote:

> > Why do we need compatibility layers?
> > There are ARMv8 CPUs that only support AArch64 but still need to run
> > AArch32 applications.
> > Cortex-A34/R82 and other cores are prime examples of this.
> > Additionally if a user is needing to run legacy 32-bit x86 software, it
> > needs the same compatibility layer.

> Unless I'm much mistaken QEMU's user mode already does this - admittedly I
> don't tend to run "legacy 32-bit x86 software".

Yes, this has been deployed on Debian for a long time - you can install
any combination of Debian architectures on a single system and it will
use qemu to run binaries that can't be supported natively by the
hardware.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-12 12:35   ` Mark Brown
@ 2021-02-12 13:28     ` Catalin Marinas
  2021-02-12 14:12       ` David Laight
  2021-02-12 16:24       ` Amanieu d'Antras
  0 siblings, 2 replies; 12+ messages in thread
From: Catalin Marinas @ 2021-02-12 13:28 UTC (permalink / raw)
  To: Mark Brown
  Cc: Steven Price, sonicadvance1, amanieu, Will Deacon, Mark Rutland,
	Oleg Nesterov, Al Viro, Dave Martin, Amit Daniel Kachhap,
	Marc Zyngier, David Brazdil, Jean-Philippe Brucker,
	Andrew Morton, Anshuman Khandual, Gavin Shan, Mike Rapoport,
	Vincenzo Frascino, Kristina Martsenko, Kees Cook, Sami Tolvanen,
	Frederic Weisbecker, Kevin Hao, Jason Yan, Andrey Ignatov,
	Peter Collingbourne, Julien Grall, Tian Tao, Qais Yousef,
	Jens Axboe, linux-arm-kernel, linux-kernel

On Fri, Feb 12, 2021 at 12:35:15PM +0000, Mark Brown wrote:
> On Fri, Feb 12, 2021 at 11:30:41AM +0000, Steven Price wrote:
> > On 11/02/2021 20:21, sonicadvance1@gmail.com wrote:
> > > Why do we need compatibility layers?
> > > There are ARMv8 CPUs that only support AArch64 but still need to run
> > > AArch32 applications.
> > > Cortex-A34/R82 and other cores are prime examples of this.
> > > Additionally if a user is needing to run legacy 32-bit x86 software, it
> > > needs the same compatibility layer.
> 
> > Unless I'm much mistaken QEMU's user mode already does this - admittedly I
> > don't tend to run "legacy 32-bit x86 software".
> 
> Yes, this has been deployed on Debian for a long time - you can install
> any combination of Debian architectures on a single system and it will
> use qemu to run binaries that can't be supported natively by the
> hardware.

The only downside I think is that for some syscalls it's not that
efficient. Those using struct iovec come to mind, qemu probably
duplicates the user structures, having to copy them in both directions
(well, the kernel compat layer does something similar).

Anyway, I'm not in favour of this patch. Those binary translation tools
need to explore the user-only options first and come up with some perf
numbers to justify the proposal.

-- 
Catalin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-12 11:30 ` Steven Price
  2021-02-12 12:35   ` Mark Brown
@ 2021-02-12 13:59   ` Arnd Bergmann
  1 sibling, 0 replies; 12+ messages in thread
From: Arnd Bergmann @ 2021-02-12 13:59 UTC (permalink / raw)
  To: Steven Price
  Cc: Ryan Houdek, Amanieu d'Antras, Catalin Marinas, Will Deacon,
	Mark Rutland, Oleg Nesterov, Al Viro, Dave Martin,
	Amit Daniel Kachhap, Mark Brown, Marc Zyngier, David Brazdil,
	Jean-Philippe Brucker, Andrew Morton, Anshuman Khandual,
	Gavin Shan, Mike Rapoport, Vincenzo Frascino, Kristina Martsenko,
	Kees Cook, Sami Tolvanen, Frederic Weisbecker, Kevin Hao,
	Jason Yan, Andrey Ignatov, Peter Collingbourne, Julien Grall,
	Tian Tao, Qais Yousef, Jens Axboe, Linux ARM, linux-kernel

On Fri, Feb 12, 2021 at 12:33 PM Steven Price <steven.price@arm.com> wrote:
> On 11/02/2021 20:21, sonicadvance1@gmail.com wrote:

> > The problem:
> > We need to support 32-bit processes running under a userspace
> > compatibility layer. The compatibility layer is a AArch64 process.
> > This means exposing the 32bit compatibility syscalls to userspace.
>
> I'm not sure how you come to this conclusion. Running 32-bit processes
> under a compatibility layer is a fine goal, but it's not clear why the
> entire 32-bit compat syscall layer is needed for this.
>
> As a case in point QEMU's user mode emulation already achieves this in
> many cases without any changes to the kernel.

I think it's a quantitative difference, not a qualitative one:

qemu does a nice job at translating the interfaces for many combinations
of host and target architectures at a decent speed, and is improving
at both the compatibility and the performance over time.

What both Tango and FEX promise is to be much faster by focusing
on one target architecture each, and to have better compatibility than
what qemu can do.

> > Who does this matter to?
> > Any user that has a specific need to run legacy 32-bit software under a
> > compatibility layer.
> > Not all software is open source or easy to convert to 64bit, it's
> > something we need to live with.
> > Professional software and the gaming ecosystem is rife with this.
> >
> > What applications have tried to work around this problem?
> > FEX emulator (1) - Userspace x86 to AArch64 compatibility layer
> > Tango binary translator (2) - AArch32 to AArch64 compatibility layer
> > QEmu (3) - Not really but they do some userspace ioctl emulation
>
> Can you expand on "not really"? Clearly there are limitations, but in
> general I can happily "chroot" into a distro filesystem using an
> otherwise incompatible architecture using a qemu-xxx-static binary.

The ioctl emulation in qemu is limited in multiple ways:
 - it needs to duplicate the kernel's compat emulation for
   every single command it wants to handle, and will always
   lag behind what gets merged into the kernel and what
   drivers a particular distro ships.
 - some ioctl commands cannot be emulated in user space
   because the compat code relies on tracking device state
   in the kernel.
 - In some cases, emulation can be expensive, both for
   runtime overhead and for code complexity

> > What problems did they hit?
> > FEX and Tango hit problems with emulating memory related syscalls.
> > - Emulating 32-bit mmap, mremap, shmat in userspace changes behaviour
> > All three hit issues with ioctl emulation
> > - ioctls are free to do what they want including allocating memory and
> > returning opaque structures with pointers.
>
> Now I think we're getting to what the actual problems are:
>
>   * mmap and friends have no (easy) way of forcing a mapping into a 32
> bit region.
>   * ioctls are a mess
>
> The first seems like a reasonable goal - I've seen examples of MAP_32BIT
> being (ab)used to do this, but it actually restricts to 31 bits and it's
> not even available on arm64. Here I think you'd be better off focusing
> on coming up with a new (generic) way of restricting the addresses that
> the kernel will pick.

I think that would be useful for other projects as well.

> ioctls are going to be a problem whatever you do, and I don't think
> there is much option other than having a list of known ioctls and
> translating them in user space - see below.

In particular for the arm32-on-arm64 ioctl case, we have a known-working
implementation in the kernel, I don't see why we wouldn't want to use it.

the x86-32-on-anything case for FEX is trickier because it does
require handling the ia32 alignment case, and the proposed patch
does not handle that correctly for all commands. I think this would
be fixable in the kernel, but it requires a little more work.

> > This is now exposing the compat syscalls to userspace, but for the sake
> > of userspace compatibility it is a necessary evil.
>
> You've yet to convince me that it's "necessary" - I agree on the "evil"
> part ;)

I think it's much easier to argue in favor of exposing the kernel's
ioctl() emulation and a get_unmapped_area() limit to a process
specific address than for doing the entire syscalls emulation.

The emulation for any of the other syscalls should be manageable
once ioctl can be called directly, though there are a couple that
could fall into the same category (setsockopt, sendmsg/recvmsg,
fcntl).

       Arnd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-12 13:28     ` Catalin Marinas
@ 2021-02-12 14:12       ` David Laight
  2021-02-12 14:44         ` Catalin Marinas
  2021-02-12 16:24       ` Amanieu d'Antras
  1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2021-02-12 14:12 UTC (permalink / raw)
  To: 'Catalin Marinas', Mark Brown
  Cc: Steven Price, sonicadvance1, amanieu, Will Deacon, Mark Rutland,
	Oleg Nesterov, Al Viro, Dave Martin, Amit Daniel Kachhap,
	Marc Zyngier, David Brazdil, Jean-Philippe Brucker,
	Andrew Morton, Anshuman Khandual, Gavin Shan, Mike Rapoport,
	Vincenzo Frascino, Kristina Martsenko, Kees Cook, Sami Tolvanen,
	Frederic Weisbecker, Kevin Hao, Jason Yan, Andrey Ignatov,
	Peter Collingbourne, Julien Grall, Tian Tao, Qais Yousef,
	Jens Axboe, linux-arm-kernel, linux-kernel

From: Catalin Marinas
> Sent: 12 February 2021 13:28
> 
> On Fri, Feb 12, 2021 at 12:35:15PM +0000, Mark Brown wrote:
> > On Fri, Feb 12, 2021 at 11:30:41AM +0000, Steven Price wrote:
> > > On 11/02/2021 20:21, sonicadvance1@gmail.com wrote:
> > > > Why do we need compatibility layers?
> > > > There are ARMv8 CPUs that only support AArch64 but still need to run
> > > > AArch32 applications.
> > > > Cortex-A34/R82 and other cores are prime examples of this.
> > > > Additionally if a user is needing to run legacy 32-bit x86 software, it
> > > > needs the same compatibility layer.
> >
> > > Unless I'm much mistaken QEMU's user mode already does this - admittedly I
> > > don't tend to run "legacy 32-bit x86 software".
> >
> > Yes, this has been deployed on Debian for a long time - you can install
> > any combination of Debian architectures on a single system and it will
> > use qemu to run binaries that can't be supported natively by the
> > hardware.
> 
> The only downside I think is that for some syscalls it's not that
> efficient. Those using struct iovec come to mind, qemu probably
> duplicates the user structures, having to copy them in both directions
> (well, the kernel compat layer does something similar).
> 
> Anyway, I'm not in favour of this patch. Those binary translation tools
> need to explore the user-only options first and come up with some perf
> numbers to justify the proposal.

I don't think the problem is only the performance.
The difficulty is knowing when structures need changing.
A typical example is driver ioctl requests.
Any user space adaption layer would have to know which actual
driver has been opened and what internal structures it has.
Getting that right is hard and difficult.
The recent changes to move (IIRC) sockopt compatibility down
into the protocol code found quite a few places where it was
previously broken.
It is much easier to get it right in the code that knows about
the actual structures.

For mmap() you certainly want to be able to limit the returned
address to 32 bits (or maybe 31.5 bits).
A MAP_BELOW flag could do that, but the 32bit syscall has to.
(I can't remember what is done for wine - which needs 31bit addresses).

Of course, that only helps for 32bit arm binaries - when the
kernel compat code is written for,
Trying to run x86 binaries adds extra complexity.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-11 20:21 [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls sonicadvance1
  2021-02-12 11:30 ` Steven Price
@ 2021-02-12 14:13 ` Mark Rutland
  1 sibling, 0 replies; 12+ messages in thread
From: Mark Rutland @ 2021-02-12 14:13 UTC (permalink / raw)
  To: sonicadvance1
  Cc: amanieu, Catalin Marinas, Will Deacon, Oleg Nesterov, Al Viro,
	Dave Martin, Amit Daniel Kachhap, Mark Brown, Marc Zyngier,
	David Brazdil, Jean-Philippe Brucker, Andrew Morton,
	Anshuman Khandual, Gavin Shan, Mike Rapoport, Steven Price,
	Vincenzo Frascino, Kristina Martsenko, Kees Cook, Sami Tolvanen,
	Frederic Weisbecker, Kevin Hao, Jason Yan, Andrey Ignatov,
	Peter Collingbourne, Julien Grall, Tian Tao, Qais Yousef,
	Jens Axboe, linux-arm-kernel, linux-kernel

On Thu, Feb 11, 2021 at 12:21:54PM -0800, sonicadvance1@gmail.com wrote:
> From: Ryan Houdek <Sonicadvance1@gmail.com>
> 
> Sorry about the noise. I obviously don't work in this ecosystem.
> Didn't get any comments previously so I'm resending
> 
> The problem:
> We need to support 32-bit processes running under a userspace
> compatibility layer. The compatibility layer is a AArch64 process.
> This means exposing the 32bit compatibility syscalls to userspace.

I think that the requirement is that the compatibility layer can
/somehow/ emulate the 32-bit syscalls (e.g. being able to limit the
range of mmap() and friends), and doesn't strictly require that the
kernel expose the 32-bit compat syscalls directly to userspace.

> Why do we need compatibility layers?
> There are ARMv8 CPUs that only support AArch64 but still need to run
> AArch32 applications.
> Cortex-A34/R82 and other cores are prime examples of this.
> Additionally if a user is needing to run legacy 32-bit x86 software, it
> needs the same compatibility layer.

I think that for this series x86 emulation is a red herring. ABIs differ
across 32-bit arm and 32-bit x86 (and they have distinct arch-specific
syscalls), and so those need distinct compatibility layers.

If you're wanting to accelerate x86 emulation, I don't think this is the
right approach.

> Who does this matter to?
> Any user that has a specific need to run legacy 32-bit software under a
> compatibility layer.
> Not all software is open source or easy to convert to 64bit, it's
> something we need to live with.
> Professional software and the gaming ecosystem is rife with this.
> 
> What applications have tried to work around this problem?
> FEX emulator (1) - Userspace x86 to AArch64 compatibility layer
> Tango binary translator (2) - AArch32 to AArch64 compatibility layer
> QEmu (3) - Not really but they do some userspace ioctl emulation
> 
> What problems did they hit?
> FEX and Tango hit problems with emulating memory related syscalls.
> - Emulating 32-bit mmap, mremap, shmat in userspace changes behaviour
> All three hit issues with ioctl emulation

I suspect these can be solved by extending these syscalls (or adding new
variants) with new functionality to support emulation cases. For
example, having variants with an explicit address mask would also
benefit JITs which want to turn VA bits into additional tag bits.

> - ioctls are free to do what they want including allocating memory and
> returning opaque structures with pointers.

They're also free to mess with state that can differ across
compat/native tasks, so I don't think this is generally safe to expose
without auditing the specific ioctl.

I'd also note that non-ioctl syscalls can affect distinct state across
compat/native tasks, e.g. TPIDR_EL0 vs TPIDRRO_EL0, and while I see some
accounting for that below I don't believe that will work consistently or
reliably without further invasive changes (e.g. considering the KPTI
trampoline). Futher, this introduces significant scope for error, and
the potential to introduce security problems.

So to reiterate, I am strongly opposed to exposing compat syscalls in
this way. However, I do think that we can make emulation easier by
extending some syscalls (e.g. mmap()), and that this would be worthwhile
regardless of emulation.

I've made some further technical comments below, but those have no
bearing on my fundamental objection here.

> With this patch we will be exposing the compatibility syscall table
> through the regular syscall svc API. There is prior art here where on
> x86-64 they also expose the compatibility tables.
> The justification for this is that we need to maintain support for 32bit
> application compatibility going in to the foreseeable future.
> Userspace does almost all of the heavy lifting here, especially when the
> hardware no longer supports the 32bit use case.
> 
> A couple of caveats to this approach.
> Userspace must know that this doesn't solve structure alignment problems
> for the x86->AArch64 (1) case.
> The API for this changes from syscall number in x8 to x7 to match
> AArch32 syscall semantics
> This is now exposing the compat syscalls to userspace, but for the sake
> of userspace compatibility it is a necessary evil.
> 
> Why does the upstream kernel care?
> I believe every user wants to have their software ecosystem continue
> working if they are in a mixed AArch32/AArch64 world even when they are
> running AArch64 only hardware. The kernel should facilitate a good user
> experience.

I appreciate that people have 32-bit applications, and want to run
those, but I'm not convinced that this requires these specific changes.
Especially considering that the cross-architecture cases you mention are
not addressed by this, and need syscall emulation in userspace; that
implies that in general userspace needs to handle conversion of
semantics, and what the kernel needs to do is provide the primitives
onto which userspace can map things in order to get the desried
semantics (which is not the same as blindly exposing compat syscalls).

[...]

> +/* Definitions for compat syscall guest mmap area */
> +#define COMPAT_MIN_GAP			(SZ_128M)
> +#define COMPAT_STACK_TOP		0xffff0000
> +#define COMPAT_MAX_GAP			(COMPAT_STACK_TOP/6*5)
> +#define COMPAT_TASK_UNMAPPED_BASE	PAGE_ALIGN(TASK_SIZE_32 / 4)
> +#define COMPAT_STACK_RND_MASK		(0x7ff >> (PAGE_SHIFT - 12))

What's the "guest mmap area" ?

The commit message introduced the abstract problem, but none of the new
technical concepts like this.

> +#ifndef arch_get_mmap_end
> +#define arch_get_mmap_end(addr)	(TASK_SIZE)
> +#endif
> +
> +#ifndef arch_get_mmap_base
> +#define arch_get_mmap_base(addr, base) (base)
> +#endif
> +
> +static int mmap_is_legacy(unsigned long rlim_stack)
> +{
> +	if (current->personality & ADDR_COMPAT_LAYOUT)
> +		return 1;
> +
> +	if (rlim_stack == RLIM_INFINITY)
> +		return 1;
> +
> +	return sysctl_legacy_va_layout;
> +}
> +
> +static unsigned long compat_mmap_base(unsigned long rnd, unsigned long gap)
> +{
> +	unsigned long pad = stack_guard_gap;
> +
> +	/* Account for stack randomization if necessary */
> +	if (current->flags & PF_RANDOMIZE)
> +		pad += (COMPAT_STACK_RND_MASK << PAGE_SHIFT);
> +
> +	/* Values close to RLIM_INFINITY can overflow. */
> +	if (gap + pad > gap)
> +		gap += pad;
> +
> +	if (gap < COMPAT_MIN_GAP)
> +		gap = COMPAT_MIN_GAP;
> +	else if (gap > COMPAT_MAX_GAP)
> +		gap = COMPAT_MAX_GAP;
> +
> +	return PAGE_ALIGN(COMPAT_STACK_TOP - gap - rnd);
> +}

Again, it's not clear what's going on here, and I fear this is specific
to the design of your userspace emilation layer.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-12 14:12       ` David Laight
@ 2021-02-12 14:44         ` Catalin Marinas
  2021-02-12 15:06           ` David Laight
  0 siblings, 1 reply; 12+ messages in thread
From: Catalin Marinas @ 2021-02-12 14:44 UTC (permalink / raw)
  To: David Laight
  Cc: Mark Brown, Steven Price, sonicadvance1, amanieu, Will Deacon,
	Mark Rutland, Oleg Nesterov, Al Viro, Dave Martin,
	Amit Daniel Kachhap, Marc Zyngier, David Brazdil,
	Jean-Philippe Brucker, Andrew Morton, Anshuman Khandual,
	Gavin Shan, Mike Rapoport, Vincenzo Frascino, Kristina Martsenko,
	Kees Cook, Sami Tolvanen, Frederic Weisbecker, Kevin Hao,
	Jason Yan, Andrey Ignatov, Peter Collingbourne, Julien Grall,
	Tian Tao, Qais Yousef, Jens Axboe, linux-arm-kernel,
	linux-kernel

On Fri, Feb 12, 2021 at 02:12:02PM +0000, David Laight wrote:
> From: Catalin Marinas
> > Sent: 12 February 2021 13:28
> > On Fri, Feb 12, 2021 at 12:35:15PM +0000, Mark Brown wrote:
> > > On Fri, Feb 12, 2021 at 11:30:41AM +0000, Steven Price wrote:
> > > > On 11/02/2021 20:21, sonicadvance1@gmail.com wrote:
> > > > > Why do we need compatibility layers?
> > > > > There are ARMv8 CPUs that only support AArch64 but still need to run
> > > > > AArch32 applications.
> > > > > Cortex-A34/R82 and other cores are prime examples of this.
> > > > > Additionally if a user is needing to run legacy 32-bit x86 software, it
> > > > > needs the same compatibility layer.
> > >
> > > > Unless I'm much mistaken QEMU's user mode already does this - admittedly I
> > > > don't tend to run "legacy 32-bit x86 software".
> > >
> > > Yes, this has been deployed on Debian for a long time - you can install
> > > any combination of Debian architectures on a single system and it will
> > > use qemu to run binaries that can't be supported natively by the
> > > hardware.
> > 
> > The only downside I think is that for some syscalls it's not that
> > efficient. Those using struct iovec come to mind, qemu probably
> > duplicates the user structures, having to copy them in both directions
> > (well, the kernel compat layer does something similar).
> > 
> > Anyway, I'm not in favour of this patch. Those binary translation tools
> > need to explore the user-only options first and come up with some perf
> > numbers to justify the proposal.
> 
> I don't think the problem is only the performance.
> The difficulty is knowing when structures need changing.
> A typical example is driver ioctl requests.

The ioctl is indeed the difficult case not necessarily because of
changing structures but rather their large amount. For the generic
syscalls, the existing ABI is very rarely changed. It is, however,
evolving (new syscalls) and such binary translation tool would need to
keep up or at least intercept syscalls like uname and report older
kernel versions.

> Any user space adaption layer would have to know which actual
> driver has been opened and what internal structures it has.
> Getting that right is hard and difficult.
> The recent changes to move (IIRC) sockopt compatibility down
> into the protocol code found quite a few places where it was
> previously broken.
> It is much easier to get it right in the code that knows about
> the actual structures.

As Arnd I think was suggesting, we could have an ioctl32() syscall that
allows compat arguments but not opening up the whole set of compat
syscalls to native processes.

> For mmap() you certainly want to be able to limit the returned
> address to 32 bits (or maybe 31.5 bits).
> A MAP_BELOW flag could do that, but the 32bit syscall has to.
> (I can't remember what is done for wine - which needs 31bit addresses).

For mmap(), we can easily add support for PER_LINUX_32BIT or
PER_LINUX32_3GB, so we don't need to change the mmap() parameters. I
don't recall to have had a (strong) request for this.

> Of course, that only helps for 32bit arm binaries - when the kernel
> compat code is written for, Trying to run x86 binaries adds extra
> complexity.

Indeed.

-- 
Catalin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-12 14:44         ` Catalin Marinas
@ 2021-02-12 15:06           ` David Laight
  0 siblings, 0 replies; 12+ messages in thread
From: David Laight @ 2021-02-12 15:06 UTC (permalink / raw)
  To: 'Catalin Marinas'
  Cc: Mark Brown, Steven Price, sonicadvance1, amanieu, Will Deacon,
	Mark Rutland, Oleg Nesterov, Al Viro, Dave Martin,
	Amit Daniel Kachhap, Marc Zyngier, David Brazdil,
	Jean-Philippe Brucker, Andrew Morton, Anshuman Khandual,
	Gavin Shan, Mike Rapoport, Vincenzo Frascino, Kristina Martsenko,
	Kees Cook, Sami Tolvanen, Frederic Weisbecker, Kevin Hao,
	Jason Yan, Andrey Ignatov, Peter Collingbourne, Julien Grall,
	Tian Tao, Qais Yousef, Jens Axboe, linux-arm-kernel,
	linux-kernel

> > Any user space adaption layer would have to know which actual
> > driver has been opened and what internal structures it has.
> > Getting that right is hard and difficult.
> > The recent changes to move (IIRC) sockopt compatibility down
> > into the protocol code found quite a few places where it was
> > previously broken.
> > It is much easier to get it right in the code that knows about
> > the actual structures.
> 
> As Arnd I think was suggesting, we could have an ioctl32() syscall that
> allows compat arguments but not opening up the whole set of compat
> syscalls to native processes.

Why is that a problem.
The kernel has to allow absolute garbage in syscall parameters.
So it really shouldn't matter.
It may give processes extra ways to 'shoot themselves in the foot'
but surely that is their problem.

Certainly, on x86, a 64bit process can make all three different
types of system call.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-12 13:28     ` Catalin Marinas
  2021-02-12 14:12       ` David Laight
@ 2021-02-12 16:24       ` Amanieu d'Antras
  2021-02-12 18:04         ` Mark Rutland
  1 sibling, 1 reply; 12+ messages in thread
From: Amanieu d'Antras @ 2021-02-12 16:24 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Mark Brown, Steven Price, sonicadvance1, Will Deacon,
	Mark Rutland, Oleg Nesterov, Al Viro, Dave Martin,
	Amit Daniel Kachhap, Marc Zyngier, David Brazdil,
	Jean-Philippe Brucker, Andrew Morton, Anshuman Khandual,
	Gavin Shan, Mike Rapoport, Vincenzo Frascino, Kristina Martsenko,
	Kees Cook, Sami Tolvanen, Frederic Weisbecker, Kevin Hao,
	Jason Yan, Andrey Ignatov, Peter Collingbourne, Julien Grall,
	Tian Tao, Qais Yousef, Jens Axboe, Linux ARM, linux-kernel

On Fri, Feb 12, 2021 at 1:28 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> The only downside I think is that for some syscalls it's not that
> efficient. Those using struct iovec come to mind, qemu probably
> duplicates the user structures, having to copy them in both directions
> (well, the kernel compat layer does something similar).
>
> Anyway, I'm not in favour of this patch. Those binary translation tools
> need to explore the user-only options first and come up with some perf
> numbers to justify the proposal.

I'd like to elaborate Tango's point of view on this problem.

Quick recap: Tango allows AArch32 programs to run on AArch64 CPUs that
don't support 32-bit mode. The primary use case is supporting 32-bit
Android apps, which means that Tango needs to be able to support the
full set of syscalls used on Android, including interfacing with many
drivers that are not in the mainline kernel. The patch proposed by
Ryan is based on the kernel patch used by Tango which can be found
here: https://github.com/Amanieu/linux/tree/tango-v5.4

Efficiency is not the concern here: copying/rearranging some bytes is
tiny compared to the cost of a syscall. The main concern is
correctness: there are many cases where userspace does not have the
information or the capabilities needed to ensure that the 32-bit
syscall ABI is correctly emulated.

There are two distinct parts to this: compat syscall emulation and
mmap address selection. I will address each separately.

Part 1: Compat syscall emulation

Even with this patch, Tango doesn't just pass 32-bit syscall through
to the kernel directly. We have ~5000 lines of code dealing with
various details such as memory management, signal handling, /proc
emulation, ptrace emulation, etc. However once this is done, Tango
will pass the syscall through to the kernel as a 32-bit compat syscall
instead of as a 64-bit syscall.

Here are several issues, off the top of my head, which are impossible
or impractical to support in user-mode:
- As mentioned before, there are a huge number of ioctls which behave
differently in 32-bit mode. It is impractical and error prone to
manually emulate them all in user mode. Specifically, the kernel
already has a well-tested and reliable compatibility layer and it
makes sense to reuse this. QEMU supports emulating some ioctls in
userspace but this still does not cover devices like GPUs which are
needed for accelerated rendering.
- The 64-bit set_robust_list is not compatible with the 32-bit ABI.
The compat version of set_robust_list must be used. Emulating this in
user mode is not reliable since SIGKILL cannot be caught.
- io_uring uses iovec structures as part of its API, which have
different sizes on 32-bit and 64-bit. This makes io_uring unusable
- ext4 represents positions in directories as a 64-bit hash, which
break if they are truncated to 32 bits. There is special support for
32-bit off_t in the ext4 driver but this is only used when
in_compat_syscall is true. QEMU also suffers from this problem:
https://bugzilla.kernel.org/show_bug.cgi?id=205957

Additionally, for Tango we want 32-bit programs to be able to use
seccomp filters, which is required by the Android CTS. Tango
intercepts seccomp filter installation and inserts a prefix which
whitelists 64-bit syscalls used internally by Tango and passes the
rest through to the user seccomp filter. For this to work, the kernel
must report 32-bit syscalls from 64-bit processes as AUDIT_ARCH_ARM
with the compat syscall number.

These issues are all solved by exposing compat syscalls to 64-bit
processes and ensuring is_compat_task/in_compat_syscall is true for
the duration of that syscall. There is a precedent for this: on x86,
syscalls made with int 0x80 are treated as 32-bit syscalls even if
they come from a 64-bit process.

Aside from seccomp support, this also solves FEX's concerns for
x86-to-AArch64 translation. There are of course some structures with
architecture-specific differences (e.g. epoll, stat, statfs) which
have to be translated manually in userspace, but the vast majority of
the ABI differences are simply due to 32/64-bit differences which
apply to all architectures.

Part 2: mmap address range

A binary translator such as FEX or Tango generally splits the address
space into two parts: the lower 4GB are reserved for the use of the
32-bit process and the rest of the address space is for the
translator's internal use (e.g. JIT cache). It is important that any
VM regions allocated through syscalls by the translated application be
located in the lower 4GB.

QEMU reserves 4G of address space with PROT_NONE and maps chunks out
of it for the application with MAP_FIXED as needed. However this
doesn't work for all cases:
- The io_setup syscall allocates a VM area for the AIO context and
returns it. But there is no way to control where this context is
allocated so it will almost always end up above the 4GB limit.
- Some ioctls will also perform VM allocations, with the same issues
as io_setup. Search for "vm_mmap" in drivers/.
- Some file descriptors have alignment requirements which are not
known to userspace. For example, a hugetlbfs file can only be mmaped
at a huge page alignment but there is no way for userspace to know
this when selecting an address.
- The Mali kbase out-of-tree driver outright forbids MAP_FIXED when
mapping GPU memory and insists on selecting a properly aligned address
itself.
- shmat and shmdt are particularly difficult to emulate since the
length of the mapping is not passed in as a parameter. They also
suffer from race conditions since shmdt leaves a gap in the 4GB
reserved space which could be filled in by a concurrent mmap
operation.

The solution proposed in this patch is to use a separate mmap_base
when a compat syscall is being executed by a 64-bit process. This
mmap_base is separately randomized on process startup so that
translated processes benefit from the additional security. All VM
allocations performed by 32-bit-under-64-bit syscalls will be done in
the low 4GB using this new mmap_base, while 64-bit syscalls used by
the translator continue to use the original mmap_base.

A possible alternative approach would be to use a prctl to restrict
the mmap range of the process and allow the translator to manually
specify its mmap_base. Any allocations that the translator needs to
perform above 4GB would then need to be done with MAP_FIXED, which is
workable albeit slightly inconvenient. The main advantage of this
alternative is that it is not tied to compat syscalls.

An extension to mmap which allows a custom address range to be
specified does *not* solve all of the issues listed above, which
primarily come from VM allocations performed by syscalls other than
mmap.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-12 16:24       ` Amanieu d'Antras
@ 2021-02-12 18:04         ` Mark Rutland
  2021-02-12 19:06           ` Amanieu d'Antras
  0 siblings, 1 reply; 12+ messages in thread
From: Mark Rutland @ 2021-02-12 18:04 UTC (permalink / raw)
  To: Amanieu d'Antras
  Cc: Catalin Marinas, Mark Brown, Steven Price, sonicadvance1,
	Will Deacon, Oleg Nesterov, Al Viro, Dave Martin,
	Amit Daniel Kachhap, Marc Zyngier, David Brazdil,
	Jean-Philippe Brucker, Andrew Morton, Anshuman Khandual,
	Gavin Shan, Mike Rapoport, Vincenzo Frascino, Kristina Martsenko,
	Kees Cook, Sami Tolvanen, Frederic Weisbecker, Kevin Hao,
	Jason Yan, Andrey Ignatov, Peter Collingbourne, Julien Grall,
	Tian Tao, Qais Yousef, Jens Axboe, Linux ARM, linux-kernel

On Fri, Feb 12, 2021 at 04:24:35PM +0000, Amanieu d'Antras wrote:
> On Fri, Feb 12, 2021 at 1:28 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > The only downside I think is that for some syscalls it's not that
> > efficient. Those using struct iovec come to mind, qemu probably
> > duplicates the user structures, having to copy them in both directions
> > (well, the kernel compat layer does something similar).
> >
> > Anyway, I'm not in favour of this patch. Those binary translation tools
> > need to explore the user-only options first and come up with some perf
> > numbers to justify the proposal.
> 
> I'd like to elaborate Tango's point of view on this problem.
> 
> Quick recap: Tango allows AArch32 programs to run on AArch64 CPUs that
> don't support 32-bit mode. The primary use case is supporting 32-bit
> Android apps, which means that Tango needs to be able to support the
> full set of syscalls used on Android, including interfacing with many
> drivers that are not in the mainline kernel.

Please bear in mind that for an upstream patch submission, the state of
out-of-tree drivers is not a justification. I appreciate that's a
problem for you if you need to support out-of-tree code, but it's not a
burden that upstream needs to care about.

> The patch proposed by
> Ryan is based on the kernel patch used by Tango which can be found
> here: https://github.com/Amanieu/linux/tree/tango-v5.4
> 
> Efficiency is not the concern here: copying/rearranging some bytes is
> tiny compared to the cost of a syscall. The main concern is
> correctness: there are many cases where userspace does not have the
> information or the capabilities needed to ensure that the 32-bit
> syscall ABI is correctly emulated.

I do appreciate that today there are cases where the emulator cannot do
the right thing due to lack of a mechanism, but where the emulator does
not have knowledge, I don't think that it can safely invoke the syscall.
Consider if userspace invokes compat_rt_sigreturn() or similar, which
will trash the emulator's state.

Note that there are cases (e.g. compat_rt_sigreturn()), the kernel
cannot provide the correct behaviour either. In your tree above, I spot
at least the following:

* For rt_sigreturn() the kernel will attempt to validate/restore a
  compat sigframe assuming the AArch32 register mappings, then
  valid_user_regs() will blat the PSTATE/SPSR_ELx value to /some/ valid
  AArch64 SPSR_ELx value that happens to mostly alias.

  I hope that your emulator doesn't allow emulated apps to call this,
  because it would blat the emulator's state. Regardless, this syscall
  simply cannot do any correct thing in the context of a fake compat
  task, and doesn't make sense to expose.

* For ptrace, operations on the user_aarch32_view (which
  task_user_regset_view() will return for fake compat tasks) will
  erroneously try to convert between SPSR_ELx <-> AARCH32 SPSR layouts,
  assuming the pt_regs::pstate field is using the encoding for AArch32
  EL0, where it's actually the AArch64 EL0 encoding where the layout is
  subtly differnt. Subseqeuntly valid_user_regs() will sanitize that to
  an AArch64 encoding, with the exact same problems as for rt_sigreturn().

  Note that attempting to set the TLS will clobber TPIDRR0_EL0, which
  the kernel will clobber for AArch64 tasks (including your fake compat
  tasks) in the KPTI trampoline. I'm not sure what your emulator expects
  here, and I suspect this also gets clobbered by the case
  tls_thread_flush() tries to cater for.

  So this really doesn't make sense to expose either, the kernel cannot
  possibly do something that is correct in this case.

I fully expect that there are more cases where this sort of mismatch
exists and there isn't some final sanity check that prevents said
mismatch from breaking the kernel.

Maybe your emulator avoids these, but that's no justification for the
kernel to expose such broken behaviour.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls
  2021-02-12 18:04         ` Mark Rutland
@ 2021-02-12 19:06           ` Amanieu d'Antras
  0 siblings, 0 replies; 12+ messages in thread
From: Amanieu d'Antras @ 2021-02-12 19:06 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Mark Brown, Steven Price, sonicadvance1,
	Will Deacon, Oleg Nesterov, Al Viro, Dave Martin,
	Amit Daniel Kachhap, Marc Zyngier, David Brazdil,
	Jean-Philippe Brucker, Andrew Morton, Anshuman Khandual,
	Gavin Shan, Mike Rapoport, Vincenzo Frascino, Kristina Martsenko,
	Kees Cook, Sami Tolvanen, Frederic Weisbecker, Kevin Hao,
	Jason Yan, Andrey Ignatov, Peter Collingbourne, Julien Grall,
	Tian Tao, Qais Yousef, Jens Axboe, Linux ARM, linux-kernel

On Fri, Feb 12, 2021 at 6:04 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > The patch proposed by
> > Ryan is based on the kernel patch used by Tango which can be found
> > here: https://github.com/Amanieu/linux/tree/tango-v5.4
> >
> > Efficiency is not the concern here: copying/rearranging some bytes is
> > tiny compared to the cost of a syscall. The main concern is
> > correctness: there are many cases where userspace does not have the
> > information or the capabilities needed to ensure that the 32-bit
> > syscall ABI is correctly emulated.
>
> I do appreciate that today there are cases where the emulator cannot do
> the right thing due to lack of a mechanism, but where the emulator does
> not have knowledge, I don't think that it can safely invoke the syscall.
> Consider if userspace invokes compat_rt_sigreturn() or similar, which
> will trash the emulator's state.
>
> Note that there are cases (e.g. compat_rt_sigreturn()), the kernel
> cannot provide the correct behaviour either. In your tree above, I spot
> at least the following:
>
> * For rt_sigreturn() the kernel will attempt to validate/restore a
>   compat sigframe assuming the AArch32 register mappings, then
>   valid_user_regs() will blat the PSTATE/SPSR_ELx value to /some/ valid
>   AArch64 SPSR_ELx value that happens to mostly alias.
>
>   I hope that your emulator doesn't allow emulated apps to call this,
>   because it would blat the emulator's state. Regardless, this syscall
>   simply cannot do any correct thing in the context of a fake compat
>   task, and doesn't make sense to expose.

sigreturn and sigaction and completely emulated by Tango and never
reach the kernel. It simply doesn't make sense to do otherwise since
the kernel has no knowledge of how Tango manages the emulated AArch32
state.

I agree that invoking compat_sys_rt_sigreturn from a 64-bit process
doesn't make sense. My reading of the code is that it will trigger a
SIGSEGV due to valid_user_regs failing. We could add an explicit check
against is_aarch32_compat_task in the compat syscall but the end
result will be the same.

> * For ptrace, operations on the user_aarch32_view (which
>   task_user_regset_view() will return for fake compat tasks) will
>   erroneously try to convert between SPSR_ELx <-> AARCH32 SPSR layouts,
>   assuming the pt_regs::pstate field is using the encoding for AArch32
>   EL0, where it's actually the AArch64 EL0 encoding where the layout is
>   subtly differnt. Subseqeuntly valid_user_regs() will sanitize that to
>   an AArch64 encoding, with the exact same problems as for rt_sigreturn().

Note that user_aarch32_view is only returned when the tracer is
performing a compat syscall. This is no different from a normal 32-bit
process tracing a 64-bit process, which already "works" (the register
state is garbage but everything else works).

Tango doesn't use the regsets and instead retrieves the AArch32
register state directly from the traced process (which is also running
under Tango) with process_vm_readv and returns that when emulating the
various PTRACE_* operations.

>   Note that attempting to set the TLS will clobber TPIDRR0_EL0, which
>   the kernel will clobber for AArch64 tasks (including your fake compat
>   tasks) in the KPTI trampoline. I'm not sure what your emulator expects
>   here, and I suspect this also gets clobbered by the case
>   tls_thread_flush() tries to cater for.

All the code paths that modify TPIDRR0_EL0 are guarded by
is_aarch32_compat_task which returns false for fake compat tasks. The
call to compat_arm_syscall which handles __ARM_NR_compat_set_tls is
also guarded by is_aarch32_compat_task.

Again, __ARM_NR_compat_set_tls is emulated internally by Tango and the
AArch32 TLS registers visible to translated code are also emulated in
software.

>   So this really doesn't make sense to expose either, the kernel cannot
>   possibly do something that is correct in this case.
>
> I fully expect that there are more cases where this sort of mismatch
> exists and there isn't some final sanity check that prevents said
> mismatch from breaking the kernel.
>
> Maybe your emulator avoids these, but that's no justification for the
> kernel to expose such broken behaviour.

I disagree that any broken behavior is exposed here.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-02-12 19:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-11 20:21 [RESEND RFC PATCH v2] arm64: Exposes support for 32-bit syscalls sonicadvance1
2021-02-12 11:30 ` Steven Price
2021-02-12 12:35   ` Mark Brown
2021-02-12 13:28     ` Catalin Marinas
2021-02-12 14:12       ` David Laight
2021-02-12 14:44         ` Catalin Marinas
2021-02-12 15:06           ` David Laight
2021-02-12 16:24       ` Amanieu d'Antras
2021-02-12 18:04         ` Mark Rutland
2021-02-12 19:06           ` Amanieu d'Antras
2021-02-12 13:59   ` Arnd Bergmann
2021-02-12 14:13 ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).