All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] x86: add arch_prctl to switch between native/compat modes
@ 2016-04-06 16:29 Dmitry Safonov
  2016-04-06 16:29 ` [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode Dmitry Safonov
  2016-04-06 16:29 ` [PATCH 2/2] x86/tools/testing: add test for ARCH_SET_COMPAT Dmitry Safonov
  0 siblings, 2 replies; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-06 16:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: luto, tglx, mingo, hpa, x86, shuahkh, bp, akpm, linux-kselftest,
	gorcunov, xemul, khorenko, 0x7f454c46, Dmitry Safonov

With those patches it becomes possible to tell the kernel in which mode
current task is.
I need it for compatibility process C/R:
restorer is native x86_64 process, that maps vmas, restore task parameters,
does clone to add threads and so on. To restore 32-bit application, that
runs on x86_64 (in compatibility mode), I need to set proper CS selector
for USER32_CS and tell the kernel, that the process is now in compat mode.
Switching selector isn't a hard task (and it's done in other selftests
with long jump/lret).
This patch makes possible to tell Linux kernel in which mode you are.

I also did vdso/vvar blob remapping on compat <-> native switch.
This part isn't really needed by CRIU, as on restore stage we already
have dumped vdso/vvar vma images.
So, this part is for other processes that may need to switch their mode.
(I will drop this part if no one else needs this possibility).

I add a selftest and I did CRIU branch that uses this to C/R 32-bit processes:
https://github.com/0x7f454c46/criu/tree/compat-2
There are dozens of patches there and I will prepare them for CRIU master
branch after mainstreaming this switching patch.

Dmitry Safonov (2):
  x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  x86/tools/testing: add test for ARCH_SET_COMPAT

 arch/x86/entry/vdso/vma.c                          |  76 ++++--
 arch/x86/include/asm/vdso.h                        |   5 +
 arch/x86/include/uapi/asm/prctl.h                  |   6 +
 arch/x86/kernel/process_64.c                       |  87 ++++++
 tools/testing/selftests/x86/Makefile               |   1 +
 .../testing/selftests/x86/arch_prctl_set_compat.c  | 295 +++++++++++++++++++++
 6 files changed, 453 insertions(+), 17 deletions(-)
 create mode 100644 tools/testing/selftests/x86/arch_prctl_set_compat.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-06 16:29 [PATCH 0/2] x86: add arch_prctl to switch between native/compat modes Dmitry Safonov
@ 2016-04-06 16:29 ` Dmitry Safonov
  2016-04-06 18:04   ` Andy Lutomirski
  2016-04-06 16:29 ` [PATCH 2/2] x86/tools/testing: add test for ARCH_SET_COMPAT Dmitry Safonov
  1 sibling, 1 reply; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-06 16:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: luto, tglx, mingo, hpa, x86, shuahkh, bp, akpm, linux-kselftest,
	gorcunov, xemul, khorenko, 0x7f454c46, Dmitry Safonov

Now each process that runs natively on x86_64 may execute 32-bit code
by proper setting it's CS selector: either from LDT or reuse Linux's
USER32_CS. The vice-versa is also valid: running 64-bit code in
compatible task is also possible by choosing USER_CS.
So we may switch between 32 and 64 bit code execution in any process.
Linux will choose the right syscall numbers in entries for those
processes. But it still will consider them native/compat by the
personality, that elf loader set on launch. This affects i.e., ptrace
syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset
according to process's mode (that's how strace detect task's
personality from 4.8 version).

This patch adds arch_prctl calls for x86 that make possible to tell
Linux kernel in which mode the application is running currently.
Mainly, this is needed for CRIU: restoring compatible & native
applications both from 64-bit restorer. By that reason I wrapped all
the code in CONFIG_CHECKPOINT_RESTORE.
This patch solves also a problem for running 64-bit code in 32-bit elf
(and reverse), that you have only 32-bit elf vdso for fast syscalls.
When switching between native <-> compat mode by arch_prctl, it will
remap needed vdso binary blob for target mode.

Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
CC: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 arch/x86/entry/vdso/vma.c         | 76 ++++++++++++++++++++++++++--------
 arch/x86/include/asm/vdso.h       |  5 +++
 arch/x86/include/uapi/asm/prctl.h |  6 +++
 arch/x86/kernel/process_64.c      | 87 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 157 insertions(+), 17 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 10f704584922..9a1561da0bad 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -156,22 +156,21 @@ static int vvar_fault(const struct vm_special_mapping *sm,
 	return VM_FAULT_SIGBUS;
 }
 
-static int map_vdso(const struct vdso_image *image, bool calculate_addr)
+static int do_map_vdso(const struct vdso_image *image, bool calculate_addr,
+		unsigned long addr)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
-	unsigned long addr, text_start;
+	unsigned long text_start;
 	int ret = 0;
 	static const struct vm_special_mapping vvar_mapping = {
 		.name = "[vvar]",
 		.fault = vvar_fault,
 	};
 
-	if (calculate_addr) {
+	if (calculate_addr && !addr) {
 		addr = vdso_addr(current->mm->start_stack,
 				 image->size - image->sym_vvar_start);
-	} else {
-		addr = 0;
 	}
 
 	down_write(&mm->mmap_sem);
@@ -209,11 +208,11 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 				       VM_PFNMAP,
 				       &vvar_mapping);
 
-	if (IS_ERR(vma)) {
+	if (IS_ERR(vma))
 		ret = PTR_ERR(vma);
-		goto up_fail;
-	}
 
+	if (ret)
+		do_munmap(mm, addr, image->size - image->sym_vvar_start);
 up_fail:
 	if (ret)
 		current->mm->context.vdso = NULL;
@@ -223,24 +222,28 @@ up_fail:
 }
 
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
-static int load_vdso32(void)
+static int load_vdso32(unsigned long addr)
 {
 	if (vdso32_enabled != 1)  /* Other values all mean "disabled" */
 		return 0;
 
-	return map_vdso(&vdso_image_32, false);
+	return do_map_vdso(&vdso_image_32, false, addr);
 }
 #endif
 
 #ifdef CONFIG_X86_64
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+static int load_vdso64(unsigned long addr)
 {
 	if (!vdso64_enabled)
 		return 0;
 
-	return map_vdso(&vdso_image_64, true);
+	return do_map_vdso(&vdso_image_64, true, addr);
 }
 
+int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+{
+	return load_vdso64(0);
+}
 #ifdef CONFIG_COMPAT
 int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
 				       int uses_interp)
@@ -250,20 +253,59 @@ int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
 		if (!vdso64_enabled)
 			return 0;
 
-		return map_vdso(&vdso_image_x32, true);
+		return do_map_vdso(&vdso_image_x32, true, 0);
 	}
 #endif
 #ifdef CONFIG_IA32_EMULATION
-	return load_vdso32();
+	return load_vdso32(0);
 #else
 	return 0;
 #endif
 }
-#endif
-#else
+#endif /* CONFIG_COMPAT */
+
+#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
+unsigned long unmap_vdso(void)
+{
+	struct vm_area_struct *vma;
+	unsigned long addr = (unsigned long)current->mm->context.vdso;
+
+	if (!addr)
+		return 0;
+
+	/* vvar pages */
+	vma = find_vma(current->mm, addr - 1);
+	if (vma)
+		vm_munmap(vma->vm_start, vma->vm_end - vma->vm_start);
+
+	/* vdso pages */
+	vma = find_vma(current->mm, addr);
+	if (vma)
+		vm_munmap(vma->vm_start, vma->vm_end - vma->vm_start);
+
+	current->mm->context.vdso = NULL;
+
+	return addr;
+}
+/*
+ * Maps needed vdso type: vdso_image_32/vdso_image_64
+ * @compatible - true for compatible, false for native vdso image
+ * @addr - specify addr for vdso mapping (0 for random/searching)
+ * NOTE: be sure to set/clear thread-specific flags before
+ * calling this function.
+ */
+int map_vdso(bool compatible, unsigned long addr)
+{
+	if (compatible)
+		return load_vdso32(addr);
+	else
+		return load_vdso64(addr);
+}
+#endif /* CONFIG_IA32_EMULATION && CONFIG_CHECKPOINT_RESTORE */
+#else /* !CONFIG_X86_64 */
 int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 {
-	return load_vdso32();
+	return load_vdso32(0);
 }
 #endif
 
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 43dc55be524e..3ead7cc48a68 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -39,6 +39,11 @@ extern const struct vdso_image vdso_image_x32;
 extern const struct vdso_image vdso_image_32;
 #endif
 
+#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
+extern int map_vdso(bool to_compat, unsigned long addr);
+extern unsigned long unmap_vdso(void);
+#endif
+
 extern void __init init_vdso_image(const struct vdso_image *image);
 
 #endif /* __ASSEMBLER__ */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 3ac5032fae09..455844f06485 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -6,4 +6,10 @@
 #define ARCH_GET_FS 0x1003
 #define ARCH_GET_GS 0x1004
 
+#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
+#define ARCH_SET_COMPAT		0x2001
+#define ARCH_SET_NATIVE		0x2002
+#define ARCH_GET_PERSONALITY	0x2003
+#endif
+
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6cbab31ac23a..e50660d59530 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -49,6 +49,7 @@
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
 #include <asm/xen/hypervisor.h>
+#include <asm/vdso.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -505,6 +506,83 @@ void set_personality_ia32(bool x32)
 }
 EXPORT_SYMBOL_GPL(set_personality_ia32);
 
+#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
+/*
+ * Check if there are still some vmas (except vdso) for current,
+ * which placed above compatible TASK_SIZE.
+ * Check also code, data, stack, args and env placements.
+ * Returns true if all mappings are compatible.
+ */
+static bool task_mappings_compatible(void)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long top_addr = IA32_PAGE_OFFSET;
+	struct vm_area_struct *vma = find_vma(mm, top_addr);
+
+	if (mm->end_code	> top_addr ||
+	    mm->end_data	> top_addr ||
+	    mm->start_stack	> top_addr ||
+	    mm->brk		> top_addr ||
+	    mm->arg_end		> top_addr ||
+	    mm->env_end		> top_addr)
+		return false;
+
+	while (vma) {
+		if ((vma->vm_start != (unsigned long)mm->context.vdso) &&
+		    (vma->vm_end != (unsigned long)mm->context.vdso))
+			return false;
+
+		top_addr = vma->vm_end;
+		vma = find_vma(mm, top_addr);
+	}
+
+	return true;
+}
+
+static int do_set_personality(bool compat, unsigned long addr)
+{
+	int ret;
+	unsigned long old_vdso_base;
+	unsigned long old_mmap_base = current->mm->mmap_base;
+
+	if (test_thread_flag(TIF_IA32) == compat) /* nothing to do */
+		return 0;
+
+	if (compat && !task_mappings_compatible())
+		return -EFAULT;
+
+	/*
+	 * We can't just remap vdso to needed location:
+	 * vdso compatible and native images differs
+	 */
+	old_vdso_base = unmap_vdso();
+
+	if (compat)
+		set_personality_ia32(false);
+	else
+		set_personality_64bit();
+
+	/*
+	 * Update mmap_base & get_unmapped_area helper, side effect:
+	 * one may change get_unmapped_area or mmap_base with personality()
+	 * or switching to and fro compatible mode
+	 */
+	arch_pick_mmap_layout(current->mm);
+
+	ret = map_vdso(compat, addr);
+	if (ret) {
+		current->mm->mmap_base = old_mmap_base;
+		if (compat)
+			set_personality_64bit();
+		else
+			set_personality_ia32(false);
+		WARN_ON(map_vdso(!compat, old_vdso_base));
+	}
+
+	return ret;
+}
+#endif
+
 long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 {
 	int ret = 0;
@@ -592,6 +670,15 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 		break;
 	}
 
+#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
+	case ARCH_SET_COMPAT:
+		return do_set_personality(true, addr);
+	case ARCH_SET_NATIVE:
+		return do_set_personality(false, addr);
+	case ARCH_GET_PERSONALITY:
+		return test_thread_flag(TIF_IA32);
+#endif
+
 	default:
 		ret = -EINVAL;
 		break;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2] x86/tools/testing: add test for ARCH_SET_COMPAT
  2016-04-06 16:29 [PATCH 0/2] x86: add arch_prctl to switch between native/compat modes Dmitry Safonov
  2016-04-06 16:29 ` [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode Dmitry Safonov
@ 2016-04-06 16:29 ` Dmitry Safonov
  1 sibling, 0 replies; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-06 16:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: luto, tglx, mingo, hpa, x86, shuahkh, bp, akpm, linux-kselftest,
	gorcunov, xemul, khorenko, 0x7f454c46, Dmitry Safonov

This is simple test to determine if arch_prctl(ARCH_SET_COMPAT) is
working by ptracing switched application with PTRACE_GETREGS -
it should return 32-bit registers set.

Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
CC: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 tools/testing/selftests/x86/Makefile               |   1 +
 .../testing/selftests/x86/arch_prctl_set_compat.c  | 295 +++++++++++++++++++++
 2 files changed, 296 insertions(+)
 create mode 100644 tools/testing/selftests/x86/arch_prctl_set_compat.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index b47ebd170690..bdadd12698f4 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -9,6 +9,7 @@ TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt ptrace_sysc
 TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
+TARGETS_C_64BIT_ONLY := arch_prctl_set_compat
 
 TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
 TARGETS_C_64BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_64BIT_ONLY)
diff --git a/tools/testing/selftests/x86/arch_prctl_set_compat.c b/tools/testing/selftests/x86/arch_prctl_set_compat.c
new file mode 100644
index 000000000000..326a46d4b056
--- /dev/null
+++ b/tools/testing/selftests/x86/arch_prctl_set_compat.c
@@ -0,0 +1,295 @@
+/*
+ * arch_prctl_set_compat.c - tests switching to compatible mode from 64-bit
+ * Copyright (c) 2016 Dmitry Safonov
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * This switches to compatible mode with the help from arch_prctl friend.
+ * Switching is simple syscall, but one need unmap every vma that is
+ * higher than 32-bit TASK_SIZE, make raw 32/64-bit syscalls.
+ * So this is also a really good example. By the end tracee is
+ * compatible task that makes 32-bit syscalls to stop itself.
+ * For returning in some 32-bit code it may be handy to use sigreturn
+ * there with formed frame.
+ *
+ * Switching from 32-bit compatible application to native is just one
+ * arch_prctl syscall, so this is for harder task: switching from native to
+ * compat mode.
+ */
+#define _GNU_SOURCE
+
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <signal.h>
+#include <stdint.h>
+
+#include <sys/syscall.h>
+#include <asm/prctl.h>
+#include <sys/prctl.h>
+#include <sys/personality.h>
+#include <sys/mman.h>
+
+#include <sys/uio.h>
+#include <sys/ptrace.h>
+#include <linux/elf.h>
+#include <sys/wait.h>
+
+#include <sys/stat.h>
+#include <fcntl.h>
+
+#define __stringify_1(x...)	#x
+#define __stringify(x...)	__stringify_1(x)
+
+#ifndef ARCH_SET_COMPAT
+#define ARCH_SET_COMPAT		0x2001
+#define ARCH_SET_NATIVE		0x2002
+#define ARCH_GET_PERSONALITY	0x2003
+#endif
+
+#define PAGE_SIZE		4096
+#define TASK_SIZE_MAX		((1UL << 47) - PAGE_SIZE)
+#define IA32_PAGE_OFFSET	0xFFFFe000
+
+/* Just a typical random stack on x86_64 compatible task */
+#define STACK_START		0xffdb8000
+#define STACK_END		0xffdd9000
+
+/* Some empty randoms inside compatible address space */
+#define ARG_START 0xf77c8000
+#define ARG_END 0xf77c8000
+#define ENV_START 0xf77c8000
+#define ENV_END 0xf77c8000
+
+/*
+ * After removing all mappings higher than compatible TASK_SIZE,
+ * we remove libc mapping. That's the reason for plain syscalls
+ */
+#define __NR_munmap		11
+#define __NR_arch_prctl		158
+
+#define __NR32_getpid		20
+#define __NR32_kill		37
+
+/* unmaps everything above IA32_PAGE_OFFSET */
+static inline void unmap_uncompat_mappings(void)
+{
+	unsigned long addr = IA32_PAGE_OFFSET;
+	unsigned long len = TASK_SIZE_MAX - IA32_PAGE_OFFSET;
+
+	asm volatile(
+		"	movq $"__stringify(__NR_munmap)", %%rax\n"
+		"	syscall\n"
+		:
+		: "D" (addr), "S" (len)
+	);
+}
+
+static inline void sys_arch_prctl(int code, unsigned long addr)
+{
+	asm volatile(
+		"	movq $"__stringify(__NR_arch_prctl)", %%rax\n"
+		"	syscall\n"
+		:
+		: "D" (code), "S" (addr)
+	);
+}
+
+static inline void
+prctl_print(int opcode, unsigned long arg1, unsigned long arg2)
+{
+	long ret = syscall(SYS_prctl, opcode, arg1, arg2, 0, 0);
+
+	if (ret)
+		fprintf(stderr, "[ERR]\tprctl failed with %ld : %m\n", ret);
+}
+
+/*
+ * Runed in different task just for test purposes:
+ * tracer with the help of PTRACE_GETREGS will fetch it's registers set size
+ * and determine, if it's compatible task.
+ * Then tracer will kill tracee, sorry for it.
+ */
+void tracee_func(void)
+{
+	personality(PER_LINUX32);
+
+	/* emptify arg & env, moving them to compatible address space */
+	prctl_print(PR_SET_MM, PR_SET_MM_ARG_START, ARG_START);
+	prctl_print(PR_SET_MM, PR_SET_MM_ARG_END, ARG_END);
+	prctl_print(PR_SET_MM, PR_SET_MM_ENV_START, ENV_START);
+	prctl_print(PR_SET_MM, PR_SET_MM_ENV_END, ENV_END);
+
+	/* stack: get a new one */
+	if (mmap((void *)STACK_START, STACK_END - STACK_START,
+				PROT_READ | PROT_WRITE,
+				MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0)
+			== MAP_FAILED) {
+		fprintf(stderr, "[ERR]\tfailed to mmap new stack : %m\n");
+	} else {
+		prctl_print(PR_SET_MM, PR_SET_MM_START_STACK, STACK_START);
+		asm volatile (
+				"	mov %%rax,%%rsp\n"
+				:
+				: "a" (STACK_END));
+	}
+	/* we are cool guys: we have our own stack */
+
+	unmap_uncompat_mappings();
+	/*
+	 * we are poor boys: unmapped everything with glibc,
+	 * do not use it from now - we are on our own!
+	 */
+
+	sys_arch_prctl(ARCH_SET_COMPAT, 0);
+
+	/* Now switch to compatibility mode */
+	asm volatile (
+		"	pushq $0x23\n"	/* USER32_CS */
+		"	pushq $1f\n"
+		"	lretq\n"
+		/* here we are: ready to execute some 32-bit code */
+		"1:\n"
+		".code32\n"
+		/* getpid() */
+		"	movl $"__stringify(__NR32_getpid)", %eax\n"
+		"	int $0x80\n"
+		"	movl %eax, %ebx\n" /* pid */
+		/* raise SIGSTOP */
+		"	movl $"__stringify(__NR32_kill)", %eax\n"
+		"	movl $19, %ecx\n"
+		"	int $0x80\n"
+		".code64\n"
+	);
+
+}
+
+typedef struct {
+	uint64_t	r15, r14, r13, r12, bp, bx, r11, r10, r9, r8;
+	uint64_t	ax, cx, dx, si, di, orig_ax, ip, cs, flags;
+	uint64_t	sp, ss, fs_base, gs_base, ds, es, fs, gs;
+} user_regs_64;
+
+typedef struct {
+	uint32_t	bx, cx, dx, si, di, bp, ax, ds, es, fs, gs;
+	uint32_t	orig_ax, ip, cs, flags, sp, ss;
+} user_regs_32;
+
+typedef union {
+	user_regs_64 native;
+	user_regs_32 compat;
+} user_regs_struct_t;
+
+int ptrace_task_compatible(pid_t pid)
+{
+	struct iovec iov;
+	user_regs_struct_t r;
+	size_t reg_size = sizeof(user_regs_64);
+
+	iov.iov_base = &r;
+	iov.iov_len = reg_size;
+
+	errno = 0;
+	if (ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, &iov)) {
+		fprintf(stderr, "[NOTE]\tCan't get register set: PTRACE_GETREGSET failed for pid %d : %m\n",
+			pid);
+		return 0;
+	}
+
+	return iov.iov_len == sizeof(user_regs_32);
+}
+
+void dump_proc_maps(pid_t pid)
+{
+#define BUF_SIZE	1024
+	char buf[BUF_SIZE];
+	int fd;
+	size_t nread;
+
+	snprintf(buf, BUF_SIZE, "/proc/%d/maps", pid);
+	fd = open(buf, O_RDONLY);
+	if (fd < 0) {
+		fprintf(stderr, "[NOTE]\tCant open %s to dump : %s\n", buf);
+		return;
+	}
+
+	while ((nread = read(fd, buf, sizeof(buf))) > 0)
+		fwrite(buf, 1, nread, stdout);
+
+	close(fd);
+}
+
+int main(int argc, char **argv)
+{
+	pid_t pid;
+	int ret = 0;
+	int status;
+	int in_compat = syscall(SYS_arch_prctl, 0x2003, 0);
+	int dump_maps = 0;
+
+	if (in_compat < 0) {
+		fprintf(stderr,
+			"[ERR]\tSYS_arch_prctl returned %d : %m\n", in_compat);
+	}
+
+	if (in_compat == 1) {
+		fprintf(stderr, "[SKIP]\tRun in 64-bit x86 userspace\n");
+		return 0;
+	}
+
+	if (argc > 1)
+		dump_maps = !(strcmp(argv[1], "--dump-proc"));
+
+	if (dump_maps)
+		dump_proc_maps(getpid());
+
+	fflush(NULL);
+	pid = fork();
+	if (pid < 0) {
+		fprintf(stderr, "[SKIP]\tCan't fork : %m\n");
+		return 1;
+	}
+
+	if (pid == 0) {/* child */
+		ptrace(PTRACE_TRACEME, 0, 0, 0);
+		tracee_func();
+	}
+
+	/* parent, the tracer */
+	waitpid(pid, &status, 0);
+	if (WIFEXITED(status)) {
+		fprintf(stderr, "[FAIL]\tTest was suddenly killed\n");
+		return 2;
+	}
+
+	if (WIFSIGNALED(status)) {
+		fprintf(stderr, "[FAIL]\tTest killed with signal %d\n",
+				WTERMSIG(status));
+		return 3;
+	}
+
+	if (!WIFSTOPPED(status))
+		fprintf(stderr, "[NOTE]\twaitpid() returned, but tracee wasn't stopped\n");
+
+	if (!ptrace_task_compatible) {
+		fprintf(stderr, "[FAIL]\tTask didn't become compatible\n");
+		ret = 4;
+	}
+
+	if (dump_maps)
+		dump_proc_maps(pid);
+
+	kill(pid, SIGKILL);
+	fprintf(stderr, "[OK]\tSuccessfuly changed mode to compatible\n");
+
+	return ret;
+}
+
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-06 16:29 ` [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode Dmitry Safonov
@ 2016-04-06 18:04   ` Andy Lutomirski
  2016-04-06 18:49     ` Andy Lutomirski
  2016-04-07 12:11     ` Dmitry Safonov
  0 siblings, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-06 18:04 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Shuah Khan, H. Peter Anvin, 0x7f454c46, khorenko, linux-kernel,
	Thomas Gleixner, Cyrill Gorcunov, Borislav Petkov, xemul,
	linux-kselftest, Andrew Morton, X86 ML, Ingo Molnar, Dave Hansen

[cc Dave Hansen for MPX]

On Apr 6, 2016 9:30 AM, "Dmitry Safonov" <dsafonov@virtuozzo.com> wrote:
>
> Now each process that runs natively on x86_64 may execute 32-bit code
> by proper setting it's CS selector: either from LDT or reuse Linux's
> USER32_CS. The vice-versa is also valid: running 64-bit code in
> compatible task is also possible by choosing USER_CS.
> So we may switch between 32 and 64 bit code execution in any process.
> Linux will choose the right syscall numbers in entries for those
> processes. But it still will consider them native/compat by the
> personality, that elf loader set on launch. This affects i.e., ptrace
> syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset
> according to process's mode (that's how strace detect task's
> personality from 4.8 version).
>
> This patch adds arch_prctl calls for x86 that make possible to tell
> Linux kernel in which mode the application is running currently.
> Mainly, this is needed for CRIU: restoring compatible & native
> applications both from 64-bit restorer. By that reason I wrapped all
> the code in CONFIG_CHECKPOINT_RESTORE.
> This patch solves also a problem for running 64-bit code in 32-bit elf
> (and reverse), that you have only 32-bit elf vdso for fast syscalls.
> When switching between native <-> compat mode by arch_prctl, it will
> remap needed vdso binary blob for target mode.

General comments first:

You forgot about x32.

I think that you should separate vdso remapping from "personality".
vdso remapping should be available even on native 32-bit builds, which
means that either you can't use arch_prctl for it or you'll have to
wire up arch_prctl as a 32-bit syscall.

For "personality", someone needs to enumerate all of the various thigs
that try to track bitness and see how many of them even make sense.
On brief inspection:

 - TIF_IA32: affects signal format and does something to ptrace.  I
suspect that whatever it does to ptrace is nonsensical, and I don't
know whether we're stuck with it.

 - TIF_ADDR32 affects TASK_SIZE and mmap behavior (and the latter
isn't even done in a sensible way).

 - is_64bit_mm affects MPX and uprobes.

On even more brief inspection:

 - uprobes using is_64bit_mm is buggy.

 - I doubt that having TASK_SIZE vary serves any purpose.  Does anyone
know why TASK_SIZE is different for different tasks?  It would save
code size and speed things up if TASK_SIZE were always TASK_SIZE_MAX.

 - Using TIF_IA32 for signal processing is IMO suboptimal.  Instead,
we should record which syscall installed the signal handler and use
the corresponding frame format.

 - Using TIF_IA32 of the *target* for ptrace is nonsense.  Having
strace figure out syscall type using that is actively buggy, and I ran
into that bug a few days ago and cursed at it.  strace should inspect
TS_COMPAT (I don't know how, but that's what should happen).  We may
be stuck with this for ABI reasons.

 - For MPX, could we track which syscall called mpx_enable_management?
 I.e. can we stash in_compat_syscall's return from
mpx_enable_management and use that instead of trying to determine the
MPX data structure format by the mm's initial type?

If we make all of these changes, we're left with *only* the ptrace
thing, and you could certainly add a way for CRIU to switch that
around.

>
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
> CC: Dmitry Safonov <0x7f454c46@gmail.com>
> Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  arch/x86/entry/vdso/vma.c         | 76 ++++++++++++++++++++++++++--------
>  arch/x86/include/asm/vdso.h       |  5 +++
>  arch/x86/include/uapi/asm/prctl.h |  6 +++
>  arch/x86/kernel/process_64.c      | 87 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 157 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index 10f704584922..9a1561da0bad 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -156,22 +156,21 @@ static int vvar_fault(const struct vm_special_mapping *sm,
>         return VM_FAULT_SIGBUS;
>  }
>
> -static int map_vdso(const struct vdso_image *image, bool calculate_addr)
> +static int do_map_vdso(const struct vdso_image *image, bool calculate_addr,
> +               unsigned long addr)
>  {
>         struct mm_struct *mm = current->mm;
>         struct vm_area_struct *vma;
> -       unsigned long addr, text_start;
> +       unsigned long text_start;
>         int ret = 0;
>         static const struct vm_special_mapping vvar_mapping = {
>                 .name = "[vvar]",
>                 .fault = vvar_fault,
>         };
>
> -       if (calculate_addr) {
> +       if (calculate_addr && !addr) {
>                 addr = vdso_addr(current->mm->start_stack,
>                                  image->size - image->sym_vvar_start);
> -       } else {
> -               addr = 0;
>         }
>

This is overcomplicated.  Just pass in the address and us it as supplied.

>         down_write(&mm->mmap_sem);
> @@ -209,11 +208,11 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
>                                        VM_PFNMAP,
>                                        &vvar_mapping);
>
> -       if (IS_ERR(vma)) {
> +       if (IS_ERR(vma))
>                 ret = PTR_ERR(vma);
> -               goto up_fail;
> -       }
>
> +       if (ret)
> +               do_munmap(mm, addr, image->size - image->sym_vvar_start);
>  up_fail:
>         if (ret)
>                 current->mm->context.vdso = NULL;
> @@ -223,24 +222,28 @@ up_fail:
>  }
>
>  #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
> -static int load_vdso32(void)
> +static int load_vdso32(unsigned long addr)
>  {
>         if (vdso32_enabled != 1)  /* Other values all mean "disabled" */
>                 return 0;
>
> -       return map_vdso(&vdso_image_32, false);
> +       return do_map_vdso(&vdso_image_32, false, addr);
>  }
>  #endif

I'd just make it one function do_map_vdso(type, addr).

>
>  #ifdef CONFIG_X86_64
> -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> +static int load_vdso64(unsigned long addr)
>  {
>         if (!vdso64_enabled)
>                 return 0;
>
> -       return map_vdso(&vdso_image_64, true);
> +       return do_map_vdso(&vdso_image_64, true, addr);
>  }
>
> +int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> +{
> +       return load_vdso64(0);
> +}
>  #ifdef CONFIG_COMPAT
>  int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
>                                        int uses_interp)
> @@ -250,20 +253,59 @@ int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
>                 if (!vdso64_enabled)
>                         return 0;
>
> -               return map_vdso(&vdso_image_x32, true);
> +               return do_map_vdso(&vdso_image_x32, true, 0);
>         }
>  #endif
>  #ifdef CONFIG_IA32_EMULATION
> -       return load_vdso32();
> +       return load_vdso32(0);

No special 0 please.

>  #else
>         return 0;
>  #endif
>  }
> -#endif
> -#else
> +#endif /* CONFIG_COMPAT */
> +
> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
> +unsigned long unmap_vdso(void)
> +{
> +       struct vm_area_struct *vma;
> +       unsigned long addr = (unsigned long)current->mm->context.vdso;
> +
> +       if (!addr)
> +               return 0;
> +
> +       /* vvar pages */
> +       vma = find_vma(current->mm, addr - 1);
> +       if (vma)
> +               vm_munmap(vma->vm_start, vma->vm_end - vma->vm_start);
> +
> +       /* vdso pages */
> +       vma = find_vma(current->mm, addr);
> +       if (vma)
> +               vm_munmap(vma->vm_start, vma->vm_end - vma->vm_start);
> +
> +       current->mm->context.vdso = NULL;
> +
> +       return addr;
> +}
> +/*
> + * Maps needed vdso type: vdso_image_32/vdso_image_64
> + * @compatible - true for compatible, false for native vdso image
> + * @addr - specify addr for vdso mapping (0 for random/searching)
> + * NOTE: be sure to set/clear thread-specific flags before
> + * calling this function.
> + */
> +int map_vdso(bool compatible, unsigned long addr)
> +{
> +       if (compatible)
> +               return load_vdso32(addr);
> +       else
> +               return load_vdso64(addr);
> +}

This makes sense.  But it can't be bool -- you forgot x32.

> +#endif /* CONFIG_IA32_EMULATION && CONFIG_CHECKPOINT_RESTORE */
> +#else /* !CONFIG_X86_64 */
>  int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>  {
> -       return load_vdso32();
> +       return load_vdso32(0);
>  }
>  #endif
>
> diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
> index 43dc55be524e..3ead7cc48a68 100644
> --- a/arch/x86/include/asm/vdso.h
> +++ b/arch/x86/include/asm/vdso.h
> @@ -39,6 +39,11 @@ extern const struct vdso_image vdso_image_x32;
>  extern const struct vdso_image vdso_image_32;
>  #endif
>
> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
> +extern int map_vdso(bool to_compat, unsigned long addr);
> +extern unsigned long unmap_vdso(void);
> +#endif
> +
>  extern void __init init_vdso_image(const struct vdso_image *image);
>
>  #endif /* __ASSEMBLER__ */
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 3ac5032fae09..455844f06485 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -6,4 +6,10 @@
>  #define ARCH_GET_FS 0x1003
>  #define ARCH_GET_GS 0x1004
>
> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
> +#define ARCH_SET_COMPAT                0x2001
> +#define ARCH_SET_NATIVE                0x2002
> +#define ARCH_GET_PERSONALITY   0x2003
> +#endif
> +
>  #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 6cbab31ac23a..e50660d59530 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -49,6 +49,7 @@
>  #include <asm/debugreg.h>
>  #include <asm/switch_to.h>
>  #include <asm/xen/hypervisor.h>
> +#include <asm/vdso.h>
>
>  asmlinkage extern void ret_from_fork(void);
>
> @@ -505,6 +506,83 @@ void set_personality_ia32(bool x32)
>  }
>  EXPORT_SYMBOL_GPL(set_personality_ia32);
>
> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
> +/*
> + * Check if there are still some vmas (except vdso) for current,
> + * which placed above compatible TASK_SIZE.
> + * Check also code, data, stack, args and env placements.
> + * Returns true if all mappings are compatible.
> + */
> +static bool task_mappings_compatible(void)
> +{
> +       struct mm_struct *mm = current->mm;
> +       unsigned long top_addr = IA32_PAGE_OFFSET;
> +       struct vm_area_struct *vma = find_vma(mm, top_addr);
> +
> +       if (mm->end_code        > top_addr ||
> +           mm->end_data        > top_addr ||
> +           mm->start_stack     > top_addr ||
> +           mm->brk             > top_addr ||
> +           mm->arg_end         > top_addr ||
> +           mm->env_end         > top_addr)
> +               return false;
> +
> +       while (vma) {
> +               if ((vma->vm_start != (unsigned long)mm->context.vdso) &&
> +                   (vma->vm_end != (unsigned long)mm->context.vdso))
> +                       return false;
> +
> +               top_addr = vma->vm_end;
> +               vma = find_vma(mm, top_addr);
> +       }
> +
> +       return true;
> +}

What goes wrong if there are leftover high mappings?

> +
> +static int do_set_personality(bool compat, unsigned long addr)
> +{
> +       int ret;
> +       unsigned long old_vdso_base;
> +       unsigned long old_mmap_base = current->mm->mmap_base;
> +
> +       if (test_thread_flag(TIF_IA32) == compat) /* nothing to do */
> +               return 0;

Please don't.  Instead, remove TIF_IA32 entirely.

Also, please separate out ARCH_REMAP_VDSO from any personality change API.

> +
> +       if (compat && !task_mappings_compatible())
> +               return -EFAULT;
> +
> +       /*
> +        * We can't just remap vdso to needed location:
> +        * vdso compatible and native images differs
> +        */
> +       old_vdso_base = unmap_vdso();
> +
> +       if (compat)
> +               set_personality_ia32(false);
> +       else
> +               set_personality_64bit();
> +
> +       /*
> +        * Update mmap_base & get_unmapped_area helper, side effect:
> +        * one may change get_unmapped_area or mmap_base with personality()
> +        * or switching to and fro compatible mode
> +        */
> +       arch_pick_mmap_layout(current->mm);
> +
> +       ret = map_vdso(compat, addr);
> +       if (ret) {
> +               current->mm->mmap_base = old_mmap_base;
> +               if (compat)
> +                       set_personality_64bit();
> +               else
> +                       set_personality_ia32(false);
> +               WARN_ON(map_vdso(!compat, old_vdso_base));
> +       }
> +
> +       return ret;
> +}
> +#endif
> +
>  long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
>  {
>         int ret = 0;
> @@ -592,6 +670,15 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
>                 break;
>         }
>
> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
> +       case ARCH_SET_COMPAT:
> +               return do_set_personality(true, addr);
> +       case ARCH_SET_NATIVE:
> +               return do_set_personality(false, addr);
> +       case ARCH_GET_PERSONALITY:
> +               return test_thread_flag(TIF_IA32);
> +#endif
> +
>         default:
>                 ret = -EINVAL;
>                 break;
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-06 18:04   ` Andy Lutomirski
@ 2016-04-06 18:49     ` Andy Lutomirski
  2016-04-07 12:11     ` Dmitry Safonov
  1 sibling, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-06 18:49 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Shuah Khan, H. Peter Anvin, 0x7f454c46, khorenko, linux-kernel,
	Thomas Gleixner, Cyrill Gorcunov, Borislav Petkov, xemul,
	linux-kselftest, Andrew Morton, X86 ML, Ingo Molnar, Dave Hansen

On Wed, Apr 6, 2016 at 11:04 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> [cc Dave Hansen for MPX]
>

>  - For MPX, could we track which syscall called mpx_enable_management?
>  I.e. can we stash in_compat_syscall's return from
> mpx_enable_management and use that instead of trying to determine the
> MPX data structure format by the mm's initial type?
>

Even this may be more complicated than necessary.  Could the MPX
helpers just use user_64bit_mode?  After all, if your bounds
structures don't match your bitness and you take an MPX fault, you are
screwed no matter what the kernel does, so why not just have the
kernel helpers look at the user state?

Hmm.  There's mpx_unmap_tables, too, which complicates things.

FWIW, caching the bounds directory address may not be so important.
On my Skylake laptop, reading BNDCSR using XSAVE (RFBM set to just
BNDCSR) takes about 100 cycles and reading it using XSAVEC takes about
75 cycles.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-06 18:04   ` Andy Lutomirski
  2016-04-06 18:49     ` Andy Lutomirski
@ 2016-04-07 12:11     ` Dmitry Safonov
  2016-04-07 12:21       ` Cyrill Gorcunov
  2016-04-07 14:39       ` Andy Lutomirski
  1 sibling, 2 replies; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-07 12:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Shuah Khan, H. Peter Anvin, 0x7f454c46, khorenko, linux-kernel,
	Thomas Gleixner, Cyrill Gorcunov, Borislav Petkov, xemul,
	linux-kselftest, Andrew Morton, X86 ML, Ingo Molnar, Dave Hansen

On 04/06/2016 09:04 PM, Andy Lutomirski wrote:
> [cc Dave Hansen for MPX]
>
> On Apr 6, 2016 9:30 AM, "Dmitry Safonov" <dsafonov@virtuozzo.com> wrote:
>> Now each process that runs natively on x86_64 may execute 32-bit code
>> by proper setting it's CS selector: either from LDT or reuse Linux's
>> USER32_CS. The vice-versa is also valid: running 64-bit code in
>> compatible task is also possible by choosing USER_CS.
>> So we may switch between 32 and 64 bit code execution in any process.
>> Linux will choose the right syscall numbers in entries for those
>> processes. But it still will consider them native/compat by the
>> personality, that elf loader set on launch. This affects i.e., ptrace
>> syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset
>> according to process's mode (that's how strace detect task's
>> personality from 4.8 version).
>>
>> This patch adds arch_prctl calls for x86 that make possible to tell
>> Linux kernel in which mode the application is running currently.
>> Mainly, this is needed for CRIU: restoring compatible & native
>> applications both from 64-bit restorer. By that reason I wrapped all
>> the code in CONFIG_CHECKPOINT_RESTORE.
>> This patch solves also a problem for running 64-bit code in 32-bit elf
>> (and reverse), that you have only 32-bit elf vdso for fast syscalls.
>> When switching between native <-> compat mode by arch_prctl, it will
>> remap needed vdso binary blob for target mode.
> General comments first:
Thanks for your comments.
> You forgot about x32.
Will add x32 support for v2.
> I think that you should separate vdso remapping from "personality".
> vdso remapping should be available even on native 32-bit builds, which
> means that either you can't use arch_prctl for it or you'll have to
> wire up arch_prctl as a 32-bit syscall.
I cant say, I got your point. Do you mean by vdso remapping
mremap for vdso/vvar pages? I think, it should work now.
I did remapping for vdso as blob for native x86_64 task differs
to compatible task. So it's just changing blobs, address value
is there for convenience - I may omit it and just remap
different vdso blob at the same place where was previous vdso.
I'm not sure, why do we need possibility to map 64-bit vdso blob
on native 32-bit builds?
> For "personality", someone needs to enumerate all of the various thigs
> that try to track bitness and see how many of them even make sense.
> On brief inspection:
>
>   - TIF_IA32: affects signal format and does something to ptrace.  I
> suspect that whatever it does to ptrace is nonsensical, and I don't
> know whether we're stuck with it.
>
>   - TIF_ADDR32 affects TASK_SIZE and mmap behavior (and the latter
> isn't even done in a sensible way).
>
>   - is_64bit_mm affects MPX and uprobes.
>
> On even more brief inspection:
>
>   - uprobes using is_64bit_mm is buggy.
>
>   - I doubt that having TASK_SIZE vary serves any purpose.  Does anyone
> know why TASK_SIZE is different for different tasks?  It would save
> code size and speed things up if TASK_SIZE were always TASK_SIZE_MAX.
>   - Using TIF_IA32 for signal processing is IMO suboptimal.  Instead,
> we should record which syscall installed the signal handler and use
> the corresponding frame format.
Oh, I like it, will do.
>   - Using TIF_IA32 of the *target* for ptrace is nonsense.  Having
> strace figure out syscall type using that is actively buggy, and I ran
> into that bug a few days ago and cursed at it.  strace should inspect
> TS_COMPAT (I don't know how, but that's what should happen).  We may
> be stuck with this for ABI reasons.
ptrace may check seg_32bit for code selector, what do you think?
>   - For MPX, could we track which syscall called mpx_enable_management?
>   I.e. can we stash in_compat_syscall's return from
> mpx_enable_management and use that instead of trying to determine the
> MPX data structure format by the mm's initial type?
>
> If we make all of these changes, we're left with *only* the ptrace
> thing, and you could certainly add a way for CRIU to switch that
> around.
That sounds really good!
>
>> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
>> Cc: Pavel Emelyanov <xemul@virtuozzo.com>
>> Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
>> CC: Dmitry Safonov <0x7f454c46@gmail.com>
>> Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
>> ---
>>   arch/x86/entry/vdso/vma.c         | 76 ++++++++++++++++++++++++++--------
>>   arch/x86/include/asm/vdso.h       |  5 +++
>>   arch/x86/include/uapi/asm/prctl.h |  6 +++
>>   arch/x86/kernel/process_64.c      | 87 +++++++++++++++++++++++++++++++++++++++
>>   4 files changed, 157 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
>> index 10f704584922..9a1561da0bad 100644
>> --- a/arch/x86/entry/vdso/vma.c
>> +++ b/arch/x86/entry/vdso/vma.c
>> @@ -156,22 +156,21 @@ static int vvar_fault(const struct vm_special_mapping *sm,
>>          return VM_FAULT_SIGBUS;
>>   }
>>
>> -static int map_vdso(const struct vdso_image *image, bool calculate_addr)
>> +static int do_map_vdso(const struct vdso_image *image, bool calculate_addr,
>> +               unsigned long addr)
>>   {
>>          struct mm_struct *mm = current->mm;
>>          struct vm_area_struct *vma;
>> -       unsigned long addr, text_start;
>> +       unsigned long text_start;
>>          int ret = 0;
>>          static const struct vm_special_mapping vvar_mapping = {
>>                  .name = "[vvar]",
>>                  .fault = vvar_fault,
>>          };
>>
>> -       if (calculate_addr) {
>> +       if (calculate_addr && !addr) {
>>                  addr = vdso_addr(current->mm->start_stack,
>>                                   image->size - image->sym_vvar_start);
>> -       } else {
>> -               addr = 0;
>>          }
>>
> This is overcomplicated.  Just pass in the address and us it as supplied.
Will do.
>
>>          down_write(&mm->mmap_sem);
>> @@ -209,11 +208,11 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
>>                                         VM_PFNMAP,
>>                                         &vvar_mapping);
>>
>> -       if (IS_ERR(vma)) {
>> +       if (IS_ERR(vma))
>>                  ret = PTR_ERR(vma);
>> -               goto up_fail;
>> -       }
>>
>> +       if (ret)
>> +               do_munmap(mm, addr, image->size - image->sym_vvar_start);
>>   up_fail:
>>          if (ret)
>>                  current->mm->context.vdso = NULL;
>> @@ -223,24 +222,28 @@ up_fail:
>>   }
>>
>>   #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
>> -static int load_vdso32(void)
>> +static int load_vdso32(unsigned long addr)
>>   {
>>          if (vdso32_enabled != 1)  /* Other values all mean "disabled" */
>>                  return 0;
>>
>> -       return map_vdso(&vdso_image_32, false);
>> +       return do_map_vdso(&vdso_image_32, false, addr);
>>   }
>>   #endif
> I'd just make it one function do_map_vdso(type, addr).
Sure, will do
>
>>   #ifdef CONFIG_X86_64
>> -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>> +static int load_vdso64(unsigned long addr)
>>   {
>>          if (!vdso64_enabled)
>>                  return 0;
>>
>> -       return map_vdso(&vdso_image_64, true);
>> +       return do_map_vdso(&vdso_image_64, true, addr);
>>   }
>>
>> +int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>> +{
>> +       return load_vdso64(0);
>> +}
>>   #ifdef CONFIG_COMPAT
>>   int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
>>                                         int uses_interp)
>> @@ -250,20 +253,59 @@ int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
>>                  if (!vdso64_enabled)
>>                          return 0;
>>
>> -               return map_vdso(&vdso_image_x32, true);
>> +               return do_map_vdso(&vdso_image_x32, true, 0);
>>          }
>>   #endif
>>   #ifdef CONFIG_IA32_EMULATION
>> -       return load_vdso32();
>> +       return load_vdso32(0);
> No special 0 please.
Yes, will use here vdso_addr(..) inplace.
>
>>   #else
>>          return 0;
>>   #endif
>>   }
>> -#endif
>> -#else
>> +#endif /* CONFIG_COMPAT */
>> +
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> +unsigned long unmap_vdso(void)
>> +{
>> +       struct vm_area_struct *vma;
>> +       unsigned long addr = (unsigned long)current->mm->context.vdso;
>> +
>> +       if (!addr)
>> +               return 0;
>> +
>> +       /* vvar pages */
>> +       vma = find_vma(current->mm, addr - 1);
>> +       if (vma)
>> +               vm_munmap(vma->vm_start, vma->vm_end - vma->vm_start);
>> +
>> +       /* vdso pages */
>> +       vma = find_vma(current->mm, addr);
>> +       if (vma)
>> +               vm_munmap(vma->vm_start, vma->vm_end - vma->vm_start);
>> +
>> +       current->mm->context.vdso = NULL;
>> +
>> +       return addr;
>> +}
>> +/*
>> + * Maps needed vdso type: vdso_image_32/vdso_image_64
>> + * @compatible - true for compatible, false for native vdso image
>> + * @addr - specify addr for vdso mapping (0 for random/searching)
>> + * NOTE: be sure to set/clear thread-specific flags before
>> + * calling this function.
>> + */
>> +int map_vdso(bool compatible, unsigned long addr)
>> +{
>> +       if (compatible)
>> +               return load_vdso32(addr);
>> +       else
>> +               return load_vdso64(addr);
>> +}
> This makes sense.  But it can't be bool -- you forgot x32.
Sure, will rework for x32 support.
>
>> +#endif /* CONFIG_IA32_EMULATION && CONFIG_CHECKPOINT_RESTORE */
>> +#else /* !CONFIG_X86_64 */
>>   int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>>   {
>> -       return load_vdso32();
>> +       return load_vdso32(0);
>>   }
>>   #endif
>>
>> diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
>> index 43dc55be524e..3ead7cc48a68 100644
>> --- a/arch/x86/include/asm/vdso.h
>> +++ b/arch/x86/include/asm/vdso.h
>> @@ -39,6 +39,11 @@ extern const struct vdso_image vdso_image_x32;
>>   extern const struct vdso_image vdso_image_32;
>>   #endif
>>
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> +extern int map_vdso(bool to_compat, unsigned long addr);
>> +extern unsigned long unmap_vdso(void);
>> +#endif
>> +
>>   extern void __init init_vdso_image(const struct vdso_image *image);
>>
>>   #endif /* __ASSEMBLER__ */
>> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
>> index 3ac5032fae09..455844f06485 100644
>> --- a/arch/x86/include/uapi/asm/prctl.h
>> +++ b/arch/x86/include/uapi/asm/prctl.h
>> @@ -6,4 +6,10 @@
>>   #define ARCH_GET_FS 0x1003
>>   #define ARCH_GET_GS 0x1004
>>
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> +#define ARCH_SET_COMPAT                0x2001
>> +#define ARCH_SET_NATIVE                0x2002
>> +#define ARCH_GET_PERSONALITY   0x2003
>> +#endif
>> +
>>   #endif /* _ASM_X86_PRCTL_H */
>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>> index 6cbab31ac23a..e50660d59530 100644
>> --- a/arch/x86/kernel/process_64.c
>> +++ b/arch/x86/kernel/process_64.c
>> @@ -49,6 +49,7 @@
>>   #include <asm/debugreg.h>
>>   #include <asm/switch_to.h>
>>   #include <asm/xen/hypervisor.h>
>> +#include <asm/vdso.h>
>>
>>   asmlinkage extern void ret_from_fork(void);
>>
>> @@ -505,6 +506,83 @@ void set_personality_ia32(bool x32)
>>   }
>>   EXPORT_SYMBOL_GPL(set_personality_ia32);
>>
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> +/*
>> + * Check if there are still some vmas (except vdso) for current,
>> + * which placed above compatible TASK_SIZE.
>> + * Check also code, data, stack, args and env placements.
>> + * Returns true if all mappings are compatible.
>> + */
>> +static bool task_mappings_compatible(void)
>> +{
>> +       struct mm_struct *mm = current->mm;
>> +       unsigned long top_addr = IA32_PAGE_OFFSET;
>> +       struct vm_area_struct *vma = find_vma(mm, top_addr);
>> +
>> +       if (mm->end_code        > top_addr ||
>> +           mm->end_data        > top_addr ||
>> +           mm->start_stack     > top_addr ||
>> +           mm->brk             > top_addr ||
>> +           mm->arg_end         > top_addr ||
>> +           mm->env_end         > top_addr)
>> +               return false;
>> +
>> +       while (vma) {
>> +               if ((vma->vm_start != (unsigned long)mm->context.vdso) &&
>> +                   (vma->vm_end != (unsigned long)mm->context.vdso))
>> +                       return false;
>> +
>> +               top_addr = vma->vm_end;
>> +               vma = find_vma(mm, top_addr);
>> +       }
>> +
>> +       return true;
>> +}
> What goes wrong if there are leftover high mappings?
Nothing should. That's not expected by me
that someone will "hide" mappings over 32-bit
address space - that's the reason why I did it.
I'll drop this part.
>
>> +
>> +static int do_set_personality(bool compat, unsigned long addr)
>> +{
>> +       int ret;
>> +       unsigned long old_vdso_base;
>> +       unsigned long old_mmap_base = current->mm->mmap_base;
>> +
>> +       if (test_thread_flag(TIF_IA32) == compat) /* nothing to do */
>> +               return 0;
> Please don't.  Instead, remove TIF_IA32 entirely.
Thanks, I will remove TIF_IA32.
> Also, please separate out ARCH_REMAP_VDSO from any personality change API.
Ok.
>
>> +
>> +       if (compat && !task_mappings_compatible())
>> +               return -EFAULT;
>> +
>> +       /*
>> +        * We can't just remap vdso to needed location:
>> +        * vdso compatible and native images differs
>> +        */
>> +       old_vdso_base = unmap_vdso();
>> +
>> +       if (compat)
>> +               set_personality_ia32(false);
>> +       else
>> +               set_personality_64bit();
>> +
>> +       /*
>> +        * Update mmap_base & get_unmapped_area helper, side effect:
>> +        * one may change get_unmapped_area or mmap_base with personality()
>> +        * or switching to and fro compatible mode
>> +        */
>> +       arch_pick_mmap_layout(current->mm);
>> +
>> +       ret = map_vdso(compat, addr);
>> +       if (ret) {
>> +               current->mm->mmap_base = old_mmap_base;
>> +               if (compat)
>> +                       set_personality_64bit();
>> +               else
>> +                       set_personality_ia32(false);
>> +               WARN_ON(map_vdso(!compat, old_vdso_base));
>> +       }
>> +
>> +       return ret;
>> +}
>> +#endif
>> +
>>   long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
>>   {
>>          int ret = 0;
>> @@ -592,6 +670,15 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
>>                  break;
>>          }
>>
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> +       case ARCH_SET_COMPAT:
>> +               return do_set_personality(true, addr);
>> +       case ARCH_SET_NATIVE:
>> +               return do_set_personality(false, addr);
>> +       case ARCH_GET_PERSONALITY:
>> +               return test_thread_flag(TIF_IA32);
>> +#endif
>> +
>>          default:
>>                  ret = -EINVAL;
>>                  break;
>> --
>> 2.7.4
>>


-- 
Regards,
Dmitry Safonov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-07 12:11     ` Dmitry Safonov
@ 2016-04-07 12:21       ` Cyrill Gorcunov
  2016-04-07 12:35         ` Dmitry Safonov
  2016-04-07 14:39       ` Andy Lutomirski
  1 sibling, 1 reply; 26+ messages in thread
From: Cyrill Gorcunov @ 2016-04-07 12:21 UTC (permalink / raw)
  To: Dmitry Safonov, Andy Lutomirski
  Cc: Shuah Khan, H. Peter Anvin, 0x7f454c46, khorenko, linux-kernel,
	Thomas Gleixner, Borislav Petkov, xemul, linux-kselftest,
	Andrew Morton, X86 ML, Ingo Molnar, Dave Hansen

On Thu, Apr 07, 2016 at 03:11:24PM +0300, Dmitry Safonov wrote:
> On 04/06/2016 09:04 PM, Andy Lutomirski wrote:
> >[cc Dave Hansen for MPX]
> >
> Will add x32 support for v2.
> >I think that you should separate vdso remapping from "personality".
> >vdso remapping should be available even on native 32-bit builds, which
> >means that either you can't use arch_prctl for it or you'll have to
> >wire up arch_prctl as a 32-bit syscall.
> I cant say, I got your point. Do you mean by vdso remapping
> mremap for vdso/vvar pages? I think, it should work now.
> I did remapping for vdso as blob for native x86_64 task differs
> to compatible task. So it's just changing blobs, address value
> is there for convenience - I may omit it and just remap
> different vdso blob at the same place where was previous vdso.
> I'm not sure, why do we need possibility to map 64-bit vdso blob
> on native 32-bit builds?

First of all, thanks a huge for all your comments, Andy!
Dima, I'm not seeing in patch, don't you have to take
mm::mmap_sem when walking over VMAs?

	Cyrill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-07 12:21       ` Cyrill Gorcunov
@ 2016-04-07 12:35         ` Dmitry Safonov
  0 siblings, 0 replies; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-07 12:35 UTC (permalink / raw)
  To: Cyrill Gorcunov, Andy Lutomirski
  Cc: Shuah Khan, H. Peter Anvin, 0x7f454c46, khorenko, linux-kernel,
	Thomas Gleixner, Borislav Petkov, xemul, linux-kselftest,
	Andrew Morton, X86 ML, Ingo Molnar, Dave Hansen

On 04/07/2016 03:21 PM, Cyrill Gorcunov wrote:
> On Thu, Apr 07, 2016 at 03:11:24PM +0300, Dmitry Safonov wrote:
>> On 04/06/2016 09:04 PM, Andy Lutomirski wrote:
>>> [cc Dave Hansen for MPX]
>>>
>> Will add x32 support for v2.
>>> I think that you should separate vdso remapping from "personality".
>>> vdso remapping should be available even on native 32-bit builds, which
>>> means that either you can't use arch_prctl for it or you'll have to
>>> wire up arch_prctl as a 32-bit syscall.
>> I cant say, I got your point. Do you mean by vdso remapping
>> mremap for vdso/vvar pages? I think, it should work now.
>> I did remapping for vdso as blob for native x86_64 task differs
>> to compatible task. So it's just changing blobs, address value
>> is there for convenience - I may omit it and just remap
>> different vdso blob at the same place where was previous vdso.
>> I'm not sure, why do we need possibility to map 64-bit vdso blob
>> on native 32-bit builds?
> First of all, thanks a huge for all your comments, Andy!
> Dima, I'm not seeing in patch, don't you have to take
> mm::mmap_sem when walking over VMAs?
I was sure I did, but have missed it, well.
Thanks!
>
> 	Cyrill


-- 
Regards,
Dmitry Safonov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-07 12:11     ` Dmitry Safonov
  2016-04-07 12:21       ` Cyrill Gorcunov
@ 2016-04-07 14:39       ` Andy Lutomirski
  2016-04-07 15:18         ` Dmitry Safonov
  2016-04-08 13:50         ` Dmitry Safonov
  1 sibling, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-07 14:39 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Thomas Gleixner, Dmitry Safonov, Dave Hansen, Ingo Molnar,
	Shuah Khan, Borislav Petkov, X86 ML, khorenko, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, linux-kernel,
	H. Peter Anvin

On Apr 7, 2016 5:12 AM, "Dmitry Safonov" <dsafonov@virtuozzo.com> wrote:
>
> On 04/06/2016 09:04 PM, Andy Lutomirski wrote:
>>
>> [cc Dave Hansen for MPX]
>>
>> On Apr 6, 2016 9:30 AM, "Dmitry Safonov" <dsafonov@virtuozzo.com> wrote:
>>>
>>> Now each process that runs natively on x86_64 may execute 32-bit code
>>> by proper setting it's CS selector: either from LDT or reuse Linux's
>>> USER32_CS. The vice-versa is also valid: running 64-bit code in
>>> compatible task is also possible by choosing USER_CS.
>>> So we may switch between 32 and 64 bit code execution in any process.
>>> Linux will choose the right syscall numbers in entries for those
>>> processes. But it still will consider them native/compat by the
>>> personality, that elf loader set on launch. This affects i.e., ptrace
>>> syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset
>>> according to process's mode (that's how strace detect task's
>>> personality from 4.8 version).
>>>
>>> This patch adds arch_prctl calls for x86 that make possible to tell
>>> Linux kernel in which mode the application is running currently.
>>> Mainly, this is needed for CRIU: restoring compatible & native
>>> applications both from 64-bit restorer. By that reason I wrapped all
>>> the code in CONFIG_CHECKPOINT_RESTORE.
>>> This patch solves also a problem for running 64-bit code in 32-bit elf
>>> (and reverse), that you have only 32-bit elf vdso for fast syscalls.
>>> When switching between native <-> compat mode by arch_prctl, it will
>>> remap needed vdso binary blob for target mode.
>>
>> General comments first:
>
> Thanks for your comments.
>>
>> You forgot about x32.
>
> Will add x32 support for v2.
>
>> I think that you should separate vdso remapping from "personality".
>> vdso remapping should be available even on native 32-bit builds, which
>> means that either you can't use arch_prctl for it or you'll have to
>> wire up arch_prctl as a 32-bit syscall.
>
> I cant say, I got your point. Do you mean by vdso remapping
> mremap for vdso/vvar pages? I think, it should work now.

For 32-bit, the vdso *must* exist in memory at the address that the
kernel thinks it's at.  Even if you had a pure 32-bit restore stub,
you would still need vdso remap, because there's a chance the vdso
could land at an unusable address, say one page off from where you
want it.  You couldn't map a wrapper because there wouldn't be any
space for it without moving the real vdso out of the way.

Remember, you *cannot* mremap() the 32-bit vdso because you will
crash.  It works by luck for 64-bit, but it's plausible that we'd want
to change that some day.  (I have awful patches that speed a bunch of
things up at the cost of a vdso trampoline for 64-bit code and a bunch
of other hacks.  Those patches will never go in for real, but
something else might want the ability to use 64-bit vdso trampolines.)

> I did remapping for vdso as blob for native x86_64 task differs
> to compatible task. So it's just changing blobs, address value
> is there for convenience - I may omit it and just remap
> different vdso blob at the same place where was previous vdso.
> I'm not sure, why do we need possibility to map 64-bit vdso blob
> on native 32-bit builds?

That would fail, but I think the API should exist.  But a native
32-bit program should be able to remap the 32-bit vdso.

IOW, I think you should be able to do, roughly:

map_new_vdso(VDSO_32BIT, addr);

on any kernel.

Am I making sense?

>
>> For "personality", someone needs to enumerate all of the various thigs
>> that try to track bitness and see how many of them even make sense.
>> On brief inspection:
>>
>>   - TIF_IA32: affects signal format and does something to ptrace.  I
>> suspect that whatever it does to ptrace is nonsensical, and I don't
>> know whether we're stuck with it.
>>
>>   - TIF_ADDR32 affects TASK_SIZE and mmap behavior (and the latter
>> isn't even done in a sensible way).
>>
>>   - is_64bit_mm affects MPX and uprobes.
>>
>> On even more brief inspection:
>>
>>   - uprobes using is_64bit_mm is buggy.
>>
>>   - I doubt that having TASK_SIZE vary serves any purpose.  Does anyone
>> know why TASK_SIZE is different for different tasks?  It would save
>> code size and speed things up if TASK_SIZE were always TASK_SIZE_MAX.
>>   - Using TIF_IA32 for signal processing is IMO suboptimal.  Instead,
>> we should record which syscall installed the signal handler and use
>> the corresponding frame format.
>
> Oh, I like it, will do.
>
>>   - Using TIF_IA32 of the *target* for ptrace is nonsense.  Having
>> strace figure out syscall type using that is actively buggy, and I ran
>> into that bug a few days ago and cursed at it.  strace should inspect
>> TS_COMPAT (I don't know how, but that's what should happen).  We may
>> be stuck with this for ABI reasons.
>
> ptrace may check seg_32bit for code selector, what do you think?

Not sure.  I have never fully wrapped my had around ptrace.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-07 14:39       ` Andy Lutomirski
@ 2016-04-07 15:18         ` Dmitry Safonov
  2016-04-08 13:50         ` Dmitry Safonov
  1 sibling, 0 replies; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-07 15:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Dmitry Safonov, Dave Hansen, Ingo Molnar,
	Shuah Khan, Borislav Petkov, X86 ML, khorenko, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, linux-kernel,
	H. Peter Anvin

On 04/07/2016 05:39 PM, Andy Lutomirski wrote:
> On Apr 7, 2016 5:12 AM, "Dmitry Safonov" <dsafonov@virtuozzo.com> wrote:
>> On 04/06/2016 09:04 PM, Andy Lutomirski wrote:
>>> [cc Dave Hansen for MPX]
>>>
>>> On Apr 6, 2016 9:30 AM, "Dmitry Safonov" <dsafonov@virtuozzo.com> wrote:
>>>> Now each process that runs natively on x86_64 may execute 32-bit code
>>>> by proper setting it's CS selector: either from LDT or reuse Linux's
>>>> USER32_CS. The vice-versa is also valid: running 64-bit code in
>>>> compatible task is also possible by choosing USER_CS.
>>>> So we may switch between 32 and 64 bit code execution in any process.
>>>> Linux will choose the right syscall numbers in entries for those
>>>> processes. But it still will consider them native/compat by the
>>>> personality, that elf loader set on launch. This affects i.e., ptrace
>>>> syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset
>>>> according to process's mode (that's how strace detect task's
>>>> personality from 4.8 version).
>>>>
>>>> This patch adds arch_prctl calls for x86 that make possible to tell
>>>> Linux kernel in which mode the application is running currently.
>>>> Mainly, this is needed for CRIU: restoring compatible & native
>>>> applications both from 64-bit restorer. By that reason I wrapped all
>>>> the code in CONFIG_CHECKPOINT_RESTORE.
>>>> This patch solves also a problem for running 64-bit code in 32-bit elf
>>>> (and reverse), that you have only 32-bit elf vdso for fast syscalls.
>>>> When switching between native <-> compat mode by arch_prctl, it will
>>>> remap needed vdso binary blob for target mode.
>>> General comments first:
>> Thanks for your comments.
>>> You forgot about x32.
>> Will add x32 support for v2.
>>
>>> I think that you should separate vdso remapping from "personality".
>>> vdso remapping should be available even on native 32-bit builds, which
>>> means that either you can't use arch_prctl for it or you'll have to
>>> wire up arch_prctl as a 32-bit syscall.
>> I cant say, I got your point. Do you mean by vdso remapping
>> mremap for vdso/vvar pages? I think, it should work now.
> For 32-bit, the vdso *must* exist in memory at the address that the
> kernel thinks it's at.  Even if you had a pure 32-bit restore stub,
> you would still need vdso remap, because there's a chance the vdso
> could land at an unusable address, say one page off from where you
> want it.  You couldn't map a wrapper because there wouldn't be any
> space for it without moving the real vdso out of the way.
>
> Remember, you *cannot* mremap() the 32-bit vdso because you will
> crash.  It works by luck for 64-bit, but it's plausible that we'd want
> to change that some day.  (I have awful patches that speed a bunch of
> things up at the cost of a vdso trampoline for 64-bit code and a bunch
> of other hacks.  Those patches will never go in for real, but
> something else might want the ability to use 64-bit vdso trampolines.)
Thanks for the elaboration, now I see. Signals and fast syscalls
expect mm->context.vdso to be correct.
>
>> I did remapping for vdso as blob for native x86_64 task differs
>> to compatible task. So it's just changing blobs, address value
>> is there for convenience - I may omit it and just remap
>> different vdso blob at the same place where was previous vdso.
>> I'm not sure, why do we need possibility to map 64-bit vdso blob
>> on native 32-bit builds?
> That would fail, but I think the API should exist.  But a native
> 32-bit program should be able to remap the 32-bit vdso.
>
> IOW, I think you should be able to do, roughly:
>
> map_new_vdso(VDSO_32BIT, addr);
>
> on any kernel.
>
> Am I making sense?
Yes. I will rework it for some API.
>
>>> For "personality", someone needs to enumerate all of the various thigs
>>> that try to track bitness and see how many of them even make sense.
>>> On brief inspection:
>>>
>>>    - TIF_IA32: affects signal format and does something to ptrace.  I
>>> suspect that whatever it does to ptrace is nonsensical, and I don't
>>> know whether we're stuck with it.
>>>
>>>    - TIF_ADDR32 affects TASK_SIZE and mmap behavior (and the latter
>>> isn't even done in a sensible way).
>>>
>>>    - is_64bit_mm affects MPX and uprobes.
>>>
>>> On even more brief inspection:
>>>
>>>    - uprobes using is_64bit_mm is buggy.
>>>
>>>    - I doubt that having TASK_SIZE vary serves any purpose.  Does anyone
>>> know why TASK_SIZE is different for different tasks?  It would save
>>> code size and speed things up if TASK_SIZE were always TASK_SIZE_MAX.
>>>    - Using TIF_IA32 for signal processing is IMO suboptimal.  Instead,
>>> we should record which syscall installed the signal handler and use
>>> the corresponding frame format.
>> Oh, I like it, will do.
>>
>>>    - Using TIF_IA32 of the *target* for ptrace is nonsense.  Having
>>> strace figure out syscall type using that is actively buggy, and I ran
>>> into that bug a few days ago and cursed at it.  strace should inspect
>>> TS_COMPAT (I don't know how, but that's what should happen).  We may
>>> be stuck with this for ABI reasons.
>> ptrace may check seg_32bit for code selector, what do you think?
> Not sure.  I have never fully wrapped my had around ptrace.
Hm, I guess, it's better to check TS_COMPAT, after some thinking:
It's set up on compatible syscall enter, so there is no need to
check seg_32bit anyway.

Huge thanks, will work on v2 according to your comments.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-07 14:39       ` Andy Lutomirski
  2016-04-07 15:18         ` Dmitry Safonov
@ 2016-04-08 13:50         ` Dmitry Safonov
  2016-04-08 15:56           ` Andy Lutomirski
  1 sibling, 1 reply; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-08 13:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Dmitry Safonov, Dave Hansen, Ingo Molnar,
	Shuah Khan, Borislav Petkov, X86 ML, khorenko, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, linux-kernel,
	H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 1790 bytes --]

On 04/07/2016 05:39 PM, Andy Lutomirski wrote:
> For 32-bit, the vdso *must* exist in memory at the address that the
> kernel thinks it's at.  Even if you had a pure 32-bit restore stub,
> you would still need vdso remap, because there's a chance the vdso
> could land at an unusable address, say one page off from where you
> want it.  You couldn't map a wrapper because there wouldn't be any
> space for it without moving the real vdso out of the way.
>
> Remember, you *cannot* mremap() the 32-bit vdso because you will
> crash.  It works by luck for 64-bit, but it's plausible that we'd want
> to change that some day.  (I have awful patches that speed a bunch of
> things up at the cost of a vdso trampoline for 64-bit code and a bunch
> of other hacks.  Those patches will never go in for real, but
> something else might want the ability to use 64-bit vdso trampolines.)
Hello again,
what do you think about attached patch?
I think it should fix landing problem for i386 vdso mremap.
It does not touch fast syscall path, so there should be no
speed regression.
>> I did remapping for vdso as blob for native x86_64 task differs
>> to compatible task. So it's just changing blobs, address value
>> is there for convenience - I may omit it and just remap
>> different vdso blob at the same place where was previous vdso.
>> I'm not sure, why do we need possibility to map 64-bit vdso blob
>> on native 32-bit builds?
> That would fail, but I think the API should exist.  But a native
> 32-bit program should be able to remap the 32-bit vdso.
>
> IOW, I think you should be able to do, roughly:
>
> map_new_vdso(VDSO_32BIT, addr);
>
> on any kernel.
>
> Am I making sense?
I will still work for this interface - just wanted that
usuall mremap to work on vdso mappings.

Thanks,
Dmitry.

[-- Attachment #2: 0001-x86-vdso-add-mremap-hook-to-vm_special_mapping.patch --]
[-- Type: text/x-patch, Size: 4202 bytes --]

>From ffba027156d7194a07ccccd80ece9be19107e4e6 Mon Sep 17 00:00:00 2001
From: Dmitry Safonov <dsafonov@virtuozzo.com>
Date: Fri, 8 Apr 2016 16:27:18 +0300
Subject: [PATCH] x86/vdso: add mremap hook to vm_special_mapping

This patch adds possibility for userspace 32-bit applications
to move vdso mapping. Previously, when userspace app called
mremap for vdso, in return path it would land on previous
address of vdso page, resulting in segmentation violation.
Now it lands fine and returns to userspace with remapped vdso.

There is still problem for remapping vdso in glibc applications:
linker relocates addresses for syscalls on vdso page, so
you need to relink with the new addresses. Or the next syscall
through glibc may fail:
  Program received signal SIGSEGV, Segmentation fault.
  #0  0xf7fd9b80 in __kernel_vsyscall ()
  #1  0xf7ec8238 in _exit () from /usr/lib32/libc.so.6

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 arch/x86/entry/vdso/vma.c | 29 ++++++++++++++++++++++++-----
 include/linux/mm_types.h  |  3 +++
 mm/mmap.c                 | 10 ++++++++++
 3 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 10f704584922..ed6c9fa64531 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -12,6 +12,7 @@
 #include <linux/random.h>
 #include <linux/elf.h>
 #include <linux/cpu.h>
+#include <linux/ptrace.h>
 #include <asm/pvclock.h>
 #include <asm/vgtod.h>
 #include <asm/proto.h>
@@ -98,10 +99,22 @@ static int vdso_fault(const struct vm_special_mapping *sm,
 	return 0;
 }
 
-static const struct vm_special_mapping text_mapping = {
-	.name = "[vdso]",
-	.fault = vdso_fault,
-};
+static int vdso_mremap(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *new_vma)
+{
+	struct pt_regs *regs = current_pt_regs();
+
+	new_vma->vm_mm->context.vdso = (void __user *)new_vma->vm_start;
+
+#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
+	/* Fixing userspace landing - look at do_fast_syscall_32 */
+	if (current_thread_info()->status & TS_COMPAT)
+		regs->ip = (unsigned long)current->mm->context.vdso +
+			vdso_image_32.sym_int80_landing_pad;
+#endif
+
+	return 0;
+}
 
 static int vvar_fault(const struct vm_special_mapping *sm,
 		      struct vm_area_struct *vma, struct vm_fault *vmf)
@@ -162,6 +175,12 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 	struct vm_area_struct *vma;
 	unsigned long addr, text_start;
 	int ret = 0;
+
+	static const struct vm_special_mapping vdso_mapping = {
+		.name = "[vdso]",
+		.fault = vdso_fault,
+		.mremap = vdso_mremap,
+	};
 	static const struct vm_special_mapping vvar_mapping = {
 		.name = "[vvar]",
 		.fault = vvar_fault,
@@ -195,7 +214,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 				       image->size,
 				       VM_READ|VM_EXEC|
 				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-				       &text_mapping);
+				       &vdso_mapping);
 
 	if (IS_ERR(vma)) {
 		ret = PTR_ERR(vma);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c2d75b4fa86c..4d16ab9287af 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -586,6 +586,9 @@ struct vm_special_mapping {
 	int (*fault)(const struct vm_special_mapping *sm,
 		     struct vm_area_struct *vma,
 		     struct vm_fault *vmf);
+
+	int (*mremap)(const struct vm_special_mapping *sm,
+		     struct vm_area_struct *new_vma);
 };
 
 enum tlb_flush_reason {
diff --git a/mm/mmap.c b/mm/mmap.c
index bd2e1a533bc1..ba71658dd1a1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2930,9 +2930,19 @@ static const char *special_mapping_name(struct vm_area_struct *vma)
 	return ((struct vm_special_mapping *)vma->vm_private_data)->name;
 }
 
+static int special_mapping_mremap(struct vm_area_struct *new_vma)
+{
+	struct vm_special_mapping *sm = new_vma->vm_private_data;
+
+	if (sm->mremap)
+		return sm->mremap(sm, new_vma);
+	return 0;
+}
+
 static const struct vm_operations_struct special_mapping_vmops = {
 	.close = special_mapping_close,
 	.fault = special_mapping_fault,
+	.mremap = special_mapping_mremap,
 	.name = special_mapping_name,
 };
 
-- 
2.8.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-08 13:50         ` Dmitry Safonov
@ 2016-04-08 15:56           ` Andy Lutomirski
  2016-04-08 16:18             ` Dmitry Safonov
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-08 15:56 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Thomas Gleixner, Dmitry Safonov, Dave Hansen, Ingo Molnar,
	Shuah Khan, Borislav Petkov, X86 ML, khorenko, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, linux-kernel,
	H. Peter Anvin

On Fri, Apr 8, 2016 at 6:50 AM, Dmitry Safonov <dsafonov@virtuozzo.com> wrote:
> On 04/07/2016 05:39 PM, Andy Lutomirski wrote:
>>
>> For 32-bit, the vdso *must* exist in memory at the address that the
>> kernel thinks it's at.  Even if you had a pure 32-bit restore stub,
>> you would still need vdso remap, because there's a chance the vdso
>> could land at an unusable address, say one page off from where you
>> want it.  You couldn't map a wrapper because there wouldn't be any
>> space for it without moving the real vdso out of the way.
>>
>> Remember, you *cannot* mremap() the 32-bit vdso because you will
>> crash.  It works by luck for 64-bit, but it's plausible that we'd want
>> to change that some day.  (I have awful patches that speed a bunch of
>> things up at the cost of a vdso trampoline for 64-bit code and a bunch
>> of other hacks.  Those patches will never go in for real, but
>> something else might want the ability to use 64-bit vdso trampolines.)
>
> Hello again,
> what do you think about attached patch?
> I think it should fix landing problem for i386 vdso mremap.
> It does not touch fast syscall path, so there should be no
> speed regression.
>>>
>>> I did remapping for vdso as blob for native x86_64 task differs
>>> to compatible task. So it's just changing blobs, address value
>>> is there for convenience - I may omit it and just remap
>>> different vdso blob at the same place where was previous vdso.
>>> I'm not sure, why do we need possibility to map 64-bit vdso blob
>>> on native 32-bit builds?
>>
>> That would fail, but I think the API should exist.  But a native
>> 32-bit program should be able to remap the 32-bit vdso.
>>
>> IOW, I think you should be able to do, roughly:
>>
>> map_new_vdso(VDSO_32BIT, addr);
>>
>> on any kernel.
>>
>> Am I making sense?
>
> I will still work for this interface - just wanted that
> usuall mremap to work on vdso mappings.

For this thing:

+    /* Fixing userspace landing - look at do_fast_syscall_32 */
+    if (current_thread_info()->status & TS_COMPAT)
+        regs->ip = (unsigned long)current->mm->context.vdso +
+            vdso_image_32.sym_int80_landing_pad;

Either check that ip was where you expected it or simply remove this
code -- user programs that are mremapping the vdso are already playing
with fire and can just use int $0x80 to do it.

Other than that, it looks generally sane.  The .mremap hook didn't
exist last time I looked at this :)

The main downside of your approach is that it doesn't allow switching
between the 32-bit, 64-bit, and x32 images.  Also, it requires
awareness of how vvar and vdso line up, whereas a dedicated API could
do the whole thing.

>
> Thanks,
> Dmitry.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-08 15:56           ` Andy Lutomirski
@ 2016-04-08 16:18             ` Dmitry Safonov
  2016-04-08 20:44               ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-08 16:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Dmitry Safonov, Dave Hansen, Ingo Molnar,
	Shuah Khan, Borislav Petkov, X86 ML, khorenko, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, linux-kernel,
	H. Peter Anvin

On 04/08/2016 06:56 PM, Andy Lutomirski wrote:
> On Fri, Apr 8, 2016 at 6:50 AM, Dmitry Safonov <dsafonov@virtuozzo.com> wrote:
>> Hello again,
>> what do you think about attached patch?
>> I think it should fix landing problem for i386 vdso mremap.
>> It does not touch fast syscall path, so there should be no
>> speed regression.
> For this thing:
>
> +    /* Fixing userspace landing - look at do_fast_syscall_32 */
> +    if (current_thread_info()->status & TS_COMPAT)
> +        regs->ip = (unsigned long)current->mm->context.vdso +
> +            vdso_image_32.sym_int80_landing_pad;
>
> Either check that ip was where you expected it
And if it's not there - return error?
>   or simply remove this
> code -- user programs that are mremapping the vdso are already playing
> with fire and can just use int $0x80 to do it.
>
> Other than that, it looks generally sane.  The .mremap hook didn't
> exist last time I looked at this :)
>
> The main downside of your approach is that it doesn't allow switching
> between the 32-bit, 64-bit, and x32 images.  Also, it requires
> awareness of how vvar and vdso line up, whereas a dedicated API could
> do the whole thing.
Yes, I'm working on it. This patch will only allow moving vdso
image with general mremap - so I could use arch_prctl for
that API, as for native i386 one may move vdso with mremap
and cannot map any other vdso blobs.
Does it sound fine?

So, I have some difficulties with removing TIF_IA32 flag:
it's checked by perf for interpreting stack frames/instructions
and may be checked out of syscall executing (when tracing
page fault events, for example). I doubt, is it sane to remove
TS_COMPAT instead, leaving TIF_IA32, as for some cases
we need to know if task is compatible outside of syscall's path?
And the comment in asm/syscall.h says:
 >  * TIF_IA32 tasks should always have TS_COMPAT set at
 >  * system call time.
that means, that TS_COMPAT is always set on TIF_IA32, so
is meaningless.
What do you think?

Thanks,
Dmitry.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-08 16:18             ` Dmitry Safonov
@ 2016-04-08 20:44               ` Andy Lutomirski
  2016-04-09  8:06                 ` Dmitry Safonov
  2016-04-13 16:55                 ` Dmitry Safonov
  0 siblings, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-08 20:44 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Thomas Gleixner, Shuah Khan, Ingo Molnar, Dave Hansen,
	Borislav Petkov, khorenko, X86 ML, Andrew Morton, xemul,
	linux-kselftest, Cyrill Gorcunov, Dmitry Safonov, linux-kernel,
	H. Peter Anvin

On Apr 8, 2016 9:20 AM, "Dmitry Safonov" <dsafonov@virtuozzo.com> wrote:
>
> On 04/08/2016 06:56 PM, Andy Lutomirski wrote:
>>
>> On Fri, Apr 8, 2016 at 6:50 AM, Dmitry Safonov <dsafonov@virtuozzo.com> wrote:
>>>
>>> Hello again,
>>> what do you think about attached patch?
>>> I think it should fix landing problem for i386 vdso mremap.
>>> It does not touch fast syscall path, so there should be no
>>> speed regression.
>>
>> For this thing:
>>
>>
>> +    /* Fixing userspace landing - look at do_fast_syscall_32 */
>> +    if (current_thread_info()->status & TS_COMPAT)
>> +        regs->ip = (unsigned long)current->mm->context.vdso +
>> +            vdso_image_32.sym_int80_landing_pad;
>>
>> Either check that ip was where you expected it
>
> And if it's not there - return error?

No, just leave IP unchanged.

>
>>   or simply remove this
>> code -- user programs that are mremapping the vdso are already playing
>> with fire and can just use int $0x80 to do it.
>>
>> Other than that, it looks generally sane.  The .mremap hook didn't
>> exist last time I looked at this :)
>>
>> The main downside of your approach is that it doesn't allow switching
>> between the 32-bit, 64-bit, and x32 images.  Also, it requires
>> awareness of how vvar and vdso line up, whereas a dedicated API could
>> do the whole thing.
>
> Yes, I'm working on it. This patch will only allow moving vdso
> image with general mremap - so I could use arch_prctl for
> that API, as for native i386 one may move vdso with mremap
> and cannot map any other vdso blobs.
> Does it sound fine?
>
> So, I have some difficulties with removing TIF_IA32 flag:
> it's checked by perf for interpreting stack frames/instructions
> and may be checked out of syscall executing (when tracing
> page fault events, for example).

Feel free to ask for help on some of these details.  user_64bit_mode
will be helpful too.

> I doubt, is it sane to remove
> TS_COMPAT instead, leaving TIF_IA32, as for some cases
> we need to know if task is compatible outside of syscall's path?

No.  TS_COMPAT is important, and it's also better behaved than
TIF_IA32 -- it has a very specific meaning: "am I currently executing
a 32-bit syscall".

> And the comment in asm/syscall.h says:
> >  * TIF_IA32 tasks should always have TS_COMPAT set at
> >  * system call time.
> that means, that TS_COMPAT is always set on TIF_IA32, so
> is meaningless.
> What do you think?

The comment is wrong :). TS_COMPAT is true on int80 or 32-bit vdso
syscall entries and is false otherwise.  64-bit tasks can use int80
and, with your patches, will be able to use the 32-bit vdso entry as
well.

>
> Thanks,
> Dmitry.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-08 20:44               ` Andy Lutomirski
@ 2016-04-09  8:06                 ` Dmitry Safonov
  2016-04-13 16:55                 ` Dmitry Safonov
  1 sibling, 0 replies; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-09  8:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, linux-kernel,
	H. Peter Anvin

2016-04-08 23:44 GMT+03:00 Andy Lutomirski <luto@amacapital.net>:
> On Apr 8, 2016 9:20 AM, "Dmitry Safonov" <dsafonov@virtuozzo.com> wrote:
>>
>>
>> And if it's not there - return error?
>
> No, just leave IP unchanged.

Ok, will resend with this fixup.

>
> Feel free to ask for help on some of these details.  user_64bit_mode
> will be helpful too.

Thanks.

>> I doubt, is it sane to remove
>> TS_COMPAT instead, leaving TIF_IA32, as for some cases
>> we need to know if task is compatible outside of syscall's path?
>
> No.  TS_COMPAT is important, and it's also better behaved than
> TIF_IA32 -- it has a very specific meaning: "am I currently executing
> a 32-bit syscall".
>
>
> The comment is wrong :). TS_COMPAT is true on int80 or 32-bit vdso
> syscall entries and is false otherwise.  64-bit tasks can use int80
> and, with your patches, will be able to use the 32-bit vdso entry as
> well.
>

Oh, yes, I see what you pointing, thanks, will work on it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-08 20:44               ` Andy Lutomirski
  2016-04-09  8:06                 ` Dmitry Safonov
@ 2016-04-13 16:55                 ` Dmitry Safonov
  2016-04-14 18:27                   ` Andy Lutomirski
  1 sibling, 1 reply; 26+ messages in thread
From: Dmitry Safonov @ 2016-04-13 16:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Shuah Khan, Ingo Molnar, Dave Hansen,
	Borislav Petkov, khorenko, X86 ML, Andrew Morton, xemul,
	linux-kselftest, Cyrill Gorcunov, Dmitry Safonov, linux-kernel,
	H. Peter Anvin

On 04/08/2016 11:44 PM, Andy Lutomirski wrote:
> Feel free to ask for help on some of these details.  user_64bit_mode
> will be helpful too.
Hello again,

here are some questions on  TIF_IA32 removal:
- in function intel_pmu_pebs_fixup_ip: there is need to
know if process was it native/compat mode for instruction
interpreter for IP + one instruction fixup. There are
registers, but they are from PEBS, which does not contain
segment descriptors (even for PEBSv3). Other values
are from interrupt regs (look at setup_pebs_sample_data).
So, I guess, we may use user_64bit_mode on interrupt
register set, which will be racy with changing task's mode,
but quite ok?
- the same with LBR branching: I may got cs value for
user_64bit_mode or all registers set from intel_pmu_handle_irq
and pass it through intel_pmu_lbr_read => intel_pmu_lbr_filter
to branch_type for instruction decoder, which may
missinterpret opcode for the same racy-mode-switching app.
Is it also fine?
- for coredumping/ptracing, I will change test_thread_flag(TIF_IA32)
by user_64bit_mode(task_pt_regs()) - that looks/should be simple.
It's also valid as at the moment of coredump or of
PTRACE_GETREGSET task isn't running.
- I do not know what to do with uprobes - as you noted,
the way it cheks ia32_compat is buggy even now: task that
switches CS to __USER32_CS or back to __USER_CS will have
lousy inserted uprobe in mm.
So, how do we know on insert-time, with which descriptor
will be program on uprobed code?
- for MPX, I guess, tracking which syscall called
mpx_enable_management will work, at least it may be
documented, that before switching, one need to disable mpx.
- perf_reg_abi everywhere is used with current, so it's
also simple-switching to user_64bit_mode(task_pt_regs(current)).

For the conclusion:
I will send those patches, but I do not know what to do with
uprobes tracing. Could you give an advice what to do with
that?
It seems like, if I do those things, I will only need a way to
change vdso blob, without swapping some compatible flags,
as 64-bit tasks will differ from 32-bit only by the way they
execute syscalls.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-13 16:55                 ` Dmitry Safonov
@ 2016-04-14 18:27                   ` Andy Lutomirski
  2016-04-20 11:04                     ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-14 18:27 UTC (permalink / raw)
  To: Dmitry Safonov, Peter Zijlstra
  Cc: Thomas Gleixner, Shuah Khan, Ingo Molnar, Dave Hansen,
	Borislav Petkov, khorenko, X86 ML, Andrew Morton, xemul,
	linux-kselftest, Cyrill Gorcunov, Dmitry Safonov, linux-kernel,
	H. Peter Anvin

On Wed, Apr 13, 2016 at 9:55 AM, Dmitry Safonov <dsafonov@virtuozzo.com> wrote:
> On 04/08/2016 11:44 PM, Andy Lutomirski wrote:
>>
>> Feel free to ask for help on some of these details.  user_64bit_mode
>> will be helpful too.
>
> Hello again,
>
> here are some questions on  TIF_IA32 removal:
> - in function intel_pmu_pebs_fixup_ip: there is need to
> know if process was it native/compat mode for instruction
> interpreter for IP + one instruction fixup. There are
> registers, but they are from PEBS, which does not contain
> segment descriptors (even for PEBSv3). Other values
> are from interrupt regs (look at setup_pebs_sample_data).
> So, I guess, we may use user_64bit_mode on interrupt
> register set, which will be racy with changing task's mode,
> but quite ok?

Here's my understanding:

We don't actually know the mode, and there's no way we could get it
exactly.  User code could have changed the mode between when the PEBS
event was written and when we got the interrupt, and there's no way
for us to tell.

The regs passed to the interrupt aren't particularly helpful -- if we
get the overflow event from kernel mode, the regs will be kernel regs,
not user regs.

What we can do is to the the regs returned by perf_get_regs_user,
which I imagine perf is already doing.  Peter, is this the case?

If necessary, starting in 4.6, I could make the regs->cs part of
perf_get_regs_user be correct no matter what -- the only funny cases
left are NMI-in-system-call-prologue (there can't be intervening
interrupts any more other than MCE, and I don't think we really care
if we report correct PEBS results if we take a machine check in the
middle).

> - the same with LBR branching: I may got cs value for
> user_64bit_mode or all registers set from intel_pmu_handle_irq
> and pass it through intel_pmu_lbr_read => intel_pmu_lbr_filter
> to branch_type for instruction decoder, which may
> missinterpret opcode for the same racy-mode-switching app.
> Is it also fine?

Same thing, I think.

> - for coredumping/ptracing, I will change test_thread_flag(TIF_IA32)
> by user_64bit_mode(task_pt_regs()) - that looks/should be simple.
> It's also valid as at the moment of coredump or of
> PTRACE_GETREGSET task isn't running.

Please cc Oleg Nesterov on that one.

> - I do not know what to do with uprobes - as you noted,
> the way it cheks ia32_compat is buggy even now: task that
> switches CS to __USER32_CS or back to __USER_CS will have
> lousy inserted uprobe in mm.

I have no idea, but I'll look at your patch and maybe have an idea.
Oleg Nesterov might be a good person to ask about that, too.

> So, how do we know on insert-time, with which descriptor
> will be program on uprobed code?
> - for MPX, I guess, tracking which syscall called
> mpx_enable_management will work, at least it may be
> documented, that before switching, one need to disable mpx.

You already have to disable MPX before switching because of hardware
issues, so I wouldn't worry about it.

> - perf_reg_abi everywhere is used with current, so it's
> also simple-switching to user_64bit_mode(task_pt_regs(current)).

perf_reg_abi should be better -- see the fancy code in
perf_regs_get_user, which is where it comes from these days, I think.

>
> For the conclusion:
> I will send those patches, but I do not know what to do with
> uprobes tracing. Could you give an advice what to do with
> that?
> It seems like, if I do those things, I will only need a way to
> change vdso blob, without swapping some compatible flags,
> as 64-bit tasks will differ from 32-bit only by the way they
> execute syscalls.

Fantastic!

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-14 18:27                   ` Andy Lutomirski
@ 2016-04-20 11:04                     ` Peter Zijlstra
  2016-04-20 15:40                       ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-04-20 11:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, Dmitry Safonov,
	linux-kernel, H. Peter Anvin

On Thu, Apr 14, 2016 at 11:27:35AM -0700, Andy Lutomirski wrote:
> On Wed, Apr 13, 2016 at 9:55 AM, Dmitry Safonov <dsafonov@virtuozzo.com> wrote:
> > On 04/08/2016 11:44 PM, Andy Lutomirski wrote:
> >>
> >> Feel free to ask for help on some of these details.  user_64bit_mode
> >> will be helpful too.
> >
> > Hello again,
> >
> > here are some questions on  TIF_IA32 removal:
> > - in function intel_pmu_pebs_fixup_ip: there is need to
> > know if process was it native/compat mode for instruction
> > interpreter for IP + one instruction fixup. There are
> > registers, but they are from PEBS, which does not contain
> > segment descriptors (even for PEBSv3). Other values
> > are from interrupt regs (look at setup_pebs_sample_data).
> > So, I guess, we may use user_64bit_mode on interrupt
> > register set, which will be racy with changing task's mode,
> > but quite ok?
> 
> Here's my understanding:
> 
> We don't actually know the mode, and there's no way we could get it
> exactly.  User code could have changed the mode between when the PEBS
> event was written and when we got the interrupt, and there's no way
> for us to tell.
> 
> The regs passed to the interrupt aren't particularly helpful -- if we
> get the overflow event from kernel mode, the regs will be kernel regs,
> not user regs.
> 
> What we can do is to the the regs returned by perf_get_regs_user,
> which I imagine perf is already doing.  Peter, is this the case?

*confused*, how is perf_get_regs_user() connected to the PEBS fixup?

Ah, you want to use perf_get_regs_user() instead of task_pt_regs()
because of how an NMI during interrupt entry would mess up the
task_pt_regs() contents.

At that point you can use regs_user->abi, right?

> If necessary, starting in 4.6, I could make the regs->cs part of
> perf_get_regs_user be correct no matter what -- the only funny cases
> left are NMI-in-system-call-prologue (there can't be intervening
> interrupts any more other than MCE, and I don't think we really care
> if we report correct PEBS results if we take a machine check in the
> middle).
> 
> > - the same with LBR branching: I may got cs value for
> > user_64bit_mode or all registers set from intel_pmu_handle_irq
> > and pass it through intel_pmu_lbr_read => intel_pmu_lbr_filter
> > to branch_type for instruction decoder, which may
> > missinterpret opcode for the same racy-mode-switching app.
> > Is it also fine?
> 
> Same thing, I think.

Yep, whatever works for PEBS should also work for the LBR case. Both can
handle an occasional failed decode. Esp. if userspace is doing daft
things like changing the mode, you get to keep whatever pieces result
from that.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-20 11:04                     ` Peter Zijlstra
@ 2016-04-20 15:40                       ` Andy Lutomirski
  2016-04-20 19:05                         ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-20 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, Dmitry Safonov,
	linux-kernel, H. Peter Anvin

On Wed, Apr 20, 2016 at 4:04 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Apr 14, 2016 at 11:27:35AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 13, 2016 at 9:55 AM, Dmitry Safonov <dsafonov@virtuozzo.com> wrote:
>> > On 04/08/2016 11:44 PM, Andy Lutomirski wrote:
>> >>
>> >> Feel free to ask for help on some of these details.  user_64bit_mode
>> >> will be helpful too.
>> >
>> > Hello again,
>> >
>> > here are some questions on  TIF_IA32 removal:
>> > - in function intel_pmu_pebs_fixup_ip: there is need to
>> > know if process was it native/compat mode for instruction
>> > interpreter for IP + one instruction fixup. There are
>> > registers, but they are from PEBS, which does not contain
>> > segment descriptors (even for PEBSv3). Other values
>> > are from interrupt regs (look at setup_pebs_sample_data).
>> > So, I guess, we may use user_64bit_mode on interrupt
>> > register set, which will be racy with changing task's mode,
>> > but quite ok?
>>
>> Here's my understanding:
>>
>> We don't actually know the mode, and there's no way we could get it
>> exactly.  User code could have changed the mode between when the PEBS
>> event was written and when we got the interrupt, and there's no way
>> for us to tell.
>>
>> The regs passed to the interrupt aren't particularly helpful -- if we
>> get the overflow event from kernel mode, the regs will be kernel regs,
>> not user regs.
>>
>> What we can do is to the the regs returned by perf_get_regs_user,
>> which I imagine perf is already doing.  Peter, is this the case?
>
> *confused*, how is perf_get_regs_user() connected to the PEBS fixup?
>
> Ah, you want to use perf_get_regs_user() instead of task_pt_regs()
> because of how an NMI during interrupt entry would mess up the
> task_pt_regs() contents.
>
> At that point you can use regs_user->abi, right?

Yes, exactly.

Do LBR, PEBS, and similar report user regs or do they merely want to
know the instruction format?  If the latter, I could whip up a tiny
function to do just that (like perf_get_regs_user but just for ABI --
it would be simpler).

[merging some emails]

>> Peter, I got lost in the code that calls this.  Are regs coming from
>> the overflow interrupt's regs, current_pt_regs(), or
>> perf_get_regs_user?
>
> So get_perf_callchain() will get regs from:
>
>  - interrupt/NMI regs
>  - perf_arch_fetch_caller_regs()
>
> And when user && !user_mode(), we'll use:
>
>  - task_pt_regs() (which arguably should maybe be perf_get_regs_user())

Could you point me to this bit of the code?

>
> to call perf_callchain_user(), which then, ands up calling
> perf_callchain_user32() which is expected to NO-OP for 64bit userspace.
>
>> If it's the perf_get_regs_user, then this should be okay, but passing
>> in the ABI field directly would be even nicer.  If they're coming from
>> the overflow interrupt's regs or current_pt_regs(), could we change
>> that?
>>
>> It might also be nice to make sure that we call perf_get_regs_user
>> exactly once per overflow interrupt -- i.e. we could push it into the
>> main code rather than the regs sampling code.
>
> The risk there is that we might not need the user regs at all to handle
> the overflow thingy, so doing it unconditionally would be unwanted.

One call to perf_get_user_regs per interrupt shouldn't be too bad --
certainly much better then one per PEBS record.  One call to get user
ABI per overflow would be even less bad, but at that point, folding it
in to the PEBS code wouldn't be so bad either.

If I'm understanding this right (a big, big if), if we get a PEBS
overflow while running in user mode, we'll dump out the user regs (and
call perf_get_regs_user) and all the PEBS entries (subject to
exclude_kernel and with all the decoding magic).  So, in that case, we
call perf_get_user_regs.

If we get a PEBS overflow while running in kernel mode, we'll report
the kernel regs (if !exclude_kernel) and report the PEBS data as well.
If any of those records are in user mode, then, ideally, we'd invoke
perf_get_regs_user or similar *once* to get the ABI.  Although, if we
can get the user ABI efficiently enough, then maybe we don't care if
we call it once per PEBS record.

On x86, the only weird cases are NMIs or MCEs that land in the
syscall, syscall32, and sysenter prologues (easy to handle fully
correctly if we care because the IP that we interrupted tells us the
ABI) and the bullshit SYSENTER+TF thing.  Even the latter isn't so
hard to get right.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-20 15:40                       ` Andy Lutomirski
@ 2016-04-20 19:05                         ` Peter Zijlstra
  2016-04-21 19:39                           ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-04-20 19:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, Dmitry Safonov,
	linux-kernel, H. Peter Anvin

On Wed, Apr 20, 2016 at 08:40:23AM -0700, Andy Lutomirski wrote:
> Do LBR, PEBS, and similar report user regs or do they merely want to
> know the instruction format?  If the latter, I could whip up a tiny
> function to do just that (like perf_get_regs_user but just for ABI --
> it would be simpler).

Just the instruction format, nothing else.

> >> Peter, I got lost in the code that calls this.  Are regs coming from
> >> the overflow interrupt's regs, current_pt_regs(), or
> >> perf_get_regs_user?
> >
> > So get_perf_callchain() will get regs from:
> >
> >  - interrupt/NMI regs
> >  - perf_arch_fetch_caller_regs()
> >
> > And when user && !user_mode(), we'll use:
> >
> >  - task_pt_regs() (which arguably should maybe be perf_get_regs_user())
> 
> Could you point me to this bit of the code?

kernel/events/callchain.c:198

> > to call perf_callchain_user(), which then, ands up calling
> > perf_callchain_user32() which is expected to NO-OP for 64bit userspace.
> >
> >> If it's the perf_get_regs_user, then this should be okay, but passing
> >> in the ABI field directly would be even nicer.  If they're coming from
> >> the overflow interrupt's regs or current_pt_regs(), could we change
> >> that?
> >>
> >> It might also be nice to make sure that we call perf_get_regs_user
> >> exactly once per overflow interrupt -- i.e. we could push it into the
> >> main code rather than the regs sampling code.
> >
> > The risk there is that we might not need the user regs at all to handle
> > the overflow thingy, so doing it unconditionally would be unwanted.
> 
> One call to perf_get_user_regs per interrupt shouldn't be too bad --
> certainly much better then one per PEBS record.  One call to get user
> ABI per overflow would be even less bad, but at that point, folding it
> in to the PEBS code wouldn't be so bad either.

Right; although note that the whole fixup_ip() thing requires a single
record per interrupt (for we need the LBR state for each record in order
to rewind).

Also, HSW+ PEBS doesn't do the fixup anymore.

> If I'm understanding this right (a big, big if), if we get a PEBS
> overflow while running in user mode, we'll dump out the user regs (and
> call perf_get_regs_user) and all the PEBS entries (subject to
> exclude_kernel and with all the decoding magic).  So, in that case, we
> call perf_get_user_regs.

We only dump user regs if PERF_SAMPLE_REGS_USER, and in case we hit
userspace userspace with the interrupt we use the interrupt regs; see
perf_sample_regs_user().

> If we get a PEBS overflow while running in kernel mode, we'll report
> the kernel regs (if !exclude_kernel) and report the PEBS data as well.
> If any of those records are in user mode, then, ideally, we'd invoke
> perf_get_regs_user or similar *once* to get the ABI.  Although, if we
> can get the user ABI efficiently enough, then maybe we don't care if
> we call it once per PEBS record.

Right, if we interrupt kernel mode, we'll call perf_get_regs_user() if
PERF_SAMPLE_REGS_USER (| PERF_SAMPLE_STACK_USER).

The problem here is that the overflow stuff is designed for a single
'event' per interrupt, so passing it data for multiple events is
somewhat icky.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-20 19:05                         ` Peter Zijlstra
@ 2016-04-21 19:39                           ` Andy Lutomirski
  2016-04-21 20:12                             ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-21 19:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, Dmitry Safonov,
	linux-kernel, H. Peter Anvin

On Wed, Apr 20, 2016 at 12:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Apr 20, 2016 at 08:40:23AM -0700, Andy Lutomirski wrote:
>> Do LBR, PEBS, and similar report user regs or do they merely want to
>> know the instruction format?  If the latter, I could whip up a tiny
>> function to do just that (like perf_get_regs_user but just for ABI --
>> it would be simpler).
>
> Just the instruction format, nothing else.
>
>> >> Peter, I got lost in the code that calls this.  Are regs coming from
>> >> the overflow interrupt's regs, current_pt_regs(), or
>> >> perf_get_regs_user?
>> >
>> > So get_perf_callchain() will get regs from:
>> >
>> >  - interrupt/NMI regs
>> >  - perf_arch_fetch_caller_regs()
>> >
>> > And when user && !user_mode(), we'll use:
>> >
>> >  - task_pt_regs() (which arguably should maybe be perf_get_regs_user())
>>
>> Could you point me to this bit of the code?
>
> kernel/events/callchain.c:198

But that only applies to the callchain code, right?  AFAICS the PEBS
code is invoked through the x86_pmu NMI handler and always gets the
IRQ regs.  Except for this case:

static inline void intel_pmu_drain_pebs_buffer(void)
{
    struct pt_regs regs;

    x86_pmu.drain_pebs(&regs);
}

which seems a bit confused.

I don't suppose we could arrange to pass something consistent into the
PEBS handlers...

Or is the PEBS code being called from the callchain code somehow?

I haven't dug in to the LBR code much.

>>
>> One call to perf_get_user_regs per interrupt shouldn't be too bad --
>> certainly much better then one per PEBS record.  One call to get user
>> ABI per overflow would be even less bad, but at that point, folding it
>> in to the PEBS code wouldn't be so bad either.
>
> Right; although note that the whole fixup_ip() thing requires a single
> record per interrupt (for we need the LBR state for each record in order
> to rewind).

So do earlier PEBS events not get rewound?  Or so we just program the
thing to only ever give us one event at a time?

>
> Also, HSW+ PEBS doesn't do the fixup anymore.
>
>> If I'm understanding this right (a big, big if), if we get a PEBS
>> overflow while running in user mode, we'll dump out the user regs (and
>> call perf_get_regs_user) and all the PEBS entries (subject to
>> exclude_kernel and with all the decoding magic).  So, in that case, we
>> call perf_get_user_regs.
>
> We only dump user regs if PERF_SAMPLE_REGS_USER, and in case we hit
> userspace userspace with the interrupt we use the interrupt regs; see
> perf_sample_regs_user().
>
>> If we get a PEBS overflow while running in kernel mode, we'll report
>> the kernel regs (if !exclude_kernel) and report the PEBS data as well.
>> If any of those records are in user mode, then, ideally, we'd invoke
>> perf_get_regs_user or similar *once* to get the ABI.  Although, if we
>> can get the user ABI efficiently enough, then maybe we don't care if
>> we call it once per PEBS record.
>
> Right, if we interrupt kernel mode, we'll call perf_get_regs_user() if
> PERF_SAMPLE_REGS_USER (| PERF_SAMPLE_STACK_USER).

But not in get_perf_callchain.  So we'll show the correct user *regs*
but not the current user callchain under some conditions, AFAICS.

>
> The problem here is that the overflow stuff is designed for a single
> 'event' per interrupt, so passing it data for multiple events is
> somewhat icky.

It also seems that there's a certain amount of confusion as to exactly
what "regs" means in various contexts.  Or at least I'm confused by
it.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-21 19:39                           ` Andy Lutomirski
@ 2016-04-21 20:12                             ` Peter Zijlstra
  2016-04-21 23:27                               ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-04-21 20:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, Dmitry Safonov,
	linux-kernel, H. Peter Anvin

On Thu, Apr 21, 2016 at 12:39:42PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 20, 2016 at 12:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Apr 20, 2016 at 08:40:23AM -0700, Andy Lutomirski wrote:

> >> >> Peter, I got lost in the code that calls this.  Are regs coming from
> >> >> the overflow interrupt's regs, current_pt_regs(), or
> >> >> perf_get_regs_user?
> >> >
> >> > So get_perf_callchain() will get regs from:
> >> >
> >> >  - interrupt/NMI regs
> >> >  - perf_arch_fetch_caller_regs()
> >> >
> >> > And when user && !user_mode(), we'll use:
> >> >
> >> >  - task_pt_regs() (which arguably should maybe be perf_get_regs_user())
> >>
> >> Could you point me to this bit of the code?
> >
> > kernel/events/callchain.c:198
> 
> But that only applies to the callchain code, right? 

Yes, which is what I thought you were after..

> AFAICS the PEBS
> code is invoked through the x86_pmu NMI handler and always gets the
> IRQ regs.  Except for this case:
> 
> static inline void intel_pmu_drain_pebs_buffer(void)
> {
>     struct pt_regs regs;
> 
>     x86_pmu.drain_pebs(&regs);
> }
> 
> which seems a bit confused.

Yes, so that only gets used with 'large' pebs, which requires no other
flags than PERF_FRERERUNNING_FLAGS, which precludes the regs set from
being used.

Could definitely use a comment.

> I don't suppose we could arrange to pass something consistent into the
> PEBS handlers...
> 
> Or is the PEBS code being called from the callchain code somehow?

No. I think we were/are slightly talking past one another.

> >> One call to perf_get_user_regs per interrupt shouldn't be too bad --
> >> certainly much better then one per PEBS record.  One call to get user
> >> ABI per overflow would be even less bad, but at that point, folding it
> >> in to the PEBS code wouldn't be so bad either.
> >
> > Right; although note that the whole fixup_ip() thing requires a single
> > record per interrupt (for we need the LBR state for each record in order
> > to rewind).
> 
> So do earlier PEBS events not get rewound?  Or so we just program the
> thing to only ever give us one event at a time?

The latter; we program PEBS such that it can hold but a single record
and thereby assure we get an interrupt for each record.

> > The problem here is that the overflow stuff is designed for a single
> > 'event' per interrupt, so passing it data for multiple events is
> > somewhat icky.
> 
> It also seems that there's a certain amount of confusion as to exactly
> what "regs" means in various contexts.  Or at least I'm confused by
> it.

Yes, there's too much regs.

Typically 'regs' is the 'interrrupt'/'event' regs, that is the register
set at eventing time. For sampling hardware PMUs this is NMI/IRQ like
things, for software events this ends up being
perf_arch_fetch_caller_regs().

Then there's PERF_SAMPLE_REGS_USER|PERF_SAMPLE_STACK_USER, which, for
each event with it set, use perf_get_regs_user() to dump the thing into
our ringbuffer as part of the event record.

And then there's the callchain code, which first unwinds kernel space if
the 'interrupt'/'event' reg set points into the kernel, and then uses
task_pt_regs() (which I think we agree should be perf_get_regs_user())
to obtain the user regs to continue with the user stack unwind.

Finally there's PERF_SAMPLE_REGS_INTR, which dumps whatever
'interrupt/event' regs we get into the ringbuffer sample record.


Did that help? Or did I confuse you moar?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-21 20:12                             ` Peter Zijlstra
@ 2016-04-21 23:27                               ` Andy Lutomirski
  2016-04-21 23:46                                 ` Andy Lutomirski
  2016-04-25 15:16                                 ` Peter Zijlstra
  0 siblings, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-21 23:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, Dmitry Safonov,
	linux-kernel, H. Peter Anvin

On Thu, Apr 21, 2016 at 1:12 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Apr 21, 2016 at 12:39:42PM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 20, 2016 at 12:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Wed, Apr 20, 2016 at 08:40:23AM -0700, Andy Lutomirski wrote:
>
>> >> >> Peter, I got lost in the code that calls this.  Are regs coming from
>> >> >> the overflow interrupt's regs, current_pt_regs(), or
>> >> >> perf_get_regs_user?
>> >> >
>> >> > So get_perf_callchain() will get regs from:
>> >> >
>> >> >  - interrupt/NMI regs
>> >> >  - perf_arch_fetch_caller_regs()
>> >> >
>> >> > And when user && !user_mode(), we'll use:
>> >> >
>> >> >  - task_pt_regs() (which arguably should maybe be perf_get_regs_user())
>> >>
>> >> Could you point me to this bit of the code?
>> >
>> > kernel/events/callchain.c:198
>>
>> But that only applies to the callchain code, right?
>
> Yes, which is what I thought you were after..
>
>> AFAICS the PEBS
>> code is invoked through the x86_pmu NMI handler and always gets the
>> IRQ regs.  Except for this case:
>>
>> static inline void intel_pmu_drain_pebs_buffer(void)
>> {
>>     struct pt_regs regs;
>>
>>     x86_pmu.drain_pebs(&regs);
>> }
>>
>> which seems a bit confused.
>
> Yes, so that only gets used with 'large' pebs, which requires no other
> flags than PERF_FRERERUNNING_FLAGS, which precludes the regs set from
> being used.
>
> Could definitely use a comment.
>
>> I don't suppose we could arrange to pass something consistent into the
>> PEBS handlers...
>>
>> Or is the PEBS code being called from the callchain code somehow?
>
> No. I think we were/are slightly talking past one another.
>
>> >> One call to perf_get_user_regs per interrupt shouldn't be too bad --
>> >> certainly much better then one per PEBS record.  One call to get user
>> >> ABI per overflow would be even less bad, but at that point, folding it
>> >> in to the PEBS code wouldn't be so bad either.
>> >
>> > Right; although note that the whole fixup_ip() thing requires a single
>> > record per interrupt (for we need the LBR state for each record in order
>> > to rewind).
>>
>> So do earlier PEBS events not get rewound?  Or so we just program the
>> thing to only ever give us one event at a time?
>
> The latter; we program PEBS such that it can hold but a single record
> and thereby assure we get an interrupt for each record.
>
>> > The problem here is that the overflow stuff is designed for a single
>> > 'event' per interrupt, so passing it data for multiple events is
>> > somewhat icky.
>>
>> It also seems that there's a certain amount of confusion as to exactly
>> what "regs" means in various contexts.  Or at least I'm confused by
>> it.
>
> Yes, there's too much regs.
>
> Typically 'regs' is the 'interrrupt'/'event' regs, that is the register
> set at eventing time. For sampling hardware PMUs this is NMI/IRQ like
> things, for software events this ends up being
> perf_arch_fetch_caller_regs().
>
> Then there's PERF_SAMPLE_REGS_USER|PERF_SAMPLE_STACK_USER, which, for
> each event with it set, use perf_get_regs_user() to dump the thing into
> our ringbuffer as part of the event record.
>
> And then there's the callchain code, which first unwinds kernel space if
> the 'interrupt'/'event' reg set points into the kernel, and then uses
> task_pt_regs() (which I think we agree should be perf_get_regs_user())
> to obtain the user regs to continue with the user stack unwind.
>
> Finally there's PERF_SAMPLE_REGS_INTR, which dumps whatever
> 'interrupt/event' regs we get into the ringbuffer sample record.
>
>
> Did that help? Or did I confuse you moar?
>

I think I'm starting to get it.  What if we rearrange slightly, like this:

perf_sample_data already has a struct perf_regs in it.  We could add a
flags field to the first chunk of perf_sample_data:

u64 sample_flags;

perf_sample_data_init sets sample_flags to zero.

Now we rename perf_sample_regs_user to __perf_sample_regs_user and
make it non-static.  We also teach it to set do data->sample_flags |=
PERF_SAMPLE_FLAGS_HAS_REGS_USER.  We add:

static void perf_fetch_regs_user(struct perf_sample_data *data, struct
pt_regs *interrupt_regs)
{
  if (data->sample_flags & PERF_SAMPLE_FLAGS_HAS_REGS_USER)
    return;

  __perf_sample_regs_user(&data->regs_user, interrupt_regs,
&data->regs_user_copy);
}

(Hmm.  This only really works well if we can guarantee that
interrupt_regs remains valid for the life of the perf_sample_data
object.  Could we perhaps move the interrupt_regs pointer *into*
perf_sample_data and stop passing it all over the place?)

We change all the callers of perf_sample_regs_user to use
perf_fetch_regs_user instead.

Now we teach the PEBS fixup code to call perf_fetch_regs_user if it
sees a user IP.  Then it can use regs_user->abi instead of TIF_IA32,
and my original goal of nuking TIF_IA32 can proceed apace.  (Keep in
mind that, if the interrupt refers to user mode, this is very fast.
If the interrupt refers to kernel mode, it's slower, but that's
comparatively rare and it's the case where we actually care.)

There might be one or two other tweaks needed, but I think this should
mostly do the trick.

What do you think?  If you like it, I can probably find some time to
give it a shot, but I don't guarantee that I won't miss some subtlety
in its interaction with the rest of the event output code.

On a vaguely related note, why is the big prebs-to-pt_regs copy
conditional on (sample_type & PERF_SAMPLE_REGS_INTR)?  I bet it would
be faster to make it unconditional, because you could avoid copying
over the entire pt_regs struct if PERF_SAMPLE_REGS_INTR isn't set.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-21 23:27                               ` Andy Lutomirski
@ 2016-04-21 23:46                                 ` Andy Lutomirski
  2016-04-25 15:16                                 ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-21 23:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, Dmitry Safonov,
	linux-kernel, H. Peter Anvin

On Thu, Apr 21, 2016 at 4:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Apr 21, 2016 at 1:12 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Apr 21, 2016 at 12:39:42PM -0700, Andy Lutomirski wrote:
>>> On Wed, Apr 20, 2016 at 12:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> > On Wed, Apr 20, 2016 at 08:40:23AM -0700, Andy Lutomirski wrote:
>>
>>> >> >> Peter, I got lost in the code that calls this.  Are regs coming from
>>> >> >> the overflow interrupt's regs, current_pt_regs(), or
>>> >> >> perf_get_regs_user?
>>> >> >
>>> >> > So get_perf_callchain() will get regs from:
>>> >> >
>>> >> >  - interrupt/NMI regs
>>> >> >  - perf_arch_fetch_caller_regs()
>>> >> >
>>> >> > And when user && !user_mode(), we'll use:
>>> >> >
>>> >> >  - task_pt_regs() (which arguably should maybe be perf_get_regs_user())
>>> >>
>>> >> Could you point me to this bit of the code?
>>> >
>>> > kernel/events/callchain.c:198
>>>
>>> But that only applies to the callchain code, right?
>>
>> Yes, which is what I thought you were after..
>>
>>> AFAICS the PEBS
>>> code is invoked through the x86_pmu NMI handler and always gets the
>>> IRQ regs.  Except for this case:
>>>
>>> static inline void intel_pmu_drain_pebs_buffer(void)
>>> {
>>>     struct pt_regs regs;
>>>
>>>     x86_pmu.drain_pebs(&regs);
>>> }
>>>
>>> which seems a bit confused.
>>
>> Yes, so that only gets used with 'large' pebs, which requires no other
>> flags than PERF_FRERERUNNING_FLAGS, which precludes the regs set from
>> being used.
>>
>> Could definitely use a comment.
>>
>>> I don't suppose we could arrange to pass something consistent into the
>>> PEBS handlers...
>>>
>>> Or is the PEBS code being called from the callchain code somehow?
>>
>> No. I think we were/are slightly talking past one another.
>>
>>> >> One call to perf_get_user_regs per interrupt shouldn't be too bad --
>>> >> certainly much better then one per PEBS record.  One call to get user
>>> >> ABI per overflow would be even less bad, but at that point, folding it
>>> >> in to the PEBS code wouldn't be so bad either.
>>> >
>>> > Right; although note that the whole fixup_ip() thing requires a single
>>> > record per interrupt (for we need the LBR state for each record in order
>>> > to rewind).
>>>
>>> So do earlier PEBS events not get rewound?  Or so we just program the
>>> thing to only ever give us one event at a time?
>>
>> The latter; we program PEBS such that it can hold but a single record
>> and thereby assure we get an interrupt for each record.
>>
>>> > The problem here is that the overflow stuff is designed for a single
>>> > 'event' per interrupt, so passing it data for multiple events is
>>> > somewhat icky.
>>>
>>> It also seems that there's a certain amount of confusion as to exactly
>>> what "regs" means in various contexts.  Or at least I'm confused by
>>> it.
>>
>> Yes, there's too much regs.
>>
>> Typically 'regs' is the 'interrrupt'/'event' regs, that is the register
>> set at eventing time. For sampling hardware PMUs this is NMI/IRQ like
>> things, for software events this ends up being
>> perf_arch_fetch_caller_regs().
>>
>> Then there's PERF_SAMPLE_REGS_USER|PERF_SAMPLE_STACK_USER, which, for
>> each event with it set, use perf_get_regs_user() to dump the thing into
>> our ringbuffer as part of the event record.
>>
>> And then there's the callchain code, which first unwinds kernel space if
>> the 'interrupt'/'event' reg set points into the kernel, and then uses
>> task_pt_regs() (which I think we agree should be perf_get_regs_user())
>> to obtain the user regs to continue with the user stack unwind.
>>
>> Finally there's PERF_SAMPLE_REGS_INTR, which dumps whatever
>> 'interrupt/event' regs we get into the ringbuffer sample record.
>>
>>
>> Did that help? Or did I confuse you moar?
>>
>
> I think I'm starting to get it.  What if we rearrange slightly, like this:
>

I started fiddling to see what's involved, then I got to this:

    if (sample_type & PERF_SAMPLE_REGS_INTR) {
        u64 abi = data->regs_intr.abi;
        /*
         * If there are no regs to dump, notice it through
         * first u64 being zero (PERF_SAMPLE_REGS_ABI_NONE).
         */
        perf_output_put(handle, abi);

        if (abi) {
            u64 mask = event->attr.sample_regs_intr;

            perf_output_sample_regs(handle,
                        data->regs_intr.regs,
                        mask);
        }
    }

regs_intr.abi comes from perf_regs_abi(current), which, on x86_64 or
arm64, may indicate 32-bit regs, but the actual regs are always
64-bit.  Am I just confused or is this a bug?

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-21 23:27                               ` Andy Lutomirski
  2016-04-21 23:46                                 ` Andy Lutomirski
@ 2016-04-25 15:16                                 ` Peter Zijlstra
  2016-04-25 16:50                                   ` Andy Lutomirski
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-04-25 15:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, Dmitry Safonov,
	linux-kernel, H. Peter Anvin

On Thu, Apr 21, 2016 at 04:27:19PM -0700, Andy Lutomirski wrote:
> > Did that help? Or did I confuse you moar?
> >
> 
> I think I'm starting to get it.  What if we rearrange slightly, like this:
> 
> perf_sample_data already has a struct perf_regs in it.  We could add a
> flags field to the first chunk of perf_sample_data:
> 
> u64 sample_flags;

I actually considered that for another problem. Didn't like it then, but
seeing how I still haven't figured out a better way and you're now
proposing this too, maybe...

Part of the problem is that this will completely exhaust that first
cacheline :/

> perf_sample_data_init sets sample_flags to zero.

And while we're on struct perf_sample_data, that thing has gotten
insanely large. We carry it on-stack!

It should be fairly easy to take regs_user_copy out and use a per-cpu
array of them things for this I think, see below.

> Now we rename perf_sample_regs_user to __perf_sample_regs_user and
> make it non-static.  We also teach it to set do data->sample_flags |=
> PERF_SAMPLE_FLAGS_HAS_REGS_USER.  We add:
> 
> static void perf_fetch_regs_user(struct perf_sample_data *data, struct
> pt_regs *interrupt_regs)
> {
>   if (data->sample_flags & PERF_SAMPLE_FLAGS_HAS_REGS_USER)
>     return;
> 
>   __perf_sample_regs_user(&data->regs_user, interrupt_regs,
> &data->regs_user_copy);
> }

I meant to change perf_prepare_sample() to do:

	u64 sample_type = event->attr.sample_type & ~data.sample_type;

or something similar, such that we can override/avoid some of the work
there.

> (Hmm.  This only really works well if we can guarantee that
> interrupt_regs remains valid for the life of the perf_sample_data
> object.  Could we perhaps move the interrupt_regs pointer *into*
> perf_sample_data and stop passing it all over the place?)

So the problem with that is that we'll now overflow the one cacheline,
and the last time I really looked at this that made samples that much
slower.

It might be time to re-evaluate this stuff, since pretty much everything
will eventually write into perf_sample_data::ip etc.. which is the
second line anyway.

Also, looking at it, we actually have a pointer in there for this,
perf_sample_data::regs_intr::regs, but its at the very tail of this
monster, 4 cachelines off the normal path.

> We change all the callers of perf_sample_regs_user to use
> perf_fetch_regs_user instead.

There's only the one site currently, but yeah.

> What do you think?  If you like it, I can probably find some time to
> give it a shot, but I don't guarantee that I won't miss some subtlety
> in its interaction with the rest of the event output code.

Sure give it a go, I'll stomp on it to fix the pebs-time issue (we need
to skip perf_prepare_sample's PERF_SAMPLE_TIME branch for that).

> On a vaguely related note, why is the big prebs-to-pt_regs copy
> conditional on (sample_type & PERF_SAMPLE_REGS_INTR)?  I bet it would
> be faster to make it unconditional, because you could avoid copying
> over the entire pt_regs struct if PERF_SAMPLE_REGS_INTR isn't set.

Hmm, yes.. that code did move about a bit, not sure what it looked like
originally.

In any case, That fully copy is overkill in the simple case as well, I
think that could get away with only copying cs,flags.


Compile tested only..

---
Subject: perf: Replace perf_sample_data::regs_user_copy with per-cpu storage

struct perf_sample_data is immense, and we carry it on stack, shrink it
some.

struct perf_sample_data {
        /* size: 384, cachelines: 6, members: 19 */
}

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/perf_event.h |  2 --
 kernel/events/core.c       | 23 +++++++++++++++++------
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 85749ae8cb5f..dd2cab6c5bbb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -795,8 +795,6 @@ struct perf_sample_data {
 	 * on arch details.
 	 */
 	struct perf_regs		regs_user;
-	struct pt_regs			regs_user_copy;
-
 	struct perf_regs		regs_intr;
 	u64				stack_user_size;
 } ____cacheline_aligned;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index eabeb2aec00f..72754607d2cd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5146,15 +5146,27 @@ perf_output_sample_regs(struct perf_output_handle *handle,
 	}
 }
 
-static void perf_sample_regs_user(struct perf_regs *regs_user,
-				  struct pt_regs *regs,
-				  struct pt_regs *regs_user_copy)
+static DEFINE_PER_CPU(struct pt_regs, __regs_user[4]);
+
+static struct pt_regs *regs_user_ptr(void)
+{
+	if (in_nmi())
+		return this_cpu_ptr(&__regs_user[0]);
+	if (in_interrupt())
+		return this_cpu_ptr(&__regs_user[1]);
+	if (in_serving_softirq())
+		return this_cpu_ptr(&__regs_user[2]);
+	return this_cpu_ptr(&__regs_user[3]);
+}
+
+static void
+perf_sample_regs_user(struct perf_regs *regs_user, struct pt_regs *regs)
 {
 	if (user_mode(regs)) {
 		regs_user->abi = perf_reg_abi(current);
 		regs_user->regs = regs;
 	} else if (current->mm) {
-		perf_get_regs_user(regs_user, regs, regs_user_copy);
+		perf_get_regs_user(regs_user, regs, regs_user_ptr());
 	} else {
 		regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
 		regs_user->regs = NULL;
@@ -5638,8 +5650,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 	}
 
 	if (sample_type & (PERF_SAMPLE_REGS_USER | PERF_SAMPLE_STACK_USER))
-		perf_sample_regs_user(&data->regs_user, regs,
-				      &data->regs_user_copy);
+		perf_sample_regs_user(&data->regs_user, regs);
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
 		/* regs dump ABI info */

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
  2016-04-25 15:16                                 ` Peter Zijlstra
@ 2016-04-25 16:50                                   ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2016-04-25 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Safonov, Thomas Gleixner, Shuah Khan, Ingo Molnar,
	Dave Hansen, Borislav Petkov, khorenko, X86 ML, Andrew Morton,
	xemul, linux-kselftest, Cyrill Gorcunov, Dmitry Safonov,
	linux-kernel, H. Peter Anvin

On Mon, Apr 25, 2016 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Apr 21, 2016 at 04:27:19PM -0700, Andy Lutomirski wrote:
>> > Did that help? Or did I confuse you moar?
>> >
>>
>> I think I'm starting to get it.  What if we rearrange slightly, like this:
>>
>> perf_sample_data already has a struct perf_regs in it.  We could add a
>> flags field to the first chunk of perf_sample_data:
>>
>> u64 sample_flags;
>
> I actually considered that for another problem. Didn't like it then, but
> seeing how I still haven't figured out a better way and you're now
> proposing this too, maybe...
>
> Part of the problem is that this will completely exhaust that first
> cacheline :/

What do you mean?  You have a whole 63 bits left :)

Another option would be to initialize regs_user.regs to
PERF_REGS_NOT_YET_FILLED (#defined to ~0 or whatever).  That would
involve a *write* to an otherwise possibly unused cacheline, which is
less than ideal but is probably considerably less bad than reading the
cacheline.

>
>> perf_sample_data_init sets sample_flags to zero.
>
> And while we're on struct perf_sample_data, that thing has gotten
> insanely large. We carry it on-stack!
>
> It should be fairly easy to take regs_user_copy out and use a per-cpu
> array of them things for this I think, see below.
>
>> Now we rename perf_sample_regs_user to __perf_sample_regs_user and
>> make it non-static.  We also teach it to set do data->sample_flags |=
>> PERF_SAMPLE_FLAGS_HAS_REGS_USER.  We add:
>>
>> static void perf_fetch_regs_user(struct perf_sample_data *data, struct
>> pt_regs *interrupt_regs)
>> {
>>   if (data->sample_flags & PERF_SAMPLE_FLAGS_HAS_REGS_USER)
>>     return;
>>
>>   __perf_sample_regs_user(&data->regs_user, interrupt_regs,
>> &data->regs_user_copy);
>> }
>
> I meant to change perf_prepare_sample() to do:
>
>         u64 sample_type = event->attr.sample_type & ~data.sample_type;
>
> or something similar, such that we can override/avoid some of the work
> there.

I'm not sure I follow, but that's okay.

>
>> (Hmm.  This only really works well if we can guarantee that
>> interrupt_regs remains valid for the life of the perf_sample_data
>> object.  Could we perhaps move the interrupt_regs pointer *into*
>> perf_sample_data and stop passing it all over the place?)
>
> So the problem with that is that we'll now overflow the one cacheline,
> and the last time I really looked at this that made samples that much
> slower.
>
> It might be time to re-evaluate this stuff, since pretty much everything
> will eventually write into perf_sample_data::ip etc.. which is the
> second line anyway.
>
> Also, looking at it, we actually have a pointer in there for this,
> perf_sample_data::regs_intr::regs, but its at the very tail of this
> monster, 4 cachelines off the normal path.
>
>> We change all the callers of perf_sample_regs_user to use
>> perf_fetch_regs_user instead.
>
> There's only the one site currently, but yeah.
>
>> What do you think?  If you like it, I can probably find some time to
>> give it a shot, but I don't guarantee that I won't miss some subtlety
>> in its interaction with the rest of the event output code.
>
> Sure give it a go, I'll stomp on it to fix the pebs-time issue (we need
> to skip perf_prepare_sample's PERF_SAMPLE_TIME branch for that).

Will do.  No promises about the time frame -- my queue overfloweth
right now.  But I do have a draft patch or two that I should be able
to dust off a bit over the next few days.

>
>> On a vaguely related note, why is the big prebs-to-pt_regs copy
>> conditional on (sample_type & PERF_SAMPLE_REGS_INTR)?  I bet it would
>> be faster to make it unconditional, because you could avoid copying
>> over the entire pt_regs struct if PERF_SAMPLE_REGS_INTR isn't set.
>
> Hmm, yes.. that code did move about a bit, not sure what it looked like
> originally.
>
> In any case, That fully copy is overkill in the simple case as well, I
> think that could get away with only copying cs,flags.
>

I'd be more comfortable with it if we always either populated all or
none of it or otherwise made sure that unpopulated regs never leaked
out into a sample.

>
> Compile tested only..
>
> ---
> Subject: perf: Replace perf_sample_data::regs_user_copy with per-cpu storage
>
> struct perf_sample_data is immense, and we carry it on stack, shrink it
> some.
>
> struct perf_sample_data {
>         /* size: 384, cachelines: 6, members: 19 */
> }
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/perf_event.h |  2 --
>  kernel/events/core.c       | 23 +++++++++++++++++------
>  2 files changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 85749ae8cb5f..dd2cab6c5bbb 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -795,8 +795,6 @@ struct perf_sample_data {
>          * on arch details.
>          */
>         struct perf_regs                regs_user;
> -       struct pt_regs                  regs_user_copy;
> -
>         struct perf_regs                regs_intr;
>         u64                             stack_user_size;
>  } ____cacheline_aligned;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index eabeb2aec00f..72754607d2cd 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5146,15 +5146,27 @@ perf_output_sample_regs(struct perf_output_handle *handle,
>         }
>  }
>
> -static void perf_sample_regs_user(struct perf_regs *regs_user,
> -                                 struct pt_regs *regs,
> -                                 struct pt_regs *regs_user_copy)
> +static DEFINE_PER_CPU(struct pt_regs, __regs_user[4]);
> +
> +static struct pt_regs *regs_user_ptr(void)
> +{
> +       if (in_nmi())
> +               return this_cpu_ptr(&__regs_user[0]);
> +       if (in_interrupt())
> +               return this_cpu_ptr(&__regs_user[1]);
> +       if (in_serving_softirq())
> +               return this_cpu_ptr(&__regs_user[2]);
> +       return this_cpu_ptr(&__regs_user[3]);
> +}
> +

There's already something very similar in
kernel/trace/trace_event_perf.c and core.c
(perf_swevent_get_recursion_context()) that explicitly counts
recursion.  Could they maybe be merged?  I.e. there could just be a
per-cpu pile of pt_regs structs and a simple allocator for them?  E.g.
perf_sample_data_init could increment some counter and
perf_sample_data_free (which doesn't currently exist) could decrement
the counter.

I don't personally mind keeping one of these on the stack -- it's not
*that* big.

But maybe there's a much better solution.  There is only ever one set
of user regs at a time.  If perf events nest, then the user regs are
exactly the same.  I wonder if this means that there could be a single
percpu copy of this mess.  It might not be quite that simple, because
an NMI could hit in the middle of populating the thing, though.
Grumble.

> +static void
> +perf_sample_regs_user(struct perf_regs *regs_user, struct pt_regs *regs)
>  {
>         if (user_mode(regs)) {
>                 regs_user->abi = perf_reg_abi(current);
>                 regs_user->regs = regs;
>         } else if (current->mm) {
> -               perf_get_regs_user(regs_user, regs, regs_user_copy);
> +               perf_get_regs_user(regs_user, regs, regs_user_ptr());
>         } else {
>                 regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
>                 regs_user->regs = NULL;
> @@ -5638,8 +5650,7 @@ void perf_prepare_sample(struct perf_event_header *header,
>         }
>
>         if (sample_type & (PERF_SAMPLE_REGS_USER | PERF_SAMPLE_STACK_USER))
> -               perf_sample_regs_user(&data->regs_user, regs,
> -                                     &data->regs_user_copy);
> +               perf_sample_regs_user(&data->regs_user, regs);

The rest looks reasonable.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-04-25 16:50 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-06 16:29 [PATCH 0/2] x86: add arch_prctl to switch between native/compat modes Dmitry Safonov
2016-04-06 16:29 ` [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode Dmitry Safonov
2016-04-06 18:04   ` Andy Lutomirski
2016-04-06 18:49     ` Andy Lutomirski
2016-04-07 12:11     ` Dmitry Safonov
2016-04-07 12:21       ` Cyrill Gorcunov
2016-04-07 12:35         ` Dmitry Safonov
2016-04-07 14:39       ` Andy Lutomirski
2016-04-07 15:18         ` Dmitry Safonov
2016-04-08 13:50         ` Dmitry Safonov
2016-04-08 15:56           ` Andy Lutomirski
2016-04-08 16:18             ` Dmitry Safonov
2016-04-08 20:44               ` Andy Lutomirski
2016-04-09  8:06                 ` Dmitry Safonov
2016-04-13 16:55                 ` Dmitry Safonov
2016-04-14 18:27                   ` Andy Lutomirski
2016-04-20 11:04                     ` Peter Zijlstra
2016-04-20 15:40                       ` Andy Lutomirski
2016-04-20 19:05                         ` Peter Zijlstra
2016-04-21 19:39                           ` Andy Lutomirski
2016-04-21 20:12                             ` Peter Zijlstra
2016-04-21 23:27                               ` Andy Lutomirski
2016-04-21 23:46                                 ` Andy Lutomirski
2016-04-25 15:16                                 ` Peter Zijlstra
2016-04-25 16:50                                   ` Andy Lutomirski
2016-04-06 16:29 ` [PATCH 2/2] x86/tools/testing: add test for ARCH_SET_COMPAT Dmitry Safonov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.