All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-22  1:49 ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa, Andy Lutomirski

[applies on jmorris's security-next tree]

This is both a cleanup and a speedup.  It reduces overhead due to
installing a trivial seccomp filter by 87%.  The speedup comes from
avoiding the full syscall tracing mechanism for filters that don't
return SECCOMP_RET_TRACE.

This series works by splitting the seccomp hooks into two phases.
The first phase evaluates the filter; it can skip syscalls, allow
them, kill the calling task, or pass a u32 to the second phase.  The
second phase requires a full tracing context, and it sends ptrace
events if necessary.

Once this is done, I implemented a similar split for the x86 syscall
entry work.  The C callback is invoked in two phases: the first has
only a partial frame, and it can request phase 2 processing with a
full frame.

Finally, I switch the 64-bit system_call code to use the new split
entry work.  This is a net deletion of assembly code: it replaces
all of the audit entry muck.

In the process, I fixed some bugs.

If this is acceptable, someone can do the same tweak for the
ia32entry and entry_32 code.

This passes all seccomp tests that I know of.  Now that it's properly
rebased, even the previously expected failures are gone.

Kees, if you like this version, can you create a branch with patches
1-4?  I think that the rest should go into tip/x86 once everyone's happy
with it.

Changes from v2:
 - Fixed 32-bit x86 build (and the tests pass).
 - Put the doc patch where it belongs.

Changes from v1:
 - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
   part).
 - Improved patch 6 vs patch 7 split (thanks Alexei!)
 - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
 - Improved changelog message in patch 6.

Changes from RFC version:
 - The first three patches are more or less the same
 - The rest is more or less a rewrite

Andy Lutomirski (8):
  seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
  seccomp: Refactor the filter callback and the API
  seccomp: Allow arch code to provide seccomp_data
  seccomp: Document two-phase seccomp and arch-provided seccomp_data
  x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
  x86: Split syscall_trace_enter into two phases
  x86_64,entry: Treat regs->ax the same in fastpath and slowpath
    syscalls
  x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls

 arch/Kconfig                   |  11 ++
 arch/arm/kernel/ptrace.c       |   7 +-
 arch/mips/kernel/ptrace.c      |   2 +-
 arch/s390/kernel/ptrace.c      |   2 +-
 arch/x86/include/asm/calling.h |   6 +-
 arch/x86/include/asm/ptrace.h  |   5 +
 arch/x86/kernel/entry_64.S     |  51 ++++-----
 arch/x86/kernel/ptrace.c       | 150 +++++++++++++++++++-----
 arch/x86/kernel/vsyscall_64.c  |   2 +-
 include/linux/seccomp.h        |  25 ++--
 kernel/seccomp.c               | 252 ++++++++++++++++++++++++++++-------------
 11 files changed, 360 insertions(+), 153 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-22  1:49 ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-arm-kernel

[applies on jmorris's security-next tree]

This is both a cleanup and a speedup.  It reduces overhead due to
installing a trivial seccomp filter by 87%.  The speedup comes from
avoiding the full syscall tracing mechanism for filters that don't
return SECCOMP_RET_TRACE.

This series works by splitting the seccomp hooks into two phases.
The first phase evaluates the filter; it can skip syscalls, allow
them, kill the calling task, or pass a u32 to the second phase.  The
second phase requires a full tracing context, and it sends ptrace
events if necessary.

Once this is done, I implemented a similar split for the x86 syscall
entry work.  The C callback is invoked in two phases: the first has
only a partial frame, and it can request phase 2 processing with a
full frame.

Finally, I switch the 64-bit system_call code to use the new split
entry work.  This is a net deletion of assembly code: it replaces
all of the audit entry muck.

In the process, I fixed some bugs.

If this is acceptable, someone can do the same tweak for the
ia32entry and entry_32 code.

This passes all seccomp tests that I know of.  Now that it's properly
rebased, even the previously expected failures are gone.

Kees, if you like this version, can you create a branch with patches
1-4?  I think that the rest should go into tip/x86 once everyone's happy
with it.

Changes from v2:
 - Fixed 32-bit x86 build (and the tests pass).
 - Put the doc patch where it belongs.

Changes from v1:
 - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
   part).
 - Improved patch 6 vs patch 7 split (thanks Alexei!)
 - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
 - Improved changelog message in patch 6.

Changes from RFC version:
 - The first three patches are more or less the same
 - The rest is more or less a rewrite

Andy Lutomirski (8):
  seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
  seccomp: Refactor the filter callback and the API
  seccomp: Allow arch code to provide seccomp_data
  seccomp: Document two-phase seccomp and arch-provided seccomp_data
  x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
  x86: Split syscall_trace_enter into two phases
  x86_64,entry: Treat regs->ax the same in fastpath and slowpath
    syscalls
  x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls

 arch/Kconfig                   |  11 ++
 arch/arm/kernel/ptrace.c       |   7 +-
 arch/mips/kernel/ptrace.c      |   2 +-
 arch/s390/kernel/ptrace.c      |   2 +-
 arch/x86/include/asm/calling.h |   6 +-
 arch/x86/include/asm/ptrace.h  |   5 +
 arch/x86/kernel/entry_64.S     |  51 ++++-----
 arch/x86/kernel/ptrace.c       | 150 +++++++++++++++++++-----
 arch/x86/kernel/vsyscall_64.c  |   2 +-
 include/linux/seccomp.h        |  25 ++--
 kernel/seccomp.c               | 252 ++++++++++++++++++++++++++++-------------
 11 files changed, 360 insertions(+), 153 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 1/8] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
  2014-07-22  1:49 ` Andy Lutomirski
@ 2014-07-22  1:49   ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa, Andy Lutomirski,
	Russell King, Ralf Baechle, Martin Schwidefsky, Heiko Carstens,
	linux-s390

The secure_computing function took a syscall number parameter, but
it only paid any attention to that parameter if seccomp mode 1 was
enabled.  Rather than coming up with a kludge to get the parameter
to work in mode 2, just remove the parameter.

To avoid churn in arches that don't have seccomp filters (and may
not even support syscall_get_nr right now), this leaves the
parameter in secure_computing_strict, which is now a real function.

For ARM, this is a bit ugly due to the fact that ARM conditionally
supports seccomp filters.  Fixing that would probably only be a
couple of lines of code, but it should be coordinated with the audit
maintainers.

This will be a slight slowdown on some arches.  The right fix is to
pass in all of seccomp_data instead of trying to make just the
syscall nr part be fast.

This is a prerequisite for making two-phase seccomp work cleanly.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/arm/kernel/ptrace.c      |  7 ++++-
 arch/mips/kernel/ptrace.c     |  2 +-
 arch/s390/kernel/ptrace.c     |  2 +-
 arch/x86/kernel/ptrace.c      |  2 +-
 arch/x86/kernel/vsyscall_64.c |  2 +-
 include/linux/seccomp.h       | 21 +++++++-------
 kernel/seccomp.c              | 64 ++++++++++++++++++++++++++++++-------------
 7 files changed, 66 insertions(+), 34 deletions(-)

diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 0c27ed6..5e772a2 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -933,8 +933,13 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
 	current_thread_info()->syscall = scno;
 
 	/* Do the secure computing check first; failures should be fast. */
-	if (secure_computing(scno) == -1)
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+	if (secure_computing() == -1)
 		return -1;
+#else
+	/* XXX: remove this once OABI gets fixed */
+	secure_computing_strict(scno);
+#endif
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index f639ccd..808bafc 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -639,7 +639,7 @@ asmlinkage long syscall_trace_enter(struct pt_regs *regs, long syscall)
 	long ret = 0;
 	user_exit();
 
-	if (secure_computing(syscall) == -1)
+	if (secure_computing() == -1)
 		return -1;
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE) &&
diff --git a/arch/s390/kernel/ptrace.c b/arch/s390/kernel/ptrace.c
index 2d716734..7ab8b91 100644
--- a/arch/s390/kernel/ptrace.c
+++ b/arch/s390/kernel/ptrace.c
@@ -795,7 +795,7 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
 	long ret = 0;
 
 	/* Do the secure computing check first. */
-	if (secure_computing(regs->gprs[2])) {
+	if (secure_computing()) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1;
 		goto out;
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 678c0ad..93c182a 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1471,7 +1471,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 		regs->flags |= X86_EFLAGS_TF;
 
 	/* do the secure computing check first */
-	if (secure_computing(regs->orig_ax)) {
+	if (secure_computing()) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1L;
 		goto out;
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index ea5b570..23c0c23 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -216,7 +216,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 	 */
 	regs->orig_ax = syscall_nr;
 	regs->ax = -ENOSYS;
-	tmp = secure_computing(syscall_nr);
+	tmp = secure_computing();
 	if ((!tmp && regs->orig_ax != syscall_nr) || regs->ip != address) {
 		warn_bad_vsyscall(KERN_DEBUG, regs,
 				  "seccomp tried to change syscall nr or ip");
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 5d586a4..aa3c040 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -27,19 +27,17 @@ struct seccomp {
 	struct seccomp_filter *filter;
 };
 
-extern int __secure_computing(int);
-static inline int secure_computing(int this_syscall)
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+extern int __secure_computing(void);
+static inline int secure_computing(void)
 {
 	if (unlikely(test_thread_flag(TIF_SECCOMP)))
-		return  __secure_computing(this_syscall);
+		return  __secure_computing();
 	return 0;
 }
-
-/* A wrapper for architectures supporting only SECCOMP_MODE_STRICT. */
-static inline void secure_computing_strict(int this_syscall)
-{
-	BUG_ON(secure_computing(this_syscall) != 0);
-}
+#else
+extern void secure_computing_strict(int this_syscall);
+#endif
 
 extern long prctl_get_seccomp(void);
 extern long prctl_set_seccomp(unsigned long, char __user *);
@@ -56,8 +54,11 @@ static inline int seccomp_mode(struct seccomp *s)
 struct seccomp { };
 struct seccomp_filter { };
 
-static inline int secure_computing(int this_syscall) { return 0; }
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+static inline int secure_computing(void) { return 0; }
+#else
 static inline void secure_computing_strict(int this_syscall) { return; }
+#endif
 
 static inline long prctl_get_seccomp(void)
 {
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 74f4601..861d7ee 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -23,8 +23,11 @@
 
 /* #define SECCOMP_DEBUG 1 */
 
-#ifdef CONFIG_SECCOMP_FILTER
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 #include <asm/syscall.h>
+#endif
+
+#ifdef CONFIG_SECCOMP_FILTER
 #include <linux/filter.h>
 #include <linux/pid.h>
 #include <linux/ptrace.h>
@@ -172,7 +175,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  *
  * Returns valid seccomp BPF response codes.
  */
-static u32 seccomp_run_filters(int syscall)
+static u32 seccomp_run_filters(void)
 {
 	struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
 	struct seccomp_data sd;
@@ -564,10 +567,43 @@ static int mode1_syscalls_32[] = {
 };
 #endif
 
-int __secure_computing(int this_syscall)
+static void __secure_computing_strict(int this_syscall)
+{
+	int *syscall_whitelist = mode1_syscalls;
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		syscall_whitelist = mode1_syscalls_32;
+#endif
+	do {
+		if (*syscall_whitelist == this_syscall)
+			return;
+	} while (*++syscall_whitelist);
+
+#ifdef SECCOMP_DEBUG
+	dump_stack();
+#endif
+	audit_seccomp(this_syscall, SIGKILL, SECCOMP_RET_KILL);
+	do_exit(SIGKILL);
+}
+
+#ifndef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+void secure_computing_strict(int this_syscall)
+{
+	int mode = current->seccomp.mode;
+
+	if (mode == 0)
+		return;
+	else if (mode == SECCOMP_MODE_STRICT)
+		__secure_computing_strict(this_syscall);
+	else
+		BUG();
+}
+#else
+int __secure_computing(void)
 {
+	struct pt_regs *regs = task_pt_regs(current);
+	int this_syscall = syscall_get_nr(current, regs);
 	int exit_sig = 0;
-	int *syscall;
 	u32 ret;
 
 	/*
@@ -578,23 +614,12 @@ int __secure_computing(int this_syscall)
 
 	switch (current->seccomp.mode) {
 	case SECCOMP_MODE_STRICT:
-		syscall = mode1_syscalls;
-#ifdef CONFIG_COMPAT
-		if (is_compat_task())
-			syscall = mode1_syscalls_32;
-#endif
-		do {
-			if (*syscall == this_syscall)
-				return 0;
-		} while (*++syscall);
-		exit_sig = SIGKILL;
-		ret = SECCOMP_RET_KILL;
-		break;
+		__secure_computing_strict(this_syscall);
+		return 0;
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER: {
 		int data;
-		struct pt_regs *regs = task_pt_regs(current);
-		ret = seccomp_run_filters(this_syscall);
+		ret = seccomp_run_filters();
 		data = ret & SECCOMP_RET_DATA;
 		ret &= SECCOMP_RET_ACTION;
 		switch (ret) {
@@ -652,9 +677,10 @@ int __secure_computing(int this_syscall)
 #ifdef CONFIG_SECCOMP_FILTER
 skip:
 	audit_seccomp(this_syscall, exit_sig, ret);
-#endif
 	return -1;
+#endif
 }
+#endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
 
 long prctl_get_seccomp(void)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 1/8] seccomp, x86, arm, mips, s390: Remove nr parameter from secure_computing
@ 2014-07-22  1:49   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-arm-kernel

The secure_computing function took a syscall number parameter, but
it only paid any attention to that parameter if seccomp mode 1 was
enabled.  Rather than coming up with a kludge to get the parameter
to work in mode 2, just remove the parameter.

To avoid churn in arches that don't have seccomp filters (and may
not even support syscall_get_nr right now), this leaves the
parameter in secure_computing_strict, which is now a real function.

For ARM, this is a bit ugly due to the fact that ARM conditionally
supports seccomp filters.  Fixing that would probably only be a
couple of lines of code, but it should be coordinated with the audit
maintainers.

This will be a slight slowdown on some arches.  The right fix is to
pass in all of seccomp_data instead of trying to make just the
syscall nr part be fast.

This is a prerequisite for making two-phase seccomp work cleanly.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel at lists.infradead.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips at linux-mips.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390 at vger.kernel.org
Cc: x86 at kernel.org
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/arm/kernel/ptrace.c      |  7 ++++-
 arch/mips/kernel/ptrace.c     |  2 +-
 arch/s390/kernel/ptrace.c     |  2 +-
 arch/x86/kernel/ptrace.c      |  2 +-
 arch/x86/kernel/vsyscall_64.c |  2 +-
 include/linux/seccomp.h       | 21 +++++++-------
 kernel/seccomp.c              | 64 ++++++++++++++++++++++++++++++-------------
 7 files changed, 66 insertions(+), 34 deletions(-)

diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 0c27ed6..5e772a2 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -933,8 +933,13 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
 	current_thread_info()->syscall = scno;
 
 	/* Do the secure computing check first; failures should be fast. */
-	if (secure_computing(scno) == -1)
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+	if (secure_computing() == -1)
 		return -1;
+#else
+	/* XXX: remove this once OABI gets fixed */
+	secure_computing_strict(scno);
+#endif
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index f639ccd..808bafc 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -639,7 +639,7 @@ asmlinkage long syscall_trace_enter(struct pt_regs *regs, long syscall)
 	long ret = 0;
 	user_exit();
 
-	if (secure_computing(syscall) == -1)
+	if (secure_computing() == -1)
 		return -1;
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE) &&
diff --git a/arch/s390/kernel/ptrace.c b/arch/s390/kernel/ptrace.c
index 2d716734..7ab8b91 100644
--- a/arch/s390/kernel/ptrace.c
+++ b/arch/s390/kernel/ptrace.c
@@ -795,7 +795,7 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
 	long ret = 0;
 
 	/* Do the secure computing check first. */
-	if (secure_computing(regs->gprs[2])) {
+	if (secure_computing()) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1;
 		goto out;
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 678c0ad..93c182a 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1471,7 +1471,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 		regs->flags |= X86_EFLAGS_TF;
 
 	/* do the secure computing check first */
-	if (secure_computing(regs->orig_ax)) {
+	if (secure_computing()) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1L;
 		goto out;
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index ea5b570..23c0c23 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -216,7 +216,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 	 */
 	regs->orig_ax = syscall_nr;
 	regs->ax = -ENOSYS;
-	tmp = secure_computing(syscall_nr);
+	tmp = secure_computing();
 	if ((!tmp && regs->orig_ax != syscall_nr) || regs->ip != address) {
 		warn_bad_vsyscall(KERN_DEBUG, regs,
 				  "seccomp tried to change syscall nr or ip");
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 5d586a4..aa3c040 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -27,19 +27,17 @@ struct seccomp {
 	struct seccomp_filter *filter;
 };
 
-extern int __secure_computing(int);
-static inline int secure_computing(int this_syscall)
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+extern int __secure_computing(void);
+static inline int secure_computing(void)
 {
 	if (unlikely(test_thread_flag(TIF_SECCOMP)))
-		return  __secure_computing(this_syscall);
+		return  __secure_computing();
 	return 0;
 }
-
-/* A wrapper for architectures supporting only SECCOMP_MODE_STRICT. */
-static inline void secure_computing_strict(int this_syscall)
-{
-	BUG_ON(secure_computing(this_syscall) != 0);
-}
+#else
+extern void secure_computing_strict(int this_syscall);
+#endif
 
 extern long prctl_get_seccomp(void);
 extern long prctl_set_seccomp(unsigned long, char __user *);
@@ -56,8 +54,11 @@ static inline int seccomp_mode(struct seccomp *s)
 struct seccomp { };
 struct seccomp_filter { };
 
-static inline int secure_computing(int this_syscall) { return 0; }
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+static inline int secure_computing(void) { return 0; }
+#else
 static inline void secure_computing_strict(int this_syscall) { return; }
+#endif
 
 static inline long prctl_get_seccomp(void)
 {
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 74f4601..861d7ee 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -23,8 +23,11 @@
 
 /* #define SECCOMP_DEBUG 1 */
 
-#ifdef CONFIG_SECCOMP_FILTER
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 #include <asm/syscall.h>
+#endif
+
+#ifdef CONFIG_SECCOMP_FILTER
 #include <linux/filter.h>
 #include <linux/pid.h>
 #include <linux/ptrace.h>
@@ -172,7 +175,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  *
  * Returns valid seccomp BPF response codes.
  */
-static u32 seccomp_run_filters(int syscall)
+static u32 seccomp_run_filters(void)
 {
 	struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
 	struct seccomp_data sd;
@@ -564,10 +567,43 @@ static int mode1_syscalls_32[] = {
 };
 #endif
 
-int __secure_computing(int this_syscall)
+static void __secure_computing_strict(int this_syscall)
+{
+	int *syscall_whitelist = mode1_syscalls;
+#ifdef CONFIG_COMPAT
+	if (is_compat_task())
+		syscall_whitelist = mode1_syscalls_32;
+#endif
+	do {
+		if (*syscall_whitelist == this_syscall)
+			return;
+	} while (*++syscall_whitelist);
+
+#ifdef SECCOMP_DEBUG
+	dump_stack();
+#endif
+	audit_seccomp(this_syscall, SIGKILL, SECCOMP_RET_KILL);
+	do_exit(SIGKILL);
+}
+
+#ifndef CONFIG_HAVE_ARCH_SECCOMP_FILTER
+void secure_computing_strict(int this_syscall)
+{
+	int mode = current->seccomp.mode;
+
+	if (mode == 0)
+		return;
+	else if (mode == SECCOMP_MODE_STRICT)
+		__secure_computing_strict(this_syscall);
+	else
+		BUG();
+}
+#else
+int __secure_computing(void)
 {
+	struct pt_regs *regs = task_pt_regs(current);
+	int this_syscall = syscall_get_nr(current, regs);
 	int exit_sig = 0;
-	int *syscall;
 	u32 ret;
 
 	/*
@@ -578,23 +614,12 @@ int __secure_computing(int this_syscall)
 
 	switch (current->seccomp.mode) {
 	case SECCOMP_MODE_STRICT:
-		syscall = mode1_syscalls;
-#ifdef CONFIG_COMPAT
-		if (is_compat_task())
-			syscall = mode1_syscalls_32;
-#endif
-		do {
-			if (*syscall == this_syscall)
-				return 0;
-		} while (*++syscall);
-		exit_sig = SIGKILL;
-		ret = SECCOMP_RET_KILL;
-		break;
+		__secure_computing_strict(this_syscall);
+		return 0;
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER: {
 		int data;
-		struct pt_regs *regs = task_pt_regs(current);
-		ret = seccomp_run_filters(this_syscall);
+		ret = seccomp_run_filters();
 		data = ret & SECCOMP_RET_DATA;
 		ret &= SECCOMP_RET_ACTION;
 		switch (ret) {
@@ -652,9 +677,10 @@ int __secure_computing(int this_syscall)
 #ifdef CONFIG_SECCOMP_FILTER
 skip:
 	audit_seccomp(this_syscall, exit_sig, ret);
-#endif
 	return -1;
+#endif
 }
+#endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
 
 long prctl_get_seccomp(void)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 2/8] seccomp: Refactor the filter callback and the API
  2014-07-22  1:49 ` Andy Lutomirski
@ 2014-07-22  1:49   ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa, Andy Lutomirski

The reason I did this is to add a seccomp API that will be usable
for an x86 fast path.  The x86 entry code needs to use a rather
expensive slow path for a syscall that might be visible to things
like ptrace.  By splitting seccomp into two phases, we can check
whether we need the slow path and then use the fast path in if the
filter allows the syscall or just returns some errno.

As a side effect, I think the new code is much easier to understand
than the old code.

This has one user-visible effect: the audit record written for
SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE
happened.  It used to depend in a complicated way on what the tracer
did.  I couldn't make much sense of it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/seccomp.h |   6 ++
 kernel/seccomp.c        | 190 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 130 insertions(+), 66 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index aa3c040..3885108 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -35,6 +35,12 @@ static inline int secure_computing(void)
 		return  __secure_computing();
 	return 0;
 }
+
+#define SECCOMP_PHASE1_OK	0
+#define SECCOMP_PHASE1_SKIP	1
+
+extern u32 seccomp_phase1(void);
+int seccomp_phase2(u32 phase1_result);
 #else
 extern void secure_computing_strict(int this_syscall);
 #endif
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 861d7ee..0088d29 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -21,8 +21,6 @@
 #include <linux/slab.h>
 #include <linux/syscalls.h>
 
-/* #define SECCOMP_DEBUG 1 */
-
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 #include <asm/syscall.h>
 #endif
@@ -601,10 +599,21 @@ void secure_computing_strict(int this_syscall)
 #else
 int __secure_computing(void)
 {
-	struct pt_regs *regs = task_pt_regs(current);
-	int this_syscall = syscall_get_nr(current, regs);
-	int exit_sig = 0;
-	u32 ret;
+	u32 phase1_result = seccomp_phase1();
+
+	if (likely(phase1_result == SECCOMP_PHASE1_OK))
+		return 0;
+	else if (likely(phase1_result == SECCOMP_PHASE1_SKIP))
+		return -1;
+	else
+		return seccomp_phase2(phase1_result);
+}
+
+#ifdef CONFIG_SECCOMP_FILTER
+static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
+{
+	u32 filter_ret, action;
+	int data;
 
 	/*
 	 * Make sure that any changes to mode from another thread have
@@ -612,73 +621,122 @@ int __secure_computing(void)
 	 */
 	rmb();
 
-	switch (current->seccomp.mode) {
+	filter_ret = seccomp_run_filters();
+	data = filter_ret & SECCOMP_RET_DATA;
+	action = filter_ret & SECCOMP_RET_ACTION;
+
+	switch (action) {
+	case SECCOMP_RET_ERRNO:
+		/* Set the low-order 16-bits as a errno. */
+		syscall_set_return_value(current, regs,
+					 -data, 0);
+		goto skip;
+
+	case SECCOMP_RET_TRAP:
+		/* Show the handler the original registers. */
+		syscall_rollback(current, regs);
+		/* Let the filter pass back 16 bits of data. */
+		seccomp_send_sigsys(this_syscall, data);
+		goto skip;
+
+	case SECCOMP_RET_TRACE:
+		return filter_ret;  /* Save the rest for phase 2. */
+
+	case SECCOMP_RET_ALLOW:
+		return SECCOMP_PHASE1_OK;
+
+	case SECCOMP_RET_KILL:
+	default:
+		audit_seccomp(this_syscall, SIGSYS, action);
+		do_exit(SIGSYS);
+	}
+
+	unreachable();
+
+skip:
+	audit_seccomp(this_syscall, 0, action);
+	return SECCOMP_PHASE1_SKIP;
+}
+#endif
+
+/**
+ * seccomp_phase1() - run fast path seccomp checks on the current syscall
+ *
+ * This only reads pt_regs via the syscall_xyz helpers.  The only change
+ * it will make to pt_regs is via syscall_set_return_value, and it will
+ * only do that if it returns SECCOMP_PHASE1_SKIP.
+ *
+ * It may also call do_exit or force a signal; these actions must be
+ * safe.
+ *
+ * If it returns SECCOMP_PHASE1_OK, the syscall passes checks and should
+ * be processed normally.
+ *
+ * If it returns SECCOMP_PHASE1_SKIP, then the syscall should not be
+ * invoked.  In this case, seccomp_phase1 will have set the return value
+ * using syscall_set_return_value.
+ *
+ * If it returns anything else, then the return value should be passed
+ * to seccomp_phase2 from a context in which ptrace hooks are safe.
+ */
+u32 seccomp_phase1(void)
+{
+	int mode = current->seccomp.mode;
+	struct pt_regs *regs = task_pt_regs(current);
+	int this_syscall = syscall_get_nr(current, regs);
+
+	switch (mode) {
 	case SECCOMP_MODE_STRICT:
-		__secure_computing_strict(this_syscall);
-		return 0;
+		__secure_computing_strict(this_syscall);  /* may call do_exit */
+		return SECCOMP_PHASE1_OK;
 #ifdef CONFIG_SECCOMP_FILTER
-	case SECCOMP_MODE_FILTER: {
-		int data;
-		ret = seccomp_run_filters();
-		data = ret & SECCOMP_RET_DATA;
-		ret &= SECCOMP_RET_ACTION;
-		switch (ret) {
-		case SECCOMP_RET_ERRNO:
-			/* Set the low-order 16-bits as a errno. */
-			syscall_set_return_value(current, regs,
-						 -data, 0);
-			goto skip;
-		case SECCOMP_RET_TRAP:
-			/* Show the handler the original registers. */
-			syscall_rollback(current, regs);
-			/* Let the filter pass back 16 bits of data. */
-			seccomp_send_sigsys(this_syscall, data);
-			goto skip;
-		case SECCOMP_RET_TRACE:
-			/* Skip these calls if there is no tracer. */
-			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
-				syscall_set_return_value(current, regs,
-							 -ENOSYS, 0);
-				goto skip;
-			}
-			/* Allow the BPF to provide the event message */
-			ptrace_event(PTRACE_EVENT_SECCOMP, data);
-			/*
-			 * The delivery of a fatal signal during event
-			 * notification may silently skip tracer notification.
-			 * Terminating the task now avoids executing a system
-			 * call that may not be intended.
-			 */
-			if (fatal_signal_pending(current))
-				break;
-			if (syscall_get_nr(current, regs) < 0)
-				goto skip;  /* Explicit request to skip. */
-
-			return 0;
-		case SECCOMP_RET_ALLOW:
-			return 0;
-		case SECCOMP_RET_KILL:
-		default:
-			break;
-		}
-		exit_sig = SIGSYS;
-		break;
-	}
+	case SECCOMP_MODE_FILTER:
+		return __seccomp_phase1_filter(this_syscall, regs);
 #endif
 	default:
 		BUG();
 	}
+}
 
-#ifdef SECCOMP_DEBUG
-	dump_stack();
-#endif
-	audit_seccomp(this_syscall, exit_sig, ret);
-	do_exit(exit_sig);
-#ifdef CONFIG_SECCOMP_FILTER
-skip:
-	audit_seccomp(this_syscall, exit_sig, ret);
-	return -1;
-#endif
+/**
+ * seccomp_phase2() - finish slow path seccomp work for the current syscall
+ * @phase1_result: The return value from seccomp_phase1()
+ *
+ * This must be called from a context in which ptrace hooks can be used.
+ *
+ * Returns 0 if the syscall should be processed or -1 to skip the syscall.
+ */
+int seccomp_phase2(u32 phase1_result)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	u32 action = phase1_result & SECCOMP_RET_ACTION;
+	int data = phase1_result & SECCOMP_RET_DATA;
+
+	BUG_ON(action != SECCOMP_RET_TRACE);
+
+	audit_seccomp(syscall_get_nr(current, regs), 0, action);
+
+	/* Skip these calls if there is no tracer. */
+	if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
+		syscall_set_return_value(current, regs,
+					 -ENOSYS, 0);
+		return -1;
+	}
+
+	/* Allow the BPF to provide the event message */
+	ptrace_event(PTRACE_EVENT_SECCOMP, data);
+	/*
+	 * The delivery of a fatal signal during event
+	 * notification may silently skip tracer notification.
+	 * Terminating the task now avoids executing a system
+	 * call that may not be intended.
+	 */
+	if (fatal_signal_pending(current))
+		do_exit(SIGSYS);
+	if (syscall_get_nr(current, regs) < 0)
+		return -1;  /* Explicit request to skip. */
+
+	return 0;
 }
 #endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 2/8] seccomp: Refactor the filter callback and the API
@ 2014-07-22  1:49   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-arm-kernel

The reason I did this is to add a seccomp API that will be usable
for an x86 fast path.  The x86 entry code needs to use a rather
expensive slow path for a syscall that might be visible to things
like ptrace.  By splitting seccomp into two phases, we can check
whether we need the slow path and then use the fast path in if the
filter allows the syscall or just returns some errno.

As a side effect, I think the new code is much easier to understand
than the old code.

This has one user-visible effect: the audit record written for
SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE
happened.  It used to depend in a complicated way on what the tracer
did.  I couldn't make much sense of it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/seccomp.h |   6 ++
 kernel/seccomp.c        | 190 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 130 insertions(+), 66 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index aa3c040..3885108 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -35,6 +35,12 @@ static inline int secure_computing(void)
 		return  __secure_computing();
 	return 0;
 }
+
+#define SECCOMP_PHASE1_OK	0
+#define SECCOMP_PHASE1_SKIP	1
+
+extern u32 seccomp_phase1(void);
+int seccomp_phase2(u32 phase1_result);
 #else
 extern void secure_computing_strict(int this_syscall);
 #endif
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 861d7ee..0088d29 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -21,8 +21,6 @@
 #include <linux/slab.h>
 #include <linux/syscalls.h>
 
-/* #define SECCOMP_DEBUG 1 */
-
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 #include <asm/syscall.h>
 #endif
@@ -601,10 +599,21 @@ void secure_computing_strict(int this_syscall)
 #else
 int __secure_computing(void)
 {
-	struct pt_regs *regs = task_pt_regs(current);
-	int this_syscall = syscall_get_nr(current, regs);
-	int exit_sig = 0;
-	u32 ret;
+	u32 phase1_result = seccomp_phase1();
+
+	if (likely(phase1_result == SECCOMP_PHASE1_OK))
+		return 0;
+	else if (likely(phase1_result == SECCOMP_PHASE1_SKIP))
+		return -1;
+	else
+		return seccomp_phase2(phase1_result);
+}
+
+#ifdef CONFIG_SECCOMP_FILTER
+static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
+{
+	u32 filter_ret, action;
+	int data;
 
 	/*
 	 * Make sure that any changes to mode from another thread have
@@ -612,73 +621,122 @@ int __secure_computing(void)
 	 */
 	rmb();
 
-	switch (current->seccomp.mode) {
+	filter_ret = seccomp_run_filters();
+	data = filter_ret & SECCOMP_RET_DATA;
+	action = filter_ret & SECCOMP_RET_ACTION;
+
+	switch (action) {
+	case SECCOMP_RET_ERRNO:
+		/* Set the low-order 16-bits as a errno. */
+		syscall_set_return_value(current, regs,
+					 -data, 0);
+		goto skip;
+
+	case SECCOMP_RET_TRAP:
+		/* Show the handler the original registers. */
+		syscall_rollback(current, regs);
+		/* Let the filter pass back 16 bits of data. */
+		seccomp_send_sigsys(this_syscall, data);
+		goto skip;
+
+	case SECCOMP_RET_TRACE:
+		return filter_ret;  /* Save the rest for phase 2. */
+
+	case SECCOMP_RET_ALLOW:
+		return SECCOMP_PHASE1_OK;
+
+	case SECCOMP_RET_KILL:
+	default:
+		audit_seccomp(this_syscall, SIGSYS, action);
+		do_exit(SIGSYS);
+	}
+
+	unreachable();
+
+skip:
+	audit_seccomp(this_syscall, 0, action);
+	return SECCOMP_PHASE1_SKIP;
+}
+#endif
+
+/**
+ * seccomp_phase1() - run fast path seccomp checks on the current syscall
+ *
+ * This only reads pt_regs via the syscall_xyz helpers.  The only change
+ * it will make to pt_regs is via syscall_set_return_value, and it will
+ * only do that if it returns SECCOMP_PHASE1_SKIP.
+ *
+ * It may also call do_exit or force a signal; these actions must be
+ * safe.
+ *
+ * If it returns SECCOMP_PHASE1_OK, the syscall passes checks and should
+ * be processed normally.
+ *
+ * If it returns SECCOMP_PHASE1_SKIP, then the syscall should not be
+ * invoked.  In this case, seccomp_phase1 will have set the return value
+ * using syscall_set_return_value.
+ *
+ * If it returns anything else, then the return value should be passed
+ * to seccomp_phase2 from a context in which ptrace hooks are safe.
+ */
+u32 seccomp_phase1(void)
+{
+	int mode = current->seccomp.mode;
+	struct pt_regs *regs = task_pt_regs(current);
+	int this_syscall = syscall_get_nr(current, regs);
+
+	switch (mode) {
 	case SECCOMP_MODE_STRICT:
-		__secure_computing_strict(this_syscall);
-		return 0;
+		__secure_computing_strict(this_syscall);  /* may call do_exit */
+		return SECCOMP_PHASE1_OK;
 #ifdef CONFIG_SECCOMP_FILTER
-	case SECCOMP_MODE_FILTER: {
-		int data;
-		ret = seccomp_run_filters();
-		data = ret & SECCOMP_RET_DATA;
-		ret &= SECCOMP_RET_ACTION;
-		switch (ret) {
-		case SECCOMP_RET_ERRNO:
-			/* Set the low-order 16-bits as a errno. */
-			syscall_set_return_value(current, regs,
-						 -data, 0);
-			goto skip;
-		case SECCOMP_RET_TRAP:
-			/* Show the handler the original registers. */
-			syscall_rollback(current, regs);
-			/* Let the filter pass back 16 bits of data. */
-			seccomp_send_sigsys(this_syscall, data);
-			goto skip;
-		case SECCOMP_RET_TRACE:
-			/* Skip these calls if there is no tracer. */
-			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
-				syscall_set_return_value(current, regs,
-							 -ENOSYS, 0);
-				goto skip;
-			}
-			/* Allow the BPF to provide the event message */
-			ptrace_event(PTRACE_EVENT_SECCOMP, data);
-			/*
-			 * The delivery of a fatal signal during event
-			 * notification may silently skip tracer notification.
-			 * Terminating the task now avoids executing a system
-			 * call that may not be intended.
-			 */
-			if (fatal_signal_pending(current))
-				break;
-			if (syscall_get_nr(current, regs) < 0)
-				goto skip;  /* Explicit request to skip. */
-
-			return 0;
-		case SECCOMP_RET_ALLOW:
-			return 0;
-		case SECCOMP_RET_KILL:
-		default:
-			break;
-		}
-		exit_sig = SIGSYS;
-		break;
-	}
+	case SECCOMP_MODE_FILTER:
+		return __seccomp_phase1_filter(this_syscall, regs);
 #endif
 	default:
 		BUG();
 	}
+}
 
-#ifdef SECCOMP_DEBUG
-	dump_stack();
-#endif
-	audit_seccomp(this_syscall, exit_sig, ret);
-	do_exit(exit_sig);
-#ifdef CONFIG_SECCOMP_FILTER
-skip:
-	audit_seccomp(this_syscall, exit_sig, ret);
-	return -1;
-#endif
+/**
+ * seccomp_phase2() - finish slow path seccomp work for the current syscall
+ * @phase1_result: The return value from seccomp_phase1()
+ *
+ * This must be called from a context in which ptrace hooks can be used.
+ *
+ * Returns 0 if the syscall should be processed or -1 to skip the syscall.
+ */
+int seccomp_phase2(u32 phase1_result)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	u32 action = phase1_result & SECCOMP_RET_ACTION;
+	int data = phase1_result & SECCOMP_RET_DATA;
+
+	BUG_ON(action != SECCOMP_RET_TRACE);
+
+	audit_seccomp(syscall_get_nr(current, regs), 0, action);
+
+	/* Skip these calls if there is no tracer. */
+	if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
+		syscall_set_return_value(current, regs,
+					 -ENOSYS, 0);
+		return -1;
+	}
+
+	/* Allow the BPF to provide the event message */
+	ptrace_event(PTRACE_EVENT_SECCOMP, data);
+	/*
+	 * The delivery of a fatal signal during event
+	 * notification may silently skip tracer notification.
+	 * Terminating the task now avoids executing a system
+	 * call that may not be intended.
+	 */
+	if (fatal_signal_pending(current))
+		do_exit(SIGSYS);
+	if (syscall_get_nr(current, regs) < 0)
+		return -1;  /* Explicit request to skip. */
+
+	return 0;
 }
 #endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 3/8] seccomp: Allow arch code to provide seccomp_data
  2014-07-22  1:49 ` Andy Lutomirski
@ 2014-07-22  1:49   ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa, Andy Lutomirski

populate_seccomp_data is expensive: it works by inspecting
task_pt_regs and various other bits to piece together all the
information, and it's does so in multiple partially redundant steps.

Arch-specific code in the syscall entry path can do much better.

Admittedly this adds a bit of additional room for error, but the
speedup should be worth it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/seccomp.h |  2 +-
 kernel/seccomp.c        | 32 +++++++++++++++++++-------------
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 3885108..a19ddac 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -39,7 +39,7 @@ static inline int secure_computing(void)
 #define SECCOMP_PHASE1_OK	0
 #define SECCOMP_PHASE1_SKIP	1
 
-extern u32 seccomp_phase1(void);
+extern u32 seccomp_phase1(struct seccomp_data *sd);
 int seccomp_phase2(u32 phase1_result);
 #else
 extern void secure_computing_strict(int this_syscall);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 0088d29..80115b0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -173,10 +173,10 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  *
  * Returns valid seccomp BPF response codes.
  */
-static u32 seccomp_run_filters(void)
+static u32 seccomp_run_filters(struct seccomp_data *sd)
 {
 	struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
-	struct seccomp_data sd;
+	struct seccomp_data sd_local;
 	u32 ret = SECCOMP_RET_ALLOW;
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
@@ -186,14 +186,17 @@ static u32 seccomp_run_filters(void)
 	/* Make sure cross-thread synced filter points somewhere sane. */
 	smp_read_barrier_depends();
 
-	populate_seccomp_data(&sd);
+	if (!sd) {
+		populate_seccomp_data(&sd_local);
+		sd = &sd_local;
+	}
 
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
 	 */
 	for (; f; f = f->prev) {
-		u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)&sd);
+		u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)sd);
 
 		if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
 			ret = cur_ret;
@@ -599,7 +602,7 @@ void secure_computing_strict(int this_syscall)
 #else
 int __secure_computing(void)
 {
-	u32 phase1_result = seccomp_phase1();
+	u32 phase1_result = seccomp_phase1(NULL);
 
 	if (likely(phase1_result == SECCOMP_PHASE1_OK))
 		return 0;
@@ -610,7 +613,7 @@ int __secure_computing(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
-static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
+static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd)
 {
 	u32 filter_ret, action;
 	int data;
@@ -621,20 +624,20 @@ static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
 	 */
 	rmb();
 
-	filter_ret = seccomp_run_filters();
+	filter_ret = seccomp_run_filters(sd);
 	data = filter_ret & SECCOMP_RET_DATA;
 	action = filter_ret & SECCOMP_RET_ACTION;
 
 	switch (action) {
 	case SECCOMP_RET_ERRNO:
 		/* Set the low-order 16-bits as a errno. */
-		syscall_set_return_value(current, regs,
+		syscall_set_return_value(current, task_pt_regs(current),
 					 -data, 0);
 		goto skip;
 
 	case SECCOMP_RET_TRAP:
 		/* Show the handler the original registers. */
-		syscall_rollback(current, regs);
+		syscall_rollback(current, task_pt_regs(current));
 		/* Let the filter pass back 16 bits of data. */
 		seccomp_send_sigsys(this_syscall, data);
 		goto skip;
@@ -661,11 +664,14 @@ skip:
 
 /**
  * seccomp_phase1() - run fast path seccomp checks on the current syscall
+ * @arg sd: The seccomp_data or NULL
  *
  * This only reads pt_regs via the syscall_xyz helpers.  The only change
  * it will make to pt_regs is via syscall_set_return_value, and it will
  * only do that if it returns SECCOMP_PHASE1_SKIP.
  *
+ * If sd is provided, it will not read pt_regs at all.
+ *
  * It may also call do_exit or force a signal; these actions must be
  * safe.
  *
@@ -679,11 +685,11 @@ skip:
  * If it returns anything else, then the return value should be passed
  * to seccomp_phase2 from a context in which ptrace hooks are safe.
  */
-u32 seccomp_phase1(void)
+u32 seccomp_phase1(struct seccomp_data *sd)
 {
 	int mode = current->seccomp.mode;
-	struct pt_regs *regs = task_pt_regs(current);
-	int this_syscall = syscall_get_nr(current, regs);
+	int this_syscall = sd ? sd->nr :
+		syscall_get_nr(current, task_pt_regs(current));
 
 	switch (mode) {
 	case SECCOMP_MODE_STRICT:
@@ -691,7 +697,7 @@ u32 seccomp_phase1(void)
 		return SECCOMP_PHASE1_OK;
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER:
-		return __seccomp_phase1_filter(this_syscall, regs);
+		return __seccomp_phase1_filter(this_syscall, sd);
 #endif
 	default:
 		BUG();
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 3/8] seccomp: Allow arch code to provide seccomp_data
@ 2014-07-22  1:49   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-arm-kernel

populate_seccomp_data is expensive: it works by inspecting
task_pt_regs and various other bits to piece together all the
information, and it's does so in multiple partially redundant steps.

Arch-specific code in the syscall entry path can do much better.

Admittedly this adds a bit of additional room for error, but the
speedup should be worth it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/seccomp.h |  2 +-
 kernel/seccomp.c        | 32 +++++++++++++++++++-------------
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 3885108..a19ddac 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -39,7 +39,7 @@ static inline int secure_computing(void)
 #define SECCOMP_PHASE1_OK	0
 #define SECCOMP_PHASE1_SKIP	1
 
-extern u32 seccomp_phase1(void);
+extern u32 seccomp_phase1(struct seccomp_data *sd);
 int seccomp_phase2(u32 phase1_result);
 #else
 extern void secure_computing_strict(int this_syscall);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 0088d29..80115b0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -173,10 +173,10 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
  *
  * Returns valid seccomp BPF response codes.
  */
-static u32 seccomp_run_filters(void)
+static u32 seccomp_run_filters(struct seccomp_data *sd)
 {
 	struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
-	struct seccomp_data sd;
+	struct seccomp_data sd_local;
 	u32 ret = SECCOMP_RET_ALLOW;
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
@@ -186,14 +186,17 @@ static u32 seccomp_run_filters(void)
 	/* Make sure cross-thread synced filter points somewhere sane. */
 	smp_read_barrier_depends();
 
-	populate_seccomp_data(&sd);
+	if (!sd) {
+		populate_seccomp_data(&sd_local);
+		sd = &sd_local;
+	}
 
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
 	 */
 	for (; f; f = f->prev) {
-		u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)&sd);
+		u32 cur_ret = SK_RUN_FILTER(f->prog, (void *)sd);
 
 		if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
 			ret = cur_ret;
@@ -599,7 +602,7 @@ void secure_computing_strict(int this_syscall)
 #else
 int __secure_computing(void)
 {
-	u32 phase1_result = seccomp_phase1();
+	u32 phase1_result = seccomp_phase1(NULL);
 
 	if (likely(phase1_result == SECCOMP_PHASE1_OK))
 		return 0;
@@ -610,7 +613,7 @@ int __secure_computing(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
-static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
+static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd)
 {
 	u32 filter_ret, action;
 	int data;
@@ -621,20 +624,20 @@ static u32 __seccomp_phase1_filter(int this_syscall, struct pt_regs *regs)
 	 */
 	rmb();
 
-	filter_ret = seccomp_run_filters();
+	filter_ret = seccomp_run_filters(sd);
 	data = filter_ret & SECCOMP_RET_DATA;
 	action = filter_ret & SECCOMP_RET_ACTION;
 
 	switch (action) {
 	case SECCOMP_RET_ERRNO:
 		/* Set the low-order 16-bits as a errno. */
-		syscall_set_return_value(current, regs,
+		syscall_set_return_value(current, task_pt_regs(current),
 					 -data, 0);
 		goto skip;
 
 	case SECCOMP_RET_TRAP:
 		/* Show the handler the original registers. */
-		syscall_rollback(current, regs);
+		syscall_rollback(current, task_pt_regs(current));
 		/* Let the filter pass back 16 bits of data. */
 		seccomp_send_sigsys(this_syscall, data);
 		goto skip;
@@ -661,11 +664,14 @@ skip:
 
 /**
  * seccomp_phase1() - run fast path seccomp checks on the current syscall
+ * @arg sd: The seccomp_data or NULL
  *
  * This only reads pt_regs via the syscall_xyz helpers.  The only change
  * it will make to pt_regs is via syscall_set_return_value, and it will
  * only do that if it returns SECCOMP_PHASE1_SKIP.
  *
+ * If sd is provided, it will not read pt_regs@all.
+ *
  * It may also call do_exit or force a signal; these actions must be
  * safe.
  *
@@ -679,11 +685,11 @@ skip:
  * If it returns anything else, then the return value should be passed
  * to seccomp_phase2 from a context in which ptrace hooks are safe.
  */
-u32 seccomp_phase1(void)
+u32 seccomp_phase1(struct seccomp_data *sd)
 {
 	int mode = current->seccomp.mode;
-	struct pt_regs *regs = task_pt_regs(current);
-	int this_syscall = syscall_get_nr(current, regs);
+	int this_syscall = sd ? sd->nr :
+		syscall_get_nr(current, task_pt_regs(current));
 
 	switch (mode) {
 	case SECCOMP_MODE_STRICT:
@@ -691,7 +697,7 @@ u32 seccomp_phase1(void)
 		return SECCOMP_PHASE1_OK;
 #ifdef CONFIG_SECCOMP_FILTER
 	case SECCOMP_MODE_FILTER:
-		return __seccomp_phase1_filter(this_syscall, regs);
+		return __seccomp_phase1_filter(this_syscall, sd);
 #endif
 	default:
 		BUG();
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 4/8] seccomp: Document two-phase seccomp and arch-provided seccomp_data
  2014-07-22  1:49 ` Andy Lutomirski
  (?)
@ 2014-07-22  1:49   ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa, Andy Lutomirski

The description of how archs should implement seccomp filters was
still strictly correct, but it failed to describe the newly
available optimizations.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/Kconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 0eae9df..05d7a8a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -323,6 +323,17 @@ config HAVE_ARCH_SECCOMP_FILTER
 	    results in the system call being skipped immediately.
 	  - seccomp syscall wired up
 
+	  For best performance, an arch should use seccomp_phase1 and
+	  seccomp_phase2 directly.  It should call seccomp_phase1 for all
+	  syscalls if TIF_SECCOMP is set, but seccomp_phase1 does not
+	  need to be called from a ptrace-safe context.  It must then
+	  call seccomp_phase2 if seccomp_phase1 returns anything other
+	  than SECCOMP_PHASE1_OK or SECCOMP_PHASE1_SKIP.
+
+	  As an additional optimization, an arch may provide seccomp_data
+	  directly to seccomp_phase1; this avoids multiple calls
+	  to the syscall_xyz helpers for every syscall.
+
 config SECCOMP_FILTER
 	def_bool y
 	depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 4/8] seccomp: Document two-phase seccomp and arch-provided seccomp_data
@ 2014-07-22  1:49   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: linux-arch, linux-mips, x86, Oleg Nesterov, Andy Lutomirski,
	linux-security-module, hpa, linux-arm-kernel, Alexei Starovoitov

The description of how archs should implement seccomp filters was
still strictly correct, but it failed to describe the newly
available optimizations.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/Kconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 0eae9df..05d7a8a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -323,6 +323,17 @@ config HAVE_ARCH_SECCOMP_FILTER
 	    results in the system call being skipped immediately.
 	  - seccomp syscall wired up
 
+	  For best performance, an arch should use seccomp_phase1 and
+	  seccomp_phase2 directly.  It should call seccomp_phase1 for all
+	  syscalls if TIF_SECCOMP is set, but seccomp_phase1 does not
+	  need to be called from a ptrace-safe context.  It must then
+	  call seccomp_phase2 if seccomp_phase1 returns anything other
+	  than SECCOMP_PHASE1_OK or SECCOMP_PHASE1_SKIP.
+
+	  As an additional optimization, an arch may provide seccomp_data
+	  directly to seccomp_phase1; this avoids multiple calls
+	  to the syscall_xyz helpers for every syscall.
+
 config SECCOMP_FILTER
 	def_bool y
 	depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 4/8] seccomp: Document two-phase seccomp and arch-provided seccomp_data
@ 2014-07-22  1:49   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:49 UTC (permalink / raw)
  To: linux-arm-kernel

The description of how archs should implement seccomp filters was
still strictly correct, but it failed to describe the newly
available optimizations.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/Kconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 0eae9df..05d7a8a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -323,6 +323,17 @@ config HAVE_ARCH_SECCOMP_FILTER
 	    results in the system call being skipped immediately.
 	  - seccomp syscall wired up
 
+	  For best performance, an arch should use seccomp_phase1 and
+	  seccomp_phase2 directly.  It should call seccomp_phase1 for all
+	  syscalls if TIF_SECCOMP is set, but seccomp_phase1 does not
+	  need to be called from a ptrace-safe context.  It must then
+	  call seccomp_phase2 if seccomp_phase1 returns anything other
+	  than SECCOMP_PHASE1_OK or SECCOMP_PHASE1_SKIP.
+
+	  As an additional optimization, an arch may provide seccomp_data
+	  directly to seccomp_phase1; this avoids multiple calls
+	  to the syscall_xyz helpers for every syscall.
+
 config SECCOMP_FILTER
 	def_bool y
 	depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 5/8] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
  2014-07-22  1:49 ` Andy Lutomirski
@ 2014-07-22  1:53   ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:53 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa, Andy Lutomirski

is_compat_task() is the wrong check for audit arch; the check should
be is_ia32_task(): x32 syscalls should be AUDIT_ARCH_X86_64, not
AUDIT_ARCH_I386.

CONFIG_AUDITSYSCALL is currently incompatible with x32, so this has
no visible effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/ptrace.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 93c182a..39296d2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,15 +1441,6 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
-
-#ifdef CONFIG_X86_32
-# define IS_IA32	1
-#elif defined CONFIG_IA32_EMULATION
-# define IS_IA32	is_compat_task()
-#else
-# define IS_IA32	0
-#endif
-
 /*
  * We must return the syscall number to actually look up in the table.
  * This can be -1L to skip running any syscall at all.
@@ -1487,7 +1478,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (IS_IA32)
+	if (is_ia32_task())
 		audit_syscall_entry(AUDIT_ARCH_I386,
 				    regs->orig_ax,
 				    regs->bx, regs->cx,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 5/8] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
@ 2014-07-22  1:53   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:53 UTC (permalink / raw)
  To: linux-arm-kernel

is_compat_task() is the wrong check for audit arch; the check should
be is_ia32_task(): x32 syscalls should be AUDIT_ARCH_X86_64, not
AUDIT_ARCH_I386.

CONFIG_AUDITSYSCALL is currently incompatible with x32, so this has
no visible effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/ptrace.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 93c182a..39296d2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,15 +1441,6 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
-
-#ifdef CONFIG_X86_32
-# define IS_IA32	1
-#elif defined CONFIG_IA32_EMULATION
-# define IS_IA32	is_compat_task()
-#else
-# define IS_IA32	0
-#endif
-
 /*
  * We must return the syscall number to actually look up in the table.
  * This can be -1L to skip running any syscall at all.
@@ -1487,7 +1478,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (IS_IA32)
+	if (is_ia32_task())
 		audit_syscall_entry(AUDIT_ARCH_I386,
 				    regs->orig_ax,
 				    regs->bx, regs->cx,
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-22  1:53   ` Andy Lutomirski
@ 2014-07-22  1:53   ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:53 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa, Andy Lutomirski

This splits syscall_trace_enter into syscall_trace_enter_phase1 and
syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
phase 2 is permitted to modify any of pt_regs except for orig_ax.

The intent is that phase 1 can be called from the syscall fast path.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/ptrace.h |   5 ++
 arch/x86/kernel/ptrace.c      | 145 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 131 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 6205f0c..86fc2bb 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
 extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 			 int error_code, int si_code);
 
+
+extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
+extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
+				       unsigned long phase1_result);
+
 extern long syscall_trace_enter(struct pt_regs *);
 extern void syscall_trace_leave(struct pt_regs *);
 
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 39296d2..9ec6972 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,13 +1441,117 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+	if (arch == AUDIT_ARCH_X86_64) {
+		audit_syscall_entry(arch, regs->orig_ax, regs->di,
+				    regs->si, regs->dx, regs->r10);
+	} else
+#endif
+	{
+		audit_syscall_entry(arch, regs->orig_ax, regs->bx,
+				    regs->cx, regs->dx, regs->si);
+	}
+}
+
 /*
- * We must return the syscall number to actually look up in the table.
- * This can be -1L to skip running any syscall at all.
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2.  If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0: resume the syscall
+ * 1: go to phase 2; no seccomp phase 2 needed
+ * 2: go to phase 2; pass return value to seccomp
  */
-long syscall_trace_enter(struct pt_regs *regs)
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
+{
+	unsigned long ret = 0;
+	u32 work;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Do seccomp first -- it should minimize exposure of other
+	 * code, and keeping seccomp fast is probably more valuable
+	 * than the rest of this.
+	 */
+	if (work & _TIF_SECCOMP) {
+		struct seccomp_data sd;
+
+		sd.arch = arch;
+		sd.nr = regs->orig_ax;
+		sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+		if (arch == AUDIT_ARCH_X86_64) {
+			sd.args[0] = regs->di;
+			sd.args[1] = regs->si;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->r10;
+			sd.args[4] = regs->r8;
+			sd.args[5] = regs->r9;
+		} else
+#endif
+		{
+			sd.args[0] = regs->bx;
+			sd.args[1] = regs->cx;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->si;
+			sd.args[4] = regs->di;
+			sd.args[5] = regs->bp;
+		}
+
+		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+		ret = seccomp_phase1(&sd);
+		if (ret == SECCOMP_PHASE1_SKIP) {
+			regs->orig_ax = -1;
+			ret = 0;
+		} else if (ret != SECCOMP_PHASE1_OK) {
+			return ret;  /* Go directly to phase 2 */
+		}
+
+		work &= ~_TIF_SECCOMP;
+	}
+#endif
+
+	/* Do our best to finish without phase 2. */
+	if (work == 0)
+		return ret;  /* seccomp only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+	if (work == _TIF_SYSCALL_AUDIT) {
+		/*
+		 * If there is no more work to be done except auditing,
+		 * then audit in phase 1.  Phase 2 always audits, so, if
+		 * we audit here, then we can't go on to phase 2.
+		 */
+		do_audit_syscall_entry(regs, arch);
+		return 0;
+	}
+#endif
+
+	return 1;  /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+				unsigned long phase1_result)
 {
 	long ret = 0;
+	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	BUG_ON(regs != task_pt_regs(current));
 
 	user_exit();
 
@@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs)
 	 * do_debug() and we need to set it again to restore the user
 	 * state.  If we entered on the slow path, TF was already set.
 	 */
-	if (test_thread_flag(TIF_SINGLESTEP))
+	if (work & _TIF_SINGLESTEP)
 		regs->flags |= X86_EFLAGS_TF;
 
-	/* do the secure computing check first */
-	if (secure_computing()) {
+	/*
+	 * Call seccomp_phase2 before running the other hooks so that
+	 * they can see any changes made by a seccomp tracer.
+	 */
+	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1L;
 		goto out;
 	}
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
+	if (unlikely(work & _TIF_SYSCALL_EMU))
 		ret = -1L;
 
 	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
@@ -1478,23 +1585,23 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (is_ia32_task())
-		audit_syscall_entry(AUDIT_ARCH_I386,
-				    regs->orig_ax,
-				    regs->bx, regs->cx,
-				    regs->dx, regs->si);
-#ifdef CONFIG_X86_64
-	else
-		audit_syscall_entry(AUDIT_ARCH_X86_64,
-				    regs->orig_ax,
-				    regs->di, regs->si,
-				    regs->dx, regs->r10);
-#endif
+	do_audit_syscall_entry(regs, arch);
 
 out:
 	return ret ?: regs->orig_ax;
 }
 
+long syscall_trace_enter(struct pt_regs *regs)
+{
+	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+	if (phase1_result == 0)
+		return regs->orig_ax;
+	else
+		return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
 void syscall_trace_leave(struct pt_regs *regs)
 {
 	bool step;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-22  1:53   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:53 UTC (permalink / raw)
  To: linux-arm-kernel

This splits syscall_trace_enter into syscall_trace_enter_phase1 and
syscall_trace_enter_phase2.  Only phase 2 has full pt_regs, and only
phase 2 is permitted to modify any of pt_regs except for orig_ax.

The intent is that phase 1 can be called from the syscall fast path.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/ptrace.h |   5 ++
 arch/x86/kernel/ptrace.c      | 145 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 131 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 6205f0c..86fc2bb 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
 extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 			 int error_code, int si_code);
 
+
+extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch);
+extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch,
+				       unsigned long phase1_result);
+
 extern long syscall_trace_enter(struct pt_regs *);
 extern void syscall_trace_leave(struct pt_regs *);
 
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 39296d2..9ec6972 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1441,13 +1441,117 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
 	force_sig_info(SIGTRAP, &info, tsk);
 }
 
+static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
+{
+#ifdef CONFIG_X86_64
+	if (arch == AUDIT_ARCH_X86_64) {
+		audit_syscall_entry(arch, regs->orig_ax, regs->di,
+				    regs->si, regs->dx, regs->r10);
+	} else
+#endif
+	{
+		audit_syscall_entry(arch, regs->orig_ax, regs->bx,
+				    regs->cx, regs->dx, regs->si);
+	}
+}
+
 /*
- * We must return the syscall number to actually look up in the table.
- * This can be -1L to skip running any syscall at all.
+ * We can return 0 to resume the syscall or anything else to go to phase
+ * 2.  If we resume the syscall, we need to put something appropriate in
+ * regs->orig_ax.
+ *
+ * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
+ * are fully functional.
+ *
+ * For phase 2's benefit, our return value is:
+ * 0: resume the syscall
+ * 1: go to phase 2; no seccomp phase 2 needed
+ * 2: go to phase 2; pass return value to seccomp
  */
-long syscall_trace_enter(struct pt_regs *regs)
+unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
+{
+	unsigned long ret = 0;
+	u32 work;
+
+	BUG_ON(regs != task_pt_regs(current));
+
+	work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+#ifdef CONFIG_SECCOMP
+	/*
+	 * Do seccomp first -- it should minimize exposure of other
+	 * code, and keeping seccomp fast is probably more valuable
+	 * than the rest of this.
+	 */
+	if (work & _TIF_SECCOMP) {
+		struct seccomp_data sd;
+
+		sd.arch = arch;
+		sd.nr = regs->orig_ax;
+		sd.instruction_pointer = regs->ip;
+#ifdef CONFIG_X86_64
+		if (arch == AUDIT_ARCH_X86_64) {
+			sd.args[0] = regs->di;
+			sd.args[1] = regs->si;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->r10;
+			sd.args[4] = regs->r8;
+			sd.args[5] = regs->r9;
+		} else
+#endif
+		{
+			sd.args[0] = regs->bx;
+			sd.args[1] = regs->cx;
+			sd.args[2] = regs->dx;
+			sd.args[3] = regs->si;
+			sd.args[4] = regs->di;
+			sd.args[5] = regs->bp;
+		}
+
+		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
+		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);
+
+		ret = seccomp_phase1(&sd);
+		if (ret == SECCOMP_PHASE1_SKIP) {
+			regs->orig_ax = -1;
+			ret = 0;
+		} else if (ret != SECCOMP_PHASE1_OK) {
+			return ret;  /* Go directly to phase 2 */
+		}
+
+		work &= ~_TIF_SECCOMP;
+	}
+#endif
+
+	/* Do our best to finish without phase 2. */
+	if (work == 0)
+		return ret;  /* seccomp only (ret == 0 here) */
+
+#ifdef CONFIG_AUDITSYSCALL
+	if (work == _TIF_SYSCALL_AUDIT) {
+		/*
+		 * If there is no more work to be done except auditing,
+		 * then audit in phase 1.  Phase 2 always audits, so, if
+		 * we audit here, then we can't go on to phase 2.
+		 */
+		do_audit_syscall_entry(regs, arch);
+		return 0;
+	}
+#endif
+
+	return 1;  /* Something is enabled that we can't handle in phase 1 */
+}
+
+/* Returns the syscall nr to run (which should match regs->orig_ax). */
+long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
+				unsigned long phase1_result)
 {
 	long ret = 0;
+	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
+		_TIF_WORK_SYSCALL_ENTRY;
+
+	BUG_ON(regs != task_pt_regs(current));
 
 	user_exit();
 
@@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs)
 	 * do_debug() and we need to set it again to restore the user
 	 * state.  If we entered on the slow path, TF was already set.
 	 */
-	if (test_thread_flag(TIF_SINGLESTEP))
+	if (work & _TIF_SINGLESTEP)
 		regs->flags |= X86_EFLAGS_TF;
 
-	/* do the secure computing check first */
-	if (secure_computing()) {
+	/*
+	 * Call seccomp_phase2 before running the other hooks so that
+	 * they can see any changes made by a seccomp tracer.
+	 */
+	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
 		/* seccomp failures shouldn't expose any additional code. */
 		ret = -1L;
 		goto out;
 	}
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
+	if (unlikely(work & _TIF_SYSCALL_EMU))
 		ret = -1L;
 
 	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
@@ -1478,23 +1585,23 @@ long syscall_trace_enter(struct pt_regs *regs)
 	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
 		trace_sys_enter(regs, regs->orig_ax);
 
-	if (is_ia32_task())
-		audit_syscall_entry(AUDIT_ARCH_I386,
-				    regs->orig_ax,
-				    regs->bx, regs->cx,
-				    regs->dx, regs->si);
-#ifdef CONFIG_X86_64
-	else
-		audit_syscall_entry(AUDIT_ARCH_X86_64,
-				    regs->orig_ax,
-				    regs->di, regs->si,
-				    regs->dx, regs->r10);
-#endif
+	do_audit_syscall_entry(regs, arch);
 
 out:
 	return ret ?: regs->orig_ax;
 }
 
+long syscall_trace_enter(struct pt_regs *regs)
+{
+	u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
+	unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
+
+	if (phase1_result == 0)
+		return regs->orig_ax;
+	else
+		return syscall_trace_enter_phase2(regs, arch, phase1_result);
+}
+
 void syscall_trace_leave(struct pt_regs *regs)
 {
 	bool step;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 7/8] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls
  2014-07-22  1:53   ` Andy Lutomirski
@ 2014-07-22  1:53   ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:53 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa, Andy Lutomirski

For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
the syscall number into regs->orig_ax prior to any possible tracing
and syscall execution.  This is user-visible ABI used by ptrace
syscall emulation and seccomp.

For fastpath syscalls, there's no good reason not to do the same
thing.  It's even slightly simpler than what we're currently doing.
It probably has no measureable performance impact.  It should have
no user-visible effect.

The purpose of this patch is to prepare for two-phase syscall
tracing, in which the first phase might modify the saved RAX without
leaving the fast path.  This change is just subtle enough that I'm
keeping it separate.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/calling.h |  6 +++++-
 arch/x86/kernel/entry_64.S     | 13 ++++---------
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index cb4c73b..76659b6 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -85,7 +85,7 @@ For 32-bit we have the following conventions - kernel is built with
 #define ARGOFFSET	R11
 #define SWFRAME		ORIG_RAX
 
-	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
+	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1, rax_enosys=0
 	subq  $9*8+\addskip, %rsp
 	CFI_ADJUST_CFA_OFFSET	9*8+\addskip
 	movq_cfi rdi, 8*8
@@ -96,7 +96,11 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rcx, 5*8
 	.endif
 
+	.if \rax_enosys
+	movq $-ENOSYS, 4*8(%rsp)
+	.else
 	movq_cfi rax, 4*8
+	.endif
 
 	.if \save_r891011
 	movq_cfi r8,  3*8
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b25ca96..1eb3094 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -405,8 +405,8 @@ GLOBAL(system_call_after_swapgs)
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_ARGS 8,0
-	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
+	SAVE_ARGS 8, 0, rax_enosys=1
+	movq_cfi rax,(ORIG_RAX-ARGOFFSET)
 	movq  %rcx,RIP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rip,RIP-ARGOFFSET
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
@@ -418,7 +418,7 @@ system_call_fastpath:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja badsys
+	ja ret_from_sys_call  /* and return regs->ax */
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
 	movq %rax,RAX-ARGOFFSET(%rsp)
@@ -477,10 +477,6 @@ sysret_signal:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 	jmp int_check_syscall_exit_work
 
-badsys:
-	movq $-ENOSYS,RAX-ARGOFFSET(%rsp)
-	jmp ret_from_sys_call
-
 #ifdef CONFIG_AUDITSYSCALL
 	/*
 	 * Fast path for syscall audit without full syscall trace.
@@ -520,7 +516,6 @@ tracesys:
 	jz auditsys
 #endif
 	SAVE_REST
-	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
 	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
@@ -537,7 +532,7 @@ tracesys:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja   int_ret_from_sys_call	/* RAX(%rsp) set to -ENOSYS above */
+	ja   int_ret_from_sys_call	/* RAX(%rsp) is already set */
 	movq %r10,%rcx	/* fixup for C */
 	call *sys_call_table(,%rax,8)
 	movq %rax,RAX-ARGOFFSET(%rsp)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 7/8] x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls
@ 2014-07-22  1:53   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:53 UTC (permalink / raw)
  To: linux-arm-kernel

For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick
the syscall number into regs->orig_ax prior to any possible tracing
and syscall execution.  This is user-visible ABI used by ptrace
syscall emulation and seccomp.

For fastpath syscalls, there's no good reason not to do the same
thing.  It's even slightly simpler than what we're currently doing.
It probably has no measureable performance impact.  It should have
no user-visible effect.

The purpose of this patch is to prepare for two-phase syscall
tracing, in which the first phase might modify the saved RAX without
leaving the fast path.  This change is just subtle enough that I'm
keeping it separate.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/calling.h |  6 +++++-
 arch/x86/kernel/entry_64.S     | 13 ++++---------
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index cb4c73b..76659b6 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -85,7 +85,7 @@ For 32-bit we have the following conventions - kernel is built with
 #define ARGOFFSET	R11
 #define SWFRAME		ORIG_RAX
 
-	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
+	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1, rax_enosys=0
 	subq  $9*8+\addskip, %rsp
 	CFI_ADJUST_CFA_OFFSET	9*8+\addskip
 	movq_cfi rdi, 8*8
@@ -96,7 +96,11 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rcx, 5*8
 	.endif
 
+	.if \rax_enosys
+	movq $-ENOSYS, 4*8(%rsp)
+	.else
 	movq_cfi rax, 4*8
+	.endif
 
 	.if \save_r891011
 	movq_cfi r8,  3*8
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b25ca96..1eb3094 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -405,8 +405,8 @@ GLOBAL(system_call_after_swapgs)
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_ARGS 8,0
-	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
+	SAVE_ARGS 8, 0, rax_enosys=1
+	movq_cfi rax,(ORIG_RAX-ARGOFFSET)
 	movq  %rcx,RIP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rip,RIP-ARGOFFSET
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
@@ -418,7 +418,7 @@ system_call_fastpath:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja badsys
+	ja ret_from_sys_call  /* and return regs->ax */
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
 	movq %rax,RAX-ARGOFFSET(%rsp)
@@ -477,10 +477,6 @@ sysret_signal:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 	jmp int_check_syscall_exit_work
 
-badsys:
-	movq $-ENOSYS,RAX-ARGOFFSET(%rsp)
-	jmp ret_from_sys_call
-
 #ifdef CONFIG_AUDITSYSCALL
 	/*
 	 * Fast path for syscall audit without full syscall trace.
@@ -520,7 +516,6 @@ tracesys:
 	jz auditsys
 #endif
 	SAVE_REST
-	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
 	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
@@ -537,7 +532,7 @@ tracesys:
 	andl $__SYSCALL_MASK,%eax
 	cmpl $__NR_syscall_max,%eax
 #endif
-	ja   int_ret_from_sys_call	/* RAX(%rsp) set to -ENOSYS above */
+	ja   int_ret_from_sys_call	/* RAX(%rsp) is already set */
 	movq %r10,%rcx	/* fixup for C */
 	call *sys_call_table(,%rax,8)
 	movq %rax,RAX-ARGOFFSET(%rsp)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 8/8] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls
  2014-07-22  1:53   ` Andy Lutomirski
@ 2014-07-22  1:53   ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:53 UTC (permalink / raw)
  To: linux-kernel, Kees Cook, Will Drewry
  Cc: Oleg Nesterov, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa, Andy Lutomirski

On KVM on my box, this reduces the overhead from an always-accept
seccomp filter from ~130ns to ~17ns.  Most of that comes from
avoiding IRET on every syscall when seccomp is enabled.

In extremely approximate hacked-up benchmarking, just bypassing IRET
saves about 80ns, so there's another 43ns of savings here from
simplifying the seccomp path.

The diffstat is also rather nice :)

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 38 +++++++++++++++-----------------------
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1eb3094..13e0c1d 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -479,22 +479,6 @@ sysret_signal:
 
 #ifdef CONFIG_AUDITSYSCALL
 	/*
-	 * Fast path for syscall audit without full syscall trace.
-	 * We just call __audit_syscall_entry() directly, and then
-	 * jump back to the normal fast path.
-	 */
-auditsys:
-	movq %r10,%r9			/* 6th arg: 4th syscall arg */
-	movq %rdx,%r8			/* 5th arg: 3rd syscall arg */
-	movq %rsi,%rcx			/* 4th arg: 2nd syscall arg */
-	movq %rdi,%rdx			/* 3rd arg: 1st syscall arg */
-	movq %rax,%rsi			/* 2nd arg: syscall number */
-	movl $AUDIT_ARCH_X86_64,%edi	/* 1st arg: audit arch */
-	call __audit_syscall_entry
-	LOAD_ARGS 0		/* reload call-clobbered registers */
-	jmp system_call_fastpath
-
-	/*
 	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
 	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
 	 * masked off.
@@ -511,17 +495,25 @@ sysret_audit:
 
 	/* Do syscall tracing */
 tracesys:
-#ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	jz auditsys
-#endif
+	leaq -REST_SKIP(%rsp), %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	call syscall_trace_enter_phase1
+	test %rax, %rax
+	jnz tracesys_phase2		/* if needed, run the slow path */
+	LOAD_ARGS 0			/* else restore clobbered regs */
+	jmp system_call_fastpath	/*      and return to the fast path */
+
+tracesys_phase2:
 	SAVE_REST
 	FIXUP_TOP_OF_STACK %rdi
-	movq %rsp,%rdi
-	call syscall_trace_enter
+	movq %rsp, %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	movq %rax,%rdx
+	call syscall_trace_enter_phase2
+
 	/*
 	 * Reload arg registers from stack in case ptrace changed them.
-	 * We don't reload %rax because syscall_trace_enter() returned
+	 * We don't reload %rax because syscall_trace_entry_phase2() returned
 	 * the value it wants us to use in the table lookup.
 	 */
 	LOAD_ARGS ARGOFFSET, 1
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 8/8] x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscalls
@ 2014-07-22  1:53   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-22  1:53 UTC (permalink / raw)
  To: linux-arm-kernel

On KVM on my box, this reduces the overhead from an always-accept
seccomp filter from ~130ns to ~17ns.  Most of that comes from
avoiding IRET on every syscall when seccomp is enabled.

In extremely approximate hacked-up benchmarking, just bypassing IRET
saves about 80ns, so there's another 43ns of savings here from
simplifying the seccomp path.

The diffstat is also rather nice :)

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 38 +++++++++++++++-----------------------
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1eb3094..13e0c1d 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -479,22 +479,6 @@ sysret_signal:
 
 #ifdef CONFIG_AUDITSYSCALL
 	/*
-	 * Fast path for syscall audit without full syscall trace.
-	 * We just call __audit_syscall_entry() directly, and then
-	 * jump back to the normal fast path.
-	 */
-auditsys:
-	movq %r10,%r9			/* 6th arg: 4th syscall arg */
-	movq %rdx,%r8			/* 5th arg: 3rd syscall arg */
-	movq %rsi,%rcx			/* 4th arg: 2nd syscall arg */
-	movq %rdi,%rdx			/* 3rd arg: 1st syscall arg */
-	movq %rax,%rsi			/* 2nd arg: syscall number */
-	movl $AUDIT_ARCH_X86_64,%edi	/* 1st arg: audit arch */
-	call __audit_syscall_entry
-	LOAD_ARGS 0		/* reload call-clobbered registers */
-	jmp system_call_fastpath
-
-	/*
 	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
 	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
 	 * masked off.
@@ -511,17 +495,25 @@ sysret_audit:
 
 	/* Do syscall tracing */
 tracesys:
-#ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	jz auditsys
-#endif
+	leaq -REST_SKIP(%rsp), %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	call syscall_trace_enter_phase1
+	test %rax, %rax
+	jnz tracesys_phase2		/* if needed, run the slow path */
+	LOAD_ARGS 0			/* else restore clobbered regs */
+	jmp system_call_fastpath	/*      and return to the fast path */
+
+tracesys_phase2:
 	SAVE_REST
 	FIXUP_TOP_OF_STACK %rdi
-	movq %rsp,%rdi
-	call syscall_trace_enter
+	movq %rsp, %rdi
+	movq $AUDIT_ARCH_X86_64, %rsi
+	movq %rax,%rdx
+	call syscall_trace_enter_phase2
+
 	/*
 	 * Reload arg registers from stack in case ptrace changed them.
-	 * We don't reload %rax because syscall_trace_enter() returned
+	 * We don't reload %rax because syscall_trace_entry_phase2() returned
 	 * the value it wants us to use in the table lookup.
 	 */
 	LOAD_ARGS ARGOFFSET, 1
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
  2014-07-22  1:49 ` Andy Lutomirski
  (?)
@ 2014-07-22 19:37   ` Kees Cook
  -1 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-22 19:37 UTC (permalink / raw)
  To: Andy Lutomirski, H. Peter Anvin
  Cc: LKML, Will Drewry, Oleg Nesterov, x86, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, linux-security-module,
	Alexei Starovoitov

On Mon, Jul 21, 2014 at 6:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> [applies on jmorris's security-next tree]
>
> This is both a cleanup and a speedup.  It reduces overhead due to
> installing a trivial seccomp filter by 87%.  The speedup comes from
> avoiding the full syscall tracing mechanism for filters that don't
> return SECCOMP_RET_TRACE.
>
> This series works by splitting the seccomp hooks into two phases.
> The first phase evaluates the filter; it can skip syscalls, allow
> them, kill the calling task, or pass a u32 to the second phase.  The
> second phase requires a full tracing context, and it sends ptrace
> events if necessary.
>
> Once this is done, I implemented a similar split for the x86 syscall
> entry work.  The C callback is invoked in two phases: the first has
> only a partial frame, and it can request phase 2 processing with a
> full frame.
>
> Finally, I switch the 64-bit system_call code to use the new split
> entry work.  This is a net deletion of assembly code: it replaces
> all of the audit entry muck.
>
> In the process, I fixed some bugs.
>
> If this is acceptable, someone can do the same tweak for the
> ia32entry and entry_32 code.
>
> This passes all seccomp tests that I know of.  Now that it's properly
> rebased, even the previously expected failures are gone.
>
> Kees, if you like this version, can you create a branch with patches
> 1-4?  I think that the rest should go into tip/x86 once everyone's happy
> with it.
>
> Changes from v2:
>  - Fixed 32-bit x86 build (and the tests pass).
>  - Put the doc patch where it belongs.

Thanks! This looks good to me. I'll add it to my tree.

Peter, how do you feel about this series? Do the x86 changes look good to you?

-Kees

>
> Changes from v1:
>  - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
>    part).
>  - Improved patch 6 vs patch 7 split (thanks Alexei!)
>  - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
>  - Improved changelog message in patch 6.
>
> Changes from RFC version:
>  - The first three patches are more or less the same
>  - The rest is more or less a rewrite
>
> Andy Lutomirski (8):
>   seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
>   seccomp: Refactor the filter callback and the API
>   seccomp: Allow arch code to provide seccomp_data
>   seccomp: Document two-phase seccomp and arch-provided seccomp_data
>   x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
>   x86: Split syscall_trace_enter into two phases
>   x86_64,entry: Treat regs->ax the same in fastpath and slowpath
>     syscalls
>   x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls
>
>  arch/Kconfig                   |  11 ++
>  arch/arm/kernel/ptrace.c       |   7 +-
>  arch/mips/kernel/ptrace.c      |   2 +-
>  arch/s390/kernel/ptrace.c      |   2 +-
>  arch/x86/include/asm/calling.h |   6 +-
>  arch/x86/include/asm/ptrace.h  |   5 +
>  arch/x86/kernel/entry_64.S     |  51 ++++-----
>  arch/x86/kernel/ptrace.c       | 150 +++++++++++++++++++-----
>  arch/x86/kernel/vsyscall_64.c  |   2 +-
>  include/linux/seccomp.h        |  25 ++--
>  kernel/seccomp.c               | 252 ++++++++++++++++++++++++++++-------------
>  11 files changed, 360 insertions(+), 153 deletions(-)
>
> --
> 1.9.3
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-22 19:37   ` Kees Cook
  0 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-22 19:37 UTC (permalink / raw)
  To: Andy Lutomirski, H. Peter Anvin
  Cc: LKML, Will Drewry, Oleg Nesterov, x86, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, linux-security-module,
	Alexei Starovoitov

On Mon, Jul 21, 2014 at 6:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> [applies on jmorris's security-next tree]
>
> This is both a cleanup and a speedup.  It reduces overhead due to
> installing a trivial seccomp filter by 87%.  The speedup comes from
> avoiding the full syscall tracing mechanism for filters that don't
> return SECCOMP_RET_TRACE.
>
> This series works by splitting the seccomp hooks into two phases.
> The first phase evaluates the filter; it can skip syscalls, allow
> them, kill the calling task, or pass a u32 to the second phase.  The
> second phase requires a full tracing context, and it sends ptrace
> events if necessary.
>
> Once this is done, I implemented a similar split for the x86 syscall
> entry work.  The C callback is invoked in two phases: the first has
> only a partial frame, and it can request phase 2 processing with a
> full frame.
>
> Finally, I switch the 64-bit system_call code to use the new split
> entry work.  This is a net deletion of assembly code: it replaces
> all of the audit entry muck.
>
> In the process, I fixed some bugs.
>
> If this is acceptable, someone can do the same tweak for the
> ia32entry and entry_32 code.
>
> This passes all seccomp tests that I know of.  Now that it's properly
> rebased, even the previously expected failures are gone.
>
> Kees, if you like this version, can you create a branch with patches
> 1-4?  I think that the rest should go into tip/x86 once everyone's happy
> with it.
>
> Changes from v2:
>  - Fixed 32-bit x86 build (and the tests pass).
>  - Put the doc patch where it belongs.

Thanks! This looks good to me. I'll add it to my tree.

Peter, how do you feel about this series? Do the x86 changes look good to you?

-Kees

>
> Changes from v1:
>  - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
>    part).
>  - Improved patch 6 vs patch 7 split (thanks Alexei!)
>  - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
>  - Improved changelog message in patch 6.
>
> Changes from RFC version:
>  - The first three patches are more or less the same
>  - The rest is more or less a rewrite
>
> Andy Lutomirski (8):
>   seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
>   seccomp: Refactor the filter callback and the API
>   seccomp: Allow arch code to provide seccomp_data
>   seccomp: Document two-phase seccomp and arch-provided seccomp_data
>   x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
>   x86: Split syscall_trace_enter into two phases
>   x86_64,entry: Treat regs->ax the same in fastpath and slowpath
>     syscalls
>   x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls
>
>  arch/Kconfig                   |  11 ++
>  arch/arm/kernel/ptrace.c       |   7 +-
>  arch/mips/kernel/ptrace.c      |   2 +-
>  arch/s390/kernel/ptrace.c      |   2 +-
>  arch/x86/include/asm/calling.h |   6 +-
>  arch/x86/include/asm/ptrace.h  |   5 +
>  arch/x86/kernel/entry_64.S     |  51 ++++-----
>  arch/x86/kernel/ptrace.c       | 150 +++++++++++++++++++-----
>  arch/x86/kernel/vsyscall_64.c  |   2 +-
>  include/linux/seccomp.h        |  25 ++--
>  kernel/seccomp.c               | 252 ++++++++++++++++++++++++++++-------------
>  11 files changed, 360 insertions(+), 153 deletions(-)
>
> --
> 1.9.3
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-22 19:37   ` Kees Cook
  0 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-22 19:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 21, 2014 at 6:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> [applies on jmorris's security-next tree]
>
> This is both a cleanup and a speedup.  It reduces overhead due to
> installing a trivial seccomp filter by 87%.  The speedup comes from
> avoiding the full syscall tracing mechanism for filters that don't
> return SECCOMP_RET_TRACE.
>
> This series works by splitting the seccomp hooks into two phases.
> The first phase evaluates the filter; it can skip syscalls, allow
> them, kill the calling task, or pass a u32 to the second phase.  The
> second phase requires a full tracing context, and it sends ptrace
> events if necessary.
>
> Once this is done, I implemented a similar split for the x86 syscall
> entry work.  The C callback is invoked in two phases: the first has
> only a partial frame, and it can request phase 2 processing with a
> full frame.
>
> Finally, I switch the 64-bit system_call code to use the new split
> entry work.  This is a net deletion of assembly code: it replaces
> all of the audit entry muck.
>
> In the process, I fixed some bugs.
>
> If this is acceptable, someone can do the same tweak for the
> ia32entry and entry_32 code.
>
> This passes all seccomp tests that I know of.  Now that it's properly
> rebased, even the previously expected failures are gone.
>
> Kees, if you like this version, can you create a branch with patches
> 1-4?  I think that the rest should go into tip/x86 once everyone's happy
> with it.
>
> Changes from v2:
>  - Fixed 32-bit x86 build (and the tests pass).
>  - Put the doc patch where it belongs.

Thanks! This looks good to me. I'll add it to my tree.

Peter, how do you feel about this series? Do the x86 changes look good to you?

-Kees

>
> Changes from v1:
>  - Rebased on top of Kees' shiny new seccomp tree (no effect on the x86
>    part).
>  - Improved patch 6 vs patch 7 split (thanks Alexei!)
>  - Fixed bogus -ENOSYS in patch 5 (thanks Kees!)
>  - Improved changelog message in patch 6.
>
> Changes from RFC version:
>  - The first three patches are more or less the same
>  - The rest is more or less a rewrite
>
> Andy Lutomirski (8):
>   seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
>   seccomp: Refactor the filter callback and the API
>   seccomp: Allow arch code to provide seccomp_data
>   seccomp: Document two-phase seccomp and arch-provided seccomp_data
>   x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit
>   x86: Split syscall_trace_enter into two phases
>   x86_64,entry: Treat regs->ax the same in fastpath and slowpath
>     syscalls
>   x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls
>
>  arch/Kconfig                   |  11 ++
>  arch/arm/kernel/ptrace.c       |   7 +-
>  arch/mips/kernel/ptrace.c      |   2 +-
>  arch/s390/kernel/ptrace.c      |   2 +-
>  arch/x86/include/asm/calling.h |   6 +-
>  arch/x86/include/asm/ptrace.h  |   5 +
>  arch/x86/kernel/entry_64.S     |  51 ++++-----
>  arch/x86/kernel/ptrace.c       | 150 +++++++++++++++++++-----
>  arch/x86/kernel/vsyscall_64.c  |   2 +-
>  include/linux/seccomp.h        |  25 ++--
>  kernel/seccomp.c               | 252 ++++++++++++++++++++++++++++-------------
>  11 files changed, 360 insertions(+), 153 deletions(-)
>
> --
> 1.9.3
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
  2014-07-22 19:37   ` Kees Cook
  (?)
@ 2014-07-23 19:20     ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-23 19:20 UTC (permalink / raw)
  To: Kees Cook
  Cc: H. Peter Anvin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov

On Tue, Jul 22, 2014 at 12:37 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jul 21, 2014 at 6:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> [applies on jmorris's security-next tree]
>>
>> This is both a cleanup and a speedup.  It reduces overhead due to
>> installing a trivial seccomp filter by 87%.  The speedup comes from
>> avoiding the full syscall tracing mechanism for filters that don't
>> return SECCOMP_RET_TRACE.
>>
>> This series works by splitting the seccomp hooks into two phases.
>> The first phase evaluates the filter; it can skip syscalls, allow
>> them, kill the calling task, or pass a u32 to the second phase.  The
>> second phase requires a full tracing context, and it sends ptrace
>> events if necessary.
>>
>> Once this is done, I implemented a similar split for the x86 syscall
>> entry work.  The C callback is invoked in two phases: the first has
>> only a partial frame, and it can request phase 2 processing with a
>> full frame.
>>
>> Finally, I switch the 64-bit system_call code to use the new split
>> entry work.  This is a net deletion of assembly code: it replaces
>> all of the audit entry muck.
>>
>> In the process, I fixed some bugs.
>>
>> If this is acceptable, someone can do the same tweak for the
>> ia32entry and entry_32 code.
>>
>> This passes all seccomp tests that I know of.  Now that it's properly
>> rebased, even the previously expected failures are gone.
>>
>> Kees, if you like this version, can you create a branch with patches
>> 1-4?  I think that the rest should go into tip/x86 once everyone's happy
>> with it.
>>
>> Changes from v2:
>>  - Fixed 32-bit x86 build (and the tests pass).
>>  - Put the doc patch where it belongs.
>
> Thanks! This looks good to me. I'll add it to my tree.
>
> Peter, how do you feel about this series? Do the x86 changes look good to you?
>

It looks like patches 1-4 have landed here:

https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath

hpa, what's the route forward for the x86 part?

--Andy

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-23 19:20     ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-23 19:20 UTC (permalink / raw)
  To: Kees Cook
  Cc: H. Peter Anvin, LKML, Will Drewry, Oleg Nesterov, x86,
	linux-arm-kernel, Linux MIPS Mailing List, linux-arch,
	linux-security-module, Alexei Starovoitov

On Tue, Jul 22, 2014 at 12:37 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jul 21, 2014 at 6:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> [applies on jmorris's security-next tree]
>>
>> This is both a cleanup and a speedup.  It reduces overhead due to
>> installing a trivial seccomp filter by 87%.  The speedup comes from
>> avoiding the full syscall tracing mechanism for filters that don't
>> return SECCOMP_RET_TRACE.
>>
>> This series works by splitting the seccomp hooks into two phases.
>> The first phase evaluates the filter; it can skip syscalls, allow
>> them, kill the calling task, or pass a u32 to the second phase.  The
>> second phase requires a full tracing context, and it sends ptrace
>> events if necessary.
>>
>> Once this is done, I implemented a similar split for the x86 syscall
>> entry work.  The C callback is invoked in two phases: the first has
>> only a partial frame, and it can request phase 2 processing with a
>> full frame.
>>
>> Finally, I switch the 64-bit system_call code to use the new split
>> entry work.  This is a net deletion of assembly code: it replaces
>> all of the audit entry muck.
>>
>> In the process, I fixed some bugs.
>>
>> If this is acceptable, someone can do the same tweak for the
>> ia32entry and entry_32 code.
>>
>> This passes all seccomp tests that I know of.  Now that it's properly
>> rebased, even the previously expected failures are gone.
>>
>> Kees, if you like this version, can you create a branch with patches
>> 1-4?  I think that the rest should go into tip/x86 once everyone's happy
>> with it.
>>
>> Changes from v2:
>>  - Fixed 32-bit x86 build (and the tests pass).
>>  - Put the doc patch where it belongs.
>
> Thanks! This looks good to me. I'll add it to my tree.
>
> Peter, how do you feel about this series? Do the x86 changes look good to you?
>

It looks like patches 1-4 have landed here:

https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath

hpa, what's the route forward for the x86 part?

--Andy

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-23 19:20     ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-23 19:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 22, 2014 at 12:37 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jul 21, 2014 at 6:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> [applies on jmorris's security-next tree]
>>
>> This is both a cleanup and a speedup.  It reduces overhead due to
>> installing a trivial seccomp filter by 87%.  The speedup comes from
>> avoiding the full syscall tracing mechanism for filters that don't
>> return SECCOMP_RET_TRACE.
>>
>> This series works by splitting the seccomp hooks into two phases.
>> The first phase evaluates the filter; it can skip syscalls, allow
>> them, kill the calling task, or pass a u32 to the second phase.  The
>> second phase requires a full tracing context, and it sends ptrace
>> events if necessary.
>>
>> Once this is done, I implemented a similar split for the x86 syscall
>> entry work.  The C callback is invoked in two phases: the first has
>> only a partial frame, and it can request phase 2 processing with a
>> full frame.
>>
>> Finally, I switch the 64-bit system_call code to use the new split
>> entry work.  This is a net deletion of assembly code: it replaces
>> all of the audit entry muck.
>>
>> In the process, I fixed some bugs.
>>
>> If this is acceptable, someone can do the same tweak for the
>> ia32entry and entry_32 code.
>>
>> This passes all seccomp tests that I know of.  Now that it's properly
>> rebased, even the previously expected failures are gone.
>>
>> Kees, if you like this version, can you create a branch with patches
>> 1-4?  I think that the rest should go into tip/x86 once everyone's happy
>> with it.
>>
>> Changes from v2:
>>  - Fixed 32-bit x86 build (and the tests pass).
>>  - Put the doc patch where it belongs.
>
> Thanks! This looks good to me. I'll add it to my tree.
>
> Peter, how do you feel about this series? Do the x86 changes look good to you?
>

It looks like patches 1-4 have landed here:

https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath

hpa, what's the route forward for the x86 part?

--Andy

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-22  1:53   ` Andy Lutomirski
@ 2014-07-28 17:37     ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-28 17:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, x86, linux-arm-kernel,
	linux-mips, linux-arch, linux-security-module,
	Alexei Starovoitov, hpa

Hi Andy,

I am really sorry for delay.

This is on top of the recent change from Kees, right? Could me remind me
where can I found the tree this series based on? So that I could actually
apply these changes...

On 07/21, Andy Lutomirski wrote:
>
> +long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
> +				unsigned long phase1_result)
>  {
>  	long ret = 0;
> +	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
> +		_TIF_WORK_SYSCALL_ENTRY;
> +
> +	BUG_ON(regs != task_pt_regs(current));
>  
>  	user_exit();
>  
> @@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs)
>  	 * do_debug() and we need to set it again to restore the user
>  	 * state.  If we entered on the slow path, TF was already set.
>  	 */
> -	if (test_thread_flag(TIF_SINGLESTEP))
> +	if (work & _TIF_SINGLESTEP)
>  		regs->flags |= X86_EFLAGS_TF;

This looks suspicious, but perhaps I misread this change.

If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
But we should always call user_exit() unconditionally?

And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
actually cleared on a 32bit kernel if we step over sysenter insn?

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-28 17:37     ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-28 17:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Andy,

I am really sorry for delay.

This is on top of the recent change from Kees, right? Could me remind me
where can I found the tree this series based on? So that I could actually
apply these changes...

On 07/21, Andy Lutomirski wrote:
>
> +long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
> +				unsigned long phase1_result)
>  {
>  	long ret = 0;
> +	u32 work = ACCESS_ONCE(current_thread_info()->flags) &
> +		_TIF_WORK_SYSCALL_ENTRY;
> +
> +	BUG_ON(regs != task_pt_regs(current));
>  
>  	user_exit();
>  
> @@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs)
>  	 * do_debug() and we need to set it again to restore the user
>  	 * state.  If we entered on the slow path, TF was already set.
>  	 */
> -	if (test_thread_flag(TIF_SINGLESTEP))
> +	if (work & _TIF_SINGLESTEP)
>  		regs->flags |= X86_EFLAGS_TF;

This looks suspicious, but perhaps I misread this change.

If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
But we should always call user_exit() unconditionally?

And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
actually cleared on a 32bit kernel if we step over sysenter insn?

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
  2014-07-23 19:20     ` Andy Lutomirski
  (?)
@ 2014-07-28 17:59       ` H. Peter Anvin
  -1 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2014-07-28 17:59 UTC (permalink / raw)
  To: Andy Lutomirski, Kees Cook
  Cc: LKML, Will Drewry, Oleg Nesterov, x86, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, linux-security-module,
	Alexei Starovoitov

On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
> 
> It looks like patches 1-4 have landed here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
> 
> hpa, what's the route forward for the x86 part?
> 

I guess I should discuss this with Kees to figure out what makes most
sense.  In the meantime, could you address Oleg's question?

	-hpa



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 17:59       ` H. Peter Anvin
  0 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2014-07-28 17:59 UTC (permalink / raw)
  To: Andy Lutomirski, Kees Cook
  Cc: LKML, Will Drewry, Oleg Nesterov, x86, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, linux-security-module,
	Alexei Starovoitov

On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
> 
> It looks like patches 1-4 have landed here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
> 
> hpa, what's the route forward for the x86 part?
> 

I guess I should discuss this with Kees to figure out what makes most
sense.  In the meantime, could you address Oleg's question?

	-hpa



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 17:59       ` H. Peter Anvin
  0 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2014-07-28 17:59 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
> 
> It looks like patches 1-4 have landed here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
> 
> hpa, what's the route forward for the x86 part?
> 

I guess I should discuss this with Kees to figure out what makes most
sense.  In the meantime, could you address Oleg's question?

	-hpa

^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-28 17:37     ` Oleg Nesterov
@ 2014-07-28 18:58       ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-28 18:58 UTC (permalink / raw)
  To: Andy Lutomirski, Frederic Weisbecker, Paul E. McKenney
  Cc: linux-kernel, Kees Cook, Will Drewry, x86, linux-arm-kernel,
	linux-mips, linux-arch, linux-security-module,
	Alexei Starovoitov, hpa

Off-topic, but...

On 07/28, Oleg Nesterov wrote:
>
> But we should always call user_exit() unconditionally?

Frederic, don't we need the patch below? In fact clear_() can be moved
under "if ()" too. and probably copy_process() should clear this flag...

Or. __context_tracking_task_switch() can simply do

	 if (context_tracking_cpu_is_enabled())
	 	set_tsk_thread_flag(next, TIF_NOHZ);
	 else
	 	clear_tsk_thread_flag(next, TIF_NOHZ);

and then we can forget about copy_process(). Or I am totally confused?


I am also wondering if we can extend user_return_notifier to handle
enter/exit and kill TIF_NOHZ.

Oleg.

--- x/kernel/context_tracking.c
+++ x/kernel/context_tracking.c
@@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru
 				    struct task_struct *next)
 {
 	clear_tsk_thread_flag(prev, TIF_NOHZ);
-	set_tsk_thread_flag(next, TIF_NOHZ);
+	if (context_tracking_cpu_is_enabled())
+		set_tsk_thread_flag(next, TIF_NOHZ);
 }
 
 #ifdef CONFIG_CONTEXT_TRACKING_FORCE


^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-28 18:58       ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-28 18:58 UTC (permalink / raw)
  To: linux-arm-kernel

Off-topic, but...

On 07/28, Oleg Nesterov wrote:
>
> But we should always call user_exit() unconditionally?

Frederic, don't we need the patch below? In fact clear_() can be moved
under "if ()" too. and probably copy_process() should clear this flag...

Or. __context_tracking_task_switch() can simply do

	 if (context_tracking_cpu_is_enabled())
	 	set_tsk_thread_flag(next, TIF_NOHZ);
	 else
	 	clear_tsk_thread_flag(next, TIF_NOHZ);

and then we can forget about copy_process(). Or I am totally confused?


I am also wondering if we can extend user_return_notifier to handle
enter/exit and kill TIF_NOHZ.

Oleg.

--- x/kernel/context_tracking.c
+++ x/kernel/context_tracking.c
@@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru
 				    struct task_struct *next)
 {
 	clear_tsk_thread_flag(prev, TIF_NOHZ);
-	set_tsk_thread_flag(next, TIF_NOHZ);
+	if (context_tracking_cpu_is_enabled())
+		set_tsk_thread_flag(next, TIF_NOHZ);
 }
 
 #ifdef CONFIG_CONTEXT_TRACKING_FORCE

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-28 18:58       ` Oleg Nesterov
@ 2014-07-28 19:22         ` Frederic Weisbecker
  -1 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-28 19:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On Mon, Jul 28, 2014 at 08:58:03PM +0200, Oleg Nesterov wrote:
> Off-topic, but...
> 
> On 07/28, Oleg Nesterov wrote:
> >
> > But we should always call user_exit() unconditionally?
> 
> Frederic, don't we need the patch below? In fact clear_() can be moved
> under "if ()" too. and probably copy_process() should clear this flag...
> 
> Or. __context_tracking_task_switch() can simply do
> 
> 	 if (context_tracking_cpu_is_enabled())
> 	 	set_tsk_thread_flag(next, TIF_NOHZ);
> 	 else
> 	 	clear_tsk_thread_flag(next, TIF_NOHZ);
> 
> and then we can forget about copy_process(). Or I am totally confused?
> 
> 
> I am also wondering if we can extend user_return_notifier to handle
> enter/exit and kill TIF_NOHZ.
> 
> Oleg.
> 
> --- x/kernel/context_tracking.c
> +++ x/kernel/context_tracking.c
> @@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru
>  				    struct task_struct *next)
>  {
>  	clear_tsk_thread_flag(prev, TIF_NOHZ);
> -	set_tsk_thread_flag(next, TIF_NOHZ);
> +	if (context_tracking_cpu_is_enabled())
> +		set_tsk_thread_flag(next, TIF_NOHZ);
>  }
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_FORCE

Unfortunately, as long as tasks can migrate in and out a context tracked CPU, we
need to track all CPUs.

This is because there is always a small shift between hard and soft kernelspace
boundaries.

Hard boundaries are the real strict boundaries: between "int", "iret" or faulting
instructions for example.

Soft boundaries are the place where we put our context tracking probes. They
are just function calls and a distance between them and hard boundaries is inevitable.

So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
tracking call before returning from a syscall to userspace, and gets an interrupt. The
interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
after which it is going to resume to userspace.

In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
the task is resuming to userspace, because we passed through the context tracking probe
already and it was ignored on CPU 0.

This might be hackbable by ensuring that irqs are disabled between context tracking
calls and actual returns to userspace. It's a nightmare to audit on all archs though,
and it makes the context tracking callers less flexible also that only solve the issue
for irqs. Exception have a similar problem and we can't mask them.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-28 19:22         ` Frederic Weisbecker
  0 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-28 19:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 28, 2014 at 08:58:03PM +0200, Oleg Nesterov wrote:
> Off-topic, but...
> 
> On 07/28, Oleg Nesterov wrote:
> >
> > But we should always call user_exit() unconditionally?
> 
> Frederic, don't we need the patch below? In fact clear_() can be moved
> under "if ()" too. and probably copy_process() should clear this flag...
> 
> Or. __context_tracking_task_switch() can simply do
> 
> 	 if (context_tracking_cpu_is_enabled())
> 	 	set_tsk_thread_flag(next, TIF_NOHZ);
> 	 else
> 	 	clear_tsk_thread_flag(next, TIF_NOHZ);
> 
> and then we can forget about copy_process(). Or I am totally confused?
> 
> 
> I am also wondering if we can extend user_return_notifier to handle
> enter/exit and kill TIF_NOHZ.
> 
> Oleg.
> 
> --- x/kernel/context_tracking.c
> +++ x/kernel/context_tracking.c
> @@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru
>  				    struct task_struct *next)
>  {
>  	clear_tsk_thread_flag(prev, TIF_NOHZ);
> -	set_tsk_thread_flag(next, TIF_NOHZ);
> +	if (context_tracking_cpu_is_enabled())
> +		set_tsk_thread_flag(next, TIF_NOHZ);
>  }
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_FORCE

Unfortunately, as long as tasks can migrate in and out a context tracked CPU, we
need to track all CPUs.

This is because there is always a small shift between hard and soft kernelspace
boundaries.

Hard boundaries are the real strict boundaries: between "int", "iret" or faulting
instructions for example.

Soft boundaries are the place where we put our context tracking probes. They
are just function calls and a distance between them and hard boundaries is inevitable.

So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
tracking call before returning from a syscall to userspace, and gets an interrupt. The
interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
after which it is going to resume to userspace.

In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
the task is resuming to userspace, because we passed through the context tracking probe
already and it was ignored on CPU 0.

This might be hackbable by ensuring that irqs are disabled between context tracking
calls and actual returns to userspace. It's a nightmare to audit on all archs though,
and it makes the context tracking callers less flexible also that only solve the issue
for irqs. Exception have a similar problem and we can't mask them.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-28 17:37     ` Oleg Nesterov
  (?)
@ 2014-07-28 20:23       ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-28 20:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Hi Andy,
>
> I am really sorry for delay.
>
> This is on top of the recent change from Kees, right? Could me remind me
> where can I found the tree this series based on? So that I could actually
> apply these changes...

https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath

The first four patches are already applied there.

>
> On 07/21, Andy Lutomirski wrote:
>>
>> +long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
>> +                             unsigned long phase1_result)
>>  {
>>       long ret = 0;
>> +     u32 work = ACCESS_ONCE(current_thread_info()->flags) &
>> +             _TIF_WORK_SYSCALL_ENTRY;
>> +
>> +     BUG_ON(regs != task_pt_regs(current));
>>
>>       user_exit();
>>
>> @@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs)
>>        * do_debug() and we need to set it again to restore the user
>>        * state.  If we entered on the slow path, TF was already set.
>>        */
>> -     if (test_thread_flag(TIF_SINGLESTEP))
>> +     if (work & _TIF_SINGLESTEP)
>>               regs->flags |= X86_EFLAGS_TF;
>
> This looks suspicious, but perhaps I misread this change.
>
> If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
> But we should always call user_exit() unconditionally?

Damnit.  I read that every function called by user_exit, and none of
them give any indication of why they're needed for traced syscalls but
not for untraced syscalls.  On a second look, it seems that TIF_NOHZ
controls it.  I'll update the code to call user_exit iff TIF_NOHZ is
set.  If that's still wrong, then I don't see how the current code is
correct either.

>
> And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
> actually cleared on a 32bit kernel if we step over sysenter insn?

I don't follow.  If TIF_SINGLESTEP, then phase1 will return a nonzero
value, and phase2 will set TF.

I admit I don't really understand all the TF machinations.

--Andy

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-28 20:23       ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-28 20:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Hi Andy,
>
> I am really sorry for delay.
>
> This is on top of the recent change from Kees, right? Could me remind me
> where can I found the tree this series based on? So that I could actually
> apply these changes...

https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath

The first four patches are already applied there.

>
> On 07/21, Andy Lutomirski wrote:
>>
>> +long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
>> +                             unsigned long phase1_result)
>>  {
>>       long ret = 0;
>> +     u32 work = ACCESS_ONCE(current_thread_info()->flags) &
>> +             _TIF_WORK_SYSCALL_ENTRY;
>> +
>> +     BUG_ON(regs != task_pt_regs(current));
>>
>>       user_exit();
>>
>> @@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs)
>>        * do_debug() and we need to set it again to restore the user
>>        * state.  If we entered on the slow path, TF was already set.
>>        */
>> -     if (test_thread_flag(TIF_SINGLESTEP))
>> +     if (work & _TIF_SINGLESTEP)
>>               regs->flags |= X86_EFLAGS_TF;
>
> This looks suspicious, but perhaps I misread this change.
>
> If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
> But we should always call user_exit() unconditionally?

Damnit.  I read that every function called by user_exit, and none of
them give any indication of why they're needed for traced syscalls but
not for untraced syscalls.  On a second look, it seems that TIF_NOHZ
controls it.  I'll update the code to call user_exit iff TIF_NOHZ is
set.  If that's still wrong, then I don't see how the current code is
correct either.

>
> And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
> actually cleared on a 32bit kernel if we step over sysenter insn?

I don't follow.  If TIF_SINGLESTEP, then phase1 will return a nonzero
value, and phase2 will set TF.

I admit I don't really understand all the TF machinations.

--Andy

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-28 20:23       ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-28 20:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Hi Andy,
>
> I am really sorry for delay.
>
> This is on top of the recent change from Kees, right? Could me remind me
> where can I found the tree this series based on? So that I could actually
> apply these changes...

https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath

The first four patches are already applied there.

>
> On 07/21, Andy Lutomirski wrote:
>>
>> +long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
>> +                             unsigned long phase1_result)
>>  {
>>       long ret = 0;
>> +     u32 work = ACCESS_ONCE(current_thread_info()->flags) &
>> +             _TIF_WORK_SYSCALL_ENTRY;
>> +
>> +     BUG_ON(regs != task_pt_regs(current));
>>
>>       user_exit();
>>
>> @@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs)
>>        * do_debug() and we need to set it again to restore the user
>>        * state.  If we entered on the slow path, TF was already set.
>>        */
>> -     if (test_thread_flag(TIF_SINGLESTEP))
>> +     if (work & _TIF_SINGLESTEP)
>>               regs->flags |= X86_EFLAGS_TF;
>
> This looks suspicious, but perhaps I misread this change.
>
> If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
> But we should always call user_exit() unconditionally?

Damnit.  I read that every function called by user_exit, and none of
them give any indication of why they're needed for traced syscalls but
not for untraced syscalls.  On a second look, it seems that TIF_NOHZ
controls it.  I'll update the code to call user_exit iff TIF_NOHZ is
set.  If that's still wrong, then I don't see how the current code is
correct either.

>
> And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
> actually cleared on a 32bit kernel if we step over sysenter insn?

I don't follow.  If TIF_SINGLESTEP, then phase1 will return a nonzero
value, and phase2 will set TF.

I admit I don't really understand all the TF machinations.

--Andy

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
  2014-07-28 17:59       ` H. Peter Anvin
  (?)
@ 2014-07-28 23:29         ` Kees Cook
  -1 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-28 23:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, linux-arch, Linux MIPS Mailing List,
	Will Drewry, x86, LKML, Oleg Nesterov, linux-security-module,
	linux-arm-kernel, Alexei Starovoitov

On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>
>> It looks like patches 1-4 have landed here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>
>> hpa, what's the route forward for the x86 part?
>>
>
> I guess I should discuss this with Kees to figure out what makes most
> sense.  In the meantime, could you address Oleg's question?

Since the x86 parts depend on the seccomp parts, I'm happy if you
carry them instead of having them land from my tree. Otherwise I'm
open to how to coordinate timing.

-Kees

>
>         -hpa
>
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:29         ` Kees Cook
  0 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-28 23:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, linux-arch, Linux MIPS Mailing List,
	Will Drewry, x86, LKML, Oleg Nesterov, linux-security-module,
	linux-arm-kernel, Alexei Starovoitov

On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>
>> It looks like patches 1-4 have landed here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>
>> hpa, what's the route forward for the x86 part?
>>
>
> I guess I should discuss this with Kees to figure out what makes most
> sense.  In the meantime, could you address Oleg's question?

Since the x86 parts depend on the seccomp parts, I'm happy if you
carry them instead of having them land from my tree. Otherwise I'm
open to how to coordinate timing.

-Kees

>
>         -hpa
>
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:29         ` Kees Cook
  0 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-28 23:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>
>> It looks like patches 1-4 have landed here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>
>> hpa, what's the route forward for the x86 part?
>>
>
> I guess I should discuss this with Kees to figure out what makes most
> sense.  In the meantime, could you address Oleg's question?

Since the x86 parts depend on the seccomp parts, I'm happy if you
carry them instead of having them land from my tree. Otherwise I'm
open to how to coordinate timing.

-Kees

>
>         -hpa
>
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
  2014-07-28 23:29         ` Kees Cook
  (?)
@ 2014-07-28 23:34           ` H. Peter Anvin
  -1 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2014-07-28 23:34 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, linux-arch, Linux MIPS Mailing List,
	Will Drewry, x86, LKML, Oleg Nesterov, linux-security-module,
	linux-arm-kernel, Alexei Starovoitov

On 07/28/2014 04:29 PM, Kees Cook wrote:
> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>
>>> It looks like patches 1-4 have landed here:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>
>>> hpa, what's the route forward for the x86 part?
>>>
>>
>> I guess I should discuss this with Kees to figure out what makes most
>> sense.  In the meantime, could you address Oleg's question?
> 
> Since the x86 parts depend on the seccomp parts, I'm happy if you
> carry them instead of having them land from my tree. Otherwise I'm
> open to how to coordinate timing.
> 

You mean for me to carry the seccomp part as well?

	-hpa



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:34           ` H. Peter Anvin
  0 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2014-07-28 23:34 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, linux-arch, Linux MIPS Mailing List,
	Will Drewry, x86, LKML, Oleg Nesterov, linux-security-module,
	linux-arm-kernel, Alexei Starovoitov

On 07/28/2014 04:29 PM, Kees Cook wrote:
> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>
>>> It looks like patches 1-4 have landed here:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>
>>> hpa, what's the route forward for the x86 part?
>>>
>>
>> I guess I should discuss this with Kees to figure out what makes most
>> sense.  In the meantime, could you address Oleg's question?
> 
> Since the x86 parts depend on the seccomp parts, I'm happy if you
> carry them instead of having them land from my tree. Otherwise I'm
> open to how to coordinate timing.
> 

You mean for me to carry the seccomp part as well?

	-hpa



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:34           ` H. Peter Anvin
  0 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2014-07-28 23:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/28/2014 04:29 PM, Kees Cook wrote:
> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>
>>> It looks like patches 1-4 have landed here:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>
>>> hpa, what's the route forward for the x86 part?
>>>
>>
>> I guess I should discuss this with Kees to figure out what makes most
>> sense.  In the meantime, could you address Oleg's question?
> 
> Since the x86 parts depend on the seccomp parts, I'm happy if you
> carry them instead of having them land from my tree. Otherwise I'm
> open to how to coordinate timing.
> 

You mean for me to carry the seccomp part as well?

	-hpa

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
  2014-07-28 23:34           ` H. Peter Anvin
  (?)
@ 2014-07-28 23:42             ` Kees Cook
  -1 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-28 23:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-arch, Linux MIPS Mailing List, Will Drewry, x86, LKML,
	Andy Lutomirski, linux-security-module, Oleg Nesterov,
	linux-arm-kernel, Alexei Starovoitov

On Mon, Jul 28, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/28/2014 04:29 PM, Kees Cook wrote:
>> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>>
>>>> It looks like patches 1-4 have landed here:
>>>>
>>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>>
>>>> hpa, what's the route forward for the x86 part?
>>>>
>>>
>>> I guess I should discuss this with Kees to figure out what makes most
>>> sense.  In the meantime, could you address Oleg's question?
>>
>> Since the x86 parts depend on the seccomp parts, I'm happy if you
>> carry them instead of having them land from my tree. Otherwise I'm
>> open to how to coordinate timing.
>>
>
> You mean for me to carry the seccomp part as well?

If that makes sense as far as the coordination, that's fine with me.
Otherwise I'm not sure how x86 can build without having the seccomp
changes in your tree.

-Kees


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:42             ` Kees Cook
  0 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-28 23:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-arch, Linux MIPS Mailing List, Will Drewry, x86, LKML,
	Andy Lutomirski, linux-security-module, Oleg Nesterov,
	linux-arm-kernel, Alexei Starovoitov

On Mon, Jul 28, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/28/2014 04:29 PM, Kees Cook wrote:
>> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>>
>>>> It looks like patches 1-4 have landed here:
>>>>
>>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>>
>>>> hpa, what's the route forward for the x86 part?
>>>>
>>>
>>> I guess I should discuss this with Kees to figure out what makes most
>>> sense.  In the meantime, could you address Oleg's question?
>>
>> Since the x86 parts depend on the seccomp parts, I'm happy if you
>> carry them instead of having them land from my tree. Otherwise I'm
>> open to how to coordinate timing.
>>
>
> You mean for me to carry the seccomp part as well?

If that makes sense as far as the coordination, that's fine with me.
Otherwise I'm not sure how x86 can build without having the seccomp
changes in your tree.

-Kees


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:42             ` Kees Cook
  0 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-28 23:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 28, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/28/2014 04:29 PM, Kees Cook wrote:
>> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>>
>>>> It looks like patches 1-4 have landed here:
>>>>
>>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>>
>>>> hpa, what's the route forward for the x86 part?
>>>>
>>>
>>> I guess I should discuss this with Kees to figure out what makes most
>>> sense.  In the meantime, could you address Oleg's question?
>>
>> Since the x86 parts depend on the seccomp parts, I'm happy if you
>> carry them instead of having them land from my tree. Otherwise I'm
>> open to how to coordinate timing.
>>
>
> You mean for me to carry the seccomp part as well?

If that makes sense as far as the coordination, that's fine with me.
Otherwise I'm not sure how x86 can build without having the seccomp
changes in your tree.

-Kees


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
  2014-07-28 23:42             ` Kees Cook
  (?)
@ 2014-07-28 23:45               ` H. Peter Anvin
  -1 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2014-07-28 23:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-arch, Linux MIPS Mailing List, Will Drewry, x86, LKML,
	Andy Lutomirski, linux-security-module, Oleg Nesterov,
	linux-arm-kernel, Alexei Starovoitov

On 07/28/2014 04:42 PM, Kees Cook wrote:
> On Mon, Jul 28, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 07/28/2014 04:29 PM, Kees Cook wrote:
>>> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>>>
>>>>> It looks like patches 1-4 have landed here:
>>>>>
>>>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>>>
>>>>> hpa, what's the route forward for the x86 part?
>>>>>
>>>>
>>>> I guess I should discuss this with Kees to figure out what makes most
>>>> sense.  In the meantime, could you address Oleg's question?
>>>
>>> Since the x86 parts depend on the seccomp parts, I'm happy if you
>>> carry them instead of having them land from my tree. Otherwise I'm
>>> open to how to coordinate timing.
>>>
>>
>> You mean for me to carry the seccomp part as well?
> 
> If that makes sense as far as the coordination, that's fine with me.
> Otherwise I'm not sure how x86 can build without having the seccomp
> changes in your tree.
> 

Exactly.  What I guess I'll do is set up a separate tip branch for this,
pull your branch into it, and then put the x86 patches on top.  Does
that make sense for everyone?

	-hpa



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:45               ` H. Peter Anvin
  0 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2014-07-28 23:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-arch, Linux MIPS Mailing List, Will Drewry, x86, LKML,
	Andy Lutomirski, linux-security-module, Oleg Nesterov,
	linux-arm-kernel, Alexei Starovoitov

On 07/28/2014 04:42 PM, Kees Cook wrote:
> On Mon, Jul 28, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 07/28/2014 04:29 PM, Kees Cook wrote:
>>> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>>>
>>>>> It looks like patches 1-4 have landed here:
>>>>>
>>>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>>>
>>>>> hpa, what's the route forward for the x86 part?
>>>>>
>>>>
>>>> I guess I should discuss this with Kees to figure out what makes most
>>>> sense.  In the meantime, could you address Oleg's question?
>>>
>>> Since the x86 parts depend on the seccomp parts, I'm happy if you
>>> carry them instead of having them land from my tree. Otherwise I'm
>>> open to how to coordinate timing.
>>>
>>
>> You mean for me to carry the seccomp part as well?
> 
> If that makes sense as far as the coordination, that's fine with me.
> Otherwise I'm not sure how x86 can build without having the seccomp
> changes in your tree.
> 

Exactly.  What I guess I'll do is set up a separate tip branch for this,
pull your branch into it, and then put the x86 patches on top.  Does
that make sense for everyone?

	-hpa

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:45               ` H. Peter Anvin
  0 siblings, 0 replies; 100+ messages in thread
From: H. Peter Anvin @ 2014-07-28 23:45 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/28/2014 04:42 PM, Kees Cook wrote:
> On Mon, Jul 28, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 07/28/2014 04:29 PM, Kees Cook wrote:
>>> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>>>
>>>>> It looks like patches 1-4 have landed here:
>>>>>
>>>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>>>
>>>>> hpa, what's the route forward for the x86 part?
>>>>>
>>>>
>>>> I guess I should discuss this with Kees to figure out what makes most
>>>> sense.  In the meantime, could you address Oleg's question?
>>>
>>> Since the x86 parts depend on the seccomp parts, I'm happy if you
>>> carry them instead of having them land from my tree. Otherwise I'm
>>> open to how to coordinate timing.
>>>
>>
>> You mean for me to carry the seccomp part as well?
> 
> If that makes sense as far as the coordination, that's fine with me.
> Otherwise I'm not sure how x86 can build without having the seccomp
> changes in your tree.
> 

Exactly.  What I guess I'll do is set up a separate tip branch for this,
pull your branch into it, and then put the x86 patches on top.  Does
that make sense for everyone?

	-hpa

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
  2014-07-28 23:45               ` H. Peter Anvin
  (?)
@ 2014-07-28 23:54                 ` Kees Cook
  -1 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-28 23:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-arch, Linux MIPS Mailing List, Will Drewry, x86, LKML,
	Andy Lutomirski, linux-security-module, Oleg Nesterov,
	linux-arm-kernel, Alexei Starovoitov

On Mon, Jul 28, 2014 at 4:45 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/28/2014 04:42 PM, Kees Cook wrote:
>> On Mon, Jul 28, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 07/28/2014 04:29 PM, Kees Cook wrote:
>>>> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>>>>
>>>>>> It looks like patches 1-4 have landed here:
>>>>>>
>>>>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>>>>
>>>>>> hpa, what's the route forward for the x86 part?
>>>>>>
>>>>>
>>>>> I guess I should discuss this with Kees to figure out what makes most
>>>>> sense.  In the meantime, could you address Oleg's question?
>>>>
>>>> Since the x86 parts depend on the seccomp parts, I'm happy if you
>>>> carry them instead of having them land from my tree. Otherwise I'm
>>>> open to how to coordinate timing.
>>>>
>>>
>>> You mean for me to carry the seccomp part as well?
>>
>> If that makes sense as far as the coordination, that's fine with me.
>> Otherwise I'm not sure how x86 can build without having the seccomp
>> changes in your tree.
>>
>
> Exactly.  What I guess I'll do is set up a separate tip branch for this,
> pull your branch into it, and then put the x86 patches on top.  Does
> that make sense for everyone?

Sounds good to me. Once Oleg and Andy are happy, we'll be set.

-Kees


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:54                 ` Kees Cook
  0 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-28 23:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-arch, Linux MIPS Mailing List, Will Drewry, x86, LKML,
	Andy Lutomirski, linux-security-module, Oleg Nesterov,
	linux-arm-kernel, Alexei Starovoitov

On Mon, Jul 28, 2014 at 4:45 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/28/2014 04:42 PM, Kees Cook wrote:
>> On Mon, Jul 28, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 07/28/2014 04:29 PM, Kees Cook wrote:
>>>> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>>>>
>>>>>> It looks like patches 1-4 have landed here:
>>>>>>
>>>>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>>>>
>>>>>> hpa, what's the route forward for the x86 part?
>>>>>>
>>>>>
>>>>> I guess I should discuss this with Kees to figure out what makes most
>>>>> sense.  In the meantime, could you address Oleg's question?
>>>>
>>>> Since the x86 parts depend on the seccomp parts, I'm happy if you
>>>> carry them instead of having them land from my tree. Otherwise I'm
>>>> open to how to coordinate timing.
>>>>
>>>
>>> You mean for me to carry the seccomp part as well?
>>
>> If that makes sense as far as the coordination, that's fine with me.
>> Otherwise I'm not sure how x86 can build without having the seccomp
>> changes in your tree.
>>
>
> Exactly.  What I guess I'll do is set up a separate tip branch for this,
> pull your branch into it, and then put the x86 patches on top.  Does
> that make sense for everyone?

Sounds good to me. Once Oleg and Andy are happy, we'll be set.

-Kees


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes
@ 2014-07-28 23:54                 ` Kees Cook
  0 siblings, 0 replies; 100+ messages in thread
From: Kees Cook @ 2014-07-28 23:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 28, 2014 at 4:45 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/28/2014 04:42 PM, Kees Cook wrote:
>> On Mon, Jul 28, 2014 at 4:34 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 07/28/2014 04:29 PM, Kees Cook wrote:
>>>> On Mon, Jul 28, 2014 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>>> On 07/23/2014 12:20 PM, Andy Lutomirski wrote:
>>>>>>
>>>>>> It looks like patches 1-4 have landed here:
>>>>>>
>>>>>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>>>>>
>>>>>> hpa, what's the route forward for the x86 part?
>>>>>>
>>>>>
>>>>> I guess I should discuss this with Kees to figure out what makes most
>>>>> sense.  In the meantime, could you address Oleg's question?
>>>>
>>>> Since the x86 parts depend on the seccomp parts, I'm happy if you
>>>> carry them instead of having them land from my tree. Otherwise I'm
>>>> open to how to coordinate timing.
>>>>
>>>
>>> You mean for me to carry the seccomp part as well?
>>
>> If that makes sense as far as the coordination, that's fine with me.
>> Otherwise I'm not sure how x86 can build without having the seccomp
>> changes in your tree.
>>
>
> Exactly.  What I guess I'll do is set up a separate tip branch for this,
> pull your branch into it, and then put the x86 patches on top.  Does
> that make sense for everyone?

Sounds good to me. Once Oleg and Andy are happy, we'll be set.

-Kees


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-28 20:23       ` Andy Lutomirski
  (?)
@ 2014-07-29 16:54         ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 16:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/28, Andy Lutomirski wrote:
>
> On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > Hi Andy,
> >
> > I am really sorry for delay.
> >
> > This is on top of the recent change from Kees, right? Could me remind me
> > where can I found the tree this series based on? So that I could actually
> > apply these changes...
>
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>
> The first four patches are already applied there.

Thanks!

> > If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
> > But we should always call user_exit() unconditionally?
>
> Damnit.  I read that every function called by user_exit, and none of
> them give any indication of why they're needed for traced syscalls but
> not for untraced syscalls.  On a second look, it seems that TIF_NOHZ
> controls it.

Yes, just to trigger the slow path, I guess.

> I'll update the code to call user_exit iff TIF_NOHZ is
> set.

Or perhaps it would be better to not add another user of this (strange) flag
and just call user_exit() unconditionally(). But, yes, you need to use
from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.

> > And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
> > actually cleared on a 32bit kernel if we step over sysenter insn?
>
> I don't follow.  If TIF_SINGLESTEP, then phase1 will return a nonzero
> value,

Ah yes, thanks, I missed this.

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 16:54         ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 16:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/28, Andy Lutomirski wrote:
>
> On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > Hi Andy,
> >
> > I am really sorry for delay.
> >
> > This is on top of the recent change from Kees, right? Could me remind me
> > where can I found the tree this series based on? So that I could actually
> > apply these changes...
>
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>
> The first four patches are already applied there.

Thanks!

> > If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
> > But we should always call user_exit() unconditionally?
>
> Damnit.  I read that every function called by user_exit, and none of
> them give any indication of why they're needed for traced syscalls but
> not for untraced syscalls.  On a second look, it seems that TIF_NOHZ
> controls it.

Yes, just to trigger the slow path, I guess.

> I'll update the code to call user_exit iff TIF_NOHZ is
> set.

Or perhaps it would be better to not add another user of this (strange) flag
and just call user_exit() unconditionally(). But, yes, you need to use
from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.

> > And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
> > actually cleared on a 32bit kernel if we step over sysenter insn?
>
> I don't follow.  If TIF_SINGLESTEP, then phase1 will return a nonzero
> value,

Ah yes, thanks, I missed this.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 16:54         ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 16:54 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/28, Andy Lutomirski wrote:
>
> On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > Hi Andy,
> >
> > I am really sorry for delay.
> >
> > This is on top of the recent change from Kees, right? Could me remind me
> > where can I found the tree this series based on? So that I could actually
> > apply these changes...
>
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>
> The first four patches are already applied there.

Thanks!

> > If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
> > But we should always call user_exit() unconditionally?
>
> Damnit.  I read that every function called by user_exit, and none of
> them give any indication of why they're needed for traced syscalls but
> not for untraced syscalls.  On a second look, it seems that TIF_NOHZ
> controls it.

Yes, just to trigger the slow path, I guess.

> I'll update the code to call user_exit iff TIF_NOHZ is
> set.

Or perhaps it would be better to not add another user of this (strange) flag
and just call user_exit() unconditionally(). But, yes, you need to use
from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.

> > And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
> > actually cleared on a 32bit kernel if we step over sysenter insn?
>
> I don't follow.  If TIF_SINGLESTEP, then phase1 will return a nonzero
> value,

Ah yes, thanks, I missed this.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-29 16:54         ` Oleg Nesterov
  (?)
@ 2014-07-29 17:01           ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-29 17:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/28, Andy Lutomirski wrote:
>>
>> On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> > Hi Andy,
>> >
>> > I am really sorry for delay.
>> >
>> > This is on top of the recent change from Kees, right? Could me remind me
>> > where can I found the tree this series based on? So that I could actually
>> > apply these changes...
>>
>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>
>> The first four patches are already applied there.
>
> Thanks!
>
>> > If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
>> > But we should always call user_exit() unconditionally?
>>
>> Damnit.  I read that every function called by user_exit, and none of
>> them give any indication of why they're needed for traced syscalls but
>> not for untraced syscalls.  On a second look, it seems that TIF_NOHZ
>> controls it.
>
> Yes, just to trigger the slow path, I guess.
>
>> I'll update the code to call user_exit iff TIF_NOHZ is
>> set.
>
> Or perhaps it would be better to not add another user of this (strange) flag
> and just call user_exit() unconditionally(). But, yes, you need to use
> from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\

user_exit looks slow enough to me that a branch to try to avoid it may
be worthwhile.  I bet that explicitly checking the flag is
actually both faster and clearer.  That's what I did for v4.

--Andy

>
>> > And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
>> > actually cleared on a 32bit kernel if we step over sysenter insn?
>>
>> I don't follow.  If TIF_SINGLESTEP, then phase1 will return a nonzero
>> value,
>
> Ah yes, thanks, I missed this.
>
> Oleg.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 17:01           ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-29 17:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/28, Andy Lutomirski wrote:
>>
>> On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> > Hi Andy,
>> >
>> > I am really sorry for delay.
>> >
>> > This is on top of the recent change from Kees, right? Could me remind me
>> > where can I found the tree this series based on? So that I could actually
>> > apply these changes...
>>
>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>
>> The first four patches are already applied there.
>
> Thanks!
>
>> > If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
>> > But we should always call user_exit() unconditionally?
>>
>> Damnit.  I read that every function called by user_exit, and none of
>> them give any indication of why they're needed for traced syscalls but
>> not for untraced syscalls.  On a second look, it seems that TIF_NOHZ
>> controls it.
>
> Yes, just to trigger the slow path, I guess.
>
>> I'll update the code to call user_exit iff TIF_NOHZ is
>> set.
>
> Or perhaps it would be better to not add another user of this (strange) flag
> and just call user_exit() unconditionally(). But, yes, you need to use
> from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\

user_exit looks slow enough to me that a branch to try to avoid it may
be worthwhile.  I bet that explicitly checking the flag is
actually both faster and clearer.  That's what I did for v4.

--Andy

>
>> > And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
>> > actually cleared on a 32bit kernel if we step over sysenter insn?
>>
>> I don't follow.  If TIF_SINGLESTEP, then phase1 will return a nonzero
>> value,
>
> Ah yes, thanks, I missed this.
>
> Oleg.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 17:01           ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-29 17:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/28, Andy Lutomirski wrote:
>>
>> On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> > Hi Andy,
>> >
>> > I am really sorry for delay.
>> >
>> > This is on top of the recent change from Kees, right? Could me remind me
>> > where can I found the tree this series based on? So that I could actually
>> > apply these changes...
>>
>> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath
>>
>> The first four patches are already applied there.
>
> Thanks!
>
>> > If I understand correctly, syscall_trace_enter() can avoid _phase2() above.
>> > But we should always call user_exit() unconditionally?
>>
>> Damnit.  I read that every function called by user_exit, and none of
>> them give any indication of why they're needed for traced syscalls but
>> not for untraced syscalls.  On a second look, it seems that TIF_NOHZ
>> controls it.
>
> Yes, just to trigger the slow path, I guess.
>
>> I'll update the code to call user_exit iff TIF_NOHZ is
>> set.
>
> Or perhaps it would be better to not add another user of this (strange) flag
> and just call user_exit() unconditionally(). But, yes, you need to use
> from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\

user_exit looks slow enough to me that a branch to try to avoid it may
be worthwhile.  I bet that explicitly checking the flag is
actually both faster and clearer.  That's what I did for v4.

--Andy

>
>> > And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be
>> > actually cleared on a 32bit kernel if we step over sysenter insn?
>>
>> I don't follow.  If TIF_SINGLESTEP, then phase1 will return a nonzero
>> value,
>
> Ah yes, thanks, I missed this.
>
> Oleg.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-29 17:01           ` Andy Lutomirski
  (?)
@ 2014-07-29 17:31             ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 17:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/29, Andy Lutomirski wrote:
>
> On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, just to trigger the slow path, I guess.
> >
> >> I'll update the code to call user_exit iff TIF_NOHZ is
> >> set.
> >
> > Or perhaps it would be better to not add another user of this (strange) flag
> > and just call user_exit() unconditionally(). But, yes, you need to use
> > from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\
>
> user_exit looks slow enough to me that a branch to try to avoid it may
> be worthwhile.  I bet that explicitly checking the flag is
> actually both faster and clearer.

I don't think so (unless I am confused again), note that user_exit() uses
jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
if possible because I think it should die somehow (currently I do not know
how ;). And because it is ugly to check the same condition twice:

	if (work & TIF_NOHZ) {
		// user_exit()
		if (context_tracking_is_enabled())
			context_tracking_user_exit();
	}

TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
So I think that

	work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);

	user_exit();

looks a bit better. But I won't argue.

> That's what I did for v4.

I am going to read it today. Not that I think I can help or find something
wrong.

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 17:31             ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 17:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/29, Andy Lutomirski wrote:
>
> On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, just to trigger the slow path, I guess.
> >
> >> I'll update the code to call user_exit iff TIF_NOHZ is
> >> set.
> >
> > Or perhaps it would be better to not add another user of this (strange) flag
> > and just call user_exit() unconditionally(). But, yes, you need to use
> > from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\
>
> user_exit looks slow enough to me that a branch to try to avoid it may
> be worthwhile.  I bet that explicitly checking the flag is
> actually both faster and clearer.

I don't think so (unless I am confused again), note that user_exit() uses
jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
if possible because I think it should die somehow (currently I do not know
how ;). And because it is ugly to check the same condition twice:

	if (work & TIF_NOHZ) {
		// user_exit()
		if (context_tracking_is_enabled())
			context_tracking_user_exit();
	}

TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
So I think that

	work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);

	user_exit();

looks a bit better. But I won't argue.

> That's what I did for v4.

I am going to read it today. Not that I think I can help or find something
wrong.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 17:31             ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 17:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/29, Andy Lutomirski wrote:
>
> On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, just to trigger the slow path, I guess.
> >
> >> I'll update the code to call user_exit iff TIF_NOHZ is
> >> set.
> >
> > Or perhaps it would be better to not add another user of this (strange) flag
> > and just call user_exit() unconditionally(). But, yes, you need to use
> > from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\
>
> user_exit looks slow enough to me that a branch to try to avoid it may
> be worthwhile.  I bet that explicitly checking the flag is
> actually both faster and clearer.

I don't think so (unless I am confused again), note that user_exit() uses
jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
if possible because I think it should die somehow (currently I do not know
how ;). And because it is ugly to check the same condition twice:

	if (work & TIF_NOHZ) {
		// user_exit()
		if (context_tracking_is_enabled())
			context_tracking_user_exit();
	}

TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
So I think that

	work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);

	user_exit();

looks a bit better. But I won't argue.

> That's what I did for v4.

I am going to read it today. Not that I think I can help or find something
wrong.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-28 19:22         ` Frederic Weisbecker
@ 2014-07-29 17:54           ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 17:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On 07/28, Frederic Weisbecker wrote:
>
> On Mon, Jul 28, 2014 at 08:58:03PM +0200, Oleg Nesterov wrote:
> >
> > Frederic, don't we need the patch below? In fact clear_() can be moved
> > under "if ()" too. and probably copy_process() should clear this flag...
> >
> > Or. __context_tracking_task_switch() can simply do
> >
> > 	 if (context_tracking_cpu_is_enabled())
> > 	 	set_tsk_thread_flag(next, TIF_NOHZ);
> > 	 else
> > 	 	clear_tsk_thread_flag(next, TIF_NOHZ);
> >
> > and then we can forget about copy_process(). Or I am totally confused?
> >
> >
> > I am also wondering if we can extend user_return_notifier to handle
> > enter/exit and kill TIF_NOHZ.
> >
> > Oleg.
> >
> > --- x/kernel/context_tracking.c
> > +++ x/kernel/context_tracking.c
> > @@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru
> >  				    struct task_struct *next)
> >  {
> >  	clear_tsk_thread_flag(prev, TIF_NOHZ);
> > -	set_tsk_thread_flag(next, TIF_NOHZ);
> > +	if (context_tracking_cpu_is_enabled())
> > +		set_tsk_thread_flag(next, TIF_NOHZ);
> >  }
> >
> >  #ifdef CONFIG_CONTEXT_TRACKING_FORCE
>
> Unfortunately, as long as tasks can migrate in and out a context tracked CPU, we
> need to track all CPUs.

Thanks Frederic for your explanations. Yes, I was confused. But cough, now I am
even more confused.

I didn't even try to read this code, perhaps I'll try later, but let me ask
another question while you are here ;)

The comment above __context_tracking_task_switch() says:

	 * The context tracking uses the syscall slow path to implement its user-kernel
	 * boundaries probes on syscalls. This way it doesn't impact the syscall fast
	 * path on CPUs that don't do context tracking.
	        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

How? Every running task should have TIF_NOHZ set if context_tracking_is_enabled() ?

	 * But we need to clear the flag on the previous task because it may later
	 * migrate to some CPU that doesn't do the context tracking. As such the TIF
	 * flag may not be desired there.

For what? How this can help? This flag will be set again when we switch to this
task again?

Looks like, we can kill context_tracking_task_switch() and simply change the
"__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
Then this flag will be propagated by copy_process().

Or I am totally confused? (quite possible).

> So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> tracking call before returning from a syscall to userspace, and gets an interrupt. The
> interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> after which it is going to resume to userspace.
>
> In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> the task is resuming to userspace, because we passed through the context tracking probe
> already and it was ignored on CPU 0.

Thanks. But I still can't understand... So if we only track CPU 1, then in this
case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
on CPU 1.

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-29 17:54           ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 17:54 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/28, Frederic Weisbecker wrote:
>
> On Mon, Jul 28, 2014 at 08:58:03PM +0200, Oleg Nesterov wrote:
> >
> > Frederic, don't we need the patch below? In fact clear_() can be moved
> > under "if ()" too. and probably copy_process() should clear this flag...
> >
> > Or. __context_tracking_task_switch() can simply do
> >
> > 	 if (context_tracking_cpu_is_enabled())
> > 	 	set_tsk_thread_flag(next, TIF_NOHZ);
> > 	 else
> > 	 	clear_tsk_thread_flag(next, TIF_NOHZ);
> >
> > and then we can forget about copy_process(). Or I am totally confused?
> >
> >
> > I am also wondering if we can extend user_return_notifier to handle
> > enter/exit and kill TIF_NOHZ.
> >
> > Oleg.
> >
> > --- x/kernel/context_tracking.c
> > +++ x/kernel/context_tracking.c
> > @@ -202,7 +202,8 @@ void __context_tracking_task_switch(stru
> >  				    struct task_struct *next)
> >  {
> >  	clear_tsk_thread_flag(prev, TIF_NOHZ);
> > -	set_tsk_thread_flag(next, TIF_NOHZ);
> > +	if (context_tracking_cpu_is_enabled())
> > +		set_tsk_thread_flag(next, TIF_NOHZ);
> >  }
> >
> >  #ifdef CONFIG_CONTEXT_TRACKING_FORCE
>
> Unfortunately, as long as tasks can migrate in and out a context tracked CPU, we
> need to track all CPUs.

Thanks Frederic for your explanations. Yes, I was confused. But cough, now I am
even more confused.

I didn't even try to read this code, perhaps I'll try later, but let me ask
another question while you are here ;)

The comment above __context_tracking_task_switch() says:

	 * The context tracking uses the syscall slow path to implement its user-kernel
	 * boundaries probes on syscalls. This way it doesn't impact the syscall fast
	 * path on CPUs that don't do context tracking.
	        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

How? Every running task should have TIF_NOHZ set if context_tracking_is_enabled() ?

	 * But we need to clear the flag on the previous task because it may later
	 * migrate to some CPU that doesn't do the context tracking. As such the TIF
	 * flag may not be desired there.

For what? How this can help? This flag will be set again when we switch to this
task again?

Looks like, we can kill context_tracking_task_switch() and simply change the
"__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
Then this flag will be propagated by copy_process().

Or I am totally confused? (quite possible).

> So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> tracking call before returning from a syscall to userspace, and gets an interrupt. The
> interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> after which it is going to resume to userspace.
>
> In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> the task is resuming to userspace, because we passed through the context tracking probe
> already and it was ignored on CPU 0.

Thanks. But I still can't understand... So if we only track CPU 1, then in this
case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
on CPU 1.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-29 17:31             ` Oleg Nesterov
  (?)
@ 2014-07-29 17:55               ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-29 17:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/29, Andy Lutomirski wrote:
>>
>> On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > Yes, just to trigger the slow path, I guess.
>> >
>> >> I'll update the code to call user_exit iff TIF_NOHZ is
>> >> set.
>> >
>> > Or perhaps it would be better to not add another user of this (strange) flag
>> > and just call user_exit() unconditionally(). But, yes, you need to use
>> > from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\
>>
>> user_exit looks slow enough to me that a branch to try to avoid it may
>> be worthwhile.  I bet that explicitly checking the flag is
>> actually both faster and clearer.
>
> I don't think so (unless I am confused again), note that user_exit() uses
> jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
> if possible because I think it should die somehow (currently I do not know
> how ;). And because it is ugly to check the same condition twice:
>
>         if (work & TIF_NOHZ) {
>                 // user_exit()
>                 if (context_tracking_is_enabled())
>                         context_tracking_user_exit();
>         }
>
> TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
> So I think that
>
>         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
>
>         user_exit();
>
> looks a bit better. But I won't argue.

I don't get it.  context_tracking_is_enabled is global, and TIF_NOHZ
is per-task.  Isn't this stuff determined per-task or per-cpu or
something?

IOW, if one CPU is running something that's very heavily
userspace-oriented and another CPU is doing something syscall- or
sleep-heavy, then shouldn't only the first CPU end up paying the price
of context tracking?

>
>> That's what I did for v4.
>
> I am going to read it today. Not that I think I can help or find something
> wrong.
>
> Oleg.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 17:55               ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-29 17:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/29, Andy Lutomirski wrote:
>>
>> On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > Yes, just to trigger the slow path, I guess.
>> >
>> >> I'll update the code to call user_exit iff TIF_NOHZ is
>> >> set.
>> >
>> > Or perhaps it would be better to not add another user of this (strange) flag
>> > and just call user_exit() unconditionally(). But, yes, you need to use
>> > from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\
>>
>> user_exit looks slow enough to me that a branch to try to avoid it may
>> be worthwhile.  I bet that explicitly checking the flag is
>> actually both faster and clearer.
>
> I don't think so (unless I am confused again), note that user_exit() uses
> jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
> if possible because I think it should die somehow (currently I do not know
> how ;). And because it is ugly to check the same condition twice:
>
>         if (work & TIF_NOHZ) {
>                 // user_exit()
>                 if (context_tracking_is_enabled())
>                         context_tracking_user_exit();
>         }
>
> TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
> So I think that
>
>         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
>
>         user_exit();
>
> looks a bit better. But I won't argue.

I don't get it.  context_tracking_is_enabled is global, and TIF_NOHZ
is per-task.  Isn't this stuff determined per-task or per-cpu or
something?

IOW, if one CPU is running something that's very heavily
userspace-oriented and another CPU is doing something syscall- or
sleep-heavy, then shouldn't only the first CPU end up paying the price
of context tracking?

>
>> That's what I did for v4.
>
> I am going to read it today. Not that I think I can help or find something
> wrong.
>
> Oleg.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 17:55               ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-29 17:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/29, Andy Lutomirski wrote:
>>
>> On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > Yes, just to trigger the slow path, I guess.
>> >
>> >> I'll update the code to call user_exit iff TIF_NOHZ is
>> >> set.
>> >
>> > Or perhaps it would be better to not add another user of this (strange) flag
>> > and just call user_exit() unconditionally(). But, yes, you need to use
>> > from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\
>>
>> user_exit looks slow enough to me that a branch to try to avoid it may
>> be worthwhile.  I bet that explicitly checking the flag is
>> actually both faster and clearer.
>
> I don't think so (unless I am confused again), note that user_exit() uses
> jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
> if possible because I think it should die somehow (currently I do not know
> how ;). And because it is ugly to check the same condition twice:
>
>         if (work & TIF_NOHZ) {
>                 // user_exit()
>                 if (context_tracking_is_enabled())
>                         context_tracking_user_exit();
>         }
>
> TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
> So I think that
>
>         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
>
>         user_exit();
>
> looks a bit better. But I won't argue.

I don't get it.  context_tracking_is_enabled is global, and TIF_NOHZ
is per-task.  Isn't this stuff determined per-task or per-cpu or
something?

IOW, if one CPU is running something that's very heavily
userspace-oriented and another CPU is doing something syscall- or
sleep-heavy, then shouldn't only the first CPU end up paying the price
of context tracking?

>
>> That's what I did for v4.
>
> I am going to read it today. Not that I think I can help or find something
> wrong.
>
> Oleg.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-29 17:55               ` Andy Lutomirski
  (?)
@ 2014-07-29 18:16                 ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 18:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/29, Andy Lutomirski wrote:
>
> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I don't think so (unless I am confused again), note that user_exit() uses
> > jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
> > if possible because I think it should die somehow (currently I do not know
> > how ;). And because it is ugly to check the same condition twice:
> >
> >         if (work & TIF_NOHZ) {
> >                 // user_exit()
> >                 if (context_tracking_is_enabled())
> >                         context_tracking_user_exit();
> >         }
> >
> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
> > So I think that
> >
> >         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
> >
> >         user_exit();
> >
> > looks a bit better. But I won't argue.
>
> I don't get it.

Don't worry, you are not alone.

> context_tracking_is_enabled is global, and TIF_NOHZ
> is per-task.  Isn't this stuff determined per-task or per-cpu or
> something?
>
> IOW, if one CPU is running something that's very heavily
> userspace-oriented and another CPU is doing something syscall- or
> sleep-heavy, then shouldn't only the first CPU end up paying the price
> of context tracking?

Please see another email I sent to Frederic.

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 18:16                 ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 18:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/29, Andy Lutomirski wrote:
>
> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I don't think so (unless I am confused again), note that user_exit() uses
> > jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
> > if possible because I think it should die somehow (currently I do not know
> > how ;). And because it is ugly to check the same condition twice:
> >
> >         if (work & TIF_NOHZ) {
> >                 // user_exit()
> >                 if (context_tracking_is_enabled())
> >                         context_tracking_user_exit();
> >         }
> >
> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
> > So I think that
> >
> >         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
> >
> >         user_exit();
> >
> > looks a bit better. But I won't argue.
>
> I don't get it.

Don't worry, you are not alone.

> context_tracking_is_enabled is global, and TIF_NOHZ
> is per-task.  Isn't this stuff determined per-task or per-cpu or
> something?
>
> IOW, if one CPU is running something that's very heavily
> userspace-oriented and another CPU is doing something syscall- or
> sleep-heavy, then shouldn't only the first CPU end up paying the price
> of context tracking?

Please see another email I sent to Frederic.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 18:16                 ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 18:16 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/29, Andy Lutomirski wrote:
>
> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I don't think so (unless I am confused again), note that user_exit() uses
> > jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
> > if possible because I think it should die somehow (currently I do not know
> > how ;). And because it is ugly to check the same condition twice:
> >
> >         if (work & TIF_NOHZ) {
> >                 // user_exit()
> >                 if (context_tracking_is_enabled())
> >                         context_tracking_user_exit();
> >         }
> >
> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
> > So I think that
> >
> >         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
> >
> >         user_exit();
> >
> > looks a bit better. But I won't argue.
>
> I don't get it.

Don't worry, you are not alone.

> context_tracking_is_enabled is global, and TIF_NOHZ
> is per-task.  Isn't this stuff determined per-task or per-cpu or
> something?
>
> IOW, if one CPU is running something that's very heavily
> userspace-oriented and another CPU is doing something syscall- or
> sleep-heavy, then shouldn't only the first CPU end up paying the price
> of context tracking?

Please see another email I sent to Frederic.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-29 18:16                 ` Oleg Nesterov
  (?)
@ 2014-07-29 18:22                   ` Andy Lutomirski
  -1 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-29 18:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On Tue, Jul 29, 2014 at 11:16 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/29, Andy Lutomirski wrote:
>>
>> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > I don't think so (unless I am confused again), note that user_exit() uses
>> > jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
>> > if possible because I think it should die somehow (currently I do not know
>> > how ;). And because it is ugly to check the same condition twice:
>> >
>> >         if (work & TIF_NOHZ) {
>> >                 // user_exit()
>> >                 if (context_tracking_is_enabled())
>> >                         context_tracking_user_exit();
>> >         }
>> >
>> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
>> > So I think that
>> >
>> >         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
>> >
>> >         user_exit();
>> >
>> > looks a bit better. But I won't argue.
>>
>> I don't get it.
>
> Don't worry, you are not alone.
>
>> context_tracking_is_enabled is global, and TIF_NOHZ
>> is per-task.  Isn't this stuff determined per-task or per-cpu or
>> something?
>>
>> IOW, if one CPU is running something that's very heavily
>> userspace-oriented and another CPU is doing something syscall- or
>> sleep-heavy, then shouldn't only the first CPU end up paying the price
>> of context tracking?
>
> Please see another email I sent to Frederic.
>

I'll add at least this argument in favor of my approach: if context
tracking works at all, then it had better not demand that syscall
entry call user_exit if TIF_NOHZ is *not* set.  So adding the
condition ought to be safe, barring dumb bugs in my code.

--Andy

> Oleg.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 18:22                   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-29 18:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On Tue, Jul 29, 2014 at 11:16 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/29, Andy Lutomirski wrote:
>>
>> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > I don't think so (unless I am confused again), note that user_exit() uses
>> > jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
>> > if possible because I think it should die somehow (currently I do not know
>> > how ;). And because it is ugly to check the same condition twice:
>> >
>> >         if (work & TIF_NOHZ) {
>> >                 // user_exit()
>> >                 if (context_tracking_is_enabled())
>> >                         context_tracking_user_exit();
>> >         }
>> >
>> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
>> > So I think that
>> >
>> >         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
>> >
>> >         user_exit();
>> >
>> > looks a bit better. But I won't argue.
>>
>> I don't get it.
>
> Don't worry, you are not alone.
>
>> context_tracking_is_enabled is global, and TIF_NOHZ
>> is per-task.  Isn't this stuff determined per-task or per-cpu or
>> something?
>>
>> IOW, if one CPU is running something that's very heavily
>> userspace-oriented and another CPU is doing something syscall- or
>> sleep-heavy, then shouldn't only the first CPU end up paying the price
>> of context tracking?
>
> Please see another email I sent to Frederic.
>

I'll add at least this argument in favor of my approach: if context
tracking works at all, then it had better not demand that syscall
entry call user_exit if TIF_NOHZ is *not* set.  So adding the
condition ought to be safe, barring dumb bugs in my code.

--Andy

> Oleg.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 18:22                   ` Andy Lutomirski
  0 siblings, 0 replies; 100+ messages in thread
From: Andy Lutomirski @ 2014-07-29 18:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 29, 2014 at 11:16 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 07/29, Andy Lutomirski wrote:
>>
>> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> >
>> > I don't think so (unless I am confused again), note that user_exit() uses
>> > jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ
>> > if possible because I think it should die somehow (currently I do not know
>> > how ;). And because it is ugly to check the same condition twice:
>> >
>> >         if (work & TIF_NOHZ) {
>> >                 // user_exit()
>> >                 if (context_tracking_is_enabled())
>> >                         context_tracking_user_exit();
>> >         }
>> >
>> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
>> > So I think that
>> >
>> >         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
>> >
>> >         user_exit();
>> >
>> > looks a bit better. But I won't argue.
>>
>> I don't get it.
>
> Don't worry, you are not alone.
>
>> context_tracking_is_enabled is global, and TIF_NOHZ
>> is per-task.  Isn't this stuff determined per-task or per-cpu or
>> something?
>>
>> IOW, if one CPU is running something that's very heavily
>> userspace-oriented and another CPU is doing something syscall- or
>> sleep-heavy, then shouldn't only the first CPU end up paying the price
>> of context tracking?
>
> Please see another email I sent to Frederic.
>

I'll add at least this argument in favor of my approach: if context
tracking works at all, then it had better not demand that syscall
entry call user_exit if TIF_NOHZ is *not* set.  So adding the
condition ought to be safe, barring dumb bugs in my code.

--Andy

> Oleg.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
  2014-07-29 18:22                   ` Andy Lutomirski
  (?)
@ 2014-07-29 18:44                     ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 18:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/29, Andy Lutomirski wrote:
>
> On Tue, Jul 29, 2014 at 11:16 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 07/29, Andy Lutomirski wrote:
> >>
> >> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> >
> >> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
> >> > So I think that
> >> >
> >> >         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
> >> >
> >> >         user_exit();
> >> >
> >> > looks a bit better. But I won't argue.
> >>
> >> I don't get it.
> >
> > Don't worry, you are not alone.
> >
> >> context_tracking_is_enabled is global, and TIF_NOHZ
> >> is per-task.  Isn't this stuff determined per-task or per-cpu or
> >> something?
> >>
> >> IOW, if one CPU is running something that's very heavily
> >> userspace-oriented and another CPU is doing something syscall- or
> >> sleep-heavy, then shouldn't only the first CPU end up paying the price
> >> of context tracking?
> >
> > Please see another email I sent to Frederic.
> >
> I'll add at least this argument in favor of my approach: if context
> tracking works at all, then it had better not demand that syscall
> entry call user_exit if TIF_NOHZ is *not* set.

I disagree. At least I disagree with that you should enforce this in
syscall_trace_enter() paths, and in any case this has nothing to do with
these changes.

But again, I won't insist, so please forget.

> So adding the
> condition ought to be safe, barring dumb bugs in my code.

Yes, I think it is technically correct.

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 18:44                     ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 18:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kees Cook, Will Drewry, X86 ML, linux-arm-kernel,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/29, Andy Lutomirski wrote:
>
> On Tue, Jul 29, 2014 at 11:16 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 07/29, Andy Lutomirski wrote:
> >>
> >> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> >
> >> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
> >> > So I think that
> >> >
> >> >         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
> >> >
> >> >         user_exit();
> >> >
> >> > looks a bit better. But I won't argue.
> >>
> >> I don't get it.
> >
> > Don't worry, you are not alone.
> >
> >> context_tracking_is_enabled is global, and TIF_NOHZ
> >> is per-task.  Isn't this stuff determined per-task or per-cpu or
> >> something?
> >>
> >> IOW, if one CPU is running something that's very heavily
> >> userspace-oriented and another CPU is doing something syscall- or
> >> sleep-heavy, then shouldn't only the first CPU end up paying the price
> >> of context tracking?
> >
> > Please see another email I sent to Frederic.
> >
> I'll add at least this argument in favor of my approach: if context
> tracking works at all, then it had better not demand that syscall
> entry call user_exit if TIF_NOHZ is *not* set.

I disagree. At least I disagree with that you should enforce this in
syscall_trace_enter() paths, and in any case this has nothing to do with
these changes.

But again, I won't insist, so please forget.

> So adding the
> condition ought to be safe, barring dumb bugs in my code.

Yes, I think it is technically correct.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases
@ 2014-07-29 18:44                     ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-29 18:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/29, Andy Lutomirski wrote:
>
> On Tue, Jul 29, 2014 at 11:16 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 07/29, Andy Lutomirski wrote:
> >>
> >> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> >
> >> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true.
> >> > So I think that
> >> >
> >> >         work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ);
> >> >
> >> >         user_exit();
> >> >
> >> > looks a bit better. But I won't argue.
> >>
> >> I don't get it.
> >
> > Don't worry, you are not alone.
> >
> >> context_tracking_is_enabled is global, and TIF_NOHZ
> >> is per-task.  Isn't this stuff determined per-task or per-cpu or
> >> something?
> >>
> >> IOW, if one CPU is running something that's very heavily
> >> userspace-oriented and another CPU is doing something syscall- or
> >> sleep-heavy, then shouldn't only the first CPU end up paying the price
> >> of context tracking?
> >
> > Please see another email I sent to Frederic.
> >
> I'll add at least this argument in favor of my approach: if context
> tracking works at all, then it had better not demand that syscall
> entry call user_exit if TIF_NOHZ is *not* set.

I disagree. At least I disagree with that you should enforce this in
syscall_trace_enter() paths, and in any case this has nothing to do with
these changes.

But again, I won't insist, so please forget.

> So adding the
> condition ought to be safe, barring dumb bugs in my code.

Yes, I think it is technically correct.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-29 17:54           ` Oleg Nesterov
@ 2014-07-30 16:35             ` Frederic Weisbecker
  -1 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-30 16:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> Thanks Frederic for your explanations. Yes, I was confused. But cough, now I am
> even more confused.
> 
> I didn't even try to read this code, perhaps I'll try later, but let me ask
> another question while you are here ;)
> 
> The comment above __context_tracking_task_switch() says:
> 
> 	 * The context tracking uses the syscall slow path to implement its user-kernel
> 	 * boundaries probes on syscalls. This way it doesn't impact the syscall fast
> 	 * path on CPUs that don't do context tracking.
> 	        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Indeed, in fact the comment is confusing the way it explain things. It suggests that
some CPUs maybe doing context tracking while some other can choose not to context track.

It's rather all or nothing. Actually TIF_NOHZ optimize systems that have CONFIG_CONTEXT_TRACKING=y
and that don't need context tracking. In this case TIF_NOHZ is cleared and thus the syscall
fastpath has no overhead.

So I should rephrase it that way:

        * The context tracking uses the syscall slow path to implement its user-kernel
        * boundaries probes on syscalls. This way it doesn't impact the syscall fast
        * path when context tracking is globally disabled.

> 
> How? Every running task should have TIF_NOHZ set if context_tracking_is_enabled() ?
> 
> 	 * But we need to clear the flag on the previous task because it may later
> 	 * migrate to some CPU that doesn't do the context tracking. As such the TIF
> 	 * flag may not be desired there.
> 
> For what? How this can help? This flag will be set again when we switch to this
> task again?

That is indeed a stale comment from aborted early design.

> 
> Looks like, we can kill context_tracking_task_switch() and simply change the
> "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> Then this flag will be propagated by copy_process().

Right, that would be much better. Good catch! context tracking is enabled from
tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.

I still think we need a for_each_process_thread() set as well though because some
kernel threads may well have been created at this stage already.

> 
> Or I am totally confused? (quite possible).
> 
> > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> > tracking call before returning from a syscall to userspace, and gets an interrupt. The
> > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> > after which it is going to resume to userspace.
> >
> > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> > the task is resuming to userspace, because we passed through the context tracking probe
> > already and it was ignored on CPU 0.
> 
> Thanks. But I still can't understand... So if we only track CPU 1, then in this
> case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
> on CPU 1.

I'm not sure I understand your question. Context tracking is either enabled everywhere or
nowhere.

I need to say though that there is a per CPU context tracking state named context_tracking.active.
It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked
everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or
not.

So only nohz full CPUs have context_tracking.active set because only these need to call the RCU
and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime
functions.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-30 16:35             ` Frederic Weisbecker
  0 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-30 16:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> Thanks Frederic for your explanations. Yes, I was confused. But cough, now I am
> even more confused.
> 
> I didn't even try to read this code, perhaps I'll try later, but let me ask
> another question while you are here ;)
> 
> The comment above __context_tracking_task_switch() says:
> 
> 	 * The context tracking uses the syscall slow path to implement its user-kernel
> 	 * boundaries probes on syscalls. This way it doesn't impact the syscall fast
> 	 * path on CPUs that don't do context tracking.
> 	        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Indeed, in fact the comment is confusing the way it explain things. It suggests that
some CPUs maybe doing context tracking while some other can choose not to context track.

It's rather all or nothing. Actually TIF_NOHZ optimize systems that have CONFIG_CONTEXT_TRACKING=y
and that don't need context tracking. In this case TIF_NOHZ is cleared and thus the syscall
fastpath has no overhead.

So I should rephrase it that way:

        * The context tracking uses the syscall slow path to implement its user-kernel
        * boundaries probes on syscalls. This way it doesn't impact the syscall fast
        * path when context tracking is globally disabled.

> 
> How? Every running task should have TIF_NOHZ set if context_tracking_is_enabled() ?
> 
> 	 * But we need to clear the flag on the previous task because it may later
> 	 * migrate to some CPU that doesn't do the context tracking. As such the TIF
> 	 * flag may not be desired there.
> 
> For what? How this can help? This flag will be set again when we switch to this
> task again?

That is indeed a stale comment from aborted early design.

> 
> Looks like, we can kill context_tracking_task_switch() and simply change the
> "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> Then this flag will be propagated by copy_process().

Right, that would be much better. Good catch! context tracking is enabled from
tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.

I still think we need a for_each_process_thread() set as well though because some
kernel threads may well have been created at this stage already.

> 
> Or I am totally confused? (quite possible).
> 
> > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> > tracking call before returning from a syscall to userspace, and gets an interrupt. The
> > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> > after which it is going to resume to userspace.
> >
> > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> > the task is resuming to userspace, because we passed through the context tracking probe
> > already and it was ignored on CPU 0.
> 
> Thanks. But I still can't understand... So if we only track CPU 1, then in this
> case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
> on CPU 1.

I'm not sure I understand your question. Context tracking is either enabled everywhere or
nowhere.

I need to say though that there is a per CPU context tracking state named context_tracking.active.
It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked
everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or
not.

So only nohz full CPUs have context_tracking.active set because only these need to call the RCU
and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime
functions.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-30 16:35             ` Frederic Weisbecker
@ 2014-07-30 17:46               ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-30 17:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On 07/30, Frederic Weisbecker wrote:
>
> On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
>
> >
> > Looks like, we can kill context_tracking_task_switch() and simply change the
> > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > Then this flag will be propagated by copy_process().
>
> Right, that would be much better. Good catch! context tracking is enabled from
> tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.

actually init 1 task, but this doesn't matter.

> I still think we need a for_each_process_thread() set as well though because some
> kernel threads may well have been created at this stage already.

Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().

> > Or I am totally confused? (quite possible).
> >
> > > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> > > tracking call before returning from a syscall to userspace, and gets an interrupt. The
> > > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> > > after which it is going to resume to userspace.
> > >
> > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> > > the task is resuming to userspace, because we passed through the context tracking probe
> > > already and it was ignored on CPU 0.
> >
> > Thanks. But I still can't understand... So if we only track CPU 1, then in this
> > case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
> > on CPU 1.
>
> I'm not sure I understand your question.

Probably because it was stupid. Seriously, I still have no idea what this code
actually does.

> Context tracking is either enabled everywhere or
> nowhere.
>
> I need to say though that there is a per CPU context tracking state named context_tracking.active.
> It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked
> everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or
> not.
>
> So only nohz full CPUs have context_tracking.active set because only these need to call the RCU
> and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime
> functions.

I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
really make a difference, afaics.

Lets assume that context tracking is only enabled on CPU 1. To simplify,
assume that we have a single usermode task T which sleeps in kernel mode.

So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.

T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
context_tracking[0].state = IN_USER but otherwise does nothing else, this
CPU is not tracked and .active is false.

Right after local_irq_restore() this task can migrate to CPU_1 and finish
its ret-to-usermode path. But since it had already passed user_enter() we
do not change context_tracking[1].state and do not play with rcu/vtime.
(unless this task hits SCHEDULE_USER in asm).

The same for user_exit() of course.

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-30 17:46               ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-30 17:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/30, Frederic Weisbecker wrote:
>
> On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
>
> >
> > Looks like, we can kill context_tracking_task_switch() and simply change the
> > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > Then this flag will be propagated by copy_process().
>
> Right, that would be much better. Good catch! context tracking is enabled from
> tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.

actually init 1 task, but this doesn't matter.

> I still think we need a for_each_process_thread() set as well though because some
> kernel threads may well have been created at this stage already.

Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().

> > Or I am totally confused? (quite possible).
> >
> > > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> > > tracking call before returning from a syscall to userspace, and gets an interrupt. The
> > > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> > > after which it is going to resume to userspace.
> > >
> > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> > > the task is resuming to userspace, because we passed through the context tracking probe
> > > already and it was ignored on CPU 0.
> >
> > Thanks. But I still can't understand... So if we only track CPU 1, then in this
> > case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
> > on CPU 1.
>
> I'm not sure I understand your question.

Probably because it was stupid. Seriously, I still have no idea what this code
actually does.

> Context tracking is either enabled everywhere or
> nowhere.
>
> I need to say though that there is a per CPU context tracking state named context_tracking.active.
> It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked
> everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or
> not.
>
> So only nohz full CPUs have context_tracking.active set because only these need to call the RCU
> and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime
> functions.

I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
really make a difference, afaics.

Lets assume that context tracking is only enabled on CPU 1. To simplify,
assume that we have a single usermode task T which sleeps in kernel mode.

So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.

T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
context_tracking[0].state = IN_USER but otherwise does nothing else, this
CPU is not tracked and .active is false.

Right after local_irq_restore() this task can migrate to CPU_1 and finish
its ret-to-usermode path. But since it had already passed user_enter() we
do not change context_tracking[1].state and do not play with rcu/vtime.
(unless this task hits SCHEDULE_USER in asm).

The same for user_exit() of course.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-30 17:46               ` Oleg Nesterov
@ 2014-07-31  0:30                 ` Frederic Weisbecker
  -1 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-31  0:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> On 07/30, Frederic Weisbecker wrote:
> >
> > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> >
> > >
> > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > Then this flag will be propagated by copy_process().
> >
> > Right, that would be much better. Good catch! context tracking is enabled from
> > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> 
> actually init 1 task, but this doesn't matter.

Are you sure? It does matter because that would invalidate everything I understood
about init/main.c :) I was convinced that the very first kernel init task is PID 0 then
it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
idle task of the boot CPU.

> 
> > I still think we need a for_each_process_thread() set as well though because some
> > kernel threads may well have been created at this stage already.
> 
> Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().

Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
or random kernel threads?

> 
> > > Or I am totally confused? (quite possible).
> > >
> > > > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> > > > tracking call before returning from a syscall to userspace, and gets an interrupt. The
> > > > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> > > > after which it is going to resume to userspace.
> > > >
> > > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> > > > the task is resuming to userspace, because we passed through the context tracking probe
> > > > already and it was ignored on CPU 0.
> > >
> > > Thanks. But I still can't understand... So if we only track CPU 1, then in this
> > > case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
> > > on CPU 1.
> >
> > I'm not sure I understand your question.
> 
> Probably because it was stupid. Seriously, I still have no idea what this code
> actually does.
> 
> > Context tracking is either enabled everywhere or
> > nowhere.
> >
> > I need to say though that there is a per CPU context tracking state named context_tracking.active.
> > It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked
> > everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or
> > not.
> >
> > So only nohz full CPUs have context_tracking.active set because only these need to call the RCU
> > and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime
> > functions.
> 
> I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> really make a difference, afaics.
> 
> Lets assume that context tracking is only enabled on CPU 1. To simplify,
> assume that we have a single usermode task T which sleeps in kernel mode.
> 
> So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> 
> T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> context_tracking[0].state = IN_USER but otherwise does nothing else, this
> CPU is not tracked and .active is false.
> 
> Right after local_irq_restore() this task can migrate to CPU_1 and finish
> its ret-to-usermode path. But since it had already passed user_enter() we
> do not change context_tracking[1].state and do not play with rcu/vtime.
> (unless this task hits SCHEDULE_USER in asm).
> 
> The same for user_exit() of course.

So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
such situation where CPU 1 has wrong context tracking.

But global TIF_NOHZ should enforce context tracking everywhere. And also it's
less context switch overhead.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-31  0:30                 ` Frederic Weisbecker
  0 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-31  0:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> On 07/30, Frederic Weisbecker wrote:
> >
> > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> >
> > >
> > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > Then this flag will be propagated by copy_process().
> >
> > Right, that would be much better. Good catch! context tracking is enabled from
> > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> 
> actually init 1 task, but this doesn't matter.

Are you sure? It does matter because that would invalidate everything I understood
about init/main.c :) I was convinced that the very first kernel init task is PID 0 then
it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
idle task of the boot CPU.

> 
> > I still think we need a for_each_process_thread() set as well though because some
> > kernel threads may well have been created at this stage already.
> 
> Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().

Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
or random kernel threads?

> 
> > > Or I am totally confused? (quite possible).
> > >
> > > > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> > > > tracking call before returning from a syscall to userspace, and gets an interrupt. The
> > > > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> > > > after which it is going to resume to userspace.
> > > >
> > > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> > > > the task is resuming to userspace, because we passed through the context tracking probe
> > > > already and it was ignored on CPU 0.
> > >
> > > Thanks. But I still can't understand... So if we only track CPU 1, then in this
> > > case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
> > > on CPU 1.
> >
> > I'm not sure I understand your question.
> 
> Probably because it was stupid. Seriously, I still have no idea what this code
> actually does.
> 
> > Context tracking is either enabled everywhere or
> > nowhere.
> >
> > I need to say though that there is a per CPU context tracking state named context_tracking.active.
> > It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked
> > everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or
> > not.
> >
> > So only nohz full CPUs have context_tracking.active set because only these need to call the RCU
> > and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime
> > functions.
> 
> I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> really make a difference, afaics.
> 
> Lets assume that context tracking is only enabled on CPU 1. To simplify,
> assume that we have a single usermode task T which sleeps in kernel mode.
> 
> So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> 
> T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> context_tracking[0].state = IN_USER but otherwise does nothing else, this
> CPU is not tracked and .active is false.
> 
> Right after local_irq_restore() this task can migrate to CPU_1 and finish
> its ret-to-usermode path. But since it had already passed user_enter() we
> do not change context_tracking[1].state and do not play with rcu/vtime.
> (unless this task hits SCHEDULE_USER in asm).
> 
> The same for user_exit() of course.

So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
such situation where CPU 1 has wrong context tracking.

But global TIF_NOHZ should enforce context tracking everywhere. And also it's
less context switch overhead.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-31  0:30                 ` Frederic Weisbecker
@ 2014-07-31 16:03                   ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-31 16:03 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On 07/31, Frederic Weisbecker wrote:
>
> On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> > On 07/30, Frederic Weisbecker wrote:
> > >
> > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> > >
> > > >
> > > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > > Then this flag will be propagated by copy_process().
> > >
> > > Right, that would be much better. Good catch! context tracking is enabled from
> > > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> >
> > actually init 1 task, but this doesn't matter.
>
> Are you sure? It does matter because that would invalidate everything I understood
> about init/main.c :)

Sorry for confusion ;)

> I was convinced that the very first kernel init task is PID 0 then
> it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> idle task of the boot CPU.

Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
by "swapper". And we do not care about idle threads at all.

> > > I still think we need a for_each_process_thread() set as well though because some
> > > kernel threads may well have been created at this stage already.
> >
> > Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().
>
> Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
> or random kernel threads?

Sure, but we do not care. A kernel thread can never return to user space, it
must never call user_enter/exit().

> > I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> > really make a difference, afaics.
> >
> > Lets assume that context tracking is only enabled on CPU 1. To simplify,
> > assume that we have a single usermode task T which sleeps in kernel mode.
> >
> > So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> >
> > T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> > context_tracking[0].state = IN_USER but otherwise does nothing else, this
> > CPU is not tracked and .active is false.
> >
> > Right after local_irq_restore() this task can migrate to CPU_1 and finish
> > its ret-to-usermode path. But since it had already passed user_enter() we
> > do not change context_tracking[1].state and do not play with rcu/vtime.
> > (unless this task hits SCHEDULE_USER in asm).
> >
> > The same for user_exit() of course.
>
> So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
> such situation where CPU 1 has wrong context tracking.

OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
CPU_0, and CPU_1 can miss it anyway.

> But global TIF_NOHZ should enforce context tracking everywhere.

And this is what I can't understand. Lets return to my initial question, why
we can't change __context_tracking_task_switch()

	void __context_tracking_task_switch(struct task_struct *prev,
					    struct task_struct *next)
	{
		if (context_tracking_cpu_is_enabled())
			set_tsk_thread_flag(next, TIF_NOHZ);
		else
			clear_tsk_thread_flag(next, TIF_NOHZ);
	}

? How can the global TIF_NOHZ help?

OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
comes to user_enter(). But this case is very unlikely, certainly this can't
explain why do we penalize the untracked CPU's ?

> And also it's
> less context switch overhead.

Why???

I think I have a blind spot here. Help!



And of course I can't understand exception_enter/exit(). Not to mention that
(afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
if we forget that the caller can migrate in between. Just because, once again,
a tracked CPU can miss user_exit().

So, why not

	static inline void exception_enter(void)
	{
		user_exit();
	}

	static inline void exception_exit(struct pt_regs *regs)
	{
		if (user_mode(regs))
			user_enter();
	}

?

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-31 16:03                   ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-31 16:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/31, Frederic Weisbecker wrote:
>
> On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> > On 07/30, Frederic Weisbecker wrote:
> > >
> > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> > >
> > > >
> > > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > > Then this flag will be propagated by copy_process().
> > >
> > > Right, that would be much better. Good catch! context tracking is enabled from
> > > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> >
> > actually init 1 task, but this doesn't matter.
>
> Are you sure? It does matter because that would invalidate everything I understood
> about init/main.c :)

Sorry for confusion ;)

> I was convinced that the very first kernel init task is PID 0 then
> it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> idle task of the boot CPU.

Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
by "swapper". And we do not care about idle threads at all.

> > > I still think we need a for_each_process_thread() set as well though because some
> > > kernel threads may well have been created at this stage already.
> >
> > Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().
>
> Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
> or random kernel threads?

Sure, but we do not care. A kernel thread can never return to user space, it
must never call user_enter/exit().

> > I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> > really make a difference, afaics.
> >
> > Lets assume that context tracking is only enabled on CPU 1. To simplify,
> > assume that we have a single usermode task T which sleeps in kernel mode.
> >
> > So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> >
> > T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> > context_tracking[0].state = IN_USER but otherwise does nothing else, this
> > CPU is not tracked and .active is false.
> >
> > Right after local_irq_restore() this task can migrate to CPU_1 and finish
> > its ret-to-usermode path. But since it had already passed user_enter() we
> > do not change context_tracking[1].state and do not play with rcu/vtime.
> > (unless this task hits SCHEDULE_USER in asm).
> >
> > The same for user_exit() of course.
>
> So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
> such situation where CPU 1 has wrong context tracking.

OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
CPU_0, and CPU_1 can miss it anyway.

> But global TIF_NOHZ should enforce context tracking everywhere.

And this is what I can't understand. Lets return to my initial question, why
we can't change __context_tracking_task_switch()

	void __context_tracking_task_switch(struct task_struct *prev,
					    struct task_struct *next)
	{
		if (context_tracking_cpu_is_enabled())
			set_tsk_thread_flag(next, TIF_NOHZ);
		else
			clear_tsk_thread_flag(next, TIF_NOHZ);
	}

? How can the global TIF_NOHZ help?

OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
comes to user_enter(). But this case is very unlikely, certainly this can't
explain why do we penalize the untracked CPU's ?

> And also it's
> less context switch overhead.

Why???

I think I have a blind spot here. Help!



And of course I can't understand exception_enter/exit(). Not to mention that
(afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
if we forget that the caller can migrate in between. Just because, once again,
a tracked CPU can miss user_exit().

So, why not

	static inline void exception_enter(void)
	{
		user_exit();
	}

	static inline void exception_exit(struct pt_regs *regs)
	{
		if (user_mode(regs))
			user_enter();
	}

?

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-31 16:03                   ` Oleg Nesterov
@ 2014-07-31 17:13                     ` Frederic Weisbecker
  -1 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-31 17:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> > > On 07/30, Frederic Weisbecker wrote:
> > > >
> > > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> > > >
> > > > >
> > > > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > > > Then this flag will be propagated by copy_process().
> > > >
> > > > Right, that would be much better. Good catch! context tracking is enabled from
> > > > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> > >
> > > actually init 1 task, but this doesn't matter.
> >
> > Are you sure? It does matter because that would invalidate everything I understood
> > about init/main.c :)
> 
> Sorry for confusion ;)
> 
> > I was convinced that the very first kernel init task is PID 0 then
> > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > idle task of the boot CPU.
> 
> Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> by "swapper". 

Are you sure? It's called from start_kernel() which is init/0.

But that's not the point. You're right that kernel threads don't care about it.
In fact we only need init/1 to get the flag and also call_usermodehelper(), good point!

Maybe we should just enable it everywhere. Trying to spare the flag on specific tasks will
bring headaches for no much performance win. Kernel threads don't do syscalls and seldom
do exceptions, so they shouldn't hit context tracking that much.

> 
> > > > I still think we need a for_each_process_thread() set as well though because some
> > > > kernel threads may well have been created at this stage already.
> > >
> > > Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().
> >
> > Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
> > or random kernel threads?
> 
> Sure, but we do not care. A kernel thread can never return to user space, it
> must never call user_enter/exit().

Yep, thanks for noticing that!

> 
> > > I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> > > really make a difference, afaics.
> > >
> > > Lets assume that context tracking is only enabled on CPU 1. To simplify,
> > > assume that we have a single usermode task T which sleeps in kernel mode.
> > >
> > > So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> > >
> > > T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> > > context_tracking[0].state = IN_USER but otherwise does nothing else, this
> > > CPU is not tracked and .active is false.
> > >
> > > Right after local_irq_restore() this task can migrate to CPU_1 and finish
> > > its ret-to-usermode path. But since it had already passed user_enter() we
> > > do not change context_tracking[1].state and do not play with rcu/vtime.
> > > (unless this task hits SCHEDULE_USER in asm).
> > >
> > > The same for user_exit() of course.
> >
> > So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
> > such situation where CPU 1 has wrong context tracking.
> 
> OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> CPU_0, and CPU_1 can miss it anyway.
> 
> > But global TIF_NOHZ should enforce context tracking everywhere.
> 
> And this is what I can't understand. Lets return to my initial question, why
> we can't change __context_tracking_task_switch()
> 
> 	void __context_tracking_task_switch(struct task_struct *prev,
> 					    struct task_struct *next)
> 	{
> 		if (context_tracking_cpu_is_enabled())
> 			set_tsk_thread_flag(next, TIF_NOHZ);
> 		else
> 			clear_tsk_thread_flag(next, TIF_NOHZ);
> 	}
> 
> ?

Well we can change it to global TIF_NOHZ

> How can the global TIF_NOHZ help?

It avoids that flag swap on task_switch.

> 
> OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> comes to user_enter(). But this case is very unlikely, certainly this can't
> explain why do we penalize the untracked CPU's ?

It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
to userspace.

It's unlikely but possible. I actually observed that very easily on early testing.

And it's a big problem because then the CPU runs in userspace, possibly for a long while
in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
userspace.

It looks like a harmless race but it can have big consequences.

> 
> > And also it's
> > less context switch overhead.
> 
> Why???

Because calling context_switch_task_switch() on every context switch is costly.

> 
> I think I have a blind spot here. Help!
> 
> 
> 
> And of course I can't understand exception_enter/exit(). Not to mention that
> (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> if we forget that the caller can migrate in between. Just because, once again,
> a tracked CPU can miss user_exit().

You lost me on this. How can a tracked CPU miss user_exit()?

> 
> So, why not
> 
> 	static inline void exception_enter(void)
> 	{
> 		user_exit();
> 	}
> 
> 	static inline void exception_exit(struct pt_regs *regs)
> 	{
> 		if (user_mode(regs))
> 			user_enter();
> 	}

That's how I implemented it first. But then I changed it the way it is now:
6c1e0256fad84a843d915414e4b5973b7443d48d
("context_tracking: Restore correct previous context state on exception exit")

This is again due to the shift between hard and soft userspace boundaries.
user_mode(regs) checks hard boundaries only.

Lets get back to our beloved example:

          CPU 0                                  CPU 1
          ---------------------------------------------

          returning from syscall {
               user_enter();
               exception {
                    exception_enter()
                    PREEMPT!
                    ----------------------->
                                                 //resume exception
                                                   exception_exit();
                                                   return to userspace

Here if we use user_mode(regs) from exception_exit(), we are screwed because
the task is in the dead zone between user_enter() and the actual hard return to
userspace.

user_mode() thinks we are in the kernel, but from the context tracking POV we
are in userspace. So we again risk to run in userspace for an undetermined time
and RCU will think we are in the kernel and disturb CPU 1 with IPIs to report
quiescent states. Also all the time spent in userspace will be accounted as kernelspace.

Yeah the context tracking code gave me a lot of headaches :)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-31 17:13                     ` Frederic Weisbecker
  0 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-31 17:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> > > On 07/30, Frederic Weisbecker wrote:
> > > >
> > > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> > > >
> > > > >
> > > > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > > > Then this flag will be propagated by copy_process().
> > > >
> > > > Right, that would be much better. Good catch! context tracking is enabled from
> > > > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> > >
> > > actually init 1 task, but this doesn't matter.
> >
> > Are you sure? It does matter because that would invalidate everything I understood
> > about init/main.c :)
> 
> Sorry for confusion ;)
> 
> > I was convinced that the very first kernel init task is PID 0 then
> > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > idle task of the boot CPU.
> 
> Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> by "swapper". 

Are you sure? It's called from start_kernel() which is init/0.

But that's not the point. You're right that kernel threads don't care about it.
In fact we only need init/1 to get the flag and also call_usermodehelper(), good point!

Maybe we should just enable it everywhere. Trying to spare the flag on specific tasks will
bring headaches for no much performance win. Kernel threads don't do syscalls and seldom
do exceptions, so they shouldn't hit context tracking that much.

> 
> > > > I still think we need a for_each_process_thread() set as well though because some
> > > > kernel threads may well have been created at this stage already.
> > >
> > > Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().
> >
> > Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
> > or random kernel threads?
> 
> Sure, but we do not care. A kernel thread can never return to user space, it
> must never call user_enter/exit().

Yep, thanks for noticing that!

> 
> > > I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> > > really make a difference, afaics.
> > >
> > > Lets assume that context tracking is only enabled on CPU 1. To simplify,
> > > assume that we have a single usermode task T which sleeps in kernel mode.
> > >
> > > So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> > >
> > > T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> > > context_tracking[0].state = IN_USER but otherwise does nothing else, this
> > > CPU is not tracked and .active is false.
> > >
> > > Right after local_irq_restore() this task can migrate to CPU_1 and finish
> > > its ret-to-usermode path. But since it had already passed user_enter() we
> > > do not change context_tracking[1].state and do not play with rcu/vtime.
> > > (unless this task hits SCHEDULE_USER in asm).
> > >
> > > The same for user_exit() of course.
> >
> > So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
> > such situation where CPU 1 has wrong context tracking.
> 
> OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> CPU_0, and CPU_1 can miss it anyway.
> 
> > But global TIF_NOHZ should enforce context tracking everywhere.
> 
> And this is what I can't understand. Lets return to my initial question, why
> we can't change __context_tracking_task_switch()
> 
> 	void __context_tracking_task_switch(struct task_struct *prev,
> 					    struct task_struct *next)
> 	{
> 		if (context_tracking_cpu_is_enabled())
> 			set_tsk_thread_flag(next, TIF_NOHZ);
> 		else
> 			clear_tsk_thread_flag(next, TIF_NOHZ);
> 	}
> 
> ?

Well we can change it to global TIF_NOHZ

> How can the global TIF_NOHZ help?

It avoids that flag swap on task_switch.

> 
> OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> comes to user_enter(). But this case is very unlikely, certainly this can't
> explain why do we penalize the untracked CPU's ?

It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
to userspace.

It's unlikely but possible. I actually observed that very easily on early testing.

And it's a big problem because then the CPU runs in userspace, possibly for a long while
in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
userspace.

It looks like a harmless race but it can have big consequences.

> 
> > And also it's
> > less context switch overhead.
> 
> Why???

Because calling context_switch_task_switch() on every context switch is costly.

> 
> I think I have a blind spot here. Help!
> 
> 
> 
> And of course I can't understand exception_enter/exit(). Not to mention that
> (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> if we forget that the caller can migrate in between. Just because, once again,
> a tracked CPU can miss user_exit().

You lost me on this. How can a tracked CPU miss user_exit()?

> 
> So, why not
> 
> 	static inline void exception_enter(void)
> 	{
> 		user_exit();
> 	}
> 
> 	static inline void exception_exit(struct pt_regs *regs)
> 	{
> 		if (user_mode(regs))
> 			user_enter();
> 	}

That's how I implemented it first. But then I changed it the way it is now:
6c1e0256fad84a843d915414e4b5973b7443d48d
("context_tracking: Restore correct previous context state on exception exit")

This is again due to the shift between hard and soft userspace boundaries.
user_mode(regs) checks hard boundaries only.

Lets get back to our beloved example:

          CPU 0                                  CPU 1
          ---------------------------------------------

          returning from syscall {
               user_enter();
               exception {
                    exception_enter()
                    PREEMPT!
                    ----------------------->
                                                 //resume exception
                                                   exception_exit();
                                                   return to userspace

Here if we use user_mode(regs) from exception_exit(), we are screwed because
the task is in the dead zone between user_enter() and the actual hard return to
userspace.

user_mode() thinks we are in the kernel, but from the context tracking POV we
are in userspace. So we again risk to run in userspace for an undetermined time
and RCU will think we are in the kernel and disturb CPU 1 with IPIs to report
quiescent states. Also all the time spent in userspace will be accounted as kernelspace.

Yeah the context tracking code gave me a lot of headaches :)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-31 17:13                     ` Frederic Weisbecker
@ 2014-07-31 18:12                       ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-31 18:12 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On 07/31, Frederic Weisbecker wrote:
>
> On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> > On 07/31, Frederic Weisbecker wrote:
> > >
> > > I was convinced that the very first kernel init task is PID 0 then
> > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > > idle task of the boot CPU.
> >
> > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > by "swapper".
>
> Are you sure? It's called from start_kernel() which is init/0.

But do_initcalls() is called by kernel_init(), this is the init process which is
going to exec /sbin/init later.

But this doesn't really matter,

> Maybe we should just enable it everywhere.

Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
of start_kernel(). The question is, I still can't understand why do we want to
have the global TIF_NOHZ.

> > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> > CPU_0, and CPU_1 can miss it anyway.
> >
> > > But global TIF_NOHZ should enforce context tracking everywhere.
> >
> > And this is what I can't understand. Lets return to my initial question, why
> > we can't change __context_tracking_task_switch()
> >
> > 	void __context_tracking_task_switch(struct task_struct *prev,
> > 					    struct task_struct *next)
> > 	{
> > 		if (context_tracking_cpu_is_enabled())
> > 			set_tsk_thread_flag(next, TIF_NOHZ);
> > 		else
> > 			clear_tsk_thread_flag(next, TIF_NOHZ);
> > 	}
> >
> > ?
>
> Well we can change it to global TIF_NOHZ
>
> > How can the global TIF_NOHZ help?
>
> It avoids that flag swap on task_switch.

Ah, you probably meant that we can kill context_tracking_task_switch() as
we discussed.

But I meant another thing, TIF_NOHZ is already global because it is always
set after context_tracking_cpu_set().

Performance-wise, this set/clear code above can be better because it avoids
the slow paths on the untracked CPU's. But let's ignore this, the question is
why the change above is not correct? Or why it can make the things worse?


> > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> > comes to user_enter(). But this case is very unlikely, certainly this can't
> > explain why do we penalize the untracked CPU's ?
>
> It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
> to userspace.

And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses
the necessary user_enter().

> It's unlikely but possible. I actually observed that very easily on early testing.

Sure. And this can happen anyway? Why the change in __context_tracking_task_switch()
is wrong?

> And it's a big problem because then the CPU runs in userspace, possibly for a long while
> in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
> for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
> userspace.
>
> It looks like a harmless race but it can have big consequences.

I see. Again, does the global TIF_NOHZ really help?

> > > And also it's
> > > less context switch overhead.
> >
> > Why???
>
> Because calling context_switch_task_switch() on every context switch is costly.

See above. This depends, but forcing the slow paths on all CPU's can be more
costly.

> > And of course I can't understand exception_enter/exit(). Not to mention that
> > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> > if we forget that the caller can migrate in between. Just because, once again,
> > a tracked CPU can miss user_exit().
>
> You lost me on this. How can a tracked CPU miss user_exit()?

I am lost too ;) Didn't we already discuss this? A task calls user_exit() on
CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user
space leaving this cpu in IN_KERNEL state?

> > So, why not
> >
> > 	static inline void exception_enter(void)
> > 	{
> > 		user_exit();
> > 	}
> >
> > 	static inline void exception_exit(struct pt_regs *regs)
> > 	{
> > 		if (user_mode(regs))
> > 			user_enter();
> > 	}
>
> That's how I implemented it first. But then I changed it the way it is now:
> 6c1e0256fad84a843d915414e4b5973b7443d48d
> ("context_tracking: Restore correct previous context state on exception exit")
>
> This is again due to the shift between hard and soft userspace boundaries.
> user_mode(regs) checks hard boundaries only.
>
> Lets get back to our beloved example:
>
>           CPU 0                                  CPU 1
>           ---------------------------------------------
>
>           returning from syscall {
>                user_enter();
>                exception {
>                     exception_enter()
>                     PREEMPT!
>                     ----------------------->
>                                                  //resume exception
>                                                    exception_exit();
>                                                    return to userspace

OK, thanks, so in this case we miss user_enter().

But again, we can miss it anyway if preemptions comes before "exception" ?
if CPU 1 was in IN_KERNEL state.

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-31 18:12                       ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-31 18:12 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/31, Frederic Weisbecker wrote:
>
> On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> > On 07/31, Frederic Weisbecker wrote:
> > >
> > > I was convinced that the very first kernel init task is PID 0 then
> > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > > idle task of the boot CPU.
> >
> > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > by "swapper".
>
> Are you sure? It's called from start_kernel() which is init/0.

But do_initcalls() is called by kernel_init(), this is the init process which is
going to exec /sbin/init later.

But this doesn't really matter,

> Maybe we should just enable it everywhere.

Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
of start_kernel(). The question is, I still can't understand why do we want to
have the global TIF_NOHZ.

> > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> > CPU_0, and CPU_1 can miss it anyway.
> >
> > > But global TIF_NOHZ should enforce context tracking everywhere.
> >
> > And this is what I can't understand. Lets return to my initial question, why
> > we can't change __context_tracking_task_switch()
> >
> > 	void __context_tracking_task_switch(struct task_struct *prev,
> > 					    struct task_struct *next)
> > 	{
> > 		if (context_tracking_cpu_is_enabled())
> > 			set_tsk_thread_flag(next, TIF_NOHZ);
> > 		else
> > 			clear_tsk_thread_flag(next, TIF_NOHZ);
> > 	}
> >
> > ?
>
> Well we can change it to global TIF_NOHZ
>
> > How can the global TIF_NOHZ help?
>
> It avoids that flag swap on task_switch.

Ah, you probably meant that we can kill context_tracking_task_switch() as
we discussed.

But I meant another thing, TIF_NOHZ is already global because it is always
set after context_tracking_cpu_set().

Performance-wise, this set/clear code above can be better because it avoids
the slow paths on the untracked CPU's. But let's ignore this, the question is
why the change above is not correct? Or why it can make the things worse?


> > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> > comes to user_enter(). But this case is very unlikely, certainly this can't
> > explain why do we penalize the untracked CPU's ?
>
> It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
> to userspace.

And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses
the necessary user_enter().

> It's unlikely but possible. I actually observed that very easily on early testing.

Sure. And this can happen anyway? Why the change in __context_tracking_task_switch()
is wrong?

> And it's a big problem because then the CPU runs in userspace, possibly for a long while
> in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
> for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
> userspace.
>
> It looks like a harmless race but it can have big consequences.

I see. Again, does the global TIF_NOHZ really help?

> > > And also it's
> > > less context switch overhead.
> >
> > Why???
>
> Because calling context_switch_task_switch() on every context switch is costly.

See above. This depends, but forcing the slow paths on all CPU's can be more
costly.

> > And of course I can't understand exception_enter/exit(). Not to mention that
> > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> > if we forget that the caller can migrate in between. Just because, once again,
> > a tracked CPU can miss user_exit().
>
> You lost me on this. How can a tracked CPU miss user_exit()?

I am lost too ;) Didn't we already discuss this? A task calls user_exit() on
CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user
space leaving this cpu in IN_KERNEL state?

> > So, why not
> >
> > 	static inline void exception_enter(void)
> > 	{
> > 		user_exit();
> > 	}
> >
> > 	static inline void exception_exit(struct pt_regs *regs)
> > 	{
> > 		if (user_mode(regs))
> > 			user_enter();
> > 	}
>
> That's how I implemented it first. But then I changed it the way it is now:
> 6c1e0256fad84a843d915414e4b5973b7443d48d
> ("context_tracking: Restore correct previous context state on exception exit")
>
> This is again due to the shift between hard and soft userspace boundaries.
> user_mode(regs) checks hard boundaries only.
>
> Lets get back to our beloved example:
>
>           CPU 0                                  CPU 1
>           ---------------------------------------------
>
>           returning from syscall {
>                user_enter();
>                exception {
>                     exception_enter()
>                     PREEMPT!
>                     ----------------------->
>                                                  //resume exception
>                                                    exception_exit();
>                                                    return to userspace

OK, thanks, so in this case we miss user_enter().

But again, we can miss it anyway if preemptions comes before "exception" ?
if CPU 1 was in IN_KERNEL state.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-31 18:12                       ` Oleg Nesterov
@ 2014-07-31 18:47                         ` Frederic Weisbecker
  -1 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-31 18:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> > > On 07/31, Frederic Weisbecker wrote:
> > > >
> > > > I was convinced that the very first kernel init task is PID 0 then
> > > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > > > idle task of the boot CPU.
> > >
> > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > by "swapper".
> >
> > Are you sure? It's called from start_kernel() which is init/0.
> 
> But do_initcalls() is called by kernel_init(), this is the init process which is
> going to exec /sbin/init later.
> 
> But this doesn't really matter,

Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
before initcalls.

> 
> > Maybe we should just enable it everywhere.
> 
> Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> of start_kernel(). The question is, I still can't understand why do we want to
> have the global TIF_NOHZ.

Because then the flags is inherited in forks. It's better than inheriting it on
context switch due to context switch being called much more often than fork.

> 
> > > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> > > CPU_0, and CPU_1 can miss it anyway.
> > >
> > > > But global TIF_NOHZ should enforce context tracking everywhere.
> > >
> > > And this is what I can't understand. Lets return to my initial question, why
> > > we can't change __context_tracking_task_switch()
> > >
> > > 	void __context_tracking_task_switch(struct task_struct *prev,
> > > 					    struct task_struct *next)
> > > 	{
> > > 		if (context_tracking_cpu_is_enabled())
> > > 			set_tsk_thread_flag(next, TIF_NOHZ);
> > > 		else
> > > 			clear_tsk_thread_flag(next, TIF_NOHZ);
> > > 	}
> > >
> > > ?
> >
> > Well we can change it to global TIF_NOHZ
> >
> > > How can the global TIF_NOHZ help?
> >
> > It avoids that flag swap on task_switch.
> 
> Ah, you probably meant that we can kill context_tracking_task_switch() as
> we discussed.

Yeah.

> 
> But I meant another thing, TIF_NOHZ is already global because it is always
> set after context_tracking_cpu_set().
> 
> Performance-wise, this set/clear code above can be better because it avoids
> the slow paths on the untracked CPU's.

But all CPUs are tracked when context tracking is enabled. So there is no such
per CPU granularity.

> But let's ignore this, the question is
> why the change above is not correct? Or why it can make the things worse?

Which change above?

> 
> 
> > > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> > > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> > > comes to user_enter(). But this case is very unlikely, certainly this can't
> > > explain why do we penalize the untracked CPU's ?
> >
> > It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
> > to userspace.
> 
> And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses
> the necessary user_enter().

No, user_enter() is necessary on CPU 0 so that CPU 1 sees that it is in userspace
from context tracking POV.

> 
> > It's unlikely but possible. I actually observed that very easily on early testing.
> 
> Sure. And this can happen anyway? Why the change in __context_tracking_task_switch()
> is wrong?

Which change?

> 
> > And it's a big problem because then the CPU runs in userspace, possibly for a long while
> > in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
> > for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
> > userspace.
> >
> > It looks like a harmless race but it can have big consequences.
> 
> I see. Again, does the global TIF_NOHZ really help?

Yes, to remove the context switch overhead. But it doesn't change anything on those races.

> > Because calling context_switch_task_switch() on every context switch is costly.
> 
> See above. This depends, but forcing the slow paths on all CPU's can be more
> costly.

Yeah but I don't have a safe solution that avoids that.

> 
> > > And of course I can't understand exception_enter/exit(). Not to mention that
> > > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> > > if we forget that the caller can migrate in between. Just because, once again,
> > > a tracked CPU can miss user_exit().
> >
> > You lost me on this. How can a tracked CPU miss user_exit()?
> 
> I am lost too ;) Didn't we already discuss this? A task calls user_exit() on
> CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user
> space leaving this cpu in IN_KERNEL state?

Yeah, so? :)

I'm pretty sure there is a small but important element here that makes us misunderstanding
what each others says. It's like we aren't taking about the same thing :)

> > That's how I implemented it first. But then I changed it the way it is now:
> > 6c1e0256fad84a843d915414e4b5973b7443d48d
> > ("context_tracking: Restore correct previous context state on exception exit")
> >
> > This is again due to the shift between hard and soft userspace boundaries.
> > user_mode(regs) checks hard boundaries only.
> >
> > Lets get back to our beloved example:
> >
> >           CPU 0                                  CPU 1
> >           ---------------------------------------------
> >
> >           returning from syscall {
> >                user_enter();
> >                exception {
> >                     exception_enter()
> >                     PREEMPT!
> >                     ----------------------->
> >                                                  //resume exception
> >                                                    exception_exit();
> >                                                    return to userspace
> 
> OK, thanks, so in this case we miss user_enter().
> 
> But again, we can miss it anyway if preemptions comes before "exception" ?
> if CPU 1 was in IN_KERNEL state.

No, because preempt_schedule_irq() does the ctx_state save and restore with
exception_enter/exception_exit.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-31 18:47                         ` Frederic Weisbecker
  0 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-31 18:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> > > On 07/31, Frederic Weisbecker wrote:
> > > >
> > > > I was convinced that the very first kernel init task is PID 0 then
> > > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > > > idle task of the boot CPU.
> > >
> > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > by "swapper".
> >
> > Are you sure? It's called from start_kernel() which is init/0.
> 
> But do_initcalls() is called by kernel_init(), this is the init process which is
> going to exec /sbin/init later.
> 
> But this doesn't really matter,

Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
before initcalls.

> 
> > Maybe we should just enable it everywhere.
> 
> Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> of start_kernel(). The question is, I still can't understand why do we want to
> have the global TIF_NOHZ.

Because then the flags is inherited in forks. It's better than inheriting it on
context switch due to context switch being called much more often than fork.

> 
> > > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> > > CPU_0, and CPU_1 can miss it anyway.
> > >
> > > > But global TIF_NOHZ should enforce context tracking everywhere.
> > >
> > > And this is what I can't understand. Lets return to my initial question, why
> > > we can't change __context_tracking_task_switch()
> > >
> > > 	void __context_tracking_task_switch(struct task_struct *prev,
> > > 					    struct task_struct *next)
> > > 	{
> > > 		if (context_tracking_cpu_is_enabled())
> > > 			set_tsk_thread_flag(next, TIF_NOHZ);
> > > 		else
> > > 			clear_tsk_thread_flag(next, TIF_NOHZ);
> > > 	}
> > >
> > > ?
> >
> > Well we can change it to global TIF_NOHZ
> >
> > > How can the global TIF_NOHZ help?
> >
> > It avoids that flag swap on task_switch.
> 
> Ah, you probably meant that we can kill context_tracking_task_switch() as
> we discussed.

Yeah.

> 
> But I meant another thing, TIF_NOHZ is already global because it is always
> set after context_tracking_cpu_set().
> 
> Performance-wise, this set/clear code above can be better because it avoids
> the slow paths on the untracked CPU's.

But all CPUs are tracked when context tracking is enabled. So there is no such
per CPU granularity.

> But let's ignore this, the question is
> why the change above is not correct? Or why it can make the things worse?

Which change above?

> 
> 
> > > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> > > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> > > comes to user_enter(). But this case is very unlikely, certainly this can't
> > > explain why do we penalize the untracked CPU's ?
> >
> > It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
> > to userspace.
> 
> And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses
> the necessary user_enter().

No, user_enter() is necessary on CPU 0 so that CPU 1 sees that it is in userspace
from context tracking POV.

> 
> > It's unlikely but possible. I actually observed that very easily on early testing.
> 
> Sure. And this can happen anyway? Why the change in __context_tracking_task_switch()
> is wrong?

Which change?

> 
> > And it's a big problem because then the CPU runs in userspace, possibly for a long while
> > in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
> > for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
> > userspace.
> >
> > It looks like a harmless race but it can have big consequences.
> 
> I see. Again, does the global TIF_NOHZ really help?

Yes, to remove the context switch overhead. But it doesn't change anything on those races.

> > Because calling context_switch_task_switch() on every context switch is costly.
> 
> See above. This depends, but forcing the slow paths on all CPU's can be more
> costly.

Yeah but I don't have a safe solution that avoids that.

> 
> > > And of course I can't understand exception_enter/exit(). Not to mention that
> > > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> > > if we forget that the caller can migrate in between. Just because, once again,
> > > a tracked CPU can miss user_exit().
> >
> > You lost me on this. How can a tracked CPU miss user_exit()?
> 
> I am lost too ;) Didn't we already discuss this? A task calls user_exit() on
> CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user
> space leaving this cpu in IN_KERNEL state?

Yeah, so? :)

I'm pretty sure there is a small but important element here that makes us misunderstanding
what each others says. It's like we aren't taking about the same thing :)

> > That's how I implemented it first. But then I changed it the way it is now:
> > 6c1e0256fad84a843d915414e4b5973b7443d48d
> > ("context_tracking: Restore correct previous context state on exception exit")
> >
> > This is again due to the shift between hard and soft userspace boundaries.
> > user_mode(regs) checks hard boundaries only.
> >
> > Lets get back to our beloved example:
> >
> >           CPU 0                                  CPU 1
> >           ---------------------------------------------
> >
> >           returning from syscall {
> >                user_enter();
> >                exception {
> >                     exception_enter()
> >                     PREEMPT!
> >                     ----------------------->
> >                                                  //resume exception
> >                                                    exception_exit();
> >                                                    return to userspace
> 
> OK, thanks, so in this case we miss user_enter().
> 
> But again, we can miss it anyway if preemptions comes before "exception" ?
> if CPU 1 was in IN_KERNEL state.

No, because preempt_schedule_irq() does the ctx_state save and restore with
exception_enter/exception_exit.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-31 18:47                         ` Frederic Weisbecker
  (?)
@ 2014-07-31 18:50                           ` Frederic Weisbecker
  -1 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-31 18:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Paul E. McKenney, LKML, Kees Cook, Will Drewry,
	the arch/x86 maintainers,
	<linux-arm-kernel@lists.infradead.org>,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

2014-07-31 20:47 GMT+02:00 Frederic Weisbecker <fweisbec@gmail.com>:
> On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
>> On 07/31, Frederic Weisbecker wrote:
> No, because preempt_schedule_irq() does the ctx_state save and restore with
> exception_enter/exception_exit.

Similar thing happens with schedule_user().

preempt_schedule_irq() handles kernel preemption and schedule_user()
the user preemption. On both cases we save and restore the context
tracking state.

This might be the missing piece you were missing :)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-31 18:50                           ` Frederic Weisbecker
  0 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-31 18:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andy Lutomirski, Paul E. McKenney, LKML, Kees Cook, Will Drewry,
	the arch/x86 maintainers,
	<linux-arm-kernel@lists.infradead.org>,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

2014-07-31 20:47 GMT+02:00 Frederic Weisbecker <fweisbec@gmail.com>:
> On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
>> On 07/31, Frederic Weisbecker wrote:
> No, because preempt_schedule_irq() does the ctx_state save and restore with
> exception_enter/exception_exit.

Similar thing happens with schedule_user().

preempt_schedule_irq() handles kernel preemption and schedule_user()
the user preemption. On both cases we save and restore the context
tracking state.

This might be the missing piece you were missing :)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-31 18:50                           ` Frederic Weisbecker
  0 siblings, 0 replies; 100+ messages in thread
From: Frederic Weisbecker @ 2014-07-31 18:50 UTC (permalink / raw)
  To: linux-arm-kernel

2014-07-31 20:47 GMT+02:00 Frederic Weisbecker <fweisbec@gmail.com>:
> On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
>> On 07/31, Frederic Weisbecker wrote:
> No, because preempt_schedule_irq() does the ctx_state save and restore with
> exception_enter/exception_exit.

Similar thing happens with schedule_user().

preempt_schedule_irq() handles kernel preemption and schedule_user()
the user preemption. On both cases we save and restore the context
tracking state.

This might be the missing piece you were missing :)

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-31 18:50                           ` Frederic Weisbecker
  (?)
@ 2014-07-31 19:05                             ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-31 19:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Paul E. McKenney, LKML, Kees Cook, Will Drewry,
	the arch/x86 maintainers,
	<linux-arm-kernel@lists.infradead.org>,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/31, Frederic Weisbecker wrote:
>
> 2014-07-31 20:47 GMT+02:00 Frederic Weisbecker <fweisbec@gmail.com>:
> > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> >> On 07/31, Frederic Weisbecker wrote:
> > No, because preempt_schedule_irq() does the ctx_state save and restore with
> > exception_enter/exception_exit.
>
> Similar thing happens with schedule_user().
>
> preempt_schedule_irq() handles kernel preemption and schedule_user()
> the user preemption. On both cases we save and restore the context
> tracking state.
>
> This might be the missing piece you were missing :)

YYYYYEEEEESSSS, thanks!!

And in fact I was going to suggest to add this logic into preempt schedule
paths to improve the situation if we can't make TIF_NOHZ per-cpu.

But Frederic, perhaps I'll return here tomorrow with another question, it
is too late for me now ;)

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-31 19:05                             ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-31 19:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Paul E. McKenney, LKML, Kees Cook, Will Drewry,
	the arch/x86 maintainers,
	<linux-arm-kernel@lists.infradead.org>,
	Linux MIPS Mailing List, linux-arch, LSM List,
	Alexei Starovoitov, H. Peter Anvin

On 07/31, Frederic Weisbecker wrote:
>
> 2014-07-31 20:47 GMT+02:00 Frederic Weisbecker <fweisbec@gmail.com>:
> > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> >> On 07/31, Frederic Weisbecker wrote:
> > No, because preempt_schedule_irq() does the ctx_state save and restore with
> > exception_enter/exception_exit.
>
> Similar thing happens with schedule_user().
>
> preempt_schedule_irq() handles kernel preemption and schedule_user()
> the user preemption. On both cases we save and restore the context
> tracking state.
>
> This might be the missing piece you were missing :)

YYYYYEEEEESSSS, thanks!!

And in fact I was going to suggest to add this logic into preempt schedule
paths to improve the situation if we can't make TIF_NOHZ per-cpu.

But Frederic, perhaps I'll return here tomorrow with another question, it
is too late for me now ;)

Thanks!

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-07-31 19:05                             ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-07-31 19:05 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/31, Frederic Weisbecker wrote:
>
> 2014-07-31 20:47 GMT+02:00 Frederic Weisbecker <fweisbec@gmail.com>:
> > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> >> On 07/31, Frederic Weisbecker wrote:
> > No, because preempt_schedule_irq() does the ctx_state save and restore with
> > exception_enter/exception_exit.
>
> Similar thing happens with schedule_user().
>
> preempt_schedule_irq() handles kernel preemption and schedule_user()
> the user preemption. On both cases we save and restore the context
> tracking state.
>
> This might be the missing piece you were missing :)

YYYYYEEEEESSSS, thanks!!

And in fact I was going to suggest to add this logic into preempt schedule
paths to improve the situation if we can't make TIF_NOHZ per-cpu.

But Frederic, perhaps I'll return here tomorrow with another question, it
is too late for me now ;)

Thanks!

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-07-31 18:47                         ` Frederic Weisbecker
@ 2014-08-02 17:30                           ` Oleg Nesterov
  -1 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-08-02 17:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Paul E. McKenney, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On 07/31, Frederic Weisbecker wrote:
>
> On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> > > >
> > > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > > by "swapper".
> > >
> > > Are you sure? It's called from start_kernel() which is init/0.
> >
> > But do_initcalls() is called by kernel_init(), this is the init process which is
> > going to exec /sbin/init later.
> >
> > But this doesn't really matter,
>
> Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
> before initcalls.

Ah, indeed, and context_tracking_init() too. Even better, so we only need

	--- x/kernel/context_tracking.c
	+++ x/kernel/context_tracking.c
	@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(context_tracking_enabl
	 DEFINE_PER_CPU(struct context_tracking, context_tracking);
	 EXPORT_SYMBOL_GPL(context_tracking);
	 
	-void context_tracking_cpu_set(int cpu)
	+void __init context_tracking_cpu_set(int cpu)
	 {
	+	/* Called by "swapper" thread, all threads will inherit this flag */
	+	set_thread_flag(TIF_NOHZ);
		if (!per_cpu(context_tracking.active, cpu)) {
			per_cpu(context_tracking.active, cpu) = true;
			static_key_slow_inc(&context_tracking_enabled);

and now we can kill context_tracking_task_switch() ?

> > Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> > of start_kernel(). The question is, I still can't understand why do we want to
> > have the global TIF_NOHZ.
>
> Because then the flags is inherited in forks. It's better than inheriting it on
> context switch due to context switch being called much more often than fork.

This is clear, that is why I suggested this. Just we didn't understand each other,
when I said "global TIF_NOHZ" I meant the current situtation when every (running)
task has this bit set anyway. Sorry for confusion.

> No, because preempt_schedule_irq() does the ctx_state save and restore with
> exception_enter/exception_exit.

Thanks again. Can't understand how I managed to miss that exception_enter/exit
in preempt_schedule_*.

Damn. And after I spent more time, I don't have any idea how to make this
tracking cheaper.

Oleg.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-08-02 17:30                           ` Oleg Nesterov
  0 siblings, 0 replies; 100+ messages in thread
From: Oleg Nesterov @ 2014-08-02 17:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/31, Frederic Weisbecker wrote:
>
> On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> > > >
> > > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > > by "swapper".
> > >
> > > Are you sure? It's called from start_kernel() which is init/0.
> >
> > But do_initcalls() is called by kernel_init(), this is the init process which is
> > going to exec /sbin/init later.
> >
> > But this doesn't really matter,
>
> Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
> before initcalls.

Ah, indeed, and context_tracking_init() too. Even better, so we only need

	--- x/kernel/context_tracking.c
	+++ x/kernel/context_tracking.c
	@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(context_tracking_enabl
	 DEFINE_PER_CPU(struct context_tracking, context_tracking);
	 EXPORT_SYMBOL_GPL(context_tracking);
	 
	-void context_tracking_cpu_set(int cpu)
	+void __init context_tracking_cpu_set(int cpu)
	 {
	+	/* Called by "swapper" thread, all threads will inherit this flag */
	+	set_thread_flag(TIF_NOHZ);
		if (!per_cpu(context_tracking.active, cpu)) {
			per_cpu(context_tracking.active, cpu) = true;
			static_key_slow_inc(&context_tracking_enabled);

and now we can kill context_tracking_task_switch() ?

> > Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> > of start_kernel(). The question is, I still can't understand why do we want to
> > have the global TIF_NOHZ.
>
> Because then the flags is inherited in forks. It's better than inheriting it on
> context switch due to context switch being called much more often than fork.

This is clear, that is why I suggested this. Just we didn't understand each other,
when I said "global TIF_NOHZ" I meant the current situtation when every (running)
task has this bit set anyway. Sorry for confusion.

> No, because preempt_schedule_irq() does the ctx_state save and restore with
> exception_enter/exception_exit.

Thanks again. Can't understand how I managed to miss that exception_enter/exit
in preempt_schedule_*.

Damn. And after I spent more time, I don't have any idea how to make this
tracking cheaper.

Oleg.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
  2014-08-02 17:30                           ` Oleg Nesterov
  (?)
@ 2014-08-04 12:02                             ` Paul E. McKenney
  -1 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2014-08-04 12:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Andy Lutomirski, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On Sat, Aug 02, 2014 at 07:30:24PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> > > > >
> > > > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > > > by "swapper".
> > > >
> > > > Are you sure? It's called from start_kernel() which is init/0.
> > >
> > > But do_initcalls() is called by kernel_init(), this is the init process which is
> > > going to exec /sbin/init later.
> > >
> > > But this doesn't really matter,
> >
> > Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
> > before initcalls.
> 
> Ah, indeed, and context_tracking_init() too. Even better, so we only need
> 
> 	--- x/kernel/context_tracking.c
> 	+++ x/kernel/context_tracking.c
> 	@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(context_tracking_enabl
> 	 DEFINE_PER_CPU(struct context_tracking, context_tracking);
> 	 EXPORT_SYMBOL_GPL(context_tracking);
> 	 
> 	-void context_tracking_cpu_set(int cpu)
> 	+void __init context_tracking_cpu_set(int cpu)
> 	 {
> 	+	/* Called by "swapper" thread, all threads will inherit this flag */
> 	+	set_thread_flag(TIF_NOHZ);
> 		if (!per_cpu(context_tracking.active, cpu)) {
> 			per_cpu(context_tracking.active, cpu) = true;
> 			static_key_slow_inc(&context_tracking_enabled);
> 
> and now we can kill context_tracking_task_switch() ?
> 
> > > Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> > > of start_kernel(). The question is, I still can't understand why do we want to
> > > have the global TIF_NOHZ.
> >
> > Because then the flags is inherited in forks. It's better than inheriting it on
> > context switch due to context switch being called much more often than fork.
> 
> This is clear, that is why I suggested this. Just we didn't understand each other,
> when I said "global TIF_NOHZ" I meant the current situtation when every (running)
> task has this bit set anyway. Sorry for confusion.
> 
> > No, because preempt_schedule_irq() does the ctx_state save and restore with
> > exception_enter/exception_exit.
> 
> Thanks again. Can't understand how I managed to miss that exception_enter/exit
> in preempt_schedule_*.
> 
> Damn. And after I spent more time, I don't have any idea how to make this
> tracking cheaper.

Mike Galbraith's profiles showed that timekeeping was one of the most
expensive operations.  Would it make sense to have the option of statistical
jiffy-based accounting?  The idea would be to sample the jiffies counter
at each context switch, and charge the time to whoever happens to be running
when the jiffies counter increments.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86:  Split syscall_trace_enter into two phases)
@ 2014-08-04 12:02                             ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2014-08-04 12:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Andy Lutomirski, linux-kernel, Kees Cook,
	Will Drewry, x86, linux-arm-kernel, linux-mips, linux-arch,
	linux-security-module, Alexei Starovoitov, hpa

On Sat, Aug 02, 2014 at 07:30:24PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> > > > >
> > > > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > > > by "swapper".
> > > >
> > > > Are you sure? It's called from start_kernel() which is init/0.
> > >
> > > But do_initcalls() is called by kernel_init(), this is the init process which is
> > > going to exec /sbin/init later.
> > >
> > > But this doesn't really matter,
> >
> > Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
> > before initcalls.
> 
> Ah, indeed, and context_tracking_init() too. Even better, so we only need
> 
> 	--- x/kernel/context_tracking.c
> 	+++ x/kernel/context_tracking.c
> 	@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(context_tracking_enabl
> 	 DEFINE_PER_CPU(struct context_tracking, context_tracking);
> 	 EXPORT_SYMBOL_GPL(context_tracking);
> 	 
> 	-void context_tracking_cpu_set(int cpu)
> 	+void __init context_tracking_cpu_set(int cpu)
> 	 {
> 	+	/* Called by "swapper" thread, all threads will inherit this flag */
> 	+	set_thread_flag(TIF_NOHZ);
> 		if (!per_cpu(context_tracking.active, cpu)) {
> 			per_cpu(context_tracking.active, cpu) = true;
> 			static_key_slow_inc(&context_tracking_enabled);
> 
> and now we can kill context_tracking_task_switch() ?
> 
> > > Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> > > of start_kernel(). The question is, I still can't understand why do we want to
> > > have the global TIF_NOHZ.
> >
> > Because then the flags is inherited in forks. It's better than inheriting it on
> > context switch due to context switch being called much more often than fork.
> 
> This is clear, that is why I suggested this. Just we didn't understand each other,
> when I said "global TIF_NOHZ" I meant the current situtation when every (running)
> task has this bit set anyway. Sorry for confusion.
> 
> > No, because preempt_schedule_irq() does the ctx_state save and restore with
> > exception_enter/exception_exit.
> 
> Thanks again. Can't understand how I managed to miss that exception_enter/exit
> in preempt_schedule_*.
> 
> Damn. And after I spent more time, I don't have any idea how to make this
> tracking cheaper.

Mike Galbraith's profiles showed that timekeeping was one of the most
expensive operations.  Would it make sense to have the option of statistical
jiffy-based accounting?  The idea would be to sample the jiffies counter
at each context switch, and charge the time to whoever happens to be running
when the jiffies counter increments.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
@ 2014-08-04 12:02                             ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2014-08-04 12:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Aug 02, 2014 at 07:30:24PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> > > > >
> > > > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > > > by "swapper".
> > > >
> > > > Are you sure? It's called from start_kernel() which is init/0.
> > >
> > > But do_initcalls() is called by kernel_init(), this is the init process which is
> > > going to exec /sbin/init later.
> > >
> > > But this doesn't really matter,
> >
> > Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
> > before initcalls.
> 
> Ah, indeed, and context_tracking_init() too. Even better, so we only need
> 
> 	--- x/kernel/context_tracking.c
> 	+++ x/kernel/context_tracking.c
> 	@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(context_tracking_enabl
> 	 DEFINE_PER_CPU(struct context_tracking, context_tracking);
> 	 EXPORT_SYMBOL_GPL(context_tracking);
> 	 
> 	-void context_tracking_cpu_set(int cpu)
> 	+void __init context_tracking_cpu_set(int cpu)
> 	 {
> 	+	/* Called by "swapper" thread, all threads will inherit this flag */
> 	+	set_thread_flag(TIF_NOHZ);
> 		if (!per_cpu(context_tracking.active, cpu)) {
> 			per_cpu(context_tracking.active, cpu) = true;
> 			static_key_slow_inc(&context_tracking_enabled);
> 
> and now we can kill context_tracking_task_switch() ?
> 
> > > Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> > > of start_kernel(). The question is, I still can't understand why do we want to
> > > have the global TIF_NOHZ.
> >
> > Because then the flags is inherited in forks. It's better than inheriting it on
> > context switch due to context switch being called much more often than fork.
> 
> This is clear, that is why I suggested this. Just we didn't understand each other,
> when I said "global TIF_NOHZ" I meant the current situtation when every (running)
> task has this bit set anyway. Sorry for confusion.
> 
> > No, because preempt_schedule_irq() does the ctx_state save and restore with
> > exception_enter/exception_exit.
> 
> Thanks again. Can't understand how I managed to miss that exception_enter/exit
> in preempt_schedule_*.
> 
> Damn. And after I spent more time, I don't have any idea how to make this
> tracking cheaper.

Mike Galbraith's profiles showed that timekeeping was one of the most
expensive operations.  Would it make sense to have the option of statistical
jiffy-based accounting?  The idea would be to sample the jiffies counter
at each context switch, and charge the time to whoever happens to be running
when the jiffies counter increments.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2014-08-04 12:02 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-22  1:49 [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes Andy Lutomirski
2014-07-22  1:49 ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 1/8] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
2014-07-22  1:49   ` [PATCH v3 1/8] seccomp, x86, arm, mips, s390: " Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 2/8] seccomp: Refactor the filter callback and the API Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 3/8] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 4/8] seccomp: Document two-phase seccomp and arch-provided seccomp_data Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 5/8] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit Andy Lutomirski
2014-07-22  1:53   ` Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-07-22  1:53   ` Andy Lutomirski
2014-07-28 17:37   ` Oleg Nesterov
2014-07-28 17:37     ` Oleg Nesterov
2014-07-28 18:58     ` TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases) Oleg Nesterov
2014-07-28 18:58       ` Oleg Nesterov
2014-07-28 19:22       ` Frederic Weisbecker
2014-07-28 19:22         ` Frederic Weisbecker
2014-07-29 17:54         ` Oleg Nesterov
2014-07-29 17:54           ` Oleg Nesterov
2014-07-30 16:35           ` Frederic Weisbecker
2014-07-30 16:35             ` Frederic Weisbecker
2014-07-30 17:46             ` Oleg Nesterov
2014-07-30 17:46               ` Oleg Nesterov
2014-07-31  0:30               ` Frederic Weisbecker
2014-07-31  0:30                 ` Frederic Weisbecker
2014-07-31 16:03                 ` Oleg Nesterov
2014-07-31 16:03                   ` Oleg Nesterov
2014-07-31 17:13                   ` Frederic Weisbecker
2014-07-31 17:13                     ` Frederic Weisbecker
2014-07-31 18:12                     ` Oleg Nesterov
2014-07-31 18:12                       ` Oleg Nesterov
2014-07-31 18:47                       ` Frederic Weisbecker
2014-07-31 18:47                         ` Frederic Weisbecker
2014-07-31 18:50                         ` Frederic Weisbecker
2014-07-31 18:50                           ` Frederic Weisbecker
2014-07-31 18:50                           ` Frederic Weisbecker
2014-07-31 19:05                           ` Oleg Nesterov
2014-07-31 19:05                             ` Oleg Nesterov
2014-07-31 19:05                             ` Oleg Nesterov
2014-08-02 17:30                         ` Oleg Nesterov
2014-08-02 17:30                           ` Oleg Nesterov
2014-08-04 12:02                           ` Paul E. McKenney
2014-08-04 12:02                             ` Paul E. McKenney
2014-08-04 12:02                             ` Paul E. McKenney
2014-07-28 20:23     ` [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-07-28 20:23       ` Andy Lutomirski
2014-07-28 20:23       ` Andy Lutomirski
2014-07-29 16:54       ` Oleg Nesterov
2014-07-29 16:54         ` Oleg Nesterov
2014-07-29 16:54         ` Oleg Nesterov
2014-07-29 17:01         ` Andy Lutomirski
2014-07-29 17:01           ` Andy Lutomirski
2014-07-29 17:01           ` Andy Lutomirski
2014-07-29 17:31           ` Oleg Nesterov
2014-07-29 17:31             ` Oleg Nesterov
2014-07-29 17:31             ` Oleg Nesterov
2014-07-29 17:55             ` Andy Lutomirski
2014-07-29 17:55               ` Andy Lutomirski
2014-07-29 17:55               ` Andy Lutomirski
2014-07-29 18:16               ` Oleg Nesterov
2014-07-29 18:16                 ` Oleg Nesterov
2014-07-29 18:16                 ` Oleg Nesterov
2014-07-29 18:22                 ` Andy Lutomirski
2014-07-29 18:22                   ` Andy Lutomirski
2014-07-29 18:22                   ` Andy Lutomirski
2014-07-29 18:44                   ` Oleg Nesterov
2014-07-29 18:44                     ` Oleg Nesterov
2014-07-29 18:44                     ` Oleg Nesterov
2014-07-22  1:53 ` [PATCH v3 7/8] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
2014-07-22  1:53   ` [PATCH v3 7/8] x86_64, entry: " Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 8/8] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls Andy Lutomirski
2014-07-22  1:53   ` [PATCH v3 8/8] x86_64, entry: " Andy Lutomirski
2014-07-22 19:37 ` [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes Kees Cook
2014-07-22 19:37   ` Kees Cook
2014-07-22 19:37   ` Kees Cook
2014-07-23 19:20   ` Andy Lutomirski
2014-07-23 19:20     ` Andy Lutomirski
2014-07-23 19:20     ` Andy Lutomirski
2014-07-28 17:59     ` H. Peter Anvin
2014-07-28 17:59       ` H. Peter Anvin
2014-07-28 17:59       ` H. Peter Anvin
2014-07-28 23:29       ` Kees Cook
2014-07-28 23:29         ` Kees Cook
2014-07-28 23:29         ` Kees Cook
2014-07-28 23:34         ` H. Peter Anvin
2014-07-28 23:34           ` H. Peter Anvin
2014-07-28 23:34           ` H. Peter Anvin
2014-07-28 23:42           ` Kees Cook
2014-07-28 23:42             ` Kees Cook
2014-07-28 23:42             ` Kees Cook
2014-07-28 23:45             ` H. Peter Anvin
2014-07-28 23:45               ` H. Peter Anvin
2014-07-28 23:45               ` H. Peter Anvin
2014-07-28 23:54               ` Kees Cook
2014-07-28 23:54                 ` Kees Cook
2014-07-28 23:54                 ` Kees Cook

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.