linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
@ 2014-10-03 17:18 Andy Lutomirski
  2014-10-03 17:27 ` Andy Lutomirski
       [not found] ` <20141003174141.GR2342@redhat.com>
  0 siblings, 2 replies; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-03 17:18 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Ingo Molnar, Kees Cook,
	Andrea Arcangeli, Erik Bosman, hpa
  Cc: Paul Mackerras, Arnaldo Carvalho de Melo, x86, Andy Lutomirski

PR_SET_TSC / PR_TSC_SIGSEGV is a security feature to prevent heavily
sandboxed programs from learning the time, presumably to avoid
disclosing the wall clock and to make timing attacks much harder to
exploit.

Unfortunately, this feature is very insecure, for multiple reasons,
and has probably been insecure since before it was written.

Weakness 1: Before Linux 3.16, the vvar page and the HPET (!) were
part of the kernel's fixmap, so any user process could read them.
The vvar page contains low-resolution timing information (with real
wall clock and frequency data), and the HPET can be used for high
precision timing.  Even in Linux 3.16, there clean way to disable
access to these pages.

Weakness 2: On most configurations, most or all userspace processes
have unrestricted access to RDPMC, which is even better than RDTSC
for exploiting timing attacks.

I would like to fix both of these issues.  I want to deny access to
RDPMC to processes that haven't asked for access via
perf_event_open.  I also want to implement real TSC blocking, which
will require some vdso enhancements

The problem is that both of these fixes will be per-mm, not per
task.  So PR_SET_TSC will be barely supportable.

Therefore, I'm proposing the radical solution of ripping out the old
ABI to make room for the new.

Enabling strict seccomp mode no longer disables the TSC.

PR_GET_TSC still works and returns PR_TSC_ENABLED.  PR_SET_TSC now
return -EINVAL if you try to set PR_TSC_SIGSEGV.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 .../prctl/disable-tsc-ctxt-sw-stress-test.c        | 96 ----------------------
 .../prctl/disable-tsc-on-off-stress-test.c         | 95 ---------------------
 Documentation/prctl/disable-tsc-test.c             | 94 ---------------------
 arch/x86/include/asm/processor.h                   |  7 --
 arch/x86/include/asm/thread_info.h                 |  4 +-
 arch/x86/include/asm/tsc.h                         |  2 -
 arch/x86/kernel/process.c                          | 67 ---------------
 include/uapi/linux/prctl.h                         |  7 +-
 kernel/seccomp.c                                   |  3 -
 kernel/sys.c                                       | 12 +--
 10 files changed, 14 insertions(+), 373 deletions(-)
 delete mode 100644 Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-on-off-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-test.c

diff --git a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
deleted file mode 100644
index f8e8e95e81fd..000000000000
--- a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
+++ /dev/null
@@ -1,96 +0,0 @@
-/*
- * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
- *
- * Tests if the control register is updated correctly
- * at context switches
- *
- * Warning: this test will cause a very high load for a few seconds
- *
- */
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-#include <signal.h>
-#include <inttypes.h>
-#include <wait.h>
-
-
-#include <sys/prctl.h>
-#include <linux/prctl.h>
-
-/* Get/set the process' ability to use the timestamp counter instruction */
-#ifndef PR_GET_TSC
-#define PR_GET_TSC 25
-#define PR_SET_TSC 26
-# define PR_TSC_ENABLE		1   /* allow the use of the timestamp counter */
-# define PR_TSC_SIGSEGV		2   /* throw a SIGSEGV instead of reading the TSC */
-#endif
-
-uint64_t rdtsc() {
-uint32_t lo, hi;
-/* We cannot use "=A", since this would use %rax on x86_64 */
-__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
-return (uint64_t)hi << 32 | lo;
-}
-
-void sigsegv_expect(int sig)
-{
-	/* */
-}
-
-void segvtask(void)
-{
-	if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
-	{
-		perror("prctl");
-		exit(0);
-	}
-	signal(SIGSEGV, sigsegv_expect);
-	alarm(10);
-	rdtsc();
-	fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
-	exit(0);
-}
-
-
-void sigsegv_fail(int sig)
-{
-	fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
-	exit(0);
-}
-
-void rdtsctask(void)
-{
-	if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
-	{
-		perror("prctl");
-		exit(0);
-	}
-	signal(SIGSEGV, sigsegv_fail);
-	alarm(10);
-	for(;;) rdtsc();
-}
-
-
-int main(int argc, char **argv)
-{
-	int n_tasks = 100, i;
-
-	fprintf(stderr, "[No further output means we're allright]\n");
-
-	for (i=0; i<n_tasks; i++)
-		if (fork() == 0)
-		{
-			if (i & 1)
-				segvtask();
-			else
-				rdtsctask();
-		}
-
-	for (i=0; i<n_tasks; i++)
-		wait(NULL);
-
-	exit(0);
-}
-
diff --git a/Documentation/prctl/disable-tsc-on-off-stress-test.c b/Documentation/prctl/disable-tsc-on-off-stress-test.c
deleted file mode 100644
index 1fcd91445375..000000000000
--- a/Documentation/prctl/disable-tsc-on-off-stress-test.c
+++ /dev/null
@@ -1,95 +0,0 @@
-/*
- * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
- *
- * Tests if the control register is updated correctly
- * when set with prctl()
- *
- * Warning: this test will cause a very high load for a few seconds
- *
- */
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-#include <signal.h>
-#include <inttypes.h>
-#include <wait.h>
-
-
-#include <sys/prctl.h>
-#include <linux/prctl.h>
-
-/* Get/set the process' ability to use the timestamp counter instruction */
-#ifndef PR_GET_TSC
-#define PR_GET_TSC 25
-#define PR_SET_TSC 26
-# define PR_TSC_ENABLE		1   /* allow the use of the timestamp counter */
-# define PR_TSC_SIGSEGV		2   /* throw a SIGSEGV instead of reading the TSC */
-#endif
-
-/* snippet from wikipedia :-) */
-
-uint64_t rdtsc() {
-uint32_t lo, hi;
-/* We cannot use "=A", since this would use %rax on x86_64 */
-__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
-return (uint64_t)hi << 32 | lo;
-}
-
-int should_segv = 0;
-
-void sigsegv_cb(int sig)
-{
-	if (!should_segv)
-	{
-		fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
-		exit(0);
-	}
-	if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
-	{
-		perror("prctl");
-		exit(0);
-	}
-	should_segv = 0;
-
-	rdtsc();
-}
-
-void task(void)
-{
-	signal(SIGSEGV, sigsegv_cb);
-	alarm(10);
-	for(;;)
-	{
-		rdtsc();
-		if (should_segv)
-		{
-			fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
-			exit(0);
-		}
-		if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
-		{
-			perror("prctl");
-			exit(0);
-		}
-		should_segv = 1;
-	}
-}
-
-
-int main(int argc, char **argv)
-{
-	int n_tasks = 100, i;
-
-	fprintf(stderr, "[No further output means we're allright]\n");
-
-	for (i=0; i<n_tasks; i++)
-		if (fork() == 0)
-			task();
-
-	for (i=0; i<n_tasks; i++)
-		wait(NULL);
-
-	exit(0);
-}
-
diff --git a/Documentation/prctl/disable-tsc-test.c b/Documentation/prctl/disable-tsc-test.c
deleted file mode 100644
index 843c81eac235..000000000000
--- a/Documentation/prctl/disable-tsc-test.c
+++ /dev/null
@@ -1,94 +0,0 @@
-/*
- * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
- *
- * Basic test to test behaviour of PR_GET_TSC and PR_SET_TSC
- */
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-#include <signal.h>
-#include <inttypes.h>
-
-
-#include <sys/prctl.h>
-#include <linux/prctl.h>
-
-/* Get/set the process' ability to use the timestamp counter instruction */
-#ifndef PR_GET_TSC
-#define PR_GET_TSC 25
-#define PR_SET_TSC 26
-# define PR_TSC_ENABLE		1   /* allow the use of the timestamp counter */
-# define PR_TSC_SIGSEGV		2   /* throw a SIGSEGV instead of reading the TSC */
-#endif
-
-const char *tsc_names[] =
-{
-	[0] = "[not set]",
-	[PR_TSC_ENABLE] = "PR_TSC_ENABLE",
-	[PR_TSC_SIGSEGV] = "PR_TSC_SIGSEGV",
-};
-
-uint64_t rdtsc() {
-uint32_t lo, hi;
-/* We cannot use "=A", since this would use %rax on x86_64 */
-__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
-return (uint64_t)hi << 32 | lo;
-}
-
-void sigsegv_cb(int sig)
-{
-	int tsc_val = 0;
-
-	printf("[ SIG_SEGV ]\n");
-	printf("prctl(PR_GET_TSC, &tsc_val); ");
-	fflush(stdout);
-
-	if ( prctl(PR_GET_TSC, &tsc_val) == -1)
-		perror("prctl");
-
-	printf("tsc_val == %s\n", tsc_names[tsc_val]);
-	printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n");
-	fflush(stdout);
-	if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1)
-		perror("prctl");
-
-	printf("rdtsc() == ");
-}
-
-int main(int argc, char **argv)
-{
-	int tsc_val = 0;
-
-	signal(SIGSEGV, sigsegv_cb);
-
-	printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
-	printf("prctl(PR_GET_TSC, &tsc_val); ");
-	fflush(stdout);
-
-	if ( prctl(PR_GET_TSC, &tsc_val) == -1)
-		perror("prctl");
-
-	printf("tsc_val == %s\n", tsc_names[tsc_val]);
-	printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
-	printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n");
-	fflush(stdout);
-
-	if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1)
-		perror("prctl");
-
-	printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
-	printf("prctl(PR_SET_TSC, PR_TSC_SIGSEGV)\n");
-	fflush(stdout);
-
-	if ( prctl(PR_SET_TSC, PR_TSC_SIGSEGV) == -1)
-		perror("prctl");
-
-	printf("rdtsc() == ");
-	fflush(stdout);
-	printf("%llu\n", (unsigned long long)rdtsc());
-	fflush(stdout);
-
-	exit(EXIT_SUCCESS);
-}
-
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec794732..d9ed8489bc04 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -946,13 +946,6 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
-/* Get/set a process' ability to use the timestamp counter instruction */
-#define GET_TSC_CTL(adr)	get_tsc_mode((adr))
-#define SET_TSC_CTL(val)	set_tsc_mode((val))
-
-extern int get_tsc_mode(unsigned long adr);
-extern int set_tsc_mode(unsigned int val);
-
 extern u16 amd_get_nb_id(int cpu);
 
 static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 854053889d4d..ac9ed8b13aa8 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -78,7 +78,6 @@ struct thread_info {
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
-#define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_FORK		18	/* ret_from_fork */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
@@ -103,7 +102,6 @@ struct thread_info {
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
-#define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
@@ -145,7 +143,7 @@ struct thread_info {
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
-	(_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP)
+	(_TIF_IO_BITMAP|_TIF_BLOCKSTEP)
 
 #define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY)
 #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94605c0e9cee..dc8fefa8b672 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -17,8 +17,6 @@ typedef unsigned long long cycles_t;
 extern unsigned int cpu_khz;
 extern unsigned int tsc_khz;
 
-extern void disable_TSC(void);
-
 static inline cycles_t get_cycles(void)
 {
 	unsigned long long ret = 0;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f804dc935d2a..73e6a57e24d9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -137,64 +137,6 @@ void flush_thread(void)
 		free_thread_xstate(tsk);
 }
 
-static void hard_disable_TSC(void)
-{
-	write_cr4(read_cr4() | X86_CR4_TSD);
-}
-
-void disable_TSC(void)
-{
-	preempt_disable();
-	if (!test_and_set_thread_flag(TIF_NOTSC))
-		/*
-		 * Must flip the CPU state synchronously with
-		 * TIF_NOTSC in the current running context.
-		 */
-		hard_disable_TSC();
-	preempt_enable();
-}
-
-static void hard_enable_TSC(void)
-{
-	write_cr4(read_cr4() & ~X86_CR4_TSD);
-}
-
-static void enable_TSC(void)
-{
-	preempt_disable();
-	if (test_and_clear_thread_flag(TIF_NOTSC))
-		/*
-		 * Must flip the CPU state synchronously with
-		 * TIF_NOTSC in the current running context.
-		 */
-		hard_enable_TSC();
-	preempt_enable();
-}
-
-int get_tsc_mode(unsigned long adr)
-{
-	unsigned int val;
-
-	if (test_thread_flag(TIF_NOTSC))
-		val = PR_TSC_SIGSEGV;
-	else
-		val = PR_TSC_ENABLE;
-
-	return put_user(val, (unsigned int __user *)adr);
-}
-
-int set_tsc_mode(unsigned int val)
-{
-	if (val == PR_TSC_SIGSEGV)
-		disable_TSC();
-	else if (val == PR_TSC_ENABLE)
-		enable_TSC();
-	else
-		return -EINVAL;
-
-	return 0;
-}
-
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss)
 {
@@ -214,15 +156,6 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		update_debugctlmsr(debugctl);
 	}
 
-	if (test_tsk_thread_flag(prev_p, TIF_NOTSC) ^
-	    test_tsk_thread_flag(next_p, TIF_NOTSC)) {
-		/* prev and next are different */
-		if (test_tsk_thread_flag(next_p, TIF_NOTSC))
-			hard_disable_TSC();
-		else
-			hard_enable_TSC();
-	}
-
 	if (test_tsk_thread_flag(next_p, TIF_IO_BITMAP)) {
 		/*
 		 * Copy the relevant range of the IO bitmap.
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 58afc04c107e..9a2fea65c965 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -68,7 +68,12 @@
 #define PR_CAPBSET_READ 23
 #define PR_CAPBSET_DROP 24
 
-/* Get/set the process' ability to use the timestamp counter instruction */
+/*
+ * Get/set the process' ability to use the timestamp counter instruction
+ * Deprecated: PR_GET_TSC always reports PR_TSC_ENABLE, and PR_SET_TSC can
+ * only be used to set PR_TSC_ENABLE.  On non-x86 systems, neither of these
+ * prctls exists at all.
+ */
 #define PR_GET_TSC 25
 #define PR_SET_TSC 26
 # define PR_TSC_ENABLE		1	/* allow the use of the timestamp counter */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 44eb005c6695..bdc60dc68e56 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -678,9 +678,6 @@ static long seccomp_set_mode_strict(void)
 	if (!seccomp_may_assign_mode(seccomp_mode))
 		goto out;
 
-#ifdef TIF_NOTSC
-	disable_TSC();
-#endif
 	seccomp_assign_mode(current, seccomp_mode);
 	ret = 0;
 
diff --git a/kernel/sys.c b/kernel/sys.c
index ce8129192a26..5dd1dedb7bf9 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -88,9 +88,6 @@
 #ifndef GET_TSC_CTL
 # define GET_TSC_CTL(a)		(-EINVAL)
 #endif
-#ifndef SET_TSC_CTL
-# define SET_TSC_CTL(a)		(-EINVAL)
-#endif
 
 /*
  * this is where the system-wide overflow UID and GID are defined, for
@@ -1917,12 +1914,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_SET_SECCOMP:
 		error = prctl_set_seccomp(arg2, (char __user *)arg3);
 		break;
+#ifdef CONFIG_X86
 	case PR_GET_TSC:
-		error = GET_TSC_CTL(arg2);
+		error = put_user(PR_TSC_ENABLE, (int __user *)arg2);
 		break;
 	case PR_SET_TSC:
-		error = SET_TSC_CTL(arg2);
+		if (arg2 == PR_TSC_ENABLE)
+			error = 0;
+		else
+			error = -EINVAL;
 		break;
+#endif
 	case PR_TASK_PERF_EVENTS_DISABLE:
 		error = perf_event_task_disable();
 		break;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 17:18 [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering Andy Lutomirski
@ 2014-10-03 17:27 ` Andy Lutomirski
  2014-10-03 20:14   ` Peter Zijlstra
       [not found] ` <20141003174141.GR2342@redhat.com>
  1 sibling, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-03 17:27 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Ingo Molnar, Kees Cook,
	Andrea Arcangeli, Erik Bosman, H. Peter Anvin, Linux API,
	Michael Kerrisk-manpages
  Cc: Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML, Andy Lutomirski

[adding linux-api.  whoops.]

On Fri, Oct 3, 2014 at 10:18 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> PR_SET_TSC / PR_TSC_SIGSEGV is a security feature to prevent heavily
> sandboxed programs from learning the time, presumably to avoid
> disclosing the wall clock and to make timing attacks much harder to
> exploit.
>
> Unfortunately, this feature is very insecure, for multiple reasons,
> and has probably been insecure since before it was written.
>
> Weakness 1: Before Linux 3.16, the vvar page and the HPET (!) were
> part of the kernel's fixmap, so any user process could read them.
> The vvar page contains low-resolution timing information (with real
> wall clock and frequency data), and the HPET can be used for high
> precision timing.  Even in Linux 3.16, there clean way to disable
> access to these pages.
>
> Weakness 2: On most configurations, most or all userspace processes
> have unrestricted access to RDPMC, which is even better than RDTSC
> for exploiting timing attacks.
>
> I would like to fix both of these issues.  I want to deny access to
> RDPMC to processes that haven't asked for access via
> perf_event_open.  I also want to implement real TSC blocking, which
> will require some vdso enhancements
>
> The problem is that both of these fixes will be per-mm, not per
> task.  So PR_SET_TSC will be barely supportable.
>
> Therefore, I'm proposing the radical solution of ripping out the old
> ABI to make room for the new.
>
> Enabling strict seccomp mode no longer disables the TSC.
>
> PR_GET_TSC still works and returns PR_TSC_ENABLED.  PR_SET_TSC now
> return -EINVAL if you try to set PR_TSC_SIGSEGV.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  .../prctl/disable-tsc-ctxt-sw-stress-test.c        | 96 ----------------------
>  .../prctl/disable-tsc-on-off-stress-test.c         | 95 ---------------------
>  Documentation/prctl/disable-tsc-test.c             | 94 ---------------------
>  arch/x86/include/asm/processor.h                   |  7 --
>  arch/x86/include/asm/thread_info.h                 |  4 +-
>  arch/x86/include/asm/tsc.h                         |  2 -
>  arch/x86/kernel/process.c                          | 67 ---------------
>  include/uapi/linux/prctl.h                         |  7 +-
>  kernel/seccomp.c                                   |  3 -
>  kernel/sys.c                                       | 12 +--
>  10 files changed, 14 insertions(+), 373 deletions(-)
>  delete mode 100644 Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
>  delete mode 100644 Documentation/prctl/disable-tsc-on-off-stress-test.c
>  delete mode 100644 Documentation/prctl/disable-tsc-test.c
>
> diff --git a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
> deleted file mode 100644
> index f8e8e95e81fd..000000000000
> --- a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
> +++ /dev/null
> @@ -1,96 +0,0 @@
> -/*
> - * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
> - *
> - * Tests if the control register is updated correctly
> - * at context switches
> - *
> - * Warning: this test will cause a very high load for a few seconds
> - *
> - */
> -
> -#include <stdio.h>
> -#include <stdlib.h>
> -#include <unistd.h>
> -#include <signal.h>
> -#include <inttypes.h>
> -#include <wait.h>
> -
> -
> -#include <sys/prctl.h>
> -#include <linux/prctl.h>
> -
> -/* Get/set the process' ability to use the timestamp counter instruction */
> -#ifndef PR_GET_TSC
> -#define PR_GET_TSC 25
> -#define PR_SET_TSC 26
> -# define PR_TSC_ENABLE         1   /* allow the use of the timestamp counter */
> -# define PR_TSC_SIGSEGV                2   /* throw a SIGSEGV instead of reading the TSC */
> -#endif
> -
> -uint64_t rdtsc() {
> -uint32_t lo, hi;
> -/* We cannot use "=A", since this would use %rax on x86_64 */
> -__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
> -return (uint64_t)hi << 32 | lo;
> -}
> -
> -void sigsegv_expect(int sig)
> -{
> -       /* */
> -}
> -
> -void segvtask(void)
> -{
> -       if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
> -       {
> -               perror("prctl");
> -               exit(0);
> -       }
> -       signal(SIGSEGV, sigsegv_expect);
> -       alarm(10);
> -       rdtsc();
> -       fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
> -       exit(0);
> -}
> -
> -
> -void sigsegv_fail(int sig)
> -{
> -       fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
> -       exit(0);
> -}
> -
> -void rdtsctask(void)
> -{
> -       if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
> -       {
> -               perror("prctl");
> -               exit(0);
> -       }
> -       signal(SIGSEGV, sigsegv_fail);
> -       alarm(10);
> -       for(;;) rdtsc();
> -}
> -
> -
> -int main(int argc, char **argv)
> -{
> -       int n_tasks = 100, i;
> -
> -       fprintf(stderr, "[No further output means we're allright]\n");
> -
> -       for (i=0; i<n_tasks; i++)
> -               if (fork() == 0)
> -               {
> -                       if (i & 1)
> -                               segvtask();
> -                       else
> -                               rdtsctask();
> -               }
> -
> -       for (i=0; i<n_tasks; i++)
> -               wait(NULL);
> -
> -       exit(0);
> -}
> -
> diff --git a/Documentation/prctl/disable-tsc-on-off-stress-test.c b/Documentation/prctl/disable-tsc-on-off-stress-test.c
> deleted file mode 100644
> index 1fcd91445375..000000000000
> --- a/Documentation/prctl/disable-tsc-on-off-stress-test.c
> +++ /dev/null
> @@ -1,95 +0,0 @@
> -/*
> - * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
> - *
> - * Tests if the control register is updated correctly
> - * when set with prctl()
> - *
> - * Warning: this test will cause a very high load for a few seconds
> - *
> - */
> -
> -#include <stdio.h>
> -#include <stdlib.h>
> -#include <unistd.h>
> -#include <signal.h>
> -#include <inttypes.h>
> -#include <wait.h>
> -
> -
> -#include <sys/prctl.h>
> -#include <linux/prctl.h>
> -
> -/* Get/set the process' ability to use the timestamp counter instruction */
> -#ifndef PR_GET_TSC
> -#define PR_GET_TSC 25
> -#define PR_SET_TSC 26
> -# define PR_TSC_ENABLE         1   /* allow the use of the timestamp counter */
> -# define PR_TSC_SIGSEGV                2   /* throw a SIGSEGV instead of reading the TSC */
> -#endif
> -
> -/* snippet from wikipedia :-) */
> -
> -uint64_t rdtsc() {
> -uint32_t lo, hi;
> -/* We cannot use "=A", since this would use %rax on x86_64 */
> -__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
> -return (uint64_t)hi << 32 | lo;
> -}
> -
> -int should_segv = 0;
> -
> -void sigsegv_cb(int sig)
> -{
> -       if (!should_segv)
> -       {
> -               fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
> -               exit(0);
> -       }
> -       if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
> -       {
> -               perror("prctl");
> -               exit(0);
> -       }
> -       should_segv = 0;
> -
> -       rdtsc();
> -}
> -
> -void task(void)
> -{
> -       signal(SIGSEGV, sigsegv_cb);
> -       alarm(10);
> -       for(;;)
> -       {
> -               rdtsc();
> -               if (should_segv)
> -               {
> -                       fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
> -                       exit(0);
> -               }
> -               if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
> -               {
> -                       perror("prctl");
> -                       exit(0);
> -               }
> -               should_segv = 1;
> -       }
> -}
> -
> -
> -int main(int argc, char **argv)
> -{
> -       int n_tasks = 100, i;
> -
> -       fprintf(stderr, "[No further output means we're allright]\n");
> -
> -       for (i=0; i<n_tasks; i++)
> -               if (fork() == 0)
> -                       task();
> -
> -       for (i=0; i<n_tasks; i++)
> -               wait(NULL);
> -
> -       exit(0);
> -}
> -
> diff --git a/Documentation/prctl/disable-tsc-test.c b/Documentation/prctl/disable-tsc-test.c
> deleted file mode 100644
> index 843c81eac235..000000000000
> --- a/Documentation/prctl/disable-tsc-test.c
> +++ /dev/null
> @@ -1,94 +0,0 @@
> -/*
> - * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
> - *
> - * Basic test to test behaviour of PR_GET_TSC and PR_SET_TSC
> - */
> -
> -#include <stdio.h>
> -#include <stdlib.h>
> -#include <unistd.h>
> -#include <signal.h>
> -#include <inttypes.h>
> -
> -
> -#include <sys/prctl.h>
> -#include <linux/prctl.h>
> -
> -/* Get/set the process' ability to use the timestamp counter instruction */
> -#ifndef PR_GET_TSC
> -#define PR_GET_TSC 25
> -#define PR_SET_TSC 26
> -# define PR_TSC_ENABLE         1   /* allow the use of the timestamp counter */
> -# define PR_TSC_SIGSEGV                2   /* throw a SIGSEGV instead of reading the TSC */
> -#endif
> -
> -const char *tsc_names[] =
> -{
> -       [0] = "[not set]",
> -       [PR_TSC_ENABLE] = "PR_TSC_ENABLE",
> -       [PR_TSC_SIGSEGV] = "PR_TSC_SIGSEGV",
> -};
> -
> -uint64_t rdtsc() {
> -uint32_t lo, hi;
> -/* We cannot use "=A", since this would use %rax on x86_64 */
> -__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
> -return (uint64_t)hi << 32 | lo;
> -}
> -
> -void sigsegv_cb(int sig)
> -{
> -       int tsc_val = 0;
> -
> -       printf("[ SIG_SEGV ]\n");
> -       printf("prctl(PR_GET_TSC, &tsc_val); ");
> -       fflush(stdout);
> -
> -       if ( prctl(PR_GET_TSC, &tsc_val) == -1)
> -               perror("prctl");
> -
> -       printf("tsc_val == %s\n", tsc_names[tsc_val]);
> -       printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n");
> -       fflush(stdout);
> -       if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1)
> -               perror("prctl");
> -
> -       printf("rdtsc() == ");
> -}
> -
> -int main(int argc, char **argv)
> -{
> -       int tsc_val = 0;
> -
> -       signal(SIGSEGV, sigsegv_cb);
> -
> -       printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
> -       printf("prctl(PR_GET_TSC, &tsc_val); ");
> -       fflush(stdout);
> -
> -       if ( prctl(PR_GET_TSC, &tsc_val) == -1)
> -               perror("prctl");
> -
> -       printf("tsc_val == %s\n", tsc_names[tsc_val]);
> -       printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
> -       printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n");
> -       fflush(stdout);
> -
> -       if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1)
> -               perror("prctl");
> -
> -       printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
> -       printf("prctl(PR_SET_TSC, PR_TSC_SIGSEGV)\n");
> -       fflush(stdout);
> -
> -       if ( prctl(PR_SET_TSC, PR_TSC_SIGSEGV) == -1)
> -               perror("prctl");
> -
> -       printf("rdtsc() == ");
> -       fflush(stdout);
> -       printf("%llu\n", (unsigned long long)rdtsc());
> -       fflush(stdout);
> -
> -       exit(EXIT_SUCCESS);
> -}
> -
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index eb71ec794732..d9ed8489bc04 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -946,13 +946,6 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>
>  #define KSTK_EIP(task)         (task_pt_regs(task)->ip)
>
> -/* Get/set a process' ability to use the timestamp counter instruction */
> -#define GET_TSC_CTL(adr)       get_tsc_mode((adr))
> -#define SET_TSC_CTL(val)       set_tsc_mode((val))
> -
> -extern int get_tsc_mode(unsigned long adr);
> -extern int set_tsc_mode(unsigned int val);
> -
>  extern u16 amd_get_nb_id(int cpu);
>
>  static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 854053889d4d..ac9ed8b13aa8 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -78,7 +78,6 @@ struct thread_info {
>  #define TIF_MCE_NOTIFY         10      /* notify userspace of an MCE */
>  #define TIF_USER_RETURN_NOTIFY 11      /* notify kernel of userspace return */
>  #define TIF_UPROBE             12      /* breakpointed or singlestepping */
> -#define TIF_NOTSC              16      /* TSC is not accessible in userland */
>  #define TIF_IA32               17      /* IA32 compatibility process */
>  #define TIF_FORK               18      /* ret_from_fork */
>  #define TIF_NOHZ               19      /* in adaptive nohz mode */
> @@ -103,7 +102,6 @@ struct thread_info {
>  #define _TIF_MCE_NOTIFY                (1 << TIF_MCE_NOTIFY)
>  #define _TIF_USER_RETURN_NOTIFY        (1 << TIF_USER_RETURN_NOTIFY)
>  #define _TIF_UPROBE            (1 << TIF_UPROBE)
> -#define _TIF_NOTSC             (1 << TIF_NOTSC)
>  #define _TIF_IA32              (1 << TIF_IA32)
>  #define _TIF_FORK              (1 << TIF_FORK)
>  #define _TIF_NOHZ              (1 << TIF_NOHZ)
> @@ -145,7 +143,7 @@ struct thread_info {
>
>  /* flags to check in __switch_to() */
>  #define _TIF_WORK_CTXSW                                                        \
> -       (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP)
> +       (_TIF_IO_BITMAP|_TIF_BLOCKSTEP)
>
>  #define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY)
>  #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
> diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
> index 94605c0e9cee..dc8fefa8b672 100644
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -17,8 +17,6 @@ typedef unsigned long long cycles_t;
>  extern unsigned int cpu_khz;
>  extern unsigned int tsc_khz;
>
> -extern void disable_TSC(void);
> -
>  static inline cycles_t get_cycles(void)
>  {
>         unsigned long long ret = 0;
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index f804dc935d2a..73e6a57e24d9 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -137,64 +137,6 @@ void flush_thread(void)
>                 free_thread_xstate(tsk);
>  }
>
> -static void hard_disable_TSC(void)
> -{
> -       write_cr4(read_cr4() | X86_CR4_TSD);
> -}
> -
> -void disable_TSC(void)
> -{
> -       preempt_disable();
> -       if (!test_and_set_thread_flag(TIF_NOTSC))
> -               /*
> -                * Must flip the CPU state synchronously with
> -                * TIF_NOTSC in the current running context.
> -                */
> -               hard_disable_TSC();
> -       preempt_enable();
> -}
> -
> -static void hard_enable_TSC(void)
> -{
> -       write_cr4(read_cr4() & ~X86_CR4_TSD);
> -}
> -
> -static void enable_TSC(void)
> -{
> -       preempt_disable();
> -       if (test_and_clear_thread_flag(TIF_NOTSC))
> -               /*
> -                * Must flip the CPU state synchronously with
> -                * TIF_NOTSC in the current running context.
> -                */
> -               hard_enable_TSC();
> -       preempt_enable();
> -}
> -
> -int get_tsc_mode(unsigned long adr)
> -{
> -       unsigned int val;
> -
> -       if (test_thread_flag(TIF_NOTSC))
> -               val = PR_TSC_SIGSEGV;
> -       else
> -               val = PR_TSC_ENABLE;
> -
> -       return put_user(val, (unsigned int __user *)adr);
> -}
> -
> -int set_tsc_mode(unsigned int val)
> -{
> -       if (val == PR_TSC_SIGSEGV)
> -               disable_TSC();
> -       else if (val == PR_TSC_ENABLE)
> -               enable_TSC();
> -       else
> -               return -EINVAL;
> -
> -       return 0;
> -}
> -
>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>                       struct tss_struct *tss)
>  {
> @@ -214,15 +156,6 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>                 update_debugctlmsr(debugctl);
>         }
>
> -       if (test_tsk_thread_flag(prev_p, TIF_NOTSC) ^
> -           test_tsk_thread_flag(next_p, TIF_NOTSC)) {
> -               /* prev and next are different */
> -               if (test_tsk_thread_flag(next_p, TIF_NOTSC))
> -                       hard_disable_TSC();
> -               else
> -                       hard_enable_TSC();
> -       }
> -
>         if (test_tsk_thread_flag(next_p, TIF_IO_BITMAP)) {
>                 /*
>                  * Copy the relevant range of the IO bitmap.
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 58afc04c107e..9a2fea65c965 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -68,7 +68,12 @@
>  #define PR_CAPBSET_READ 23
>  #define PR_CAPBSET_DROP 24
>
> -/* Get/set the process' ability to use the timestamp counter instruction */
> +/*
> + * Get/set the process' ability to use the timestamp counter instruction
> + * Deprecated: PR_GET_TSC always reports PR_TSC_ENABLE, and PR_SET_TSC can
> + * only be used to set PR_TSC_ENABLE.  On non-x86 systems, neither of these
> + * prctls exists at all.
> + */
>  #define PR_GET_TSC 25
>  #define PR_SET_TSC 26
>  # define PR_TSC_ENABLE         1       /* allow the use of the timestamp counter */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 44eb005c6695..bdc60dc68e56 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -678,9 +678,6 @@ static long seccomp_set_mode_strict(void)
>         if (!seccomp_may_assign_mode(seccomp_mode))
>                 goto out;
>
> -#ifdef TIF_NOTSC
> -       disable_TSC();
> -#endif
>         seccomp_assign_mode(current, seccomp_mode);
>         ret = 0;
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ce8129192a26..5dd1dedb7bf9 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -88,9 +88,6 @@
>  #ifndef GET_TSC_CTL
>  # define GET_TSC_CTL(a)                (-EINVAL)
>  #endif
> -#ifndef SET_TSC_CTL
> -# define SET_TSC_CTL(a)                (-EINVAL)
> -#endif
>
>  /*
>   * this is where the system-wide overflow UID and GID are defined, for
> @@ -1917,12 +1914,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>         case PR_SET_SECCOMP:
>                 error = prctl_set_seccomp(arg2, (char __user *)arg3);
>                 break;
> +#ifdef CONFIG_X86
>         case PR_GET_TSC:
> -               error = GET_TSC_CTL(arg2);
> +               error = put_user(PR_TSC_ENABLE, (int __user *)arg2);
>                 break;
>         case PR_SET_TSC:
> -               error = SET_TSC_CTL(arg2);
> +               if (arg2 == PR_TSC_ENABLE)
> +                       error = 0;
> +               else
> +                       error = -EINVAL;
>                 break;
> +#endif
>         case PR_TASK_PERF_EVENTS_DISABLE:
>                 error = perf_event_task_disable();
>                 break;
> --
> 1.9.3
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
       [not found] ` <20141003174141.GR2342@redhat.com>
@ 2014-10-03 17:59   ` Andy Lutomirski
  2014-10-03 20:15     ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-03 17:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML, linux-kernel,
	Peter Zijlstra, Ingo Molnar, Kees Cook, Erik Bosman,
	H. Peter Anvin, Linux API, Michael Kerrisk-manpages

[cc's re-added]

On Fri, Oct 3, 2014 at 10:41 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi,
>
> On Fri, Oct 03, 2014 at 10:18:14AM -0700, Andy Lutomirski wrote:
>> Weakness 2: On most configurations, most or all userspace processes
>> have unrestricted access to RDPMC, which is even better than RDTSC
>> for exploiting timing attacks.
>
> "User access of the RDPMC instruction is not guaranteed. Like RDTSC,
> user access is controlled by a bit in CR4. CR4.PCE (bit-8) controls
> whether or not a user program can execute the RDPMC instruction
> without faulting"
>
> I don't think there's was a seccomp leak of RDPMC because of this, the
> rdtsc and rdpmc seems to be linked to the same cr4 tweak.

RDPMC is controlled by CR4.PCE, whereas RDTSC is controlled by
CR4.TSD, so, unless there's magic that I'm unaware of, seccomp never
blocked RDPMC access.  I don't know whether circa-2008 kernels set PCE
(or perhaps left it set if BIOS set it) at boot, but certainly almost
everyone on any kernel for the last few years has RDPMC enabled in
ring 3.

>
> The vsyscall data was leaked right, but you can't compare the
> two. Sure it's better to block that too but it's not comparable to
> give tsc access to apps running under seccomp.
>
> The time of the day isn't secret either (ok it could be an issue if
> you intend to run the system on some secret time in the past or future
> but this sounds not a practical issue).
>
> What's not public info and should never be leaked to seccomp
> sandboxes, is the tsc at that kind of cycle count granularity (and the
> various gettimeofday variants with nanosecond granularity). I thought
> RDPMC was blocked too with the same CR4 tweak... if that wasn't the
> case and you could get tsc granular information into a seccomp
> sandbox, that's not ok because it allows for covert channel attacks.

The HPET very fine granularity.  It's slow to access, but that isn't
necessarily a problem for attackers.

I agree that this is problematic, and I want to fix it.  The trouble
is that I'm not sure it's fixable in a sane manner with the current
semantics.  CR4.PCE, in particular, will need to be a function of both
per-thread state (PR_TSC_SIGSEGV) and per-mm state (whether perf_event
self-monitoring is on).  Getting the context switching logic correct
without hurting the common case too much will be quite complicated.

On top of this, supporting something like PR_TSC_SIGSEGV per-thread
using seccomp mode 2 should really be done by redirecting vdso-based
timing to use syscalls, and that's fundamentally per mm.

Hence my proposal of removing the current insecure model so that
adding a secure variant will be straightforward, rather than trying to
shoehorn a fix on top of the current ABI.

The eventual fix could still disable the TSC in a strict seccomp task
by default if the task is single threaded.  Yes, that won't quite
cover old Chromium-style sandboxes, but those are rapidly being
replaced by new designs using seccomp mode 2 anyway.

Also, keep in mind that multithreaded attackers can exploit timing
attacks without hardware help at all: one thread can just run a timing
loop.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 17:27 ` Andy Lutomirski
@ 2014-10-03 20:14   ` Peter Zijlstra
  2014-10-03 20:22     ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-03 20:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 03, 2014 at 10:27:47AM -0700, Andy Lutomirski wrote:
> [adding linux-api.  whoops.]
> 
> On Fri, Oct 3, 2014 at 10:18 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > PR_SET_TSC / PR_TSC_SIGSEGV is a security feature to prevent heavily
> > sandboxed programs from learning the time, presumably to avoid
> > disclosing the wall clock and to make timing attacks much harder to
> > exploit.
> >
> > Unfortunately, this feature is very insecure, for multiple reasons,
> > and has probably been insecure since before it was written.
> >
> > Weakness 1: Before Linux 3.16, the vvar page and the HPET (!) were
> > part of the kernel's fixmap, so any user process could read them.
> > The vvar page contains low-resolution timing information (with real
> > wall clock and frequency data), and the HPET can be used for high
> > precision timing.  Even in Linux 3.16, there clean way to disable
> > access to these pages.
> >
> > Weakness 2: On most configurations, most or all userspace processes
> > have unrestricted access to RDPMC, which is even better than RDTSC
> > for exploiting timing attacks.
> >
> > I would like to fix both of these issues.  I want to deny access to
> > RDPMC to processes that haven't asked for access via
> > perf_event_open.  I also want to implement real TSC blocking, which
> > will require some vdso enhancements

So the problem with the default deny is that its:
 1) pointless -- the attacker can do sys_perf_event_open() just fine;
 2) and expensive -- the people trying to measure performance get the
    penalty of the CR4 write.

So I would suggest a default on, but allow a disable for the seccomp
users, which might have also disabled the syscall. Note that is is
possible to disable RDPMC while still allowing the syscall.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 17:59   ` Andy Lutomirski
@ 2014-10-03 20:15     ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-03 20:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrea Arcangeli, Paul Mackerras, Arnaldo Carvalho de Melo,
	X86 ML, linux-kernel, Ingo Molnar, Kees Cook, Erik Bosman,
	H. Peter Anvin, Linux API, Michael Kerrisk-manpages

On Fri, Oct 03, 2014 at 10:59:15AM -0700, Andy Lutomirski wrote:
> RDPMC is controlled by CR4.PCE, whereas RDTSC is controlled by
> CR4.TSD,

This is my understanding too, they're separate things.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 20:14   ` Peter Zijlstra
@ 2014-10-03 20:22     ` Andy Lutomirski
  2014-10-03 20:27       ` Andy Lutomirski
  2014-10-03 20:42       ` Peter Zijlstra
  0 siblings, 2 replies; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-03 20:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 3, 2014 at 1:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 03, 2014 at 10:27:47AM -0700, Andy Lutomirski wrote:
>> [adding linux-api.  whoops.]
>>
>> On Fri, Oct 3, 2014 at 10:18 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > PR_SET_TSC / PR_TSC_SIGSEGV is a security feature to prevent heavily
>> > sandboxed programs from learning the time, presumably to avoid
>> > disclosing the wall clock and to make timing attacks much harder to
>> > exploit.
>> >
>> > Unfortunately, this feature is very insecure, for multiple reasons,
>> > and has probably been insecure since before it was written.
>> >
>> > Weakness 1: Before Linux 3.16, the vvar page and the HPET (!) were
>> > part of the kernel's fixmap, so any user process could read them.
>> > The vvar page contains low-resolution timing information (with real
>> > wall clock and frequency data), and the HPET can be used for high
>> > precision timing.  Even in Linux 3.16, there clean way to disable
>> > access to these pages.
>> >
>> > Weakness 2: On most configurations, most or all userspace processes
>> > have unrestricted access to RDPMC, which is even better than RDTSC
>> > for exploiting timing attacks.
>> >
>> > I would like to fix both of these issues.  I want to deny access to
>> > RDPMC to processes that haven't asked for access via
>> > perf_event_open.  I also want to implement real TSC blocking, which
>> > will require some vdso enhancements
>
> So the problem with the default deny is that its:
>  1) pointless -- the attacker can do sys_perf_event_open() just fine;

Not if the attacker is in a seccomp sandbox.

>  2) and expensive -- the people trying to measure performance get the
>     penalty of the CR4 write.

Does this matter for performance measuring?  I'm not 100% clear on how all
the perf_event stuff gets used in practice, but, by my very vague
understanding, there are two main workflows:

a) perf record, etc: one process creates a ringbuffer and wakes up
rarely to record the contents.  The process being recorded doesn't
have a perf_event mapped, so the cr4 switch will only happen when
waking up the perf process.

perf record prints stuff like "[ perf record: Woken up 1 times to
write data ]", which seems to confirm my understanding.

b) self-monitoring.  A task mmaps a perf_event, does rdpmc, does
something, and does rdpmc again.  In that case, there's no context
switch.

>
> So I would suggest a default on, but allow a disable for the seccomp
> users, which might have also disabled the syscall. Note that is is
> possible to disable RDPMC while still allowing the syscall.

Disabling RDPMC per-process while still allowing the syscall will need
a bunch of work, right?  What happens if the same perf_event is mapped
by two different users?

We could make the rule be that RDPMC is enabled if a perf event is
mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
there's an actual performance issue first.  Ideally we can get this
all working with no API or ABI change at all.

P.S. Hey, Intel, let us context switch RDPMC accessibility of the
individual counters, please :)

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 20:22     ` Andy Lutomirski
@ 2014-10-03 20:27       ` Andy Lutomirski
  2014-10-03 20:44         ` Peter Zijlstra
  2014-10-03 20:42       ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-03 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> We could make the rule be that RDPMC is enabled if a perf event is
> mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
> there's an actual performance issue first.  Ideally we can get this
> all working with no API or ABI change at all.

No, we can't use that rule.  But we could say that RDPMC is enabled if
a perf event is mmapped and no thread in the mm uses seccomp.  I'll
grumble a little bit about adding yet another piece of seccomp state.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 20:22     ` Andy Lutomirski
  2014-10-03 20:27       ` Andy Lutomirski
@ 2014-10-03 20:42       ` Peter Zijlstra
  2014-10-03 20:53         ` Andy Lutomirski
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-03 20:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 03, 2014 at 01:22:22PM -0700, Andy Lutomirski wrote:
> > So the problem with the default deny is that its:
> >  1) pointless -- the attacker can do sys_perf_event_open() just fine;
> 
> Not if the attacker is in a seccomp sandbox.

Clearly :-)

> >  2) and expensive -- the people trying to measure performance get the
> >     penalty of the CR4 write.
> 
> Does this matter for performance measuring?  I'm not 100% clear on how all
> the perf_event stuff gets used in practice, but, by my very vague
> understanding, there are two main workflows:
> 
> a) perf record, etc: one process creates a ringbuffer and wakes up
> rarely to record the contents.  The process being recorded doesn't
> have a perf_event mapped, so the cr4 switch will only happen when
> waking up the perf process.
> 
> perf record prints stuff like "[ perf record: Woken up 1 times to
> write data ]", which seems to confirm my understanding.
> 
> b) self-monitoring.  A task mmaps a perf_event, does rdpmc, does
> something, and does rdpmc again.  In that case, there's no context
> switch.

Agreed, but with the default disable, the self-monitoring task will get
CR4 writes to its context switches and will be slower.

Whereas if we default on and only disable with seccomp then only the
seccomp tasks will suffer the burden of a CR4 write at context switch
time.

And I don't see a security implication in the default.

> > So I would suggest a default on, but allow a disable for the seccomp
> > users, which might have also disabled the syscall. Note that is is
> > possible to disable RDPMC while still allowing the syscall.
> 
> Disabling RDPMC per-process while still allowing the syscall will need
> a bunch of work, right? 

Not much, all you need to do is make
perf_event_mmmap_page::cap_user_rdpmc be 0 and userspace should never
use rdpmc. Currently we set that bit based on the global sysfs rdpmc
value, but we could easy bitwise AND it with a per process value.

> What happens if the same perf_event is mapped
> by two different users?

Not entirely clear on what you mean here.

> We could make the rule be that RDPMC is enabled if a perf event is
> mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
> there's an actual performance issue first.  Ideally we can get this
> all working with no API or ABI change at all.

Well I just want to not have extra CR4 writes by default (and by default
I don't care about seccomp).

> P.S. Hey, Intel, let us context switch RDPMC accessibility of the
> individual counters, please :)

Ha sure, make it more complicated why don't you ;-) But yes, that has
come up before.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 20:27       ` Andy Lutomirski
@ 2014-10-03 20:44         ` Peter Zijlstra
  2014-10-03 20:46           ` Andy Lutomirski
  2014-10-03 21:02           ` Peter Zijlstra
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-03 20:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 03, 2014 at 01:27:52PM -0700, Andy Lutomirski wrote:
> On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > We could make the rule be that RDPMC is enabled if a perf event is
> > mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
> > there's an actual performance issue first.  Ideally we can get this
> > all working with no API or ABI change at all.
> 
> No, we can't use that rule.  But we could say that RDPMC is enabled if
> a perf event is mmapped and no thread in the mm uses seccomp.  I'll
> grumble a little bit about adding yet another piece of seccomp state.

Well, we could simply disable the RDPMC for everything TIF_SECCOMP.
Should be fairly straight fwd.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 20:44         ` Peter Zijlstra
@ 2014-10-03 20:46           ` Andy Lutomirski
  2014-10-03 21:02           ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-03 20:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 3, 2014 at 1:44 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 03, 2014 at 01:27:52PM -0700, Andy Lutomirski wrote:
>> On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >
>> > We could make the rule be that RDPMC is enabled if a perf event is
>> > mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
>> > there's an actual performance issue first.  Ideally we can get this
>> > all working with no API or ABI change at all.
>>
>> No, we can't use that rule.  But we could say that RDPMC is enabled if
>> a perf event is mmapped and no thread in the mm uses seccomp.  I'll
>> grumble a little bit about adding yet another piece of seccomp state.
>
> Well, we could simply disable the RDPMC for everything TIF_SECCOMP.
> Should be fairly straight fwd.

That won't work.  I bet there are plenty of existing users of fairly
wide-open seccomp sandboxes that allow perf_event in.

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 20:42       ` Peter Zijlstra
@ 2014-10-03 20:53         ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-03 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 3, 2014 at 1:42 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 03, 2014 at 01:22:22PM -0700, Andy Lutomirski wrote:
>> > So the problem with the default deny is that its:
>> >  1) pointless -- the attacker can do sys_perf_event_open() just fine;
>>
>> Not if the attacker is in a seccomp sandbox.
>
> Clearly :-)
>
>> >  2) and expensive -- the people trying to measure performance get the
>> >     penalty of the CR4 write.
>>
>> Does this matter for performance measuring?  I'm not 100% clear on how all
>> the perf_event stuff gets used in practice, but, by my very vague
>> understanding, there are two main workflows:
>>
>> a) perf record, etc: one process creates a ringbuffer and wakes up
>> rarely to record the contents.  The process being recorded doesn't
>> have a perf_event mapped, so the cr4 switch will only happen when
>> waking up the perf process.
>>
>> perf record prints stuff like "[ perf record: Woken up 1 times to
>> write data ]", which seems to confirm my understanding.
>>
>> b) self-monitoring.  A task mmaps a perf_event, does rdpmc, does
>> something, and does rdpmc again.  In that case, there's no context
>> switch.
>
> Agreed, but with the default disable, the self-monitoring task will get
> CR4 writes to its context switches and will be slower.
>
> Whereas if we default on and only disable with seccomp then only the
> seccomp tasks will suffer the burden of a CR4 write at context switch
> time.
>
> And I don't see a security implication in the default.

I don't see a security implication, but we can fight over whether
seccomp performance or perf self monitoring performance is more
important.  I spent a bunch of time making seccomp *much* faster for
3.18, and things like Chromium (the browser) could have lots of
context switched into and out of seccomp. :)

I suspect that the overhead only really matters when running as a VM guest.

Hmm.  Can we switch lazily?  I don't really like the idea, but
non-seccomp, non-perf-event tasks could, in principle, just inherit
the PCE value from whatever had the CPU last.

>
>> > So I would suggest a default on, but allow a disable for the seccomp
>> > users, which might have also disabled the syscall. Note that is is
>> > possible to disable RDPMC while still allowing the syscall.
>>
>> Disabling RDPMC per-process while still allowing the syscall will need
>> a bunch of work, right?
>
> Not much, all you need to do is make
> perf_event_mmmap_page::cap_user_rdpmc be 0 and userspace should never
> use rdpmc. Currently we set that bit based on the global sysfs rdpmc
> value, but we could easy bitwise AND it with a per process value.
>
>> What happens if the same perf_event is mapped
>> by two different users?
>
> Not entirely clear on what you mean here.

Can you open a perf_event fd, fork, and mmap it in the child?  What if
you map it in the parent and the child?  Then cap_user_rdpmc has to
match for both mappings.  Or is this impossible?

>
>> We could make the rule be that RDPMC is enabled if a perf event is
>> mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
>> there's an actual performance issue first.  Ideally we can get this
>> all working with no API or ABI change at all.
>
> Well I just want to not have extra CR4 writes by default (and by default
> I don't care about seccomp).
>
>> P.S. Hey, Intel, let us context switch RDPMC accessibility of the
>> individual counters, please :)
>
> Ha sure, make it more complicated why don't you ;-) But yes, that has
> come up before.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 20:44         ` Peter Zijlstra
  2014-10-03 20:46           ` Andy Lutomirski
@ 2014-10-03 21:02           ` Peter Zijlstra
  2014-10-03 21:04             ` Peter Zijlstra
  2014-10-03 21:04             ` Andy Lutomirski
  1 sibling, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-03 21:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 03, 2014 at 10:44:43PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 03, 2014 at 01:27:52PM -0700, Andy Lutomirski wrote:
> > On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > > We could make the rule be that RDPMC is enabled if a perf event is
> > > mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
> > > there's an actual performance issue first.  Ideally we can get this
> > > all working with no API or ABI change at all.
> > 
> > No, we can't use that rule.  But we could say that RDPMC is enabled if
> > a perf event is mmapped and no thread in the mm uses seccomp.  I'll
> > grumble a little bit about adding yet another piece of seccomp state.
> 
> Well, we could simply disable the RDPMC for everything TIF_SECCOMP.
> Should be fairly straight fwd.


Something like so.. slightly less ugly and possibly with more
complicated conditions setting the cr4 if you want to fix tsc vs seccomp
as well.

---
 arch/x86/kernel/cpu/perf_event.c | 13 ++++++++++++-
 arch/x86/kernel/process.c        | 24 +++++++++++++++++-------
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 16c73022306e..cfc42ff5d901 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1869,6 +1869,17 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
 	return count;
 }
 
+void perf_change_rdpmc(bool on, unsigned long *cr4)
+{
+	if (x86_pmu.attr_rdpmc_broken)
+		return;
+
+	if (on)
+		*cr4 |= X86_CR4_PCE;
+	else
+		*cr4 &= ~X86_CR4_PCE;
+}
+
 static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
 
 static struct attribute *x86_pmu_attrs[] = {
@@ -1928,7 +1939,7 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
 
 	userpg->cap_user_time = 0;
 	userpg->cap_user_time_zero = 0;
-	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
+	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc && test_thread_flag(TIF_SECCOMP);
 	userpg->pmc_width = x86_pmu.cntval_bits;
 
 	if (!sched_clock_stable())
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e127ddaa2d5a..b74c0400851e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -201,12 +201,15 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss)
 {
 	struct thread_struct *prev, *next;
+	struct thread_info *pi, *ni;
 
 	prev = &prev_p->thread;
 	next = &next_p->thread;
 
-	if (test_tsk_thread_flag(prev_p, TIF_BLOCKSTEP) ^
-	    test_tsk_thread_flag(next_p, TIF_BLOCKSTEP)) {
+	pi = task_thread_info(prev_p);
+	ni = task_thread_info(next_p);
+
+	if ((pi->flags & _TIF_BLOCKSTEP) ^ (ni->flags & _TIF_BLOCKSTEP)) {
 		unsigned long debugctl = get_debugctlmsr();
 
 		debugctl &= ~DEBUGCTLMSR_BTF;
@@ -216,13 +219,20 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		update_debugctlmsr(debugctl);
 	}
 
-	if (test_tsk_thread_flag(prev_p, TIF_NOTSC) ^
-	    test_tsk_thread_flag(next_p, TIF_NOTSC)) {
+	if ((pi->flags & (_TIF_NOTSC | _TIF_SECCOMP)) ^
+	    (ni->flags & (_TIF_NOTSC | _TIF_SECCOMP))) {
+		extern void perf_change_rdpmc(bool, unsigned long *);
+		unsigned long cr4 = read_cr4();
+
 		/* prev and next are different */
-		if (test_tsk_thread_flag(next_p, TIF_NOTSC))
-			hard_disable_TSC();
+		if (ni->flags & _TIF_NOTSC)
+			cr4 |= X86_CR4_TSD;
 		else
-			hard_enable_TSC();
+			cr4 &= ~X86_CR4_TSD;
+
+		perf_change_rdpmc(!(ni->flags & _TIF_SECCOMP), &cr4);
+
+		write_cr4(cr4);
 	}
 
 	if (test_tsk_thread_flag(next_p, TIF_IO_BITMAP)) {

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 21:02           ` Peter Zijlstra
@ 2014-10-03 21:04             ` Peter Zijlstra
  2014-10-03 21:04             ` Andy Lutomirski
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-03 21:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 03, 2014 at 11:02:13PM +0200, Peter Zijlstra wrote:
> @@ -1928,7 +1939,7 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
>  
>  	userpg->cap_user_time = 0;
>  	userpg->cap_user_time_zero = 0;
> -	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
> +	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc && test_thread_flag(TIF_SECCOMP);

&& !test_thread_flag(TIF_SECCOMP) would probably work better

>  	userpg->pmc_width = x86_pmu.cntval_bits;
>  
>  	if (!sched_clock_stable())

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 21:02           ` Peter Zijlstra
  2014-10-03 21:04             ` Peter Zijlstra
@ 2014-10-03 21:04             ` Andy Lutomirski
  2014-10-03 21:12               ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-03 21:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 3, 2014 at 2:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 03, 2014 at 10:44:43PM +0200, Peter Zijlstra wrote:
>> On Fri, Oct 03, 2014 at 01:27:52PM -0700, Andy Lutomirski wrote:
>> > On Fri, Oct 3, 2014 at 1:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > >
>> > > We could make the rule be that RDPMC is enabled if a perf event is
>> > > mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
>> > > there's an actual performance issue first.  Ideally we can get this
>> > > all working with no API or ABI change at all.
>> >
>> > No, we can't use that rule.  But we could say that RDPMC is enabled if
>> > a perf event is mmapped and no thread in the mm uses seccomp.  I'll
>> > grumble a little bit about adding yet another piece of seccomp state.
>>
>> Well, we could simply disable the RDPMC for everything TIF_SECCOMP.
>> Should be fairly straight fwd.
>
>
> Something like so.. slightly less ugly and possibly with more
> complicated conditions setting the cr4 if you want to fix tsc vs seccomp
> as well.

This will crash anything that tries rdpmc in an allow-everything
seccomp sandbox.  It's also not very compatible with my grand scheme
of allowing rdtsc to be turned off without breaking clock_gettime. :)

I'll send out a real set of patches in the next few days.  I'll even
try to benchmark them :)

>
> ---
>  arch/x86/kernel/cpu/perf_event.c | 13 ++++++++++++-
>  arch/x86/kernel/process.c        | 24 +++++++++++++++++-------
>  2 files changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 16c73022306e..cfc42ff5d901 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1869,6 +1869,17 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
>         return count;
>  }
>
> +void perf_change_rdpmc(bool on, unsigned long *cr4)
> +{
> +       if (x86_pmu.attr_rdpmc_broken)
> +               return;
> +
> +       if (on)
> +               *cr4 |= X86_CR4_PCE;
> +       else
> +               *cr4 &= ~X86_CR4_PCE;
> +}
> +
>  static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
>
>  static struct attribute *x86_pmu_attrs[] = {
> @@ -1928,7 +1939,7 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
>
>         userpg->cap_user_time = 0;
>         userpg->cap_user_time_zero = 0;
> -       userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
> +       userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc && test_thread_flag(TIF_SECCOMP);
>         userpg->pmc_width = x86_pmu.cntval_bits;
>
>         if (!sched_clock_stable())
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index e127ddaa2d5a..b74c0400851e 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -201,12 +201,15 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>                       struct tss_struct *tss)
>  {
>         struct thread_struct *prev, *next;
> +       struct thread_info *pi, *ni;
>
>         prev = &prev_p->thread;
>         next = &next_p->thread;
>
> -       if (test_tsk_thread_flag(prev_p, TIF_BLOCKSTEP) ^
> -           test_tsk_thread_flag(next_p, TIF_BLOCKSTEP)) {
> +       pi = task_thread_info(prev_p);
> +       ni = task_thread_info(next_p);
> +
> +       if ((pi->flags & _TIF_BLOCKSTEP) ^ (ni->flags & _TIF_BLOCKSTEP)) {
>                 unsigned long debugctl = get_debugctlmsr();
>
>                 debugctl &= ~DEBUGCTLMSR_BTF;
> @@ -216,13 +219,20 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
>                 update_debugctlmsr(debugctl);
>         }
>
> -       if (test_tsk_thread_flag(prev_p, TIF_NOTSC) ^
> -           test_tsk_thread_flag(next_p, TIF_NOTSC)) {
> +       if ((pi->flags & (_TIF_NOTSC | _TIF_SECCOMP)) ^
> +           (ni->flags & (_TIF_NOTSC | _TIF_SECCOMP))) {
> +               extern void perf_change_rdpmc(bool, unsigned long *);
> +               unsigned long cr4 = read_cr4();
> +
>                 /* prev and next are different */
> -               if (test_tsk_thread_flag(next_p, TIF_NOTSC))
> -                       hard_disable_TSC();
> +               if (ni->flags & _TIF_NOTSC)
> +                       cr4 |= X86_CR4_TSD;
>                 else
> -                       hard_enable_TSC();
> +                       cr4 &= ~X86_CR4_TSD;
> +
> +               perf_change_rdpmc(!(ni->flags & _TIF_SECCOMP), &cr4);
> +
> +               write_cr4(cr4);
>         }
>
>         if (test_tsk_thread_flag(next_p, TIF_IO_BITMAP)) {



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 21:04             ` Andy Lutomirski
@ 2014-10-03 21:12               ` Peter Zijlstra
  2014-10-03 21:15                 ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-03 21:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 03, 2014 at 02:04:53PM -0700, Andy Lutomirski wrote:
> On Fri, Oct 3, 2014 at 2:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> > Something like so.. slightly less ugly and possibly with more
> > complicated conditions setting the cr4 if you want to fix tsc vs seccomp
> > as well.
> 
> This will crash anything that tries rdpmc in an allow-everything
> seccomp sandbox.  It's also not very compatible with my grand scheme
> of allowing rdtsc to be turned off without breaking clock_gettime. :)

Well, we clear cap_user_rdpmc, so everybody who still tries it gets what
he deserves, no problem there.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 21:12               ` Peter Zijlstra
@ 2014-10-03 21:15                 ` Andy Lutomirski
  2014-10-04  8:13                   ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-03 21:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 3, 2014 at 2:12 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 03, 2014 at 02:04:53PM -0700, Andy Lutomirski wrote:
>> On Fri, Oct 3, 2014 at 2:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > Something like so.. slightly less ugly and possibly with more
>> > complicated conditions setting the cr4 if you want to fix tsc vs seccomp
>> > as well.
>>
>> This will crash anything that tries rdpmc in an allow-everything
>> seccomp sandbox.  It's also not very compatible with my grand scheme
>> of allowing rdtsc to be turned off without breaking clock_gettime. :)
>
> Well, we clear cap_user_rdpmc, so everybody who still tries it gets what
> he deserves, no problem there.

Oh, interesting.

To continue playing devil's advocate, what if you do perf_event_open,
then mmap it, then start the seccomp sandbox?

My draft patches are currently tracking the number of perf_event mmaps
per mm.  I'm not thrilled with it, but it's straightforward.  And I
still need to benchmark cr4 writes, which is tedious, because I can't
do it from user code.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-03 21:15                 ` Andy Lutomirski
@ 2014-10-04  8:13                   ` Peter Zijlstra
  2014-10-06 16:44                     ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-04  8:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Fri, Oct 03, 2014 at 02:15:24PM -0700, Andy Lutomirski wrote:
> On Fri, Oct 3, 2014 at 2:12 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Oct 03, 2014 at 02:04:53PM -0700, Andy Lutomirski wrote:
> >> On Fri, Oct 3, 2014 at 2:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> > Something like so.. slightly less ugly and possibly with more
> >> > complicated conditions setting the cr4 if you want to fix tsc vs seccomp
> >> > as well.
> >>
> >> This will crash anything that tries rdpmc in an allow-everything
> >> seccomp sandbox.  It's also not very compatible with my grand scheme
> >> of allowing rdtsc to be turned off without breaking clock_gettime. :)
> >
> > Well, we clear cap_user_rdpmc, so everybody who still tries it gets what
> > he deserves, no problem there.
> 
> Oh, interesting.
> 
> To continue playing devil's advocate, what if you do perf_event_open,
> then mmap it, then start the seccomp sandbox?

We update that cap bit on every update to the self-monitor state, and in
a perfect world people would also check the cap bit every time they try
and read it, and fall back to the syscall. So we could just clear it..
but I can imagine reality ruining things here.

> My draft patches are currently tracking the number of perf_event mmaps
> per mm.  I'm not thrilled with it, but it's straightforward.  And I
> still need to benchmark cr4 writes, which is tedious, because I can't
> do it from user code.

Should be fairly straight fwd from kernel space, get a tsc stamp,
read+write cr4 1000 times, get another tsc read, and maybe do that
several times. No?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
  2014-10-04  8:13                   ` Peter Zijlstra
@ 2014-10-06 16:44                     ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2014-10-06 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Kees Cook, Andrea Arcangeli,
	Erik Bosman, H. Peter Anvin, Linux API, Michael Kerrisk-manpages,
	Paul Mackerras, Arnaldo Carvalho de Melo, X86 ML

On Sat, Oct 4, 2014 at 1:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 03, 2014 at 02:15:24PM -0700, Andy Lutomirski wrote:
>> On Fri, Oct 3, 2014 at 2:12 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, Oct 03, 2014 at 02:04:53PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Oct 3, 2014 at 2:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> >> > Something like so.. slightly less ugly and possibly with more
>> >> > complicated conditions setting the cr4 if you want to fix tsc vs seccomp
>> >> > as well.
>> >>
>> >> This will crash anything that tries rdpmc in an allow-everything
>> >> seccomp sandbox.  It's also not very compatible with my grand scheme
>> >> of allowing rdtsc to be turned off without breaking clock_gettime. :)
>> >
>> > Well, we clear cap_user_rdpmc, so everybody who still tries it gets what
>> > he deserves, no problem there.
>>
>> Oh, interesting.
>>
>> To continue playing devil's advocate, what if you do perf_event_open,
>> then mmap it, then start the seccomp sandbox?
>
> We update that cap bit on every update to the self-monitor state, and in
> a perfect world people would also check the cap bit every time they try
> and read it, and fall back to the syscall. So we could just clear it..
> but I can imagine reality ruining things here.

If nothing else, the fact that rdpmc fails with SIGSEGV instead of
with some nonsense value means that this will always be racy.

>
>> My draft patches are currently tracking the number of perf_event mmaps
>> per mm.  I'm not thrilled with it, but it's straightforward.  And I
>> still need to benchmark cr4 writes, which is tedious, because I can't
>> do it from user code.
>
> Should be fairly straight fwd from kernel space, get a tsc stamp,
> read+write cr4 1000 times, get another tsc read, and maybe do that
> several times. No?

I tried it.  Rough numbers on my 2.7 GHz Sandy Bridge laptop

Writing to cr4 in VMX non-root (changing PCE) takes ~48ns.  RMW cr4
takes rougly 51ns.  IMO neither of these is enough to be worth
worrying *that* much about when switching into or out of a perf-using
task.  But you might disagree with me.

Changing TSD takes 700ns, because KVM has the VMCS programmed wrong.
I'll send a patch.

I suspect that the same experiment on bare metal would run faster.


--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-10-06 16:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-03 17:18 [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering Andy Lutomirski
2014-10-03 17:27 ` Andy Lutomirski
2014-10-03 20:14   ` Peter Zijlstra
2014-10-03 20:22     ` Andy Lutomirski
2014-10-03 20:27       ` Andy Lutomirski
2014-10-03 20:44         ` Peter Zijlstra
2014-10-03 20:46           ` Andy Lutomirski
2014-10-03 21:02           ` Peter Zijlstra
2014-10-03 21:04             ` Peter Zijlstra
2014-10-03 21:04             ` Andy Lutomirski
2014-10-03 21:12               ` Peter Zijlstra
2014-10-03 21:15                 ` Andy Lutomirski
2014-10-04  8:13                   ` Peter Zijlstra
2014-10-06 16:44                     ` Andy Lutomirski
2014-10-03 20:42       ` Peter Zijlstra
2014-10-03 20:53         ` Andy Lutomirski
     [not found] ` <20141003174141.GR2342@redhat.com>
2014-10-03 17:59   ` Andy Lutomirski
2014-10-03 20:15     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).