All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-23 23:29 ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, linux-api, containers, bpf, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	linux-kernel

rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
v1:
- rebase to for-next/seccomp
- finish X86_X32 support for both pinning and bitmaps
- replace TLB magic with Jann's emulator
- add JSET insn

TODO:
- add ALU|AND insn
- significantly more testing

Hi,

This is a refresh of my earlier constant action bitmap series. It looks
like the RFC was missed on the container list, so I've CCed it now. :)
I'd like to work from this series, as it handles the multi-architecture
stuff.

Repeating the commit log from patch 3:

    seccomp: Implement constant action bitmaps
    
    One of the most common pain points with seccomp filters has been dealing
    with the overhead of processing the filters, especially for "always allow"
    or "always reject" cases. While BPF is extremely fast[1], it will always
    have overhead associated with it. Additionally, due to seccomp's design,
    filters are layered, which means processing time goes up as the number
    of filters attached goes up.
    
    In the past, efforts have been focused on making filter execution complete
    in a shorter amount of time. For example, filters were rewritten from
    using linear if/then/else syscall search to using balanced binary trees,
    or moving tests for syscalls common to the process's workload to the
    front of the filter. However, there are limits to this, especially when
    some processes are dealing with tens of filters[2], or when some
    architectures have a less efficient BPF engine[3].
    
    The most common use of seccomp, constructing syscall block/allow-lists,
    where syscalls that are always allowed or always rejected (without regard
    to any arguments), also tends to produce the most pathological runtime
    problems, in that a large number of syscall checks in the filter need
    to be performed to come to a determination.
    
    In order to optimize these cases from O(n) to O(1), seccomp can
    use bitmaps to immediately determine the desired action. A critical
    observation in the prior paragraph bears repeating: the common case for
    syscall tests do not check arguments. For any given filter, there is a
    constant mapping from the combination of architecture and syscall to the
    seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
    there is a single architecture.). As such, it is possible to construct
    a mapping of arch/syscall to action, which can be updated as new filters
    are attached to a process.
    
    In order to build this mapping at filter attach time, each filter is
    executed for every syscall (under each possible architecture), and
    checked for any accesses of struct seccomp_data that are not the "arch"
    nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
    there is a constant mapping for that syscall, and bitmaps can be updated
    accordingly. If any accesses happen outside of those struct members,
    seccomp must not bypass filter execution for that syscall, since program
    state will be used to determine filter action result. (This logic comes
    in the next patch.)
    
    [1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/
    [2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
    [3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/


Thanks!

-Kees


Kees Cook (6):
  seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
  x86: Enable seccomp architecture tracking
  seccomp: Implement constant action bitmaps
  seccomp: Emulate basic filters for constant action results
  selftests/seccomp: Compare bitmap vs filter overhead
  [DEBUG] seccomp: Report bitmap coverage ranges

 arch/x86/include/asm/seccomp.h                |  14 +
 include/linux/seccomp.h                       |  27 +
 include/uapi/linux/seccomp.h                  |   1 +
 kernel/seccomp.c                              | 473 +++++++++++++++++-
 net/core/filter.c                             |   3 +-
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c |  33 ++
 tools/testing/selftests/seccomp/settings      |   2 +-
 8 files changed, 674 insertions(+), 30 deletions(-)

-- 
2.25.1

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-23 23:29 ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Kees Cook, Jann Horn, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, containers, linux-api, linux-kernel

rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
v1:
- rebase to for-next/seccomp
- finish X86_X32 support for both pinning and bitmaps
- replace TLB magic with Jann's emulator
- add JSET insn

TODO:
- add ALU|AND insn
- significantly more testing

Hi,

This is a refresh of my earlier constant action bitmap series. It looks
like the RFC was missed on the container list, so I've CCed it now. :)
I'd like to work from this series, as it handles the multi-architecture
stuff.

Repeating the commit log from patch 3:

    seccomp: Implement constant action bitmaps
    
    One of the most common pain points with seccomp filters has been dealing
    with the overhead of processing the filters, especially for "always allow"
    or "always reject" cases. While BPF is extremely fast[1], it will always
    have overhead associated with it. Additionally, due to seccomp's design,
    filters are layered, which means processing time goes up as the number
    of filters attached goes up.
    
    In the past, efforts have been focused on making filter execution complete
    in a shorter amount of time. For example, filters were rewritten from
    using linear if/then/else syscall search to using balanced binary trees,
    or moving tests for syscalls common to the process's workload to the
    front of the filter. However, there are limits to this, especially when
    some processes are dealing with tens of filters[2], or when some
    architectures have a less efficient BPF engine[3].
    
    The most common use of seccomp, constructing syscall block/allow-lists,
    where syscalls that are always allowed or always rejected (without regard
    to any arguments), also tends to produce the most pathological runtime
    problems, in that a large number of syscall checks in the filter need
    to be performed to come to a determination.
    
    In order to optimize these cases from O(n) to O(1), seccomp can
    use bitmaps to immediately determine the desired action. A critical
    observation in the prior paragraph bears repeating: the common case for
    syscall tests do not check arguments. For any given filter, there is a
    constant mapping from the combination of architecture and syscall to the
    seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
    there is a single architecture.). As such, it is possible to construct
    a mapping of arch/syscall to action, which can be updated as new filters
    are attached to a process.
    
    In order to build this mapping at filter attach time, each filter is
    executed for every syscall (under each possible architecture), and
    checked for any accesses of struct seccomp_data that are not the "arch"
    nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
    there is a constant mapping for that syscall, and bitmaps can be updated
    accordingly. If any accesses happen outside of those struct members,
    seccomp must not bypass filter execution for that syscall, since program
    state will be used to determine filter action result. (This logic comes
    in the next patch.)
    
    [1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/
    [2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
    [3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/


Thanks!

-Kees


Kees Cook (6):
  seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
  x86: Enable seccomp architecture tracking
  seccomp: Implement constant action bitmaps
  seccomp: Emulate basic filters for constant action results
  selftests/seccomp: Compare bitmap vs filter overhead
  [DEBUG] seccomp: Report bitmap coverage ranges

 arch/x86/include/asm/seccomp.h                |  14 +
 include/linux/seccomp.h                       |  27 +
 include/uapi/linux/seccomp.h                  |   1 +
 kernel/seccomp.c                              | 473 +++++++++++++++++-
 net/core/filter.c                             |   3 +-
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c |  33 ++
 tools/testing/selftests/seccomp/settings      |   2 +-
 8 files changed, 674 insertions(+), 30 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
  2020-09-23 23:29 ` Kees Cook
@ 2020-09-23 23:29   ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, linux-api, containers, bpf, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	linux-kernel

For systems that provide multiple syscall maps based on audit
architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via
CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way
to pin the process to a specific syscall table, instead of needing
to generate all filters with an architecture check as the first filter
action.

This creates the internal representation that seccomp itself can use
(which is separate from the filters, which need to stay runtime
agnostic). Additionally paves the way for constant-action bitmaps.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/seccomp.h                       |  9 +++
 include/uapi/linux/seccomp.h                  |  1 +
 kernel/seccomp.c                              | 79 ++++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++
 4 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..0be20bc81ea9 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -20,12 +20,18 @@
 #include <linux/atomic.h>
 #include <asm/seccomp.h>
 
+#define SECCOMP_ARCH_IS_NATIVE		1
+#define SECCOMP_ARCH_IS_COMPAT		2
+#define SECCOMP_ARCH_IS_MULTIPLEX	3
+#define SECCOMP_ARCH_IS_UNKNOWN		0xff
+
 struct seccomp_filter;
 /**
  * struct seccomp - the state of a seccomp'ed process
  *
  * @mode:  indicates one of the valid values above for controlled
  *         system calls available to a process.
+ * @arch: seccomp's internal architecture identifier (not seccomp_data->arch)
  * @filter: must always point to a valid seccomp-filter or NULL as it is
  *          accessed without locking during system call entry.
  *
@@ -34,6 +40,9 @@ struct seccomp_filter;
  */
 struct seccomp {
 	int mode;
+#ifdef SECCOMP_ARCH
+	u8 arch;
+#endif
 	atomic_t filter_count;
 	struct seccomp_filter *filter;
 };
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 6ba18b82a02e..f4d134ebfa7e 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -16,6 +16,7 @@
 #define SECCOMP_SET_MODE_FILTER		1
 #define SECCOMP_GET_ACTION_AVAIL	2
 #define SECCOMP_GET_NOTIF_SIZES		3
+#define SECCOMP_PIN_ARCHITECTURE	4
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ae6b40cc39f4..0a3ff8eb8aea 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -298,6 +298,47 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 	return 0;
 }
 
+#ifdef SECCOMP_ARCH
+static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
+{
+	u8 seccomp_arch;
+
+	switch (syscall_arch) {
+	case SECCOMP_ARCH:
+		seccomp_arch = SECCOMP_ARCH_IS_NATIVE;
+		break;
+#ifdef CONFIG_COMPAT
+	case SECCOMP_ARCH_COMPAT:
+		seccomp_arch = SECCOMP_ARCH_IS_COMPAT;
+		break;
+#endif
+	default:
+		seccomp_arch = SECCOMP_ARCH_IS_UNKNOWN;
+	}
+
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
+		seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
+				SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;
+	}
+#endif
+
+	return seccomp_arch;
+}
+#endif
+
+static inline bool seccomp_arch_mismatch(struct seccomp *seccomp,
+					 const struct seccomp_data *sd)
+{
+#ifdef SECCOMP_ARCH
+	/* Block mismatched architectures. */
+	if (seccomp->arch && seccomp->arch != seccomp_get_arch(sd->arch, sd->nr))
+		return true;
+#endif
+
+	return false;
+}
+
 /**
  * seccomp_run_filters - evaluates all seccomp filters against @sd
  * @sd: optional seccomp data to be passed to filters
@@ -312,9 +353,14 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 			       struct seccomp_filter **match)
 {
 	u32 ret = SECCOMP_RET_ALLOW;
+	struct seccomp_filter *f;
+	struct seccomp *seccomp = &current->seccomp;
+
+	if (seccomp_arch_mismatch(seccomp, sd))
+		return SECCOMP_RET_KILL_PROCESS;
+
 	/* Make sure cross-thread synced filter points somewhere sane. */
-	struct seccomp_filter *f =
-			READ_ONCE(current->seccomp.filter);
+	f = READ_ONCE(seccomp->filter);
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (WARN_ON(f == NULL))
@@ -522,6 +568,11 @@ static inline void seccomp_sync_threads(unsigned long flags)
 		if (task_no_new_privs(caller))
 			task_set_no_new_privs(thread);
 
+#ifdef SECCOMP_ARCH
+		/* Copy any pinned architecture. */
+		thread->seccomp.arch = caller->seccomp.arch;
+#endif
+
 		/*
 		 * Opt the other thread into seccomp if needed.
 		 * As threads are considered to be trust-realm
@@ -1652,6 +1703,23 @@ static long seccomp_get_notif_sizes(void __user *usizes)
 	return 0;
 }
 
+static long seccomp_pin_architecture(void)
+{
+#ifdef SECCOMP_ARCH
+	struct task_struct *task = current;
+
+	u8 arch = seccomp_get_arch(syscall_get_arch(task),
+				   syscall_get_nr(task, task_pt_regs(task)));
+
+	/* How did you even get here? */
+	if (task->seccomp.arch && task->seccomp.arch != arch)
+		return -EBUSY;
+
+	task->seccomp.arch = arch;
+#endif
+	return 0;
+}
+
 /* Common entry point for both prctl and syscall. */
 static long do_seccomp(unsigned int op, unsigned int flags,
 		       void __user *uargs)
@@ -1673,6 +1741,13 @@ static long do_seccomp(unsigned int op, unsigned int flags,
 			return -EINVAL;
 
 		return seccomp_get_notif_sizes(uargs);
+	case SECCOMP_PIN_ARCHITECTURE:
+		if (flags != 0)
+			return -EINVAL;
+		if (uargs != NULL)
+			return -EINVAL;
+
+		return seccomp_pin_architecture();
 	default:
 		return -EINVAL;
 	}
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 9c398768553b..d90551e0385e 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -157,6 +157,10 @@ struct seccomp_data {
 #define SECCOMP_GET_NOTIF_SIZES 3
 #endif
 
+#ifndef SECCOMP_PIN_ARCHITECTURE
+#define SECCOMP_PIN_ARCHITECTURE 4
+#endif
+
 #ifndef SECCOMP_FILTER_FLAG_TSYNC
 #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
 #endif
@@ -2221,6 +2225,35 @@ TEST_F_SIGNAL(TRACE_syscall, kill_after, SIGSYS)
 	EXPECT_NE(self->mypid, syscall(__NR_getpid));
 }
 
+TEST(seccomp_architecture_pin)
+{
+	long ret;
+
+	ret = seccomp(SECCOMP_PIN_ARCHITECTURE, 0, NULL);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support SECCOMP_PIN_ARCHITECTURE!");
+	}
+
+	/* Make sure unexpected arguments are rejected. */
+	ret = seccomp(SECCOMP_PIN_ARCHITECTURE, 1, NULL);
+	ASSERT_EQ(-1, ret);
+	EXPECT_EQ(EINVAL, errno) {
+		TH_LOG("Did not reject SECCOMP_PIN_ARCHITECTURE with flags!");
+	}
+
+	ret = seccomp(SECCOMP_PIN_ARCHITECTURE, 0, &ret);
+	ASSERT_EQ(-1, ret);
+	EXPECT_EQ(EINVAL, errno) {
+		TH_LOG("Did not reject SECCOMP_PIN_ARCHITECTURE with address!");
+	}
+
+	ret = seccomp(SECCOMP_PIN_ARCHITECTURE, 1, &ret);
+	ASSERT_EQ(-1, ret);
+	EXPECT_EQ(EINVAL, errno) {
+		TH_LOG("Did not reject SECCOMP_PIN_ARCHITECTURE with flags and address!");
+	}
+}
+
 TEST(seccomp_syscall)
 {
 	struct sock_filter filter[] = {
-- 
2.25.1

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
@ 2020-09-23 23:29   ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Kees Cook, Jann Horn, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, containers, linux-api, linux-kernel

For systems that provide multiple syscall maps based on audit
architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via
CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way
to pin the process to a specific syscall table, instead of needing
to generate all filters with an architecture check as the first filter
action.

This creates the internal representation that seccomp itself can use
(which is separate from the filters, which need to stay runtime
agnostic). Additionally paves the way for constant-action bitmaps.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/seccomp.h                       |  9 +++
 include/uapi/linux/seccomp.h                  |  1 +
 kernel/seccomp.c                              | 79 ++++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++
 4 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..0be20bc81ea9 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -20,12 +20,18 @@
 #include <linux/atomic.h>
 #include <asm/seccomp.h>
 
+#define SECCOMP_ARCH_IS_NATIVE		1
+#define SECCOMP_ARCH_IS_COMPAT		2
+#define SECCOMP_ARCH_IS_MULTIPLEX	3
+#define SECCOMP_ARCH_IS_UNKNOWN		0xff
+
 struct seccomp_filter;
 /**
  * struct seccomp - the state of a seccomp'ed process
  *
  * @mode:  indicates one of the valid values above for controlled
  *         system calls available to a process.
+ * @arch: seccomp's internal architecture identifier (not seccomp_data->arch)
  * @filter: must always point to a valid seccomp-filter or NULL as it is
  *          accessed without locking during system call entry.
  *
@@ -34,6 +40,9 @@ struct seccomp_filter;
  */
 struct seccomp {
 	int mode;
+#ifdef SECCOMP_ARCH
+	u8 arch;
+#endif
 	atomic_t filter_count;
 	struct seccomp_filter *filter;
 };
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 6ba18b82a02e..f4d134ebfa7e 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -16,6 +16,7 @@
 #define SECCOMP_SET_MODE_FILTER		1
 #define SECCOMP_GET_ACTION_AVAIL	2
 #define SECCOMP_GET_NOTIF_SIZES		3
+#define SECCOMP_PIN_ARCHITECTURE	4
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ae6b40cc39f4..0a3ff8eb8aea 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -298,6 +298,47 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 	return 0;
 }
 
+#ifdef SECCOMP_ARCH
+static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
+{
+	u8 seccomp_arch;
+
+	switch (syscall_arch) {
+	case SECCOMP_ARCH:
+		seccomp_arch = SECCOMP_ARCH_IS_NATIVE;
+		break;
+#ifdef CONFIG_COMPAT
+	case SECCOMP_ARCH_COMPAT:
+		seccomp_arch = SECCOMP_ARCH_IS_COMPAT;
+		break;
+#endif
+	default:
+		seccomp_arch = SECCOMP_ARCH_IS_UNKNOWN;
+	}
+
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
+		seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
+				SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;
+	}
+#endif
+
+	return seccomp_arch;
+}
+#endif
+
+static inline bool seccomp_arch_mismatch(struct seccomp *seccomp,
+					 const struct seccomp_data *sd)
+{
+#ifdef SECCOMP_ARCH
+	/* Block mismatched architectures. */
+	if (seccomp->arch && seccomp->arch != seccomp_get_arch(sd->arch, sd->nr))
+		return true;
+#endif
+
+	return false;
+}
+
 /**
  * seccomp_run_filters - evaluates all seccomp filters against @sd
  * @sd: optional seccomp data to be passed to filters
@@ -312,9 +353,14 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 			       struct seccomp_filter **match)
 {
 	u32 ret = SECCOMP_RET_ALLOW;
+	struct seccomp_filter *f;
+	struct seccomp *seccomp = &current->seccomp;
+
+	if (seccomp_arch_mismatch(seccomp, sd))
+		return SECCOMP_RET_KILL_PROCESS;
+
 	/* Make sure cross-thread synced filter points somewhere sane. */
-	struct seccomp_filter *f =
-			READ_ONCE(current->seccomp.filter);
+	f = READ_ONCE(seccomp->filter);
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (WARN_ON(f == NULL))
@@ -522,6 +568,11 @@ static inline void seccomp_sync_threads(unsigned long flags)
 		if (task_no_new_privs(caller))
 			task_set_no_new_privs(thread);
 
+#ifdef SECCOMP_ARCH
+		/* Copy any pinned architecture. */
+		thread->seccomp.arch = caller->seccomp.arch;
+#endif
+
 		/*
 		 * Opt the other thread into seccomp if needed.
 		 * As threads are considered to be trust-realm
@@ -1652,6 +1703,23 @@ static long seccomp_get_notif_sizes(void __user *usizes)
 	return 0;
 }
 
+static long seccomp_pin_architecture(void)
+{
+#ifdef SECCOMP_ARCH
+	struct task_struct *task = current;
+
+	u8 arch = seccomp_get_arch(syscall_get_arch(task),
+				   syscall_get_nr(task, task_pt_regs(task)));
+
+	/* How did you even get here? */
+	if (task->seccomp.arch && task->seccomp.arch != arch)
+		return -EBUSY;
+
+	task->seccomp.arch = arch;
+#endif
+	return 0;
+}
+
 /* Common entry point for both prctl and syscall. */
 static long do_seccomp(unsigned int op, unsigned int flags,
 		       void __user *uargs)
@@ -1673,6 +1741,13 @@ static long do_seccomp(unsigned int op, unsigned int flags,
 			return -EINVAL;
 
 		return seccomp_get_notif_sizes(uargs);
+	case SECCOMP_PIN_ARCHITECTURE:
+		if (flags != 0)
+			return -EINVAL;
+		if (uargs != NULL)
+			return -EINVAL;
+
+		return seccomp_pin_architecture();
 	default:
 		return -EINVAL;
 	}
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 9c398768553b..d90551e0385e 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -157,6 +157,10 @@ struct seccomp_data {
 #define SECCOMP_GET_NOTIF_SIZES 3
 #endif
 
+#ifndef SECCOMP_PIN_ARCHITECTURE
+#define SECCOMP_PIN_ARCHITECTURE 4
+#endif
+
 #ifndef SECCOMP_FILTER_FLAG_TSYNC
 #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
 #endif
@@ -2221,6 +2225,35 @@ TEST_F_SIGNAL(TRACE_syscall, kill_after, SIGSYS)
 	EXPECT_NE(self->mypid, syscall(__NR_getpid));
 }
 
+TEST(seccomp_architecture_pin)
+{
+	long ret;
+
+	ret = seccomp(SECCOMP_PIN_ARCHITECTURE, 0, NULL);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support SECCOMP_PIN_ARCHITECTURE!");
+	}
+
+	/* Make sure unexpected arguments are rejected. */
+	ret = seccomp(SECCOMP_PIN_ARCHITECTURE, 1, NULL);
+	ASSERT_EQ(-1, ret);
+	EXPECT_EQ(EINVAL, errno) {
+		TH_LOG("Did not reject SECCOMP_PIN_ARCHITECTURE with flags!");
+	}
+
+	ret = seccomp(SECCOMP_PIN_ARCHITECTURE, 0, &ret);
+	ASSERT_EQ(-1, ret);
+	EXPECT_EQ(EINVAL, errno) {
+		TH_LOG("Did not reject SECCOMP_PIN_ARCHITECTURE with address!");
+	}
+
+	ret = seccomp(SECCOMP_PIN_ARCHITECTURE, 1, &ret);
+	ASSERT_EQ(-1, ret);
+	EXPECT_EQ(EINVAL, errno) {
+		TH_LOG("Did not reject SECCOMP_PIN_ARCHITECTURE with flags and address!");
+	}
+}
+
 TEST(seccomp_syscall)
 {
 	struct sock_filter filter[] = {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 2/6] x86: Enable seccomp architecture tracking
  2020-09-23 23:29 ` Kees Cook
@ 2020-09-23 23:29   ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, linux-api, containers, bpf, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	linux-kernel

Provide seccomp internals with the details to calculate which syscall
table the running kernel is expecting to deal with. This allows for
efficient architecture pinning and paves the way for constant-action
bitmaps.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/seccomp.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 2bd1338de236..38181e20e1d3 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -16,6 +16,20 @@
 #define __NR_seccomp_sigreturn_32	__NR_ia32_sigreturn
 #endif
 
+#ifdef CONFIG_X86_64
+# define SECCOMP_ARCH					AUDIT_ARCH_X86_64
+# ifdef CONFIG_COMPAT
+#  define SECCOMP_ARCH_COMPAT				AUDIT_ARCH_I386
+# endif
+# ifdef CONFIG_X86_X32_ABI
+#  define SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH	AUDIT_ARCH_X86_64
+#  define SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK	__X32_SYSCALL_BIT
+#  define SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT	29
+#endif
+#else /* !CONFIG_X86_64 */
+# define SECCOMP_ARCH					AUDIT_ARCH_I386
+#endif
+
 #include <asm-generic/seccomp.h>
 
 #endif /* _ASM_X86_SECCOMP_H */
-- 
2.25.1

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 2/6] x86: Enable seccomp architecture tracking
@ 2020-09-23 23:29   ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Kees Cook, Jann Horn, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, containers, linux-api, linux-kernel

Provide seccomp internals with the details to calculate which syscall
table the running kernel is expecting to deal with. This allows for
efficient architecture pinning and paves the way for constant-action
bitmaps.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/seccomp.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 2bd1338de236..38181e20e1d3 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -16,6 +16,20 @@
 #define __NR_seccomp_sigreturn_32	__NR_ia32_sigreturn
 #endif
 
+#ifdef CONFIG_X86_64
+# define SECCOMP_ARCH					AUDIT_ARCH_X86_64
+# ifdef CONFIG_COMPAT
+#  define SECCOMP_ARCH_COMPAT				AUDIT_ARCH_I386
+# endif
+# ifdef CONFIG_X86_X32_ABI
+#  define SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH	AUDIT_ARCH_X86_64
+#  define SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK	__X32_SYSCALL_BIT
+#  define SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT	29
+#endif
+#else /* !CONFIG_X86_64 */
+# define SECCOMP_ARCH					AUDIT_ARCH_I386
+#endif
+
 #include <asm-generic/seccomp.h>
 
 #endif /* _ASM_X86_SECCOMP_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-23 23:29 ` Kees Cook
@ 2020-09-23 23:29   ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, linux-api, containers, bpf, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	linux-kernel

One of the most common pain points with seccomp filters has been dealing
with the overhead of processing the filters, especially for "always allow"
or "always reject" cases. While BPF is extremely fast[1], it will always
have overhead associated with it. Additionally, due to seccomp's design,
filters are layered, which means processing time goes up as the number
of filters attached goes up.

In the past, efforts have been focused on making filter execution complete
in a shorter amount of time. For example, filters were rewritten from
using linear if/then/else syscall search to using balanced binary trees,
or moving tests for syscalls common to the process's workload to the
front of the filter. However, there are limits to this, especially when
some processes are dealing with tens of filters[2], or when some
architectures have a less efficient BPF engine[3].

The most common use of seccomp, constructing syscall block/allow-lists,
where syscalls that are always allowed or always rejected (without regard
to any arguments), also tends to produce the most pathological runtime
problems, in that a large number of syscall checks in the filter need
to be performed to come to a determination.

In order to optimize these cases from O(n) to O(1), seccomp can
use bitmaps to immediately determine the desired action. A critical
observation in the prior paragraph bears repeating: the common case for
syscall tests do not check arguments. For any given filter, there is a
constant mapping from the combination of architecture and syscall to the
seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
there is a single architecture.). As such, it is possible to construct
a mapping of arch/syscall to action, which can be updated as new filters
are attached to a process.

In order to build this mapping at filter attach time, each filter is
executed for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
there is a constant mapping for that syscall, and bitmaps can be updated
accordingly. If any accesses happen outside of those struct members,
seccomp must not bypass filter execution for that syscall, since program
state will be used to determine filter action result. (This logic comes
in the next patch.)

[1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/
[2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
[3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/seccomp.h |  18 ++++
 kernel/seccomp.c        | 207 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 221 insertions(+), 4 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 0be20bc81ea9..96df2f899e3d 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -25,6 +25,17 @@
 #define SECCOMP_ARCH_IS_MULTIPLEX	3
 #define SECCOMP_ARCH_IS_UNKNOWN		0xff
 
+/* When no bits are set for a syscall, filters are run. */
+struct seccomp_bitmaps {
+#ifdef SECCOMP_ARCH
+	/* "allow" are initialized to set and only ever get cleared. */
+	DECLARE_BITMAP(allow, NR_syscalls);
+	/* These are initialized to clear and only ever get set. */
+	DECLARE_BITMAP(kill_thread, NR_syscalls);
+	DECLARE_BITMAP(kill_process, NR_syscalls);
+#endif
+};
+
 struct seccomp_filter;
 /**
  * struct seccomp - the state of a seccomp'ed process
@@ -45,6 +56,13 @@ struct seccomp {
 #endif
 	atomic_t filter_count;
 	struct seccomp_filter *filter;
+	struct seccomp_bitmaps native;
+#ifdef CONFIG_COMPAT
+	struct seccomp_bitmaps compat;
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	struct seccomp_bitmaps multiplex;
+#endif
 };
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 0a3ff8eb8aea..111a238bc532 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
 
 #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
 	if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
-		seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
+		seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
 				SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;
 	}
 #endif
@@ -559,6 +559,21 @@ static inline void seccomp_sync_threads(unsigned long flags)
 		atomic_set(&thread->seccomp.filter_count,
 			   atomic_read(&thread->seccomp.filter_count));
 
+		/* Copy syscall filter bitmaps. */
+		memcpy(&thread->seccomp.native,
+		       &caller->seccomp.native,
+		       sizeof(caller->seccomp.native));
+#ifdef CONFIG_COMPAT
+		memcpy(&thread->seccomp.compat,
+		       &caller->seccomp.compat,
+		       sizeof(caller->seccomp.compat));
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+		memcpy(&thread->seccomp.multiplex,
+		       &caller->seccomp.multiplex,
+		       sizeof(caller->seccomp.multiplex));
+#endif
+
 		/*
 		 * Don't let an unprivileged task work around
 		 * the no_new_privs restriction by creating
@@ -661,6 +676,114 @@ seccomp_prepare_user_filter(const char __user *user_filter)
 	return filter;
 }
 
+static inline bool sd_touched(pte_t *ptep)
+{
+	return !!pte_young(*(READ_ONCE(ptep)));
+}
+
+#ifdef SECCOMP_ARCH
+/*
+ * We can build constant-action bitmaps only when an arch/nr combination reads
+ * nothing more that sd->nr and sd->arch, since those have a constant mapping
+ * to the syscall.
+ *
+ * This approach could also be used to test for access to sd->arch too,
+ * if we wanted to warn about compat-unsafe filters.
+ */
+static inline bool seccomp_filter_action_is_constant(struct bpf_prog *prog,
+						     struct seccomp_data *sd,
+						     u32 *action)
+{
+	/* No evaluation implementation yet. */
+	return false;
+}
+
+/*
+ * Walk everyone syscall combination for this arch/mask combo and update
+ * the bitmaps with any results.
+ */
+static void seccomp_update_bitmap(struct seccomp_filter *filter,
+				  void *pagepair, u32 arch, u32 mask,
+				  struct seccomp_bitmaps *bitmaps)
+{
+	struct seccomp_data sd = { };
+	bool constant;
+	u32 nr, ret;
+
+	/* Initialize bitmaps for first filter. */
+	if (!filter->prev)
+		bitmap_fill(bitmaps->allow, NR_syscalls);
+
+	/*
+	 * For every syscall, if we don't already know we need to run
+	 * the full filter, simulate the filter with our static values.
+	 */
+	for (nr = 0; nr < NR_syscalls; nr++) {
+		/* Are we already at the maximal rejection state? */
+		if (test_bit(nr, bitmaps->kill_process))
+			continue;
+
+		sd.nr = nr | mask;
+		sd.arch = arch;
+
+		/* Evaluate filter for this arch/syscall. */
+		constant = seccomp_filter_action_is_constant(filter->prog, &sd, &ret);
+
+		/*
+		 * If this run through the filter didn't access
+		 * beyond "arch", we know the result is a constant
+		 * mapping for arch/nr -> ret.
+		 */
+		if (constant) {
+			/* Constant evaluation. Mark appropriate bitmaps. */
+			switch (ret) {
+			case SECCOMP_RET_KILL_PROCESS:
+				set_bit(nr, bitmaps->kill_process);
+				break;
+			case SECCOMP_RET_KILL_THREAD:
+				set_bit(nr, bitmaps->kill_thread);
+				break;
+			default:
+				break;
+			case SECCOMP_RET_ALLOW:
+				/*
+				 * If we always map to allow, there are
+				 * no changes needed to the bitmaps.
+				 */
+				continue;
+			}
+		}
+
+		/*
+		 * Dynamic evaluation of syscall, or non-allow constant
+		 * mapping to something other than SECCOMP_RET_ALLOW: we
+		 * must not short-circuit-allow it anymore.
+		 */
+		clear_bit(nr, bitmaps->allow);
+	}
+}
+
+static void seccomp_update_bitmaps(struct seccomp_filter *filter,
+				   void *pagepair)
+{
+	seccomp_update_bitmap(filter, pagepair, SECCOMP_ARCH, 0,
+			      &current->seccomp.native);
+#ifdef CONFIG_COMPAT
+	seccomp_update_bitmap(filter, pagepair, SECCOMP_ARCH_COMPAT, 0,
+			      &current->seccomp.compat);
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	seccomp_update_bitmap(filter, pagepair, SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH,
+			      SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK,
+			      &current->seccomp.multiplex);
+#endif
+}
+#else
+static void seccomp_update_bitmaps(struct seccomp_filter *filter,
+				   void *pagepair)
+{ }
+#endif
+
 /**
  * seccomp_attach_filter: validate and attach filter
  * @flags:  flags to change filter behavior
@@ -674,7 +797,8 @@ seccomp_prepare_user_filter(const char __user *user_filter)
  *   - in NEW_LISTENER mode: the fd of the new listener
  */
 static long seccomp_attach_filter(unsigned int flags,
-				  struct seccomp_filter *filter)
+				  struct seccomp_filter *filter,
+				  void *pagepair)
 {
 	unsigned long total_insns;
 	struct seccomp_filter *walker;
@@ -713,6 +837,9 @@ static long seccomp_attach_filter(unsigned int flags,
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
 
+	/* Evaluate filter for new known-outcome syscalls */
+	seccomp_update_bitmaps(filter, pagepair);
+
 	/* Now that the new filter is in place, synchronize to all threads. */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		seccomp_sync_threads(flags);
@@ -970,6 +1097,65 @@ static int seccomp_do_user_notification(int this_syscall,
 	return -1;
 }
 
+#ifdef SECCOMP_ARCH
+static inline bool __bypass_filter(struct seccomp_bitmaps *bitmaps,
+				   u32 nr, u32 *filter_ret)
+{
+	if (nr < NR_syscalls) {
+		if (test_bit(nr, bitmaps->allow)) {
+			*filter_ret = SECCOMP_RET_ALLOW;
+			return true;
+		}
+		if (test_bit(nr, bitmaps->kill_process)) {
+			*filter_ret = SECCOMP_RET_KILL_PROCESS;
+			return true;
+		}
+		if (test_bit(nr, bitmaps->kill_thread)) {
+			*filter_ret = SECCOMP_RET_KILL_THREAD;
+			return true;
+		}
+	}
+	return false;
+}
+
+static inline u32 check_syscall(const struct seccomp_data *sd,
+				struct seccomp_filter **match)
+{
+	u32 filter_ret = SECCOMP_RET_KILL_PROCESS;
+	u8 arch = seccomp_get_arch(sd->arch, sd->nr);
+
+	switch (arch) {
+	case SECCOMP_ARCH_IS_NATIVE:
+		if (__bypass_filter(&current->seccomp.native, sd->nr, &filter_ret))
+			return filter_ret;
+		break;
+#ifdef CONFIG_COMPAT
+	case SECCOMP_ARCH_IS_COMPAT:
+		if (__bypass_filter(&current->seccomp.compat, sd->nr, &filter_ret))
+			return filter_ret;
+		break;
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	case SECCOMP_ARCH_IS_MULTIPLEX:
+		if (__bypass_filter(&current->seccomp.multiplex, sd->nr, &filter_ret))
+			return filter_ret;
+		break;
+#endif
+	default:
+		WARN_ON_ONCE(1);
+		return filter_ret;
+	}
+
+	return seccomp_run_filters(sd, match);
+}
+#else
+static inline u32 check_syscall(const struct seccomp_data *sd,
+				struct seccomp_filter **match)
+{
+	return seccomp_run_filters(sd, match);
+}
+#endif
+
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
 {
@@ -989,7 +1175,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 		sd = &sd_local;
 	}
 
-	filter_ret = seccomp_run_filters(sd, &match);
+	filter_ret = check_syscall(sd, &match);
 	data = filter_ret & SECCOMP_RET_DATA;
 	action = filter_ret & SECCOMP_RET_ACTION_FULL;
 
@@ -1580,6 +1766,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	long ret = -EINVAL;
 	int listener = -1;
 	struct file *listener_f = NULL;
+	void *pagepair;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -1625,12 +1812,24 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
 		goto out_put_fd;
 
+	/*
+	 * This memory will be needed for bitmap testing, but we'll
+	 * be holding a spinlock at that point. Do the allocation
+	 * (and free) outside of the lock.
+	 *
+	 * Alternative: we could do the bitmap update before attach
+	 * to avoid spending too much time under lock.
+	 */
+	pagepair = vzalloc(PAGE_SIZE * 2);
+	if (!pagepair)
+		goto out_put_fd;
+
 	spin_lock_irq(&current->sighand->siglock);
 
 	if (!seccomp_may_assign_mode(seccomp_mode))
 		goto out;
 
-	ret = seccomp_attach_filter(flags, prepared);
+	ret = seccomp_attach_filter(flags, prepared, pagepair);
 	if (ret)
 		goto out;
 	/* Do not free the successfully attached filter. */
-- 
2.25.1

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-23 23:29   ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Kees Cook, Jann Horn, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, containers, linux-api, linux-kernel

One of the most common pain points with seccomp filters has been dealing
with the overhead of processing the filters, especially for "always allow"
or "always reject" cases. While BPF is extremely fast[1], it will always
have overhead associated with it. Additionally, due to seccomp's design,
filters are layered, which means processing time goes up as the number
of filters attached goes up.

In the past, efforts have been focused on making filter execution complete
in a shorter amount of time. For example, filters were rewritten from
using linear if/then/else syscall search to using balanced binary trees,
or moving tests for syscalls common to the process's workload to the
front of the filter. However, there are limits to this, especially when
some processes are dealing with tens of filters[2], or when some
architectures have a less efficient BPF engine[3].

The most common use of seccomp, constructing syscall block/allow-lists,
where syscalls that are always allowed or always rejected (without regard
to any arguments), also tends to produce the most pathological runtime
problems, in that a large number of syscall checks in the filter need
to be performed to come to a determination.

In order to optimize these cases from O(n) to O(1), seccomp can
use bitmaps to immediately determine the desired action. A critical
observation in the prior paragraph bears repeating: the common case for
syscall tests do not check arguments. For any given filter, there is a
constant mapping from the combination of architecture and syscall to the
seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
there is a single architecture.). As such, it is possible to construct
a mapping of arch/syscall to action, which can be updated as new filters
are attached to a process.

In order to build this mapping at filter attach time, each filter is
executed for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
there is a constant mapping for that syscall, and bitmaps can be updated
accordingly. If any accesses happen outside of those struct members,
seccomp must not bypass filter execution for that syscall, since program
state will be used to determine filter action result. (This logic comes
in the next patch.)

[1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/
[2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
[3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/seccomp.h |  18 ++++
 kernel/seccomp.c        | 207 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 221 insertions(+), 4 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 0be20bc81ea9..96df2f899e3d 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -25,6 +25,17 @@
 #define SECCOMP_ARCH_IS_MULTIPLEX	3
 #define SECCOMP_ARCH_IS_UNKNOWN		0xff
 
+/* When no bits are set for a syscall, filters are run. */
+struct seccomp_bitmaps {
+#ifdef SECCOMP_ARCH
+	/* "allow" are initialized to set and only ever get cleared. */
+	DECLARE_BITMAP(allow, NR_syscalls);
+	/* These are initialized to clear and only ever get set. */
+	DECLARE_BITMAP(kill_thread, NR_syscalls);
+	DECLARE_BITMAP(kill_process, NR_syscalls);
+#endif
+};
+
 struct seccomp_filter;
 /**
  * struct seccomp - the state of a seccomp'ed process
@@ -45,6 +56,13 @@ struct seccomp {
 #endif
 	atomic_t filter_count;
 	struct seccomp_filter *filter;
+	struct seccomp_bitmaps native;
+#ifdef CONFIG_COMPAT
+	struct seccomp_bitmaps compat;
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	struct seccomp_bitmaps multiplex;
+#endif
 };
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 0a3ff8eb8aea..111a238bc532 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
 
 #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
 	if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
-		seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
+		seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
 				SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;
 	}
 #endif
@@ -559,6 +559,21 @@ static inline void seccomp_sync_threads(unsigned long flags)
 		atomic_set(&thread->seccomp.filter_count,
 			   atomic_read(&thread->seccomp.filter_count));
 
+		/* Copy syscall filter bitmaps. */
+		memcpy(&thread->seccomp.native,
+		       &caller->seccomp.native,
+		       sizeof(caller->seccomp.native));
+#ifdef CONFIG_COMPAT
+		memcpy(&thread->seccomp.compat,
+		       &caller->seccomp.compat,
+		       sizeof(caller->seccomp.compat));
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+		memcpy(&thread->seccomp.multiplex,
+		       &caller->seccomp.multiplex,
+		       sizeof(caller->seccomp.multiplex));
+#endif
+
 		/*
 		 * Don't let an unprivileged task work around
 		 * the no_new_privs restriction by creating
@@ -661,6 +676,114 @@ seccomp_prepare_user_filter(const char __user *user_filter)
 	return filter;
 }
 
+static inline bool sd_touched(pte_t *ptep)
+{
+	return !!pte_young(*(READ_ONCE(ptep)));
+}
+
+#ifdef SECCOMP_ARCH
+/*
+ * We can build constant-action bitmaps only when an arch/nr combination reads
+ * nothing more that sd->nr and sd->arch, since those have a constant mapping
+ * to the syscall.
+ *
+ * This approach could also be used to test for access to sd->arch too,
+ * if we wanted to warn about compat-unsafe filters.
+ */
+static inline bool seccomp_filter_action_is_constant(struct bpf_prog *prog,
+						     struct seccomp_data *sd,
+						     u32 *action)
+{
+	/* No evaluation implementation yet. */
+	return false;
+}
+
+/*
+ * Walk everyone syscall combination for this arch/mask combo and update
+ * the bitmaps with any results.
+ */
+static void seccomp_update_bitmap(struct seccomp_filter *filter,
+				  void *pagepair, u32 arch, u32 mask,
+				  struct seccomp_bitmaps *bitmaps)
+{
+	struct seccomp_data sd = { };
+	bool constant;
+	u32 nr, ret;
+
+	/* Initialize bitmaps for first filter. */
+	if (!filter->prev)
+		bitmap_fill(bitmaps->allow, NR_syscalls);
+
+	/*
+	 * For every syscall, if we don't already know we need to run
+	 * the full filter, simulate the filter with our static values.
+	 */
+	for (nr = 0; nr < NR_syscalls; nr++) {
+		/* Are we already at the maximal rejection state? */
+		if (test_bit(nr, bitmaps->kill_process))
+			continue;
+
+		sd.nr = nr | mask;
+		sd.arch = arch;
+
+		/* Evaluate filter for this arch/syscall. */
+		constant = seccomp_filter_action_is_constant(filter->prog, &sd, &ret);
+
+		/*
+		 * If this run through the filter didn't access
+		 * beyond "arch", we know the result is a constant
+		 * mapping for arch/nr -> ret.
+		 */
+		if (constant) {
+			/* Constant evaluation. Mark appropriate bitmaps. */
+			switch (ret) {
+			case SECCOMP_RET_KILL_PROCESS:
+				set_bit(nr, bitmaps->kill_process);
+				break;
+			case SECCOMP_RET_KILL_THREAD:
+				set_bit(nr, bitmaps->kill_thread);
+				break;
+			default:
+				break;
+			case SECCOMP_RET_ALLOW:
+				/*
+				 * If we always map to allow, there are
+				 * no changes needed to the bitmaps.
+				 */
+				continue;
+			}
+		}
+
+		/*
+		 * Dynamic evaluation of syscall, or non-allow constant
+		 * mapping to something other than SECCOMP_RET_ALLOW: we
+		 * must not short-circuit-allow it anymore.
+		 */
+		clear_bit(nr, bitmaps->allow);
+	}
+}
+
+static void seccomp_update_bitmaps(struct seccomp_filter *filter,
+				   void *pagepair)
+{
+	seccomp_update_bitmap(filter, pagepair, SECCOMP_ARCH, 0,
+			      &current->seccomp.native);
+#ifdef CONFIG_COMPAT
+	seccomp_update_bitmap(filter, pagepair, SECCOMP_ARCH_COMPAT, 0,
+			      &current->seccomp.compat);
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	seccomp_update_bitmap(filter, pagepair, SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH,
+			      SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK,
+			      &current->seccomp.multiplex);
+#endif
+}
+#else
+static void seccomp_update_bitmaps(struct seccomp_filter *filter,
+				   void *pagepair)
+{ }
+#endif
+
 /**
  * seccomp_attach_filter: validate and attach filter
  * @flags:  flags to change filter behavior
@@ -674,7 +797,8 @@ seccomp_prepare_user_filter(const char __user *user_filter)
  *   - in NEW_LISTENER mode: the fd of the new listener
  */
 static long seccomp_attach_filter(unsigned int flags,
-				  struct seccomp_filter *filter)
+				  struct seccomp_filter *filter,
+				  void *pagepair)
 {
 	unsigned long total_insns;
 	struct seccomp_filter *walker;
@@ -713,6 +837,9 @@ static long seccomp_attach_filter(unsigned int flags,
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
 
+	/* Evaluate filter for new known-outcome syscalls */
+	seccomp_update_bitmaps(filter, pagepair);
+
 	/* Now that the new filter is in place, synchronize to all threads. */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		seccomp_sync_threads(flags);
@@ -970,6 +1097,65 @@ static int seccomp_do_user_notification(int this_syscall,
 	return -1;
 }
 
+#ifdef SECCOMP_ARCH
+static inline bool __bypass_filter(struct seccomp_bitmaps *bitmaps,
+				   u32 nr, u32 *filter_ret)
+{
+	if (nr < NR_syscalls) {
+		if (test_bit(nr, bitmaps->allow)) {
+			*filter_ret = SECCOMP_RET_ALLOW;
+			return true;
+		}
+		if (test_bit(nr, bitmaps->kill_process)) {
+			*filter_ret = SECCOMP_RET_KILL_PROCESS;
+			return true;
+		}
+		if (test_bit(nr, bitmaps->kill_thread)) {
+			*filter_ret = SECCOMP_RET_KILL_THREAD;
+			return true;
+		}
+	}
+	return false;
+}
+
+static inline u32 check_syscall(const struct seccomp_data *sd,
+				struct seccomp_filter **match)
+{
+	u32 filter_ret = SECCOMP_RET_KILL_PROCESS;
+	u8 arch = seccomp_get_arch(sd->arch, sd->nr);
+
+	switch (arch) {
+	case SECCOMP_ARCH_IS_NATIVE:
+		if (__bypass_filter(&current->seccomp.native, sd->nr, &filter_ret))
+			return filter_ret;
+		break;
+#ifdef CONFIG_COMPAT
+	case SECCOMP_ARCH_IS_COMPAT:
+		if (__bypass_filter(&current->seccomp.compat, sd->nr, &filter_ret))
+			return filter_ret;
+		break;
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	case SECCOMP_ARCH_IS_MULTIPLEX:
+		if (__bypass_filter(&current->seccomp.multiplex, sd->nr, &filter_ret))
+			return filter_ret;
+		break;
+#endif
+	default:
+		WARN_ON_ONCE(1);
+		return filter_ret;
+	}
+
+	return seccomp_run_filters(sd, match);
+}
+#else
+static inline u32 check_syscall(const struct seccomp_data *sd,
+				struct seccomp_filter **match)
+{
+	return seccomp_run_filters(sd, match);
+}
+#endif
+
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
 {
@@ -989,7 +1175,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 		sd = &sd_local;
 	}
 
-	filter_ret = seccomp_run_filters(sd, &match);
+	filter_ret = check_syscall(sd, &match);
 	data = filter_ret & SECCOMP_RET_DATA;
 	action = filter_ret & SECCOMP_RET_ACTION_FULL;
 
@@ -1580,6 +1766,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	long ret = -EINVAL;
 	int listener = -1;
 	struct file *listener_f = NULL;
+	void *pagepair;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -1625,12 +1812,24 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
 		goto out_put_fd;
 
+	/*
+	 * This memory will be needed for bitmap testing, but we'll
+	 * be holding a spinlock at that point. Do the allocation
+	 * (and free) outside of the lock.
+	 *
+	 * Alternative: we could do the bitmap update before attach
+	 * to avoid spending too much time under lock.
+	 */
+	pagepair = vzalloc(PAGE_SIZE * 2);
+	if (!pagepair)
+		goto out_put_fd;
+
 	spin_lock_irq(&current->sighand->siglock);
 
 	if (!seccomp_may_assign_mode(seccomp_mode))
 		goto out;
 
-	ret = seccomp_attach_filter(flags, prepared);
+	ret = seccomp_attach_filter(flags, prepared, pagepair);
 	if (ret)
 		goto out;
 	/* Do not free the successfully attached filter. */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 4/6] seccomp: Emulate basic filters for constant action results
  2020-09-23 23:29 ` Kees Cook
@ 2020-09-23 23:29   ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, linux-api, containers, bpf, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	linux-kernel

This emulates absolutely the most basic seccomp filters to figure out
if they will always give the same results for a given arch/nr combo.

Nearly all seccomp filters are built from the following ops:

BPF_LD  | BPF_W    | BPF_ABS
BPF_JMP | BPF_JEQ  | BPF_K
BPF_JMP | BPF_JGE  | BPF_K
BPF_JMP | BPF_JGT  | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K

These are now emulated to check for accesses beyond seccomp_data::arch
or unknown instructions.

Not yet implemented are:

BPF_ALU | BPF_AND (generated by libseccomp and Chrome)

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 kernel/seccomp.c  | 82 ++++++++++++++++++++++++++++++++++++++++++++---
 net/core/filter.c |  3 +-
 2 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 111a238bc532..9921f6f39d12 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -610,7 +610,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *sfilter;
 	int ret;
-	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+	const bool save_orig =
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH)
+		true;
+#else
+		false;
+#endif
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
 		return ERR_PTR(-EINVAL);
@@ -690,11 +695,78 @@ static inline bool sd_touched(pte_t *ptep)
  * This approach could also be used to test for access to sd->arch too,
  * if we wanted to warn about compat-unsafe filters.
  */
-static inline bool seccomp_filter_action_is_constant(struct bpf_prog *prog,
-						     struct seccomp_data *sd,
-						     u32 *action)
+static bool seccomp_filter_action_is_constant(struct bpf_prog *prog,
+					      struct seccomp_data *sd,
+					      u32 *action)
 {
-	/* No evaluation implementation yet. */
+	struct sock_fprog_kern *fprog = prog->orig_prog;
+	unsigned int insns;
+	unsigned int reg_value = 0;
+	unsigned int pc;
+	bool op_res;
+
+	if (WARN_ON_ONCE(!fprog))
+		return false;
+
+	insns = bpf_classic_proglen(fprog);
+	for (pc = 0; pc < insns; pc++) {
+		struct sock_filter *insn = &fprog->filter[pc];
+		u16 code = insn->code;
+		u32 k = insn->k;
+
+		switch (code) {
+		case BPF_LD | BPF_W | BPF_ABS:
+			switch (k) {
+			case offsetof(struct seccomp_data, nr):
+				reg_value = sd->nr;
+				break;
+			case offsetof(struct seccomp_data, arch):
+				reg_value = sd->arch;
+				break;
+			default:
+				/* can't optimize (non-constant value load) */
+				return false;
+			}
+			break;
+		case BPF_RET | BPF_K:
+			*action = insn->k;
+			/* success: reached return with constant values only */
+			return true;
+		case BPF_JMP | BPF_JA:
+			pc += insn->k;
+			break;
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JSET | BPF_K:
+			switch (BPF_OP(code)) {
+			case BPF_JEQ:
+				op_res = reg_value == k;
+				break;
+			case BPF_JGE:
+				op_res = reg_value >= k;
+				break;
+			case BPF_JGT:
+				op_res = reg_value > k;
+				break;
+			case BPF_JSET:
+				op_res = !!(reg_value & k);
+				break;
+			default:
+				/* can't optimize (unknown jump) */
+				return false;
+			}
+
+			pc += op_res ? insn->jt : insn->jf;
+			break;
+		default:
+			/* can't optimize (unknown insn) */
+			return false;
+		}
+	}
+
+	/* ran off the end of the filter?! */
+	WARN_ON(1);
 	return false;
 }
 
diff --git a/net/core/filter.c b/net/core/filter.c
index b2df52086445..cb1bdb0bfe87 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1145,7 +1145,7 @@ static int bpf_prog_store_orig_filter(struct bpf_prog *fp,
 	return 0;
 }
 
-static void bpf_release_orig_filter(struct bpf_prog *fp)
+void bpf_release_orig_filter(struct bpf_prog *fp)
 {
 	struct sock_fprog_kern *fprog = fp->orig_prog;
 
@@ -1154,6 +1154,7 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 		kfree(fprog);
 	}
 }
+EXPORT_SYMBOL_GPL(bpf_release_orig_filter);
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-- 
2.25.1

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 4/6] seccomp: Emulate basic filters for constant action results
@ 2020-09-23 23:29   ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Kees Cook, Jann Horn, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, containers, linux-api, linux-kernel

This emulates absolutely the most basic seccomp filters to figure out
if they will always give the same results for a given arch/nr combo.

Nearly all seccomp filters are built from the following ops:

BPF_LD  | BPF_W    | BPF_ABS
BPF_JMP | BPF_JEQ  | BPF_K
BPF_JMP | BPF_JGE  | BPF_K
BPF_JMP | BPF_JGT  | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K

These are now emulated to check for accesses beyond seccomp_data::arch
or unknown instructions.

Not yet implemented are:

BPF_ALU | BPF_AND (generated by libseccomp and Chrome)

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 kernel/seccomp.c  | 82 ++++++++++++++++++++++++++++++++++++++++++++---
 net/core/filter.c |  3 +-
 2 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 111a238bc532..9921f6f39d12 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -610,7 +610,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *sfilter;
 	int ret;
-	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+	const bool save_orig =
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH)
+		true;
+#else
+		false;
+#endif
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
 		return ERR_PTR(-EINVAL);
@@ -690,11 +695,78 @@ static inline bool sd_touched(pte_t *ptep)
  * This approach could also be used to test for access to sd->arch too,
  * if we wanted to warn about compat-unsafe filters.
  */
-static inline bool seccomp_filter_action_is_constant(struct bpf_prog *prog,
-						     struct seccomp_data *sd,
-						     u32 *action)
+static bool seccomp_filter_action_is_constant(struct bpf_prog *prog,
+					      struct seccomp_data *sd,
+					      u32 *action)
 {
-	/* No evaluation implementation yet. */
+	struct sock_fprog_kern *fprog = prog->orig_prog;
+	unsigned int insns;
+	unsigned int reg_value = 0;
+	unsigned int pc;
+	bool op_res;
+
+	if (WARN_ON_ONCE(!fprog))
+		return false;
+
+	insns = bpf_classic_proglen(fprog);
+	for (pc = 0; pc < insns; pc++) {
+		struct sock_filter *insn = &fprog->filter[pc];
+		u16 code = insn->code;
+		u32 k = insn->k;
+
+		switch (code) {
+		case BPF_LD | BPF_W | BPF_ABS:
+			switch (k) {
+			case offsetof(struct seccomp_data, nr):
+				reg_value = sd->nr;
+				break;
+			case offsetof(struct seccomp_data, arch):
+				reg_value = sd->arch;
+				break;
+			default:
+				/* can't optimize (non-constant value load) */
+				return false;
+			}
+			break;
+		case BPF_RET | BPF_K:
+			*action = insn->k;
+			/* success: reached return with constant values only */
+			return true;
+		case BPF_JMP | BPF_JA:
+			pc += insn->k;
+			break;
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JSET | BPF_K:
+			switch (BPF_OP(code)) {
+			case BPF_JEQ:
+				op_res = reg_value == k;
+				break;
+			case BPF_JGE:
+				op_res = reg_value >= k;
+				break;
+			case BPF_JGT:
+				op_res = reg_value > k;
+				break;
+			case BPF_JSET:
+				op_res = !!(reg_value & k);
+				break;
+			default:
+				/* can't optimize (unknown jump) */
+				return false;
+			}
+
+			pc += op_res ? insn->jt : insn->jf;
+			break;
+		default:
+			/* can't optimize (unknown insn) */
+			return false;
+		}
+	}
+
+	/* ran off the end of the filter?! */
+	WARN_ON(1);
 	return false;
 }
 
diff --git a/net/core/filter.c b/net/core/filter.c
index b2df52086445..cb1bdb0bfe87 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1145,7 +1145,7 @@ static int bpf_prog_store_orig_filter(struct bpf_prog *fp,
 	return 0;
 }
 
-static void bpf_release_orig_filter(struct bpf_prog *fp)
+void bpf_release_orig_filter(struct bpf_prog *fp)
 {
 	struct sock_fprog_kern *fprog = fp->orig_prog;
 
@@ -1154,6 +1154,7 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 		kfree(fprog);
 	}
 }
+EXPORT_SYMBOL_GPL(bpf_release_orig_filter);
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 5/6] selftests/seccomp: Compare bitmap vs filter overhead
  2020-09-23 23:29 ` Kees Cook
@ 2020-09-23 23:29   ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, linux-api, containers, bpf, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	linux-kernel

As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.

Example output with constant action bitmaps on x86:

$ sudo ./seccomp_benchmark 30344920
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 30344920 syscalls...
22.113430452 - 0.005691205 = 22107739247 (22.1s)
getpid native: 728 ns
44.867669556 - 22.113755935 = 22753913621 (22.8s)
getpid RET_ALLOW 1 filter (bitmap): 749 ns
67.649040358 - 44.868003056 = 22781037302 (22.8s)
getpid RET_ALLOW 2 filters (bitmap): 750 ns
92.555661414 - 67.650328959 = 24905332455 (24.9s)
getpid RET_ALLOW 3 filters (full): 820 ns
118.170831065 - 92.556057543 = 25614773522 (25.6s)
getpid RET_ALLOW 4 filters (full): 844 ns
Estimated total seccomp overhead for 1 bitmapped filter: 21 ns
Estimated total seccomp overhead for 2 bitmapped filters: 22 ns
Estimated total seccomp overhead for 3 full filters: 92 ns
Estimated total seccomp overhead for 4 full filters: 116 ns
Estimated seccomp entry overhead: 20 ns
Estimated seccomp per-filter overhead (last 2 diff): 24 ns
Estimated seccomp per-filter overhead (filters / 4): 24 ns
Expectations:
        native ≤ 1 bitmap (728 ≤ 749): ✔️
        native ≤ 1 filter (728 ≤ 820): ✔️
        per-filter (last 2 diff) ≈ per-filter (filters / 4) (24 ≈ 24): ✔️
        1 bitmapped ≈ 2 bitmapped (21 ≈ 22): ✔️
        entry ≈ 1 bitmapped (20 ≈ 21): ✔️
        entry ≈ 2 bitmapped (20 ≈ 22): ✔️
        native + entry + (per filter * 4) ≈ 4 filters total (844 ≈ 844): ✔️

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++++++++++---
 tools/testing/selftests/seccomp/settings      |   2 +-
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 91f5a89cadac..fcc806585266 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,12 +4,16 @@
  */
 #define _GNU_SOURCE
 #include <assert.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <unistd.h>
 #include <linux/filter.h>
 #include <linux/seccomp.h>
+#include <sys/param.h>
 #include <sys/prctl.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
@@ -70,18 +74,74 @@ unsigned long long calibrate(void)
 	return samples * seconds;
 }
 
+bool approx(int i_one, int i_two)
+{
+	double one = i_one, one_bump = one * 0.01;
+	double two = i_two, two_bump = two * 0.01;
+
+	one_bump = one + MAX(one_bump, 2.0);
+	two_bump = two + MAX(two_bump, 2.0);
+
+	/* Equal to, or within 1% or 2 digits */
+	if (one == two ||
+	    (one > two && one <= two_bump) ||
+	    (two > one && two <= one_bump))
+		return true;
+	return false;
+}
+
+bool le(int i_one, int i_two)
+{
+	if (i_one <= i_two)
+		return true;
+	return false;
+}
+
+long compare(const char *name_one, const char *name_eval, const char *name_two,
+	     unsigned long long one, bool (*eval)(int, int), unsigned long long two)
+{
+	bool good;
+
+	printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two,
+	       (long long)one, name_eval, (long long)two);
+	if (one > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)one);
+		return 1;
+	}
+	if (two > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)two);
+		return 1;
+	}
+
+	good = eval(one, two);
+	printf("%s\n", good ? "✔️" : "❌");
+
+	return good ? 0 : 1;
+}
+
 int main(int argc, char *argv[])
 {
+	struct sock_filter bitmap_filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog bitmap_prog = {
+		.len = (unsigned short)ARRAY_SIZE(bitmap_filter),
+		.filter = bitmap_filter,
+	};
 	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])),
 		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
 	};
 	struct sock_fprog prog = {
 		.len = (unsigned short)ARRAY_SIZE(filter),
 		.filter = filter,
 	};
-	long ret;
-	unsigned long long samples;
-	unsigned long long native, filter1, filter2;
+
+	long ret, bits;
+	unsigned long long samples, calc;
+	unsigned long long native, filter1, filter2, bitmap1, bitmap2;
+	unsigned long long entry, per_filter1, per_filter2;
 
 	printf("Current BPF sysctl settings:\n");
 	system("sysctl net.core.bpf_jit_enable");
@@ -101,35 +161,82 @@ int main(int argc, char *argv[])
 	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
 	assert(ret == 0);
 
-	/* One filter */
-	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
+	/* One filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
 	assert(ret == 0);
 
-	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1);
+	bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1);
+
+	/* Second filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	if (filter1 == native)
-		printf("No overhead measured!? Try running again with more samples.\n");
+	bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2);
 
-	/* Two filters */
+	/* Third filter, can no longer be converted to bitmap */
 	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
 	assert(ret == 0);
 
-	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2);
-
-	/* Calculations */
-	printf("Estimated total seccomp overhead for 1 filter: %llu ns\n",
-		filter1 - native);
+	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1);
 
-	printf("Estimated total seccomp overhead for 2 filters: %llu ns\n",
-		filter2 - native);
+	/* Fourth filter, can not be converted to bitmap because of filter 3 */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	printf("Estimated seccomp per-filter overhead: %llu ns\n",
-		filter2 - filter1);
+	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2);
+
+	/* Estimations */
+#define ESTIMATE(fmt, var, what)	do {			\
+		var = (what);					\
+		printf("Estimated " fmt ": %llu ns\n", var);	\
+		if (var > INT_MAX)				\
+			goto more_samples;			\
+	} while (0)
+
+	ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc,
+		 bitmap1 - native);
+	ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc,
+		 bitmap2 - native);
+	ESTIMATE("total seccomp overhead for 3 full filters", calc,
+		 filter1 - native);
+	ESTIMATE("total seccomp overhead for 4 full filters", calc,
+		 filter2 - native);
+	ESTIMATE("seccomp entry overhead", entry,
+		 bitmap1 - native - (bitmap2 - bitmap1));
+	ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1,
+		 filter2 - filter1);
+	ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2,
+		 (filter2 - native - entry) / 4);
+
+	printf("Expectations:\n");
+	ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1);
+	bits = compare("native", "≤", "1 filter", native, le, filter1);
+	if (bits)
+		goto more_samples;
+
+	ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)",
+			per_filter1, approx, per_filter2);
+
+	bits = compare("1 bitmapped", "≈", "2 bitmapped",
+			bitmap1 - native, approx, bitmap2 - native);
+	if (bits) {
+		printf("Skipping constant action bitmap expectations: they appear unsupported.\n");
+		goto out;
+	}
 
-	printf("Estimated seccomp entry overhead: %llu ns\n",
-		filter1 - native - (filter2 - filter1));
+	ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native);
+	ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native);
+	ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total",
+			entry + (per_filter1 * 4) + native, approx, filter2);
+	if (ret == 0)
+		goto out;
 
+more_samples:
+	printf("Saw unexpected benchmark result. Try running again with more samples?\n");
+out:
 	return 0;
 }
diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings
index ba4d85f74cd6..6091b45d226b 100644
--- a/tools/testing/selftests/seccomp/settings
+++ b/tools/testing/selftests/seccomp/settings
@@ -1 +1 @@
-timeout=90
+timeout=120
-- 
2.25.1

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 5/6] selftests/seccomp: Compare bitmap vs filter overhead
@ 2020-09-23 23:29   ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Kees Cook, Jann Horn, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, containers, linux-api, linux-kernel

As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.

Example output with constant action bitmaps on x86:

$ sudo ./seccomp_benchmark 30344920
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 30344920 syscalls...
22.113430452 - 0.005691205 = 22107739247 (22.1s)
getpid native: 728 ns
44.867669556 - 22.113755935 = 22753913621 (22.8s)
getpid RET_ALLOW 1 filter (bitmap): 749 ns
67.649040358 - 44.868003056 = 22781037302 (22.8s)
getpid RET_ALLOW 2 filters (bitmap): 750 ns
92.555661414 - 67.650328959 = 24905332455 (24.9s)
getpid RET_ALLOW 3 filters (full): 820 ns
118.170831065 - 92.556057543 = 25614773522 (25.6s)
getpid RET_ALLOW 4 filters (full): 844 ns
Estimated total seccomp overhead for 1 bitmapped filter: 21 ns
Estimated total seccomp overhead for 2 bitmapped filters: 22 ns
Estimated total seccomp overhead for 3 full filters: 92 ns
Estimated total seccomp overhead for 4 full filters: 116 ns
Estimated seccomp entry overhead: 20 ns
Estimated seccomp per-filter overhead (last 2 diff): 24 ns
Estimated seccomp per-filter overhead (filters / 4): 24 ns
Expectations:
        native ≤ 1 bitmap (728 ≤ 749): ✔️
        native ≤ 1 filter (728 ≤ 820): ✔️
        per-filter (last 2 diff) ≈ per-filter (filters / 4) (24 ≈ 24): ✔️
        1 bitmapped ≈ 2 bitmapped (21 ≈ 22): ✔️
        entry ≈ 1 bitmapped (20 ≈ 21): ✔️
        entry ≈ 2 bitmapped (20 ≈ 22): ✔️
        native + entry + (per filter * 4) ≈ 4 filters total (844 ≈ 844): ✔️

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++++++++++---
 tools/testing/selftests/seccomp/settings      |   2 +-
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 91f5a89cadac..fcc806585266 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,12 +4,16 @@
  */
 #define _GNU_SOURCE
 #include <assert.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <unistd.h>
 #include <linux/filter.h>
 #include <linux/seccomp.h>
+#include <sys/param.h>
 #include <sys/prctl.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
@@ -70,18 +74,74 @@ unsigned long long calibrate(void)
 	return samples * seconds;
 }
 
+bool approx(int i_one, int i_two)
+{
+	double one = i_one, one_bump = one * 0.01;
+	double two = i_two, two_bump = two * 0.01;
+
+	one_bump = one + MAX(one_bump, 2.0);
+	two_bump = two + MAX(two_bump, 2.0);
+
+	/* Equal to, or within 1% or 2 digits */
+	if (one == two ||
+	    (one > two && one <= two_bump) ||
+	    (two > one && two <= one_bump))
+		return true;
+	return false;
+}
+
+bool le(int i_one, int i_two)
+{
+	if (i_one <= i_two)
+		return true;
+	return false;
+}
+
+long compare(const char *name_one, const char *name_eval, const char *name_two,
+	     unsigned long long one, bool (*eval)(int, int), unsigned long long two)
+{
+	bool good;
+
+	printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two,
+	       (long long)one, name_eval, (long long)two);
+	if (one > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)one);
+		return 1;
+	}
+	if (two > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)two);
+		return 1;
+	}
+
+	good = eval(one, two);
+	printf("%s\n", good ? "✔️" : "❌");
+
+	return good ? 0 : 1;
+}
+
 int main(int argc, char *argv[])
 {
+	struct sock_filter bitmap_filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog bitmap_prog = {
+		.len = (unsigned short)ARRAY_SIZE(bitmap_filter),
+		.filter = bitmap_filter,
+	};
 	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])),
 		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
 	};
 	struct sock_fprog prog = {
 		.len = (unsigned short)ARRAY_SIZE(filter),
 		.filter = filter,
 	};
-	long ret;
-	unsigned long long samples;
-	unsigned long long native, filter1, filter2;
+
+	long ret, bits;
+	unsigned long long samples, calc;
+	unsigned long long native, filter1, filter2, bitmap1, bitmap2;
+	unsigned long long entry, per_filter1, per_filter2;
 
 	printf("Current BPF sysctl settings:\n");
 	system("sysctl net.core.bpf_jit_enable");
@@ -101,35 +161,82 @@ int main(int argc, char *argv[])
 	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
 	assert(ret == 0);
 
-	/* One filter */
-	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
+	/* One filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
 	assert(ret == 0);
 
-	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1);
+	bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1);
+
+	/* Second filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	if (filter1 == native)
-		printf("No overhead measured!? Try running again with more samples.\n");
+	bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2);
 
-	/* Two filters */
+	/* Third filter, can no longer be converted to bitmap */
 	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
 	assert(ret == 0);
 
-	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2);
-
-	/* Calculations */
-	printf("Estimated total seccomp overhead for 1 filter: %llu ns\n",
-		filter1 - native);
+	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1);
 
-	printf("Estimated total seccomp overhead for 2 filters: %llu ns\n",
-		filter2 - native);
+	/* Fourth filter, can not be converted to bitmap because of filter 3 */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	printf("Estimated seccomp per-filter overhead: %llu ns\n",
-		filter2 - filter1);
+	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2);
+
+	/* Estimations */
+#define ESTIMATE(fmt, var, what)	do {			\
+		var = (what);					\
+		printf("Estimated " fmt ": %llu ns\n", var);	\
+		if (var > INT_MAX)				\
+			goto more_samples;			\
+	} while (0)
+
+	ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc,
+		 bitmap1 - native);
+	ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc,
+		 bitmap2 - native);
+	ESTIMATE("total seccomp overhead for 3 full filters", calc,
+		 filter1 - native);
+	ESTIMATE("total seccomp overhead for 4 full filters", calc,
+		 filter2 - native);
+	ESTIMATE("seccomp entry overhead", entry,
+		 bitmap1 - native - (bitmap2 - bitmap1));
+	ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1,
+		 filter2 - filter1);
+	ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2,
+		 (filter2 - native - entry) / 4);
+
+	printf("Expectations:\n");
+	ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1);
+	bits = compare("native", "≤", "1 filter", native, le, filter1);
+	if (bits)
+		goto more_samples;
+
+	ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)",
+			per_filter1, approx, per_filter2);
+
+	bits = compare("1 bitmapped", "≈", "2 bitmapped",
+			bitmap1 - native, approx, bitmap2 - native);
+	if (bits) {
+		printf("Skipping constant action bitmap expectations: they appear unsupported.\n");
+		goto out;
+	}
 
-	printf("Estimated seccomp entry overhead: %llu ns\n",
-		filter1 - native - (filter2 - filter1));
+	ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native);
+	ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native);
+	ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total",
+			entry + (per_filter1 * 4) + native, approx, filter2);
+	if (ret == 0)
+		goto out;
 
+more_samples:
+	printf("Saw unexpected benchmark result. Try running again with more samples?\n");
+out:
 	return 0;
 }
diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings
index ba4d85f74cd6..6091b45d226b 100644
--- a/tools/testing/selftests/seccomp/settings
+++ b/tools/testing/selftests/seccomp/settings
@@ -1 +1 @@
-timeout=90
+timeout=120
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 6/6] [DEBUG] seccomp: Report bitmap coverage ranges
  2020-09-23 23:29 ` Kees Cook
@ 2020-09-23 23:29   ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, linux-api, containers, bpf, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	linux-kernel

This is what I've been using to explore actual bitmap results for
real-world filters...

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 kernel/seccomp.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 9921f6f39d12..1a0595d7f8ef 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -835,6 +835,85 @@ static void seccomp_update_bitmap(struct seccomp_filter *filter,
 	}
 }
 
+static void __report_bitmap(const char *arch, u32 ret, int start, int finish)
+{
+	int gap;
+	char *name;
+
+	if (finish == -1)
+		return;
+
+	switch (ret) {
+	case UINT_MAX:
+		name = "filter";
+		break;
+	case SECCOMP_RET_ALLOW:
+		name = "SECCOMP_RET_ALLOW";
+		break;
+	case SECCOMP_RET_KILL_PROCESS:
+		name = "SECCOMP_RET_KILL_PROCESS";
+		break;
+	case SECCOMP_RET_KILL_THREAD:
+		name = "SECCOMP_RET_KILL_THREAD";
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		name = "unknown";
+		break;
+	}
+
+	gap = 0;
+	if (start < 100)
+		gap++;
+	if (start < 10)
+		gap++;
+	if (finish < 100)
+		gap++;
+	if (finish < 10)
+		gap++;
+
+	if (start == finish)
+		pr_info("%s     %3d: %s\n", arch, start, name);
+	else if (start + 1 == finish)
+		pr_info("%s %*s%d,%d: %s\n", arch, gap, "", start, finish, name);
+	else
+		pr_info("%s %*s%d-%d: %s\n", arch, gap, "", start, finish, name);
+}
+
+static void report_bitmap(struct seccomp_bitmaps *bitmaps, const char *arch)
+{
+	u32 nr;
+	int start = 0, finish = -1;
+	u32 ret = UINT_MAX;
+	struct report_states {
+		unsigned long *bitmap;
+		u32 ret;
+	} states[] = {
+		{ .bitmap = bitmaps->allow,	   .ret = SECCOMP_RET_ALLOW, },
+		{ .bitmap = bitmaps->kill_process, .ret = SECCOMP_RET_KILL_PROCESS, },
+		{ .bitmap = bitmaps->kill_thread,  .ret = SECCOMP_RET_KILL_THREAD, },
+		{ .bitmap = NULL,		   .ret = UINT_MAX, },
+	};
+
+	for (nr = 0; nr < NR_syscalls; nr++) {
+		int i;
+
+		for (i = 0; i < ARRAY_SIZE(states); i++) {
+			if (!states[i].bitmap || test_bit(nr, states[i].bitmap)) {
+				if (ret != states[i].ret) {
+					__report_bitmap(arch, ret, start, finish);
+					ret = states[i].ret;
+					start = nr;
+				}
+				finish = nr;
+				break;
+			}
+		}
+	}
+	if (start != nr)
+		__report_bitmap(arch, ret, start, finish);
+}
+
 static void seccomp_update_bitmaps(struct seccomp_filter *filter,
 				   void *pagepair)
 {
@@ -849,6 +928,23 @@ static void seccomp_update_bitmaps(struct seccomp_filter *filter,
 			      SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK,
 			      &current->seccomp.multiplex);
 #endif
+	if (strncmp(current->comm, "test-", 5) == 0 ||
+	    strcmp(current->comm, "seccomp_bpf") == 0 ||
+	    /*
+	     * Why are systemd's process names head-truncated to 8 bytes
+	     * and wrapped in parens!?
+	     */
+	    (current->comm[0] == '(' && strrchr(current->comm, ')') != NULL)) {
+		pr_info("reporting syscall bitmap usage for %d (%s):\n",
+			task_pid_nr(current), current->comm);
+		report_bitmap(&current->seccomp.native, "native");
+#ifdef CONFIG_COMPAT
+		report_bitmap(&current->seccomp.compat, "compat");
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+		report_bitmap(&current->seccomp.multiplex, "multiplex");
+#endif
+	}
 }
 #else
 static void seccomp_update_bitmaps(struct seccomp_filter *filter,
@@ -908,6 +1004,10 @@ static long seccomp_attach_filter(unsigned int flags,
 	filter->prev = current->seccomp.filter;
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
+	if (atomic_read(&current->seccomp.filter_count) > 10)
+		pr_info("%d filters: %d (%s)\n",
+			atomic_read(&current->seccomp.filter_count),
+			task_pid_nr(current), current->comm);
 
 	/* Evaluate filter for new known-outcome syscalls */
 	seccomp_update_bitmaps(filter, pagepair);
@@ -2419,6 +2519,21 @@ static int __init seccomp_sysctl_init(void)
 		pr_warn("sysctl registration failed\n");
 	else
 		kmemleak_not_leak(hdr);
+#ifndef SECCOMP_ARCH
+	pr_info("arch lacks support for constant action bitmaps\n");
+#else
+	pr_info("NR_syscalls: %d\n", NR_syscalls);
+	pr_info("arch: 0x%x\n", SECCOMP_ARCH);
+#ifdef CONFIG_COMPAT
+	pr_info("compat arch: 0x%x\n", SECCOMP_ARCH_COMPAT);
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	pr_info("multiplex arch: 0x%x (mask: 0x%x)\n",
+		SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH,
+		SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK);
+#endif
+#endif
+	pr_info("sizeof(struct seccomp_bitmaps): %zu\n", sizeof(struct seccomp_bitmaps));
 
 	return 0;
 }
-- 
2.25.1

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 6/6] [DEBUG] seccomp: Report bitmap coverage ranges
@ 2020-09-23 23:29   ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-23 23:29 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Kees Cook, Jann Horn, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, containers, linux-api, linux-kernel

This is what I've been using to explore actual bitmap results for
real-world filters...

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 kernel/seccomp.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 9921f6f39d12..1a0595d7f8ef 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -835,6 +835,85 @@ static void seccomp_update_bitmap(struct seccomp_filter *filter,
 	}
 }
 
+static void __report_bitmap(const char *arch, u32 ret, int start, int finish)
+{
+	int gap;
+	char *name;
+
+	if (finish == -1)
+		return;
+
+	switch (ret) {
+	case UINT_MAX:
+		name = "filter";
+		break;
+	case SECCOMP_RET_ALLOW:
+		name = "SECCOMP_RET_ALLOW";
+		break;
+	case SECCOMP_RET_KILL_PROCESS:
+		name = "SECCOMP_RET_KILL_PROCESS";
+		break;
+	case SECCOMP_RET_KILL_THREAD:
+		name = "SECCOMP_RET_KILL_THREAD";
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		name = "unknown";
+		break;
+	}
+
+	gap = 0;
+	if (start < 100)
+		gap++;
+	if (start < 10)
+		gap++;
+	if (finish < 100)
+		gap++;
+	if (finish < 10)
+		gap++;
+
+	if (start == finish)
+		pr_info("%s     %3d: %s\n", arch, start, name);
+	else if (start + 1 == finish)
+		pr_info("%s %*s%d,%d: %s\n", arch, gap, "", start, finish, name);
+	else
+		pr_info("%s %*s%d-%d: %s\n", arch, gap, "", start, finish, name);
+}
+
+static void report_bitmap(struct seccomp_bitmaps *bitmaps, const char *arch)
+{
+	u32 nr;
+	int start = 0, finish = -1;
+	u32 ret = UINT_MAX;
+	struct report_states {
+		unsigned long *bitmap;
+		u32 ret;
+	} states[] = {
+		{ .bitmap = bitmaps->allow,	   .ret = SECCOMP_RET_ALLOW, },
+		{ .bitmap = bitmaps->kill_process, .ret = SECCOMP_RET_KILL_PROCESS, },
+		{ .bitmap = bitmaps->kill_thread,  .ret = SECCOMP_RET_KILL_THREAD, },
+		{ .bitmap = NULL,		   .ret = UINT_MAX, },
+	};
+
+	for (nr = 0; nr < NR_syscalls; nr++) {
+		int i;
+
+		for (i = 0; i < ARRAY_SIZE(states); i++) {
+			if (!states[i].bitmap || test_bit(nr, states[i].bitmap)) {
+				if (ret != states[i].ret) {
+					__report_bitmap(arch, ret, start, finish);
+					ret = states[i].ret;
+					start = nr;
+				}
+				finish = nr;
+				break;
+			}
+		}
+	}
+	if (start != nr)
+		__report_bitmap(arch, ret, start, finish);
+}
+
 static void seccomp_update_bitmaps(struct seccomp_filter *filter,
 				   void *pagepair)
 {
@@ -849,6 +928,23 @@ static void seccomp_update_bitmaps(struct seccomp_filter *filter,
 			      SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK,
 			      &current->seccomp.multiplex);
 #endif
+	if (strncmp(current->comm, "test-", 5) == 0 ||
+	    strcmp(current->comm, "seccomp_bpf") == 0 ||
+	    /*
+	     * Why are systemd's process names head-truncated to 8 bytes
+	     * and wrapped in parens!?
+	     */
+	    (current->comm[0] == '(' && strrchr(current->comm, ')') != NULL)) {
+		pr_info("reporting syscall bitmap usage for %d (%s):\n",
+			task_pid_nr(current), current->comm);
+		report_bitmap(&current->seccomp.native, "native");
+#ifdef CONFIG_COMPAT
+		report_bitmap(&current->seccomp.compat, "compat");
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+		report_bitmap(&current->seccomp.multiplex, "multiplex");
+#endif
+	}
 }
 #else
 static void seccomp_update_bitmaps(struct seccomp_filter *filter,
@@ -908,6 +1004,10 @@ static long seccomp_attach_filter(unsigned int flags,
 	filter->prev = current->seccomp.filter;
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
+	if (atomic_read(&current->seccomp.filter_count) > 10)
+		pr_info("%d filters: %d (%s)\n",
+			atomic_read(&current->seccomp.filter_count),
+			task_pid_nr(current), current->comm);
 
 	/* Evaluate filter for new known-outcome syscalls */
 	seccomp_update_bitmaps(filter, pagepair);
@@ -2419,6 +2519,21 @@ static int __init seccomp_sysctl_init(void)
 		pr_warn("sysctl registration failed\n");
 	else
 		kmemleak_not_leak(hdr);
+#ifndef SECCOMP_ARCH
+	pr_info("arch lacks support for constant action bitmaps\n");
+#else
+	pr_info("NR_syscalls: %d\n", NR_syscalls);
+	pr_info("arch: 0x%x\n", SECCOMP_ARCH);
+#ifdef CONFIG_COMPAT
+	pr_info("compat arch: 0x%x\n", SECCOMP_ARCH_COMPAT);
+#endif
+#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
+	pr_info("multiplex arch: 0x%x (mask: 0x%x)\n",
+		SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH,
+		SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK);
+#endif
+#endif
+	pr_info("sizeof(struct seccomp_bitmaps): %zu\n", sizeof(struct seccomp_bitmaps));
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
  2020-09-23 23:29   ` Kees Cook
@ 2020-09-23 23:47     ` Jann Horn
  -1 siblings, 0 replies; 81+ messages in thread
From: Jann Horn via Containers @ 2020-09-23 23:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, YiFei Zhu,
	Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> This emulates absolutely the most basic seccomp filters to figure out
> if they will always give the same results for a given arch/nr combo.
>
> Nearly all seccomp filters are built from the following ops:
>
> BPF_LD  | BPF_W    | BPF_ABS
> BPF_JMP | BPF_JEQ  | BPF_K
> BPF_JMP | BPF_JGE  | BPF_K
> BPF_JMP | BPF_JGT  | BPF_K
> BPF_JMP | BPF_JSET | BPF_K
> BPF_JMP | BPF_JA
> BPF_RET | BPF_K
>
> These are now emulated to check for accesses beyond seccomp_data::arch
> or unknown instructions.
>
> Not yet implemented are:
>
> BPF_ALU | BPF_AND (generated by libseccomp and Chrome)

BPF_AND is normally only used on syscall arguments, not on the syscall
number or the architecture, right? And when a syscall argument is
loaded, we abort execution anyway. So I think there is no need to
implement those?

> Suggested-by: Jann Horn <jannh@google.com>
> Link: https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  kernel/seccomp.c  | 82 ++++++++++++++++++++++++++++++++++++++++++++---
>  net/core/filter.c |  3 +-
>  2 files changed, 79 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 111a238bc532..9921f6f39d12 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -610,7 +610,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  {
>         struct seccomp_filter *sfilter;
>         int ret;
> -       const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> +       const bool save_orig =
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH)
> +               true;
> +#else
> +               false;
> +#endif

You could probably write this as something like:

const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
__is_defined(SECCOMP_ARCH);

[...]
> diff --git a/net/core/filter.c b/net/core/filter.c
[...]
> -static void bpf_release_orig_filter(struct bpf_prog *fp)
> +void bpf_release_orig_filter(struct bpf_prog *fp)
>  {
>         struct sock_fprog_kern *fprog = fp->orig_prog;
>
> @@ -1154,6 +1154,7 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
>                 kfree(fprog);
>         }
>  }
> +EXPORT_SYMBOL_GPL(bpf_release_orig_filter);

If this change really belongs into this patch (which I don't think it
does), please describe why in the commit message.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
@ 2020-09-23 23:47     ` Jann Horn
  0 siblings, 0 replies; 81+ messages in thread
From: Jann Horn @ 2020-09-23 23:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> This emulates absolutely the most basic seccomp filters to figure out
> if they will always give the same results for a given arch/nr combo.
>
> Nearly all seccomp filters are built from the following ops:
>
> BPF_LD  | BPF_W    | BPF_ABS
> BPF_JMP | BPF_JEQ  | BPF_K
> BPF_JMP | BPF_JGE  | BPF_K
> BPF_JMP | BPF_JGT  | BPF_K
> BPF_JMP | BPF_JSET | BPF_K
> BPF_JMP | BPF_JA
> BPF_RET | BPF_K
>
> These are now emulated to check for accesses beyond seccomp_data::arch
> or unknown instructions.
>
> Not yet implemented are:
>
> BPF_ALU | BPF_AND (generated by libseccomp and Chrome)

BPF_AND is normally only used on syscall arguments, not on the syscall
number or the architecture, right? And when a syscall argument is
loaded, we abort execution anyway. So I think there is no need to
implement those?

> Suggested-by: Jann Horn <jannh@google.com>
> Link: https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  kernel/seccomp.c  | 82 ++++++++++++++++++++++++++++++++++++++++++++---
>  net/core/filter.c |  3 +-
>  2 files changed, 79 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 111a238bc532..9921f6f39d12 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -610,7 +610,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  {
>         struct seccomp_filter *sfilter;
>         int ret;
> -       const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> +       const bool save_orig =
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH)
> +               true;
> +#else
> +               false;
> +#endif

You could probably write this as something like:

const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
__is_defined(SECCOMP_ARCH);

[...]
> diff --git a/net/core/filter.c b/net/core/filter.c
[...]
> -static void bpf_release_orig_filter(struct bpf_prog *fp)
> +void bpf_release_orig_filter(struct bpf_prog *fp)
>  {
>         struct sock_fprog_kern *fprog = fp->orig_prog;
>
> @@ -1154,6 +1154,7 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
>                 kfree(fprog);
>         }
>  }
> +EXPORT_SYMBOL_GPL(bpf_release_orig_filter);

If this change really belongs into this patch (which I don't think it
does), please describe why in the commit message.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-23 23:29   ` Kees Cook
@ 2020-09-24  0:25     ` Jann Horn
  -1 siblings, 0 replies; 81+ messages in thread
From: Jann Horn via Containers @ 2020-09-24  0:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, YiFei Zhu,
	Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> One of the most common pain points with seccomp filters has been dealing
> with the overhead of processing the filters, especially for "always allow"
> or "always reject" cases.

The "always reject" cases don't need to be fast, in particular not the
kill_thread/kill_process ones. Nobody's going to have "process kills
itself by executing a forbidden syscall" on a critical hot codepath.

> While BPF is extremely fast[1], it will always
> have overhead associated with it. Additionally, due to seccomp's design,
> filters are layered, which means processing time goes up as the number
> of filters attached goes up.
[...]
> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result. (This logic comes
> in the next patch.)
>
> [1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/
> [2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
> [3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  include/linux/seccomp.h |  18 ++++
>  kernel/seccomp.c        | 207 +++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 221 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 0be20bc81ea9..96df2f899e3d 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -25,6 +25,17 @@
>  #define SECCOMP_ARCH_IS_MULTIPLEX      3
>  #define SECCOMP_ARCH_IS_UNKNOWN                0xff
>
> +/* When no bits are set for a syscall, filters are run. */
> +struct seccomp_bitmaps {
> +#ifdef SECCOMP_ARCH
> +       /* "allow" are initialized to set and only ever get cleared. */
> +       DECLARE_BITMAP(allow, NR_syscalls);

This bitmap makes sense.

The "NR_syscalls" part assumes that the compat syscall tables will not
be bigger than the native syscall table, right? I guess that's usually
mostly true nowadays, thanks to the syscall table unification...
(might be worth a comment though)

> +       /* These are initialized to clear and only ever get set. */
> +       DECLARE_BITMAP(kill_thread, NR_syscalls);
> +       DECLARE_BITMAP(kill_process, NR_syscalls);

I don't think these bitmaps make sense, this is not part of any fastpath.

(However, a "which syscalls have a fixed result" bitmap might make
sense if we want to export the list of permitted syscalls as a text
file in procfs, as I mentioned over at
<https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/>.)

> +#endif
> +};
> +
>  struct seccomp_filter;
>  /**
>   * struct seccomp - the state of a seccomp'ed process
> @@ -45,6 +56,13 @@ struct seccomp {
>  #endif
>         atomic_t filter_count;
>         struct seccomp_filter *filter;
> +       struct seccomp_bitmaps native;
> +#ifdef CONFIG_COMPAT
> +       struct seccomp_bitmaps compat;
> +#endif
> +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> +       struct seccomp_bitmaps multiplex;
> +#endif

Why do we have one bitmap per thread (in struct seccomp) instead of
putting the bitmap for a given filter and all its ancestors into the
seccomp_filter?

>  };
>
>  #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 0a3ff8eb8aea..111a238bc532 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
>
>  #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
>         if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
> -               seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> +               seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
>                                 SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;

This belongs over into patch 1.

>         }
>  #endif
> @@ -559,6 +559,21 @@ static inline void seccomp_sync_threads(unsigned long flags)
>                 atomic_set(&thread->seccomp.filter_count,
>                            atomic_read(&thread->seccomp.filter_count));
>
> +               /* Copy syscall filter bitmaps. */
> +               memcpy(&thread->seccomp.native,
> +                      &caller->seccomp.native,
> +                      sizeof(caller->seccomp.native));
> +#ifdef CONFIG_COMPAT
> +               memcpy(&thread->seccomp.compat,
> +                      &caller->seccomp.compat,
> +                      sizeof(caller->seccomp.compat));
> +#endif
> +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> +               memcpy(&thread->seccomp.multiplex,
> +                      &caller->seccomp.multiplex,
> +                      sizeof(caller->seccomp.multiplex));
> +#endif

This part wouldn't be necessary if the bitmasks were part of the
seccomp_filter...

>                 /*
>                  * Don't let an unprivileged task work around
>                  * the no_new_privs restriction by creating
> @@ -661,6 +676,114 @@ seccomp_prepare_user_filter(const char __user *user_filter)
>         return filter;
>  }
>
> +static inline bool sd_touched(pte_t *ptep)
> +{
> +       return !!pte_young(*(READ_ONCE(ptep)));
> +}

I think this is left over from the previous version and should've been removed?

[...]
> +/*
> + * Walk everyone syscall combination for this arch/mask combo and update

nit: "Walk every possible", or something like that

> + * the bitmaps with any results.
> + */
> +static void seccomp_update_bitmap(struct seccomp_filter *filter,
> +                                 void *pagepair, u32 arch, u32 mask,
> +                                 struct seccomp_bitmaps *bitmaps)
[...]
> @@ -970,6 +1097,65 @@ static int seccomp_do_user_notification(int this_syscall,
>         return -1;
>  }
>
> +#ifdef SECCOMP_ARCH
> +static inline bool __bypass_filter(struct seccomp_bitmaps *bitmaps,
> +                                  u32 nr, u32 *filter_ret)
> +{
> +       if (nr < NR_syscalls) {
> +               if (test_bit(nr, bitmaps->allow)) {
> +                       *filter_ret = SECCOMP_RET_ALLOW;
> +                       return true;
> +               }
> +               if (test_bit(nr, bitmaps->kill_process)) {
> +                       *filter_ret = SECCOMP_RET_KILL_PROCESS;
> +                       return true;
> +               }
> +               if (test_bit(nr, bitmaps->kill_thread)) {
> +                       *filter_ret = SECCOMP_RET_KILL_THREAD;
> +                       return true;
> +               }

The checks against ->kill_process and ->kill_thread won't make
anything faster, but since they will run in the fastpath, they'll
probably actually contribute to making things *slower*.

> +       }
> +       return false;
> +}
> +
> +static inline u32 check_syscall(const struct seccomp_data *sd,
> +                               struct seccomp_filter **match)
> +{
> +       u32 filter_ret = SECCOMP_RET_KILL_PROCESS;
> +       u8 arch = seccomp_get_arch(sd->arch, sd->nr);
> +
> +       switch (arch) {
> +       case SECCOMP_ARCH_IS_NATIVE:
> +               if (__bypass_filter(&current->seccomp.native, sd->nr, &filter_ret))
> +                       return filter_ret;
> +               break;
> +#ifdef CONFIG_COMPAT
> +       case SECCOMP_ARCH_IS_COMPAT:
> +               if (__bypass_filter(&current->seccomp.compat, sd->nr, &filter_ret))
> +                       return filter_ret;
> +               break;
> +#endif
> +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> +       case SECCOMP_ARCH_IS_MULTIPLEX:
> +               if (__bypass_filter(&current->seccomp.multiplex, sd->nr, &filter_ret))
> +                       return filter_ret;
> +               break;
> +#endif
> +       default:
> +               WARN_ON_ONCE(1);
> +               return filter_ret;
> +       };
> +
> +       return seccomp_run_filters(sd, match);
> +}

You could write this in a less repetitive way, and especially if we
get rid of the kill_* masks, also more compact:

static inline u32 check_syscall(const struct seccomp_data *sd,
        struct seccomp_filter **match)
{
  struct seccomp_bitmaps *bitmaps;
  u32 filter_ret;

  switch (arch) {
  case SECCOMP_ARCH_IS_NATIVE:
    bitmaps = &current->seccomp.native;
    break;
#ifdef CONFIG_COMPAT
  case SECCOMP_ARCH_IS_COMPAT:
    bitmaps = &current->seccomp.compat;
    break;
#endif
#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
  case SECCOMP_ARCH_IS_MULTIPLEX:
    bitmaps = &current->seccomp.multiplex;
    break;
#endif
  default:
    WARN_ON_ONCE(1);
    return SECCOMP_RET_KILL_PROCESS;
  }

  if ((unsigned)sd->nr < __NR_syscalls && test_bit(sd->nr, bitmaps->allow))
    return SECCOMP_RET_ALLOW;

  return seccomp_run_filters(sd, match);
}

[...]
> @@ -1625,12 +1812,24 @@ static long seccomp_set_mode_filter(unsigned int flags,
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
>                 goto out_put_fd;
>
> +       /*
> +        * This memory will be needed for bitmap testing, but we'll
> +        * be holding a spinlock at that point. Do the allocation
> +        * (and free) outside of the lock.
> +        *
> +        * Alternative: we could do the bitmap update before attach
> +        * to avoid spending too much time under lock.
> +        */
> +       pagepair = vzalloc(PAGE_SIZE * 2);
> +       if (!pagepair)
> +               goto out_put_fd;
> +
[...]
> -       ret = seccomp_attach_filter(flags, prepared);
> +       ret = seccomp_attach_filter(flags, prepared, pagepair);

You probably intended to rip this stuff back out? AFAIU the vzalloc()
stuff is a remnant from the old version that relied on MMU trickery.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24  0:25     ` Jann Horn
  0 siblings, 0 replies; 81+ messages in thread
From: Jann Horn @ 2020-09-24  0:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> One of the most common pain points with seccomp filters has been dealing
> with the overhead of processing the filters, especially for "always allow"
> or "always reject" cases.

The "always reject" cases don't need to be fast, in particular not the
kill_thread/kill_process ones. Nobody's going to have "process kills
itself by executing a forbidden syscall" on a critical hot codepath.

> While BPF is extremely fast[1], it will always
> have overhead associated with it. Additionally, due to seccomp's design,
> filters are layered, which means processing time goes up as the number
> of filters attached goes up.
[...]
> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result. (This logic comes
> in the next patch.)
>
> [1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/
> [2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
> [3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  include/linux/seccomp.h |  18 ++++
>  kernel/seccomp.c        | 207 +++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 221 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 0be20bc81ea9..96df2f899e3d 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -25,6 +25,17 @@
>  #define SECCOMP_ARCH_IS_MULTIPLEX      3
>  #define SECCOMP_ARCH_IS_UNKNOWN                0xff
>
> +/* When no bits are set for a syscall, filters are run. */
> +struct seccomp_bitmaps {
> +#ifdef SECCOMP_ARCH
> +       /* "allow" are initialized to set and only ever get cleared. */
> +       DECLARE_BITMAP(allow, NR_syscalls);

This bitmap makes sense.

The "NR_syscalls" part assumes that the compat syscall tables will not
be bigger than the native syscall table, right? I guess that's usually
mostly true nowadays, thanks to the syscall table unification...
(might be worth a comment though)

> +       /* These are initialized to clear and only ever get set. */
> +       DECLARE_BITMAP(kill_thread, NR_syscalls);
> +       DECLARE_BITMAP(kill_process, NR_syscalls);

I don't think these bitmaps make sense, this is not part of any fastpath.

(However, a "which syscalls have a fixed result" bitmap might make
sense if we want to export the list of permitted syscalls as a text
file in procfs, as I mentioned over at
<https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/>.)

> +#endif
> +};
> +
>  struct seccomp_filter;
>  /**
>   * struct seccomp - the state of a seccomp'ed process
> @@ -45,6 +56,13 @@ struct seccomp {
>  #endif
>         atomic_t filter_count;
>         struct seccomp_filter *filter;
> +       struct seccomp_bitmaps native;
> +#ifdef CONFIG_COMPAT
> +       struct seccomp_bitmaps compat;
> +#endif
> +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> +       struct seccomp_bitmaps multiplex;
> +#endif

Why do we have one bitmap per thread (in struct seccomp) instead of
putting the bitmap for a given filter and all its ancestors into the
seccomp_filter?

>  };
>
>  #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 0a3ff8eb8aea..111a238bc532 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
>
>  #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
>         if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
> -               seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> +               seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
>                                 SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;

This belongs over into patch 1.

>         }
>  #endif
> @@ -559,6 +559,21 @@ static inline void seccomp_sync_threads(unsigned long flags)
>                 atomic_set(&thread->seccomp.filter_count,
>                            atomic_read(&thread->seccomp.filter_count));
>
> +               /* Copy syscall filter bitmaps. */
> +               memcpy(&thread->seccomp.native,
> +                      &caller->seccomp.native,
> +                      sizeof(caller->seccomp.native));
> +#ifdef CONFIG_COMPAT
> +               memcpy(&thread->seccomp.compat,
> +                      &caller->seccomp.compat,
> +                      sizeof(caller->seccomp.compat));
> +#endif
> +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> +               memcpy(&thread->seccomp.multiplex,
> +                      &caller->seccomp.multiplex,
> +                      sizeof(caller->seccomp.multiplex));
> +#endif

This part wouldn't be necessary if the bitmasks were part of the
seccomp_filter...

>                 /*
>                  * Don't let an unprivileged task work around
>                  * the no_new_privs restriction by creating
> @@ -661,6 +676,114 @@ seccomp_prepare_user_filter(const char __user *user_filter)
>         return filter;
>  }
>
> +static inline bool sd_touched(pte_t *ptep)
> +{
> +       return !!pte_young(*(READ_ONCE(ptep)));
> +}

I think this is left over from the previous version and should've been removed?

[...]
> +/*
> + * Walk everyone syscall combination for this arch/mask combo and update

nit: "Walk every possible", or something like that

> + * the bitmaps with any results.
> + */
> +static void seccomp_update_bitmap(struct seccomp_filter *filter,
> +                                 void *pagepair, u32 arch, u32 mask,
> +                                 struct seccomp_bitmaps *bitmaps)
[...]
> @@ -970,6 +1097,65 @@ static int seccomp_do_user_notification(int this_syscall,
>         return -1;
>  }
>
> +#ifdef SECCOMP_ARCH
> +static inline bool __bypass_filter(struct seccomp_bitmaps *bitmaps,
> +                                  u32 nr, u32 *filter_ret)
> +{
> +       if (nr < NR_syscalls) {
> +               if (test_bit(nr, bitmaps->allow)) {
> +                       *filter_ret = SECCOMP_RET_ALLOW;
> +                       return true;
> +               }
> +               if (test_bit(nr, bitmaps->kill_process)) {
> +                       *filter_ret = SECCOMP_RET_KILL_PROCESS;
> +                       return true;
> +               }
> +               if (test_bit(nr, bitmaps->kill_thread)) {
> +                       *filter_ret = SECCOMP_RET_KILL_THREAD;
> +                       return true;
> +               }

The checks against ->kill_process and ->kill_thread won't make
anything faster, but since they will run in the fastpath, they'll
probably actually contribute to making things *slower*.

> +       }
> +       return false;
> +}
> +
> +static inline u32 check_syscall(const struct seccomp_data *sd,
> +                               struct seccomp_filter **match)
> +{
> +       u32 filter_ret = SECCOMP_RET_KILL_PROCESS;
> +       u8 arch = seccomp_get_arch(sd->arch, sd->nr);
> +
> +       switch (arch) {
> +       case SECCOMP_ARCH_IS_NATIVE:
> +               if (__bypass_filter(&current->seccomp.native, sd->nr, &filter_ret))
> +                       return filter_ret;
> +               break;
> +#ifdef CONFIG_COMPAT
> +       case SECCOMP_ARCH_IS_COMPAT:
> +               if (__bypass_filter(&current->seccomp.compat, sd->nr, &filter_ret))
> +                       return filter_ret;
> +               break;
> +#endif
> +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> +       case SECCOMP_ARCH_IS_MULTIPLEX:
> +               if (__bypass_filter(&current->seccomp.multiplex, sd->nr, &filter_ret))
> +                       return filter_ret;
> +               break;
> +#endif
> +       default:
> +               WARN_ON_ONCE(1);
> +               return filter_ret;
> +       };
> +
> +       return seccomp_run_filters(sd, match);
> +}

You could write this in a less repetitive way, and especially if we
get rid of the kill_* masks, also more compact:

static inline u32 check_syscall(const struct seccomp_data *sd,
        struct seccomp_filter **match)
{
  struct seccomp_bitmaps *bitmaps;
  u32 filter_ret;

  switch (arch) {
  case SECCOMP_ARCH_IS_NATIVE:
    bitmaps = &current->seccomp.native;
    break;
#ifdef CONFIG_COMPAT
  case SECCOMP_ARCH_IS_COMPAT:
    bitmaps = &current->seccomp.compat;
    break;
#endif
#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
  case SECCOMP_ARCH_IS_MULTIPLEX:
    bitmaps = &current->seccomp.multiplex;
    break;
#endif
  default:
    WARN_ON_ONCE(1);
    return SECCOMP_RET_KILL_PROCESS;
  }

  if ((unsigned)sd->nr < __NR_syscalls && test_bit(sd->nr, bitmaps->allow))
    return SECCOMP_RET_ALLOW;

  return seccomp_run_filters(sd, match);
}

[...]
> @@ -1625,12 +1812,24 @@ static long seccomp_set_mode_filter(unsigned int flags,
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
>                 goto out_put_fd;
>
> +       /*
> +        * This memory will be needed for bitmap testing, but we'll
> +        * be holding a spinlock at that point. Do the allocation
> +        * (and free) outside of the lock.
> +        *
> +        * Alternative: we could do the bitmap update before attach
> +        * to avoid spending too much time under lock.
> +        */
> +       pagepair = vzalloc(PAGE_SIZE * 2);
> +       if (!pagepair)
> +               goto out_put_fd;
> +
[...]
> -       ret = seccomp_attach_filter(flags, prepared);
> +       ret = seccomp_attach_filter(flags, prepared, pagepair);

You probably intended to rip this stuff back out? AFAIU the vzalloc()
stuff is a remnant from the old version that relied on MMU trickery.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
       [not found]   ` <DM6PR11MB271492D0565E91475D949F5DEF390@DM6PR11MB2714.namprd11.prod.outlook.com>
@ 2020-09-24  0:36       ` YiFei Zhu
  0 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24  0:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, Jann Horn,
	YiFei Zhu, linux-api, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Wed, Sep 23, 2020 at 6:29 PM Kees Cook <keescook@chromium.org> wrote:
> In order to optimize these cases from O(n) to O(1), seccomp can
> use bitmaps to immediately determine the desired action. A critical
> observation in the prior paragraph bears repeating: the common case for
> syscall tests do not check arguments. For any given filter, there is a
> constant mapping from the combination of architecture and syscall to the
> seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> there is a single architecture.). As such, it is possible to construct
> a mapping of arch/syscall to action, which can be updated as new filters
> are attached to a process.

Would you mind educating me how this patch plan one handling MIPS? For
one kernel they seem to have up to three arch numbers per build,
AUDIT_ARCH_MIPS{,64,64N32}. Though ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
does not seem to be defined for MIPS so I'm assuming the syscall
numbers are the same, but I think it is possible some client uses that
arch number to pose different constraints for different processes, so
it would better not accelerate them rather than break them.


YiFei Zhu
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24  0:36       ` YiFei Zhu
  0 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24  0:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, bpf, Linux Containers, linux-api, kernel list,
	Jack Chen, Josep Torrellas, Tianyin Xu, YiFei Zhu

On Wed, Sep 23, 2020 at 6:29 PM Kees Cook <keescook@chromium.org> wrote:
> In order to optimize these cases from O(n) to O(1), seccomp can
> use bitmaps to immediately determine the desired action. A critical
> observation in the prior paragraph bears repeating: the common case for
> syscall tests do not check arguments. For any given filter, there is a
> constant mapping from the combination of architecture and syscall to the
> seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> there is a single architecture.). As such, it is possible to construct
> a mapping of arch/syscall to action, which can be updated as new filters
> are attached to a process.

Would you mind educating me how this patch plan one handling MIPS? For
one kernel they seem to have up to three arch numbers per build,
AUDIT_ARCH_MIPS{,64,64N32}. Though ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
does not seem to be defined for MIPS so I'm assuming the syscall
numbers are the same, but I think it is possible some client uses that
arch number to pose different constraints for different processes, so
it would better not accelerate them rather than break them.


YiFei Zhu

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
  2020-09-23 23:29   ` Kees Cook
@ 2020-09-24  0:41     ` Jann Horn
  -1 siblings, 0 replies; 81+ messages in thread
From: Jann Horn via Containers @ 2020-09-24  0:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, YiFei Zhu,
	Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> For systems that provide multiple syscall maps based on audit
> architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via
> CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way
> to pin the process to a specific syscall table, instead of needing
> to generate all filters with an architecture check as the first filter
> action.
>
> This creates the internal representation that seccomp itself can use
> (which is separate from the filters, which need to stay runtime
> agnostic). Additionally paves the way for constant-action bitmaps.

I don't really see the point in providing this UAPI - the syscall
number checking will probably have much more performance cost than the
architecture number check, and it's not like this lets us avoid the
check, we're just moving it over into C code.

> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  include/linux/seccomp.h                       |  9 +++
>  include/uapi/linux/seccomp.h                  |  1 +
>  kernel/seccomp.c                              | 79 ++++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++
>  4 files changed, 120 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 02aef2844c38..0be20bc81ea9 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -20,12 +20,18 @@
>  #include <linux/atomic.h>
>  #include <asm/seccomp.h>
>
> +#define SECCOMP_ARCH_IS_NATIVE         1
> +#define SECCOMP_ARCH_IS_COMPAT         2

FYI, mips has three different possible "arch" values (per kernel build
config; the __AUDIT_ARCH_LE flag can also be set, but that's fixed
based on the config):

 - AUDIT_ARCH_MIPS
 - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT
 - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT | __AUDIT_ARCH_CONVENTION_MIPS64_N32

But I guess we can deal with that once someone wants to actually add
support for this on mips.

> +#define SECCOMP_ARCH_IS_MULTIPLEX      3

Why should X32 be handled specially? If the seccomp filter allows
specific syscalls (as it should), we don't have to care about X32.
Only in weird cases where the seccomp filter wants to deny specific
syscalls (a horrible idea), X32 is a concern, and in such cases, the
userspace code can generate a single conditional jump to deal with it.

And when seccomp is used properly to allow specific syscalls, the
kernel will just waste time uselessly checking this X32 stuff.

[...]
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
[...]
> +static long seccomp_pin_architecture(void)
> +{
> +#ifdef SECCOMP_ARCH
> +       struct task_struct *task = current;
> +
> +       u8 arch = seccomp_get_arch(syscall_get_arch(task),
> +                                  syscall_get_nr(task, task_pt_regs(task)));
> +
> +       /* How did you even get here? */

Via a racing TSYNC, that's how.

> +       if (task->seccomp.arch && task->seccomp.arch != arch)
> +               return -EBUSY;
> +
> +       task->seccomp.arch = arch;
> +#endif
> +       return 0;
> +}

Why does this return 0 if SECCOMP_ARCH is not defined? That suggests
to userspace that we have successfully pinned the ABI, even though
we're actually unable to do so.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
@ 2020-09-24  0:41     ` Jann Horn
  0 siblings, 0 replies; 81+ messages in thread
From: Jann Horn @ 2020-09-24  0:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> For systems that provide multiple syscall maps based on audit
> architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via
> CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way
> to pin the process to a specific syscall table, instead of needing
> to generate all filters with an architecture check as the first filter
> action.
>
> This creates the internal representation that seccomp itself can use
> (which is separate from the filters, which need to stay runtime
> agnostic). Additionally paves the way for constant-action bitmaps.

I don't really see the point in providing this UAPI - the syscall
number checking will probably have much more performance cost than the
architecture number check, and it's not like this lets us avoid the
check, we're just moving it over into C code.

> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  include/linux/seccomp.h                       |  9 +++
>  include/uapi/linux/seccomp.h                  |  1 +
>  kernel/seccomp.c                              | 79 ++++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++
>  4 files changed, 120 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 02aef2844c38..0be20bc81ea9 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -20,12 +20,18 @@
>  #include <linux/atomic.h>
>  #include <asm/seccomp.h>
>
> +#define SECCOMP_ARCH_IS_NATIVE         1
> +#define SECCOMP_ARCH_IS_COMPAT         2

FYI, mips has three different possible "arch" values (per kernel build
config; the __AUDIT_ARCH_LE flag can also be set, but that's fixed
based on the config):

 - AUDIT_ARCH_MIPS
 - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT
 - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT | __AUDIT_ARCH_CONVENTION_MIPS64_N32

But I guess we can deal with that once someone wants to actually add
support for this on mips.

> +#define SECCOMP_ARCH_IS_MULTIPLEX      3

Why should X32 be handled specially? If the seccomp filter allows
specific syscalls (as it should), we don't have to care about X32.
Only in weird cases where the seccomp filter wants to deny specific
syscalls (a horrible idea), X32 is a concern, and in such cases, the
userspace code can generate a single conditional jump to deal with it.

And when seccomp is used properly to allow specific syscalls, the
kernel will just waste time uselessly checking this X32 stuff.

[...]
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
[...]
> +static long seccomp_pin_architecture(void)
> +{
> +#ifdef SECCOMP_ARCH
> +       struct task_struct *task = current;
> +
> +       u8 arch = seccomp_get_arch(syscall_get_arch(task),
> +                                  syscall_get_nr(task, task_pt_regs(task)));
> +
> +       /* How did you even get here? */

Via a racing TSYNC, that's how.

> +       if (task->seccomp.arch && task->seccomp.arch != arch)
> +               return -EBUSY;
> +
> +       task->seccomp.arch = arch;
> +#endif
> +       return 0;
> +}

Why does this return 0 if SECCOMP_ARCH is not defined? That suggests
to userspace that we have successfully pinned the ABI, even though
we're actually unable to do so.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/6] x86: Enable seccomp architecture tracking
  2020-09-23 23:29   ` Kees Cook
@ 2020-09-24  0:45     ` Jann Horn
  -1 siblings, 0 replies; 81+ messages in thread
From: Jann Horn via Containers @ 2020-09-24  0:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, YiFei Zhu,
	Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> Provide seccomp internals with the details to calculate which syscall
> table the running kernel is expecting to deal with. This allows for
> efficient architecture pinning and paves the way for constant-action
> bitmaps.
[...]
> diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
[...]
> +#ifdef CONFIG_X86_64
[...]
> +#else /* !CONFIG_X86_64 */
> +# define SECCOMP_ARCH                                  AUDIT_ARCH_I386
> +#endif

If we are on a 32-bit kernel, performing architecture number checks in
the kernel is completely pointless, because we know that there is only
a single architecture identifier under which syscalls can happen.

While this patch is useful for enabling the bitmap logic in the
following patches, I think it adds unnecessary overhead in the context
of the previous patch.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/6] x86: Enable seccomp architecture tracking
@ 2020-09-24  0:45     ` Jann Horn
  0 siblings, 0 replies; 81+ messages in thread
From: Jann Horn @ 2020-09-24  0:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> Provide seccomp internals with the details to calculate which syscall
> table the running kernel is expecting to deal with. This allows for
> efficient architecture pinning and paves the way for constant-action
> bitmaps.
[...]
> diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
[...]
> +#ifdef CONFIG_X86_64
[...]
> +#else /* !CONFIG_X86_64 */
> +# define SECCOMP_ARCH                                  AUDIT_ARCH_I386
> +#endif

If we are on a 32-bit kernel, performing architecture number checks in
the kernel is completely pointless, because we know that there is only
a single architecture identifier under which syscalls can happen.

While this patch is useful for enabling the bitmap logic in the
following patches, I think it adds unnecessary overhead in the context
of the previous patch.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
  2020-09-24  0:41     ` Jann Horn
@ 2020-09-24  7:11       ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:11 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, YiFei Zhu,
	Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 02:41:36AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > For systems that provide multiple syscall maps based on audit
> > architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via
> > CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way
> > to pin the process to a specific syscall table, instead of needing
> > to generate all filters with an architecture check as the first filter
> > action.
> >
> > This creates the internal representation that seccomp itself can use
> > (which is separate from the filters, which need to stay runtime
> > agnostic). Additionally paves the way for constant-action bitmaps.
> 
> I don't really see the point in providing this UAPI - the syscall
> number checking will probably have much more performance cost than the
> architecture number check, and it's not like this lets us avoid the
> check, we're just moving it over into C code.

It's desirable for libseccomp and is a request from systemd (which is,
at this point, the largest seccomp user I know of), as they have no way
to force an arch without doing it in filters, which doesn't help much
with reducing filter runtime.

> 
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > ---
> >  include/linux/seccomp.h                       |  9 +++
> >  include/uapi/linux/seccomp.h                  |  1 +
> >  kernel/seccomp.c                              | 79 ++++++++++++++++++-
> >  tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++
> >  4 files changed, 120 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > index 02aef2844c38..0be20bc81ea9 100644
> > --- a/include/linux/seccomp.h
> > +++ b/include/linux/seccomp.h
> > @@ -20,12 +20,18 @@
> >  #include <linux/atomic.h>
> >  #include <asm/seccomp.h>
> >
> > +#define SECCOMP_ARCH_IS_NATIVE         1
> > +#define SECCOMP_ARCH_IS_COMPAT         2
> 
> FYI, mips has three different possible "arch" values (per kernel build
> config; the __AUDIT_ARCH_LE flag can also be set, but that's fixed
> based on the config):
> 
>  - AUDIT_ARCH_MIPS
>  - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT
>  - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT | __AUDIT_ARCH_CONVENTION_MIPS64_N32
> 
> But I guess we can deal with that once someone wants to actually add
> support for this on mips.

Yup!

> 
> > +#define SECCOMP_ARCH_IS_MULTIPLEX      3
> 
> Why should X32 be handled specially? If the seccomp filter allows

Because it's a masked lookup into a separate table; the syscalls don't
map to x86_64's table; so for seccomp to correctly figure out which
bitmap to use, it has to do this decoding.

> specific syscalls (as it should), we don't have to care about X32.
> Only in weird cases where the seccomp filter wants to deny specific
> syscalls (a horrible idea), X32 is a concern, and in such cases, the
> userspace code can generate a single conditional jump to deal with it.

I feel like I must not understand what you mean. The x32-aware seccomp
filters are using syscall tests with 0x40000000 included in the values.
So seccomp's bitmap cannot handle this because it must know how many
syscalls to include in a linearly-allocated bitmap.

> And when seccomp is used properly to allow specific syscalls, the
> kernel will just waste time uselessly checking this X32 stuff.

It not measurable in my tests -- seccomp_data::nr is rather hot in the
cache. ;) That said, if it's unwanted, then CONFIG_X86_X32=n is the way
to go.

> [...]
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> [...]
> > +static long seccomp_pin_architecture(void)
> > +{
> > +#ifdef SECCOMP_ARCH
> > +       struct task_struct *task = current;
> > +
> > +       u8 arch = seccomp_get_arch(syscall_get_arch(task),
> > +                                  syscall_get_nr(task, task_pt_regs(task)));
> > +
> > +       /* How did you even get here? */
> 
> Via a racing TSYNC, that's how.

Yes; thanks. This will need to take &current->sighand->siglock.

> 
> > +       if (task->seccomp.arch && task->seccomp.arch != arch)
> > +               return -EBUSY;
> > +
> > +       task->seccomp.arch = arch;
> > +#endif
> > +       return 0;
> > +}
> 
> Why does this return 0 if SECCOMP_ARCH is not defined? That suggests
> to userspace that we have successfully pinned the ABI, even though
> we're actually unable to do so.

Yup; thanks for the catch. This is a logical leftover from the RFC. This
should be, I think:

+       task->seccomp.arch = arch;
+       return 0;
+#else
+	return -EINVAL;
+#endif


-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
@ 2020-09-24  7:11       ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:11 UTC (permalink / raw)
  To: Jann Horn
  Cc: YiFei Zhu, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 02:41:36AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > For systems that provide multiple syscall maps based on audit
> > architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via
> > CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way
> > to pin the process to a specific syscall table, instead of needing
> > to generate all filters with an architecture check as the first filter
> > action.
> >
> > This creates the internal representation that seccomp itself can use
> > (which is separate from the filters, which need to stay runtime
> > agnostic). Additionally paves the way for constant-action bitmaps.
> 
> I don't really see the point in providing this UAPI - the syscall
> number checking will probably have much more performance cost than the
> architecture number check, and it's not like this lets us avoid the
> check, we're just moving it over into C code.

It's desirable for libseccomp and is a request from systemd (which is,
at this point, the largest seccomp user I know of), as they have no way
to force an arch without doing it in filters, which doesn't help much
with reducing filter runtime.

> 
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > ---
> >  include/linux/seccomp.h                       |  9 +++
> >  include/uapi/linux/seccomp.h                  |  1 +
> >  kernel/seccomp.c                              | 79 ++++++++++++++++++-
> >  tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++
> >  4 files changed, 120 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > index 02aef2844c38..0be20bc81ea9 100644
> > --- a/include/linux/seccomp.h
> > +++ b/include/linux/seccomp.h
> > @@ -20,12 +20,18 @@
> >  #include <linux/atomic.h>
> >  #include <asm/seccomp.h>
> >
> > +#define SECCOMP_ARCH_IS_NATIVE         1
> > +#define SECCOMP_ARCH_IS_COMPAT         2
> 
> FYI, mips has three different possible "arch" values (per kernel build
> config; the __AUDIT_ARCH_LE flag can also be set, but that's fixed
> based on the config):
> 
>  - AUDIT_ARCH_MIPS
>  - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT
>  - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT | __AUDIT_ARCH_CONVENTION_MIPS64_N32
> 
> But I guess we can deal with that once someone wants to actually add
> support for this on mips.

Yup!

> 
> > +#define SECCOMP_ARCH_IS_MULTIPLEX      3
> 
> Why should X32 be handled specially? If the seccomp filter allows

Because it's a masked lookup into a separate table; the syscalls don't
map to x86_64's table; so for seccomp to correctly figure out which
bitmap to use, it has to do this decoding.

> specific syscalls (as it should), we don't have to care about X32.
> Only in weird cases where the seccomp filter wants to deny specific
> syscalls (a horrible idea), X32 is a concern, and in such cases, the
> userspace code can generate a single conditional jump to deal with it.

I feel like I must not understand what you mean. The x32-aware seccomp
filters are using syscall tests with 0x40000000 included in the values.
So seccomp's bitmap cannot handle this because it must know how many
syscalls to include in a linearly-allocated bitmap.

> And when seccomp is used properly to allow specific syscalls, the
> kernel will just waste time uselessly checking this X32 stuff.

It not measurable in my tests -- seccomp_data::nr is rather hot in the
cache. ;) That said, if it's unwanted, then CONFIG_X86_X32=n is the way
to go.

> [...]
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> [...]
> > +static long seccomp_pin_architecture(void)
> > +{
> > +#ifdef SECCOMP_ARCH
> > +       struct task_struct *task = current;
> > +
> > +       u8 arch = seccomp_get_arch(syscall_get_arch(task),
> > +                                  syscall_get_nr(task, task_pt_regs(task)));
> > +
> > +       /* How did you even get here? */
> 
> Via a racing TSYNC, that's how.

Yes; thanks. This will need to take &current->sighand->siglock.

> 
> > +       if (task->seccomp.arch && task->seccomp.arch != arch)
> > +               return -EBUSY;
> > +
> > +       task->seccomp.arch = arch;
> > +#endif
> > +       return 0;
> > +}
> 
> Why does this return 0 if SECCOMP_ARCH is not defined? That suggests
> to userspace that we have successfully pinned the ABI, even though
> we're actually unable to do so.

Yup; thanks for the catch. This is a logical leftover from the RFC. This
should be, I think:

+       task->seccomp.arch = arch;
+       return 0;
+#else
+	return -EINVAL;
+#endif


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/6] x86: Enable seccomp architecture tracking
  2020-09-24  0:45     ` Jann Horn
@ 2020-09-24  7:12       ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:12 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, YiFei Zhu,
	Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 02:45:45AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > Provide seccomp internals with the details to calculate which syscall
> > table the running kernel is expecting to deal with. This allows for
> > efficient architecture pinning and paves the way for constant-action
> > bitmaps.
> [...]
> > diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
> [...]
> > +#ifdef CONFIG_X86_64
> [...]
> > +#else /* !CONFIG_X86_64 */
> > +# define SECCOMP_ARCH                                  AUDIT_ARCH_I386
> > +#endif
> 
> If we are on a 32-bit kernel, performing architecture number checks in
> the kernel is completely pointless, because we know that there is only
> a single architecture identifier under which syscalls can happen.
> 
> While this patch is useful for enabling the bitmap logic in the
> following patches, I think it adds unnecessary overhead in the context
> of the previous patch.

That's what the RFC was trying to do (avoid the logic if there is only a
single arch known to the kernel). I will rework this a bit harder. :)

-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/6] x86: Enable seccomp architecture tracking
@ 2020-09-24  7:12       ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:12 UTC (permalink / raw)
  To: Jann Horn
  Cc: YiFei Zhu, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 02:45:45AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > Provide seccomp internals with the details to calculate which syscall
> > table the running kernel is expecting to deal with. This allows for
> > efficient architecture pinning and paves the way for constant-action
> > bitmaps.
> [...]
> > diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
> [...]
> > +#ifdef CONFIG_X86_64
> [...]
> > +#else /* !CONFIG_X86_64 */
> > +# define SECCOMP_ARCH                                  AUDIT_ARCH_I386
> > +#endif
> 
> If we are on a 32-bit kernel, performing architecture number checks in
> the kernel is completely pointless, because we know that there is only
> a single architecture identifier under which syscalls can happen.
> 
> While this patch is useful for enabling the bitmap logic in the
> following patches, I think it adds unnecessary overhead in the context
> of the previous patch.

That's what the RFC was trying to do (avoid the logic if there is only a
single arch known to the kernel). I will rework this a bit harder. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-24  0:25     ` Jann Horn
@ 2020-09-24  7:36       ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, YiFei Zhu,
	Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 02:25:03AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > +/* When no bits are set for a syscall, filters are run. */
> > +struct seccomp_bitmaps {
> > +#ifdef SECCOMP_ARCH
> > +       /* "allow" are initialized to set and only ever get cleared. */
> > +       DECLARE_BITMAP(allow, NR_syscalls);
> 
> This bitmap makes sense.
> 
> > +       /* These are initialized to clear and only ever get set. */
> > +       DECLARE_BITMAP(kill_thread, NR_syscalls);
> > +       DECLARE_BITMAP(kill_process, NR_syscalls);
> 
> I don't think these bitmaps make sense, this is not part of any fastpath.

That's a fair point. I think I arrived at this design because it ended
up making filter addition faster ("don't bother processing this one,
it's already 'kill'"), but it's likely not worse the memory usage
trade-off.

> (However, a "which syscalls have a fixed result" bitmap might make
> sense if we want to export the list of permitted syscalls as a text
> file in procfs, as I mentioned over at
> <https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/>.)

I haven't found a data structure I'm happy with for this. It seemed like
NR_syscalls * sizeof(u32) was rather a lot (i.e. to store the BPF_RET
value). However, let me discuss that more in the "why in in thread?"
below...

> The "NR_syscalls" part assumes that the compat syscall tables will not
> be bigger than the native syscall table, right? I guess that's usually
> mostly true nowadays, thanks to the syscall table unification...
> (might be worth a comment though)

Hrm, I had convinced myself it was a max() of compat. But I see no
evidence of that now. Which means that I can add these to the per-arch
seccomp defines with something like:

# define SECCOMP_NR_NATIVE	NR_syscalls
# define SECCOMP_NR_COMPAT	X32_NR_syscalls
...

> > +#endif
> > +};
> > +
> >  struct seccomp_filter;
> >  /**
> >   * struct seccomp - the state of a seccomp'ed process
> > @@ -45,6 +56,13 @@ struct seccomp {
> >  #endif
> >         atomic_t filter_count;
> >         struct seccomp_filter *filter;
> > +       struct seccomp_bitmaps native;
> > +#ifdef CONFIG_COMPAT
> > +       struct seccomp_bitmaps compat;
> > +#endif
> > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> > +       struct seccomp_bitmaps multiplex;
> > +#endif
> 
> Why do we have one bitmap per thread (in struct seccomp) instead of
> putting the bitmap for a given filter and all its ancestors into the
> seccomp_filter?

I explicitly didn't want to add code that was run per-filter; I wanted
O(1), not O(n) even if the n work was a small constant. There is
obviously a memory/perf tradeoff here. I wonder if the middle ground
would be to put a bitmap and "constant action" results in the filter....
oh duh. The "top" filter is already going to be composed with its
ancestors. That's all that needs to be checked. Then the tri-state can
be:

bitmap accept[NR_syscalls]: accept or check "known" bitmap
bitmap filter[NR_syscalls]: run filter or return known action
u32 known_action[NR_syscalls];

(times syscall numbering "architecture" counts)

Though perhaps it would be just as fast as:

bitmap run_filter[NR_syscalls]: run filter or return known_action
u32 known_action[NR_syscalls];

where accept isn't treated special...

> 
> >  };
> >
> >  #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 0a3ff8eb8aea..111a238bc532 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
> >
> >  #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> >         if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
> > -               seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> > +               seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> >                                 SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;
> 
> This belongs over into patch 1.

Thanks! I was rushing to get this posted so YiFei Zhu wouldn't spend
time fighting with arch and Kconfig stuff. :) I'll clean this (and the
other random cruft) up.

-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24  7:36       ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: YiFei Zhu, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 02:25:03AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > +/* When no bits are set for a syscall, filters are run. */
> > +struct seccomp_bitmaps {
> > +#ifdef SECCOMP_ARCH
> > +       /* "allow" are initialized to set and only ever get cleared. */
> > +       DECLARE_BITMAP(allow, NR_syscalls);
> 
> This bitmap makes sense.
> 
> > +       /* These are initialized to clear and only ever get set. */
> > +       DECLARE_BITMAP(kill_thread, NR_syscalls);
> > +       DECLARE_BITMAP(kill_process, NR_syscalls);
> 
> I don't think these bitmaps make sense, this is not part of any fastpath.

That's a fair point. I think I arrived at this design because it ended
up making filter addition faster ("don't bother processing this one,
it's already 'kill'"), but it's likely not worse the memory usage
trade-off.

> (However, a "which syscalls have a fixed result" bitmap might make
> sense if we want to export the list of permitted syscalls as a text
> file in procfs, as I mentioned over at
> <https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/>.)

I haven't found a data structure I'm happy with for this. It seemed like
NR_syscalls * sizeof(u32) was rather a lot (i.e. to store the BPF_RET
value). However, let me discuss that more in the "why in in thread?"
below...

> The "NR_syscalls" part assumes that the compat syscall tables will not
> be bigger than the native syscall table, right? I guess that's usually
> mostly true nowadays, thanks to the syscall table unification...
> (might be worth a comment though)

Hrm, I had convinced myself it was a max() of compat. But I see no
evidence of that now. Which means that I can add these to the per-arch
seccomp defines with something like:

# define SECCOMP_NR_NATIVE	NR_syscalls
# define SECCOMP_NR_COMPAT	X32_NR_syscalls
...

> > +#endif
> > +};
> > +
> >  struct seccomp_filter;
> >  /**
> >   * struct seccomp - the state of a seccomp'ed process
> > @@ -45,6 +56,13 @@ struct seccomp {
> >  #endif
> >         atomic_t filter_count;
> >         struct seccomp_filter *filter;
> > +       struct seccomp_bitmaps native;
> > +#ifdef CONFIG_COMPAT
> > +       struct seccomp_bitmaps compat;
> > +#endif
> > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> > +       struct seccomp_bitmaps multiplex;
> > +#endif
> 
> Why do we have one bitmap per thread (in struct seccomp) instead of
> putting the bitmap for a given filter and all its ancestors into the
> seccomp_filter?

I explicitly didn't want to add code that was run per-filter; I wanted
O(1), not O(n) even if the n work was a small constant. There is
obviously a memory/perf tradeoff here. I wonder if the middle ground
would be to put a bitmap and "constant action" results in the filter....
oh duh. The "top" filter is already going to be composed with its
ancestors. That's all that needs to be checked. Then the tri-state can
be:

bitmap accept[NR_syscalls]: accept or check "known" bitmap
bitmap filter[NR_syscalls]: run filter or return known action
u32 known_action[NR_syscalls];

(times syscall numbering "architecture" counts)

Though perhaps it would be just as fast as:

bitmap run_filter[NR_syscalls]: run filter or return known_action
u32 known_action[NR_syscalls];

where accept isn't treated special...

> 
> >  };
> >
> >  #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 0a3ff8eb8aea..111a238bc532 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
> >
> >  #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> >         if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) {
> > -               seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> > +               seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >>
> >                                 SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT;
> 
> This belongs over into patch 1.

Thanks! I was rushing to get this posted so YiFei Zhu wouldn't spend
time fighting with arch and Kconfig stuff. :) I'll clean this (and the
other random cruft) up.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-24  0:36       ` YiFei Zhu
@ 2020-09-24  7:38         ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:38 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, Jann Horn,
	YiFei Zhu, linux-api, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Wed, Sep 23, 2020 at 07:36:47PM -0500, YiFei Zhu wrote:
> On Wed, Sep 23, 2020 at 6:29 PM Kees Cook <keescook@chromium.org> wrote:
> > In order to optimize these cases from O(n) to O(1), seccomp can
> > use bitmaps to immediately determine the desired action. A critical
> > observation in the prior paragraph bears repeating: the common case for
> > syscall tests do not check arguments. For any given filter, there is a
> > constant mapping from the combination of architecture and syscall to the
> > seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> > there is a single architecture.). As such, it is possible to construct
> > a mapping of arch/syscall to action, which can be updated as new filters
> > are attached to a process.
> 
> Would you mind educating me how this patch plan one handling MIPS? For
> one kernel they seem to have up to three arch numbers per build,
> AUDIT_ARCH_MIPS{,64,64N32}. Though ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
> does not seem to be defined for MIPS so I'm assuming the syscall
> numbers are the same, but I think it is possible some client uses that
> arch number to pose different constraints for different processes, so
> it would better not accelerate them rather than break them.

I'll take a look, but I'm hoping it won't be too hard to fit into what
I've got designed so for to deal with x86_x32. (Will MIPS want this
optimization at all?)

-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24  7:38         ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:38 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Jann Horn, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, bpf, Linux Containers, linux-api, kernel list,
	Jack Chen, Josep Torrellas, Tianyin Xu, YiFei Zhu

On Wed, Sep 23, 2020 at 07:36:47PM -0500, YiFei Zhu wrote:
> On Wed, Sep 23, 2020 at 6:29 PM Kees Cook <keescook@chromium.org> wrote:
> > In order to optimize these cases from O(n) to O(1), seccomp can
> > use bitmaps to immediately determine the desired action. A critical
> > observation in the prior paragraph bears repeating: the common case for
> > syscall tests do not check arguments. For any given filter, there is a
> > constant mapping from the combination of architecture and syscall to the
> > seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> > there is a single architecture.). As such, it is possible to construct
> > a mapping of arch/syscall to action, which can be updated as new filters
> > are attached to a process.
> 
> Would you mind educating me how this patch plan one handling MIPS? For
> one kernel they seem to have up to three arch numbers per build,
> AUDIT_ARCH_MIPS{,64,64N32}. Though ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
> does not seem to be defined for MIPS so I'm assuming the syscall
> numbers are the same, but I think it is possible some client uses that
> arch number to pose different constraints for different processes, so
> it would better not accelerate them rather than break them.

I'll take a look, but I'm hoping it won't be too hard to fit into what
I've got designed so for to deal with x86_x32. (Will MIPS want this
optimization at all?)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
  2020-09-23 23:47     ` Jann Horn
@ 2020-09-24  7:46       ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:46 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Paul Moore,
	YiFei Zhu, Linux API, Linux Containers, bpf,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > This emulates absolutely the most basic seccomp filters to figure out
> > if they will always give the same results for a given arch/nr combo.
> >
> > Nearly all seccomp filters are built from the following ops:
> >
> > BPF_LD  | BPF_W    | BPF_ABS
> > BPF_JMP | BPF_JEQ  | BPF_K
> > BPF_JMP | BPF_JGE  | BPF_K
> > BPF_JMP | BPF_JGT  | BPF_K
> > BPF_JMP | BPF_JSET | BPF_K
> > BPF_JMP | BPF_JA
> > BPF_RET | BPF_K
> >
> > These are now emulated to check for accesses beyond seccomp_data::arch
> > or unknown instructions.
> >
> > Not yet implemented are:
> >
> > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> 
> BPF_AND is normally only used on syscall arguments, not on the syscall
> number or the architecture, right? And when a syscall argument is
> loaded, we abort execution anyway. So I think there is no need to
> implement those?

Is that right? I can't actually tell what libseccomp is doing with
ALU|AND. It looks like it's using it for building jump lists?

Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?

> > Suggested-by: Jann Horn <jannh@google.com>
> > Link: https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > ---
> >  kernel/seccomp.c  | 82 ++++++++++++++++++++++++++++++++++++++++++++---
> >  net/core/filter.c |  3 +-
> >  2 files changed, 79 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 111a238bc532..9921f6f39d12 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -610,7 +610,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
> >  {
> >         struct seccomp_filter *sfilter;
> >         int ret;
> > -       const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> > +       const bool save_orig =
> > +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH)
> > +               true;
> > +#else
> > +               false;
> > +#endif
> 
> You could probably write this as something like:
> 
> const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
> __is_defined(SECCOMP_ARCH);

Ah! Thank you. I went looking for __is_defined() and failed. :)

> 
> [...]
> > diff --git a/net/core/filter.c b/net/core/filter.c
> [...]
> > -static void bpf_release_orig_filter(struct bpf_prog *fp)
> > +void bpf_release_orig_filter(struct bpf_prog *fp)
> >  {
> >         struct sock_fprog_kern *fprog = fp->orig_prog;
> >
> > @@ -1154,6 +1154,7 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
> >                 kfree(fprog);
> >         }
> >  }
> > +EXPORT_SYMBOL_GPL(bpf_release_orig_filter);
> 
> If this change really belongs into this patch (which I don't think it
> does), please describe why in the commit message.

Yup, more cruft I failed to remove.

-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
@ 2020-09-24  7:46       ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  7:46 UTC (permalink / raw)
  To: Jann Horn
  Cc: YiFei Zhu, Paul Moore, Tom Hromatka, Christian Brauner,
	Tycho Andersen, Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > This emulates absolutely the most basic seccomp filters to figure out
> > if they will always give the same results for a given arch/nr combo.
> >
> > Nearly all seccomp filters are built from the following ops:
> >
> > BPF_LD  | BPF_W    | BPF_ABS
> > BPF_JMP | BPF_JEQ  | BPF_K
> > BPF_JMP | BPF_JGE  | BPF_K
> > BPF_JMP | BPF_JGT  | BPF_K
> > BPF_JMP | BPF_JSET | BPF_K
> > BPF_JMP | BPF_JA
> > BPF_RET | BPF_K
> >
> > These are now emulated to check for accesses beyond seccomp_data::arch
> > or unknown instructions.
> >
> > Not yet implemented are:
> >
> > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> 
> BPF_AND is normally only used on syscall arguments, not on the syscall
> number or the architecture, right? And when a syscall argument is
> loaded, we abort execution anyway. So I think there is no need to
> implement those?

Is that right? I can't actually tell what libseccomp is doing with
ALU|AND. It looks like it's using it for building jump lists?

Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?

> > Suggested-by: Jann Horn <jannh@google.com>
> > Link: https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > ---
> >  kernel/seccomp.c  | 82 ++++++++++++++++++++++++++++++++++++++++++++---
> >  net/core/filter.c |  3 +-
> >  2 files changed, 79 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 111a238bc532..9921f6f39d12 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -610,7 +610,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
> >  {
> >         struct seccomp_filter *sfilter;
> >         int ret;
> > -       const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> > +       const bool save_orig =
> > +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH)
> > +               true;
> > +#else
> > +               false;
> > +#endif
> 
> You could probably write this as something like:
> 
> const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
> __is_defined(SECCOMP_ARCH);

Ah! Thank you. I went looking for __is_defined() and failed. :)

> 
> [...]
> > diff --git a/net/core/filter.c b/net/core/filter.c
> [...]
> > -static void bpf_release_orig_filter(struct bpf_prog *fp)
> > +void bpf_release_orig_filter(struct bpf_prog *fp)
> >  {
> >         struct sock_fprog_kern *fprog = fp->orig_prog;
> >
> > @@ -1154,6 +1154,7 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
> >                 kfree(fprog);
> >         }
> >  }
> > +EXPORT_SYMBOL_GPL(bpf_release_orig_filter);
> 
> If this change really belongs into this patch (which I don't think it
> does), please describe why in the commit message.

Yup, more cruft I failed to remove.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-24  7:38         ` Kees Cook
@ 2020-09-24  7:51           ` YiFei Zhu
  -1 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24  7:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, Jann Horn,
	YiFei Zhu, linux-api, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 2:38 AM Kees Cook <keescook@chromium.org> wrote:
> > Would you mind educating me how this patch plan one handling MIPS? For
> > one kernel they seem to have up to three arch numbers per build,
> > AUDIT_ARCH_MIPS{,64,64N32}. Though ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
> > does not seem to be defined for MIPS so I'm assuming the syscall
> > numbers are the same, but I think it is possible some client uses that
> > arch number to pose different constraints for different processes, so
> > it would better not accelerate them rather than break them.
>
> I'll take a look, but I'm hoping it won't be too hard to fit into what
> I've got designed so for to deal with x86_x32. (Will MIPS want this
> optimization at all?)

I just took a slightly closer look at MIPS and it seems that they have
sparse syscall numbers (defines HAVE_SPARSE_SYSCALL_NR). I don't know
how the different "regions of syscall numbers" are affected by arch
numbers, however...
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24  7:51           ` YiFei Zhu
  0 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24  7:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, bpf, Linux Containers, linux-api, kernel list,
	Jack Chen, Josep Torrellas, Tianyin Xu, YiFei Zhu

On Thu, Sep 24, 2020 at 2:38 AM Kees Cook <keescook@chromium.org> wrote:
> > Would you mind educating me how this patch plan one handling MIPS? For
> > one kernel they seem to have up to three arch numbers per build,
> > AUDIT_ARCH_MIPS{,64,64N32}. Though ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
> > does not seem to be defined for MIPS so I'm assuming the syscall
> > numbers are the same, but I think it is possible some client uses that
> > arch number to pose different constraints for different processes, so
> > it would better not accelerate them rather than break them.
>
> I'll take a look, but I'm hoping it won't be too hard to fit into what
> I've got designed so for to deal with x86_x32. (Will MIPS want this
> optimization at all?)

I just took a slightly closer look at MIPS and it seems that they have
sparse syscall numbers (defines HAVE_SPARSE_SYSCALL_NR). I don't know
how the different "regions of syscall numbers" are affected by arch
numbers, however...

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-24  7:36       ` Kees Cook
@ 2020-09-24  8:07         ` YiFei Zhu
  -1 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24  8:07 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Jann Horn,
	YiFei Zhu, Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Dimitrios Skarlatos, Andy Lutomirski, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, bpf, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 2:37 AM Kees Cook <keescook@chromium.org> wrote:
> >
> > This belongs over into patch 1.
>
> Thanks! I was rushing to get this posted so YiFei Zhu wouldn't spend
> time fighting with arch and Kconfig stuff. :) I'll clean this (and the
> other random cruft) up.

Wait, what? I'm sorry. We have already begun fixing the mentioned
issues (mostly the split bitmaps for different arches). Although yes
it's nice to have another implementation to refer to so we get the
best of both worlds (and yes I'm already copying some of the code I
think are better here over there), don't you think it's not nice to
say "Hey I've worked on this in June, it needed rework but I didn't
send the newer version. Now you sent yours so I'll rush mine so your
work is redundant."?

That said, I do think this should be configurable. Users would be free
to experiment with the bitmap on or off, just like users may turn
seccomp off entirely. A choice also allows users to select different
implementations, a few whom I work with have ideas on how to
accelerate / cache argument dependent syscalls, for example.

YiFei Zhu
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24  8:07         ` YiFei Zhu
  0 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24  8:07 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf,
	YiFei Zhu, Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 2:37 AM Kees Cook <keescook@chromium.org> wrote:
> >
> > This belongs over into patch 1.
>
> Thanks! I was rushing to get this posted so YiFei Zhu wouldn't spend
> time fighting with arch and Kconfig stuff. :) I'll clean this (and the
> other random cruft) up.

Wait, what? I'm sorry. We have already begun fixing the mentioned
issues (mostly the split bitmaps for different arches). Although yes
it's nice to have another implementation to refer to so we get the
best of both worlds (and yes I'm already copying some of the code I
think are better here over there), don't you think it's not nice to
say "Hey I've worked on this in June, it needed rework but I didn't
send the newer version. Now you sent yours so I'll rush mine so your
work is redundant."?

That said, I do think this should be configurable. Users would be free
to experiment with the bitmap on or off, just like users may turn
seccomp off entirely. A choice also allows users to select different
implementations, a few whom I work with have ideas on how to
accelerate / cache argument dependent syscalls, for example.

YiFei Zhu

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-24  8:07         ` YiFei Zhu
@ 2020-09-24  8:15           ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  8:15 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Jann Horn,
	YiFei Zhu, Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Dimitrios Skarlatos, Andy Lutomirski, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, bpf, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 03:07:23AM -0500, YiFei Zhu wrote:
> On Thu, Sep 24, 2020 at 2:37 AM Kees Cook <keescook@chromium.org> wrote:
> > >
> > > This belongs over into patch 1.
> >
> > Thanks! I was rushing to get this posted so YiFei Zhu wouldn't spend
> > time fighting with arch and Kconfig stuff. :) I'll clean this (and the
> > other random cruft) up.
> 
> Wait, what? I'm sorry. We have already begun fixing the mentioned
> issues (mostly the split bitmaps for different arches). Although yes
> it's nice to have another implementation to refer to so we get the
> best of both worlds (and yes I'm already copying some of the code I
> think are better here over there), don't you think it's not nice to
> say "Hey I've worked on this in June, it needed rework but I didn't
> send the newer version. Now you sent yours so I'll rush mine so your
> work is redundant."?

I was trying to be helpful: you hadn't seen the RFC, and it was missing
the emulator piece, which I wanted to be small, so I put got it out the
door today. I didn't want you to think you needed to port the larger
emulator over, for example.

> That said, I do think this should be configurable. Users would be free
> to experiment with the bitmap on or off, just like users may turn
> seccomp off entirely. A choice also allows users to select different
> implementations, a few whom I work with have ideas on how to
> accelerate / cache argument dependent syscalls, for example.

I'm open to ideas, but I want to have a non-optional performance
improvement as the first step. :) "seccomp is magically faster" was my
driving goal.

-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24  8:15           ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24  8:15 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Jann Horn, Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf,
	YiFei Zhu, Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 03:07:23AM -0500, YiFei Zhu wrote:
> On Thu, Sep 24, 2020 at 2:37 AM Kees Cook <keescook@chromium.org> wrote:
> > >
> > > This belongs over into patch 1.
> >
> > Thanks! I was rushing to get this posted so YiFei Zhu wouldn't spend
> > time fighting with arch and Kconfig stuff. :) I'll clean this (and the
> > other random cruft) up.
> 
> Wait, what? I'm sorry. We have already begun fixing the mentioned
> issues (mostly the split bitmaps for different arches). Although yes
> it's nice to have another implementation to refer to so we get the
> best of both worlds (and yes I'm already copying some of the code I
> think are better here over there), don't you think it's not nice to
> say "Hey I've worked on this in June, it needed rework but I didn't
> send the newer version. Now you sent yours so I'll rush mine so your
> work is redundant."?

I was trying to be helpful: you hadn't seen the RFC, and it was missing
the emulator piece, which I wanted to be small, so I put got it out the
door today. I didn't want you to think you needed to port the larger
emulator over, for example.

> That said, I do think this should be configurable. Users would be free
> to experiment with the bitmap on or off, just like users may turn
> seccomp off entirely. A choice also allows users to select different
> implementations, a few whom I work with have ideas on how to
> accelerate / cache argument dependent syscalls, for example.

I'm open to ideas, but I want to have a non-optional performance
improvement as the first step. :) "seccomp is magically faster" was my
driving goal.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-24  8:15           ` Kees Cook
@ 2020-09-24  8:22             ` YiFei Zhu
  -1 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24  8:22 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Jann Horn,
	YiFei Zhu, Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Dimitrios Skarlatos, Andy Lutomirski, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, bpf, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 3:15 AM Kees Cook <keescook@chromium.org> wrote:
> I was trying to be helpful: you hadn't seen the RFC, and it was missing
> the emulator piece, which I wanted to be small, so I put got it out the
> door today. I didn't want you to think you needed to port the larger
> emulator over, for example.

There's no architecture-dependent code in the emulator. It just has to
iterate through all the arch numbers. So I don't know what you are
referring to by "port ... over".
The logic is simple. If the emulator determines the filter must be an
allow for a given arch / syscall pair, then it is "cached by bitmap".

> I'm open to ideas, but I want to have a non-optional performance
> improvement as the first step. :)

How about "performance improvement by default"? It's not like most end
users / distros would turn off something that's enabled by default
when they upgrade to a new kernel.

YiFei Zhu
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24  8:22             ` YiFei Zhu
  0 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24  8:22 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf,
	YiFei Zhu, Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 3:15 AM Kees Cook <keescook@chromium.org> wrote:
> I was trying to be helpful: you hadn't seen the RFC, and it was missing
> the emulator piece, which I wanted to be small, so I put got it out the
> door today. I didn't want you to think you needed to port the larger
> emulator over, for example.

There's no architecture-dependent code in the emulator. It just has to
iterate through all the arch numbers. So I don't know what you are
referring to by "port ... over".
The logic is simple. If the emulator determines the filter must be an
allow for a given arch / syscall pair, then it is "cached by bitmap".

> I'm open to ideas, but I want to have a non-optional performance
> improvement as the first step. :)

How about "performance improvement by default"? It's not like most end
users / distros would turn off something that's enabled by default
when they upgrade to a new kernel.

YiFei Zhu

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-24  7:36       ` Kees Cook
@ 2020-09-24 12:28         ` Jann Horn
  -1 siblings, 0 replies; 81+ messages in thread
From: Jann Horn via Containers @ 2020-09-24 12:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, YiFei Zhu,
	Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 9:37 AM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 24, 2020 at 02:25:03AM +0200, Jann Horn wrote:
> > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
[...]
> > (However, a "which syscalls have a fixed result" bitmap might make
> > sense if we want to export the list of permitted syscalls as a text
> > file in procfs, as I mentioned over at
> > <https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/>.)
>
> I haven't found a data structure I'm happy with for this. It seemed like
> NR_syscalls * sizeof(u32) was rather a lot (i.e. to store the BPF_RET
> value). However, let me discuss that more in the "why in in thread?"
> below...
[...]
> > > +#endif
> > > +};
> > > +
> > >  struct seccomp_filter;
> > >  /**
> > >   * struct seccomp - the state of a seccomp'ed process
> > > @@ -45,6 +56,13 @@ struct seccomp {
> > >  #endif
> > >         atomic_t filter_count;
> > >         struct seccomp_filter *filter;
> > > +       struct seccomp_bitmaps native;
> > > +#ifdef CONFIG_COMPAT
> > > +       struct seccomp_bitmaps compat;
> > > +#endif
> > > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> > > +       struct seccomp_bitmaps multiplex;
> > > +#endif
> >
> > Why do we have one bitmap per thread (in struct seccomp) instead of
> > putting the bitmap for a given filter and all its ancestors into the
> > seccomp_filter?
>
> I explicitly didn't want to add code that was run per-filter; I wanted
> O(1), not O(n) even if the n work was a small constant. There is
> obviously a memory/perf tradeoff here. I wonder if the middle ground
> would be to put a bitmap and "constant action" results in the filter....
> oh duh. The "top" filter is already going to be composed with its
> ancestors. That's all that needs to be checked.

Yeah - when adding a new filter, you can evaluate each syscall for the
newly added filter. For both the "accept" bitmap and the "constant
action" bitmap, you can AND the bitmap of the existing filter into the
new filter's bitmap.

Although actually, I think my "constant action" bitmap proposal was a
stupid idea... when someone asks for an analysis of the filter via
procfs (which shouldn't be a common action, so speed doesn't really
matter there), we can just dynamically evaluate the entire filter tree
using our filter-evaluation helper. Let's drop the "constant action"
bitmap idea.

> Then the tri-state can be:
>
> bitmap accept[NR_syscalls]: accept or check "known" bitmap
> bitmap filter[NR_syscalls]: run filter or return known action
> u32 known_action[NR_syscalls];

Actually, maybe we should just have an "accept" list, nothing else, to
keep it straightforward and with minimal memory usage...

> (times syscall numbering "architecture" counts)
>
> Though perhaps it would be just as fast as:
>
> bitmap run_filter[NR_syscalls]: run filter or return known_action
> u32 known_action[NR_syscalls];
>
> where accept isn't treated special...

Using a bitset for accepted syscalls instead of a big array would
probably have far less cache impact on the syscall entry path. If we
just have an "accept" bitmask, we can store information about 512
syscalls per cache line - that's almost the entire syscall table. In
contrast, a known_action list can only store information about 16
syscalls in a cache line, and we'd additionally still have to query
the "filter" bitmap.

I think our goal here should be that if a syscall is always allowed,
seccomp should execute the smallest amount of instructions we can get
away with, and touch the smallest amount of memory possible (and
preferably that memory should be shared between threads). The bitmap
fastpath should probably also avoid populate_seccomp_data().
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 12:28         ` Jann Horn
  0 siblings, 0 replies; 81+ messages in thread
From: Jann Horn @ 2020-09-24 12:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 9:37 AM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 24, 2020 at 02:25:03AM +0200, Jann Horn wrote:
> > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
[...]
> > (However, a "which syscalls have a fixed result" bitmap might make
> > sense if we want to export the list of permitted syscalls as a text
> > file in procfs, as I mentioned over at
> > <https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/>.)
>
> I haven't found a data structure I'm happy with for this. It seemed like
> NR_syscalls * sizeof(u32) was rather a lot (i.e. to store the BPF_RET
> value). However, let me discuss that more in the "why in in thread?"
> below...
[...]
> > > +#endif
> > > +};
> > > +
> > >  struct seccomp_filter;
> > >  /**
> > >   * struct seccomp - the state of a seccomp'ed process
> > > @@ -45,6 +56,13 @@ struct seccomp {
> > >  #endif
> > >         atomic_t filter_count;
> > >         struct seccomp_filter *filter;
> > > +       struct seccomp_bitmaps native;
> > > +#ifdef CONFIG_COMPAT
> > > +       struct seccomp_bitmaps compat;
> > > +#endif
> > > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH
> > > +       struct seccomp_bitmaps multiplex;
> > > +#endif
> >
> > Why do we have one bitmap per thread (in struct seccomp) instead of
> > putting the bitmap for a given filter and all its ancestors into the
> > seccomp_filter?
>
> I explicitly didn't want to add code that was run per-filter; I wanted
> O(1), not O(n) even if the n work was a small constant. There is
> obviously a memory/perf tradeoff here. I wonder if the middle ground
> would be to put a bitmap and "constant action" results in the filter....
> oh duh. The "top" filter is already going to be composed with its
> ancestors. That's all that needs to be checked.

Yeah - when adding a new filter, you can evaluate each syscall for the
newly added filter. For both the "accept" bitmap and the "constant
action" bitmap, you can AND the bitmap of the existing filter into the
new filter's bitmap.

Although actually, I think my "constant action" bitmap proposal was a
stupid idea... when someone asks for an analysis of the filter via
procfs (which shouldn't be a common action, so speed doesn't really
matter there), we can just dynamically evaluate the entire filter tree
using our filter-evaluation helper. Let's drop the "constant action"
bitmap idea.

> Then the tri-state can be:
>
> bitmap accept[NR_syscalls]: accept or check "known" bitmap
> bitmap filter[NR_syscalls]: run filter or return known action
> u32 known_action[NR_syscalls];

Actually, maybe we should just have an "accept" list, nothing else, to
keep it straightforward and with minimal memory usage...

> (times syscall numbering "architecture" counts)
>
> Though perhaps it would be just as fast as:
>
> bitmap run_filter[NR_syscalls]: run filter or return known_action
> u32 known_action[NR_syscalls];
>
> where accept isn't treated special...

Using a bitset for accepted syscalls instead of a big array would
probably have far less cache impact on the syscall entry path. If we
just have an "accept" bitmask, we can store information about 512
syscalls per cache line - that's almost the entire syscall table. In
contrast, a known_action list can only store information about 16
syscalls in a cache line, and we'd additionally still have to query
the "filter" bitmap.

I think our goal here should be that if a syscall is always allowed,
seccomp should execute the smallest amount of instructions we can get
away with, and touch the smallest amount of memory possible (and
preferably that memory should be shared between threads). The bitmap
fastpath should probably also avoid populate_seccomp_data().

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-24 12:28         ` Jann Horn
@ 2020-09-24 12:37           ` David Laight
  -1 siblings, 0 replies; 81+ messages in thread
From: David Laight @ 2020-09-24 12:37 UTC (permalink / raw)
  To: 'Jann Horn', Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, YiFei Zhu,
	Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

From: Jann Horn
> Sent: 24 September 2020 13:29
...
> I think our goal here should be that if a syscall is always allowed,
> seccomp should execute the smallest amount of instructions we can get
> away with, and touch the smallest amount of memory possible (and
> preferably that memory should be shared between threads). The bitmap
> fastpath should probably also avoid populate_seccomp_data().

If most syscalls are expected to be allowed then an initial:
	if (global_mask & (1u << (syscall_number & 63))
test can be used to skip any further lookups.

Although ISTR someone suggesting that the global_mask should
be per-cpu because even shared read-only cache lines were
expensive on some architecture.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 12:37           ` David Laight
  0 siblings, 0 replies; 81+ messages in thread
From: David Laight @ 2020-09-24 12:37 UTC (permalink / raw)
  To: 'Jann Horn', Kees Cook
  Cc: YiFei Zhu, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

From: Jann Horn
> Sent: 24 September 2020 13:29
...
> I think our goal here should be that if a syscall is always allowed,
> seccomp should execute the smallest amount of instructions we can get
> away with, and touch the smallest amount of memory possible (and
> preferably that memory should be shared between threads). The bitmap
> fastpath should probably also avoid populate_seccomp_data().

If most syscalls are expected to be allowed then an initial:
	if (global_mask & (1u << (syscall_number & 63))
test can be used to skip any further lookups.

Although ISTR someone suggesting that the global_mask should
be per-cpu because even shared read-only cache lines were
expensive on some architecture.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
  2020-09-24 12:37           ` David Laight
@ 2020-09-24 12:56             ` Jann Horn
  -1 siblings, 0 replies; 81+ messages in thread
From: Jann Horn via Containers @ 2020-09-24 12:56 UTC (permalink / raw)
  To: David Laight
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	YiFei Zhu, Linux API, Linux Containers, bpf,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Thu, Sep 24, 2020 at 2:37 PM David Laight <David.Laight@aculab.com> wrote:
> From: Jann Horn
> > Sent: 24 September 2020 13:29
> ...
> > I think our goal here should be that if a syscall is always allowed,
> > seccomp should execute the smallest amount of instructions we can get
> > away with, and touch the smallest amount of memory possible (and
> > preferably that memory should be shared between threads). The bitmap
> > fastpath should probably also avoid populate_seccomp_data().
>
> If most syscalls are expected to be allowed

E.g. OpenSSH's privilege-separated network process only permits
something like 26 specific syscalls.

> then an initial:
>         if (global_mask & (1u << (syscall_number & 63))
> test can be used to skip any further lookups.

I guess that would work in principle, but I'm not convinced that it's
worth adding another layer of global caching just to avoid one load
instruction for locating the correct bitmask from the current process.
Especially when it only really provides a benefit when people use
seccomp improperly - for application sandboxing, you're supposed to
only permit a list of specific syscalls, the smaller the better.

> Although ISTR someone suggesting that the global_mask should
> be per-cpu because even shared read-only cache lines were
> expensive on some architecture.

If an architecture did make that expensive, I think we have bigger
problems to worry about than a little bitmap in seccomp. (Like the
system call table.) So I think we don't have to worry about that here.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 3/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 12:56             ` Jann Horn
  0 siblings, 0 replies; 81+ messages in thread
From: Jann Horn @ 2020-09-24 12:56 UTC (permalink / raw)
  To: David Laight
  Cc: Kees Cook, YiFei Zhu, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 2:37 PM David Laight <David.Laight@aculab.com> wrote:
> From: Jann Horn
> > Sent: 24 September 2020 13:29
> ...
> > I think our goal here should be that if a syscall is always allowed,
> > seccomp should execute the smallest amount of instructions we can get
> > away with, and touch the smallest amount of memory possible (and
> > preferably that memory should be shared between threads). The bitmap
> > fastpath should probably also avoid populate_seccomp_data().
>
> If most syscalls are expected to be allowed

E.g. OpenSSH's privilege-separated network process only permits
something like 26 specific syscalls.

> then an initial:
>         if (global_mask & (1u << (syscall_number & 63))
> test can be used to skip any further lookups.

I guess that would work in principle, but I'm not convinced that it's
worth adding another layer of global caching just to avoid one load
instruction for locating the correct bitmask from the current process.
Especially when it only really provides a benefit when people use
seccomp improperly - for application sandboxing, you're supposed to
only permit a list of specific syscalls, the smaller the better.

> Although ISTR someone suggesting that the global_mask should
> be per-cpu because even shared read-only cache lines were
> expensive on some architecture.

If an architecture did make that expensive, I think we have bigger
problems to worry about than a little bitmap in seccomp. (Like the
system call table.) So I think we don't have to worry about that here.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
  2020-09-23 23:29   ` Kees Cook
  (?)
  (?)
@ 2020-09-24 13:16   ` kernel test robot
  -1 siblings, 0 replies; 81+ messages in thread
From: kernel test robot @ 2020-09-24 13:16 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2578 bytes --]

Hi Kees,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]
[also build test WARNING on bpf/master kees/for-next/pstore v5.9-rc6 next-20200924]
[cannot apply to tip/x86/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Kees-Cook/seccomp-Implement-constant-action-bitmaps/20200924-073617
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-randconfig-a011-20200924 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project d6ac649ccda289ecc2d2c0cb51892d57e8ec328c)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> net/core/filter.c:1148:6: warning: no previous prototype for function 'bpf_release_orig_filter' [-Wmissing-prototypes]
   void bpf_release_orig_filter(struct bpf_prog *fp)
        ^
   net/core/filter.c:1148:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void bpf_release_orig_filter(struct bpf_prog *fp)
   ^
   static 
   1 warning generated.

# https://github.com/0day-ci/linux/commit/84e4ee5c09f6f448c7036b919ec40a779189db77
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Kees-Cook/seccomp-Implement-constant-action-bitmaps/20200924-073617
git checkout 84e4ee5c09f6f448c7036b919ec40a779189db77
vim +/bpf_release_orig_filter +1148 net/core/filter.c

  1147	
> 1148	void bpf_release_orig_filter(struct bpf_prog *fp)
  1149	{
  1150		struct sock_fprog_kern *fprog = fp->orig_prog;
  1151	
  1152		if (fprog) {
  1153			kfree(fprog->filter);
  1154			kfree(fprog);
  1155		}
  1156	}
  1157	EXPORT_SYMBOL_GPL(bpf_release_orig_filter);
  1158	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 33047 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-23 23:29 ` Kees Cook
@ 2020-09-24 13:40   ` Rasmus Villemoes
  -1 siblings, 0 replies; 81+ messages in thread
From: Rasmus Villemoes @ 2020-09-24 13:40 UTC (permalink / raw)
  To: Kees Cook, YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, Jann Horn,
	linux-api, containers, Tobin Feldman-Fitzthum, Hubertus Franke,
	Andy Lutomirski, Valentin Rothberg, Dimitrios Skarlatos,
	Jack Chen, Josep Torrellas, Tianyin Xu, linux-kernel

On 24/09/2020 01.29, Kees Cook wrote:
> rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> v1:
> - rebase to for-next/seccomp
> - finish X86_X32 support for both pinning and bitmaps
> - replace TLB magic with Jann's emulator
> - add JSET insn
> 
> TODO:
> - add ALU|AND insn
> - significantly more testing
> 
> Hi,
> 
> This is a refresh of my earlier constant action bitmap series. It looks
> like the RFC was missed on the container list, so I've CCed it now. :)
> I'd like to work from this series, as it handles the multi-architecture
> stuff.

So, I agree with Jann's point that the only thing that matters is that
always-allowed syscalls are indeed allowed fast.

But one thing I'm wondering about and I haven't seen addressed anywhere:
Why build the bitmap on the kernel side (with all the complexity of
having to emulate the filter for all syscalls)? Why can't userspace just
hand the kernel "here's a new filter: the syscalls in this bitmap are
always allowed noquestionsasked, for the rest, run this bpf". Sure, that
might require a new syscall or extending seccomp(2) somewhat, but isn't
that a _lot_ simpler? It would probably also mean that the bpf we do get
handed is a lot smaller. Userspace might need to pass a couple of
bitmaps, one for each relevant arch, but you get the overall idea.

I'm also a bit worried about the performance of doing that emulation;
that's constant extra overhead for, say, launching a docker container.

Regardless of how the kernel's bitmap gets created, something like

+	if (nr < NR_syscalls) {
+		if (test_bit(nr, bitmaps->allow)) {
+			*filter_ret = SECCOMP_RET_ALLOW;
+			return true;
+		}

probably wants some nospec protection somewhere to avoid the irony of
seccomp() being used actively by bad guys.

Rasmus
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 13:40   ` Rasmus Villemoes
  0 siblings, 0 replies; 81+ messages in thread
From: Rasmus Villemoes @ 2020-09-24 13:40 UTC (permalink / raw)
  To: Kees Cook, YiFei Zhu
  Cc: Jann Horn, Christian Brauner, Tycho Andersen, Andy Lutomirski,
	Will Drewry, Andrea Arcangeli, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	containers, linux-api, linux-kernel

On 24/09/2020 01.29, Kees Cook wrote:
> rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> v1:
> - rebase to for-next/seccomp
> - finish X86_X32 support for both pinning and bitmaps
> - replace TLB magic with Jann's emulator
> - add JSET insn
> 
> TODO:
> - add ALU|AND insn
> - significantly more testing
> 
> Hi,
> 
> This is a refresh of my earlier constant action bitmap series. It looks
> like the RFC was missed on the container list, so I've CCed it now. :)
> I'd like to work from this series, as it handles the multi-architecture
> stuff.

So, I agree with Jann's point that the only thing that matters is that
always-allowed syscalls are indeed allowed fast.

But one thing I'm wondering about and I haven't seen addressed anywhere:
Why build the bitmap on the kernel side (with all the complexity of
having to emulate the filter for all syscalls)? Why can't userspace just
hand the kernel "here's a new filter: the syscalls in this bitmap are
always allowed noquestionsasked, for the rest, run this bpf". Sure, that
might require a new syscall or extending seccomp(2) somewhat, but isn't
that a _lot_ simpler? It would probably also mean that the bpf we do get
handed is a lot smaller. Userspace might need to pass a couple of
bitmaps, one for each relevant arch, but you get the overall idea.

I'm also a bit worried about the performance of doing that emulation;
that's constant extra overhead for, say, launching a docker container.

Regardless of how the kernel's bitmap gets created, something like

+	if (nr < NR_syscalls) {
+		if (test_bit(nr, bitmaps->allow)) {
+			*filter_ret = SECCOMP_RET_ALLOW;
+			return true;
+		}

probably wants some nospec protection somewhere to avoid the irony of
seccomp() being used actively by bad guys.

Rasmus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-24 13:40   ` Rasmus Villemoes
@ 2020-09-24 13:58     ` YiFei Zhu
  -1 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24 13:58 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, YiFei Zhu, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Andy Lutomirski,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu, kernel list

On Thu, Sep 24, 2020 at 8:46 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> But one thing I'm wondering about and I haven't seen addressed anywhere:
> Why build the bitmap on the kernel side (with all the complexity of
> having to emulate the filter for all syscalls)? Why can't userspace just
> hand the kernel "here's a new filter: the syscalls in this bitmap are
> always allowed noquestionsasked, for the rest, run this bpf". Sure, that
> might require a new syscall or extending seccomp(2) somewhat, but isn't
> that a _lot_ simpler? It would probably also mean that the bpf we do get
> handed is a lot smaller. Userspace might need to pass a couple of
> bitmaps, one for each relevant arch, but you get the overall idea.

Perhaps. The thing is, the current API expects any filter attaches to
be "additive". If a new filter gets attached that says "disallow read"
then no matter whatever has been attached already, "read" shall not be
allowed at the next syscall, bypassing all previous allowlist bitmaps
(so you need to emulate the bpf anyways here?). We should also not
have a API that could let anyone escape the secomp jail. Say "prctl"
is permitted but "read" is not permitted, one must not be allowed to
attach a bitmap so that "read" now appears in the allowlist. The only
way this could potentially work is to attach a BPF filter and a bitmap
at the same time in the same syscall, which might mean API redesign?

> I'm also a bit worried about the performance of doing that emulation;
> that's constant extra overhead for, say, launching a docker container.

IMO, launching a docker container is so expensive this should be negligible.

YiFei Zhu
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 13:58     ` YiFei Zhu
  0 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-24 13:58 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Kees Cook, YiFei Zhu, Andrea Arcangeli, Giuseppe Scrivano,
	Will Drewry, bpf, Jann Horn, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Thu, Sep 24, 2020 at 8:46 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> But one thing I'm wondering about and I haven't seen addressed anywhere:
> Why build the bitmap on the kernel side (with all the complexity of
> having to emulate the filter for all syscalls)? Why can't userspace just
> hand the kernel "here's a new filter: the syscalls in this bitmap are
> always allowed noquestionsasked, for the rest, run this bpf". Sure, that
> might require a new syscall or extending seccomp(2) somewhat, but isn't
> that a _lot_ simpler? It would probably also mean that the bpf we do get
> handed is a lot smaller. Userspace might need to pass a couple of
> bitmaps, one for each relevant arch, but you get the overall idea.

Perhaps. The thing is, the current API expects any filter attaches to
be "additive". If a new filter gets attached that says "disallow read"
then no matter whatever has been attached already, "read" shall not be
allowed at the next syscall, bypassing all previous allowlist bitmaps
(so you need to emulate the bpf anyways here?). We should also not
have a API that could let anyone escape the secomp jail. Say "prctl"
is permitted but "read" is not permitted, one must not be allowed to
attach a bitmap so that "read" now appears in the allowlist. The only
way this could potentially work is to attach a BPF filter and a bitmap
at the same time in the same syscall, which might mean API redesign?

> I'm also a bit worried about the performance of doing that emulation;
> that's constant extra overhead for, say, launching a docker container.

IMO, launching a docker container is so expensive this should be negligible.

YiFei Zhu

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-24 13:40   ` Rasmus Villemoes
@ 2020-09-24 14:05     ` Jann Horn
  -1 siblings, 0 replies; 81+ messages in thread
From: Jann Horn via Containers @ 2020-09-24 14:05 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	YiFei Zhu, Linux API, Linux Containers, bpf,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Thu, Sep 24, 2020 at 3:40 PM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> On 24/09/2020 01.29, Kees Cook wrote:
> > rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> > alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> > v1:
> > - rebase to for-next/seccomp
> > - finish X86_X32 support for both pinning and bitmaps
> > - replace TLB magic with Jann's emulator
> > - add JSET insn
> >
> > TODO:
> > - add ALU|AND insn
> > - significantly more testing
> >
> > Hi,
> >
> > This is a refresh of my earlier constant action bitmap series. It looks
> > like the RFC was missed on the container list, so I've CCed it now. :)
> > I'd like to work from this series, as it handles the multi-architecture
> > stuff.
>
> So, I agree with Jann's point that the only thing that matters is that
> always-allowed syscalls are indeed allowed fast.
>
> But one thing I'm wondering about and I haven't seen addressed anywhere:
> Why build the bitmap on the kernel side (with all the complexity of
> having to emulate the filter for all syscalls)? Why can't userspace just
> hand the kernel "here's a new filter: the syscalls in this bitmap are
> always allowed noquestionsasked, for the rest, run this bpf". Sure, that
> might require a new syscall or extending seccomp(2) somewhat, but isn't
> that a _lot_ simpler? It would probably also mean that the bpf we do get
> handed is a lot smaller. Userspace might need to pass a couple of
> bitmaps, one for each relevant arch, but you get the overall idea.

It's not really a lot of logic though; and there are a bunch of
different things in userspace that talk to the seccomp() syscall that
would have to be updated if we made this part of the UAPI. libseccomp,
Chrome, Android, OpenSSH, bubblewrap, ... - overall, if we can make
the existing interface faster, it'll be less effort, and there will be
less code duplication (because otherwise every user of seccomp will
have to implement the same thing in userspace).

Doing this internally with the old UAPI also means that we're not
creating any additional commitments in terms of UAPI - if we come up
with something better in the future, we can just rip this stuff out.
If we created a new UAPI, we'd have to stay, in some form, compatible
with it forever.

> I'm also a bit worried about the performance of doing that emulation;
> that's constant extra overhead for, say, launching a docker container.
>
> Regardless of how the kernel's bitmap gets created, something like
>
> +       if (nr < NR_syscalls) {
> +               if (test_bit(nr, bitmaps->allow)) {
> +                       *filter_ret = SECCOMP_RET_ALLOW;
> +                       return true;
> +               }
>
> probably wants some nospec protection somewhere to avoid the irony of
> seccomp() being used actively by bad guys.

Good point...
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 14:05     ` Jann Horn
  0 siblings, 0 replies; 81+ messages in thread
From: Jann Horn @ 2020-09-24 14:05 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Kees Cook, YiFei Zhu, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 3:40 PM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> On 24/09/2020 01.29, Kees Cook wrote:
> > rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> > alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> > v1:
> > - rebase to for-next/seccomp
> > - finish X86_X32 support for both pinning and bitmaps
> > - replace TLB magic with Jann's emulator
> > - add JSET insn
> >
> > TODO:
> > - add ALU|AND insn
> > - significantly more testing
> >
> > Hi,
> >
> > This is a refresh of my earlier constant action bitmap series. It looks
> > like the RFC was missed on the container list, so I've CCed it now. :)
> > I'd like to work from this series, as it handles the multi-architecture
> > stuff.
>
> So, I agree with Jann's point that the only thing that matters is that
> always-allowed syscalls are indeed allowed fast.
>
> But one thing I'm wondering about and I haven't seen addressed anywhere:
> Why build the bitmap on the kernel side (with all the complexity of
> having to emulate the filter for all syscalls)? Why can't userspace just
> hand the kernel "here's a new filter: the syscalls in this bitmap are
> always allowed noquestionsasked, for the rest, run this bpf". Sure, that
> might require a new syscall or extending seccomp(2) somewhat, but isn't
> that a _lot_ simpler? It would probably also mean that the bpf we do get
> handed is a lot smaller. Userspace might need to pass a couple of
> bitmaps, one for each relevant arch, but you get the overall idea.

It's not really a lot of logic though; and there are a bunch of
different things in userspace that talk to the seccomp() syscall that
would have to be updated if we made this part of the UAPI. libseccomp,
Chrome, Android, OpenSSH, bubblewrap, ... - overall, if we can make
the existing interface faster, it'll be less effort, and there will be
less code duplication (because otherwise every user of seccomp will
have to implement the same thing in userspace).

Doing this internally with the old UAPI also means that we're not
creating any additional commitments in terms of UAPI - if we come up
with something better in the future, we can just rip this stuff out.
If we created a new UAPI, we'd have to stay, in some form, compatible
with it forever.

> I'm also a bit worried about the performance of doing that emulation;
> that's constant extra overhead for, say, launching a docker container.
>
> Regardless of how the kernel's bitmap gets created, something like
>
> +       if (nr < NR_syscalls) {
> +               if (test_bit(nr, bitmaps->allow)) {
> +                       *filter_ret = SECCOMP_RET_ALLOW;
> +                       return true;
> +               }
>
> probably wants some nospec protection somewhere to avoid the irony of
> seccomp() being used actively by bad guys.

Good point...

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
  2020-09-24  7:46       ` Kees Cook
@ 2020-09-24 15:28         ` Paul Moore
  -1 siblings, 0 replies; 81+ messages in thread
From: Paul Moore @ 2020-09-24 15:28 UTC (permalink / raw)
  To: Kees Cook, Tom Hromatka
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, bpf, Jann Horn,
	YiFei Zhu, Linux API, Linux Containers, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Jack Chen, Josep Torrellas, Tianyin Xu,
	kernel list

On Thu, Sep 24, 2020 at 3:46 AM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > > This emulates absolutely the most basic seccomp filters to figure out
> > > if they will always give the same results for a given arch/nr combo.
> > >
> > > Nearly all seccomp filters are built from the following ops:
> > >
> > > BPF_LD  | BPF_W    | BPF_ABS
> > > BPF_JMP | BPF_JEQ  | BPF_K
> > > BPF_JMP | BPF_JGE  | BPF_K
> > > BPF_JMP | BPF_JGT  | BPF_K
> > > BPF_JMP | BPF_JSET | BPF_K
> > > BPF_JMP | BPF_JA
> > > BPF_RET | BPF_K
> > >
> > > These are now emulated to check for accesses beyond seccomp_data::arch
> > > or unknown instructions.
> > >
> > > Not yet implemented are:
> > >
> > > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> >
> > BPF_AND is normally only used on syscall arguments, not on the syscall
> > number or the architecture, right? And when a syscall argument is
> > loaded, we abort execution anyway. So I think there is no need to
> > implement those?
>
> Is that right? I can't actually tell what libseccomp is doing with
> ALU|AND. It looks like it's using it for building jump lists?

There is an ALU|AND op in the jump resolution code, but that is really
just if libseccomp needs to fixup the accumulator because a code block
is expecting a masked value (right now that would only be a syscall
argument, not the syscall number itself).

> Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?

Presently the only place where libseccomp uses ALU|AND is when the
masked equality comparison is used for comparing syscall arguments
(SCMP_CMP_MASKED_EQ).  I can't honestly say I have any good
information about how often that is used by libseccomp callers, but if
I do a quick search on GitHub for "SCMP_CMP_MASKED_EQ" I see 2k worth
of code hits; take that for whatever it is worth.  Tom may have some
more/better information.

Of course no promises on future use :)  As one quick example, I keep
thinking about adding the instruction pointer to the list of things
that can be compared as part of a libseccomp rule, and if we do that I
would expect that we would want to also allow a masked comparison (and
utilize another ALU|AND bpf op there).  However, I'm not sure how
useful that would be in practice.

-- 
paul moore
www.paul-moore.com
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
@ 2020-09-24 15:28         ` Paul Moore
  0 siblings, 0 replies; 81+ messages in thread
From: Paul Moore @ 2020-09-24 15:28 UTC (permalink / raw)
  To: Kees Cook, Tom Hromatka
  Cc: Jann Horn, YiFei Zhu, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 3:46 AM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > > This emulates absolutely the most basic seccomp filters to figure out
> > > if they will always give the same results for a given arch/nr combo.
> > >
> > > Nearly all seccomp filters are built from the following ops:
> > >
> > > BPF_LD  | BPF_W    | BPF_ABS
> > > BPF_JMP | BPF_JEQ  | BPF_K
> > > BPF_JMP | BPF_JGE  | BPF_K
> > > BPF_JMP | BPF_JGT  | BPF_K
> > > BPF_JMP | BPF_JSET | BPF_K
> > > BPF_JMP | BPF_JA
> > > BPF_RET | BPF_K
> > >
> > > These are now emulated to check for accesses beyond seccomp_data::arch
> > > or unknown instructions.
> > >
> > > Not yet implemented are:
> > >
> > > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> >
> > BPF_AND is normally only used on syscall arguments, not on the syscall
> > number or the architecture, right? And when a syscall argument is
> > loaded, we abort execution anyway. So I think there is no need to
> > implement those?
>
> Is that right? I can't actually tell what libseccomp is doing with
> ALU|AND. It looks like it's using it for building jump lists?

There is an ALU|AND op in the jump resolution code, but that is really
just if libseccomp needs to fixup the accumulator because a code block
is expecting a masked value (right now that would only be a syscall
argument, not the syscall number itself).

> Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?

Presently the only place where libseccomp uses ALU|AND is when the
masked equality comparison is used for comparing syscall arguments
(SCMP_CMP_MASKED_EQ).  I can't honestly say I have any good
information about how often that is used by libseccomp callers, but if
I do a quick search on GitHub for "SCMP_CMP_MASKED_EQ" I see 2k worth
of code hits; take that for whatever it is worth.  Tom may have some
more/better information.

Of course no promises on future use :)  As one quick example, I keep
thinking about adding the instruction pointer to the list of things
that can be compared as part of a libseccomp rule, and if we do that I
would expect that we would want to also allow a masked comparison (and
utilize another ALU|AND bpf op there).  However, I'm not sure how
useful that would be in practice.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-23 23:29 ` Kees Cook
@ 2020-09-24 18:57   ` Andrea Arcangeli
  -1 siblings, 0 replies; 81+ messages in thread
From: Andrea Arcangeli @ 2020-09-24 18:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: Giuseppe Scrivano, Will Drewry, bpf, Jann Horn, YiFei Zhu,
	linux-api, containers, Tobin Feldman-Fitzthum, Hubertus Franke,
	Andy Lutomirski, Valentin Rothberg, Dimitrios Skarlatos,
	Jack Chen, Josep Torrellas, Tianyin Xu, linux-kernel

Hello,

I'm posting this only for the record, feel free to ignore.

On Wed, Sep 23, 2020 at 04:29:17PM -0700, Kees Cook wrote:
> rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> v1:
> - rebase to for-next/seccomp
> - finish X86_X32 support for both pinning and bitmaps

It's pretty clear the O(1) seccomp filter bitmap was first was
proposed by your RFC in June (albeit it was located in the wrong place
and is still in the wrong place in v1).

> - replace TLB magic with Jann's emulator
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
That's a pretty fundamental change in v1 compared to your the
non-competing TLB magic technique you used in the RFC last June.

The bitmap isn't the clever part of the patch, the bitmap can be
reviewed in seconds, the difficult part to implement and to review is
how you fill the bitmap and in that respect there's absolutely nothing
in common in between the "rfc:" and the "alternative" link.

In June your bitmap-filling engine was this:

https://lore.kernel.org/lkml/20200616074934.1600036-5-keescook@chromium.org/

Then on Sep 21 YiFei Zhu posted his new innovative BPF emulation
innovation that obsoleted your TLB magic of June:

https://lists.linuxfoundation.org/pipermail/containers/2020-September/042153.html

And on Sep 23 instead of collaborating and helping YiFei Zhu to
improve his BPF emulator, you posted the same technique that looks
remarkably similar without giving YiFei Zhu any attribution and you
instead attribute the whole idea to Jann Horn:

https://lkml.kernel.org/r/20200923232923.3142503-5-keescook@chromium.org

Thanks,
Andrea

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 18:57   ` Andrea Arcangeli
  0 siblings, 0 replies; 81+ messages in thread
From: Andrea Arcangeli @ 2020-09-24 18:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Jann Horn, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	containers, linux-api, linux-kernel

Hello,

I'm posting this only for the record, feel free to ignore.

On Wed, Sep 23, 2020 at 04:29:17PM -0700, Kees Cook wrote:
> rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> v1:
> - rebase to for-next/seccomp
> - finish X86_X32 support for both pinning and bitmaps

It's pretty clear the O(1) seccomp filter bitmap was first was
proposed by your RFC in June (albeit it was located in the wrong place
and is still in the wrong place in v1).

> - replace TLB magic with Jann's emulator
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
That's a pretty fundamental change in v1 compared to your the
non-competing TLB magic technique you used in the RFC last June.

The bitmap isn't the clever part of the patch, the bitmap can be
reviewed in seconds, the difficult part to implement and to review is
how you fill the bitmap and in that respect there's absolutely nothing
in common in between the "rfc:" and the "alternative" link.

In June your bitmap-filling engine was this:

https://lore.kernel.org/lkml/20200616074934.1600036-5-keescook@chromium.org/

Then on Sep 21 YiFei Zhu posted his new innovative BPF emulation
innovation that obsoleted your TLB magic of June:

https://lists.linuxfoundation.org/pipermail/containers/2020-September/042153.html

And on Sep 23 instead of collaborating and helping YiFei Zhu to
improve his BPF emulator, you posted the same technique that looks
remarkably similar without giving YiFei Zhu any attribution and you
instead attribute the whole idea to Jann Horn:

https://lkml.kernel.org/r/20200923232923.3142503-5-keescook@chromium.org

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-24 18:57   ` Andrea Arcangeli
@ 2020-09-24 19:18     ` Jann Horn
  -1 siblings, 0 replies; 81+ messages in thread
From: Jann Horn via Containers @ 2020-09-24 19:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Giuseppe Scrivano, Will Drewry, Kees Cook, YiFei Zhu, Linux API,
	Linux Containers, bpf, Tobin Feldman-Fitzthum, Hubertus Franke,
	Andy Lutomirski, Valentin Rothberg, Dimitrios Skarlatos,
	Jack Chen, Josep Torrellas, Tianyin Xu, kernel list

On Thu, Sep 24, 2020 at 8:57 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello,
>
> I'm posting this only for the record, feel free to ignore.
>
> On Wed, Sep 23, 2020 at 04:29:17PM -0700, Kees Cook wrote:
> > rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> > alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> > v1:
> > - rebase to for-next/seccomp
> > - finish X86_X32 support for both pinning and bitmaps
>
> It's pretty clear the O(1) seccomp filter bitmap was first was
> proposed by your RFC in June (albeit it was located in the wrong place
> and is still in the wrong place in v1).
>
> > - replace TLB magic with Jann's emulator
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> That's a pretty fundamental change in v1 compared to your the
> non-competing TLB magic technique you used in the RFC last June.
>
> The bitmap isn't the clever part of the patch, the bitmap can be
> reviewed in seconds, the difficult part to implement and to review is
> how you fill the bitmap and in that respect there's absolutely nothing
> in common in between the "rfc:" and the "alternative" link.
>
> In June your bitmap-filling engine was this:
>
> https://lore.kernel.org/lkml/20200616074934.1600036-5-keescook@chromium.org/
>
> Then on Sep 21 YiFei Zhu posted his new innovative BPF emulation
> innovation that obsoleted your TLB magic of June:
>
> https://lists.linuxfoundation.org/pipermail/containers/2020-September/042153.html
>
> And on Sep 23 instead of collaborating and helping YiFei Zhu to
> improve his BPF emulator, you posted the same technique that looks
> remarkably similar without giving YiFei Zhu any attribution and you
> instead attribute the whole idea to Jann Horn:
>
> https://lkml.kernel.org/r/20200923232923.3142503-5-keescook@chromium.org

You're missing that I did suggest the BPF emulation approach (with
code very similar to Kees' current code) back in June:
https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 19:18     ` Jann Horn
  0 siblings, 0 replies; 81+ messages in thread
From: Jann Horn @ 2020-09-24 19:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kees Cook, YiFei Zhu, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 8:57 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello,
>
> I'm posting this only for the record, feel free to ignore.
>
> On Wed, Sep 23, 2020 at 04:29:17PM -0700, Kees Cook wrote:
> > rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> > alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> > v1:
> > - rebase to for-next/seccomp
> > - finish X86_X32 support for both pinning and bitmaps
>
> It's pretty clear the O(1) seccomp filter bitmap was first was
> proposed by your RFC in June (albeit it was located in the wrong place
> and is still in the wrong place in v1).
>
> > - replace TLB magic with Jann's emulator
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> That's a pretty fundamental change in v1 compared to your the
> non-competing TLB magic technique you used in the RFC last June.
>
> The bitmap isn't the clever part of the patch, the bitmap can be
> reviewed in seconds, the difficult part to implement and to review is
> how you fill the bitmap and in that respect there's absolutely nothing
> in common in between the "rfc:" and the "alternative" link.
>
> In June your bitmap-filling engine was this:
>
> https://lore.kernel.org/lkml/20200616074934.1600036-5-keescook@chromium.org/
>
> Then on Sep 21 YiFei Zhu posted his new innovative BPF emulation
> innovation that obsoleted your TLB magic of June:
>
> https://lists.linuxfoundation.org/pipermail/containers/2020-September/042153.html
>
> And on Sep 23 instead of collaborating and helping YiFei Zhu to
> improve his BPF emulator, you posted the same technique that looks
> remarkably similar without giving YiFei Zhu any attribution and you
> instead attribute the whole idea to Jann Horn:
>
> https://lkml.kernel.org/r/20200923232923.3142503-5-keescook@chromium.org

You're missing that I did suggest the BPF emulation approach (with
code very similar to Kees' current code) back in June:
https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
       [not found]   ` <9dbe8e3bbdad43a1872202ff38c34ca2@DM5PR11MB1692.namprd11.prod.outlook.com>
@ 2020-09-24 19:48       ` Tianyin Xu
  0 siblings, 0 replies; 81+ messages in thread
From: Tianyin Xu @ 2020-09-24 19:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook, Zhu,
	YiFei, Linux API, Linux Containers, bpf, Tobin Feldman-Fitzthum,
	Hubertus Franke, Andy Lutomirski, Valentin Rothberg,
	Dimitrios Skarlatos, Chen, Jianyan, Torrellas, Josep,
	kernel list

On Thu, Sep 24, 2020 at 2:19 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Sep 24, 2020 at 8:57 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Hello,
> >
> > I'm posting this only for the record, feel free to ignore.
> >
> > On Wed, Sep 23, 2020 at 04:29:17PM -0700, Kees Cook wrote:
> > > rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> > > alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> > > v1:
> > > - rebase to for-next/seccomp
> > > - finish X86_X32 support for both pinning and bitmaps
> >
> > It's pretty clear the O(1) seccomp filter bitmap was first was
> > proposed by your RFC in June (albeit it was located in the wrong place
> > and is still in the wrong place in v1).
> >
> > > - replace TLB magic with Jann's emulator
> >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > That's a pretty fundamental change in v1 compared to your the
> > non-competing TLB magic technique you used in the RFC last June.
> >
> > The bitmap isn't the clever part of the patch, the bitmap can be
> > reviewed in seconds, the difficult part to implement and to review is
> > how you fill the bitmap and in that respect there's absolutely nothing
> > in common in between the "rfc:" and the "alternative" link.
> >
> > In June your bitmap-filling engine was this:
> >
> > https://lore.kernel.org/lkml/20200616074934.1600036-5-keescook@chromium.org/
> >
> > Then on Sep 21 YiFei Zhu posted his new innovative BPF emulation
> > innovation that obsoleted your TLB magic of June:
> >
> > https://lists.linuxfoundation.org/pipermail/containers/2020-September/042153.html
> >
> > And on Sep 23 instead of collaborating and helping YiFei Zhu to
> > improve his BPF emulator, you posted the same technique that looks
> > remarkably similar without giving YiFei Zhu any attribution and you
> > instead attribute the whole idea to Jann Horn:
> >
> > https://lkml.kernel.org/r/20200923232923.3142503-5-keescook@chromium.org
>
> You're missing that I did suggest the BPF emulation approach (with
> code very similar to Kees' current code) back in June:
> https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/

I don't see it's a bad thing that two (or three?) teams come up with
the same ideas,
and I'm actually happy that the final solution is largely converging,
thanks to all the discussions so far.

It's better to collaborate and help each other, instead of racing on
two separate patches,
and everyone involved should be acknowledged.

Not sure if it matters, we actually started working on seccomp cache
in the end of 2018,
and our idea is to also support arguments in the cache.
We still have the paper draft sent to an academic conference at Apr 2019 :)
Unfortunately, our paper kept being rejected until recently.
Sadly, as academics, we prioritized papers over upstream.

I'm disclosing these not to dismiss anyone's innovations and hardwork.
I do really think we should work together to merge the right pieces of code,
instead of competing or ignoring others' effort.

--
Tianyin Xu
University of Illinois at Urbana-Champaign
https://tianyin.github.io/
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 19:48       ` Tianyin Xu
  0 siblings, 0 replies; 81+ messages in thread
From: Tianyin Xu @ 2020-09-24 19:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrea Arcangeli, Kees Cook, Zhu, YiFei, Christian Brauner,
	Tycho Andersen, Andy Lutomirski, Will Drewry, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Chen, Jianyan, Torrellas, Josep, bpf,
	Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 2:19 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Sep 24, 2020 at 8:57 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Hello,
> >
> > I'm posting this only for the record, feel free to ignore.
> >
> > On Wed, Sep 23, 2020 at 04:29:17PM -0700, Kees Cook wrote:
> > > rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> > > alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> > > v1:
> > > - rebase to for-next/seccomp
> > > - finish X86_X32 support for both pinning and bitmaps
> >
> > It's pretty clear the O(1) seccomp filter bitmap was first was
> > proposed by your RFC in June (albeit it was located in the wrong place
> > and is still in the wrong place in v1).
> >
> > > - replace TLB magic with Jann's emulator
> >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > That's a pretty fundamental change in v1 compared to your the
> > non-competing TLB magic technique you used in the RFC last June.
> >
> > The bitmap isn't the clever part of the patch, the bitmap can be
> > reviewed in seconds, the difficult part to implement and to review is
> > how you fill the bitmap and in that respect there's absolutely nothing
> > in common in between the "rfc:" and the "alternative" link.
> >
> > In June your bitmap-filling engine was this:
> >
> > https://lore.kernel.org/lkml/20200616074934.1600036-5-keescook@chromium.org/
> >
> > Then on Sep 21 YiFei Zhu posted his new innovative BPF emulation
> > innovation that obsoleted your TLB magic of June:
> >
> > https://lists.linuxfoundation.org/pipermail/containers/2020-September/042153.html
> >
> > And on Sep 23 instead of collaborating and helping YiFei Zhu to
> > improve his BPF emulator, you posted the same technique that looks
> > remarkably similar without giving YiFei Zhu any attribution and you
> > instead attribute the whole idea to Jann Horn:
> >
> > https://lkml.kernel.org/r/20200923232923.3142503-5-keescook@chromium.org
>
> You're missing that I did suggest the BPF emulation approach (with
> code very similar to Kees' current code) back in June:
> https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/

I don't see it's a bad thing that two (or three?) teams come up with
the same ideas,
and I'm actually happy that the final solution is largely converging,
thanks to all the discussions so far.

It's better to collaborate and help each other, instead of racing on
two separate patches,
and everyone involved should be acknowledged.

Not sure if it matters, we actually started working on seccomp cache
in the end of 2018,
and our idea is to also support arguments in the cache.
We still have the paper draft sent to an academic conference at Apr 2019 :)
Unfortunately, our paper kept being rejected until recently.
Sadly, as academics, we prioritized papers over upstream.

I'm disclosing these not to dismiss anyone's innovations and hardwork.
I do really think we should work together to merge the right pieces of code,
instead of competing or ignoring others' effort.

--
Tianyin Xu
University of Illinois at Urbana-Champaign
https://tianyin.github.io/

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
  2020-09-24 15:28         ` Paul Moore
@ 2020-09-24 19:52           ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24 19:52 UTC (permalink / raw)
  To: Paul Moore
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Jann Horn,
	YiFei Zhu, Linux API, Linux Containers, bpf,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Thu, Sep 24, 2020 at 11:28:55AM -0400, Paul Moore wrote:
> On Thu, Sep 24, 2020 at 3:46 AM Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> > > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > > > This emulates absolutely the most basic seccomp filters to figure out
> > > > if they will always give the same results for a given arch/nr combo.
> > > >
> > > > Nearly all seccomp filters are built from the following ops:
> > > >
> > > > BPF_LD  | BPF_W    | BPF_ABS
> > > > BPF_JMP | BPF_JEQ  | BPF_K
> > > > BPF_JMP | BPF_JGE  | BPF_K
> > > > BPF_JMP | BPF_JGT  | BPF_K
> > > > BPF_JMP | BPF_JSET | BPF_K
> > > > BPF_JMP | BPF_JA
> > > > BPF_RET | BPF_K
> > > >
> > > > These are now emulated to check for accesses beyond seccomp_data::arch
> > > > or unknown instructions.
> > > >
> > > > Not yet implemented are:
> > > >
> > > > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> > >
> > > BPF_AND is normally only used on syscall arguments, not on the syscall
> > > number or the architecture, right? And when a syscall argument is
> > > loaded, we abort execution anyway. So I think there is no need to
> > > implement those?
> >
> > Is that right? I can't actually tell what libseccomp is doing with
> > ALU|AND. It looks like it's using it for building jump lists?
> 
> There is an ALU|AND op in the jump resolution code, but that is really
> just if libseccomp needs to fixup the accumulator because a code block
> is expecting a masked value (right now that would only be a syscall
> argument, not the syscall number itself).
> 
> > Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?
> 
> Presently the only place where libseccomp uses ALU|AND is when the
> masked equality comparison is used for comparing syscall arguments
> (SCMP_CMP_MASKED_EQ).  I can't honestly say I have any good
> information about how often that is used by libseccomp callers, but if
> I do a quick search on GitHub for "SCMP_CMP_MASKED_EQ" I see 2k worth
> of code hits; take that for whatever it is worth.  Tom may have some
> more/better information.
> 
> Of course no promises on future use :)  As one quick example, I keep
> thinking about adding the instruction pointer to the list of things
> that can be compared as part of a libseccomp rule, and if we do that I
> would expect that we would want to also allow a masked comparison (and
> utilize another ALU|AND bpf op there).  However, I'm not sure how
> useful that would be in practice.

Okay, cool. Thanks for checking on that. It sounds like the arg-less
bitmap optimization can continue to ignore ALU|AND for now. :)

-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
@ 2020-09-24 19:52           ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24 19:52 UTC (permalink / raw)
  To: Paul Moore
  Cc: Tom Hromatka, Jann Horn, YiFei Zhu, Christian Brauner,
	Tycho Andersen, Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 11:28:55AM -0400, Paul Moore wrote:
> On Thu, Sep 24, 2020 at 3:46 AM Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> > > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > > > This emulates absolutely the most basic seccomp filters to figure out
> > > > if they will always give the same results for a given arch/nr combo.
> > > >
> > > > Nearly all seccomp filters are built from the following ops:
> > > >
> > > > BPF_LD  | BPF_W    | BPF_ABS
> > > > BPF_JMP | BPF_JEQ  | BPF_K
> > > > BPF_JMP | BPF_JGE  | BPF_K
> > > > BPF_JMP | BPF_JGT  | BPF_K
> > > > BPF_JMP | BPF_JSET | BPF_K
> > > > BPF_JMP | BPF_JA
> > > > BPF_RET | BPF_K
> > > >
> > > > These are now emulated to check for accesses beyond seccomp_data::arch
> > > > or unknown instructions.
> > > >
> > > > Not yet implemented are:
> > > >
> > > > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> > >
> > > BPF_AND is normally only used on syscall arguments, not on the syscall
> > > number or the architecture, right? And when a syscall argument is
> > > loaded, we abort execution anyway. So I think there is no need to
> > > implement those?
> >
> > Is that right? I can't actually tell what libseccomp is doing with
> > ALU|AND. It looks like it's using it for building jump lists?
> 
> There is an ALU|AND op in the jump resolution code, but that is really
> just if libseccomp needs to fixup the accumulator because a code block
> is expecting a masked value (right now that would only be a syscall
> argument, not the syscall number itself).
> 
> > Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?
> 
> Presently the only place where libseccomp uses ALU|AND is when the
> masked equality comparison is used for comparing syscall arguments
> (SCMP_CMP_MASKED_EQ).  I can't honestly say I have any good
> information about how often that is used by libseccomp callers, but if
> I do a quick search on GitHub for "SCMP_CMP_MASKED_EQ" I see 2k worth
> of code hits; take that for whatever it is worth.  Tom may have some
> more/better information.
> 
> Of course no promises on future use :)  As one quick example, I keep
> thinking about adding the instruction pointer to the list of things
> that can be compared as part of a libseccomp rule, and if we do that I
> would expect that we would want to also allow a masked comparison (and
> utilize another ALU|AND bpf op there).  However, I'm not sure how
> useful that would be in practice.

Okay, cool. Thanks for checking on that. It sounds like the arg-less
bitmap optimization can continue to ignore ALU|AND for now. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-24 18:57   ` Andrea Arcangeli
@ 2020-09-24 20:00     ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24 20:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Giuseppe Scrivano, Will Drewry, bpf, Jann Horn, YiFei Zhu,
	linux-api, containers, Tobin Feldman-Fitzthum, Hubertus Franke,
	Andy Lutomirski, Valentin Rothberg, Dimitrios Skarlatos,
	Jack Chen, Josep Torrellas, Tianyin Xu, linux-kernel

On Thu, Sep 24, 2020 at 02:57:02PM -0400, Andrea Arcangeli wrote:
> Hello,
> 
> I'm posting this only for the record, feel free to ignore.
> 
> On Wed, Sep 23, 2020 at 04:29:17PM -0700, Kees Cook wrote:
> > rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> > alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> > v1:
> > - rebase to for-next/seccomp
> > - finish X86_X32 support for both pinning and bitmaps
> 
> It's pretty clear the O(1) seccomp filter bitmap was first was
> proposed by your RFC in June (albeit it was located in the wrong place
> and is still in the wrong place in v1).
> 
> > - replace TLB magic with Jann's emulator
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>     
> That's a pretty fundamental change in v1 compared to your the
> non-competing TLB magic technique you used in the RFC last June.
> 
> The bitmap isn't the clever part of the patch, the bitmap can be
> reviewed in seconds, the difficult part to implement and to review is
> how you fill the bitmap and in that respect there's absolutely nothing
> in common in between the "rfc:" and the "alternative" link.
> 
> In June your bitmap-filling engine was this:
> 
> https://lore.kernel.org/lkml/20200616074934.1600036-5-keescook@chromium.org/
> 
> Then on Sep 21 YiFei Zhu posted his new innovative BPF emulation
> innovation that obsoleted your TLB magic of June:
> 
> https://lists.linuxfoundation.org/pipermail/containers/2020-September/042153.html
> 
> And on Sep 23 instead of collaborating and helping YiFei Zhu to
> improve his BPF emulator, you posted the same technique that looks
> remarkably similar without giving YiFei Zhu any attribution and you
> instead attribute the whole idea to Jann Horn:
> 
> https://lkml.kernel.org/r/20200923232923.3142503-5-keescook@chromium.org

?? Because it IS literally Jann's code:
https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
As the first reply to 20200616074934.1600036-5-keescook@chromium.org. In
June. Which I agreed was the way to go. In June.

And When YiFei Zhu sent their series, I saw they were headed in
a direction that looked functionally similar, but significantly
over-engineered, and done without building on the June RFC and its
discussion. So I raised the priority of putting Jann's code in to the
RFC, so I could send out an update demonstrating both how small I would
like the emulator to be, and how to handle things like x32.

How, exactly, am I not collaborating? I was literally trying to
thread-merge and avoid (more) extra work on YiFei Zhu's end.

-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-24 20:00     ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24 20:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: YiFei Zhu, Jann Horn, Christian Brauner, Tycho Andersen,
	Andy Lutomirski, Will Drewry, Giuseppe Scrivano,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Valentin Rothberg,
	Hubertus Franke, Jack Chen, Josep Torrellas, Tianyin Xu, bpf,
	containers, linux-api, linux-kernel

On Thu, Sep 24, 2020 at 02:57:02PM -0400, Andrea Arcangeli wrote:
> Hello,
> 
> I'm posting this only for the record, feel free to ignore.
> 
> On Wed, Sep 23, 2020 at 04:29:17PM -0700, Kees Cook wrote:
> > rfc: https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> > alternative: https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
> > v1:
> > - rebase to for-next/seccomp
> > - finish X86_X32 support for both pinning and bitmaps
> 
> It's pretty clear the O(1) seccomp filter bitmap was first was
> proposed by your RFC in June (albeit it was located in the wrong place
> and is still in the wrong place in v1).
> 
> > - replace TLB magic with Jann's emulator
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>     
> That's a pretty fundamental change in v1 compared to your the
> non-competing TLB magic technique you used in the RFC last June.
> 
> The bitmap isn't the clever part of the patch, the bitmap can be
> reviewed in seconds, the difficult part to implement and to review is
> how you fill the bitmap and in that respect there's absolutely nothing
> in common in between the "rfc:" and the "alternative" link.
> 
> In June your bitmap-filling engine was this:
> 
> https://lore.kernel.org/lkml/20200616074934.1600036-5-keescook@chromium.org/
> 
> Then on Sep 21 YiFei Zhu posted his new innovative BPF emulation
> innovation that obsoleted your TLB magic of June:
> 
> https://lists.linuxfoundation.org/pipermail/containers/2020-September/042153.html
> 
> And on Sep 23 instead of collaborating and helping YiFei Zhu to
> improve his BPF emulator, you posted the same technique that looks
> remarkably similar without giving YiFei Zhu any attribution and you
> instead attribute the whole idea to Jann Horn:
> 
> https://lkml.kernel.org/r/20200923232923.3142503-5-keescook@chromium.org

?? Because it IS literally Jann's code:
https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
As the first reply to 20200616074934.1600036-5-keescook@chromium.org. In
June. Which I agreed was the way to go. In June.

And When YiFei Zhu sent their series, I saw they were headed in
a direction that looked functionally similar, but significantly
over-engineered, and done without building on the June RFC and its
discussion. So I raised the priority of putting Jann's code in to the
RFC, so I could send out an update demonstrating both how small I would
like the emulator to be, and how to handle things like x32.

How, exactly, am I not collaborating? I was literally trying to
thread-merge and avoid (more) extra work on YiFei Zhu's end.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
  2020-09-24 19:52           ` Kees Cook
@ 2020-09-24 20:46             ` Paul Moore
  -1 siblings, 0 replies; 81+ messages in thread
From: Paul Moore @ 2020-09-24 20:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Jann Horn,
	YiFei Zhu, Linux API, Linux Containers, bpf,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Thu, Sep 24, 2020 at 3:52 PM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 24, 2020 at 11:28:55AM -0400, Paul Moore wrote:
> > On Thu, Sep 24, 2020 at 3:46 AM Kees Cook <keescook@chromium.org> wrote:
> > > On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> > > > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > > > > This emulates absolutely the most basic seccomp filters to figure out
> > > > > if they will always give the same results for a given arch/nr combo.
> > > > >
> > > > > Nearly all seccomp filters are built from the following ops:
> > > > >
> > > > > BPF_LD  | BPF_W    | BPF_ABS
> > > > > BPF_JMP | BPF_JEQ  | BPF_K
> > > > > BPF_JMP | BPF_JGE  | BPF_K
> > > > > BPF_JMP | BPF_JGT  | BPF_K
> > > > > BPF_JMP | BPF_JSET | BPF_K
> > > > > BPF_JMP | BPF_JA
> > > > > BPF_RET | BPF_K
> > > > >
> > > > > These are now emulated to check for accesses beyond seccomp_data::arch
> > > > > or unknown instructions.
> > > > >
> > > > > Not yet implemented are:
> > > > >
> > > > > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> > > >
> > > > BPF_AND is normally only used on syscall arguments, not on the syscall
> > > > number or the architecture, right? And when a syscall argument is
> > > > loaded, we abort execution anyway. So I think there is no need to
> > > > implement those?
> > >
> > > Is that right? I can't actually tell what libseccomp is doing with
> > > ALU|AND. It looks like it's using it for building jump lists?
> >
> > There is an ALU|AND op in the jump resolution code, but that is really
> > just if libseccomp needs to fixup the accumulator because a code block
> > is expecting a masked value (right now that would only be a syscall
> > argument, not the syscall number itself).
> >
> > > Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?
> >
> > Presently the only place where libseccomp uses ALU|AND is when the
> > masked equality comparison is used for comparing syscall arguments
> > (SCMP_CMP_MASKED_EQ).  I can't honestly say I have any good
> > information about how often that is used by libseccomp callers, but if
> > I do a quick search on GitHub for "SCMP_CMP_MASKED_EQ" I see 2k worth
> > of code hits; take that for whatever it is worth.  Tom may have some
> > more/better information.
> >
> > Of course no promises on future use :)  As one quick example, I keep
> > thinking about adding the instruction pointer to the list of things
> > that can be compared as part of a libseccomp rule, and if we do that I
> > would expect that we would want to also allow a masked comparison (and
> > utilize another ALU|AND bpf op there).  However, I'm not sure how
> > useful that would be in practice.
>
> Okay, cool. Thanks for checking on that. It sounds like the arg-less
> bitmap optimization can continue to ignore ALU|AND for now. :)

What's really the worst that could happen anyways? (/me ducks)  The
worst case is the filter falls back to the current performance levels
right?

-- 
paul moore
www.paul-moore.com
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
@ 2020-09-24 20:46             ` Paul Moore
  0 siblings, 0 replies; 81+ messages in thread
From: Paul Moore @ 2020-09-24 20:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Tom Hromatka, Jann Horn, YiFei Zhu, Christian Brauner,
	Tycho Andersen, Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 3:52 PM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 24, 2020 at 11:28:55AM -0400, Paul Moore wrote:
> > On Thu, Sep 24, 2020 at 3:46 AM Kees Cook <keescook@chromium.org> wrote:
> > > On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> > > > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > > > > This emulates absolutely the most basic seccomp filters to figure out
> > > > > if they will always give the same results for a given arch/nr combo.
> > > > >
> > > > > Nearly all seccomp filters are built from the following ops:
> > > > >
> > > > > BPF_LD  | BPF_W    | BPF_ABS
> > > > > BPF_JMP | BPF_JEQ  | BPF_K
> > > > > BPF_JMP | BPF_JGE  | BPF_K
> > > > > BPF_JMP | BPF_JGT  | BPF_K
> > > > > BPF_JMP | BPF_JSET | BPF_K
> > > > > BPF_JMP | BPF_JA
> > > > > BPF_RET | BPF_K
> > > > >
> > > > > These are now emulated to check for accesses beyond seccomp_data::arch
> > > > > or unknown instructions.
> > > > >
> > > > > Not yet implemented are:
> > > > >
> > > > > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> > > >
> > > > BPF_AND is normally only used on syscall arguments, not on the syscall
> > > > number or the architecture, right? And when a syscall argument is
> > > > loaded, we abort execution anyway. So I think there is no need to
> > > > implement those?
> > >
> > > Is that right? I can't actually tell what libseccomp is doing with
> > > ALU|AND. It looks like it's using it for building jump lists?
> >
> > There is an ALU|AND op in the jump resolution code, but that is really
> > just if libseccomp needs to fixup the accumulator because a code block
> > is expecting a masked value (right now that would only be a syscall
> > argument, not the syscall number itself).
> >
> > > Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?
> >
> > Presently the only place where libseccomp uses ALU|AND is when the
> > masked equality comparison is used for comparing syscall arguments
> > (SCMP_CMP_MASKED_EQ).  I can't honestly say I have any good
> > information about how often that is used by libseccomp callers, but if
> > I do a quick search on GitHub for "SCMP_CMP_MASKED_EQ" I see 2k worth
> > of code hits; take that for whatever it is worth.  Tom may have some
> > more/better information.
> >
> > Of course no promises on future use :)  As one quick example, I keep
> > thinking about adding the instruction pointer to the list of things
> > that can be compared as part of a libseccomp rule, and if we do that I
> > would expect that we would want to also allow a masked comparison (and
> > utilize another ALU|AND bpf op there).  However, I'm not sure how
> > useful that would be in practice.
>
> Okay, cool. Thanks for checking on that. It sounds like the arg-less
> bitmap optimization can continue to ignore ALU|AND for now. :)

What's really the worst that could happen anyways? (/me ducks)  The
worst case is the filter falls back to the current performance levels
right?

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
  2020-09-24 20:46             ` Paul Moore
@ 2020-09-24 21:35               ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24 21:35 UTC (permalink / raw)
  To: Paul Moore
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Jann Horn,
	YiFei Zhu, Linux API, Linux Containers, bpf,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Thu, Sep 24, 2020 at 04:46:05PM -0400, Paul Moore wrote:
> On Thu, Sep 24, 2020 at 3:52 PM Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Sep 24, 2020 at 11:28:55AM -0400, Paul Moore wrote:
> > > On Thu, Sep 24, 2020 at 3:46 AM Kees Cook <keescook@chromium.org> wrote:
> > > > On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> > > > > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > > > > > This emulates absolutely the most basic seccomp filters to figure out
> > > > > > if they will always give the same results for a given arch/nr combo.
> > > > > >
> > > > > > Nearly all seccomp filters are built from the following ops:
> > > > > >
> > > > > > BPF_LD  | BPF_W    | BPF_ABS
> > > > > > BPF_JMP | BPF_JEQ  | BPF_K
> > > > > > BPF_JMP | BPF_JGE  | BPF_K
> > > > > > BPF_JMP | BPF_JGT  | BPF_K
> > > > > > BPF_JMP | BPF_JSET | BPF_K
> > > > > > BPF_JMP | BPF_JA
> > > > > > BPF_RET | BPF_K
> > > > > >
> > > > > > These are now emulated to check for accesses beyond seccomp_data::arch
> > > > > > or unknown instructions.
> > > > > >
> > > > > > Not yet implemented are:
> > > > > >
> > > > > > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> > > > >
> > > > > BPF_AND is normally only used on syscall arguments, not on the syscall
> > > > > number or the architecture, right? And when a syscall argument is
> > > > > loaded, we abort execution anyway. So I think there is no need to
> > > > > implement those?
> > > >
> > > > Is that right? I can't actually tell what libseccomp is doing with
> > > > ALU|AND. It looks like it's using it for building jump lists?
> > >
> > > There is an ALU|AND op in the jump resolution code, but that is really
> > > just if libseccomp needs to fixup the accumulator because a code block
> > > is expecting a masked value (right now that would only be a syscall
> > > argument, not the syscall number itself).
> > >
> > > > Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?
> > >
> > > Presently the only place where libseccomp uses ALU|AND is when the
> > > masked equality comparison is used for comparing syscall arguments
> > > (SCMP_CMP_MASKED_EQ).  I can't honestly say I have any good
> > > information about how often that is used by libseccomp callers, but if
> > > I do a quick search on GitHub for "SCMP_CMP_MASKED_EQ" I see 2k worth
> > > of code hits; take that for whatever it is worth.  Tom may have some
> > > more/better information.
> > >
> > > Of course no promises on future use :)  As one quick example, I keep
> > > thinking about adding the instruction pointer to the list of things
> > > that can be compared as part of a libseccomp rule, and if we do that I
> > > would expect that we would want to also allow a masked comparison (and
> > > utilize another ALU|AND bpf op there).  However, I'm not sure how
> > > useful that would be in practice.
> >
> > Okay, cool. Thanks for checking on that. It sounds like the arg-less
> > bitmap optimization can continue to ignore ALU|AND for now. :)
> 
> What's really the worst that could happen anyways? (/me ducks)  The
> worst case is the filter falls back to the current performance levels
> right?

Worse case for adding complexity to verifier is the bitmaps can be
tricked into a bad state, but I've tried to design this so that it can
only fail toward just running the filter. :)

-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 4/6] seccomp: Emulate basic filters for constant action results
@ 2020-09-24 21:35               ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-24 21:35 UTC (permalink / raw)
  To: Paul Moore
  Cc: Tom Hromatka, Jann Horn, YiFei Zhu, Christian Brauner,
	Tycho Andersen, Andy Lutomirski, Will Drewry, Andrea Arcangeli,
	Giuseppe Scrivano, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	Tianyin Xu, bpf, Linux Containers, Linux API, kernel list

On Thu, Sep 24, 2020 at 04:46:05PM -0400, Paul Moore wrote:
> On Thu, Sep 24, 2020 at 3:52 PM Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Sep 24, 2020 at 11:28:55AM -0400, Paul Moore wrote:
> > > On Thu, Sep 24, 2020 at 3:46 AM Kees Cook <keescook@chromium.org> wrote:
> > > > On Thu, Sep 24, 2020 at 01:47:47AM +0200, Jann Horn wrote:
> > > > > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@chromium.org> wrote:
> > > > > > This emulates absolutely the most basic seccomp filters to figure out
> > > > > > if they will always give the same results for a given arch/nr combo.
> > > > > >
> > > > > > Nearly all seccomp filters are built from the following ops:
> > > > > >
> > > > > > BPF_LD  | BPF_W    | BPF_ABS
> > > > > > BPF_JMP | BPF_JEQ  | BPF_K
> > > > > > BPF_JMP | BPF_JGE  | BPF_K
> > > > > > BPF_JMP | BPF_JGT  | BPF_K
> > > > > > BPF_JMP | BPF_JSET | BPF_K
> > > > > > BPF_JMP | BPF_JA
> > > > > > BPF_RET | BPF_K
> > > > > >
> > > > > > These are now emulated to check for accesses beyond seccomp_data::arch
> > > > > > or unknown instructions.
> > > > > >
> > > > > > Not yet implemented are:
> > > > > >
> > > > > > BPF_ALU | BPF_AND (generated by libseccomp and Chrome)
> > > > >
> > > > > BPF_AND is normally only used on syscall arguments, not on the syscall
> > > > > number or the architecture, right? And when a syscall argument is
> > > > > loaded, we abort execution anyway. So I think there is no need to
> > > > > implement those?
> > > >
> > > > Is that right? I can't actually tell what libseccomp is doing with
> > > > ALU|AND. It looks like it's using it for building jump lists?
> > >
> > > There is an ALU|AND op in the jump resolution code, but that is really
> > > just if libseccomp needs to fixup the accumulator because a code block
> > > is expecting a masked value (right now that would only be a syscall
> > > argument, not the syscall number itself).
> > >
> > > > Paul, Tom, under what cases does libseccomp emit ALU|AND into filters?
> > >
> > > Presently the only place where libseccomp uses ALU|AND is when the
> > > masked equality comparison is used for comparing syscall arguments
> > > (SCMP_CMP_MASKED_EQ).  I can't honestly say I have any good
> > > information about how often that is used by libseccomp callers, but if
> > > I do a quick search on GitHub for "SCMP_CMP_MASKED_EQ" I see 2k worth
> > > of code hits; take that for whatever it is worth.  Tom may have some
> > > more/better information.
> > >
> > > Of course no promises on future use :)  As one quick example, I keep
> > > thinking about adding the instruction pointer to the list of things
> > > that can be compared as part of a libseccomp rule, and if we do that I
> > > would expect that we would want to also allow a masked comparison (and
> > > utilize another ALU|AND bpf op there).  However, I'm not sure how
> > > useful that would be in practice.
> >
> > Okay, cool. Thanks for checking on that. It sounds like the arg-less
> > bitmap optimization can continue to ignore ALU|AND for now. :)
> 
> What's really the worst that could happen anyways? (/me ducks)  The
> worst case is the filter falls back to the current performance levels
> right?

Worse case for adding complexity to verifier is the bitmaps can be
tricked into a bad state, but I've tried to design this so that it can
only fail toward just running the filter. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-24 13:58     ` YiFei Zhu
@ 2020-09-25  5:56       ` Rasmus Villemoes
  -1 siblings, 0 replies; 81+ messages in thread
From: Rasmus Villemoes @ 2020-09-25  5:56 UTC (permalink / raw)
  To: YiFei Zhu, Rasmus Villemoes
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, YiFei Zhu, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Andy Lutomirski,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu, kernel list

On 24/09/2020 15.58, YiFei Zhu wrote:
> On Thu, Sep 24, 2020 at 8:46 AM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>> But one thing I'm wondering about and I haven't seen addressed anywhere:
>> Why build the bitmap on the kernel side (with all the complexity of
>> having to emulate the filter for all syscalls)? Why can't userspace just
>> hand the kernel "here's a new filter: the syscalls in this bitmap are
>> always allowed noquestionsasked, for the rest, run this bpf". Sure, that
>> might require a new syscall or extending seccomp(2) somewhat, but isn't
>> that a _lot_ simpler? It would probably also mean that the bpf we do get
>> handed is a lot smaller. Userspace might need to pass a couple of
>> bitmaps, one for each relevant arch, but you get the overall idea.
> 
> Perhaps. The thing is, the current API expects any filter attaches to
> be "additive". If a new filter gets attached that says "disallow read"
> then no matter whatever has been attached already, "read" shall not be
> allowed at the next syscall, bypassing all previous allowlist bitmaps
> (so you need to emulate the bpf anyways here?). We should also not
> have a API that could let anyone escape the secomp jail. Say "prctl"
> is permitted but "read" is not permitted, one must not be allowed to
> attach a bitmap so that "read" now appears in the allowlist. The only
> way this could potentially work is to attach a BPF filter and a bitmap
> at the same time in the same syscall, which might mean API redesign?

Yes, the man page would read something like

       SECCOMP_SET_MODE_FILTER_BITMAP
              The system calls allowed are defined by a pointer to a
Berkeley Packet Filter (BPF) passed  via  args.
              This argument is a pointer to a struct sock_fprog_bitmap;

with that struct containing whatever information/extra pointers needed
for passing the bitmap(s) in addition to the bpf prog.

And SECCOMP_SET_MODE_FILTER would internally just be updated to work
as-if all-zero allow-bitmaps were passed along. The internal kernel
bitmap would just be the and of the bitmaps in the filter stack.

Sure, it's UAPI, so would certainly need more careful thought on details
of just how the arg struct looks like etc. etc., but I was wondering why
it hadn't been discussed at all.

>> I'm also a bit worried about the performance of doing that emulation;
>> that's constant extra overhead for, say, launching a docker container.
> 
> IMO, launching a docker container is so expensive this should be negligible.

Regardless, I'd like to see some numbers, certainly for the "how much
faster does a getpid() or read() or any of the other syscalls that
nobody disallows" get, but also "what's the cost of doing that emulation
at seccomp(2) time".

Rasmus
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-25  5:56       ` Rasmus Villemoes
  0 siblings, 0 replies; 81+ messages in thread
From: Rasmus Villemoes @ 2020-09-25  5:56 UTC (permalink / raw)
  To: YiFei Zhu, Rasmus Villemoes
  Cc: Kees Cook, YiFei Zhu, Andrea Arcangeli, Giuseppe Scrivano,
	Will Drewry, bpf, Jann Horn, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On 24/09/2020 15.58, YiFei Zhu wrote:
> On Thu, Sep 24, 2020 at 8:46 AM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>> But one thing I'm wondering about and I haven't seen addressed anywhere:
>> Why build the bitmap on the kernel side (with all the complexity of
>> having to emulate the filter for all syscalls)? Why can't userspace just
>> hand the kernel "here's a new filter: the syscalls in this bitmap are
>> always allowed noquestionsasked, for the rest, run this bpf". Sure, that
>> might require a new syscall or extending seccomp(2) somewhat, but isn't
>> that a _lot_ simpler? It would probably also mean that the bpf we do get
>> handed is a lot smaller. Userspace might need to pass a couple of
>> bitmaps, one for each relevant arch, but you get the overall idea.
> 
> Perhaps. The thing is, the current API expects any filter attaches to
> be "additive". If a new filter gets attached that says "disallow read"
> then no matter whatever has been attached already, "read" shall not be
> allowed at the next syscall, bypassing all previous allowlist bitmaps
> (so you need to emulate the bpf anyways here?). We should also not
> have a API that could let anyone escape the secomp jail. Say "prctl"
> is permitted but "read" is not permitted, one must not be allowed to
> attach a bitmap so that "read" now appears in the allowlist. The only
> way this could potentially work is to attach a BPF filter and a bitmap
> at the same time in the same syscall, which might mean API redesign?

Yes, the man page would read something like

       SECCOMP_SET_MODE_FILTER_BITMAP
              The system calls allowed are defined by a pointer to a
Berkeley Packet Filter (BPF) passed  via  args.
              This argument is a pointer to a struct sock_fprog_bitmap;

with that struct containing whatever information/extra pointers needed
for passing the bitmap(s) in addition to the bpf prog.

And SECCOMP_SET_MODE_FILTER would internally just be updated to work
as-if all-zero allow-bitmaps were passed along. The internal kernel
bitmap would just be the and of the bitmaps in the filter stack.

Sure, it's UAPI, so would certainly need more careful thought on details
of just how the arg struct looks like etc. etc., but I was wondering why
it hadn't been discussed at all.

>> I'm also a bit worried about the performance of doing that emulation;
>> that's constant extra overhead for, say, launching a docker container.
> 
> IMO, launching a docker container is so expensive this should be negligible.

Regardless, I'd like to see some numbers, certainly for the "how much
faster does a getpid() or read() or any of the other syscalls that
nobody disallows" get, but also "what's the cost of doing that emulation
at seccomp(2) time".

Rasmus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-25  5:56       ` Rasmus Villemoes
@ 2020-09-25  7:07         ` YiFei Zhu
  -1 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-25  7:07 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, YiFei Zhu, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Andy Lutomirski,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu, kernel list

On Fri, Sep 25, 2020 at 12:56 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> Yes, the man page would read something like
>
>        SECCOMP_SET_MODE_FILTER_BITMAP
>               The system calls allowed are defined by a pointer to a
> Berkeley Packet Filter (BPF) passed  via  args.
>               This argument is a pointer to a struct sock_fprog_bitmap;
>
> with that struct containing whatever information/extra pointers needed
> for passing the bitmap(s) in addition to the bpf prog.
>
> And SECCOMP_SET_MODE_FILTER would internally just be updated to work
> as-if all-zero allow-bitmaps were passed along. The internal kernel
> bitmap would just be the and of the bitmaps in the filter stack.
>
> Sure, it's UAPI, so would certainly need more careful thought on details
> of just how the arg struct looks like etc. etc., but I was wondering why
> it hadn't been discussed at all.

If SECCOMP_SET_MODE_FILTER is attached before / after
SECCOMP_SET_MODE_FILTER_BITMAP, does it mean all bitmap gets void?

Would it make sense to have SECCOMP_SET_MODE_FILTER run through the
emulator to see if we can construct a bitmap anyways for "legacy
no-bitmap" support?

Another thing to consider is that in both patch series we only
construct one final bitmap that, if the bit is set, seccomp will not
call into the BPF filter. If the bit is not set, then all filters are
called in sequence, even if some of them "must allow the syscall".
With SECCOMP_SET_MODE_FILTER_BITMAP, the filter BPF code will no
longer have the "if it's this syscall" for any syscalls that are given
in the bitmaps, and calling into these filters will be a false
negative. So we would need extra logic to make "does this filter have
a bitmap? if so check bitmap first". Probably won't be too
complicated, but idk if it is actually worth the complexity. wdyt?

> Regardless, I'd like to see some numbers, certainly for the "how much
> faster does a getpid() or read() or any of the other syscalls that
> nobody disallows" get, but also "what's the cost of doing that emulation
> at seccomp(2) time".

The former has been given in my RFC patch [1]. In an extreme case of
no side channel mitigations, in the same amount of time, unixbench
syscall mixed runs 33295685 syscalls without seccomp, 20661056
syscalls with docker profile, 25719937 syscalls with bitmapped docker
profile. Though, I think Jack was running on Ubuntu and it did not
have a libseccomp shipped with the distro that's new enough to do the
binary decision tree generation [2].

I'll try to profile the latter later on my qemu-kvm, with a recent
libsecomp with binary tree and docker's profile, probably both direct
filter attaches and filter attaches with fork(). I'm guessing if I
have fork() the cost of fork() will overshadow seccomp() though.

[1] https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
[2] https://github.com/seccomp/libseccomp/pull/152

YiFei Zhu
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-25  7:07         ` YiFei Zhu
  0 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-25  7:07 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Kees Cook, YiFei Zhu, Andrea Arcangeli, Giuseppe Scrivano,
	Will Drewry, bpf, Jann Horn, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Fri, Sep 25, 2020 at 12:56 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> Yes, the man page would read something like
>
>        SECCOMP_SET_MODE_FILTER_BITMAP
>               The system calls allowed are defined by a pointer to a
> Berkeley Packet Filter (BPF) passed  via  args.
>               This argument is a pointer to a struct sock_fprog_bitmap;
>
> with that struct containing whatever information/extra pointers needed
> for passing the bitmap(s) in addition to the bpf prog.
>
> And SECCOMP_SET_MODE_FILTER would internally just be updated to work
> as-if all-zero allow-bitmaps were passed along. The internal kernel
> bitmap would just be the and of the bitmaps in the filter stack.
>
> Sure, it's UAPI, so would certainly need more careful thought on details
> of just how the arg struct looks like etc. etc., but I was wondering why
> it hadn't been discussed at all.

If SECCOMP_SET_MODE_FILTER is attached before / after
SECCOMP_SET_MODE_FILTER_BITMAP, does it mean all bitmap gets void?

Would it make sense to have SECCOMP_SET_MODE_FILTER run through the
emulator to see if we can construct a bitmap anyways for "legacy
no-bitmap" support?

Another thing to consider is that in both patch series we only
construct one final bitmap that, if the bit is set, seccomp will not
call into the BPF filter. If the bit is not set, then all filters are
called in sequence, even if some of them "must allow the syscall".
With SECCOMP_SET_MODE_FILTER_BITMAP, the filter BPF code will no
longer have the "if it's this syscall" for any syscalls that are given
in the bitmaps, and calling into these filters will be a false
negative. So we would need extra logic to make "does this filter have
a bitmap? if so check bitmap first". Probably won't be too
complicated, but idk if it is actually worth the complexity. wdyt?

> Regardless, I'd like to see some numbers, certainly for the "how much
> faster does a getpid() or read() or any of the other syscalls that
> nobody disallows" get, but also "what's the cost of doing that emulation
> at seccomp(2) time".

The former has been given in my RFC patch [1]. In an extreme case of
no side channel mitigations, in the same amount of time, unixbench
syscall mixed runs 33295685 syscalls without seccomp, 20661056
syscalls with docker profile, 25719937 syscalls with bitmapped docker
profile. Though, I think Jack was running on Ubuntu and it did not
have a libseccomp shipped with the distro that's new enough to do the
binary decision tree generation [2].

I'll try to profile the latter later on my qemu-kvm, with a recent
libsecomp with binary tree and docker's profile, probably both direct
filter attaches and filter attaches with fork(). I'm guessing if I
have fork() the cost of fork() will overshadow seccomp() though.

[1] https://lore.kernel.org/containers/cover.1600661418.git.yifeifz2@illinois.edu/
[2] https://github.com/seccomp/libseccomp/pull/152

YiFei Zhu

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-25  7:07         ` YiFei Zhu
@ 2020-09-26 18:11           ` YiFei Zhu
  -1 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-26 18:11 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Will Drewry, Kees Cook,
	Jann Horn, YiFei Zhu, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Dimitrios Skarlatos, Andy Lutomirski,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu, kernel list

On Fri, Sep 25, 2020 at 2:07 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> I'll try to profile the latter later on my qemu-kvm, with a recent
> libsecomp with binary tree and docker's profile, probably both direct
> filter attaches and filter attaches with fork(). I'm guessing if I
> have fork() the cost of fork() will overshadow seccomp() though.

I'm surprised. That is not the case as far as I can tell.

I wrote a benchmark [1] that would fork() and in the child attach a
seccomp filter, look at the CLOCK_MONOTONIC difference, then add it to
a struct timespec shared with the parent. It checks the difference
with the timespec before prctl and before fork. CLOCK_MONOTONIC
instead of CLOCK_PROCESS_CPUTIME_ID because of fork.

I ran `./seccomp_emu_bench 100000` in my qemu-kvm and here are the results:
without emulator:
Benchmarking 100000 syscalls...
19799663603 (19.8s)
seecomp attach without fork: 197996 ns
33911173847 (33.9s)
seecomp attach with fork: 339111 ns

with emulator:
Benchmarking 100000 syscalls...
54428289147 (54.4s)
seecomp attach without fork: 544282 ns
69494235408 (69.5s)
seecomp attach with fork: 694942 ns

fork seems to take around 150us, seccomp attach takes around 200us,
and the filter emulation overhead is around 350us. I had no idea that
fork was this fast. If I wrote my benchmark badly please criticise.

Given that we are doubling the time to fork() + seccomp attach filter,
I think yeah running the emulator on the first instance of a syscall,
holding a lock, is a much better idea. If I naively divide 350us by
the number of syscall + arch pairs emulated the overhead is less than
1 us and that should be okay since it only happens for the first
invocation of the particular syscall.

[1] https://gist.github.com/zhuyifei1999/d7bee62bea14187e150fef59db8e30b1

YiFei Zhu
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-26 18:11           ` YiFei Zhu
  0 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-26 18:11 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Kees Cook, YiFei Zhu, Andrea Arcangeli, Giuseppe Scrivano,
	Will Drewry, bpf, Jann Horn, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Fri, Sep 25, 2020 at 2:07 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> I'll try to profile the latter later on my qemu-kvm, with a recent
> libsecomp with binary tree and docker's profile, probably both direct
> filter attaches and filter attaches with fork(). I'm guessing if I
> have fork() the cost of fork() will overshadow seccomp() though.

I'm surprised. That is not the case as far as I can tell.

I wrote a benchmark [1] that would fork() and in the child attach a
seccomp filter, look at the CLOCK_MONOTONIC difference, then add it to
a struct timespec shared with the parent. It checks the difference
with the timespec before prctl and before fork. CLOCK_MONOTONIC
instead of CLOCK_PROCESS_CPUTIME_ID because of fork.

I ran `./seccomp_emu_bench 100000` in my qemu-kvm and here are the results:
without emulator:
Benchmarking 100000 syscalls...
19799663603 (19.8s)
seecomp attach without fork: 197996 ns
33911173847 (33.9s)
seecomp attach with fork: 339111 ns

with emulator:
Benchmarking 100000 syscalls...
54428289147 (54.4s)
seecomp attach without fork: 544282 ns
69494235408 (69.5s)
seecomp attach with fork: 694942 ns

fork seems to take around 150us, seccomp attach takes around 200us,
and the filter emulation overhead is around 350us. I had no idea that
fork was this fast. If I wrote my benchmark badly please criticise.

Given that we are doubling the time to fork() + seccomp attach filter,
I think yeah running the emulator on the first instance of a syscall,
holding a lock, is a much better idea. If I naively divide 350us by
the number of syscall + arch pairs emulated the overhead is less than
1 us and that should be okay since it only happens for the first
invocation of the particular syscall.

[1] https://gist.github.com/zhuyifei1999/d7bee62bea14187e150fef59db8e30b1

YiFei Zhu

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-26 18:11           ` YiFei Zhu
@ 2020-09-28 20:04             ` Kees Cook
  -1 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-28 20:04 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Tobin Feldman-Fitzthum,
	Will Drewry, Jann Horn, YiFei Zhu, Linux API, Linux Containers,
	Rasmus Villemoes, Dimitrios Skarlatos, Andy Lutomirski,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu, kernel list

On Sat, Sep 26, 2020 at 01:11:50PM -0500, YiFei Zhu wrote:
> On Fri, Sep 25, 2020 at 2:07 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > I'll try to profile the latter later on my qemu-kvm, with a recent
> > libsecomp with binary tree and docker's profile, probably both direct
> > filter attaches and filter attaches with fork(). I'm guessing if I
> > have fork() the cost of fork() will overshadow seccomp() though.
> 
> I'm surprised. That is not the case as far as I can tell.
> 
> I wrote a benchmark [1] that would fork() and in the child attach a
> seccomp filter, look at the CLOCK_MONOTONIC difference, then add it to
> a struct timespec shared with the parent. It checks the difference
> with the timespec before prctl and before fork. CLOCK_MONOTONIC
> instead of CLOCK_PROCESS_CPUTIME_ID because of fork.
> 
> I ran `./seccomp_emu_bench 100000` in my qemu-kvm and here are the results:
> without emulator:
> Benchmarking 100000 syscalls...
> 19799663603 (19.8s)
> seecomp attach without fork: 197996 ns
> 33911173847 (33.9s)
> seecomp attach with fork: 339111 ns
> 
> with emulator:
> Benchmarking 100000 syscalls...
> 54428289147 (54.4s)
> seecomp attach without fork: 544282 ns
> 69494235408 (69.5s)
> seecomp attach with fork: 694942 ns
> 
> fork seems to take around 150us, seccomp attach takes around 200us,
> and the filter emulation overhead is around 350us. I had no idea that
> fork was this fast. If I wrote my benchmark badly please criticise.

You're calling clock_gettime() inside your loop. That might change the
numbers. Why not just measure outside the loop, or better yet, use
"perf" to measure the time in prctl().

> Given that we are doubling the time to fork() + seccomp attach filter,
> I think yeah running the emulator on the first instance of a syscall,
> holding a lock, is a much better idea. If I naively divide 350us by
> the number of syscall + arch pairs emulated the overhead is less than
> 1 us and that should be okay since it only happens for the first
> invocation of the particular syscall.
> 
> [1] https://gist.github.com/zhuyifei1999/d7bee62bea14187e150fef59db8e30b1

Regardless, let's take things one step at a time. First, let's do
the simplest version of the feature, and then let's look at further
optimizations.

Can you send a v3 and we can continue from there?

-- 
Kees Cook
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-28 20:04             ` Kees Cook
  0 siblings, 0 replies; 81+ messages in thread
From: Kees Cook @ 2020-09-28 20:04 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Rasmus Villemoes, YiFei Zhu, Andrea Arcangeli, Giuseppe Scrivano,
	Will Drewry, bpf, Jann Horn, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Sat, Sep 26, 2020 at 01:11:50PM -0500, YiFei Zhu wrote:
> On Fri, Sep 25, 2020 at 2:07 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > I'll try to profile the latter later on my qemu-kvm, with a recent
> > libsecomp with binary tree and docker's profile, probably both direct
> > filter attaches and filter attaches with fork(). I'm guessing if I
> > have fork() the cost of fork() will overshadow seccomp() though.
> 
> I'm surprised. That is not the case as far as I can tell.
> 
> I wrote a benchmark [1] that would fork() and in the child attach a
> seccomp filter, look at the CLOCK_MONOTONIC difference, then add it to
> a struct timespec shared with the parent. It checks the difference
> with the timespec before prctl and before fork. CLOCK_MONOTONIC
> instead of CLOCK_PROCESS_CPUTIME_ID because of fork.
> 
> I ran `./seccomp_emu_bench 100000` in my qemu-kvm and here are the results:
> without emulator:
> Benchmarking 100000 syscalls...
> 19799663603 (19.8s)
> seecomp attach without fork: 197996 ns
> 33911173847 (33.9s)
> seecomp attach with fork: 339111 ns
> 
> with emulator:
> Benchmarking 100000 syscalls...
> 54428289147 (54.4s)
> seecomp attach without fork: 544282 ns
> 69494235408 (69.5s)
> seecomp attach with fork: 694942 ns
> 
> fork seems to take around 150us, seccomp attach takes around 200us,
> and the filter emulation overhead is around 350us. I had no idea that
> fork was this fast. If I wrote my benchmark badly please criticise.

You're calling clock_gettime() inside your loop. That might change the
numbers. Why not just measure outside the loop, or better yet, use
"perf" to measure the time in prctl().

> Given that we are doubling the time to fork() + seccomp attach filter,
> I think yeah running the emulator on the first instance of a syscall,
> holding a lock, is a much better idea. If I naively divide 350us by
> the number of syscall + arch pairs emulated the overhead is less than
> 1 us and that should be okay since it only happens for the first
> invocation of the particular syscall.
> 
> [1] https://gist.github.com/zhuyifei1999/d7bee62bea14187e150fef59db8e30b1

Regardless, let's take things one step at a time. First, let's do
the simplest version of the feature, and then let's look at further
optimizations.

Can you send a v3 and we can continue from there?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
  2020-09-28 20:04             ` Kees Cook
@ 2020-09-28 20:16               ` YiFei Zhu
  -1 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-28 20:16 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrea Arcangeli, Giuseppe Scrivano, Tobin Feldman-Fitzthum,
	Will Drewry, Jann Horn, YiFei Zhu, Linux API, Linux Containers,
	Rasmus Villemoes, Dimitrios Skarlatos, Andy Lutomirski,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu, kernel list

On Mon, Sep 28, 2020 at 3:04 PM Kees Cook <keescook@chromium.org> wrote:
> Regardless, let's take things one step at a time. First, let's do
> the simplest version of the feature, and then let's look at further
> optimizations.
>
> Can you send a v3 and we can continue from there?

ok, will do later tonight / tomorrow.

YiFeiZhu
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v1 0/6] seccomp: Implement constant action bitmaps
@ 2020-09-28 20:16               ` YiFei Zhu
  0 siblings, 0 replies; 81+ messages in thread
From: YiFei Zhu @ 2020-09-28 20:16 UTC (permalink / raw)
  To: Kees Cook
  Cc: Rasmus Villemoes, YiFei Zhu, Andrea Arcangeli, Giuseppe Scrivano,
	Will Drewry, bpf, Jann Horn, Linux API, Linux Containers,
	Tobin Feldman-Fitzthum, Hubertus Franke, Andy Lutomirski,
	Valentin Rothberg, Dimitrios Skarlatos, Jack Chen,
	Josep Torrellas, Tianyin Xu, kernel list

On Mon, Sep 28, 2020 at 3:04 PM Kees Cook <keescook@chromium.org> wrote:
> Regardless, let's take things one step at a time. First, let's do
> the simplest version of the feature, and then let's look at further
> optimizations.
>
> Can you send a v3 and we can continue from there?

ok, will do later tonight / tomorrow.

YiFeiZhu

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2020-09-28 20:16 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-23 23:29 [PATCH v1 0/6] seccomp: Implement constant action bitmaps Kees Cook
2020-09-23 23:29 ` Kees Cook
2020-09-23 23:29 ` [PATCH 1/6] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE Kees Cook
2020-09-23 23:29   ` Kees Cook
2020-09-24  0:41   ` Jann Horn via Containers
2020-09-24  0:41     ` Jann Horn
2020-09-24  7:11     ` Kees Cook
2020-09-24  7:11       ` Kees Cook
2020-09-23 23:29 ` [PATCH 2/6] x86: Enable seccomp architecture tracking Kees Cook
2020-09-23 23:29   ` Kees Cook
2020-09-24  0:45   ` Jann Horn via Containers
2020-09-24  0:45     ` Jann Horn
2020-09-24  7:12     ` Kees Cook
2020-09-24  7:12       ` Kees Cook
2020-09-23 23:29 ` [PATCH 3/6] seccomp: Implement constant action bitmaps Kees Cook
2020-09-23 23:29   ` Kees Cook
2020-09-24  0:25   ` Jann Horn via Containers
2020-09-24  0:25     ` Jann Horn
2020-09-24  7:36     ` Kees Cook
2020-09-24  7:36       ` Kees Cook
2020-09-24  8:07       ` YiFei Zhu
2020-09-24  8:07         ` YiFei Zhu
2020-09-24  8:15         ` Kees Cook
2020-09-24  8:15           ` Kees Cook
2020-09-24  8:22           ` YiFei Zhu
2020-09-24  8:22             ` YiFei Zhu
2020-09-24 12:28       ` Jann Horn via Containers
2020-09-24 12:28         ` Jann Horn
2020-09-24 12:37         ` David Laight
2020-09-24 12:37           ` David Laight
2020-09-24 12:56           ` Jann Horn via Containers
2020-09-24 12:56             ` Jann Horn
     [not found]   ` <DM6PR11MB271492D0565E91475D949F5DEF390@DM6PR11MB2714.namprd11.prod.outlook.com>
2020-09-24  0:36     ` YiFei Zhu
2020-09-24  0:36       ` YiFei Zhu
2020-09-24  7:38       ` Kees Cook
2020-09-24  7:38         ` Kees Cook
2020-09-24  7:51         ` YiFei Zhu
2020-09-24  7:51           ` YiFei Zhu
2020-09-23 23:29 ` [PATCH 4/6] seccomp: Emulate basic filters for constant action results Kees Cook
2020-09-23 23:29   ` Kees Cook
2020-09-23 23:47   ` Jann Horn via Containers
2020-09-23 23:47     ` Jann Horn
2020-09-24  7:46     ` Kees Cook
2020-09-24  7:46       ` Kees Cook
2020-09-24 15:28       ` Paul Moore
2020-09-24 15:28         ` Paul Moore
2020-09-24 19:52         ` Kees Cook
2020-09-24 19:52           ` Kees Cook
2020-09-24 20:46           ` Paul Moore
2020-09-24 20:46             ` Paul Moore
2020-09-24 21:35             ` Kees Cook
2020-09-24 21:35               ` Kees Cook
2020-09-24 13:16   ` kernel test robot
2020-09-23 23:29 ` [PATCH 5/6] selftests/seccomp: Compare bitmap vs filter overhead Kees Cook
2020-09-23 23:29   ` Kees Cook
2020-09-23 23:29 ` [PATCH 6/6] [DEBUG] seccomp: Report bitmap coverage ranges Kees Cook
2020-09-23 23:29   ` Kees Cook
2020-09-24 13:40 ` [PATCH v1 0/6] seccomp: Implement constant action bitmaps Rasmus Villemoes
2020-09-24 13:40   ` Rasmus Villemoes
2020-09-24 13:58   ` YiFei Zhu
2020-09-24 13:58     ` YiFei Zhu
2020-09-25  5:56     ` Rasmus Villemoes
2020-09-25  5:56       ` Rasmus Villemoes
2020-09-25  7:07       ` YiFei Zhu
2020-09-25  7:07         ` YiFei Zhu
2020-09-26 18:11         ` YiFei Zhu
2020-09-26 18:11           ` YiFei Zhu
2020-09-28 20:04           ` Kees Cook
2020-09-28 20:04             ` Kees Cook
2020-09-28 20:16             ` YiFei Zhu
2020-09-28 20:16               ` YiFei Zhu
2020-09-24 14:05   ` Jann Horn via Containers
2020-09-24 14:05     ` Jann Horn
2020-09-24 18:57 ` Andrea Arcangeli
2020-09-24 18:57   ` Andrea Arcangeli
2020-09-24 19:18   ` Jann Horn via Containers
2020-09-24 19:18     ` Jann Horn
     [not found]   ` <9dbe8e3bbdad43a1872202ff38c34ca2@DM5PR11MB1692.namprd11.prod.outlook.com>
2020-09-24 19:48     ` Tianyin Xu
2020-09-24 19:48       ` Tianyin Xu
2020-09-24 20:00   ` Kees Cook
2020-09-24 20:00     ` Kees Cook

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.