BPF Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
@ 2020-09-21  5:35 YiFei Zhu
  2020-09-21  5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
                   ` (8 more replies)
  0 siblings, 9 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21  5:35 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg

From: YiFei Zhu <yifeifz2@illinois.edu>

This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We propose SECCOMP_CACHE, a cache-based solution to minimize the
Seccomp overhead. The basic idea is to cache the result of each
syscall check to save the subsequent overhead of executing the
filters. This is feasible, because the check in Seccomp is stateless.
The checking results of the same syscall ID and argument remains
the same.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

In the past Kees proposed [2] to have an "add this syscall to the
reject bitmask". It is indeed much easier to securely make a reject
accelerator to pre-filter syscalls before passing to the BPF
filters, considering it could only strengthen the security provided
by the filter. However, ultimately, filter rejections are an
exceptional / rare case. Here, instead of accelerating what is
rejected, we accelerate what is allowed. In order not to compromise
the security rules the BPF filters defined, any accept-side
accelerator must complement the BPF filters rather than replacing them.

Statically analyzing BPF bytecode to see if each syscall is going to
always land in allow or reject is more of a rabbit hole, especially
there is no current in-kernel infrastructure to enumerate all the
possible architecture numbers for a given machine. So rather than
doing that, we propose to cache the results after the BPF filters are
run. And since there are filters like docker's who will check
arguments of some syscalls, but not all or none of the syscalls, when
a filter is loaded we analyze it to find whether each syscall is
cacheable (does not access syscall argument or instruction pointer) by
following its control flow graph, and store the result for each filter
in a bitmap. Changes to architecture number or the filter are expected
to be rare and simply cause the cache to be cleared. This solution
shall be fully transparent to userspace.

Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.

We have done some benchmarks with patch applied against bpf-next
commit 2e80be60c465 ("libbpf: Fix compilation warnings for 64-bit printf args").

Me, in qemu-kvm x86_64 VM, on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz,
average results:

Without cache, seccomp_benchmark:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Calibrating sample size for 15 seconds worth of syscalls ...
  Benchmarking 23486415 syscalls...
  16.079642020 - 1.013345439 = 15066296581 (15.1s)
  getpid native: 641 ns
  32.080237410 - 16.080763500 = 15999473910 (16.0s)
  getpid RET_ALLOW 1 filter: 681 ns
  48.609461618 - 32.081296173 = 16528165445 (16.5s)
  getpid RET_ALLOW 2 filters: 703 ns
  Estimated total seccomp overhead for 1 filter: 40 ns
  Estimated total seccomp overhead for 2 filters: 62 ns
  Estimated seccomp per-filter overhead: 22 ns
  Estimated seccomp entry overhead: 18 ns

With cache:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Calibrating sample size for 15 seconds worth of syscalls ...
  Benchmarking 23486415 syscalls...
  16.059512499 - 1.014108434 = 15045404065 (15.0s)
  getpid native: 640 ns
  31.651075934 - 16.060637323 = 15590438611 (15.6s)
  getpid RET_ALLOW 1 filter: 663 ns
  47.367316169 - 31.652302661 = 15715013508 (15.7s)
  getpid RET_ALLOW 2 filters: 669 ns
  Estimated total seccomp overhead for 1 filter: 23 ns
  Estimated total seccomp overhead for 2 filters: 29 ns
  Estimated seccomp per-filter overhead: 6 ns
  Estimated seccomp entry overhead: 17 ns

Depending on the run estimated seccomp overhead for 2 filters can be
less than seccomp overhead for 1 filter, resulting in underflow to
estimated seccomp per-filter overhead:
  Estimated total seccomp overhead for 1 filter: 27 ns
  Estimated total seccomp overhead for 2 filters: 21 ns
  Estimated seccomp per-filter overhead: 18446744073709551610 ns
  Estimated seccomp entry overhead: 33 ns

Jack Chen has also run some benchmarks on a bare metal
Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz, with side channel
mitigations off (spec_store_bypass_disable=off spectre_v2=off mds=off
pti=off l1tf=off), with BPF JIT on and docker default profile,
and reported:

  unixbench syscall mix (https://github.com/kdlucas/byte-unixbench)
  unconfined:      33295685
  docker default:         20661056  60%
  docker default + cache: 25719937  30%

Patch 1 introduces the static analyzer to check for a given filter,
whether the CFG loads the syscall arguments for each syscall number.

Patch 2 implements the bitmap cache.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

YiFei Zhu (2):
  seccomp/cache: Add "emulator" to check if filter is arg-dependent
  seccomp/cache: Cache filter results that allow syscalls

 arch/x86/Kconfig        |  27 +++
 include/linux/seccomp.h |  22 +++
 kernel/seccomp.c        | 400 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 446 insertions(+), 3 deletions(-)

--
2.28.0

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
@ 2020-09-21  5:35 ` YiFei Zhu
  2020-09-21 17:47   ` Jann Horn
  2020-09-21  5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21  5:35 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg

From: YiFei Zhu <yifeifz2@illinois.edu>

SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
access any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
access. This is implemented here with a pseudo-emulator, and
stored in a per-filter bitmap. Each seccomp cBPF instruction,
aside from ALU (which should rarely be used in seccomp), gets a
naive best-effort emulation for each syscall number.

The emulator works by following all possible (without SAT solving)
paths the filter can take. Every cBPF register / memory position
records whether that is a constant, and of so, the value of the
constant. Loading from struct seccomp_data is considered constant
if it is a syscall number, else it is an unknown. For each
conditional jump, if the both arguments can be resolved to a
constant, the jump is followed after computing the result of the
condition; else both directions are followed, by pushing one of
the next states to a linked list of next states to process. We
keep a finite number of pending states to process.

The emulation is halted if it reaches a return, or if it reaches a
read from struct seccomp_data that reads an offset that is neither
syscall number or architecture number. In the latter case, we mark
the syscall number as not okay for seccomp to cache. If a filter
depends on more filters, then if its dependee cannot process the
syscall then the depender is also marked not to process the syscall.

We also do a single pass on the entire filter instructions before
performing emulation. If none of the filter instructions load from
the troublesome offsets, then the filter is considered "trivial",
and all syscalls are marked okay for seccomp to cache.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/x86/Kconfig |  27 ++++
 kernel/seccomp.c | 323 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 349 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..9e6891812053 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1984,6 +1984,33 @@ config SECCOMP
 
 	  If unsure, say Y. Only embedded should say N here.
 
+choice
+	prompt "Seccomp filter cache"
+	default SECCOMP_CACHE_NONE
+	depends on SECCOMP
+	depends on SECCOMP_FILTER
+	help
+	  Seccomp filters can potentially incur large overhead for each
+	  system call. This can alleviate some of the overhead.
+
+	  If in doubt, select 'none'.
+
+config SECCOMP_CACHE_NONE
+	bool "None"
+	help
+	  No caching is done. Seccomp filters will be called each time
+	  a system call occurs in a seccomp-guarded task.
+
+config SECCOMP_CACHE_NR_ONLY
+	bool "Syscall number only"
+	help
+	  This is enables a bitmap to cache the results of seccomp
+	  filters, if the filter allows the syscall and is independent
+	  of the syscall arguments. This requires around 60 bytes per
+	  filter and 70 bytes per task.
+
+endchoice
+
 source "kernel/Kconfig.hz"
 
 config KEXEC
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..d8c30901face 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,6 +143,27 @@ struct notification {
 	struct list_head notifications;
 };
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_cache_filter_data - container for cache's per-filter data
+ *
+ * @syscall_ok: A bitmap where each bit represent whether seccomp is allowed to
+ *	        cache the results of this syscall.
+ */
+struct seccomp_cache_filter_data {
+	DECLARE_BITMAP(syscall_ok, NR_syscalls);
+};
+
+#define SECCOMP_EMU_MAX_PENDING_STATES 64
+#else
+struct seccomp_cache_filter_data { };
+
+static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+	return 0;
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -185,6 +206,7 @@ struct seccomp_filter {
 	struct notification *notif;
 	struct mutex notify_lock;
 	wait_queue_head_t wqh;
+	struct seccomp_cache_filter_data cache;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -530,6 +552,297 @@ static inline void seccomp_sync_threads(unsigned long flags)
 	}
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_emu_env - container for seccomp emulator environment
+ *
+ * @filter: The cBPF filter instructions.
+ * @next_state: The next pending state to start emulating from.
+ * @next_state_len: Length of the next state linked list. This is used to
+ *		    enforce naximum number of pending states.
+ * @nr: The syscall number we are emulating.
+ * @syscall_ok: Emulation result, whether it is okay for seccomp to cache the
+ *		syscall.
+ */
+struct seccomp_emu_env {
+	struct sock_filter *filter;
+	struct seccomp_emu_state *next_state;
+	int next_state_len;
+	int nr;
+	bool syscall_ok;
+};
+
+/**
+ * struct seccomp_emu_state - container for seccomp emulator state
+ *
+ * @next: The next pending state. This structure is a linked list.
+ * @pc: The current program counter.
+ * @reg_known: Whether each cBPF register / memory location is a constant.
+ * @reg_const: When a cBPF register / memory location is a constant, the value
+ *	       of that constant.
+ */
+struct seccomp_emu_state {
+	struct seccomp_emu_state *next;
+	int pc;
+	bool reg_known[2 + BPF_MEMWORDS];
+	u32 reg_const[2 + BPF_MEMWORDS];
+};
+
+/**
+ * seccomp_emu_step - step one instruction in the emulator
+ * @env: The emulator environment
+ * @state: The emulator state
+ *
+ * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred.
+ */
+static int seccomp_emu_step(struct seccomp_emu_env *env,
+			    struct seccomp_emu_state *state)
+{
+	struct sock_filter *ftest = &env->filter[state->pc++];
+	struct seccomp_emu_state *new_state;
+	u16 code = ftest->code;
+	u32 k = ftest->k;
+	u32 operand;
+	bool compare;
+	int reg_idx;
+
+	switch (BPF_CLASS(code)) {
+	case BPF_LD:
+	case BPF_LDX:
+		reg_idx = BPF_CLASS(code) == BPF_LDX;
+
+		switch (BPF_MODE(code)) {
+		case BPF_IMM:
+			state->reg_known[reg_idx] = true;
+			state->reg_const[reg_idx] = k;
+			break;
+		case BPF_ABS:
+			if (k == offsetof(struct seccomp_data, nr)) {
+				state->reg_known[reg_idx] = true;
+				state->reg_const[reg_idx] = env->nr;
+			} else {
+				state->reg_known[reg_idx] = false;
+
+				if (k != offsetof(struct seccomp_data, arch)) {
+					env->syscall_ok = false;
+					return 1;
+				}
+			}
+
+			break;
+		case BPF_MEM:
+			state->reg_known[reg_idx] = state->reg_known[2 + k];
+			state->reg_const[reg_idx] = state->reg_const[2 + k];
+			break;
+		default:
+			state->reg_known[reg_idx] = false;
+		}
+
+		return 0;
+	case BPF_ST:
+	case BPF_STX:
+		reg_idx = BPF_CLASS(code) == BPF_STX;
+
+		state->reg_known[2 + k] = state->reg_known[reg_idx];
+		state->reg_const[2 + k] = state->reg_const[reg_idx];
+
+		return 0;
+	case BPF_ALU:
+		state->reg_known[0] = false;
+		return 0;
+	case BPF_JMP:
+		if (BPF_OP(code) == BPF_JA) {
+			state->pc += k;
+			return 0;
+		}
+
+		if (ftest->jt == ftest->jf) {
+			state->pc += ftest->jt;
+			return 0;
+		}
+
+		if (!state->reg_known[0])
+			goto both_cases;
+
+		switch (BPF_SRC(code)) {
+		case BPF_K:
+			operand = k;
+			break;
+		case BPF_X:
+			if (!state->reg_known[1])
+				goto both_cases;
+			operand = state->reg_const[1];
+			break;
+		default:
+			WARN_ON(true);
+			return -EINVAL;
+		}
+
+		switch (BPF_OP(code)) {
+		case BPF_JEQ:
+			compare = state->reg_const[0] == operand;
+			break;
+		case BPF_JGT:
+			compare = state->reg_const[0] > operand;
+			break;
+		case BPF_JGE:
+			compare = state->reg_const[0] >= operand;
+			break;
+		case BPF_JSET:
+			compare = state->reg_const[0] & operand;
+			break;
+		default:
+			WARN_ON(true);
+			return -EINVAL;
+		}
+
+		state->pc += compare ? ftest->jt : ftest->jf;
+
+		return 0;
+
+both_cases:
+		if (env->next_state_len >= SECCOMP_EMU_MAX_PENDING_STATES)
+			return -E2BIG;
+
+		new_state = kmalloc(sizeof(*new_state), GFP_KERNEL);
+		if (!new_state)
+			return -ENOMEM;
+
+		*new_state = *state;
+		new_state->next = env->next_state;
+		new_state->pc += ftest->jt;
+		env->next_state = new_state;
+		env->next_state_len++;
+
+		state->pc += ftest->jf;
+
+		return 0;
+	case BPF_RET:
+		return 1;
+	case BPF_MISC:
+		switch (BPF_MISCOP(code)) {
+		case BPF_TAX:
+			state->reg_known[1] = state->reg_known[0];
+			state->reg_const[1] = state->reg_const[0];
+			break;
+		case BPF_TXA:
+			state->reg_known[0] = state->reg_known[1];
+			state->reg_const[0] = state->reg_const[1];
+			break;
+		default:
+			WARN_ON(true);
+			return -EINVAL;
+		}
+
+		return 0;
+	default:
+		BUILD_BUG();
+		unreachable();
+	}
+}
+
+/**
+ * seccomp_cache_filter_trivial - check if the program does not load arguments.
+ * @fprog: The cBPF program code
+ *
+ * Returns true if the filter is trivial.
+ */
+static bool seccomp_cache_filter_trivial(struct sock_fprog_kern *fprog)
+{
+	int pc;
+
+	for (pc = 0; pc < fprog->len; pc++) {
+		struct sock_filter *ftest = &fprog->filter[pc];
+		u16 code = ftest->code;
+		u32 k = ftest->k;
+
+		if (BPF_CLASS(code) == BPF_LD && BPF_MODE(code) == BPF_ABS) {
+			if (k != offsetof(struct seccomp_data, nr) &&
+			    k != offsetof(struct seccomp_data, arch))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/**
+ * seccomp_cache_prepare - emulate the filter to find cachable syscalls
+ * @sfilter: The seccomp filter
+ *
+ * Returns 0 if successful or -errno if error occurred.
+ */
+static int seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+	struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
+	struct seccomp_filter *prev = sfilter->prev;
+	struct sock_filter *filter = fprog->filter;
+	struct seccomp_emu_state *state;
+	int nr, res = 0;
+
+	if (seccomp_cache_filter_trivial(fprog)) {
+		if (prev)
+			bitmap_copy(sfilter->cache.syscall_ok,
+				    prev->cache.syscall_ok, NR_syscalls);
+		else
+			bitmap_fill(sfilter->cache.syscall_ok, NR_syscalls);
+
+		return 0;
+	}
+
+	for (nr = 0; nr < NR_syscalls; nr++) {
+		struct seccomp_emu_env env = {0};
+
+		env.syscall_ok = !prev || test_bit(nr, prev->cache.syscall_ok);
+		if (!env.syscall_ok)
+			continue;
+
+		env.filter = filter;
+		env.nr = nr;
+
+		env.next_state = kzalloc(sizeof(*env.next_state), GFP_KERNEL);
+		env.next_state_len = 1;
+		if (!env.next_state)
+			return -ENOMEM;
+
+		while (env.next_state) {
+			state = env.next_state;
+			env.next_state = state->next;
+			env.next_state_len--;
+
+			while (true) {
+				res = seccomp_emu_step(&env, state);
+
+				if (res)
+					break;
+			}
+
+			kfree(state);
+
+			if (res < 0)
+				goto free_states;
+		}
+
+free_states:
+		while (env.next_state) {
+			state = env.next_state;
+			env.next_state = state->next;
+
+			kfree(state);
+		}
+
+		if (res < 0)
+			goto out;
+
+		if (env.syscall_ok)
+			set_bit(nr, sfilter->cache.syscall_ok);
+	}
+
+out:
+	return res;
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
@@ -540,7 +853,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *sfilter;
 	int ret;
-	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
+			       IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
 		return ERR_PTR(-EINVAL);
@@ -571,6 +885,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 		return ERR_PTR(ret);
 	}
 
+	ret = seccomp_cache_prepare(sfilter);
+	if (ret < 0) {
+		bpf_prog_destroy(sfilter->prog);
+		kfree(sfilter);
+		return ERR_PTR(ret);
+	}
+
 	refcount_set(&sfilter->refs, 1);
 	refcount_set(&sfilter->users, 1);
 	init_waitqueue_head(&sfilter->wqh);
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls
  2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
  2020-09-21  5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
@ 2020-09-21  5:35 ` YiFei Zhu
  2020-09-21 18:08   ` Jann Horn
  2020-09-25  0:01   ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook
  2020-09-21  5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21  5:35 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Valentin Rothberg

From: YiFei Zhu <yifeifz2@illinois.edu>

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

We do this by creating a per-task bitmap of permitted syscalls.
If seccomp filter is invoked we check if it is cached and if so
directly return allow. Else we call into the cBPF filter, and if
the result is an allow then we cache the results.

The cache is per-task to minimize thread-synchronization issues in
the hot path of cache lookup, and to avoid different architecture
numbers sharing the same cache.

To account for one thread changing the filter for another thread of
the same process, the per-task struct also contains a pointer to
the filter the cache is built on. When the cache lookup uses a
different filter then the last lookup, the per-task cache bitmap is
cleared.

Architecture number changes also invokes a clear of the per-task
cache, since it should be very unlikely for a given thread to change
its architecture.

Benchmark results, on qemu-kvm x86_64 VM, on Intel(R) Core(TM)
i5-8250U CPU @ 1.60GHz, with seccomp_benchmark:

With SECCOMP_CACHE_NONE:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Calibrating sample size for 15 seconds worth of syscalls ...
  Benchmarking 23486415 syscalls...
  16.079642020 - 1.013345439 = 15066296581 (15.1s)
  getpid native: 641 ns
  32.080237410 - 16.080763500 = 15999473910 (16.0s)
  getpid RET_ALLOW 1 filter: 681 ns
  48.609461618 - 32.081296173 = 16528165445 (16.5s)
  getpid RET_ALLOW 2 filters: 703 ns
  Estimated total seccomp overhead for 1 filter: 40 ns
  Estimated total seccomp overhead for 2 filters: 62 ns
  Estimated seccomp per-filter overhead: 22 ns
  Estimated seccomp entry overhead: 18 ns

With SECCOMP_CACHE_NR_ONLY:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Calibrating sample size for 15 seconds worth of syscalls ...
  Benchmarking 23486415 syscalls...
  16.059512499 - 1.014108434 = 15045404065 (15.0s)
  getpid native: 640 ns
  31.651075934 - 16.060637323 = 15590438611 (15.6s)
  getpid RET_ALLOW 1 filter: 663 ns
  47.367316169 - 31.652302661 = 15715013508 (15.7s)
  getpid RET_ALLOW 2 filters: 669 ns
  Estimated total seccomp overhead for 1 filter: 23 ns
  Estimated total seccomp overhead for 2 filters: 29 ns
  Estimated seccomp per-filter overhead: 6 ns
  Estimated seccomp entry overhead: 17 ns

Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 include/linux/seccomp.h | 22 ++++++++++++
 kernel/seccomp.c        | 77 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..08ec8b90c99d 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -21,6 +21,27 @@
 #include <asm/seccomp.h>
 
 struct seccomp_filter;
+
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_cache_task_data - container for seccomp cache's per-task data
+ *
+ * @syscall_ok: A bitmap where each bit represents whether the syscall is cached
+ *		and that the filter allowed it.
+ * @last_filter: If the next cache lookup uses a different filter, the lookup
+ *		 will clear cache.
+ * @last_arch: If the next cache lookup uses a different arch number, the
+ *	       lookup will clear cache.
+ */
+struct seccomp_cache_task_data {
+	DECLARE_BITMAP(syscall_ok, NR_syscalls);
+	const struct seccomp_filter *last_filter;
+	u32 last_arch;
+};
+#else
+struct seccomp_cache_task_data { };
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * struct seccomp - the state of a seccomp'ed process
  *
@@ -36,6 +57,7 @@ struct seccomp {
 	int mode;
 	atomic_t filter_count;
 	struct seccomp_filter *filter;
+	struct seccomp_cache_task_data cache;
 };
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d8c30901face..7096f8c86f71 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -162,6 +162,17 @@ static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
 {
 	return 0;
 }
+
+static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				       const struct seccomp_data *sd)
+{
+	return 0;
+}
+
+static inline void seccomp_cache_insert(const struct seccomp_filter *sfilter,
+					const struct seccomp_data *sd)
+{
+}
 #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
 
 /**
@@ -316,6 +327,59 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 	return 0;
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * seccomp_cache_check - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				const struct seccomp_data *sd)
+{
+	struct seccomp_cache_task_data *thread_data;
+	int syscall_nr = sd->nr;
+
+	if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
+		return false;
+
+	thread_data = &current->seccomp.cache;
+	if (unlikely(thread_data->last_filter != sfilter ||
+		     thread_data->last_arch != sd->arch)) {
+		thread_data->last_filter = sfilter;
+		thread_data->last_arch = sd->arch;
+
+		bitmap_zero(thread_data->syscall_ok, NR_syscalls);
+		return false;
+	}
+
+	return test_bit(syscall_nr, thread_data->syscall_ok);
+}
+
+/**
+ * seccomp_cache_insert - insert into seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to insert into the cache
+ */
+static void seccomp_cache_insert(const struct seccomp_filter *sfilter,
+				 const struct seccomp_data *sd)
+{
+	struct seccomp_cache_task_data *thread_data;
+	int syscall_nr = sd->nr;
+
+	if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
+		return;
+
+	thread_data = &current->seccomp.cache;
+
+	if (!test_bit(syscall_nr, sfilter->cache.syscall_ok))
+		return;
+
+	set_bit(syscall_nr, thread_data->syscall_ok);
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_run_filters - evaluates all seccomp filters against @sd
  * @sd: optional seccomp data to be passed to filters
@@ -331,13 +395,18 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 {
 	u32 ret = SECCOMP_RET_ALLOW;
 	/* Make sure cross-thread synced filter points somewhere sane. */
-	struct seccomp_filter *f =
-			READ_ONCE(current->seccomp.filter);
+	struct seccomp_filter *f, *f_head;
+
+	f = READ_ONCE(current->seccomp.filter);
+	f_head = f;
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (WARN_ON(f == NULL))
 		return SECCOMP_RET_KILL_PROCESS;
 
+	if (seccomp_cache_check(f_head, sd))
+		return SECCOMP_RET_ALLOW;
+
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
@@ -350,6 +419,10 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 			*match = f;
 		}
 	}
+
+	if (ret == SECCOMP_RET_ALLOW)
+		seccomp_cache_insert(f_head, sd);
+
 	return ret;
 }
 #endif /* CONFIG_SECCOMP_FILTER */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
  2020-09-21  5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
  2020-09-21  5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu
@ 2020-09-21  5:48 ` Sargun Dhillon
  2020-09-21  7:13   ` YiFei Zhu
  2020-09-21  8:30 ` Christian Brauner
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 135+ messages in thread
From: Sargun Dhillon @ 2020-09-21  5:48 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook,
	YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu

On Sun, Sep 20, 2020 at 10:35 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> From: YiFei Zhu <yifeifz2@illinois.edu>
>
> This series adds a bitmap to cache seccomp filter results if the
> result permits a syscall and is indepenent of syscall arguments.
> This visibly decreases seccomp overhead for most common seccomp
> filters with very little memory footprint.
>
> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
>
> We propose SECCOMP_CACHE, a cache-based solution to minimize the
> Seccomp overhead. The basic idea is to cache the result of each
> syscall check to save the subsequent overhead of executing the
> filters. This is feasible, because the check in Seccomp is stateless.
> The checking results of the same syscall ID and argument remains
> the same.
>
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.
>
> In the past Kees proposed [2] to have an "add this syscall to the
> reject bitmask". It is indeed much easier to securely make a reject
> accelerator to pre-filter syscalls before passing to the BPF
> filters, considering it could only strengthen the security provided
> by the filter. However, ultimately, filter rejections are an
> exceptional / rare case. Here, instead of accelerating what is
> rejected, we accelerate what is allowed. In order not to compromise
> the security rules the BPF filters defined, any accept-side
> accelerator must complement the BPF filters rather than replacing them.
>
> Statically analyzing BPF bytecode to see if each syscall is going to
> always land in allow or reject is more of a rabbit hole, especially
> there is no current in-kernel infrastructure to enumerate all the
> possible architecture numbers for a given machine. So rather than
> doing that, we propose to cache the results after the BPF filters are
> run. And since there are filters like docker's who will check
> arguments of some syscalls, but not all or none of the syscalls, when
> a filter is loaded we analyze it to find whether each syscall is
> cacheable (does not access syscall argument or instruction pointer) by
> following its control flow graph, and store the result for each filter
> in a bitmap. Changes to architecture number or the filter are expected
> to be rare and simply cause the cache to be cleared. This solution
> shall be fully transparent to userspace.
Long-term, do you believe static analysis will be viable? I think that it is
the "ideal" solution here, but I agree in that it is more complex.

Is there a way to "prime" filters, by giving them a syscall #, and if it has
a terminal condition without inspecting args, it turns into a bitmask entry
viable?

>
> Ongoing work is to further support arguments with fast hash table
> lookups. We are investigating the performance of doing so [6], and how
> to best integrate with the existing seccomp infrastructure.
>
> We have done some benchmarks with patch applied against bpf-next
> commit 2e80be60c465 ("libbpf: Fix compilation warnings for 64-bit printf args").
>
> Me, in qemu-kvm x86_64 VM, on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz,
> average results:
>
> Without cache, seccomp_benchmark:
>   Current BPF sysctl settings:
>   net.core.bpf_jit_enable = 1
>   net.core.bpf_jit_harden = 0
>   Calibrating sample size for 15 seconds worth of syscalls ...
>   Benchmarking 23486415 syscalls...
>   16.079642020 - 1.013345439 = 15066296581 (15.1s)
>   getpid native: 641 ns
>   32.080237410 - 16.080763500 = 15999473910 (16.0s)
>   getpid RET_ALLOW 1 filter: 681 ns
>   48.609461618 - 32.081296173 = 16528165445 (16.5s)
>   getpid RET_ALLOW 2 filters: 703 ns
>   Estimated total seccomp overhead for 1 filter: 40 ns
>   Estimated total seccomp overhead for 2 filters: 62 ns
>   Estimated seccomp per-filter overhead: 22 ns
>   Estimated seccomp entry overhead: 18 ns
>
> With cache:
>   Current BPF sysctl settings:
>   net.core.bpf_jit_enable = 1
>   net.core.bpf_jit_harden = 0
>   Calibrating sample size for 15 seconds worth of syscalls ...
>   Benchmarking 23486415 syscalls...
>   16.059512499 - 1.014108434 = 15045404065 (15.0s)
>   getpid native: 640 ns
>   31.651075934 - 16.060637323 = 15590438611 (15.6s)
>   getpid RET_ALLOW 1 filter: 663 ns
>   47.367316169 - 31.652302661 = 15715013508 (15.7s)
>   getpid RET_ALLOW 2 filters: 669 ns
>   Estimated total seccomp overhead for 1 filter: 23 ns
>   Estimated total seccomp overhead for 2 filters: 29 ns
>   Estimated seccomp per-filter overhead: 6 ns
>   Estimated seccomp entry overhead: 17 ns
>
> Depending on the run estimated seccomp overhead for 2 filters can be
> less than seccomp overhead for 1 filter, resulting in underflow to
> estimated seccomp per-filter overhead:
>   Estimated total seccomp overhead for 1 filter: 27 ns
>   Estimated total seccomp overhead for 2 filters: 21 ns
>   Estimated seccomp per-filter overhead: 18446744073709551610 ns
>   Estimated seccomp entry overhead: 33 ns
>
> Jack Chen has also run some benchmarks on a bare metal
> Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz, with side channel
> mitigations off (spec_store_bypass_disable=off spectre_v2=off mds=off
> pti=off l1tf=off), with BPF JIT on and docker default profile,
> and reported:
>
>   unixbench syscall mix (https://github.com/kdlucas/byte-unixbench)
>   unconfined:      33295685
>   docker default:         20661056  60%
>   docker default + cache: 25719937  30%
>
> Patch 1 introduces the static analyzer to check for a given filter,
> whether the CFG loads the syscall arguments for each syscall number.
>
> Patch 2 implements the bitmap cache.
>
> [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
> [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
> [3] https://github.com/seccomp/libseccomp/issues/116
> [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
> [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
> [6] Draco: Architectural and Operating System Support for System Call Security
>     https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
>
> YiFei Zhu (2):
>   seccomp/cache: Add "emulator" to check if filter is arg-dependent
>   seccomp/cache: Cache filter results that allow syscalls
>
>  arch/x86/Kconfig        |  27 +++
>  include/linux/seccomp.h |  22 +++
>  kernel/seccomp.c        | 400 +++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 446 insertions(+), 3 deletions(-)
>
> --
> 2.28.0
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21  5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon
@ 2020-09-21  7:13   ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21  7:13 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook,
	YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu

On Mon, Sep 21, 2020 at 12:49 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Sun, Sep 20, 2020 at 10:35 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >
> Long-term, do you believe static analysis will be viable? I think that it is
> the "ideal" solution here, but I agree in that it is more complex.
>
> Is there a way to "prime" filters, by giving them a syscall #, and if it has
> a terminal condition without inspecting args, it turns into a bitmask entry
> viable?

I think in theory one could follow the execution of the filter, and if
the filter is determined to return a pass for a given syscall number
under all circumstances, we record that syscall. We can then replace
the bitmap_zero call in seccomp_cache_check with a call to bitmap_copy
from the pre-primed bitmap. However, I don't know how much benefit
this would provide.

One ugly part of the current situation is that the kernel has
absolutely no idea what arch numbers returned by syscall_get_arch may
be possible for the machine it is running on. For example, for an
x86_64 machine with IA32 emulation, the arch number can be either
AUDIT_ARCH_I386 or AUDIT_ARCH_X86_64. The seccomp filter will
typically have parts handling both cases. As a result, an uncertainty
for one syscall on one arch will affect the syscall under the same
number for the other arch. If a syscall number is not guaranteed to be
allowed under both arches, it won't be primed. Given that usually a
seccomp filter is a list of allowed syscalls, my guess is that there
won't be many syscalls numbers that will fall under this case; though,
I have not tested this.

We could add an array of possible arch numbers so that the emulator
can refine its tracing. This is probably the best in effort, though,
seccomp_cache_prepare now has to iterate through all combinations of
syscall numbers and arch numbers. Given that seccomp_cache_prepare
should be relatively cold it's probably not too much of a trouble.
Alternatively, we could employ constraint tracking, but that sounds
overly complex for what we are trying to do.

The other question would be, would pre-priming the cache be worth the
effort? The assumption is that the vast majority of cacheable syscalls
will be permitted. For them, only the first time a particular syscall
is invoked would experience the overhead of calling the filter, which
means that this part of the initial run we are going to optimize out
by pre-priming is going to be relatively cold. wdyt?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
                   ` (2 preceding siblings ...)
  2020-09-21  5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon
@ 2020-09-21  8:30 ` Christian Brauner
  2020-09-21  8:44   ` YiFei Zhu
  2020-09-21 13:51 ` Tycho Andersen
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 135+ messages in thread
From: Christian Brauner @ 2020-09-21  8:30 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Jann Horn, Aleksa Sarai, linux-kernel

On Mon, Sep 21, 2020 at 12:35:16AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> This series adds a bitmap to cache seccomp filter results if the
> result permits a syscall and is indepenent of syscall arguments.
> This visibly decreases seccomp overhead for most common seccomp
> filters with very little memory footprint.

This is missing some people so expanding the Cc a little. Make sure to
run scripts/get_maintainers.pl next time, in case you forgot. (Adding
Andy, Will, Jann, Aleksa at least.)

Christian

> 
> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
> 
> We propose SECCOMP_CACHE, a cache-based solution to minimize the
> Seccomp overhead. The basic idea is to cache the result of each
> syscall check to save the subsequent overhead of executing the
> filters. This is feasible, because the check in Seccomp is stateless.
> The checking results of the same syscall ID and argument remains
> the same.
> 
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.
> 
> In the past Kees proposed [2] to have an "add this syscall to the
> reject bitmask". It is indeed much easier to securely make a reject
> accelerator to pre-filter syscalls before passing to the BPF
> filters, considering it could only strengthen the security provided
> by the filter. However, ultimately, filter rejections are an
> exceptional / rare case. Here, instead of accelerating what is
> rejected, we accelerate what is allowed. In order not to compromise
> the security rules the BPF filters defined, any accept-side
> accelerator must complement the BPF filters rather than replacing them.
> 
> Statically analyzing BPF bytecode to see if each syscall is going to
> always land in allow or reject is more of a rabbit hole, especially
> there is no current in-kernel infrastructure to enumerate all the
> possible architecture numbers for a given machine. So rather than
> doing that, we propose to cache the results after the BPF filters are
> run. And since there are filters like docker's who will check
> arguments of some syscalls, but not all or none of the syscalls, when
> a filter is loaded we analyze it to find whether each syscall is
> cacheable (does not access syscall argument or instruction pointer) by
> following its control flow graph, and store the result for each filter
> in a bitmap. Changes to architecture number or the filter are expected
> to be rare and simply cause the cache to be cleared. This solution
> shall be fully transparent to userspace.
> 
> Ongoing work is to further support arguments with fast hash table
> lookups. We are investigating the performance of doing so [6], and how
> to best integrate with the existing seccomp infrastructure.
> 
> We have done some benchmarks with patch applied against bpf-next
> commit 2e80be60c465 ("libbpf: Fix compilation warnings for 64-bit printf args").
> 
> Me, in qemu-kvm x86_64 VM, on Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz,
> average results:
> 
> Without cache, seccomp_benchmark:
>   Current BPF sysctl settings:
>   net.core.bpf_jit_enable = 1
>   net.core.bpf_jit_harden = 0
>   Calibrating sample size for 15 seconds worth of syscalls ...
>   Benchmarking 23486415 syscalls...
>   16.079642020 - 1.013345439 = 15066296581 (15.1s)
>   getpid native: 641 ns
>   32.080237410 - 16.080763500 = 15999473910 (16.0s)
>   getpid RET_ALLOW 1 filter: 681 ns
>   48.609461618 - 32.081296173 = 16528165445 (16.5s)
>   getpid RET_ALLOW 2 filters: 703 ns
>   Estimated total seccomp overhead for 1 filter: 40 ns
>   Estimated total seccomp overhead for 2 filters: 62 ns
>   Estimated seccomp per-filter overhead: 22 ns
>   Estimated seccomp entry overhead: 18 ns
> 
> With cache:
>   Current BPF sysctl settings:
>   net.core.bpf_jit_enable = 1
>   net.core.bpf_jit_harden = 0
>   Calibrating sample size for 15 seconds worth of syscalls ...
>   Benchmarking 23486415 syscalls...
>   16.059512499 - 1.014108434 = 15045404065 (15.0s)
>   getpid native: 640 ns
>   31.651075934 - 16.060637323 = 15590438611 (15.6s)
>   getpid RET_ALLOW 1 filter: 663 ns
>   47.367316169 - 31.652302661 = 15715013508 (15.7s)
>   getpid RET_ALLOW 2 filters: 669 ns
>   Estimated total seccomp overhead for 1 filter: 23 ns
>   Estimated total seccomp overhead for 2 filters: 29 ns
>   Estimated seccomp per-filter overhead: 6 ns
>   Estimated seccomp entry overhead: 17 ns
> 
> Depending on the run estimated seccomp overhead for 2 filters can be
> less than seccomp overhead for 1 filter, resulting in underflow to
> estimated seccomp per-filter overhead:
>   Estimated total seccomp overhead for 1 filter: 27 ns
>   Estimated total seccomp overhead for 2 filters: 21 ns
>   Estimated seccomp per-filter overhead: 18446744073709551610 ns
>   Estimated seccomp entry overhead: 33 ns
> 
> Jack Chen has also run some benchmarks on a bare metal
> Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz, with side channel
> mitigations off (spec_store_bypass_disable=off spectre_v2=off mds=off
> pti=off l1tf=off), with BPF JIT on and docker default profile,
> and reported:
> 
>   unixbench syscall mix (https://github.com/kdlucas/byte-unixbench)
>   unconfined:      33295685
>   docker default:         20661056  60%
>   docker default + cache: 25719937  30%
> 
> Patch 1 introduces the static analyzer to check for a given filter,
> whether the CFG loads the syscall arguments for each syscall number.
> 
> Patch 2 implements the bitmap cache.
> 
> [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
> [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
> [3] https://github.com/seccomp/libseccomp/issues/116
> [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
> [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
> [6] Draco: Architectural and Operating System Support for System Call Security
>     https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
> 
> YiFei Zhu (2):
>   seccomp/cache: Add "emulator" to check if filter is arg-dependent
>   seccomp/cache: Cache filter results that allow syscalls
> 
>  arch/x86/Kconfig        |  27 +++
>  include/linux/seccomp.h |  22 +++
>  kernel/seccomp.c        | 400 +++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 446 insertions(+), 3 deletions(-)
> 
> --
> 2.28.0

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21  8:30 ` Christian Brauner
@ 2020-09-21  8:44   ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21  8:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Jann Horn, Aleksa Sarai, linux-kernel

On Mon, Sep 21, 2020 at 3:30 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> This is missing some people so expanding the Cc a little. Make sure to
> run scripts/get_maintainers.pl next time, in case you forgot. (Adding
> Andy, Will, Jann, Aleksa at least.)
>
> Christian

ok noted. Thanks!

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
                   ` (3 preceding siblings ...)
  2020-09-21  8:30 ` Christian Brauner
@ 2020-09-21 13:51 ` Tycho Andersen
  2020-09-21 15:27   ` YiFei Zhu
  2020-09-21 19:16 ` Jann Horn
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 135+ messages in thread
From: Tycho Andersen @ 2020-09-21 13:51 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook,
	YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu

On Mon, Sep 21, 2020 at 12:35:16AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> This series adds a bitmap to cache seccomp filter results if the
> result permits a syscall and is indepenent of syscall arguments.
> This visibly decreases seccomp overhead for most common seccomp
> filters with very little memory footprint.
> 
> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
> 
> We propose SECCOMP_CACHE, a cache-based solution to minimize the
> Seccomp overhead. The basic idea is to cache the result of each
> syscall check to save the subsequent overhead of executing the
> filters. This is feasible, because the check in Seccomp is stateless.
> The checking results of the same syscall ID and argument remains
> the same.
> 
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.

One problem with a kernel config setting is that it's for all tasks.
While docker and systemd may make decsisions based on syscall number,
other applications may have more nuanced filters, and this cache would
yield incorrect results.

You could work around this by making this a filter flag instead;
filter authors would generally know whether their filter results can
be cached and probably be motivated to opt in if their users are
complaining about slow syscall execution.

Tycho

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21 13:51 ` Tycho Andersen
@ 2020-09-21 15:27   ` YiFei Zhu
  2020-09-21 16:39     ` Tycho Andersen
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21 15:27 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook,
	YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu, Andy Lutomirski, Will Drewry, Jann Horn,
	Aleksa Sarai, linux-kernel

On Mon, Sep 21, 2020 at 8:51 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> One problem with a kernel config setting is that it's for all tasks.
> While docker and systemd may make decsisions based on syscall number,
> other applications may have more nuanced filters, and this cache would
> yield incorrect results.
>
> You could work around this by making this a filter flag instead;
> filter authors would generally know whether their filter results can
> be cached and probably be motivated to opt in if their users are
> complaining about slow syscall execution.
>
> Tycho

Yielding incorrect results should not be possible. The purpose of the
"emulator" (for the lack of a better term) is to determine whether the
filter reads any syscall arguments. A read from a syscall argument
must go through the BPF_LD | BPF_ABS instruction, where the 32 bit
multiuse field "k" is an offset to struct seccomp_data.

struct seccomp_data contains four components [1]: syscall number,
architecture number, instruction pointer at the time of syscall, and
syscall arguments. The syscall number is enumerated by the emulator.
The arch number is treated by the cache as 'if arch number is
different from cached arch number, flush cache' (this is in
seccomp_cache_check). The last two (ip and args) are treated exactly
the same way in this patch: if the filter loads the arguments at all,
the syscall is marked non-cacheable for any architecture number.

The struct seccomp_data is the only external thing the filter may
access. It is also cBPF so it cannot contain maps to store special
states between runs. Therefore a seccomp filter is pure function. If
we know given some inputs (the syscall number and arch number) the
function will not evaluate any other inputs before returning, then we
can safely cache with just the inputs in concern.

As for the overhead, on my x86_64 with gcc 10.2.0, seccomp_cache_check
compiles into:

    if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
        return false;
0xffffffff8120fdb3 <+99>:    movsxd rdx,DWORD PTR [r12]
0xffffffff8120fdb7 <+103>:    cmp    edx,0x1b7
0xffffffff8120fdbd <+109>:    ja     0xffffffff8120fdf9 <__seccomp_filter+169>
    if (unlikely(thread_data->last_filter != sfilter ||
             thread_data->last_arch != sd->arch)) {
0xffffffff8120fdbf <+111>:    mov    rdi,QWORD PTR [rbp-0xb8]
0xffffffff8120fdc6 <+118>:    lea    rsi,[rax+0x6f0]
0xffffffff8120fdcd <+125>:    cmp    rdi,QWORD PTR [rax+0x728]
0xffffffff8120fdd4 <+132>:    jne    0xffffffff812101f0 <__seccomp_filter+1184>
0xffffffff8120fdda <+138>:    mov    ebx,DWORD PTR [r12+0x4]
0xffffffff8120fddf <+143>:    cmp    DWORD PTR [rax+0x730],ebx
0xffffffff8120fde5 <+149>:    jne    0xffffffff812101f0 <__seccomp_filter+1184>
    return test_bit(syscall_nr, thread_data->syscall_ok);
0xffffffff8120fdeb <+155>:    bt     QWORD PTR [rax+0x6f0],rdx
0xffffffff8120fdf3 <+163>:    jb     0xffffffff8120ffb7 <__seccomp_filter+615>
[... unlikely path of cache flush omitted]

and seccomp_cache_insert compiles into:

    if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
        return;
0xffffffff8121021b <+1227>:    movsxd rax,DWORD PTR [r12]
0xffffffff8121021f <+1231>:    cmp    eax,0x1b7
0xffffffff81210224 <+1236>:    ja     0xffffffff8120ffb7 <__seccomp_filter+615>
    if (!test_bit(syscall_nr, sfilter->cache.syscall_ok))
        return;
0xffffffff8121022a <+1242>:    mov    rbx,QWORD PTR [rbp-0xb8]
0xffffffff81210231 <+1249>:    mov    rdx,QWORD PTR gs:0x17000
0xffffffff8121023a <+1258>:    bt     QWORD PTR [rbx+0x108],rax
0xffffffff81210242 <+1266>:    jae    0xffffffff8120ffb7 <__seccomp_filter+615>
    set_bit(syscall_nr, thread_data->syscall_ok);
0xffffffff81210248 <+1272>:    lock bts QWORD PTR [rdx+0x6f0],rax
0xffffffff81210251 <+1281>:    jmp    0xffffffff8120ffb7 <__seccomp_filter+615>

In the circumstance of a non-cacheable syscall happening over and
over, the code path would go through the syscall_nr bound check, then
the filter flush check, then the test_bit, then another syscall_nr
bound check and one more test_bit in seccomp_cache_insert. Considering
that they are either stack variables, elements of current task_struct,
and elements of the filter struct, I imagine they would well be in the
CPU data cache and not incur much overhead. The CPU is also free to
branch predict and reorder memory accesses (there are no hardware
memory barriers here) to further increase the efficiency, whereas a
normal filter execution would be impaired by things like retpoline.

If one were to add an additional flag for
does-userspace-want-us-to-cache, it would still be a member of the
filter struct. What would be loaded into the CPU data cache originally
would still be loaded. Correct me if I'm wrong, but I don't think that
check will reduce any significant overhead of the seccomp cache
itself.

That said, I have not profiled the exact number of milliseconds this
patch would incur to uncacheable syscalls, I can report back with
numbers if you would like to see.

Does that answer your concern?

YiFei Zhu

[1] https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/seccomp.h#L60

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21 15:27   ` YiFei Zhu
@ 2020-09-21 16:39     ` Tycho Andersen
  2020-09-21 22:57       ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Tycho Andersen @ 2020-09-21 16:39 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook,
	YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu, Andy Lutomirski, Will Drewry, Jann Horn,
	Aleksa Sarai, linux-kernel

On Mon, Sep 21, 2020 at 10:27:56AM -0500, YiFei Zhu wrote:
> On Mon, Sep 21, 2020 at 8:51 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > One problem with a kernel config setting is that it's for all tasks.
> > While docker and systemd may make decsisions based on syscall number,
> > other applications may have more nuanced filters, and this cache would
> > yield incorrect results.
> >
> > You could work around this by making this a filter flag instead;
> > filter authors would generally know whether their filter results can
> > be cached and probably be motivated to opt in if their users are
> > complaining about slow syscall execution.
> >
> > Tycho
> 
> Yielding incorrect results should not be possible. The purpose of the
> "emulator" (for the lack of a better term) is to determine whether the
> filter reads any syscall arguments. A read from a syscall argument
> must go through the BPF_LD | BPF_ABS instruction, where the 32 bit
> multiuse field "k" is an offset to struct seccomp_data.

I see, I missed this somehow. So is there a reason to hide this behind
a config option? Isn't it just always better?

Tycho

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-21  5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
@ 2020-09-21 17:47   ` Jann Horn
  2020-09-21 18:38     ` Jann Horn
  2020-09-21 23:44     ` YiFei Zhu
  0 siblings, 2 replies; 135+ messages in thread
From: Jann Horn @ 2020-09-21 17:47 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Jann Horn, Aleksa Sarai, kernel list

On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> access any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> access. This is implemented here with a pseudo-emulator, and
> stored in a per-filter bitmap. Each seccomp cBPF instruction,
> aside from ALU (which should rarely be used in seccomp), gets a
> naive best-effort emulation for each syscall number.
>
> The emulator works by following all possible (without SAT solving)
> paths the filter can take. Every cBPF register / memory position
> records whether that is a constant, and of so, the value of the
> constant. Loading from struct seccomp_data is considered constant
> if it is a syscall number, else it is an unknown. For each
> conditional jump, if the both arguments can be resolved to a
> constant, the jump is followed after computing the result of the
> condition; else both directions are followed, by pushing one of
> the next states to a linked list of next states to process. We
> keep a finite number of pending states to process.

Is this actually necessary, or can we just bail out on any branch that
we can't statically resolve?

struct seccomp_data only contains the syscall number (constant for a
given filter evaluation), the architecture number (also constant), the
instruction pointer (basically never used in seccomp filters), and the
syscall arguments. Any normal seccomp filter first branches on the
architecture, then branches on the syscall number, and then branches
on arguments if necessary.

This optimization could only be improved by the "follow both branches"
logic if a seccomp program branches on either the instruction pointer
or an argument *before* looking at the syscall number, and later comes
to the same conclusion on *both* sides of the check. It would have to
be something like:

if (instruction_pointer == 0xasdf1234) {
  if (nr == mmap) return ACCEPT;
  [...]
  return KILL;
} else {
  if (nr == mmap) return ACCEPT;
  [...]
  return KILL;
}

I've never seen anyone do something like this. And the proposed patch
would still bail out on such a filter because of the load from the
instruction_pointer field; I don't think it would even be possible to
reach a branch with an unknown condition with this patch. So I think
we should probably get rid of this extra logic for keeping track of
multiple execution states for now. That would make the code a lot
simpler.


Also: If it turns out that the time spent in seccomp_cache_prepare()
is measurable for large filters, a possible improvement would be to
keep track of the last syscall number for which the result would be
the same as for the current one, such that instead of evaluating the
filter for one instruction at a time, it would effectively be
evaluated for a range at a time. That should be pretty straightforward
to implement, I think.

> The emulation is halted if it reaches a return, or if it reaches a
> read from struct seccomp_data that reads an offset that is neither
> syscall number or architecture number. In the latter case, we mark
> the syscall number as not okay for seccomp to cache. If a filter
> depends on more filters, then if its dependee cannot process the
> syscall then the depender is also marked not to process the syscall.
>
> We also do a single pass on the entire filter instructions before
> performing emulation. If none of the filter instructions load from
> the troublesome offsets, then the filter is considered "trivial",
> and all syscalls are marked okay for seccomp to cache.
>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  arch/x86/Kconfig |  27 ++++
>  kernel/seccomp.c | 323 ++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 349 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
[...]
> +choice
> +       prompt "Seccomp filter cache"
> +       default SECCOMP_CACHE_NONE

I think this should be on by default.

> +       depends on SECCOMP
> +       depends on SECCOMP_FILTER

SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP"
line is unnecessary.

> +       help
> +         Seccomp filters can potentially incur large overhead for each
> +         system call. This can alleviate some of the overhead.
> +
> +         If in doubt, select 'none'.

This should not be in arch/x86. Other architectures, such as arm64,
should also be able to use this without extra work.

> +config SECCOMP_CACHE_NONE
> +       bool "None"
> +       help
> +         No caching is done. Seccomp filters will be called each time
> +         a system call occurs in a seccomp-guarded task.
> +
> +config SECCOMP_CACHE_NR_ONLY
> +       bool "Syscall number only"
> +       help
> +         This is enables a bitmap to cache the results of seccomp
> +         filters, if the filter allows the syscall and is independent
> +         of the syscall arguments.

Maybe reword this as something like: "For each syscall number, if the
seccomp filter has a fixed result, store that result in a bitmap to
speed up system calls."

> This requires around 60 bytes per
> +         filter and 70 bytes per task.
> +
> +endchoice
> +
>  source "kernel/Kconfig.hz"
>
>  config KEXEC
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 3ee59ce0a323..d8c30901face 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,6 +143,27 @@ struct notification {
>         struct list_head notifications;
>  };
>
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * struct seccomp_cache_filter_data - container for cache's per-filter data
> + *
> + * @syscall_ok: A bitmap where each bit represent whether seccomp is allowed to

nit: represents

> + *             cache the results of this syscall.
> + */
> +struct seccomp_cache_filter_data {
> +       DECLARE_BITMAP(syscall_ok, NR_syscalls);
> +};
> +
> +#define SECCOMP_EMU_MAX_PENDING_STATES 64
> +#else
> +struct seccomp_cache_filter_data { };
> +
> +static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
> +{
> +       return 0;
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
[...]
> +/**
> + * seccomp_emu_step - step one instruction in the emulator
> + * @env: The emulator environment
> + * @state: The emulator state
> + *
> + * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred.
> + */
> +static int seccomp_emu_step(struct seccomp_emu_env *env,
> +                           struct seccomp_emu_state *state)
> +{
> +       struct sock_filter *ftest = &env->filter[state->pc++];
> +       struct seccomp_emu_state *new_state;
> +       u16 code = ftest->code;
> +       u32 k = ftest->k;
> +       u32 operand;
> +       bool compare;
> +       int reg_idx;
> +
> +       switch (BPF_CLASS(code)) {
> +       case BPF_LD:
> +       case BPF_LDX:
> +               reg_idx = BPF_CLASS(code) == BPF_LDX;
> +
> +               switch (BPF_MODE(code)) {
> +               case BPF_IMM:
> +                       state->reg_known[reg_idx] = true;
> +                       state->reg_const[reg_idx] = k;
> +                       break;
> +               case BPF_ABS:
> +                       if (k == offsetof(struct seccomp_data, nr)) {
> +                               state->reg_known[reg_idx] = true;
> +                               state->reg_const[reg_idx] = env->nr;
> +                       } else {
> +                               state->reg_known[reg_idx] = false;

This is completely broken. This emulation logic *needs* to run with
the proper architecture identifier. (And for platforms like x86-64
that have compatibility support for a second ABI, the emulation should
probably also be done for that ABI, and there should be separate
bitmasks for that ABI.)

With the current logic, you will (almost) never actually have
permitted syscalls in the bitmask, because filters fundamentally have
to return different results for different ABIs - the syscall numbers
mean completely different things under different ABIs.

> +                               if (k != offsetof(struct seccomp_data, arch)) {
> +                                       env->syscall_ok = false;
> +                                       return 1;
> +                               }
> +                       }

This would read nicer as:

if (k == offsetof(struct seccomp_data, nr)) {

} else if (k == offsetof(struct seccomp_data, arch)) {

} else {
  env->syscall_ok = false;
  return 1;
}

> +
> +                       break;
> +               case BPF_MEM:
> +                       state->reg_known[reg_idx] = state->reg_known[2 + k];
> +                       state->reg_const[reg_idx] = state->reg_const[2 + k];
> +                       break;
> +               default:
> +                       state->reg_known[reg_idx] = false;
> +               }
> +
> +               return 0;
> +       case BPF_ST:
> +       case BPF_STX:
> +               reg_idx = BPF_CLASS(code) == BPF_STX;
> +
> +               state->reg_known[2 + k] = state->reg_known[reg_idx];
> +               state->reg_const[2 + k] = state->reg_const[reg_idx];

I think we should probably just bail out if we see anything that's
BPF_ST/BPF_STX. I've never seen seccomp filters that actually use that
part of cBPF.

But in case we do need this, maybe instead of using "2 +" for all
these things, the cBPF memory slots should be in a separate array.

> +               return 0;
> +       case BPF_ALU:
> +               state->reg_known[0] = false;
> +               return 0;
> +       case BPF_JMP:
> +               if (BPF_OP(code) == BPF_JA) {
> +                       state->pc += k;
> +                       return 0;
> +               }
> +
> +               if (ftest->jt == ftest->jf) {
> +                       state->pc += ftest->jt;
> +                       return 0;
> +               }

Why is this check here? Is anyone actually creating filters with such
obviously nonsensical branches? I know that there are highly ludicrous
filters out there, but I don't think I've ever seen this specific kind
of useless code.

> +               if (!state->reg_known[0])
> +                       goto both_cases;
[...]
> +both_cases:
> +               if (env->next_state_len >= SECCOMP_EMU_MAX_PENDING_STATES)
> +                       return -E2BIG;

Even if we cap the maximum number of pending states, this could still
run for an almost unbounded amount of time, I think. Which is bad. If
this code was actually necessary, we'd probably want to track
separately the total number of branches we've seen and so on.

But as I said, I think this code should just be removed instead.

[...]
> +       }
> +}
[...]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls
  2020-09-21  5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu
@ 2020-09-21 18:08   ` Jann Horn
  2020-09-21 22:50     ` YiFei Zhu
  2020-09-25  0:01   ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook
  1 sibling, 1 reply; 135+ messages in thread
From: Jann Horn @ 2020-09-21 18:08 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Aleksa Sarai, kernel list

On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
[...]
> We do this by creating a per-task bitmap of permitted syscalls.
> If seccomp filter is invoked we check if it is cached and if so
> directly return allow. Else we call into the cBPF filter, and if
> the result is an allow then we cache the results.

What? Why? We already have code to statically evaluate the filter for
all syscall numbers. We should be using the results of that instead of
re-running the filter and separately caching the results.

> The cache is per-task

Please don't. The static results are per-filter, so the bitmask(s)
should also be per-filter and immutable.

> minimize thread-synchronization issues in
> the hot path of cache lookup

There should be no need for synchronization because those bitmasks
should be immutable.

> and to avoid different architecture
> numbers sharing the same cache.

There should be separate caches for separate architectures, and we
should precompute the results for all architectures. (We only have
around 2 different architectures max, so it's completely reasonable to
precompute and store all that.)

> To account for one thread changing the filter for another thread of
> the same process, the per-task struct also contains a pointer to
> the filter the cache is built on. When the cache lookup uses a
> different filter then the last lookup, the per-task cache bitmap is
> cleared.

Unnecessary complexity, we don't need that if we make the bitmasks immutable.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-21 17:47   ` Jann Horn
@ 2020-09-21 18:38     ` Jann Horn
  2020-09-21 23:44     ` YiFei Zhu
  1 sibling, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-09-21 18:38 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Aleksa Sarai, kernel list

On Mon, Sep 21, 2020 at 7:47 PM Jann Horn <jannh@google.com> wrote:
> On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> > access any syscall arguments or instruction pointer. To facilitate
> > this we need a static analyser to know whether a filter will
> > access. This is implemented here with a pseudo-emulator, and
> > stored in a per-filter bitmap. Each seccomp cBPF instruction,
> > aside from ALU (which should rarely be used in seccomp), gets a
> > naive best-effort emulation for each syscall number.
> >
> > The emulator works by following all possible (without SAT solving)
> > paths the filter can take. Every cBPF register / memory position
> > records whether that is a constant, and of so, the value of the
> > constant. Loading from struct seccomp_data is considered constant
> > if it is a syscall number, else it is an unknown. For each
> > conditional jump, if the both arguments can be resolved to a
> > constant, the jump is followed after computing the result of the
> > condition; else both directions are followed, by pushing one of
> > the next states to a linked list of next states to process. We
> > keep a finite number of pending states to process.
>
> Is this actually necessary, or can we just bail out on any branch that
> we can't statically resolve?

Aaaah, now I get what's going on. You statically compute a bitmask
that says whether a given syscall number always has a fixed result
*per architecture number*, and then use that later to decide whether
results can be cached for the combination of a specific seccomp filter
and a specific architecture number. Which mostly works, except that it
means you end up with weird per-thread caches and you get interference
between ABIs (so if a process e.g. filters the argument numbers for
syscall 123 in ABI 1, the results for syscall 123 in ABI 2 also can't
be cached).

Anyway, even though this works, I think it's the wrong way to go about it.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
                   ` (4 preceding siblings ...)
  2020-09-21 13:51 ` Tycho Andersen
@ 2020-09-21 19:16 ` Jann Horn
       [not found]   ` <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com>
  2020-09-23 19:26 ` Kees Cook
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 135+ messages in thread
From: Jann Horn @ 2020-09-21 19:16 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Aleksa Sarai, kernel list

On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> This series adds a bitmap to cache seccomp filter results if the
> result permits a syscall and is indepenent of syscall arguments.
> This visibly decreases seccomp overhead for most common seccomp
> filters with very little memory footprint.

It would be really nice if, based on this, we could have a new entry
in procfs that has one line per entry in each syscall table. Maybe
something that looks vaguely like:

X86_64 0 (read): ALLOW
X86_64 1 (write): ALLOW
X86_64 2 (open): ERRNO -1
X86_64 3 (close): ALLOW
X86_64 4 (stat): <argument-dependent>
[...]
I386 0 (restart_syscall): ALLOW
I386 1 (exit): ALLOW
I386 2 (fork): KILL
[...]

This would be useful both for inspectability (at the moment it's
pretty hard to figure out what seccomp rules really apply to a given
task) and for testing (so that we could easily write unit tests to
verify that the bitmap calculation works as expected).

But if you don't want to implement that right now, we can do that at a
later point - while it would be nice for making it easier to write
tests for this functionality, I don't see it as a blocker.


> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
>
> We propose SECCOMP_CACHE, a cache-based solution to minimize the
> Seccomp overhead. The basic idea is to cache the result of each
> syscall check to save the subsequent overhead of executing the
> filters. This is feasible, because the check in Seccomp is stateless.
> The checking results of the same syscall ID and argument remains
> the same.
>
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.
[...]
> Statically analyzing BPF bytecode to see if each syscall is going to
> always land in allow or reject is more of a rabbit hole, especially
> there is no current in-kernel infrastructure to enumerate all the
> possible architecture numbers for a given machine.

You could add that though. Or if you think that that's too much work,
you could just do it for x86 and arm64 and then use a Kconfig
dependency to limit this to those architectures for now.

> So rather than
> doing that, we propose to cache the results after the BPF filters are
> run.

Please don't add extra complexity just to work around a limitation in
existing code if you could instead remove that limitation in existing
code. Otherwise, code will become unnecessarily hard to understand and
inefficient.

You could let struct seccomp_filter contain three bitmasks - one for
the "native" architecture and up to two for "compat" architectures
(gated on some Kconfig flag).

alpha has 1 architecture numbers, arc has 1 (per build config), arm
has 1, arm64 has 2, c6x has 1 (per build config), csky has 1, h8300
has 1, hexagon has 1, ia64 has 1, m68k has 1, microblaze has 1, mips
has 3 (per build config), nds32 has 1 (per build config), nios2 has 1,
openrisc has 1, parisc has 2, powerpc has 2 (per build config), riscv
has 1 (per build config), s390 has 2, sh has 1 (per build config),
sparc has 2, x86 has 2, xtensa has 1.

> And since there are filters like docker's who will check
> arguments of some syscalls, but not all or none of the syscalls, when
> a filter is loaded we analyze it to find whether each syscall is
> cacheable (does not access syscall argument or instruction pointer) by
> following its control flow graph, and store the result for each filter
> in a bitmap. Changes to architecture number or the filter are expected
> to be rare and simply cause the cache to be cleared. This solution
> shall be fully transparent to userspace.

Caching whether a given syscall number has fixed per-architecture
results across all architectures is a pretty gross hack, please don't.



> Ongoing work is to further support arguments with fast hash table
> lookups. We are investigating the performance of doing so [6], and how
> to best integrate with the existing seccomp infrastructure.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
       [not found]   ` <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com>
@ 2020-09-21 19:45     ` Jann Horn
  0 siblings, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-09-21 19:45 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: Andrea Arcangeli, bpf, Linux Containers, Aleksa Sarai,
	Dimitrios Skarlatos, Giuseppe Scrivano, Jack Chen, Kees Cook,
	kernel list, Andy Lutomirski, Tobin Feldman-Fitzthum,
	Josep Torrellas, Tianyin Xu, Valentin Rothberg, Will Drewry,
	YiFei Zhu, YiFei Zhu

On Mon, Sep 21, 2020 at 9:35 PM Hubertus Franke <frankeh@us.ibm.com> wrote:
> I suggest we first bring it down to the minimal features we what and successively build the functions as these ideas evolve.
> We asked YiFei to prepare a minimal set that brings home the basic features. Might not be 100% optimal but having the hooks, the basic cache in place and getting a good benefit should be a good starting point
> to get this integrated into a linux kernel and then enable a larger experimentation.
> Does that make sense to approach it from that point ?

Sure. As I said, I don't think that the procfs part is a blocker - if
YiFei doesn't want to implement it now, I don't think it's necessary.
(But it would make it possible to write more precise tests.)

By the way: Please don't top-post on mailing lists - instead, quote
specific parts of a message and reply below those quotes. Also, don't
send HTML mail to kernel mailing lists, because they will reject it.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls
  2020-09-21 18:08   ` Jann Horn
@ 2020-09-21 22:50     ` YiFei Zhu
  2020-09-21 22:57       ` Jann Horn
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21 22:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Aleksa Sarai, kernel list

On Mon, Sep 21, 2020 at 1:09 PM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> [...]
> > We do this by creating a per-task bitmap of permitted syscalls.
> > If seccomp filter is invoked we check if it is cached and if so
> > directly return allow. Else we call into the cBPF filter, and if
> > the result is an allow then we cache the results.
>
> What? Why? We already have code to statically evaluate the filter for
> all syscall numbers. We should be using the results of that instead of
> re-running the filter and separately caching the results.
>
> > The cache is per-task
>
> Please don't. The static results are per-filter, so the bitmask(s)
> should also be per-filter and immutable.

I do agree that an immutable bitmask is faster and easier to reason
about its correctness. However, I did not find the "code to statically
evaluate the filter for all syscall numbers" while reading seccomp.c.
Would you give me a pointer to that and I will see how to best make
use of it?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21 16:39     ` Tycho Andersen
@ 2020-09-21 22:57       ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21 22:57 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Linux Containers, Andrea Arcangeli, Giuseppe Scrivano, Kees Cook,
	YiFei Zhu, Tobin Feldman-Fitzthum, Dimitrios Skarlatos,
	Valentin Rothberg, Hubertus Franke, Jack Chen, Josep Torrellas,
	bpf, Tianyin Xu, Andy Lutomirski, Will Drewry, Jann Horn,
	Aleksa Sarai, kernel list

On Mon, Sep 21, 2020 at 11:39 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> I see, I missed this somehow. So is there a reason to hide this behind
> a config option? Isn't it just always better?
>
> Tycho

You have a good point, though, I think keeping a config would allow
people to "test the differences" in the unlikely case that some issue
occurs. Jann pointed that it should be on by default so I'll do that.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls
  2020-09-21 22:50     ` YiFei Zhu
@ 2020-09-21 22:57       ` Jann Horn
  2020-09-21 23:08         ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Jann Horn @ 2020-09-21 22:57 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Aleksa Sarai, kernel list

On Tue, Sep 22, 2020 at 12:51 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> On Mon, Sep 21, 2020 at 1:09 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > [...]
> > > We do this by creating a per-task bitmap of permitted syscalls.
> > > If seccomp filter is invoked we check if it is cached and if so
> > > directly return allow. Else we call into the cBPF filter, and if
> > > the result is an allow then we cache the results.
> >
> > What? Why? We already have code to statically evaluate the filter for
> > all syscall numbers. We should be using the results of that instead of
> > re-running the filter and separately caching the results.
> >
> > > The cache is per-task
> >
> > Please don't. The static results are per-filter, so the bitmask(s)
> > should also be per-filter and immutable.
>
> I do agree that an immutable bitmask is faster and easier to reason
> about its correctness. However, I did not find the "code to statically
> evaluate the filter for all syscall numbers" while reading seccomp.c.
> Would you give me a pointer to that and I will see how to best make
> use of it?

I'm talking about the code you're adding in the other patch ("[RFC
PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is
arg-dependent"). Sorry, that was a bit unclear.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls
  2020-09-21 22:57       ` Jann Horn
@ 2020-09-21 23:08         ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21 23:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Aleksa Sarai, kernel list

On Mon, Sep 21, 2020 at 5:58 PM Jann Horn <jannh@google.com> wrote:
> > I do agree that an immutable bitmask is faster and easier to reason
> > about its correctness. However, I did not find the "code to statically
> > evaluate the filter for all syscall numbers" while reading seccomp.c.
> > Would you give me a pointer to that and I will see how to best make
> > use of it?
>
> I'm talking about the code you're adding in the other patch ("[RFC
> PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is
> arg-dependent"). Sorry, that was a bit unclear.

I see, building an immutable accept bitmask when preparing and then
just use that when running it. I guess if the arch number issue is
resolved this should be more doable. Will do.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-21 17:47   ` Jann Horn
  2020-09-21 18:38     ` Jann Horn
@ 2020-09-21 23:44     ` YiFei Zhu
  2020-09-22  0:25       ` Jann Horn
  1 sibling, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-21 23:44 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Valentin Rothberg, Andy Lutomirski,
	Will Drewry, Aleksa Sarai, kernel list

On Mon, Sep 21, 2020 at 12:47 PM Jann Horn <jannh@google.com> wrote:
> Is this actually necessary, or can we just bail out on any branch that
> we can't statically resolve?

I think after we do enumerate the arch numbers it would make much more
sense. Since if there is a branch after arch number and syscall
numbers are fixed we can assume that the return values will be
different if one or the other case is followed.

> Also: If it turns out that the time spent in seccomp_cache_prepare()
> is measurable for large filters, a possible improvement would be to
> keep track of the last syscall number for which the result would be
> the same as for the current one, such that instead of evaluating the
> filter for one instruction at a time, it would effectively be
> evaluated for a range at a time. That should be pretty straightforward
> to implement, I think.

My concern was more of the possibly-exponential amount of time &
memory needed to evaluate an adversarial filter containing full of
unresolveable branches, hence the max pending states. If we never
follow both branches then evaluation should not be much of a concern.

> > +       depends on SECCOMP
> > +       depends on SECCOMP_FILTER
>
> SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP"
> line is unnecessary.

The reason that this is here is because of the looks in menuconfig.
SECCOMP is the direct previous entry, so if this depends on SECCOMP
then the config would be indented. Is this looks not worth keeping or
is there some better way to do this?

> > +       help
> > +         Seccomp filters can potentially incur large overhead for each
> > +         system call. This can alleviate some of the overhead.
> > +
> > +         If in doubt, select 'none'.
>
> This should not be in arch/x86. Other architectures, such as arm64,
> should also be able to use this without extra work.

In the initial RFC patch I only added to x86. I could add it to any
arch that has seccomp filters. Though, I'm wondering, why is SECCOMP
in the arch-specific Kconfigs?

> I think we should probably just bail out if we see anything that's
> BPF_ST/BPF_STX. I've never seen seccomp filters that actually use that
> part of cBPF.
>
> But in case we do need this, maybe instead of using "2 +" for all
> these things, the cBPF memory slots should be in a separate array.

Ok I'll just bail.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-21 23:44     ` YiFei Zhu
@ 2020-09-22  0:25       ` Jann Horn
  2020-09-22  0:47         ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Jann Horn @ 2020-09-22  0:25 UTC (permalink / raw)
  To: YiFei Zhu, Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai,
	kernel list

On Tue, Sep 22, 2020 at 1:44 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> On Mon, Sep 21, 2020 at 12:47 PM Jann Horn <jannh@google.com> wrote:
> > > +       depends on SECCOMP
> > > +       depends on SECCOMP_FILTER
> >
> > SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP"
> > line is unnecessary.
>
> The reason that this is here is because of the looks in menuconfig.
> SECCOMP is the direct previous entry, so if this depends on SECCOMP
> then the config would be indented. Is this looks not worth keeping or
> is there some better way to do this?

Ah, I didn't realize this.

> > > +       help
> > > +         Seccomp filters can potentially incur large overhead for each
> > > +         system call. This can alleviate some of the overhead.
> > > +
> > > +         If in doubt, select 'none'.
> >
> > This should not be in arch/x86. Other architectures, such as arm64,
> > should also be able to use this without extra work.
>
> In the initial RFC patch I only added to x86. I could add it to any
> arch that has seccomp filters. Though, I'm wondering, why is SECCOMP
> in the arch-specific Kconfigs?

Ugh, yeah, the existing code is already bad... as far as I can tell,
SECCOMP shouldn't be there, and instead the arch-specific Kconfig
should define something like HAVE_ARCH_SECCOMP and then arch/Kconfig
would define SECCOMP and let it depend on HAVE_ARCH_SECCOMP. It's
really gross how the SECCOMP config description has been copypasted
into a dozen different Kconfig files; and looking around a bit, you
can actually see that e.g. s390 has an utterly outdated help text
which still claims that seccomp is controlled via the ancient
"/proc/<pid>/seccomp". I guess this very nicely illustrates why
putting such options into arch-specific Kconfig is a bad idea. :P

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-22  0:25       ` Jann Horn
@ 2020-09-22  0:47         ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-22  0:47 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Valentin Rothberg, Andy Lutomirski, Will Drewry, Aleksa Sarai,
	kernel list

On Mon, Sep 21, 2020 at 7:26 PM Jann Horn <jannh@google.com> wrote:
> > In the initial RFC patch I only added to x86. I could add it to any
> > arch that has seccomp filters. Though, I'm wondering, why is SECCOMP
> > in the arch-specific Kconfigs?
>
> Ugh, yeah, the existing code is already bad... as far as I can tell,
> SECCOMP shouldn't be there, and instead the arch-specific Kconfig
> should define something like HAVE_ARCH_SECCOMP and then arch/Kconfig
> would define SECCOMP and let it depend on HAVE_ARCH_SECCOMP. It's
> really gross how the SECCOMP config description has been copypasted
> into a dozen different Kconfig files; and looking around a bit, you
> can actually see that e.g. s390 has an utterly outdated help text
> which still claims that seccomp is controlled via the ancient
> "/proc/<pid>/seccomp". I guess this very nicely illustrates why
> putting such options into arch-specific Kconfig is a bad idea. :P

Ah, time to fix this then.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
                   ` (5 preceding siblings ...)
  2020-09-21 19:16 ` Jann Horn
@ 2020-09-23 19:26 ` Kees Cook
  2020-09-23 22:54   ` YiFei Zhu
  2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
  2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  8 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-23 19:26 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Valentin Rothberg

On Mon, Sep 21, 2020 at 12:35:16AM -0500, YiFei Zhu wrote:
> In the past Kees proposed [2] to have an "add this syscall to the
> reject bitmask". It is indeed much easier to securely make a reject
> accelerator to pre-filter syscalls before passing to the BPF
> filters, considering it could only strengthen the security provided
> by the filter. However, ultimately, filter rejections are an
> exceptional / rare case. Here, instead of accelerating what is
> rejected, we accelerate what is allowed. In order not to compromise
> the security rules the BPF filters defined, any accept-side
> accelerator must complement the BPF filters rather than replacing them.

Did you see the RFC series for this?

https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/

> Without cache, seccomp_benchmark:
>   Current BPF sysctl settings:
>   net.core.bpf_jit_enable = 1
>   net.core.bpf_jit_harden = 0
>   Calibrating sample size for 15 seconds worth of syscalls ...
>   Benchmarking 23486415 syscalls...
>   16.079642020 - 1.013345439 = 15066296581 (15.1s)
>   getpid native: 641 ns
>   32.080237410 - 16.080763500 = 15999473910 (16.0s)
>   getpid RET_ALLOW 1 filter: 681 ns
>   48.609461618 - 32.081296173 = 16528165445 (16.5s)
>   getpid RET_ALLOW 2 filters: 703 ns
>   Estimated total seccomp overhead for 1 filter: 40 ns
>   Estimated total seccomp overhead for 2 filters: 62 ns
>   Estimated seccomp per-filter overhead: 22 ns
>   Estimated seccomp entry overhead: 18 ns
> 
> With cache:
>   Current BPF sysctl settings:
>   net.core.bpf_jit_enable = 1
>   net.core.bpf_jit_harden = 0
>   Calibrating sample size for 15 seconds worth of syscalls ...
>   Benchmarking 23486415 syscalls...
>   16.059512499 - 1.014108434 = 15045404065 (15.0s)
>   getpid native: 640 ns
>   31.651075934 - 16.060637323 = 15590438611 (15.6s)
>   getpid RET_ALLOW 1 filter: 663 ns
>   47.367316169 - 31.652302661 = 15715013508 (15.7s)
>   getpid RET_ALLOW 2 filters: 669 ns
>   Estimated total seccomp overhead for 1 filter: 23 ns
>   Estimated total seccomp overhead for 2 filters: 29 ns
>   Estimated seccomp per-filter overhead: 6 ns
>   Estimated seccomp entry overhead: 17 ns
> 
> Depending on the run estimated seccomp overhead for 2 filters can be
> less than seccomp overhead for 1 filter, resulting in underflow to
> estimated seccomp per-filter overhead:
>   Estimated total seccomp overhead for 1 filter: 27 ns
>   Estimated total seccomp overhead for 2 filters: 21 ns
>   Estimated seccomp per-filter overhead: 18446744073709551610 ns
>   Estimated seccomp entry overhead: 33 ns

Which also includes updated benchmarking:

https://lore.kernel.org/lkml/20200616074934.1600036-6-keescook@chromium.org/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-23 19:26 ` Kees Cook
@ 2020-09-23 22:54   ` YiFei Zhu
  2020-09-24  6:52     ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-23 22:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Valentin Rothberg

On Wed, Sep 23, 2020 at 2:26 PM Kees Cook <keescook@chromium.org> wrote:
> Did you see the RFC series for this?
>
> https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> [...]
> Which also includes updated benchmarking:
>
> https://lore.kernel.org/lkml/20200616074934.1600036-6-keescook@chromium.org/

Nice. I was not aware of that series. Looking at it, it seems that our
reasoning for checking arch and nr only, and verify if the filter
accesses anything else, is the same. However, the approach in that RFC
used was some page table dark magic, and it has been concluded that an
emulator is superior. Was there a seperate patch series with emulator?
If not, would you mind me cherry-picking some of your changes in that
series?

Also, I see that BPF_AND is said to be used in the discussion of the
linked series. I think it wouldn't hurt to emulate a few BPF_ALU so
I'll add that.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-23 22:54   ` YiFei Zhu
@ 2020-09-24  6:52     ` Kees Cook
  0 siblings, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-09-24  6:52 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, Andrea Arcangeli,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Valentin Rothberg

On Wed, Sep 23, 2020 at 05:54:51PM -0500, YiFei Zhu wrote:
> On Wed, Sep 23, 2020 at 2:26 PM Kees Cook <keescook@chromium.org> wrote:
> > Did you see the RFC series for this?
> >
> > https://lore.kernel.org/lkml/20200616074934.1600036-1-keescook@chromium.org/
> > [...]
> > Which also includes updated benchmarking:
> >
> > https://lore.kernel.org/lkml/20200616074934.1600036-6-keescook@chromium.org/
> 
> Nice. I was not aware of that series. Looking at it, it seems that our
> reasoning for checking arch and nr only, and verify if the filter
> accesses anything else, is the same. However, the approach in that RFC
> used was some page table dark magic, and it has been concluded that an
> emulator is superior. Was there a seperate patch series with emulator?
> If not, would you mind me cherry-picking some of your changes in that
> series?

I've sent that series refreshed with Jann's emulator now[1]. (Which I
see you've replied to as well, but I figured I'd just link these threads
for any future archaeology. ;)

> Also, I see that BPF_AND is said to be used in the discussion of the
> linked series. I think it wouldn't hurt to emulate a few BPF_ALU so
> I'll add that.

If you could add ALU|AND, that would get us complete coverage for
libseccomp and Chrome. I don't want the emulator to get any more complex
than that, as I view it as fairly high risk part. As you can see, I
tried really hard to _not_ use an emulator in the RFC. ;)

[1] https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig
  2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
@ 2020-09-24 12:06   ` YiFei Zhu
  2020-09-24 12:06     ` YiFei Zhu
  2020-09-24 12:06   ` [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

In order to make adding configurable features into seccomp
easier, it's better to have the options at one single location,
considering easpecially that the bulk of seccomp code is
arch-independent. An quick look also show that many SECCOMP
descriptions are outdated; they talk about /proc rather than
prctl.

As a result of moving the config option and keeping it default
on, architectures arm, arm64, csky, riscv, sh, and xtensa
did not have SECCOMP on by default prior to this and SECCOMP will
be default in this change.

Architectures microblaze, mips, powerpc, s390, sh, and sparc
have an outdated depend on PROC_FS and this dependency is removed
in this change.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig            | 21 +++++++++++++++++++++
 arch/arm/Kconfig        | 15 +--------------
 arch/arm64/Kconfig      | 13 -------------
 arch/csky/Kconfig       | 13 -------------
 arch/microblaze/Kconfig | 18 +-----------------
 arch/mips/Kconfig       | 17 -----------------
 arch/parisc/Kconfig     | 16 ----------------
 arch/powerpc/Kconfig    | 17 -----------------
 arch/riscv/Kconfig      | 13 -------------
 arch/s390/Kconfig       | 17 -----------------
 arch/sh/Kconfig         | 16 ----------------
 arch/sparc/Kconfig      | 18 +-----------------
 arch/um/Kconfig         | 16 ----------------
 arch/x86/Kconfig        | 16 ----------------
 arch/xtensa/Kconfig     | 14 --------------
 15 files changed, 24 insertions(+), 216 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..6dfc5673215d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -444,8 +444,12 @@ config ARCH_WANT_OLD_COMPAT_IPC
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
 	bool
 
+config HAVE_ARCH_SECCOMP
+	bool
+
 config HAVE_ARCH_SECCOMP_FILTER
 	bool
+	select HAVE_ARCH_SECCOMP
 	help
 	  An arch should select this symbol if it provides all of these things:
 	  - syscall_get_arch()
@@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER
 	    results in the system call being skipped immediately.
 	  - seccomp syscall wired up
 
+config SECCOMP
+	def_bool y
+	depends on HAVE_ARCH_SECCOMP
+	prompt "Enable seccomp to safely compute untrusted bytecode"
+	help
+	  This kernel feature is useful for number crunching applications
+	  that may need to compute untrusted bytecode during their
+	  execution. By using pipes or other transports made available to
+	  the process as file descriptors supporting the read/write
+	  syscalls, it's possible to isolate those applications in
+	  their own address space using seccomp. Once seccomp is
+	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
+	  and the task is only allowed to execute a few safe syscalls
+	  defined by each seccomp mode.
+
+	  If unsure, say Y. Only embedded should say N here.
+
 config SECCOMP_FILTER
 	def_bool y
 	depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e00d94b16658..e26c19a16284 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -67,6 +67,7 @@ config ARM
 	select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
+	select HAVE_ARCH_SECCOMP
 	select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
@@ -1617,20 +1618,6 @@ config UACCESS_WITH_MEMCPY
 	  However, if the CPU data cache is using a write-allocate mode,
 	  this option is unlikely to provide any performance gain.
 
-config SECCOMP
-	bool
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config PARAVIRT
 	bool "Enable paravirtualization code"
 	help
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..98c4e34cbec1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1033,19 +1033,6 @@ config ARCH_ENABLE_SPLIT_PMD_PTLOCK
 config CC_HAVE_SHADOW_CALL_STACK
 	def_bool $(cc-option, -fsanitize=shadow-call-stack -ffixed-x18)
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config PARAVIRT
 	bool "Enable paravirtualization code"
 	help
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 3d5afb5f5685..7f424c85772c 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -309,16 +309,3 @@ endmenu
 source "arch/csky/Kconfig.platforms"
 
 source "kernel/Kconfig.hz"
-
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index d262ac0c8714..37bd6a5f38fb 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -26,6 +26,7 @@ config MICROBLAZE
 	select GENERIC_SCHED_CLOCK
 	select HAVE_ARCH_HASH
 	select HAVE_ARCH_KGDB
+	select HAVE_ARCH_SECCOMP
 	select HAVE_DEBUG_KMEMLEAK
 	select HAVE_DMA_CONTIGUOUS
 	select HAVE_DYNAMIC_FTRACE
@@ -120,23 +121,6 @@ config CMDLINE_FORCE
 	  Set this to have arguments from the default kernel command string
 	  override those passed by the boot loader.
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 endmenu
 
 menu "Kernel features"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index c95fa3a2484c..5f88a8fc11fc 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -3004,23 +3004,6 @@ config PHYSICAL_START
 	  specified in the "crashkernel=YM@XM" command line boot parameter
 	  passed to the panic-ed kernel).
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 config MIPS_O32_FP64_SUPPORT
 	bool "Support for O32 binaries using 64-bit FP" if !CPU_MIPSR6
 	depends on 32BIT || MIPS32_O32
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 3b0f53dd70bc..cd4afe1e7a6c 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -378,19 +378,3 @@ endmenu
 
 
 source "drivers/parisc/Kconfig"
-
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..136fe860caef 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -934,23 +934,6 @@ config ARCH_WANTS_FREEZER_CONTROL
 
 source "kernel/power/Kconfig"
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 config PPC_MEM_KEYS
 	prompt "PowerPC Memory Protection Keys"
 	def_bool y
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index df18372861d8..c456b558fab9 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -333,19 +333,6 @@ menu "Kernel features"
 
 source "kernel/Kconfig.hz"
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config RISCV_SBI_V01
 	bool "SBI v0.1 support"
 	default y
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 3d86e12e8e3c..7f7b40ec699e 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -791,23 +791,6 @@ config CRASH_DUMP
 
 endmenu
 
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y.
-
 config CCW
 	def_bool y
 
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index d20927128fce..18278152c91c 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -600,22 +600,6 @@ config PHYSICAL_START
 	  where the fail safe kernel needs to run at a different address
 	  than the panic-ed kernel.
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl, it cannot be disabled and the task is only
-	  allowed to execute a few safe syscalls defined by each seccomp
-	  mode.
-
-	  If unsure, say N.
-
 config SMP
 	bool "Symmetric multi-processing support"
 	depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index efeff2c896a5..d62ce83cf009 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -23,6 +23,7 @@ config SPARC
 	select HAVE_OPROFILE
 	select HAVE_ARCH_KGDB if !SMP || SPARC64
 	select HAVE_ARCH_TRACEHOOK
+	select HAVE_ARCH_SECCOMP if SPARC64
 	select HAVE_EXIT_THREAD
 	select HAVE_PCI
 	select SYSCTL_EXCEPTION_TRACE
@@ -226,23 +227,6 @@ config EARLYFB
 	help
 	  Say Y here to enable a faster early framebuffer boot console.
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on SPARC64 && PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 config HOTPLUG_CPU
 	bool "Support for hot-pluggable CPUs"
 	depends on SPARC64 && SMP
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index eb51fec75948..d49f471b02e3 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -173,22 +173,6 @@ config PGTABLE_LEVELS
 	default 3 if 3_LEVEL_PGTABLES
 	default 2
 
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y.
-
 config UML_TIME_TRAVEL_SUPPORT
 	bool
 	prompt "Support time-travel mode (e.g. for test execution)"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..1ab22869a765 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1968,22 +1968,6 @@ config EFI_MIXED
 
 	   If unsure, say N.
 
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 source "kernel/Kconfig.hz"
 
 config KEXEC
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index e997e0119c02..d8a29dc5a284 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -217,20 +217,6 @@ config HOTPLUG_CPU
 
 	  Say N if you want to disable CPU hotplug.
 
-config SECCOMP
-	bool
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config FAST_SYSCALL_XTENSA
 	bool "Enable fast atomic syscalls"
 	default n
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
                   ` (6 preceding siblings ...)
  2020-09-23 19:26 ` Kees Cook
@ 2020-09-24 12:06 ` YiFei Zhu
  2020-09-24 12:06   ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
                     ` (6 more replies)
  2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  8 siblings, 7 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/

Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.
* Architectures supported by default through arch number array,
  except for MIPS with its sparse syscall numbers.
* Configurable per-build for future different cache modes.

This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.

Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.

Some benchmarks are performed with results in patch 5, copied below:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Benchmarking 100000000 syscalls...
  63.896255358 - 0.008504529 = 63887750829 (63.9s)
  getpid native: 638 ns
  130.383312423 - 63.897315189 = 66485997234 (66.5s)
  getpid RET_ALLOW 1 filter (bitmap): 664 ns
  196.789080421 - 130.384414983 = 66404665438 (66.4s)
  getpid RET_ALLOW 2 filters (bitmap): 664 ns
  268.844643304 - 196.790234168 = 72054409136 (72.1s)
  getpid RET_ALLOW 3 filters (full): 720 ns
  342.627472515 - 268.845799103 = 73781673412 (73.8s)
  getpid RET_ALLOW 4 filters (full): 737 ns
  Estimated total seccomp overhead for 1 bitmapped filter: 26 ns
  Estimated total seccomp overhead for 2 bitmapped filters: 26 ns
  Estimated total seccomp overhead for 3 full filters: 82 ns
  Estimated total seccomp overhead for 4 full filters: 99 ns
  Estimated seccomp entry overhead: 26 ns
  Estimated seccomp per-filter overhead (last 2 diff): 17 ns
  Estimated seccomp per-filter overhead (filters / 4): 18 ns
  Expectations:
  	native ≤ 1 bitmap (638 ≤ 664): ✔️
  	native ≤ 1 filter (638 ≤ 720): ✔️
  	per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️
  	1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️
  	entry ≈ 1 bitmapped (26 ≈ 26): ✔️
  	entry ≈ 2 bitmapped (26 ≈ 26): ✔️
  	native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️

RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
  have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
  instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
  during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.

Patch 1 moves the SECCOMP Kcomfig option to arch/Kconfig.

Patch 2 adds a syscall_arches array so the emulator can enumerate it.

Patch 3 implements the emulator that finds if a filter must return allow,

Patch 4 implements the test_bit against the bitmaps.

Patch 5 updates the selftest to better show the new semantics.

Patch 6 implements /proc/pid/seccomp_cache.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

Kees Cook (1):
  selftests/seccomp: Compare bitmap vs filter overhead

YiFei Zhu (5):
  seccomp: Move config option SECCOMP to arch/Kconfig
  asm/syscall.h: Add syscall_arches[] array
  seccomp/cache: Add "emulator" to check if filter is arg-dependent
  seccomp/cache: Lookup syscall allowlist for fast path
  seccomp/cache: Report cache data through /proc/pid/seccomp_cache

 arch/Kconfig                                  |  56 ++++
 arch/alpha/include/asm/syscall.h              |   4 +
 arch/arc/include/asm/syscall.h                |  24 +-
 arch/arm/Kconfig                              |  15 +-
 arch/arm/include/asm/syscall.h                |   4 +
 arch/arm64/Kconfig                            |  13 -
 arch/arm64/include/asm/syscall.h              |   4 +
 arch/c6x/include/asm/syscall.h                |  13 +-
 arch/csky/Kconfig                             |  13 -
 arch/csky/include/asm/syscall.h               |   4 +
 arch/h8300/include/asm/syscall.h              |   4 +
 arch/hexagon/include/asm/syscall.h            |   4 +
 arch/ia64/include/asm/syscall.h               |   4 +
 arch/m68k/include/asm/syscall.h               |   4 +
 arch/microblaze/Kconfig                       |  18 +-
 arch/microblaze/include/asm/syscall.h         |   4 +
 arch/mips/Kconfig                             |  17 --
 arch/mips/include/asm/syscall.h               |  16 ++
 arch/nds32/include/asm/syscall.h              |  13 +-
 arch/nios2/include/asm/syscall.h              |   4 +
 arch/openrisc/include/asm/syscall.h           |   4 +
 arch/parisc/Kconfig                           |  16 --
 arch/parisc/include/asm/syscall.h             |   7 +
 arch/powerpc/Kconfig                          |  17 --
 arch/powerpc/include/asm/syscall.h            |  14 +
 arch/riscv/Kconfig                            |  13 -
 arch/riscv/include/asm/syscall.h              |  14 +-
 arch/s390/Kconfig                             |  17 --
 arch/s390/include/asm/syscall.h               |   7 +
 arch/sh/Kconfig                               |  16 --
 arch/sh/include/asm/syscall_32.h              |  17 +-
 arch/sparc/Kconfig                            |  18 +-
 arch/sparc/include/asm/syscall.h              |   9 +
 arch/um/Kconfig                               |  16 --
 arch/x86/Kconfig                              |  16 --
 arch/x86/include/asm/syscall.h                |  11 +
 arch/x86/um/asm/syscall.h                     |  14 +-
 arch/xtensa/Kconfig                           |  14 -
 arch/xtensa/include/asm/syscall.h             |   4 +
 fs/proc/base.c                                |   7 +-
 include/linux/seccomp.h                       |   5 +
 kernel/seccomp.c                              | 259 +++++++++++++++++-
 .../selftests/seccomp/seccomp_benchmark.c     | 151 ++++++++--
 tools/testing/selftests/seccomp/settings      |   2 +-
 44 files changed, 641 insertions(+), 265 deletions(-)

--
2.28.0

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig
  2020-09-24 12:06   ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
@ 2020-09-24 12:06     ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

In order to make adding configurable features into seccomp
easier, it's better to have the options at one single location,
considering easpecially that the bulk of seccomp code is
arch-independent. An quick look also show that many SECCOMP
descriptions are outdated; they talk about /proc rather than
prctl.

As a result of moving the config option and keeping it default
on, architectures arm, arm64, csky, riscv, sh, and xtensa
did not have SECCOMP on by default prior to this and SECCOMP will
be default in this change.

Architectures microblaze, mips, powerpc, s390, sh, and sparc
have an outdated depend on PROC_FS and this dependency is removed
in this change.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig            | 21 +++++++++++++++++++++
 arch/arm/Kconfig        | 15 +--------------
 arch/arm64/Kconfig      | 13 -------------
 arch/csky/Kconfig       | 13 -------------
 arch/microblaze/Kconfig | 18 +-----------------
 arch/mips/Kconfig       | 17 -----------------
 arch/parisc/Kconfig     | 16 ----------------
 arch/powerpc/Kconfig    | 17 -----------------
 arch/riscv/Kconfig      | 13 -------------
 arch/s390/Kconfig       | 17 -----------------
 arch/sh/Kconfig         | 16 ----------------
 arch/sparc/Kconfig      | 18 +-----------------
 arch/um/Kconfig         | 16 ----------------
 arch/x86/Kconfig        | 16 ----------------
 arch/xtensa/Kconfig     | 14 --------------
 15 files changed, 24 insertions(+), 216 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..6dfc5673215d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -444,8 +444,12 @@ config ARCH_WANT_OLD_COMPAT_IPC
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
 	bool
 
+config HAVE_ARCH_SECCOMP
+	bool
+
 config HAVE_ARCH_SECCOMP_FILTER
 	bool
+	select HAVE_ARCH_SECCOMP
 	help
 	  An arch should select this symbol if it provides all of these things:
 	  - syscall_get_arch()
@@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER
 	    results in the system call being skipped immediately.
 	  - seccomp syscall wired up
 
+config SECCOMP
+	def_bool y
+	depends on HAVE_ARCH_SECCOMP
+	prompt "Enable seccomp to safely compute untrusted bytecode"
+	help
+	  This kernel feature is useful for number crunching applications
+	  that may need to compute untrusted bytecode during their
+	  execution. By using pipes or other transports made available to
+	  the process as file descriptors supporting the read/write
+	  syscalls, it's possible to isolate those applications in
+	  their own address space using seccomp. Once seccomp is
+	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
+	  and the task is only allowed to execute a few safe syscalls
+	  defined by each seccomp mode.
+
+	  If unsure, say Y. Only embedded should say N here.
+
 config SECCOMP_FILTER
 	def_bool y
 	depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e00d94b16658..e26c19a16284 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -67,6 +67,7 @@ config ARM
 	select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
+	select HAVE_ARCH_SECCOMP
 	select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
@@ -1617,20 +1618,6 @@ config UACCESS_WITH_MEMCPY
 	  However, if the CPU data cache is using a write-allocate mode,
 	  this option is unlikely to provide any performance gain.
 
-config SECCOMP
-	bool
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config PARAVIRT
 	bool "Enable paravirtualization code"
 	help
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..98c4e34cbec1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1033,19 +1033,6 @@ config ARCH_ENABLE_SPLIT_PMD_PTLOCK
 config CC_HAVE_SHADOW_CALL_STACK
 	def_bool $(cc-option, -fsanitize=shadow-call-stack -ffixed-x18)
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config PARAVIRT
 	bool "Enable paravirtualization code"
 	help
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 3d5afb5f5685..7f424c85772c 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -309,16 +309,3 @@ endmenu
 source "arch/csky/Kconfig.platforms"
 
 source "kernel/Kconfig.hz"
-
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index d262ac0c8714..37bd6a5f38fb 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -26,6 +26,7 @@ config MICROBLAZE
 	select GENERIC_SCHED_CLOCK
 	select HAVE_ARCH_HASH
 	select HAVE_ARCH_KGDB
+	select HAVE_ARCH_SECCOMP
 	select HAVE_DEBUG_KMEMLEAK
 	select HAVE_DMA_CONTIGUOUS
 	select HAVE_DYNAMIC_FTRACE
@@ -120,23 +121,6 @@ config CMDLINE_FORCE
 	  Set this to have arguments from the default kernel command string
 	  override those passed by the boot loader.
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 endmenu
 
 menu "Kernel features"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index c95fa3a2484c..5f88a8fc11fc 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -3004,23 +3004,6 @@ config PHYSICAL_START
 	  specified in the "crashkernel=YM@XM" command line boot parameter
 	  passed to the panic-ed kernel).
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 config MIPS_O32_FP64_SUPPORT
 	bool "Support for O32 binaries using 64-bit FP" if !CPU_MIPSR6
 	depends on 32BIT || MIPS32_O32
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 3b0f53dd70bc..cd4afe1e7a6c 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -378,19 +378,3 @@ endmenu
 
 
 source "drivers/parisc/Kconfig"
-
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..136fe860caef 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -934,23 +934,6 @@ config ARCH_WANTS_FREEZER_CONTROL
 
 source "kernel/power/Kconfig"
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 config PPC_MEM_KEYS
 	prompt "PowerPC Memory Protection Keys"
 	def_bool y
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index df18372861d8..c456b558fab9 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -333,19 +333,6 @@ menu "Kernel features"
 
 source "kernel/Kconfig.hz"
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config RISCV_SBI_V01
 	bool "SBI v0.1 support"
 	default y
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 3d86e12e8e3c..7f7b40ec699e 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -791,23 +791,6 @@ config CRASH_DUMP
 
 endmenu
 
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y.
-
 config CCW
 	def_bool y
 
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index d20927128fce..18278152c91c 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -600,22 +600,6 @@ config PHYSICAL_START
 	  where the fail safe kernel needs to run at a different address
 	  than the panic-ed kernel.
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl, it cannot be disabled and the task is only
-	  allowed to execute a few safe syscalls defined by each seccomp
-	  mode.
-
-	  If unsure, say N.
-
 config SMP
 	bool "Symmetric multi-processing support"
 	depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index efeff2c896a5..d62ce83cf009 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -23,6 +23,7 @@ config SPARC
 	select HAVE_OPROFILE
 	select HAVE_ARCH_KGDB if !SMP || SPARC64
 	select HAVE_ARCH_TRACEHOOK
+	select HAVE_ARCH_SECCOMP if SPARC64
 	select HAVE_EXIT_THREAD
 	select HAVE_PCI
 	select SYSCTL_EXCEPTION_TRACE
@@ -226,23 +227,6 @@ config EARLYFB
 	help
 	  Say Y here to enable a faster early framebuffer boot console.
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on SPARC64 && PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 config HOTPLUG_CPU
 	bool "Support for hot-pluggable CPUs"
 	depends on SPARC64 && SMP
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index eb51fec75948..d49f471b02e3 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -173,22 +173,6 @@ config PGTABLE_LEVELS
 	default 3 if 3_LEVEL_PGTABLES
 	default 2
 
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y.
-
 config UML_TIME_TRAVEL_SUPPORT
 	bool
 	prompt "Support time-travel mode (e.g. for test execution)"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..1ab22869a765 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1968,22 +1968,6 @@ config EFI_MIXED
 
 	   If unsure, say N.
 
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 source "kernel/Kconfig.hz"
 
 config KEXEC
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index e997e0119c02..d8a29dc5a284 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -217,20 +217,6 @@ config HOTPLUG_CPU
 
 	  Say N if you want to disable CPU hotplug.
 
-config SECCOMP
-	bool
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config FAST_SYSCALL_XTENSA
 	bool "Enable fast atomic syscalls"
 	default n
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
  2020-09-24 12:06   ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
@ 2020-09-24 12:06   ` YiFei Zhu
  2020-09-24 12:06   ` [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Seccomp cache emulator needs to know all the architecture numbers
that syscall_get_arch() could return for the kernel build in order
to generate a cache for all of them.

The array is declared in header as static __maybe_unused const
to maximize compiler optimiation opportunities such as loop
unrolling.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/alpha/include/asm/syscall.h      |  4 ++++
 arch/arc/include/asm/syscall.h        | 24 +++++++++++++++++++-----
 arch/arm/include/asm/syscall.h        |  4 ++++
 arch/arm64/include/asm/syscall.h      |  4 ++++
 arch/c6x/include/asm/syscall.h        | 13 +++++++++++--
 arch/csky/include/asm/syscall.h       |  4 ++++
 arch/h8300/include/asm/syscall.h      |  4 ++++
 arch/hexagon/include/asm/syscall.h    |  4 ++++
 arch/ia64/include/asm/syscall.h       |  4 ++++
 arch/m68k/include/asm/syscall.h       |  4 ++++
 arch/microblaze/include/asm/syscall.h |  4 ++++
 arch/mips/include/asm/syscall.h       | 16 ++++++++++++++++
 arch/nds32/include/asm/syscall.h      | 13 +++++++++++--
 arch/nios2/include/asm/syscall.h      |  4 ++++
 arch/openrisc/include/asm/syscall.h   |  4 ++++
 arch/parisc/include/asm/syscall.h     |  7 +++++++
 arch/powerpc/include/asm/syscall.h    | 14 ++++++++++++++
 arch/riscv/include/asm/syscall.h      | 14 ++++++++++----
 arch/s390/include/asm/syscall.h       |  7 +++++++
 arch/sh/include/asm/syscall_32.h      | 17 +++++++++++------
 arch/sparc/include/asm/syscall.h      |  9 +++++++++
 arch/x86/include/asm/syscall.h        | 11 +++++++++++
 arch/x86/um/asm/syscall.h             | 14 ++++++++++----
 arch/xtensa/include/asm/syscall.h     |  4 ++++
 24 files changed, 184 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/include/asm/syscall.h b/arch/alpha/include/asm/syscall.h
index 11c688c1d7ec..625ac9b23f37 100644
--- a/arch/alpha/include/asm/syscall.h
+++ b/arch/alpha/include/asm/syscall.h
@@ -4,6 +4,10 @@
 
 #include <uapi/linux/audit.h>
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_ALPHA
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_ALPHA;
diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
index 94529e89dff0..899c13cbf5cc 100644
--- a/arch/arc/include/asm/syscall.h
+++ b/arch/arc/include/asm/syscall.h
@@ -65,14 +65,28 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
 	}
 }
 
+#ifdef CONFIG_ISA_ARCOMPACT
+# ifdef CONFIG_CPU_BIG_ENDIAN
+#  define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACTBE
+# else
+#  define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACT
+# endif /* CONFIG_CPU_BIG_ENDIAN */
+#else
+# ifdef CONFIG_CPU_BIG_ENDIAN
+#  define SYSCALL_ARCH AUDIT_ARCH_ARCV2BE
+# else
+#  define SYSCALL_ARCH AUDIT_ARCH_ARCV2
+# endif /* CONFIG_CPU_BIG_ENDIAN */
+#endif /* CONFIG_ISA_ARCOMPACT */
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
-	return IS_ENABLED(CONFIG_ISA_ARCOMPACT)
-		? (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
-			? AUDIT_ARCH_ARCOMPACTBE : AUDIT_ARCH_ARCOMPACT)
-		: (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
-			? AUDIT_ARCH_ARCV2BE : AUDIT_ARCH_ARCV2);
+	return SYSCALL_ARCH;
 }
 
 #endif
diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
index fd02761ba06c..33ade26e3956 100644
--- a/arch/arm/include/asm/syscall.h
+++ b/arch/arm/include/asm/syscall.h
@@ -73,6 +73,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	memcpy(&regs->ARM_r0 + 1, args, 5 * sizeof(args[0]));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_ARM
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	/* ARM tasks don't change audit architectures on the fly. */
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index cfc0672013f6..77f3d300e7a0 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -82,6 +82,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	memcpy(&regs->regs[1], args, 5 * sizeof(args[0]));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_ARM, AUDIT_ARCH_AARCH64
+};
+
 /*
  * We don't care about endianness (__AUDIT_ARCH_LE bit) here because
  * AArch64 has the same system calls both on little- and big- endian.
diff --git a/arch/c6x/include/asm/syscall.h b/arch/c6x/include/asm/syscall.h
index 38f3e2284ecd..0d78c67ee1fc 100644
--- a/arch/c6x/include/asm/syscall.h
+++ b/arch/c6x/include/asm/syscall.h
@@ -66,10 +66,19 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->a9 = *args;
 }
 
+#ifdef CONFIG_CPU_BIG_ENDIAN
+#define SYSCALL_ARCH AUDIT_ARCH_C6XBE
+#else
+#define SYSCALL_ARCH AUDIT_ARCH_C6X
+#endif
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
-	return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
-		? AUDIT_ARCH_C6XBE : AUDIT_ARCH_C6X;
+	return SYSCALL_ARCH;
 }
 
 #endif /* __ASM_C6X_SYSCALLS_H */
diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h
index f624fa3bbc22..86242d2850d7 100644
--- a/arch/csky/include/asm/syscall.h
+++ b/arch/csky/include/asm/syscall.h
@@ -68,6 +68,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
 	memcpy(&regs->a1, args, 5 * sizeof(regs->a1));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_CSKY
+};
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
diff --git a/arch/h8300/include/asm/syscall.h b/arch/h8300/include/asm/syscall.h
index 01666b8bb263..775f6ac8fde3 100644
--- a/arch/h8300/include/asm/syscall.h
+++ b/arch/h8300/include/asm/syscall.h
@@ -28,6 +28,10 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
 	*args   = regs->er6;
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_H8300
+};
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h
index f6e454f18038..6ee21a76f6a3 100644
--- a/arch/hexagon/include/asm/syscall.h
+++ b/arch/hexagon/include/asm/syscall.h
@@ -45,6 +45,10 @@ static inline long syscall_get_return_value(struct task_struct *task,
 	return regs->r00;
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_HEXAGON
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_HEXAGON;
diff --git a/arch/ia64/include/asm/syscall.h b/arch/ia64/include/asm/syscall.h
index 6c6f16e409a8..19456125c89a 100644
--- a/arch/ia64/include/asm/syscall.h
+++ b/arch/ia64/include/asm/syscall.h
@@ -71,6 +71,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	ia64_syscall_get_set_arguments(task, regs, args, 1);
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_IA64
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_IA64;
diff --git a/arch/m68k/include/asm/syscall.h b/arch/m68k/include/asm/syscall.h
index 465ac039be09..031b051f9026 100644
--- a/arch/m68k/include/asm/syscall.h
+++ b/arch/m68k/include/asm/syscall.h
@@ -4,6 +4,10 @@
 
 #include <uapi/linux/audit.h>
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_M68K
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_M68K;
diff --git a/arch/microblaze/include/asm/syscall.h b/arch/microblaze/include/asm/syscall.h
index 3a6924f3cbde..28cde14056d1 100644
--- a/arch/microblaze/include/asm/syscall.h
+++ b/arch/microblaze/include/asm/syscall.h
@@ -105,6 +105,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 asmlinkage unsigned long do_syscall_trace_enter(struct pt_regs *regs);
 asmlinkage void do_syscall_trace_leave(struct pt_regs *regs);
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_MICROBLAZE
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_MICROBLAZE;
diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
index 25fa651c937d..29e4c1c47c54 100644
--- a/arch/mips/include/asm/syscall.h
+++ b/arch/mips/include/asm/syscall.h
@@ -140,6 +140,22 @@ extern const unsigned long sys_call_table[];
 extern const unsigned long sys32_call_table[];
 extern const unsigned long sysn32_call_table[];
 
+static __maybe_unused const int syscall_arches[] = {
+#ifdef __LITTLE_ENDIAN
+	AUDIT_ARCH_MIPSEL,
+# ifdef CONFIG_64BIT
+	AUDIT_ARCH_MIPSEL64,
+	AUDIT_ARCH_MIPSEL64N32,
+# endif /* CONFIG_64BIT */
+#else
+	AUDIT_ARCH_MIPS,
+# ifdef CONFIG_64BIT
+	AUDIT_ARCH_MIPS64,
+	AUDIT_ARCH_MIPS64N32,
+# endif /* CONFIG_64BIT */
+#endif /* __LITTLE_ENDIAN */
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	int arch = AUDIT_ARCH_MIPS;
diff --git a/arch/nds32/include/asm/syscall.h b/arch/nds32/include/asm/syscall.h
index 7b5180d78e20..2dd5e33bcfcb 100644
--- a/arch/nds32/include/asm/syscall.h
+++ b/arch/nds32/include/asm/syscall.h
@@ -154,11 +154,20 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
 	memcpy(&regs->uregs[0] + 1, args, 5 * sizeof(args[0]));
 }
 
+#ifdef CONFIG_CPU_BIG_ENDIAN
+#define SYSCALL_ARCH AUDIT_ARCH_NDS32BE
+#else
+#define SYSCALL_ARCH AUDIT_ARCH_NDS32
+#endif
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
-	return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
-		? AUDIT_ARCH_NDS32BE : AUDIT_ARCH_NDS32;
+	return SYSCALL_ARCH;
 }
 
 #endif /* _ASM_NDS32_SYSCALL_H */
diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h
index 526449edd768..8fa2716cac5a 100644
--- a/arch/nios2/include/asm/syscall.h
+++ b/arch/nios2/include/asm/syscall.h
@@ -69,6 +69,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->r9 = *args;
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_NIOS2
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_NIOS2;
diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h
index e6383be2a195..4eb28ad08042 100644
--- a/arch/openrisc/include/asm/syscall.h
+++ b/arch/openrisc/include/asm/syscall.h
@@ -64,6 +64,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
 	memcpy(&regs->gpr[3], args, 6 * sizeof(args[0]));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_OPENRISC
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_OPENRISC;
diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h
index 00b127a5e09b..2915f140c9fd 100644
--- a/arch/parisc/include/asm/syscall.h
+++ b/arch/parisc/include/asm/syscall.h
@@ -55,6 +55,13 @@ static inline void syscall_rollback(struct task_struct *task,
 	/* do nothing */
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_PARISC,
+#ifdef CONFIG_64BIT
+	AUDIT_ARCH_PARISC64,
+#endif
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	int arch = AUDIT_ARCH_PARISC;
diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
index fd1b518eed17..781deb211e3d 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -104,6 +104,20 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->orig_gpr3 = args[0];
 }
 
+static __maybe_unused const int syscall_arches[] = {
+#ifdef __LITTLE_ENDIAN__
+	AUDIT_ARCH_PPC | __AUDIT_ARCH_LE,
+# ifdef CONFIG_PPC64
+	AUDIT_ARCH_PPC64LE,
+# endif /* CONFIG_PPC64 */
+#else
+	AUDIT_ARCH_PPC,
+# ifdef CONFIG_PPC64
+	AUDIT_ARCH_PPC64,
+# endif /* CONFIG_PPC64 */
+#endif /* __LITTLE_ENDIAN__ */
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	int arch;
diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h
index 49350c8bd7b0..4b36d358243e 100644
--- a/arch/riscv/include/asm/syscall.h
+++ b/arch/riscv/include/asm/syscall.h
@@ -73,13 +73,19 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	memcpy(&regs->a1, args, 5 * sizeof(regs->a1));
 }
 
-static inline int syscall_get_arch(struct task_struct *task)
-{
 #ifdef CONFIG_64BIT
-	return AUDIT_ARCH_RISCV64;
+#define SYSCALL_ARCH AUDIT_ARCH_RISCV64
 #else
-	return AUDIT_ARCH_RISCV32;
+#define SYSCALL_ARCH AUDIT_ARCH_RISCV32
 #endif
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
+static inline int syscall_get_arch(struct task_struct *task)
+{
+	return SYSCALL_ARCH;
 }
 
 #endif	/* _ASM_RISCV_SYSCALL_H */
diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h
index d9d5de0f67ff..4cb9da36610a 100644
--- a/arch/s390/include/asm/syscall.h
+++ b/arch/s390/include/asm/syscall.h
@@ -89,6 +89,13 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->orig_gpr2 = args[0];
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_S390X,
+#ifdef CONFIG_COMPAT
+	AUDIT_ARCH_S390,
+#endif
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 #ifdef CONFIG_COMPAT
diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h
index cb51a7528384..4780f2339c72 100644
--- a/arch/sh/include/asm/syscall_32.h
+++ b/arch/sh/include/asm/syscall_32.h
@@ -69,13 +69,18 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->regs[4] = args[0];
 }
 
-static inline int syscall_get_arch(struct task_struct *task)
-{
-	int arch = AUDIT_ARCH_SH;
-
 #ifdef CONFIG_CPU_LITTLE_ENDIAN
-	arch |= __AUDIT_ARCH_LE;
+#define SYSCALL_ARCH AUDIT_ARCH_SHEL
+#else
+#define SYSCALL_ARCH AUDIT_ARCH_SH
 #endif
-	return arch;
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
+static inline int syscall_get_arch(struct task_struct *task)
+{
+	return SYSCALL_ARCH;
 }
 #endif /* __ASM_SH_SYSCALL_32_H */
diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h
index 62a5a78804c4..a458992cdcfe 100644
--- a/arch/sparc/include/asm/syscall.h
+++ b/arch/sparc/include/asm/syscall.h
@@ -127,6 +127,15 @@ static inline void syscall_set_arguments(struct task_struct *task,
 		regs->u_regs[UREG_I0 + i] = args[i];
 }
 
+static __maybe_unused const int syscall_arches[] = {
+#ifdef CONFIG_SPARC64
+	AUDIT_ARCH_SPARC64,
+#endif
+#if !defined(CONFIG_SPARC64) || defined(CONFIG_COMPAT)
+	AUDIT_ARCH_SPARC,
+#endif
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 #if defined(CONFIG_SPARC64) && defined(CONFIG_COMPAT)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 7cbf733d11af..e13bb2a65b6f 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -97,6 +97,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	memcpy(&regs->bx + i, args, n * sizeof(args[0]));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_I386
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_I386;
@@ -152,6 +156,13 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	}
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_X86_64,
+#ifdef CONFIG_IA32_EMULATION
+	AUDIT_ARCH_I386,
+#endif
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
diff --git a/arch/x86/um/asm/syscall.h b/arch/x86/um/asm/syscall.h
index 56a2f0913e3c..590a31e22b99 100644
--- a/arch/x86/um/asm/syscall.h
+++ b/arch/x86/um/asm/syscall.h
@@ -9,13 +9,19 @@ typedef asmlinkage long (*sys_call_ptr_t)(unsigned long, unsigned long,
 					  unsigned long, unsigned long,
 					  unsigned long, unsigned long);
 
-static inline int syscall_get_arch(struct task_struct *task)
-{
 #ifdef CONFIG_X86_32
-	return AUDIT_ARCH_I386;
+#define SYSCALL_ARCH AUDIT_ARCH_I386
 #else
-	return AUDIT_ARCH_X86_64;
+#define SYSCALL_ARCH AUDIT_ARCH_X86_64
 #endif
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
+static inline int syscall_get_arch(struct task_struct *task)
+{
+	return SYSCALL_ARCH;
 }
 
 #endif /* __UM_ASM_SYSCALL_H */
diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h
index f9a671cbf933..3d334fb0d329 100644
--- a/arch/xtensa/include/asm/syscall.h
+++ b/arch/xtensa/include/asm/syscall.h
@@ -14,6 +14,10 @@
 #include <asm/ptrace.h>
 #include <uapi/linux/audit.h>
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_XTENSA
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_XTENSA;
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
  2020-09-24 12:06   ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
  2020-09-24 12:06   ` [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
@ 2020-09-24 12:06   ` YiFei Zhu
  2020-09-24 12:06   ` [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
access any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.

Each common BPF instruction (stolen from Kees's list [1]) are
emulated. Any weirdness or loading from a syscall argument will
cause the emulator to bail.

The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.

Filter dependency is resolved at attach time. If a filter depends
on more filters, then we perform an and on its bitmask against its
dependee; if the dependee does not guarantee to allow the syscall,
then the depender is also marked not to guarantee to allow the
syscall.

[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig     |  25 ++++++
 kernel/seccomp.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 220 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6dfc5673215d..8cc3dc87f253 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -489,6 +489,31 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+choice
+	prompt "Seccomp filter cache"
+	default SECCOMP_CACHE_NONE
+	depends on SECCOMP_FILTER
+	help
+	  Seccomp filters can potentially incur large overhead for each
+	  system call. This can alleviate some of the overhead.
+
+	  If in doubt, select 'syscall numbers only'.
+
+config SECCOMP_CACHE_NONE
+	bool "None"
+	help
+	  No caching is done. Seccomp filters will be called each time
+	  a system call occurs in a seccomp-guarded task.
+
+config SECCOMP_CACHE_NR_ONLY
+	bool "Syscall number only"
+	depends on !HAVE_SPARSE_SYSCALL_NR
+	help
+	  For each syscall number, if the seccomp filter has a fixed
+	  result, store that result in a bitmap to speed up system calls.
+
+endchoice
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..7c286d66f983 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,6 +143,32 @@ struct notification {
 	struct list_head notifications;
 };
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_cache_filter_data - container for cache's per-filter data
+ *
+ * @syscall_ok: A bitmap for each architecture number, where each bit
+ *		represents whether the filter will always allow the syscall.
+ */
+struct seccomp_cache_filter_data {
+	DECLARE_BITMAP(syscall_ok[ARRAY_SIZE(syscall_arches)], NR_syscalls);
+};
+
+#define SECCOMP_EMU_MAX_PENDING_STATES 64
+#else
+struct seccomp_cache_filter_data { };
+
+static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+	return 0;
+}
+
+static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter,
+					 const struct seccomp_filter *prev)
+{
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -185,6 +211,7 @@ struct seccomp_filter {
 	struct notification *notif;
 	struct mutex notify_lock;
 	wait_queue_head_t wqh;
+	struct seccomp_cache_filter_data cache;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -530,6 +557,139 @@ static inline void seccomp_sync_threads(unsigned long flags)
 	}
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_emu_env - container for seccomp emulator environment
+ *
+ * @filter: The cBPF filter instructions.
+ * @nr: The syscall number we are emulating.
+ * @arch: The architecture number we are emulating.
+ * @syscall_ok: Emulation result, whether it is okay for seccomp to cache the
+ *		syscall.
+ */
+struct seccomp_emu_env {
+	struct sock_filter *filter;
+	int arch;
+	int nr;
+	bool syscall_ok;
+};
+
+/**
+ * struct seccomp_emu_state - container for seccomp emulator state
+ *
+ * @next: The next pending state. This structure is a linked list.
+ * @pc: The current program counter.
+ * @areg: the value of that A register.
+ */
+struct seccomp_emu_state {
+	struct seccomp_emu_state *next;
+	int pc;
+	u32 areg;
+};
+
+/**
+ * seccomp_emu_step - step one instruction in the emulator
+ * @env: The emulator environment
+ * @state: The emulator state
+ *
+ * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred.
+ */
+static int seccomp_emu_step(struct seccomp_emu_env *env,
+			    struct seccomp_emu_state *state)
+{
+	struct sock_filter *ftest = &env->filter[state->pc++];
+	u16 code = ftest->code;
+	u32 k = ftest->k;
+	bool compare;
+
+	switch (code) {
+	case BPF_LD | BPF_W | BPF_ABS:
+		if (k == offsetof(struct seccomp_data, nr))
+			state->areg = env->nr;
+		else if (k == offsetof(struct seccomp_data, arch))
+			state->areg = env->arch;
+		else
+			return 1;
+
+		return 0;
+	case BPF_JMP | BPF_JA:
+		state->pc += k;
+		return 0;
+	case BPF_JMP | BPF_JEQ | BPF_K:
+	case BPF_JMP | BPF_JGE | BPF_K:
+	case BPF_JMP | BPF_JGT | BPF_K:
+	case BPF_JMP | BPF_JSET | BPF_K:
+		switch (BPF_OP(code)) {
+		case BPF_JEQ:
+			compare = state->areg == k;
+			break;
+		case BPF_JGT:
+			compare = state->areg > k;
+			break;
+		case BPF_JGE:
+			compare = state->areg >= k;
+			break;
+		case BPF_JSET:
+			compare = state->areg & k;
+			break;
+		default:
+			WARN_ON(true);
+			return -EINVAL;
+		}
+
+		state->pc += compare ? ftest->jt : ftest->jf;
+		return 0;
+	case BPF_ALU | BPF_AND | BPF_K:
+		state->areg &= k;
+		return 0;
+	case BPF_RET | BPF_K:
+		env->syscall_ok = k == SECCOMP_RET_ALLOW;
+		return 1;
+	default:
+		return 1;
+	}
+}
+
+/**
+ * seccomp_cache_prepare - emulate the filter to find cachable syscalls
+ * @sfilter: The seccomp filter
+ *
+ * Returns 0 if successful or -errno if error occurred.
+ */
+int seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+	struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
+	struct sock_filter *filter = fprog->filter;
+	int arch, nr, res = 0;
+
+	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+		for (nr = 0; nr < NR_syscalls; nr++) {
+			struct seccomp_emu_env env = {0};
+			struct seccomp_emu_state state = {0};
+
+			env.filter = filter;
+			env.arch = syscall_arches[arch];
+			env.nr = nr;
+
+			while (true) {
+				res = seccomp_emu_step(&env, &state);
+				if (res)
+					break;
+			}
+
+			if (res < 0)
+				goto out;
+
+			if (env.syscall_ok)
+				set_bit(nr, sfilter->cache.syscall_ok[arch]);
+		}
+	}
+
+out:
+	return res;
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
@@ -540,7 +700,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *sfilter;
 	int ret;
-	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
+			       IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
 		return ERR_PTR(-EINVAL);
@@ -571,6 +732,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 		return ERR_PTR(ret);
 	}
 
+	ret = seccomp_cache_prepare(sfilter);
+	if (ret < 0) {
+		bpf_prog_destroy(sfilter->prog);
+		kfree(sfilter);
+		return ERR_PTR(ret);
+	}
+
 	refcount_set(&sfilter->refs, 1);
 	refcount_set(&sfilter->users, 1);
 	init_waitqueue_head(&sfilter->wqh);
@@ -606,6 +774,31 @@ seccomp_prepare_user_filter(const char __user *user_filter)
 	return filter;
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * seccomp_cache_inherit - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static void seccomp_cache_inherit(struct seccomp_filter *sfilter,
+				  const struct seccomp_filter *prev)
+{
+	int arch;
+
+	if (!prev)
+		return;
+
+	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+		bitmap_and(sfilter->cache.syscall_ok[arch],
+			   sfilter->cache.syscall_ok[arch],
+			   prev->cache.syscall_ok[arch],
+			   NR_syscalls);
+	}
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_attach_filter: validate and attach filter
  * @flags:  flags to change filter behavior
@@ -655,6 +848,7 @@ static long seccomp_attach_filter(unsigned int flags,
 	 * task reference.
 	 */
 	filter->prev = current->seccomp.filter;
+	seccomp_cache_inherit(filter, filter->prev);
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
 
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path
  2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
                     ` (2 preceding siblings ...)
  2020-09-24 12:06   ` [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
@ 2020-09-24 12:06   ` YiFei Zhu
  2020-09-24 12:06   ` [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

This first finds the current allow bitmask by iterating through
syscall_arches[] array and comparing it to the one in struct
seccomp_data; this loop is expected to be unrolled. It then
does a test_bit against the bitmask. If the bit is set, then
there is no need to run the full filter; it returns
SECCOMP_RET_ALLOW immediately.

Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 kernel/seccomp.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 7c286d66f983..5b1bd8329e9c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -167,6 +167,12 @@ static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter,
 					 const struct seccomp_filter *prev)
 {
 }
+
+static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				       const struct seccomp_data *sd)
+{
+	return false;
+}
 #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
 
 /**
@@ -321,6 +327,34 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 	return 0;
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * seccomp_cache_check - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				const struct seccomp_data *sd)
+{
+	int syscall_nr = sd->nr;
+	int arch;
+
+	if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
+		return false;
+
+	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+		if (likely(syscall_arches[arch] == sd->arch))
+			return test_bit(syscall_nr,
+					sfilter->cache.syscall_ok[arch]);
+	}
+
+	WARN_ON_ONCE(true);
+	return false;
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_run_filters - evaluates all seccomp filters against @sd
  * @sd: optional seccomp data to be passed to filters
@@ -343,6 +377,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 	if (WARN_ON(f == NULL))
 		return SECCOMP_RET_KILL_PROCESS;
 
+	if (seccomp_cache_check(f, sd))
+		return SECCOMP_RET_ALLOW;
+
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead
  2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
                     ` (3 preceding siblings ...)
  2020-09-24 12:06   ` [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
@ 2020-09-24 12:06   ` YiFei Zhu
  2020-09-24 12:06   ` [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
  2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
  6 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: Kees Cook <keescook@chromium.org>

As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.

Example output with constant action bitmaps on x86:

$ sudo ./seccomp_benchmark 100000000
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 100000000 syscalls...
63.896255358 - 0.008504529 = 63887750829 (63.9s)
getpid native: 638 ns
130.383312423 - 63.897315189 = 66485997234 (66.5s)
getpid RET_ALLOW 1 filter (bitmap): 664 ns
196.789080421 - 130.384414983 = 66404665438 (66.4s)
getpid RET_ALLOW 2 filters (bitmap): 664 ns
268.844643304 - 196.790234168 = 72054409136 (72.1s)
getpid RET_ALLOW 3 filters (full): 720 ns
342.627472515 - 268.845799103 = 73781673412 (73.8s)
getpid RET_ALLOW 4 filters (full): 737 ns
Estimated total seccomp overhead for 1 bitmapped filter: 26 ns
Estimated total seccomp overhead for 2 bitmapped filters: 26 ns
Estimated total seccomp overhead for 3 full filters: 82 ns
Estimated total seccomp overhead for 4 full filters: 99 ns
Estimated seccomp entry overhead: 26 ns
Estimated seccomp per-filter overhead (last 2 diff): 17 ns
Estimated seccomp per-filter overhead (filters / 4): 18 ns
Expectations:
	native ≤ 1 bitmap (638 ≤ 664): ✔️
	native ≤ 1 filter (638 ≤ 720): ✔️
	per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️
	1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️
	entry ≈ 1 bitmapped (26 ≈ 26): ✔️
	entry ≈ 2 bitmapped (26 ≈ 26): ✔️
	native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++++++++++---
 tools/testing/selftests/seccomp/settings      |   2 +-
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 91f5a89cadac..fcc806585266 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,12 +4,16 @@
  */
 #define _GNU_SOURCE
 #include <assert.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <unistd.h>
 #include <linux/filter.h>
 #include <linux/seccomp.h>
+#include <sys/param.h>
 #include <sys/prctl.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
@@ -70,18 +74,74 @@ unsigned long long calibrate(void)
 	return samples * seconds;
 }
 
+bool approx(int i_one, int i_two)
+{
+	double one = i_one, one_bump = one * 0.01;
+	double two = i_two, two_bump = two * 0.01;
+
+	one_bump = one + MAX(one_bump, 2.0);
+	two_bump = two + MAX(two_bump, 2.0);
+
+	/* Equal to, or within 1% or 2 digits */
+	if (one == two ||
+	    (one > two && one <= two_bump) ||
+	    (two > one && two <= one_bump))
+		return true;
+	return false;
+}
+
+bool le(int i_one, int i_two)
+{
+	if (i_one <= i_two)
+		return true;
+	return false;
+}
+
+long compare(const char *name_one, const char *name_eval, const char *name_two,
+	     unsigned long long one, bool (*eval)(int, int), unsigned long long two)
+{
+	bool good;
+
+	printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two,
+	       (long long)one, name_eval, (long long)two);
+	if (one > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)one);
+		return 1;
+	}
+	if (two > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)two);
+		return 1;
+	}
+
+	good = eval(one, two);
+	printf("%s\n", good ? "✔️" : "❌");
+
+	return good ? 0 : 1;
+}
+
 int main(int argc, char *argv[])
 {
+	struct sock_filter bitmap_filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog bitmap_prog = {
+		.len = (unsigned short)ARRAY_SIZE(bitmap_filter),
+		.filter = bitmap_filter,
+	};
 	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])),
 		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
 	};
 	struct sock_fprog prog = {
 		.len = (unsigned short)ARRAY_SIZE(filter),
 		.filter = filter,
 	};
-	long ret;
-	unsigned long long samples;
-	unsigned long long native, filter1, filter2;
+
+	long ret, bits;
+	unsigned long long samples, calc;
+	unsigned long long native, filter1, filter2, bitmap1, bitmap2;
+	unsigned long long entry, per_filter1, per_filter2;
 
 	printf("Current BPF sysctl settings:\n");
 	system("sysctl net.core.bpf_jit_enable");
@@ -101,35 +161,82 @@ int main(int argc, char *argv[])
 	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
 	assert(ret == 0);
 
-	/* One filter */
-	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
+	/* One filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
 	assert(ret == 0);
 
-	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1);
+	bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1);
+
+	/* Second filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	if (filter1 == native)
-		printf("No overhead measured!? Try running again with more samples.\n");
+	bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2);
 
-	/* Two filters */
+	/* Third filter, can no longer be converted to bitmap */
 	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
 	assert(ret == 0);
 
-	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2);
-
-	/* Calculations */
-	printf("Estimated total seccomp overhead for 1 filter: %llu ns\n",
-		filter1 - native);
+	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1);
 
-	printf("Estimated total seccomp overhead for 2 filters: %llu ns\n",
-		filter2 - native);
+	/* Fourth filter, can not be converted to bitmap because of filter 3 */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	printf("Estimated seccomp per-filter overhead: %llu ns\n",
-		filter2 - filter1);
+	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2);
+
+	/* Estimations */
+#define ESTIMATE(fmt, var, what)	do {			\
+		var = (what);					\
+		printf("Estimated " fmt ": %llu ns\n", var);	\
+		if (var > INT_MAX)				\
+			goto more_samples;			\
+	} while (0)
+
+	ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc,
+		 bitmap1 - native);
+	ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc,
+		 bitmap2 - native);
+	ESTIMATE("total seccomp overhead for 3 full filters", calc,
+		 filter1 - native);
+	ESTIMATE("total seccomp overhead for 4 full filters", calc,
+		 filter2 - native);
+	ESTIMATE("seccomp entry overhead", entry,
+		 bitmap1 - native - (bitmap2 - bitmap1));
+	ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1,
+		 filter2 - filter1);
+	ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2,
+		 (filter2 - native - entry) / 4);
+
+	printf("Expectations:\n");
+	ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1);
+	bits = compare("native", "≤", "1 filter", native, le, filter1);
+	if (bits)
+		goto more_samples;
+
+	ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)",
+			per_filter1, approx, per_filter2);
+
+	bits = compare("1 bitmapped", "≈", "2 bitmapped",
+			bitmap1 - native, approx, bitmap2 - native);
+	if (bits) {
+		printf("Skipping constant action bitmap expectations: they appear unsupported.\n");
+		goto out;
+	}
 
-	printf("Estimated seccomp entry overhead: %llu ns\n",
-		filter1 - native - (filter2 - filter1));
+	ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native);
+	ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native);
+	ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total",
+			entry + (per_filter1 * 4) + native, approx, filter2);
+	if (ret == 0)
+		goto out;
 
+more_samples:
+	printf("Saw unexpected benchmark result. Try running again with more samples?\n");
+out:
 	return 0;
 }
diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings
index ba4d85f74cd6..6091b45d226b 100644
--- a/tools/testing/selftests/seccomp/settings
+++ b/tools/testing/selftests/seccomp/settings
@@ -1 +1 @@
-timeout=90
+timeout=120
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
                     ` (4 preceding siblings ...)
  2020-09-24 12:06   ` [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
@ 2020-09-24 12:06   ` YiFei Zhu
  2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
  6 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:06 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<hex arch number> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
c000003e 0 ALLOW
c000003e 1 ALLOW
c000003e 2 ALLOW
c000003e 3 ALLOW
[...]
c000003e 132 ALLOW
c000003e 133 ALLOW
c000003e 134 FILTER
c000003e 135 FILTER
c000003e 136 FILTER
c000003e 137 ALLOW
c000003e 138 ALLOW
c000003e 139 FILTER
c000003e 140 ALLOW
c000003e 141 ALLOW
[...]

This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default
of N because I think certain users of seecomp might not want the
application to know which syscalls are definitely usable.

I'm not sure if adding all the "human readable names" is worthwhile,
considering it can be easily done in userspace.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig            | 10 ++++++++++
 fs/proc/base.c          |  7 +++++--
 include/linux/seccomp.h |  5 +++++
 kernel/seccomp.c        | 26 ++++++++++++++++++++++++++
 4 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc3dc87f253..dbfd897e5dc0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -514,6 +514,16 @@ config SECCOMP_CACHE_NR_ONLY
 
 endchoice
 
+config PROC_SECCOMP_CACHE
+	bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
+	depends on SECCOMP_CACHE_NR_ONLY
+	depends on PROC_FS
+	help
+	  This is enables /proc/pid/seccomp_cache interface to monitor
+	  seccomp cache data. The file format is subject to change.
+
+	  If unsure, say N.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..2af626f69fa1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2615,7 +2615,7 @@ static struct dentry *proc_pident_instantiate(struct dentry *dentry,
 	return d_splice_alias(inode, dentry);
 }
 
-static struct dentry *proc_pident_lookup(struct inode *dir, 
+static struct dentry *proc_pident_lookup(struct inode *dir,
 					 struct dentry *dentry,
 					 const struct pid_entry *p,
 					 const struct pid_entry *end)
@@ -2815,7 +2815,7 @@ static const struct pid_entry attr_dir_stuff[] = {
 
 static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
 {
-	return proc_pident_readdir(file, ctx, 
+	return proc_pident_readdir(file, ctx,
 				   attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
 }
 
@@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+#ifdef CONFIG_PROC_SECCOMP_CACHE
+	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..3cedec824365 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_PROC_SECCOMP_CACHE
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task);
+#endif
 #endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5b1bd8329e9c..c5697d9483ae 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2295,3 +2295,29 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_PROC_SECCOMP_CACHE
+/* Currently CONFIG_PROC_SECCOMP_CACHE implies CONFIG_SECCOMP_CACHE_NR_ONLY */
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task)
+{
+	struct seccomp_filter *f = READ_ONCE(task->seccomp.filter);
+	int arch, nr;
+
+	if (!f)
+		return 0;
+
+	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+		for (nr = 0; nr < NR_syscalls; nr++) {
+			bool cached = test_bit(nr, f->cache.syscall_ok[arch]);
+			char *status = cached ? "ALLOW" : "FILTER";
+
+			seq_printf(m, "%08x %d %s\n", syscall_arches[arch],
+				   nr, status
+			);
+		}
+	}
+
+	return 0;
+}
+#endif /* CONFIG_PROC_SECCOMP_CACHE */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls
  2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
                     ` (5 preceding siblings ...)
  2020-09-24 12:06   ` [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
@ 2020-09-24 12:44   ` YiFei Zhu
  2020-09-24 12:44     ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
                       ` (5 more replies)
  6 siblings, 6 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/

Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.
* Architectures supported by default through arch number array,
  except for MIPS with its sparse syscall numbers.
* Configurable per-build for future different cache modes.

This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.

Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.

Some benchmarks are performed with results in patch 5, copied below:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Benchmarking 100000000 syscalls...
  63.896255358 - 0.008504529 = 63887750829 (63.9s)
  getpid native: 638 ns
  130.383312423 - 63.897315189 = 66485997234 (66.5s)
  getpid RET_ALLOW 1 filter (bitmap): 664 ns
  196.789080421 - 130.384414983 = 66404665438 (66.4s)
  getpid RET_ALLOW 2 filters (bitmap): 664 ns
  268.844643304 - 196.790234168 = 72054409136 (72.1s)
  getpid RET_ALLOW 3 filters (full): 720 ns
  342.627472515 - 268.845799103 = 73781673412 (73.8s)
  getpid RET_ALLOW 4 filters (full): 737 ns
  Estimated total seccomp overhead for 1 bitmapped filter: 26 ns
  Estimated total seccomp overhead for 2 bitmapped filters: 26 ns
  Estimated total seccomp overhead for 3 full filters: 82 ns
  Estimated total seccomp overhead for 4 full filters: 99 ns
  Estimated seccomp entry overhead: 26 ns
  Estimated seccomp per-filter overhead (last 2 diff): 17 ns
  Estimated seccomp per-filter overhead (filters / 4): 18 ns
  Expectations:
  	native ≤ 1 bitmap (638 ≤ 664): ✔️
  	native ≤ 1 filter (638 ≤ 720): ✔️
  	per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️
  	1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️
  	entry ≈ 1 bitmapped (26 ≈ 26): ✔️
  	entry ≈ 2 bitmapped (26 ≈ 26): ✔️
  	native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️

RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
  have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
  instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
  during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.

v1 -> v2:
* Corrected one outdated function documentation.

Patch 1 moves the SECCOMP Kcomfig option to arch/Kconfig.

Patch 2 adds a syscall_arches array so the emulator can enumerate it.

Patch 3 implements the emulator that finds if a filter must return allow,

Patch 4 implements the test_bit against the bitmaps.

Patch 5 updates the selftest to better show the new semantics.

Patch 6 implements /proc/pid/seccomp_cache.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

Kees Cook (1):
  selftests/seccomp: Compare bitmap vs filter overhead

YiFei Zhu (5):
  seccomp: Move config option SECCOMP to arch/Kconfig
  asm/syscall.h: Add syscall_arches[] array
  seccomp/cache: Add "emulator" to check if filter is arg-dependent
  seccomp/cache: Lookup syscall allowlist for fast path
  seccomp/cache: Report cache data through /proc/pid/seccomp_cache

 arch/Kconfig                                  |  56 ++++
 arch/alpha/include/asm/syscall.h              |   4 +
 arch/arc/include/asm/syscall.h                |  24 +-
 arch/arm/Kconfig                              |  15 +-
 arch/arm/include/asm/syscall.h                |   4 +
 arch/arm64/Kconfig                            |  13 -
 arch/arm64/include/asm/syscall.h              |   4 +
 arch/c6x/include/asm/syscall.h                |  13 +-
 arch/csky/Kconfig                             |  13 -
 arch/csky/include/asm/syscall.h               |   4 +
 arch/h8300/include/asm/syscall.h              |   4 +
 arch/hexagon/include/asm/syscall.h            |   4 +
 arch/ia64/include/asm/syscall.h               |   4 +
 arch/m68k/include/asm/syscall.h               |   4 +
 arch/microblaze/Kconfig                       |  18 +-
 arch/microblaze/include/asm/syscall.h         |   4 +
 arch/mips/Kconfig                             |  17 --
 arch/mips/include/asm/syscall.h               |  16 ++
 arch/nds32/include/asm/syscall.h              |  13 +-
 arch/nios2/include/asm/syscall.h              |   4 +
 arch/openrisc/include/asm/syscall.h           |   4 +
 arch/parisc/Kconfig                           |  16 --
 arch/parisc/include/asm/syscall.h             |   7 +
 arch/powerpc/Kconfig                          |  17 --
 arch/powerpc/include/asm/syscall.h            |  14 +
 arch/riscv/Kconfig                            |  13 -
 arch/riscv/include/asm/syscall.h              |  14 +-
 arch/s390/Kconfig                             |  17 --
 arch/s390/include/asm/syscall.h               |   7 +
 arch/sh/Kconfig                               |  16 --
 arch/sh/include/asm/syscall_32.h              |  17 +-
 arch/sparc/Kconfig                            |  18 +-
 arch/sparc/include/asm/syscall.h              |   9 +
 arch/um/Kconfig                               |  16 --
 arch/x86/Kconfig                              |  16 --
 arch/x86/include/asm/syscall.h                |  11 +
 arch/x86/um/asm/syscall.h                     |  14 +-
 arch/xtensa/Kconfig                           |  14 -
 arch/xtensa/include/asm/syscall.h             |   4 +
 fs/proc/base.c                                |   7 +-
 include/linux/seccomp.h                       |   5 +
 kernel/seccomp.c                              | 257 +++++++++++++++++-
 .../selftests/seccomp/seccomp_benchmark.c     | 151 ++++++++--
 tools/testing/selftests/seccomp/settings      |   2 +-
 44 files changed, 639 insertions(+), 265 deletions(-)

--
2.28.0

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig
  2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
@ 2020-09-24 12:44     ` YiFei Zhu
  2020-09-24 19:11       ` Kees Cook
  2020-09-24 12:44     ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
                       ` (4 subsequent siblings)
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

In order to make adding configurable features into seccomp
easier, it's better to have the options at one single location,
considering easpecially that the bulk of seccomp code is
arch-independent. An quick look also show that many SECCOMP
descriptions are outdated; they talk about /proc rather than
prctl.

As a result of moving the config option and keeping it default
on, architectures arm, arm64, csky, riscv, sh, and xtensa
did not have SECCOMP on by default prior to this and SECCOMP will
be default in this change.

Architectures microblaze, mips, powerpc, s390, sh, and sparc
have an outdated depend on PROC_FS and this dependency is removed
in this change.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig            | 21 +++++++++++++++++++++
 arch/arm/Kconfig        | 15 +--------------
 arch/arm64/Kconfig      | 13 -------------
 arch/csky/Kconfig       | 13 -------------
 arch/microblaze/Kconfig | 18 +-----------------
 arch/mips/Kconfig       | 17 -----------------
 arch/parisc/Kconfig     | 16 ----------------
 arch/powerpc/Kconfig    | 17 -----------------
 arch/riscv/Kconfig      | 13 -------------
 arch/s390/Kconfig       | 17 -----------------
 arch/sh/Kconfig         | 16 ----------------
 arch/sparc/Kconfig      | 18 +-----------------
 arch/um/Kconfig         | 16 ----------------
 arch/x86/Kconfig        | 16 ----------------
 arch/xtensa/Kconfig     | 14 --------------
 15 files changed, 24 insertions(+), 216 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..6dfc5673215d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -444,8 +444,12 @@ config ARCH_WANT_OLD_COMPAT_IPC
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
 	bool
 
+config HAVE_ARCH_SECCOMP
+	bool
+
 config HAVE_ARCH_SECCOMP_FILTER
 	bool
+	select HAVE_ARCH_SECCOMP
 	help
 	  An arch should select this symbol if it provides all of these things:
 	  - syscall_get_arch()
@@ -458,6 +462,23 @@ config HAVE_ARCH_SECCOMP_FILTER
 	    results in the system call being skipped immediately.
 	  - seccomp syscall wired up
 
+config SECCOMP
+	def_bool y
+	depends on HAVE_ARCH_SECCOMP
+	prompt "Enable seccomp to safely compute untrusted bytecode"
+	help
+	  This kernel feature is useful for number crunching applications
+	  that may need to compute untrusted bytecode during their
+	  execution. By using pipes or other transports made available to
+	  the process as file descriptors supporting the read/write
+	  syscalls, it's possible to isolate those applications in
+	  their own address space using seccomp. Once seccomp is
+	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
+	  and the task is only allowed to execute a few safe syscalls
+	  defined by each seccomp mode.
+
+	  If unsure, say Y. Only embedded should say N here.
+
 config SECCOMP_FILTER
 	def_bool y
 	depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e00d94b16658..e26c19a16284 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -67,6 +67,7 @@ config ARM
 	select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
+	select HAVE_ARCH_SECCOMP
 	select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
@@ -1617,20 +1618,6 @@ config UACCESS_WITH_MEMCPY
 	  However, if the CPU data cache is using a write-allocate mode,
 	  this option is unlikely to provide any performance gain.
 
-config SECCOMP
-	bool
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config PARAVIRT
 	bool "Enable paravirtualization code"
 	help
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..98c4e34cbec1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1033,19 +1033,6 @@ config ARCH_ENABLE_SPLIT_PMD_PTLOCK
 config CC_HAVE_SHADOW_CALL_STACK
 	def_bool $(cc-option, -fsanitize=shadow-call-stack -ffixed-x18)
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config PARAVIRT
 	bool "Enable paravirtualization code"
 	help
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 3d5afb5f5685..7f424c85772c 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -309,16 +309,3 @@ endmenu
 source "arch/csky/Kconfig.platforms"
 
 source "kernel/Kconfig.hz"
-
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index d262ac0c8714..37bd6a5f38fb 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -26,6 +26,7 @@ config MICROBLAZE
 	select GENERIC_SCHED_CLOCK
 	select HAVE_ARCH_HASH
 	select HAVE_ARCH_KGDB
+	select HAVE_ARCH_SECCOMP
 	select HAVE_DEBUG_KMEMLEAK
 	select HAVE_DMA_CONTIGUOUS
 	select HAVE_DYNAMIC_FTRACE
@@ -120,23 +121,6 @@ config CMDLINE_FORCE
 	  Set this to have arguments from the default kernel command string
 	  override those passed by the boot loader.
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 endmenu
 
 menu "Kernel features"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index c95fa3a2484c..5f88a8fc11fc 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -3004,23 +3004,6 @@ config PHYSICAL_START
 	  specified in the "crashkernel=YM@XM" command line boot parameter
 	  passed to the panic-ed kernel).
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 config MIPS_O32_FP64_SUPPORT
 	bool "Support for O32 binaries using 64-bit FP" if !CPU_MIPSR6
 	depends on 32BIT || MIPS32_O32
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 3b0f53dd70bc..cd4afe1e7a6c 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -378,19 +378,3 @@ endmenu
 
 
 source "drivers/parisc/Kconfig"
-
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..136fe860caef 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -934,23 +934,6 @@ config ARCH_WANTS_FREEZER_CONTROL
 
 source "kernel/power/Kconfig"
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 config PPC_MEM_KEYS
 	prompt "PowerPC Memory Protection Keys"
 	def_bool y
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index df18372861d8..c456b558fab9 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -333,19 +333,6 @@ menu "Kernel features"
 
 source "kernel/Kconfig.hz"
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config RISCV_SBI_V01
 	bool "SBI v0.1 support"
 	default y
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 3d86e12e8e3c..7f7b40ec699e 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -791,23 +791,6 @@ config CRASH_DUMP
 
 endmenu
 
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y.
-
 config CCW
 	def_bool y
 
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index d20927128fce..18278152c91c 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -600,22 +600,6 @@ config PHYSICAL_START
 	  where the fail safe kernel needs to run at a different address
 	  than the panic-ed kernel.
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on PROC_FS
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl, it cannot be disabled and the task is only
-	  allowed to execute a few safe syscalls defined by each seccomp
-	  mode.
-
-	  If unsure, say N.
-
 config SMP
 	bool "Symmetric multi-processing support"
 	depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index efeff2c896a5..d62ce83cf009 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -23,6 +23,7 @@ config SPARC
 	select HAVE_OPROFILE
 	select HAVE_ARCH_KGDB if !SMP || SPARC64
 	select HAVE_ARCH_TRACEHOOK
+	select HAVE_ARCH_SECCOMP if SPARC64
 	select HAVE_EXIT_THREAD
 	select HAVE_PCI
 	select SYSCTL_EXCEPTION_TRACE
@@ -226,23 +227,6 @@ config EARLYFB
 	help
 	  Say Y here to enable a faster early framebuffer boot console.
 
-config SECCOMP
-	bool "Enable seccomp to safely compute untrusted bytecode"
-	depends on SPARC64 && PROC_FS
-	default y
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via /proc/<pid>/seccomp, it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 config HOTPLUG_CPU
 	bool "Support for hot-pluggable CPUs"
 	depends on SPARC64 && SMP
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index eb51fec75948..d49f471b02e3 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -173,22 +173,6 @@ config PGTABLE_LEVELS
 	default 3 if 3_LEVEL_PGTABLES
 	default 2
 
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y.
-
 config UML_TIME_TRAVEL_SUPPORT
 	bool
 	prompt "Support time-travel mode (e.g. for test execution)"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..1ab22869a765 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1968,22 +1968,6 @@ config EFI_MIXED
 
 	   If unsure, say N.
 
-config SECCOMP
-	def_bool y
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
-	  If unsure, say Y. Only embedded should say N here.
-
 source "kernel/Kconfig.hz"
 
 config KEXEC
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index e997e0119c02..d8a29dc5a284 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -217,20 +217,6 @@ config HOTPLUG_CPU
 
 	  Say N if you want to disable CPU hotplug.
 
-config SECCOMP
-	bool
-	prompt "Enable seccomp to safely compute untrusted bytecode"
-	help
-	  This kernel feature is useful for number crunching applications
-	  that may need to compute untrusted bytecode during their
-	  execution. By using pipes or other transports made available to
-	  the process as file descriptors supporting the read/write
-	  syscalls, it's possible to isolate those applications in
-	  their own address space using seccomp. Once seccomp is
-	  enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
-	  and the task is only allowed to execute a few safe syscalls
-	  defined by each seccomp mode.
-
 config FAST_SYSCALL_XTENSA
 	bool "Enable fast atomic syscalls"
 	default n
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
  2020-09-24 12:44     ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
@ 2020-09-24 12:44     ` YiFei Zhu
  2020-09-24 13:47       ` David Laight
  2020-09-24 12:44     ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Seccomp cache emulator needs to know all the architecture numbers
that syscall_get_arch() could return for the kernel build in order
to generate a cache for all of them.

The array is declared in header as static __maybe_unused const
to maximize compiler optimiation opportunities such as loop
unrolling.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/alpha/include/asm/syscall.h      |  4 ++++
 arch/arc/include/asm/syscall.h        | 24 +++++++++++++++++++-----
 arch/arm/include/asm/syscall.h        |  4 ++++
 arch/arm64/include/asm/syscall.h      |  4 ++++
 arch/c6x/include/asm/syscall.h        | 13 +++++++++++--
 arch/csky/include/asm/syscall.h       |  4 ++++
 arch/h8300/include/asm/syscall.h      |  4 ++++
 arch/hexagon/include/asm/syscall.h    |  4 ++++
 arch/ia64/include/asm/syscall.h       |  4 ++++
 arch/m68k/include/asm/syscall.h       |  4 ++++
 arch/microblaze/include/asm/syscall.h |  4 ++++
 arch/mips/include/asm/syscall.h       | 16 ++++++++++++++++
 arch/nds32/include/asm/syscall.h      | 13 +++++++++++--
 arch/nios2/include/asm/syscall.h      |  4 ++++
 arch/openrisc/include/asm/syscall.h   |  4 ++++
 arch/parisc/include/asm/syscall.h     |  7 +++++++
 arch/powerpc/include/asm/syscall.h    | 14 ++++++++++++++
 arch/riscv/include/asm/syscall.h      | 14 ++++++++++----
 arch/s390/include/asm/syscall.h       |  7 +++++++
 arch/sh/include/asm/syscall_32.h      | 17 +++++++++++------
 arch/sparc/include/asm/syscall.h      |  9 +++++++++
 arch/x86/include/asm/syscall.h        | 11 +++++++++++
 arch/x86/um/asm/syscall.h             | 14 ++++++++++----
 arch/xtensa/include/asm/syscall.h     |  4 ++++
 24 files changed, 184 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/include/asm/syscall.h b/arch/alpha/include/asm/syscall.h
index 11c688c1d7ec..625ac9b23f37 100644
--- a/arch/alpha/include/asm/syscall.h
+++ b/arch/alpha/include/asm/syscall.h
@@ -4,6 +4,10 @@
 
 #include <uapi/linux/audit.h>
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_ALPHA
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_ALPHA;
diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
index 94529e89dff0..899c13cbf5cc 100644
--- a/arch/arc/include/asm/syscall.h
+++ b/arch/arc/include/asm/syscall.h
@@ -65,14 +65,28 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
 	}
 }
 
+#ifdef CONFIG_ISA_ARCOMPACT
+# ifdef CONFIG_CPU_BIG_ENDIAN
+#  define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACTBE
+# else
+#  define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACT
+# endif /* CONFIG_CPU_BIG_ENDIAN */
+#else
+# ifdef CONFIG_CPU_BIG_ENDIAN
+#  define SYSCALL_ARCH AUDIT_ARCH_ARCV2BE
+# else
+#  define SYSCALL_ARCH AUDIT_ARCH_ARCV2
+# endif /* CONFIG_CPU_BIG_ENDIAN */
+#endif /* CONFIG_ISA_ARCOMPACT */
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
-	return IS_ENABLED(CONFIG_ISA_ARCOMPACT)
-		? (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
-			? AUDIT_ARCH_ARCOMPACTBE : AUDIT_ARCH_ARCOMPACT)
-		: (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
-			? AUDIT_ARCH_ARCV2BE : AUDIT_ARCH_ARCV2);
+	return SYSCALL_ARCH;
 }
 
 #endif
diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
index fd02761ba06c..33ade26e3956 100644
--- a/arch/arm/include/asm/syscall.h
+++ b/arch/arm/include/asm/syscall.h
@@ -73,6 +73,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	memcpy(&regs->ARM_r0 + 1, args, 5 * sizeof(args[0]));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_ARM
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	/* ARM tasks don't change audit architectures on the fly. */
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index cfc0672013f6..77f3d300e7a0 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -82,6 +82,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	memcpy(&regs->regs[1], args, 5 * sizeof(args[0]));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_ARM, AUDIT_ARCH_AARCH64
+};
+
 /*
  * We don't care about endianness (__AUDIT_ARCH_LE bit) here because
  * AArch64 has the same system calls both on little- and big- endian.
diff --git a/arch/c6x/include/asm/syscall.h b/arch/c6x/include/asm/syscall.h
index 38f3e2284ecd..0d78c67ee1fc 100644
--- a/arch/c6x/include/asm/syscall.h
+++ b/arch/c6x/include/asm/syscall.h
@@ -66,10 +66,19 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->a9 = *args;
 }
 
+#ifdef CONFIG_CPU_BIG_ENDIAN
+#define SYSCALL_ARCH AUDIT_ARCH_C6XBE
+#else
+#define SYSCALL_ARCH AUDIT_ARCH_C6X
+#endif
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
-	return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
-		? AUDIT_ARCH_C6XBE : AUDIT_ARCH_C6X;
+	return SYSCALL_ARCH;
 }
 
 #endif /* __ASM_C6X_SYSCALLS_H */
diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h
index f624fa3bbc22..86242d2850d7 100644
--- a/arch/csky/include/asm/syscall.h
+++ b/arch/csky/include/asm/syscall.h
@@ -68,6 +68,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
 	memcpy(&regs->a1, args, 5 * sizeof(regs->a1));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_CSKY
+};
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
diff --git a/arch/h8300/include/asm/syscall.h b/arch/h8300/include/asm/syscall.h
index 01666b8bb263..775f6ac8fde3 100644
--- a/arch/h8300/include/asm/syscall.h
+++ b/arch/h8300/include/asm/syscall.h
@@ -28,6 +28,10 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
 	*args   = regs->er6;
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_H8300
+};
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h
index f6e454f18038..6ee21a76f6a3 100644
--- a/arch/hexagon/include/asm/syscall.h
+++ b/arch/hexagon/include/asm/syscall.h
@@ -45,6 +45,10 @@ static inline long syscall_get_return_value(struct task_struct *task,
 	return regs->r00;
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_HEXAGON
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_HEXAGON;
diff --git a/arch/ia64/include/asm/syscall.h b/arch/ia64/include/asm/syscall.h
index 6c6f16e409a8..19456125c89a 100644
--- a/arch/ia64/include/asm/syscall.h
+++ b/arch/ia64/include/asm/syscall.h
@@ -71,6 +71,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	ia64_syscall_get_set_arguments(task, regs, args, 1);
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_IA64
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_IA64;
diff --git a/arch/m68k/include/asm/syscall.h b/arch/m68k/include/asm/syscall.h
index 465ac039be09..031b051f9026 100644
--- a/arch/m68k/include/asm/syscall.h
+++ b/arch/m68k/include/asm/syscall.h
@@ -4,6 +4,10 @@
 
 #include <uapi/linux/audit.h>
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_M68K
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_M68K;
diff --git a/arch/microblaze/include/asm/syscall.h b/arch/microblaze/include/asm/syscall.h
index 3a6924f3cbde..28cde14056d1 100644
--- a/arch/microblaze/include/asm/syscall.h
+++ b/arch/microblaze/include/asm/syscall.h
@@ -105,6 +105,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 asmlinkage unsigned long do_syscall_trace_enter(struct pt_regs *regs);
 asmlinkage void do_syscall_trace_leave(struct pt_regs *regs);
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_MICROBLAZE
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_MICROBLAZE;
diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
index 25fa651c937d..29e4c1c47c54 100644
--- a/arch/mips/include/asm/syscall.h
+++ b/arch/mips/include/asm/syscall.h
@@ -140,6 +140,22 @@ extern const unsigned long sys_call_table[];
 extern const unsigned long sys32_call_table[];
 extern const unsigned long sysn32_call_table[];
 
+static __maybe_unused const int syscall_arches[] = {
+#ifdef __LITTLE_ENDIAN
+	AUDIT_ARCH_MIPSEL,
+# ifdef CONFIG_64BIT
+	AUDIT_ARCH_MIPSEL64,
+	AUDIT_ARCH_MIPSEL64N32,
+# endif /* CONFIG_64BIT */
+#else
+	AUDIT_ARCH_MIPS,
+# ifdef CONFIG_64BIT
+	AUDIT_ARCH_MIPS64,
+	AUDIT_ARCH_MIPS64N32,
+# endif /* CONFIG_64BIT */
+#endif /* __LITTLE_ENDIAN */
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	int arch = AUDIT_ARCH_MIPS;
diff --git a/arch/nds32/include/asm/syscall.h b/arch/nds32/include/asm/syscall.h
index 7b5180d78e20..2dd5e33bcfcb 100644
--- a/arch/nds32/include/asm/syscall.h
+++ b/arch/nds32/include/asm/syscall.h
@@ -154,11 +154,20 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
 	memcpy(&regs->uregs[0] + 1, args, 5 * sizeof(args[0]));
 }
 
+#ifdef CONFIG_CPU_BIG_ENDIAN
+#define SYSCALL_ARCH AUDIT_ARCH_NDS32BE
+#else
+#define SYSCALL_ARCH AUDIT_ARCH_NDS32
+#endif
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
-	return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
-		? AUDIT_ARCH_NDS32BE : AUDIT_ARCH_NDS32;
+	return SYSCALL_ARCH;
 }
 
 #endif /* _ASM_NDS32_SYSCALL_H */
diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h
index 526449edd768..8fa2716cac5a 100644
--- a/arch/nios2/include/asm/syscall.h
+++ b/arch/nios2/include/asm/syscall.h
@@ -69,6 +69,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->r9 = *args;
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_NIOS2
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_NIOS2;
diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h
index e6383be2a195..4eb28ad08042 100644
--- a/arch/openrisc/include/asm/syscall.h
+++ b/arch/openrisc/include/asm/syscall.h
@@ -64,6 +64,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
 	memcpy(&regs->gpr[3], args, 6 * sizeof(args[0]));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_OPENRISC
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_OPENRISC;
diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h
index 00b127a5e09b..2915f140c9fd 100644
--- a/arch/parisc/include/asm/syscall.h
+++ b/arch/parisc/include/asm/syscall.h
@@ -55,6 +55,13 @@ static inline void syscall_rollback(struct task_struct *task,
 	/* do nothing */
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_PARISC,
+#ifdef CONFIG_64BIT
+	AUDIT_ARCH_PARISC64,
+#endif
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	int arch = AUDIT_ARCH_PARISC;
diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
index fd1b518eed17..781deb211e3d 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -104,6 +104,20 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->orig_gpr3 = args[0];
 }
 
+static __maybe_unused const int syscall_arches[] = {
+#ifdef __LITTLE_ENDIAN__
+	AUDIT_ARCH_PPC | __AUDIT_ARCH_LE,
+# ifdef CONFIG_PPC64
+	AUDIT_ARCH_PPC64LE,
+# endif /* CONFIG_PPC64 */
+#else
+	AUDIT_ARCH_PPC,
+# ifdef CONFIG_PPC64
+	AUDIT_ARCH_PPC64,
+# endif /* CONFIG_PPC64 */
+#endif /* __LITTLE_ENDIAN__ */
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	int arch;
diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h
index 49350c8bd7b0..4b36d358243e 100644
--- a/arch/riscv/include/asm/syscall.h
+++ b/arch/riscv/include/asm/syscall.h
@@ -73,13 +73,19 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	memcpy(&regs->a1, args, 5 * sizeof(regs->a1));
 }
 
-static inline int syscall_get_arch(struct task_struct *task)
-{
 #ifdef CONFIG_64BIT
-	return AUDIT_ARCH_RISCV64;
+#define SYSCALL_ARCH AUDIT_ARCH_RISCV64
 #else
-	return AUDIT_ARCH_RISCV32;
+#define SYSCALL_ARCH AUDIT_ARCH_RISCV32
 #endif
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
+static inline int syscall_get_arch(struct task_struct *task)
+{
+	return SYSCALL_ARCH;
 }
 
 #endif	/* _ASM_RISCV_SYSCALL_H */
diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h
index d9d5de0f67ff..4cb9da36610a 100644
--- a/arch/s390/include/asm/syscall.h
+++ b/arch/s390/include/asm/syscall.h
@@ -89,6 +89,13 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->orig_gpr2 = args[0];
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_S390X,
+#ifdef CONFIG_COMPAT
+	AUDIT_ARCH_S390,
+#endif
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 #ifdef CONFIG_COMPAT
diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h
index cb51a7528384..4780f2339c72 100644
--- a/arch/sh/include/asm/syscall_32.h
+++ b/arch/sh/include/asm/syscall_32.h
@@ -69,13 +69,18 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	regs->regs[4] = args[0];
 }
 
-static inline int syscall_get_arch(struct task_struct *task)
-{
-	int arch = AUDIT_ARCH_SH;
-
 #ifdef CONFIG_CPU_LITTLE_ENDIAN
-	arch |= __AUDIT_ARCH_LE;
+#define SYSCALL_ARCH AUDIT_ARCH_SHEL
+#else
+#define SYSCALL_ARCH AUDIT_ARCH_SH
 #endif
-	return arch;
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
+static inline int syscall_get_arch(struct task_struct *task)
+{
+	return SYSCALL_ARCH;
 }
 #endif /* __ASM_SH_SYSCALL_32_H */
diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h
index 62a5a78804c4..a458992cdcfe 100644
--- a/arch/sparc/include/asm/syscall.h
+++ b/arch/sparc/include/asm/syscall.h
@@ -127,6 +127,15 @@ static inline void syscall_set_arguments(struct task_struct *task,
 		regs->u_regs[UREG_I0 + i] = args[i];
 }
 
+static __maybe_unused const int syscall_arches[] = {
+#ifdef CONFIG_SPARC64
+	AUDIT_ARCH_SPARC64,
+#endif
+#if !defined(CONFIG_SPARC64) || defined(CONFIG_COMPAT)
+	AUDIT_ARCH_SPARC,
+#endif
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 #if defined(CONFIG_SPARC64) && defined(CONFIG_COMPAT)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 7cbf733d11af..e13bb2a65b6f 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -97,6 +97,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	memcpy(&regs->bx + i, args, n * sizeof(args[0]));
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_I386
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_I386;
@@ -152,6 +156,13 @@ static inline void syscall_set_arguments(struct task_struct *task,
 	}
 }
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_X86_64,
+#ifdef CONFIG_IA32_EMULATION
+	AUDIT_ARCH_I386,
+#endif
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
diff --git a/arch/x86/um/asm/syscall.h b/arch/x86/um/asm/syscall.h
index 56a2f0913e3c..590a31e22b99 100644
--- a/arch/x86/um/asm/syscall.h
+++ b/arch/x86/um/asm/syscall.h
@@ -9,13 +9,19 @@ typedef asmlinkage long (*sys_call_ptr_t)(unsigned long, unsigned long,
 					  unsigned long, unsigned long,
 					  unsigned long, unsigned long);
 
-static inline int syscall_get_arch(struct task_struct *task)
-{
 #ifdef CONFIG_X86_32
-	return AUDIT_ARCH_I386;
+#define SYSCALL_ARCH AUDIT_ARCH_I386
 #else
-	return AUDIT_ARCH_X86_64;
+#define SYSCALL_ARCH AUDIT_ARCH_X86_64
 #endif
+
+static __maybe_unused const int syscall_arches[] = {
+	SYSCALL_ARCH
+};
+
+static inline int syscall_get_arch(struct task_struct *task)
+{
+	return SYSCALL_ARCH;
 }
 
 #endif /* __UM_ASM_SYSCALL_H */
diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h
index f9a671cbf933..3d334fb0d329 100644
--- a/arch/xtensa/include/asm/syscall.h
+++ b/arch/xtensa/include/asm/syscall.h
@@ -14,6 +14,10 @@
 #include <asm/ptrace.h>
 #include <uapi/linux/audit.h>
 
+static __maybe_unused const int syscall_arches[] = {
+	AUDIT_ARCH_XTENSA
+};
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
 	return AUDIT_ARCH_XTENSA;
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
  2020-09-24 12:44     ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
  2020-09-24 12:44     ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
@ 2020-09-24 12:44     ` YiFei Zhu
  2020-09-24 23:25       ` Kees Cook
  2020-09-24 12:44     ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
access any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.

Each common BPF instruction (stolen from Kees's list [1]) are
emulated. Any weirdness or loading from a syscall argument will
cause the emulator to bail.

The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.

Filter dependency is resolved at attach time. If a filter depends
on more filters, then we perform an and on its bitmask against its
dependee; if the dependee does not guarantee to allow the syscall,
then the depender is also marked not to guarantee to allow the
syscall.

[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig     |  25 ++++++
 kernel/seccomp.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6dfc5673215d..8cc3dc87f253 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -489,6 +489,31 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+choice
+	prompt "Seccomp filter cache"
+	default SECCOMP_CACHE_NONE
+	depends on SECCOMP_FILTER
+	help
+	  Seccomp filters can potentially incur large overhead for each
+	  system call. This can alleviate some of the overhead.
+
+	  If in doubt, select 'syscall numbers only'.
+
+config SECCOMP_CACHE_NONE
+	bool "None"
+	help
+	  No caching is done. Seccomp filters will be called each time
+	  a system call occurs in a seccomp-guarded task.
+
+config SECCOMP_CACHE_NR_ONLY
+	bool "Syscall number only"
+	depends on !HAVE_SPARSE_SYSCALL_NR
+	help
+	  For each syscall number, if the seccomp filter has a fixed
+	  result, store that result in a bitmap to speed up system calls.
+
+endchoice
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..20d33378a092 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,6 +143,32 @@ struct notification {
 	struct list_head notifications;
 };
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_cache_filter_data - container for cache's per-filter data
+ *
+ * @syscall_ok: A bitmap for each architecture number, where each bit
+ *		represents whether the filter will always allow the syscall.
+ */
+struct seccomp_cache_filter_data {
+	DECLARE_BITMAP(syscall_ok[ARRAY_SIZE(syscall_arches)], NR_syscalls);
+};
+
+#define SECCOMP_EMU_MAX_PENDING_STATES 64
+#else
+struct seccomp_cache_filter_data { };
+
+static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+	return 0;
+}
+
+static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter,
+					 const struct seccomp_filter *prev)
+{
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -185,6 +211,7 @@ struct seccomp_filter {
 	struct notification *notif;
 	struct mutex notify_lock;
 	wait_queue_head_t wqh;
+	struct seccomp_cache_filter_data cache;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -530,6 +557,139 @@ static inline void seccomp_sync_threads(unsigned long flags)
 	}
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_emu_env - container for seccomp emulator environment
+ *
+ * @filter: The cBPF filter instructions.
+ * @nr: The syscall number we are emulating.
+ * @arch: The architecture number we are emulating.
+ * @syscall_ok: Emulation result, whether it is okay for seccomp to cache the
+ *		syscall.
+ */
+struct seccomp_emu_env {
+	struct sock_filter *filter;
+	int arch;
+	int nr;
+	bool syscall_ok;
+};
+
+/**
+ * struct seccomp_emu_state - container for seccomp emulator state
+ *
+ * @next: The next pending state. This structure is a linked list.
+ * @pc: The current program counter.
+ * @areg: the value of that A register.
+ */
+struct seccomp_emu_state {
+	struct seccomp_emu_state *next;
+	int pc;
+	u32 areg;
+};
+
+/**
+ * seccomp_emu_step - step one instruction in the emulator
+ * @env: The emulator environment
+ * @state: The emulator state
+ *
+ * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred.
+ */
+static int seccomp_emu_step(struct seccomp_emu_env *env,
+			    struct seccomp_emu_state *state)
+{
+	struct sock_filter *ftest = &env->filter[state->pc++];
+	u16 code = ftest->code;
+	u32 k = ftest->k;
+	bool compare;
+
+	switch (code) {
+	case BPF_LD | BPF_W | BPF_ABS:
+		if (k == offsetof(struct seccomp_data, nr))
+			state->areg = env->nr;
+		else if (k == offsetof(struct seccomp_data, arch))
+			state->areg = env->arch;
+		else
+			return 1;
+
+		return 0;
+	case BPF_JMP | BPF_JA:
+		state->pc += k;
+		return 0;
+	case BPF_JMP | BPF_JEQ | BPF_K:
+	case BPF_JMP | BPF_JGE | BPF_K:
+	case BPF_JMP | BPF_JGT | BPF_K:
+	case BPF_JMP | BPF_JSET | BPF_K:
+		switch (BPF_OP(code)) {
+		case BPF_JEQ:
+			compare = state->areg == k;
+			break;
+		case BPF_JGT:
+			compare = state->areg > k;
+			break;
+		case BPF_JGE:
+			compare = state->areg >= k;
+			break;
+		case BPF_JSET:
+			compare = state->areg & k;
+			break;
+		default:
+			WARN_ON(true);
+			return -EINVAL;
+		}
+
+		state->pc += compare ? ftest->jt : ftest->jf;
+		return 0;
+	case BPF_ALU | BPF_AND | BPF_K:
+		state->areg &= k;
+		return 0;
+	case BPF_RET | BPF_K:
+		env->syscall_ok = k == SECCOMP_RET_ALLOW;
+		return 1;
+	default:
+		return 1;
+	}
+}
+
+/**
+ * seccomp_cache_prepare - emulate the filter to find cachable syscalls
+ * @sfilter: The seccomp filter
+ *
+ * Returns 0 if successful or -errno if error occurred.
+ */
+int seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+	struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
+	struct sock_filter *filter = fprog->filter;
+	int arch, nr, res = 0;
+
+	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+		for (nr = 0; nr < NR_syscalls; nr++) {
+			struct seccomp_emu_env env = {0};
+			struct seccomp_emu_state state = {0};
+
+			env.filter = filter;
+			env.arch = syscall_arches[arch];
+			env.nr = nr;
+
+			while (true) {
+				res = seccomp_emu_step(&env, &state);
+				if (res)
+					break;
+			}
+
+			if (res < 0)
+				goto out;
+
+			if (env.syscall_ok)
+				set_bit(nr, sfilter->cache.syscall_ok[arch]);
+		}
+	}
+
+out:
+	return res;
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
@@ -540,7 +700,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *sfilter;
 	int ret;
-	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
+			       IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
 		return ERR_PTR(-EINVAL);
@@ -571,6 +732,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 		return ERR_PTR(ret);
 	}
 
+	ret = seccomp_cache_prepare(sfilter);
+	if (ret < 0) {
+		bpf_prog_destroy(sfilter->prog);
+		kfree(sfilter);
+		return ERR_PTR(ret);
+	}
+
 	refcount_set(&sfilter->refs, 1);
 	refcount_set(&sfilter->users, 1);
 	init_waitqueue_head(&sfilter->wqh);
@@ -606,6 +774,29 @@ seccomp_prepare_user_filter(const char __user *user_filter)
 	return filter;
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * seccomp_cache_inherit - mask accept bitmap against previous filter
+ * @sfilter: The seccomp filter
+ * @sfilter: The previous seccomp filter
+ */
+static void seccomp_cache_inherit(struct seccomp_filter *sfilter,
+				  const struct seccomp_filter *prev)
+{
+	int arch;
+
+	if (!prev)
+		return;
+
+	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+		bitmap_and(sfilter->cache.syscall_ok[arch],
+			   sfilter->cache.syscall_ok[arch],
+			   prev->cache.syscall_ok[arch],
+			   NR_syscalls);
+	}
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_attach_filter: validate and attach filter
  * @flags:  flags to change filter behavior
@@ -655,6 +846,7 @@ static long seccomp_attach_filter(unsigned int flags,
 	 * task reference.
 	 */
 	filter->prev = current->seccomp.filter;
+	seccomp_cache_inherit(filter, filter->prev);
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
 
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path
  2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
                       ` (2 preceding siblings ...)
  2020-09-24 12:44     ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
@ 2020-09-24 12:44     ` YiFei Zhu
  2020-09-24 23:46       ` Kees Cook
  2020-09-24 12:44     ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
  2020-09-24 12:44     ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

This first finds the current allow bitmask by iterating through
syscall_arches[] array and comparing it to the one in struct
seccomp_data; this loop is expected to be unrolled. It then
does a test_bit against the bitmask. If the bit is set, then
there is no need to run the full filter; it returns
SECCOMP_RET_ALLOW immediately.

Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 kernel/seccomp.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 20d33378a092..ac0266b6d18a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -167,6 +167,12 @@ static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter,
 					 const struct seccomp_filter *prev)
 {
 }
+
+static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				       const struct seccomp_data *sd)
+{
+	return false;
+}
 #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
 
 /**
@@ -321,6 +327,34 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 	return 0;
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * seccomp_cache_check - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				const struct seccomp_data *sd)
+{
+	int syscall_nr = sd->nr;
+	int arch;
+
+	if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
+		return false;
+
+	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+		if (likely(syscall_arches[arch] == sd->arch))
+			return test_bit(syscall_nr,
+					sfilter->cache.syscall_ok[arch]);
+	}
+
+	WARN_ON_ONCE(true);
+	return false;
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_run_filters - evaluates all seccomp filters against @sd
  * @sd: optional seccomp data to be passed to filters
@@ -343,6 +377,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 	if (WARN_ON(f == NULL))
 		return SECCOMP_RET_KILL_PROCESS;
 
+	if (seccomp_cache_check(f, sd))
+		return SECCOMP_RET_ALLOW;
+
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead
  2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
                       ` (3 preceding siblings ...)
  2020-09-24 12:44     ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
@ 2020-09-24 12:44     ` YiFei Zhu
  2020-09-24 23:47       ` Kees Cook
  2020-09-24 12:44     ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: Kees Cook <keescook@chromium.org>

As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.

Example output with constant action bitmaps on x86:

$ sudo ./seccomp_benchmark 100000000
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 100000000 syscalls...
63.896255358 - 0.008504529 = 63887750829 (63.9s)
getpid native: 638 ns
130.383312423 - 63.897315189 = 66485997234 (66.5s)
getpid RET_ALLOW 1 filter (bitmap): 664 ns
196.789080421 - 130.384414983 = 66404665438 (66.4s)
getpid RET_ALLOW 2 filters (bitmap): 664 ns
268.844643304 - 196.790234168 = 72054409136 (72.1s)
getpid RET_ALLOW 3 filters (full): 720 ns
342.627472515 - 268.845799103 = 73781673412 (73.8s)
getpid RET_ALLOW 4 filters (full): 737 ns
Estimated total seccomp overhead for 1 bitmapped filter: 26 ns
Estimated total seccomp overhead for 2 bitmapped filters: 26 ns
Estimated total seccomp overhead for 3 full filters: 82 ns
Estimated total seccomp overhead for 4 full filters: 99 ns
Estimated seccomp entry overhead: 26 ns
Estimated seccomp per-filter overhead (last 2 diff): 17 ns
Estimated seccomp per-filter overhead (filters / 4): 18 ns
Expectations:
	native ≤ 1 bitmap (638 ≤ 664): ✔️
	native ≤ 1 filter (638 ≤ 720): ✔️
	per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️
	1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️
	entry ≈ 1 bitmapped (26 ≈ 26): ✔️
	entry ≈ 2 bitmapped (26 ≈ 26): ✔️
	native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++++++++++---
 tools/testing/selftests/seccomp/settings      |   2 +-
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 91f5a89cadac..fcc806585266 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,12 +4,16 @@
  */
 #define _GNU_SOURCE
 #include <assert.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <unistd.h>
 #include <linux/filter.h>
 #include <linux/seccomp.h>
+#include <sys/param.h>
 #include <sys/prctl.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
@@ -70,18 +74,74 @@ unsigned long long calibrate(void)
 	return samples * seconds;
 }
 
+bool approx(int i_one, int i_two)
+{
+	double one = i_one, one_bump = one * 0.01;
+	double two = i_two, two_bump = two * 0.01;
+
+	one_bump = one + MAX(one_bump, 2.0);
+	two_bump = two + MAX(two_bump, 2.0);
+
+	/* Equal to, or within 1% or 2 digits */
+	if (one == two ||
+	    (one > two && one <= two_bump) ||
+	    (two > one && two <= one_bump))
+		return true;
+	return false;
+}
+
+bool le(int i_one, int i_two)
+{
+	if (i_one <= i_two)
+		return true;
+	return false;
+}
+
+long compare(const char *name_one, const char *name_eval, const char *name_two,
+	     unsigned long long one, bool (*eval)(int, int), unsigned long long two)
+{
+	bool good;
+
+	printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two,
+	       (long long)one, name_eval, (long long)two);
+	if (one > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)one);
+		return 1;
+	}
+	if (two > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)two);
+		return 1;
+	}
+
+	good = eval(one, two);
+	printf("%s\n", good ? "✔️" : "❌");
+
+	return good ? 0 : 1;
+}
+
 int main(int argc, char *argv[])
 {
+	struct sock_filter bitmap_filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog bitmap_prog = {
+		.len = (unsigned short)ARRAY_SIZE(bitmap_filter),
+		.filter = bitmap_filter,
+	};
 	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])),
 		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
 	};
 	struct sock_fprog prog = {
 		.len = (unsigned short)ARRAY_SIZE(filter),
 		.filter = filter,
 	};
-	long ret;
-	unsigned long long samples;
-	unsigned long long native, filter1, filter2;
+
+	long ret, bits;
+	unsigned long long samples, calc;
+	unsigned long long native, filter1, filter2, bitmap1, bitmap2;
+	unsigned long long entry, per_filter1, per_filter2;
 
 	printf("Current BPF sysctl settings:\n");
 	system("sysctl net.core.bpf_jit_enable");
@@ -101,35 +161,82 @@ int main(int argc, char *argv[])
 	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
 	assert(ret == 0);
 
-	/* One filter */
-	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
+	/* One filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
 	assert(ret == 0);
 
-	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1);
+	bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1);
+
+	/* Second filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	if (filter1 == native)
-		printf("No overhead measured!? Try running again with more samples.\n");
+	bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2);
 
-	/* Two filters */
+	/* Third filter, can no longer be converted to bitmap */
 	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
 	assert(ret == 0);
 
-	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2);
-
-	/* Calculations */
-	printf("Estimated total seccomp overhead for 1 filter: %llu ns\n",
-		filter1 - native);
+	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1);
 
-	printf("Estimated total seccomp overhead for 2 filters: %llu ns\n",
-		filter2 - native);
+	/* Fourth filter, can not be converted to bitmap because of filter 3 */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	printf("Estimated seccomp per-filter overhead: %llu ns\n",
-		filter2 - filter1);
+	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2);
+
+	/* Estimations */
+#define ESTIMATE(fmt, var, what)	do {			\
+		var = (what);					\
+		printf("Estimated " fmt ": %llu ns\n", var);	\
+		if (var > INT_MAX)				\
+			goto more_samples;			\
+	} while (0)
+
+	ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc,
+		 bitmap1 - native);
+	ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc,
+		 bitmap2 - native);
+	ESTIMATE("total seccomp overhead for 3 full filters", calc,
+		 filter1 - native);
+	ESTIMATE("total seccomp overhead for 4 full filters", calc,
+		 filter2 - native);
+	ESTIMATE("seccomp entry overhead", entry,
+		 bitmap1 - native - (bitmap2 - bitmap1));
+	ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1,
+		 filter2 - filter1);
+	ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2,
+		 (filter2 - native - entry) / 4);
+
+	printf("Expectations:\n");
+	ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1);
+	bits = compare("native", "≤", "1 filter", native, le, filter1);
+	if (bits)
+		goto more_samples;
+
+	ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)",
+			per_filter1, approx, per_filter2);
+
+	bits = compare("1 bitmapped", "≈", "2 bitmapped",
+			bitmap1 - native, approx, bitmap2 - native);
+	if (bits) {
+		printf("Skipping constant action bitmap expectations: they appear unsupported.\n");
+		goto out;
+	}
 
-	printf("Estimated seccomp entry overhead: %llu ns\n",
-		filter1 - native - (filter2 - filter1));
+	ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native);
+	ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native);
+	ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total",
+			entry + (per_filter1 * 4) + native, approx, filter2);
+	if (ret == 0)
+		goto out;
 
+more_samples:
+	printf("Saw unexpected benchmark result. Try running again with more samples?\n");
+out:
 	return 0;
 }
diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings
index ba4d85f74cd6..6091b45d226b 100644
--- a/tools/testing/selftests/seccomp/settings
+++ b/tools/testing/selftests/seccomp/settings
@@ -1 +1 @@
-timeout=90
+timeout=120
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
                       ` (4 preceding siblings ...)
  2020-09-24 12:44     ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
@ 2020-09-24 12:44     ` YiFei Zhu
  2020-09-24 23:56       ` Kees Cook
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 12:44 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<hex arch number> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
c000003e 0 ALLOW
c000003e 1 ALLOW
c000003e 2 ALLOW
c000003e 3 ALLOW
[...]
c000003e 132 ALLOW
c000003e 133 ALLOW
c000003e 134 FILTER
c000003e 135 FILTER
c000003e 136 FILTER
c000003e 137 ALLOW
c000003e 138 ALLOW
c000003e 139 FILTER
c000003e 140 ALLOW
c000003e 141 ALLOW
[...]

This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default
of N because I think certain users of seecomp might not want the
application to know which syscalls are definitely usable.

I'm not sure if adding all the "human readable names" is worthwhile,
considering it can be easily done in userspace.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig            | 10 ++++++++++
 fs/proc/base.c          |  7 +++++--
 include/linux/seccomp.h |  5 +++++
 kernel/seccomp.c        | 26 ++++++++++++++++++++++++++
 4 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc3dc87f253..dbfd897e5dc0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -514,6 +514,16 @@ config SECCOMP_CACHE_NR_ONLY
 
 endchoice
 
+config PROC_SECCOMP_CACHE
+	bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
+	depends on SECCOMP_CACHE_NR_ONLY
+	depends on PROC_FS
+	help
+	  This is enables /proc/pid/seccomp_cache interface to monitor
+	  seccomp cache data. The file format is subject to change.
+
+	  If unsure, say N.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..2af626f69fa1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2615,7 +2615,7 @@ static struct dentry *proc_pident_instantiate(struct dentry *dentry,
 	return d_splice_alias(inode, dentry);
 }
 
-static struct dentry *proc_pident_lookup(struct inode *dir, 
+static struct dentry *proc_pident_lookup(struct inode *dir,
 					 struct dentry *dentry,
 					 const struct pid_entry *p,
 					 const struct pid_entry *end)
@@ -2815,7 +2815,7 @@ static const struct pid_entry attr_dir_stuff[] = {
 
 static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
 {
-	return proc_pident_readdir(file, ctx, 
+	return proc_pident_readdir(file, ctx,
 				   attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
 }
 
@@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+#ifdef CONFIG_PROC_SECCOMP_CACHE
+	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..3cedec824365 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_PROC_SECCOMP_CACHE
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task);
+#endif
 #endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ac0266b6d18a..d97ec1876b4e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2293,3 +2293,29 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_PROC_SECCOMP_CACHE
+/* Currently CONFIG_PROC_SECCOMP_CACHE implies CONFIG_SECCOMP_CACHE_NR_ONLY */
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task)
+{
+	struct seccomp_filter *f = READ_ONCE(task->seccomp.filter);
+	int arch, nr;
+
+	if (!f)
+		return 0;
+
+	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+		for (nr = 0; nr < NR_syscalls; nr++) {
+			bool cached = test_bit(nr, f->cache.syscall_ok[arch]);
+			char *status = cached ? "ALLOW" : "FILTER";
+
+			seq_printf(m, "%08x %d %s\n", syscall_arches[arch],
+				   nr, status
+			);
+		}
+	}
+
+	return 0;
+}
+#endif /* CONFIG_PROC_SECCOMP_CACHE */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* RE: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-24 12:44     ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
@ 2020-09-24 13:47       ` David Laight
  2020-09-24 14:16         ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: David Laight @ 2020-09-24 13:47 UTC (permalink / raw)
  To: 'YiFei Zhu', containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jack Chen, Jann Horn, Josep Torrellas,
	Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum, Tycho Andersen,
	Valentin Rothberg, Will Drewry

From: YiFei Zhu 
> Sent: 24 September 2020 13:44
> 
> Seccomp cache emulator needs to know all the architecture numbers
> that syscall_get_arch() could return for the kernel build in order
> to generate a cache for all of them.
> 
> The array is declared in header as static __maybe_unused const
> to maximize compiler optimiation opportunities such as loop
> unrolling.

I doubt the compiler will do what you want.
Looking at it, in most cases there are one or two entries.
I think only MIPS has three.

So a static inline function that contains a list of
conditionals will generate better code that any kind of
array lookup.
For x86-64 you end up with something like:

#ifdef CONFIG_IA32_EMULATION
	if (sd->arch == AUDIT_ARCH_I386) return xxx;
#endif
	return yyy;

Probably saves you having multiple arrays that need to be
kept carefully in step.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-24 13:47       ` David Laight
@ 2020-09-24 14:16         ` YiFei Zhu
  2020-09-24 14:20           ` David Laight
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 14:16 UTC (permalink / raw)
  To: David Laight
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 8:47 AM David Laight <David.Laight@aculab.com> wrote:
> I doubt the compiler will do what you want.
> Looking at it, in most cases there are one or two entries.
> I think only MIPS has three.

It does ;) GCC 10.2.0:

$ objdump -d kernel/seccomp.o | less
[...]
0000000000001520 <__seccomp_filter>:
[...]
    1587:       41 8b 54 24 04          mov    0x4(%r12),%edx
    158c:       b9 08 01 00 00          mov    $0x108,%ecx
    1591:       81 fa 3e 00 00 c0       cmp    $0xc000003e,%edx
    1597:       75 2e                   jne    15c7 <__seccomp_filter+0xa7>
[...]
    15c7:       81 fa 03 00 00 40       cmp    $0x40000003,%edx
    15cd:       b9 40 01 00 00          mov    $0x140,%ecx
    15d2:       74 c5                   je     1599 <__seccomp_filter+0x79>
    15d4:       0f 0b                   ud2
[...]
0000000000001cb0 <seccomp_cache_prepare>:
[...]
    1cc4:       41 b9 3e 00 00 c0       mov    $0xc000003e,%r9d
[...]
    1dba:       41 b9 03 00 00 40       mov    $0x40000003,%r9d
[...]
0000000000002e30 <proc_pid_seccomp_cache>:
[...]
    2e72:       ba 3e 00 00 c0          mov    $0xc000003e,%edx
[...]
    2eb5:       ba 03 00 00 40          mov    $0x40000003,%edx

Granted, I have CC_OPTIMIZE_FOR_PERFORMANCE rather than
CC_OPTIMIZE_FOR_SIZE, but this patch itself is trying to sacrifice
some of the memory for speed.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* RE: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-24 14:16         ` YiFei Zhu
@ 2020-09-24 14:20           ` David Laight
  2020-09-24 14:37             ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: David Laight @ 2020-09-24 14:20 UTC (permalink / raw)
  To: 'YiFei Zhu'
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu
> Sent: 24 September 2020 15:17
> 
> On Thu, Sep 24, 2020 at 8:47 AM David Laight <David.Laight@aculab.com> wrote:
> > I doubt the compiler will do what you want.
> > Looking at it, in most cases there are one or two entries.
> > I think only MIPS has three.
> 
> It does ;) GCC 10.2.0:
> 
> $ objdump -d kernel/seccomp.o | less
> [...]
> 0000000000001520 <__seccomp_filter>:
> [...]
>     1587:       41 8b 54 24 04          mov    0x4(%r12),%edx
>     158c:       b9 08 01 00 00          mov    $0x108,%ecx
>     1591:       81 fa 3e 00 00 c0       cmp    $0xc000003e,%edx
>     1597:       75 2e                   jne    15c7 <__seccomp_filter+0xa7>
> [...]
>     15c7:       81 fa 03 00 00 40       cmp    $0x40000003,%edx
>     15cd:       b9 40 01 00 00          mov    $0x140,%ecx
>     15d2:       74 c5                   je     1599 <__seccomp_filter+0x79>
>     15d4:       0f 0b                   ud2
> [...]
> 0000000000001cb0 <seccomp_cache_prepare>:
> [...]
>     1cc4:       41 b9 3e 00 00 c0       mov    $0xc000003e,%r9d
> [...]
>     1dba:       41 b9 03 00 00 40       mov    $0x40000003,%r9d
> [...]
> 0000000000002e30 <proc_pid_seccomp_cache>:
> [...]
>     2e72:       ba 3e 00 00 c0          mov    $0xc000003e,%edx
> [...]
>     2eb5:       ba 03 00 00 40          mov    $0x40000003,%edx
> 
> Granted, I have CC_OPTIMIZE_FOR_PERFORMANCE rather than
> CC_OPTIMIZE_FOR_SIZE, but this patch itself is trying to sacrifice
> some of the memory for speed.

Don't both CC_OPTIMIZE_FOR_PERFORMANCE (-??) and CC_OPTIMIZE_FOR_SIZE (-s)
generate terrible code?

Try with a slghtly older gcc.
I think that entire optimisation (discarding const arrays)
is very recent.

	David
 

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-24 14:20           ` David Laight
@ 2020-09-24 14:37             ` YiFei Zhu
  2020-09-24 16:02               ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 14:37 UTC (permalink / raw)
  To: David Laight
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 9:20 AM David Laight <David.Laight@aculab.com> wrote:
> > Granted, I have CC_OPTIMIZE_FOR_PERFORMANCE rather than
> > CC_OPTIMIZE_FOR_SIZE, but this patch itself is trying to sacrifice
> > some of the memory for speed.
>
> Don't both CC_OPTIMIZE_FOR_PERFORMANCE (-??) and CC_OPTIMIZE_FOR_SIZE (-s)
> generate terrible code?

You have to choose one for "Compiler optimization level" in "General Setup", no?
The former is -O2 and the latter is -Os.

> Try with a slghtly older gcc.
> I think that entire optimisation (discarding const arrays)
> is very recent.

Will try, will take a while to get an old GCC to run, however :/

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-24 14:37             ` YiFei Zhu
@ 2020-09-24 16:02               ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-24 16:02 UTC (permalink / raw)
  To: David Laight
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 9:37 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > Try with a slghtly older gcc.
> > I think that entire optimisation (discarding const arrays)
> > is very recent.
>
> Will try, will take a while to get an old GCC to run, however :/

Possibly one of the oldest I can easily get to work is GCC 6.5.0, and
unrolling seems is still the case:

0000000000001560 <__seccomp_filter>:
[...]
    15d4:       41 8b 74 24 04          mov    0x4(%r12),%esi
    15d9:       bf 08 01 00 00          mov    $0x108,%edi
    15de:       81 fe 3e 00 00 c0       cmp    $0xc000003e,%esi
    15e4:       75 30                   jne    1616 <__seccomp_filter+0xb6>
[...]
    1616:       81 fe 03 00 00 40       cmp    $0x40000003,%esi
    161c:       bf 40 01 00 00          mov    $0x140,%edi
    1621:       74 c3                   je     15e6 <__seccomp_filter+0x86>
    1623:       0f 0b                   ud2

Am I overlooking something or should I go further back in the compiler version?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig
  2020-09-24 12:44     ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
@ 2020-09-24 19:11       ` Kees Cook
  0 siblings, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-09-24 19:11 UTC (permalink / raw)
  To: containers, YiFei Zhu
  Cc: Kees Cook, Tycho Andersen, Valentin Rothberg, Aleksa Sarai,
	Giuseppe Scrivano, Jann Horn, Tobin Feldman-Fitzthum,
	Josep Torrellas, Tianyin Xu, Hubertus Franke, linux-kernel, bpf,
	YiFei Zhu, Dimitrios Skarlatos, Jack Chen, Andrea Arcangeli,
	Andy Lutomirski, Will Drewry

On Thu, 24 Sep 2020 07:44:15 -0500, YiFei Zhu wrote:
> In order to make adding configurable features into seccomp
> easier, it's better to have the options at one single location,
> considering easpecially that the bulk of seccomp code is
> arch-independent. An quick look also show that many SECCOMP
> descriptions are outdated; they talk about /proc rather than
> prctl.
> 
> As a result of moving the config option and keeping it default
> on, architectures arm, arm64, csky, riscv, sh, and xtensa
> did not have SECCOMP on by default prior to this and SECCOMP will
> be default in this change.
> 
> Architectures microblaze, mips, powerpc, s390, sh, and sparc
> have an outdated depend on PROC_FS and this dependency is removed
> in this change.
> 
> Suggested-by: Jann Horn <jannh@google.com>
> Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> [...]

Yes; I've been meaning to do this for a while now. Thank you! I tweaked
the help text a bit.

Applied, thanks!

[1/1] seccomp: Move config option SECCOMP to arch/Kconfig
      https://git.kernel.org/kees/c/c3c9c2df3636

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-24 12:44     ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
@ 2020-09-24 23:25       ` Kees Cook
  2020-09-25  3:04         ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-24 23:25 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 07:44:18AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> access any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> return allow regardless of syscall arguments for a given
> architecture number / syscall number pair. This is implemented
> here with a pseudo-emulator, and stored in a per-filter bitmap.
> 
> Each common BPF instruction (stolen from Kees's list [1]) are
> emulated. Any weirdness or loading from a syscall argument will
> cause the emulator to bail.
> 
> The emulation is also halted if it reaches a return. In that case,
> if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
> 
> Filter dependency is resolved at attach time. If a filter depends
> on more filters, then we perform an and on its bitmask against its
> dependee; if the dependee does not guarantee to allow the syscall,
> then the depender is also marked not to guarantee to allow the
> syscall.
> 
> [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
> 
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  arch/Kconfig     |  25 ++++++
>  kernel/seccomp.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 218 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 6dfc5673215d..8cc3dc87f253 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -489,6 +489,31 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/userspace-api/seccomp_filter.rst for details.
>  
> +choice
> +	prompt "Seccomp filter cache"
> +	default SECCOMP_CACHE_NONE
> +	depends on SECCOMP_FILTER
> +	help
> +	  Seccomp filters can potentially incur large overhead for each
> +	  system call. This can alleviate some of the overhead.
> +
> +	  If in doubt, select 'syscall numbers only'.
> +
> +config SECCOMP_CACHE_NONE
> +	bool "None"
> +	help
> +	  No caching is done. Seccomp filters will be called each time
> +	  a system call occurs in a seccomp-guarded task.
> +
> +config SECCOMP_CACHE_NR_ONLY
> +	bool "Syscall number only"
> +	depends on !HAVE_SPARSE_SYSCALL_NR
> +	help
> +	  For each syscall number, if the seccomp filter has a fixed
> +	  result, store that result in a bitmap to speed up system calls.
> +
> +endchoice

I'm not interested in seccomp having a config option for this. It should
entire exist or not, and that depends on the per-architecture support.
You mentioned in another thread that you wanted it to let people play
with this support in some way. Can you elaborate on this? My perspective
is that of distro and vendor kernels: there is _one_ config and end
users can't really do anything about it without rolling their own
kernels.

> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * struct seccomp_cache_filter_data - container for cache's per-filter data
> + *
> + * @syscall_ok: A bitmap for each architecture number, where each bit
> + *		represents whether the filter will always allow the syscall.
> + */
> +struct seccomp_cache_filter_data {
> +	DECLARE_BITMAP(syscall_ok[ARRAY_SIZE(syscall_arches)], NR_syscalls);
> +};

So, as Jann pointed out, using NR_syscalls only accidentally works --
they're actually different sizes and there isn't strictly any reason to
expect one to be smaller than another. So, we need to either choose the
max() in asm/linux/seccomp.h or be more efficient with space usage and
use explicitly named bitmaps (how my v1 does things).

> +
> +#define SECCOMP_EMU_MAX_PENDING_STATES 64

This isn't used in this patch; likely leftover/in need of moving?

> +#else
> +struct seccomp_cache_filter_data { };
> +
> +static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
> +{
> +	return 0;
> +}
> +
> +static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter,
> +					 const struct seccomp_filter *prev)
> +{
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -185,6 +211,7 @@ struct seccomp_filter {
>  	struct notification *notif;
>  	struct mutex notify_lock;
>  	wait_queue_head_t wqh;
> +	struct seccomp_cache_filter_data cache;

I moved this up in the structure to see if I could benefit from cache
line sharing. In either case, we must verify (with "pahole") that we do
not induce massive padding in the struct.

But yes, attaching this to the filter is the right way to go.

>  };
>  
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -530,6 +557,139 @@ static inline void seccomp_sync_threads(unsigned long flags)
>  	}
>  }
>  
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * struct seccomp_emu_env - container for seccomp emulator environment
> + *
> + * @filter: The cBPF filter instructions.
> + * @nr: The syscall number we are emulating.
> + * @arch: The architecture number we are emulating.
> + * @syscall_ok: Emulation result, whether it is okay for seccomp to cache the
> + *		syscall.
> + */
> +struct seccomp_emu_env {
> +	struct sock_filter *filter;
> +	int arch;
> +	int nr;
> +	bool syscall_ok;

nit: "ok" is too vague. We mean either "constant action" or "allow" (or
"filter" in the negative case).

> +};
> +
> +/**
> + * struct seccomp_emu_state - container for seccomp emulator state
> + *
> + * @next: The next pending state. This structure is a linked list.
> + * @pc: The current program counter.
> + * @areg: the value of that A register.
> + */
> +struct seccomp_emu_state {
> +	struct seccomp_emu_state *next;
> +	int pc;
> +	u32 areg;
> +};

Why is this split out? (i.e. why is it not just a self-contained loop
the way Jann wrote it?)

> +
> +/**
> + * seccomp_emu_step - step one instruction in the emulator
> + * @env: The emulator environment
> + * @state: The emulator state
> + *
> + * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred.

I appreciate the -errno intent, but it actually risks making these
changes break existing userspace filters: if something is unhandled in
the emulator in a way we don't find during design and testing, the
filter load will actually _fail_ instead of just falling back to "run
filter". Failures should be reported (WARN_ON_ONCE()), but my v1
intentionally lets this continue.

> + */
> +static int seccomp_emu_step(struct seccomp_emu_env *env,
> +			    struct seccomp_emu_state *state)
> +{
> +	struct sock_filter *ftest = &env->filter[state->pc++];
> +	u16 code = ftest->code;
> +	u32 k = ftest->k;
> +	bool compare;
> +
> +	switch (code) {
> +	case BPF_LD | BPF_W | BPF_ABS:
> +		if (k == offsetof(struct seccomp_data, nr))
> +			state->areg = env->nr;
> +		else if (k == offsetof(struct seccomp_data, arch))
> +			state->areg = env->arch;
> +		else
> +			return 1;
> +
> +		return 0;
> +	case BPF_JMP | BPF_JA:
> +		state->pc += k;
> +		return 0;
> +	case BPF_JMP | BPF_JEQ | BPF_K:
> +	case BPF_JMP | BPF_JGE | BPF_K:
> +	case BPF_JMP | BPF_JGT | BPF_K:
> +	case BPF_JMP | BPF_JSET | BPF_K:
> +		switch (BPF_OP(code)) {
> +		case BPF_JEQ:
> +			compare = state->areg == k;
> +			break;
> +		case BPF_JGT:
> +			compare = state->areg > k;
> +			break;
> +		case BPF_JGE:
> +			compare = state->areg >= k;
> +			break;
> +		case BPF_JSET:
> +			compare = state->areg & k;
> +			break;
> +		default:
> +			WARN_ON(true);
> +			return -EINVAL;
> +		}
> +
> +		state->pc += compare ? ftest->jt : ftest->jf;
> +		return 0;
> +	case BPF_ALU | BPF_AND | BPF_K:
> +		state->areg &= k;
> +		return 0;
> +	case BPF_RET | BPF_K:
> +		env->syscall_ok = k == SECCOMP_RET_ALLOW;
> +		return 1;
> +	default:
> +		return 1;
> +	}
> +}

This version appears to have removed all the comments; I liked Jann's
comments and I had rearranged things a bit to make it more readable
(IMO) for people that do not immediate understand BPF. :)

> +
> +/**
> + * seccomp_cache_prepare - emulate the filter to find cachable syscalls
> + * @sfilter: The seccomp filter
> + *
> + * Returns 0 if successful or -errno if error occurred.
> + */
> +int seccomp_cache_prepare(struct seccomp_filter *sfilter)
> +{
> +	struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
> +	struct sock_filter *filter = fprog->filter;
> +	int arch, nr, res = 0;
> +
> +	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
> +		for (nr = 0; nr < NR_syscalls; nr++) {
> +			struct seccomp_emu_env env = {0};
> +			struct seccomp_emu_state state = {0};
> +
> +			env.filter = filter;
> +			env.arch = syscall_arches[arch];
> +			env.nr = nr;
> +
> +			while (true) {
> +				res = seccomp_emu_step(&env, &state);
> +				if (res)
> +					break;
> +			}
> +
> +			if (res < 0)
> +				goto out;
> +
> +			if (env.syscall_ok)
> +				set_bit(nr, sfilter->cache.syscall_ok[arch]);

I don't really like the complexity here, passing around syscall_ok, etc.
I feel like seccomp_emu_step() should be self-contained to say "allow or
filter" directly.

I also prefer an inversion to the logic: if we start bitmaps as "default
allow", we only ever increase the filtering cases: we can never
accidentally ADD an allow to the bitmap. (This was an intentional design
in the RFC and v1 to do as much as possible to fail safe.)

> +		}
> +	}
> +
> +out:
> +	return res;
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
> +
>  /**
>   * seccomp_prepare_filter: Prepares a seccomp filter for use.
>   * @fprog: BPF program to install
> @@ -540,7 +700,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  {
>  	struct seccomp_filter *sfilter;
>  	int ret;
> -	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> +	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
> +			       IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
>  
>  	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>  		return ERR_PTR(-EINVAL);
> @@ -571,6 +732,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  		return ERR_PTR(ret);
>  	}
>  
> +	ret = seccomp_cache_prepare(sfilter);
> +	if (ret < 0) {
> +		bpf_prog_destroy(sfilter->prog);
> +		kfree(sfilter);
> +		return ERR_PTR(ret);
> +	}

Why do the prepare here instead of during attach? (And note that it
should not be written to fail.)

> +
>  	refcount_set(&sfilter->refs, 1);
>  	refcount_set(&sfilter->users, 1);
>  	init_waitqueue_head(&sfilter->wqh);
> @@ -606,6 +774,29 @@ seccomp_prepare_user_filter(const char __user *user_filter)
>  	return filter;
>  }
>  
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * seccomp_cache_inherit - mask accept bitmap against previous filter
> + * @sfilter: The seccomp filter
> + * @sfilter: The previous seccomp filter
> + */
> +static void seccomp_cache_inherit(struct seccomp_filter *sfilter,
> +				  const struct seccomp_filter *prev)
> +{
> +	int arch;
> +
> +	if (!prev)
> +		return;
> +
> +	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
> +		bitmap_and(sfilter->cache.syscall_ok[arch],
> +			   sfilter->cache.syscall_ok[arch],
> +			   prev->cache.syscall_ok[arch],
> +			   NR_syscalls);
> +	}

And, as per being as defensive as I can imagine, this should be a
one-way mask: we can only remove bits from syscall_ok, never add them.
sfilter must be constructed so that it can only ever have fewer or the
same bits set as prev.

> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
> +
>  /**
>   * seccomp_attach_filter: validate and attach filter
>   * @flags:  flags to change filter behavior
> @@ -655,6 +846,7 @@ static long seccomp_attach_filter(unsigned int flags,
>  	 * task reference.
>  	 */
>  	filter->prev = current->seccomp.filter;
> +	seccomp_cache_inherit(filter, filter->prev);

In the RFC I did this inherit earlier (in the emulation stage) to
benefit from the RET_KILL results, but that's not very useful any more.
However, I think it's still code-locality better to keep the bit
manipulation logic as close together as possible for readability.

>  	current->seccomp.filter = filter;
>  	atomic_inc(&current->seccomp.filter_count);
>  
> -- 
> 2.28.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path
  2020-09-24 12:44     ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
@ 2020-09-24 23:46       ` Kees Cook
  2020-09-25  1:55         ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-24 23:46 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 07:44:19AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> The fast (common) path for seccomp should be that the filter permits
> the syscall to pass through, and failing seccomp is expected to be
> an exceptional case; it is not expected for userspace to call a
> denylisted syscall over and over.
> 
> This first finds the current allow bitmask by iterating through
> syscall_arches[] array and comparing it to the one in struct
> seccomp_data; this loop is expected to be unrolled. It then
> does a test_bit against the bitmask. If the bit is set, then
> there is no need to run the full filter; it returns
> SECCOMP_RET_ALLOW immediately.
> 
> Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  kernel/seccomp.c | 37 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 20d33378a092..ac0266b6d18a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -167,6 +167,12 @@ static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter,
>  					 const struct seccomp_filter *prev)
>  {
>  }
> +
> +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter,
> +				       const struct seccomp_data *sd)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
>  
>  /**
> @@ -321,6 +327,34 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * seccomp_cache_check - lookup seccomp cache
> + * @sfilter: The seccomp filter
> + * @sd: The seccomp data to lookup the cache with
> + *
> + * Returns true if the seccomp_data is cached and allowed.
> + */
> +static bool seccomp_cache_check(const struct seccomp_filter *sfilter,
> +				const struct seccomp_data *sd)
> +{
> +	int syscall_nr = sd->nr;
> +	int arch;
> +
> +	if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
> +		return false;

This protects us from x32 (i.e. syscall_nr will have 0x40000000 bit
set), but given the effort needed to support compat, I think supporting
x32 isn't much more. (Though again, I note that NR_syscalls differs in
size, so this test needs to be per-arch and obviously after
arch-discovery.)

That said, if it really does turn out that x32 is literally the only
architecture doing these shenanigans (and I suspect not, given the MIPS
case), okay, fine, I'll give in. :) You and Jann both seem to think this
isn't worth it.

> +
> +	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
> +		if (likely(syscall_arches[arch] == sd->arch))

I think this linear search for the matching arch can be made O(1) (this
is what I was trying to do in v1: we can map all possible combos to a
distinct bitmap, so there is just math and lookup rather than a linear
compare search. In the one-arch case, it can also be easily collapsed
into a no-op (though my v1 didn't do this correctly).

> +			return test_bit(syscall_nr,
> +					sfilter->cache.syscall_ok[arch]);
> +	}
> +
> +	WARN_ON_ONCE(true);
> +	return false;
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
> +
>  /**
>   * seccomp_run_filters - evaluates all seccomp filters against @sd
>   * @sd: optional seccomp data to be passed to filters
> @@ -343,6 +377,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
>  	if (WARN_ON(f == NULL))
>  		return SECCOMP_RET_KILL_PROCESS;
>  
> +	if (seccomp_cache_check(f, sd))
> +		return SECCOMP_RET_ALLOW;
> +
>  	/*
>  	 * All filters in the list are evaluated and the lowest BPF return
>  	 * value always takes priority (ignoring the DATA).
> -- 
> 2.28.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead
  2020-09-24 12:44     ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
@ 2020-09-24 23:47       ` Kees Cook
  2020-09-25  1:35         ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-24 23:47 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 07:44:20AM -0500, YiFei Zhu wrote:
> From: Kees Cook <keescook@chromium.org>
> 
> As part of the seccomp benchmarking, include the expectations with
> regard to the timing behavior of the constant action bitmaps, and report
> inconsistencies better.
> 
> Example output with constant action bitmaps on x86:
> 
> $ sudo ./seccomp_benchmark 100000000
> Current BPF sysctl settings:
> net.core.bpf_jit_enable = 1
> net.core.bpf_jit_harden = 0
> Benchmarking 100000000 syscalls...
> 63.896255358 - 0.008504529 = 63887750829 (63.9s)
> getpid native: 638 ns
> 130.383312423 - 63.897315189 = 66485997234 (66.5s)
> getpid RET_ALLOW 1 filter (bitmap): 664 ns
> 196.789080421 - 130.384414983 = 66404665438 (66.4s)
> getpid RET_ALLOW 2 filters (bitmap): 664 ns
> 268.844643304 - 196.790234168 = 72054409136 (72.1s)
> getpid RET_ALLOW 3 filters (full): 720 ns
> 342.627472515 - 268.845799103 = 73781673412 (73.8s)
> getpid RET_ALLOW 4 filters (full): 737 ns
> Estimated total seccomp overhead for 1 bitmapped filter: 26 ns
> Estimated total seccomp overhead for 2 bitmapped filters: 26 ns
> Estimated total seccomp overhead for 3 full filters: 82 ns
> Estimated total seccomp overhead for 4 full filters: 99 ns
> Estimated seccomp entry overhead: 26 ns
> Estimated seccomp per-filter overhead (last 2 diff): 17 ns
> Estimated seccomp per-filter overhead (filters / 4): 18 ns
> Expectations:
> 	native ≤ 1 bitmap (638 ≤ 664): ✔️
> 	native ≤ 1 filter (638 ≤ 720): ✔️
> 	per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️
> 	1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️
> 	entry ≈ 1 bitmapped (26 ≈ 26): ✔️
> 	entry ≈ 2 bitmapped (26 ≈ 26): ✔️
> 	native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️
> 
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>

BTW, did this benchmark tool's results match your expectations from what
you saw with your RFC? (I assume it helped since you've included in
here.)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-24 12:44     ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
@ 2020-09-24 23:56       ` Kees Cook
  2020-09-25  3:11         ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-24 23:56 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 07:44:21AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
> 
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <hex arch number> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
> 
> For the docker default profile on x86_64 it looks like:
> c000003e 0 ALLOW
> c000003e 1 ALLOW
> c000003e 2 ALLOW
> c000003e 3 ALLOW
> [...]
> c000003e 132 ALLOW
> c000003e 133 ALLOW
> c000003e 134 FILTER
> c000003e 135 FILTER
> c000003e 136 FILTER
> c000003e 137 ALLOW
> c000003e 138 ALLOW
> c000003e 139 FILTER
> c000003e 140 ALLOW
> c000003e 141 ALLOW
> [...]
> 
> This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default
> of N because I think certain users of seecomp might not want the
> application to know which syscalls are definitely usable.
> 
> I'm not sure if adding all the "human readable names" is worthwhile,
> considering it can be easily done in userspace.

The question of permissions is my central concern here: who should see
this? Some contained processes have been intentionally blocked from
self-introspection so even the "standard" high bar of "ptrace attach
allowed?" can't always be sufficient.

My compromise about filter visibility in the past was saying that
CAP_SYS_ADMIN was required (see seccomp_get_filter()). I'm nervous to
weaken this. (There is some work that hasn't been sent upstream yet that
is looking to expose the filter _contents_ via /proc that has been
nervous too.)

Now full contents vs "allow"/"filter" are certainly different things,
but I don't feel like I've got enough evidence to show that this
introspection would help debugging enough to justify the partially
imagined safety of not exposing it to potential attackers.

I suspect it _is_ the right thing to do (just look at my own RFC's
"debug" patch), but I'd like this to be well justified in the commit
log.

And yes, while it does hide behind a CONFIG, I'd still want it justified,
especially since distros have a tendency to just turn everything on
anyway. ;)

> +	for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
> +		for (nr = 0; nr < NR_syscalls; nr++) {
> +			bool cached = test_bit(nr, f->cache.syscall_ok[arch]);
> +			char *status = cached ? "ALLOW" : "FILTER";
> +
> +			seq_printf(m, "%08x %d %s\n", syscall_arches[arch],
> +				   nr, status
> +			);
> +		}
> +	}

But behavior-wise, yeah, I like it; I'm fine with human-readable and
full AUDIT_ARCH values. (Though, as devil's advocate again, to repeat
Jann's own words back: do we want to add this only to have a new UAPI to
support going forward?)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-21  5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu
  2020-09-21 18:08   ` Jann Horn
@ 2020-09-25  0:01   ` Kees Cook
  2020-09-25  0:15     ` Jann Horn
  2020-09-25  1:27     ` YiFei Zhu
  1 sibling, 2 replies; 135+ messages in thread
From: Kees Cook @ 2020-09-25  0:01 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: YiFei Zhu, containers, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

[resend, argh, I didn't reply-all, sorry for the noise]

On Thu, Sep 24, 2020 at 07:44:17AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> Seccomp cache emulator needs to know all the architecture numbers
> that syscall_get_arch() could return for the kernel build in order
> to generate a cache for all of them.
> 
> The array is declared in header as static __maybe_unused const
> to maximize compiler optimiation opportunities such as loop
> unrolling.

Disregarding the "how" of this, yeah, we'll certainly need something to
tell seccomp about the arrangement of syscall tables and how to find
them.

However, I'd still prefer to do this on a per-arch basis, and include
more detail, as I've got in my v1.

Something missing from both styles, though, is a consolidation of
values, where the AUDIT_ARCH* isn't reused in both the seccomp info and
the syscall_get_arch() return. The problems here were two-fold:

1) putting this in syscall.h meant you do not have full NR_syscall*
   visibility on some architectures (e.g. arm64 plays weird games with
   header include order).

2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros
   haven't removed CONFIG_X86_X32 widely yet, so it is a reality that
   it must be dealt with), which means seccomp's idea of the arch
   "number" can't be the same as the AUDIT_ARCH.

So, likely a combo of approaches is needed: an array (or more likely,
enum), declared in the per-arch seccomp.h file. And I don't see a way
to solve #1 cleanly.

Regardless, it needs to be split per architecture so that regressions
can be bisected/reverted/isolated cleanly. And if we can't actually test
it at runtime (or find someone who can) it's not a good idea to make the
change. :)

> [...]
> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> index 7cbf733d11af..e13bb2a65b6f 100644
> --- a/arch/x86/include/asm/syscall.h
> +++ b/arch/x86/include/asm/syscall.h
> @@ -97,6 +97,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
>  	memcpy(&regs->bx + i, args, n * sizeof(args[0]));
>  }
>  
> +static __maybe_unused const int syscall_arches[] = {
> +	AUDIT_ARCH_I386
> +};
> +
>  static inline int syscall_get_arch(struct task_struct *task)
>  {
>  	return AUDIT_ARCH_I386;
> @@ -152,6 +156,13 @@ static inline void syscall_set_arguments(struct task_struct *task,
>  	}
>  }
>  
> +static __maybe_unused const int syscall_arches[] = {
> +	AUDIT_ARCH_X86_64,
> +#ifdef CONFIG_IA32_EMULATION
> +	AUDIT_ARCH_I386,
> +#endif
> +};

I'm leaving this section quoted because I'll refer to it in a later
patch review...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-25  0:01   ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook
@ 2020-09-25  0:15     ` Jann Horn
  2020-09-25  0:18       ` Al Viro
  2020-09-25  1:27     ` YiFei Zhu
  1 sibling, 1 reply; 135+ messages in thread
From: Jann Horn @ 2020-09-25  0:15 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, YiFei Zhu, Linux Containers, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Sep 25, 2020 at 2:01 AM Kees Cook <keescook@chromium.org> wrote:
> 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros
>    haven't removed CONFIG_X86_X32 widely yet, so it is a reality that
>    it must be dealt with), which means seccomp's idea of the arch
>    "number" can't be the same as the AUDIT_ARCH.

Sure, distros ship it; but basically nobody uses it, it doesn't have
to be fast. As long as we don't *break* it, everything's fine. And if
we ignore the existence of X32 in the fastpath, that'll just mean that
syscalls with the X32 marker bit always hit the seccomp slowpath
(because it'll look like the syscall number is out-of-bounds ) - no
problem.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-25  0:15     ` Jann Horn
@ 2020-09-25  0:18       ` Al Viro
  2020-09-25  0:24         ` Jann Horn
  0 siblings, 1 reply; 135+ messages in thread
From: Al Viro @ 2020-09-25  0:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, YiFei Zhu, YiFei Zhu, Linux Containers, bpf,
	kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Sep 25, 2020 at 02:15:50AM +0200, Jann Horn wrote:
> On Fri, Sep 25, 2020 at 2:01 AM Kees Cook <keescook@chromium.org> wrote:
> > 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros
> >    haven't removed CONFIG_X86_X32 widely yet, so it is a reality that
> >    it must be dealt with), which means seccomp's idea of the arch
> >    "number" can't be the same as the AUDIT_ARCH.
> 
> Sure, distros ship it; but basically nobody uses it, it doesn't have
> to be fast. As long as we don't *break* it, everything's fine. And if
> we ignore the existence of X32 in the fastpath, that'll just mean that
> syscalls with the X32 marker bit always hit the seccomp slowpath
> (because it'll look like the syscall number is out-of-bounds ) - no
> problem.

You do realize that X32 is amd64 counterpart of mips n32, right?  And that's
not "basically nobody uses it"...

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-25  0:18       ` Al Viro
@ 2020-09-25  0:24         ` Jann Horn
  0 siblings, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-09-25  0:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Kees Cook, YiFei Zhu, YiFei Zhu, Linux Containers, bpf,
	kernel list, Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Sep 25, 2020 at 2:18 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Sep 25, 2020 at 02:15:50AM +0200, Jann Horn wrote:
> > On Fri, Sep 25, 2020 at 2:01 AM Kees Cook <keescook@chromium.org> wrote:
> > > 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros
> > >    haven't removed CONFIG_X86_X32 widely yet, so it is a reality that
> > >    it must be dealt with), which means seccomp's idea of the arch
> > >    "number" can't be the same as the AUDIT_ARCH.
> >
> > Sure, distros ship it; but basically nobody uses it, it doesn't have
> > to be fast. As long as we don't *break* it, everything's fine. And if
> > we ignore the existence of X32 in the fastpath, that'll just mean that
> > syscalls with the X32 marker bit always hit the seccomp slowpath
> > (because it'll look like the syscall number is out-of-bounds ) - no
> > problem.
>
> You do realize that X32 is amd64 counterpart of mips n32, right?  And that's
> not "basically nobody uses it"...

What makes X32 weird for seccomp is that it has the syscall tables for
X86-64 and X32 mushed together, using the single architecture
identifier AUDIT_ARCH_X86_64. I believe that's what Kees referred to
by "multiplexed tables".

As far as I can tell, MIPS is more well-behaved there and uses the
separate architecture identifiers
AUDIT_ARCH_MIPS|__AUDIT_ARCH_64BIT
and
AUDIT_ARCH_MIPS|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_CONVENTION_MIPS64_N32.

(But no, I did not actually realize that that's what N32 is. Thanks
for the explanation, I was wondering why MIPS was the only
architecture with three architecture identifiers...)

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-25  0:01   ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook
  2020-09-25  0:15     ` Jann Horn
@ 2020-09-25  1:27     ` YiFei Zhu
  2020-09-25  3:09       ` Kees Cook
  1 sibling, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-25  1:27 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

[resending this too]

On Thu, Sep 24, 2020 at 6:01 PM Kees Cook <keescook@chromium.org> wrote:
> Disregarding the "how" of this, yeah, we'll certainly need something to
> tell seccomp about the arrangement of syscall tables and how to find
> them.
>
> However, I'd still prefer to do this on a per-arch basis, and include
> more detail, as I've got in my v1.
>
> Something missing from both styles, though, is a consolidation of
> values, where the AUDIT_ARCH* isn't reused in both the seccomp info and
> the syscall_get_arch() return. The problems here were two-fold:
>
> 1) putting this in syscall.h meant you do not have full NR_syscall*
>    visibility on some architectures (e.g. arm64 plays weird games with
>    header include order).

I don't get this one -- I'm not playing with NR_syscall here.

> 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros
>    haven't removed CONFIG_X86_X32 widely yet, so it is a reality that
>    it must be dealt with), which means seccomp's idea of the arch
>    "number" can't be the same as the AUDIT_ARCH.

Why so? Does anyone actually use x32 in a container? The memory cost
and analysis cost is on everyone. The worst case scenario if we don't
support it is that the syscall is not accelerated.

> So, likely a combo of approaches is needed: an array (or more likely,
> enum), declared in the per-arch seccomp.h file. And I don't see a way
> to solve #1 cleanly.
>
> Regardless, it needs to be split per architecture so that regressions
> can be bisected/reverted/isolated cleanly. And if we can't actually test
> it at runtime (or find someone who can) it's not a good idea to make the
> change. :)

You have a good point regarding tests. Don't see how it affects
regressions though. Only one file here is ever included per-build.

> > [...]
> > diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> > index 7cbf733d11af..e13bb2a65b6f 100644
> > --- a/arch/x86/include/asm/syscall.h
> > +++ b/arch/x86/include/asm/syscall.h
> > @@ -97,6 +97,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
> >       memcpy(&regs->bx + i, args, n * sizeof(args[0]));
> >  }
> >
> > +static __maybe_unused const int syscall_arches[] = {
> > +     AUDIT_ARCH_I386
> > +};
> > +
> >  static inline int syscall_get_arch(struct task_struct *task)
> >  {
> >       return AUDIT_ARCH_I386;
> > @@ -152,6 +156,13 @@ static inline void syscall_set_arguments(struct task_struct *task,
> >       }
> >  }
> >
> > +static __maybe_unused const int syscall_arches[] = {
> > +     AUDIT_ARCH_X86_64,
> > +#ifdef CONFIG_IA32_EMULATION
> > +     AUDIT_ARCH_I386,
> > +#endif
> > +};
>
> I'm leaving this section quoted because I'll refer to it in a later
> patch review...
>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead
  2020-09-24 23:47       ` Kees Cook
@ 2020-09-25  1:35         ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-25  1:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 6:47 PM Kees Cook <keescook@chromium.org> wrote:
> BTW, did this benchmark tool's results match your expectations from what
> you saw with your RFC? (I assume it helped since you've included in
> here.)

Yes, I updated the commit message with the benchmarks of this patch
series. Though, given that I'm running in a qemu-kvm on my laptop that
has a lot of stuffs running on it (and with the cursed ThinkPad T480
CPU throttling), I had to throw much more syscalls at it to pass the
"approximately equals" expectation... though no idea about what's
going on with 732 vs 737.

Or if you mean if I expected these results, yes.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path
  2020-09-24 23:46       ` Kees Cook
@ 2020-09-25  1:55         ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-25  1:55 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 6:46 PM Kees Cook <keescook@chromium.org> wrote:
> This protects us from x32 (i.e. syscall_nr will have 0x40000000 bit
> set), but given the effort needed to support compat, I think supporting
> x32 isn't much more. (Though again, I note that NR_syscalls differs in
> size, so this test needs to be per-arch and obviously after
> arch-discovery.)
>
> That said, if it really does turn out that x32 is literally the only
> architecture doing these shenanigans (and I suspect not, given the MIPS
> case), okay, fine, I'll give in. :) You and Jann both seem to think this
> isn't worth it.

MIPS has the sparse syscall shenanigans... idek how that works. Maybe
someone can clarify?

> I think this linear search for the matching arch can be made O(1) (this
> is what I was trying to do in v1: we can map all possible combos to a
> distinct bitmap, so there is just math and lookup rather than a linear
> compare search. In the one-arch case, it can also be easily collapsed
> into a no-op (though my v1 didn't do this correctly).

I remember yours was:

static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr)
{
[...]
        switch (syscall_arch) {
        case SECCOMP_ARCH:
                seccomp_arch = SECCOMP_ARCH_IS_NATIVE;
                break;
#ifdef CONFIG_COMPAT
        case SECCOMP_ARCH_COMPAT:
                seccomp_arch = SECCOMP_ARCH_IS_COMPAT;
                break;
#endif
        default:
                seccomp_arch = SECCOMP_ARCH_IS_UNKNOWN;
        }

What I'm relying on here is that the compiler will unroll the loop.
How does the compiler perform switch statements? I was imagining it
would be similar, with "case" corresponding to a compare on the
immediate, and the assign as a move to a register, and break
corresponding to a jump. this would also be O(n) to the number of
arches. Yes, compilers can also do an O(1) table lookup, but that is
nonsensical here -- the arch numbers occupy the MSBs.

That said, does O(1) or O(n) matter here? Given that n is at most 3
you might as well consider it a constant.

Also, does "collapse in one arch case" actually worth it? Given that
there's a likely(), and the other side is a WARN_ON_ONCE(), the
compiler will layout the likely path in the fast path and branch
prediction will be in our favor, right?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-24 23:25       ` Kees Cook
@ 2020-09-25  3:04         ` YiFei Zhu
  2020-09-25 16:45           ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-25  3:04 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

[resending this, forgot to hit reply all...]

On Thu, Sep 24, 2020 at 6:25 PM Kees Cook <keescook@chromium.org> wrote:
> I'm not interested in seccomp having a config option for this. It should
> entire exist or not, and that depends on the per-architecture support.
> You mentioned in another thread that you wanted it to let people play
> with this support in some way. Can you elaborate on this? My perspective
> is that of distro and vendor kernels: there is _one_ config and end
> users can't really do anything about it without rolling their own
> kernels.

That's one. The other is to allow future optional extensions, like
syscall-argument-capable accelerators.

Distro / vendor kernels will keep defaults anyways, no?

> So, as Jann pointed out, using NR_syscalls only accidentally works --
> they're actually different sizes and there isn't strictly any reason to
> expect one to be smaller than another. So, we need to either choose the
> max() in asm/linux/seccomp.h or be more efficient with space usage and
> use explicitly named bitmaps (how my v1 does things).

Right.

> This isn't used in this patch; likely leftover/in need of moving?

Correct. Will remove.

> I moved this up in the structure to see if I could benefit from cache
> line sharing. In either case, we must verify (with "pahole") that we do
> not induce massive padding in the struct.
>
> But yes, attaching this to the filter is the right way to go.

Right. I don't think it would cause massive padding with all I know
about padding learnt from [1].

I'm used to use gdb to look at structure layout, and this is what I see:
(gdb) ptype /o struct seccomp_filter
/* offset    |  size */  type = struct seccomp_filter {
/*    0      |     4 */    refcount_t refs;
/*    4      |     4 */    refcount_t users;
/*    8      |     1 */    bool log;
/* XXX  7-byte hole  */
/*   16      |     8 */    struct seccomp_filter *prev;
[...]
/*  264      |   112 */    struct seccomp_cache_filter_data {
/*  264      |   112 */        unsigned long syscall_ok[2][7];

                               /* total size (bytes):  112 */
                           } cache;

                           /* total size (bytes):  376 */
                         }

The bitmaps are long-aligned; so is the prev-pointer. If we want we
can put the cache struct right before prev and that should not
introduce any new holes. It's the refcounts and the bool that's not
cooperative.

> nit: "ok" is too vague. We mean either "constant action" or "allow" (or
> "filter" in the negative case).

Right.

> Why is this split out? (i.e. why is it not just a self-contained loop
> the way Jann wrote it?)

Because my brain thinks like a finite state machine and this function
is a state transition. ;) Though yeah I agree a loop is probably more
readable.

> I appreciate the -errno intent, but it actually risks making these
> changes break existing userspace filters: if something is unhandled in
> the emulator in a way we don't find during design and testing, the
> filter load will actually _fail_ instead of just falling back to "run
> filter". Failures should be reported (WARN_ON_ONCE()), but my v1
> intentionally lets this continue.

Right.

> This version appears to have removed all the comments; I liked Jann's
> comments and I had rearranged things a bit to make it more readable
> (IMO) for people that do not immediate understand BPF. :)

Right.

> > +/**
> > + * seccomp_cache_prepare - emulate the filter to find cachable syscalls
> > + * @sfilter: The seccomp filter
> > + *
> > + * Returns 0 if successful or -errno if error occurred.
> > + */
> > +int seccomp_cache_prepare(struct seccomp_filter *sfilter)
> > +{
> > +     struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
> > +     struct sock_filter *filter = fprog->filter;
> > +     int arch, nr, res = 0;
> > +
> > +     for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
> > +             for (nr = 0; nr < NR_syscalls; nr++) {
> > +                     struct seccomp_emu_env env = {0};

Btw, do you know what is the initial state of the A register at the
start of BPF execution? In my RFC I assumed it's unknown but then in
v1 after the "reg_known" removal the register is assumed to be 0. Idk
if it is correct to assume so.

> I don't really like the complexity here, passing around syscall_ok, etc.
> I feel like seccomp_emu_step() should be self-contained to say "allow or
> filter" directly.

Ok.

> I also prefer an inversion to the logic: if we start bitmaps as "default
> allow", we only ever increase the filtering cases: we can never
> accidentally ADD an allow to the bitmap. (This was an intentional design
> in the RFC and v1 to do as much as possible to fail safe.)

Wait why? If it's default allow, what if you hit an error? You can
accidentally not remove an allow from the bitmap, and that is much
more of an issue than accidentally not add an allow. I don't
understand your reasoning of "accidentally ADD an allow", an action
will only occur when everything is right, but an action might not
occur if some random shenanigans happen. Hence, the non-action /
default side should be the fail-safe side, rather than the action
side.

> Why do the prepare here instead of during attach? (And note that it
> should not be written to fail.)

Right.

> And, as per being as defensive as I can imagine, this should be a
> one-way mask: we can only remove bits from syscall_ok, never add them.
> sfilter must be constructed so that it can only ever have fewer or the
> same bits set as prev.

Right.

> In the RFC I did this inherit earlier (in the emulation stage) to
> benefit from the RET_KILL results, but that's not very useful any more.
> However, I think it's still code-locality better to keep the bit
> manipulation logic as close together as possible for readability.

Right.

[1] http://www.catb.org/esr/structure-packing/#_structure_alignment_and_padding

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-25  1:27     ` YiFei Zhu
@ 2020-09-25  3:09       ` Kees Cook
  2020-09-25  3:28         ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-25  3:09 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 08:27:40PM -0500, YiFei Zhu wrote:
> [resending this too]
> 
> On Thu, Sep 24, 2020 at 6:01 PM Kees Cook <keescook@chromium.org> wrote:
> > Disregarding the "how" of this, yeah, we'll certainly need something to
> > tell seccomp about the arrangement of syscall tables and how to find
> > them.
> >
> > However, I'd still prefer to do this on a per-arch basis, and include
> > more detail, as I've got in my v1.
> >
> > Something missing from both styles, though, is a consolidation of
> > values, where the AUDIT_ARCH* isn't reused in both the seccomp info and
> > the syscall_get_arch() return. The problems here were two-fold:
> >
> > 1) putting this in syscall.h meant you do not have full NR_syscall*
> >    visibility on some architectures (e.g. arm64 plays weird games with
> >    header include order).
> 
> I don't get this one -- I'm not playing with NR_syscall here.

Right, sorry, I may not have been clear. When building my RFC I noticed
that I couldn't use NR_syscall very "early" in the header file include
stack on arm64, which complicated things. So I guess what I mean is
something like "it's probably better to do all these seccomp-specific
macros/etc in asm/include/seccomp.h rather than in syscall.h because I
know at least one architecture that might cause trouble."

> > 2) seccomp needs to handle "multiplexed" tables like x86_x32 (distros
> >    haven't removed CONFIG_X86_X32 widely yet, so it is a reality that
> >    it must be dealt with), which means seccomp's idea of the arch
> >    "number" can't be the same as the AUDIT_ARCH.
> 
> Why so? Does anyone actually use x32 in a container? The memory cost
> and analysis cost is on everyone. The worst case scenario if we don't
> support it is that the syscall is not accelerated.

Ironicailly, that's the only place I actually know for sure where people
using x32 because it shows measurable (10%) speed-up for builders:
https://lore.kernel.org/lkml/CAOesGMgu1i3p7XMZuCEtj63T-ST_jh+BfaHy-K6LhgqNriKHAA@mail.gmail.com

So, yes, as you and Jann both point out, it wouldn't be terrible to just
ignore x32, it seems a shame to penalize it. That said, if the masking
step from my v1 is actually noticable on a native workload, then yeah,
probably x32 should be ignored. My instinct (not measured) is that it's
faster than walking a small array.[citation needed]

> > So, likely a combo of approaches is needed: an array (or more likely,
> > enum), declared in the per-arch seccomp.h file. And I don't see a way
> > to solve #1 cleanly.
> >
> > Regardless, it needs to be split per architecture so that regressions
> > can be bisected/reverted/isolated cleanly. And if we can't actually test
> > it at runtime (or find someone who can) it's not a good idea to make the
> > change. :)
> 
> You have a good point regarding tests. Don't see how it affects
> regressions though. Only one file here is ever included per-build.

It's easier to do a per-arch revert (i.e. all the -stable tree
machinery, etc) with a single SHA instead of having to write a partial
revert, etc.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-24 23:56       ` Kees Cook
@ 2020-09-25  3:11         ` YiFei Zhu
  2020-09-25  3:26           ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-25  3:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 6:56 PM Kees Cook <keescook@chromium.org> wrote:
> > This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default
> The question of permissions is my central concern here: who should see
> this? Some contained processes have been intentionally blocked from
> self-introspection so even the "standard" high bar of "ptrace attach
> allowed?" can't always be sufficient.
>
> My compromise about filter visibility in the past was saying that
> CAP_SYS_ADMIN was required (see seccomp_get_filter()). I'm nervous to
> weaken this. (There is some work that hasn't been sent upstream yet that
> is looking to expose the filter _contents_ via /proc that has been
> nervous too.)
>
> Now full contents vs "allow"/"filter" are certainly different things,
> but I don't feel like I've got enough evidence to show that this
> introspection would help debugging enough to justify the partially
> imagined safety of not exposing it to potential attackers.

Agreed. I'm inclined to make it CONFIG_DEBUG_SECCOMP_CACHE and guarded
by a CAP just to make it "debug only".

> I suspect it _is_ the right thing to do (just look at my own RFC's
> "debug" patch), but I'd like this to be well justified in the commit
> log.
>
> And yes, while it does hide behind a CONFIG, I'd still want it justified,
> especially since distros have a tendency to just turn everything on
> anyway. ;)

Is there something to stop a config from being enabled in an
allyesconfig? I remember seeing something like that. Else if someone
is manually selecting we can add a help text with a big banner...

> But behavior-wise, yeah, I like it; I'm fine with human-readable and
> full AUDIT_ARCH values. (Though, as devil's advocate again, to repeat
> Jann's own words back: do we want to add this only to have a new UAPI to
> support going forward?)

Is this something we want to keep stable?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-25  3:11         ` YiFei Zhu
@ 2020-09-25  3:26           ` Kees Cook
  0 siblings, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-09-25  3:26 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 10:11:17PM -0500, YiFei Zhu wrote:
> On Thu, Sep 24, 2020 at 6:56 PM Kees Cook <keescook@chromium.org> wrote:
> > > This file is guarded by CONFIG_PROC_SECCOMP_CACHE with a default
> > The question of permissions is my central concern here: who should see
> > this? Some contained processes have been intentionally blocked from
> > self-introspection so even the "standard" high bar of "ptrace attach
> > allowed?" can't always be sufficient.
> >
> > My compromise about filter visibility in the past was saying that
> > CAP_SYS_ADMIN was required (see seccomp_get_filter()). I'm nervous to
> > weaken this. (There is some work that hasn't been sent upstream yet that
> > is looking to expose the filter _contents_ via /proc that has been
> > nervous too.)
> >
> > Now full contents vs "allow"/"filter" are certainly different things,
> > but I don't feel like I've got enough evidence to show that this
> > introspection would help debugging enough to justify the partially
> > imagined safety of not exposing it to potential attackers.
> 
> Agreed. I'm inclined to make it CONFIG_DEBUG_SECCOMP_CACHE and guarded
> by a CAP just to make it "debug only".

Yeah; I just can't quite see what the best direction is here. I will
ponder this more. As I mentioned, it does seem handy. :)

> Is there something to stop a config from being enabled in an
> allyesconfig? I remember seeing something like that. Else if someone
> is manually selecting we can add a help text with a big banner...

Yeah, allyesconfig and allmodconfig both effectively set
CONFIG_COMPILE_TEST. Anyway, likely a caps test will end up being the
way to do it.

> 
> > But behavior-wise, yeah, I like it; I'm fine with human-readable and
> > full AUDIT_ARCH values. (Though, as devil's advocate again, to repeat
> > Jann's own words back: do we want to add this only to have a new UAPI to
> > support going forward?)
> 
> Is this something we want to keep stable?

The Prime Directive of "never break userspace" is really "never break
userspace in a way that someone notices". So if nothing ever parses that
file, then we don't have to keep it stable, but if something does, and
we change it, we have to fix it.

So, a capability test means very few things will touch it, and if we
decide it's not a big deal, we can relax permissions in the future.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-25  3:09       ` Kees Cook
@ 2020-09-25  3:28         ` YiFei Zhu
  2020-09-25 16:39           ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-25  3:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 10:09 PM Kees Cook <keescook@chromium.org> wrote:
> Right, sorry, I may not have been clear. When building my RFC I noticed
> that I couldn't use NR_syscall very "early" in the header file include
> stack on arm64, which complicated things. So I guess what I mean is
> something like "it's probably better to do all these seccomp-specific
> macros/etc in asm/include/seccomp.h rather than in syscall.h because I
> know at least one architecture that might cause trouble."

Ah. Makes sense.

> Ironicailly, that's the only place I actually know for sure where people
> using x32 because it shows measurable (10%) speed-up for builders:
> https://lore.kernel.org/lkml/CAOesGMgu1i3p7XMZuCEtj63T-ST_jh+BfaHy-K6LhgqNriKHAA@mail.gmail.com

Wow. 10% is significant. Makes you wonder why x32 hasn't conquered the world.

> So, yes, as you and Jann both point out, it wouldn't be terrible to just
> ignore x32, it seems a shame to penalize it. That said, if the masking
> step from my v1 is actually noticable on a native workload, then yeah,
> probably x32 should be ignored. My instinct (not measured) is that it's
> faster than walking a small array.[citation needed]

My instinct: should be pretty similar, with the loop unrolled.

You convince me that penalizing supporting x32 would be a pity :( The
10% is so nice I want it.

> It's easier to do a per-arch revert (i.e. all the -stable tree
> machinery, etc) with a single SHA instead of having to write a partial
> revert, etc.

I see. Thanks for clarifying.

How about this? Rather than specifically designing names for bitmasks
(native, compat, multiplex), just have SECCOMP_ARCH_{1,2,3}? Each arch
number would provide the size of the bitmap and a static inline
function to check the given seccomp_data belongs to the arch and if
so, the order of the bit in the bitmap. There is no need for the
shifts and madness in seccomp.c; it's arch-dependent code in their own
seccomp.h. We let the preprocessor and compiler to make things
optimized.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array
  2020-09-25  3:28         ` YiFei Zhu
@ 2020-09-25 16:39           ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-25 16:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Linux Containers, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 10:28 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> Ah. Makes sense.
>
> > Ironicailly, that's the only place I actually know for sure where people
> > using x32 because it shows measurable (10%) speed-up for builders:
> > https://lore.kernel.org/lkml/CAOesGMgu1i3p7XMZuCEtj63T-ST_jh+BfaHy-K6LhgqNriKHAA@mail.gmail.com
>
> Wow. 10% is significant. Makes you wonder why x32 hasn't conquered the world.
>
> > So, yes, as you and Jann both point out, it wouldn't be terrible to just
> > ignore x32, it seems a shame to penalize it. That said, if the masking
> > step from my v1 is actually noticable on a native workload, then yeah,
> > probably x32 should be ignored. My instinct (not measured) is that it's
> > faster than walking a small array.[citation needed]
>
> You convince me that penalizing supporting x32 would be a pity :( The
> 10% is so nice I want it.

I'm rethinking this -- the majority of our users will not use x32. I
don't think it's that useful for the majority to run all the
simulations and have the memory footprint if only a small minority
will use it.

I also just checked Debian, and it has boot-time disabling of the x32
arch downstream [1]:
CONFIG_X86_X32=y
CONFIG_X86_X32_DISABLED=y

Which means we will still generate all the code for x32 in seccomp
even though people probably won't be using it...

I also talked to some of my peers and they had a point regarding how
x32 limiting address space to 4GiB is very harsh on many modern
language runtimes, so even though it provides a 10% speed boost, its
adoption is hard -- one has to compile all the C libraries in x32 in
addition to x86_64, since one would have programs needing > 4GiB
address space needing x86_64 version of the libraries.

[1] https://wiki.debian.org/X32Port

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-25  3:04         ` YiFei Zhu
@ 2020-09-25 16:45           ` YiFei Zhu
  2020-09-25 19:42             ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-25 16:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > Why do the prepare here instead of during attach? (And note that it
> > should not be written to fail.)
>
> Right.

During attach a spinlock (current->sighand->siglock) is held. Do we
really want to put the emulator in the "atomic section"?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-25 16:45           ` YiFei Zhu
@ 2020-09-25 19:42             ` Kees Cook
  2020-09-25 19:51               ` Andy Lutomirski
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-25 19:42 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote:
> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > Why do the prepare here instead of during attach? (And note that it
> > > should not be written to fail.)
> >
> > Right.
> 
> During attach a spinlock (current->sighand->siglock) is held. Do we
> really want to put the emulator in the "atomic section"?

It's a good point, but I had some other ideas around it that lead to me
a different conclusion. Here's what I've got in my head:

I don't view filter attach (nor the siglock) as fastpath: the lock is
rarely contested and the "long time" will only be during filter attach.

When performing filter emulation, all the syscalls that are already
marked as "must run filter" on the previous filter can be skipped for
the new filter, since it cannot change the outcome, which makes the
emulation step faster.

The previous filter's bitmap isn't "stable" until siglock is held.

If we do the emulation step before siglock, we have to always do full
evaluation of all syscalls, and then merge the bitmap during attach.
That means all filters ever attached will take maximal time to perform
emulation.

I prefer the idea of the emulation step taking advantage of the bitmap
optimization, since the kernel spends less time doing work over the life
of the process tree. It's certainly marginal, but it also lets all the
bitmap manipulation stay in one place (as opposed to being split between
"prepare" and "attach").

What do you think?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-25 19:42             ` Kees Cook
@ 2020-09-25 19:51               ` Andy Lutomirski
  2020-09-25 20:37                 ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: Andy Lutomirski @ 2020-09-25 19:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry



> On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote:
> 
> On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote:
>> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>>>> Why do the prepare here instead of during attach? (And note that it
>>>> should not be written to fail.)
>>> 
>>> Right.
>> 
>> During attach a spinlock (current->sighand->siglock) is held. Do we
>> really want to put the emulator in the "atomic section"?
> 
> It's a good point, but I had some other ideas around it that lead to me
> a different conclusion. Here's what I've got in my head:
> 
> I don't view filter attach (nor the siglock) as fastpath: the lock is
> rarely contested and the "long time" will only be during filter attach.
> 
> When performing filter emulation, all the syscalls that are already
> marked as "must run filter" on the previous filter can be skipped for
> the new filter, since it cannot change the outcome, which makes the
> emulation step faster.
> 
> The previous filter's bitmap isn't "stable" until siglock is held.
> 
> If we do the emulation step before siglock, we have to always do full
> evaluation of all syscalls, and then merge the bitmap during attach.
> That means all filters ever attached will take maximal time to perform
> emulation.
> 
> I prefer the idea of the emulation step taking advantage of the bitmap
> optimization, since the kernel spends less time doing work over the life
> of the process tree. It's certainly marginal, but it also lets all the
> bitmap manipulation stay in one place (as opposed to being split between
> "prepare" and "attach").
> 
> What do you think?
> 
> 

I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-25 19:51               ` Andy Lutomirski
@ 2020-09-25 20:37                 ` Kees Cook
  2020-09-25 21:07                   ` Andy Lutomirski
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-25 20:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Sep 25, 2020 at 12:51:20PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote:
> > 
> > On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote:
> >> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >>>> Why do the prepare here instead of during attach? (And note that it
> >>>> should not be written to fail.)
> >>> 
> >>> Right.
> >> 
> >> During attach a spinlock (current->sighand->siglock) is held. Do we
> >> really want to put the emulator in the "atomic section"?
> > 
> > It's a good point, but I had some other ideas around it that lead to me
> > a different conclusion. Here's what I've got in my head:
> > 
> > I don't view filter attach (nor the siglock) as fastpath: the lock is
> > rarely contested and the "long time" will only be during filter attach.
> > 
> > When performing filter emulation, all the syscalls that are already
> > marked as "must run filter" on the previous filter can be skipped for
> > the new filter, since it cannot change the outcome, which makes the
> > emulation step faster.
> > 
> > The previous filter's bitmap isn't "stable" until siglock is held.
> > 
> > If we do the emulation step before siglock, we have to always do full
> > evaluation of all syscalls, and then merge the bitmap during attach.
> > That means all filters ever attached will take maximal time to perform
> > emulation.
> > 
> > I prefer the idea of the emulation step taking advantage of the bitmap
> > optimization, since the kernel spends less time doing work over the life
> > of the process tree. It's certainly marginal, but it also lets all the
> > bitmap manipulation stay in one place (as opposed to being split between
> > "prepare" and "attach").
> > 
> > What do you think?
> > 
> > 
> 
> I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed.

That seems like we'd need to track yet another bitmap of "did we emulate
this yet?" And it means the filter isn't really "done" until you run
another syscall? eeh, I'm not a fan: it scratches at my desire for
determinism. ;) Or maybe my implementation imagination is missing
something?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-25 20:37                 ` Kees Cook
@ 2020-09-25 21:07                   ` Andy Lutomirski
  2020-09-25 23:49                     ` Kees Cook
  2020-09-26  1:23                     ` YiFei Zhu
  0 siblings, 2 replies; 135+ messages in thread
From: Andy Lutomirski @ 2020-09-25 21:07 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Sep 25, 2020 at 1:37 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Fri, Sep 25, 2020 at 12:51:20PM -0700, Andy Lutomirski wrote:
> >
> >
> > > On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote:
> > >> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > >>>> Why do the prepare here instead of during attach? (And note that it
> > >>>> should not be written to fail.)
> > >>>
> > >>> Right.
> > >>
> > >> During attach a spinlock (current->sighand->siglock) is held. Do we
> > >> really want to put the emulator in the "atomic section"?
> > >
> > > It's a good point, but I had some other ideas around it that lead to me
> > > a different conclusion. Here's what I've got in my head:
> > >
> > > I don't view filter attach (nor the siglock) as fastpath: the lock is
> > > rarely contested and the "long time" will only be during filter attach.
> > >
> > > When performing filter emulation, all the syscalls that are already
> > > marked as "must run filter" on the previous filter can be skipped for
> > > the new filter, since it cannot change the outcome, which makes the
> > > emulation step faster.
> > >
> > > The previous filter's bitmap isn't "stable" until siglock is held.
> > >
> > > If we do the emulation step before siglock, we have to always do full
> > > evaluation of all syscalls, and then merge the bitmap during attach.
> > > That means all filters ever attached will take maximal time to perform
> > > emulation.
> > >
> > > I prefer the idea of the emulation step taking advantage of the bitmap
> > > optimization, since the kernel spends less time doing work over the life
> > > of the process tree. It's certainly marginal, but it also lets all the
> > > bitmap manipulation stay in one place (as opposed to being split between
> > > "prepare" and "attach").
> > >
> > > What do you think?
> > >
> > >
> >
> > I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed.
>
> That seems like we'd need to track yet another bitmap of "did we emulate
> this yet?" And it means the filter isn't really "done" until you run
> another syscall? eeh, I'm not a fan: it scratches at my desire for
> determinism. ;) Or maybe my implementation imagination is missing
> something?
>

We'd need at least three states per syscall: unknown, always-allow,
and need-to-run-filter.

The downsides are less determinism and a bit of an uglier
implementation.  The upside is that we don't need to loop over all
syscalls at load -- instead the time that each operation takes is
independent of the total number of syscalls on the system.  And we can
entirely avoid, say, evaluating the x32 case until the task tries an
x32 syscall.

I think it's at least worth considering.

--Andy

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-25 21:07                   ` Andy Lutomirski
@ 2020-09-25 23:49                     ` Kees Cook
  2020-09-26  0:34                       ` Andy Lutomirski
  2020-09-26  1:23                     ` YiFei Zhu
  1 sibling, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-25 23:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Sep 25, 2020 at 02:07:46PM -0700, Andy Lutomirski wrote:
> On Fri, Sep 25, 2020 at 1:37 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Fri, Sep 25, 2020 at 12:51:20PM -0700, Andy Lutomirski wrote:
> > >
> > >
> > > > On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote:
> > > >
> > > > On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote:
> > > >> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > >>>> Why do the prepare here instead of during attach? (And note that it
> > > >>>> should not be written to fail.)
> > > >>>
> > > >>> Right.
> > > >>
> > > >> During attach a spinlock (current->sighand->siglock) is held. Do we
> > > >> really want to put the emulator in the "atomic section"?
> > > >
> > > > It's a good point, but I had some other ideas around it that lead to me
> > > > a different conclusion. Here's what I've got in my head:
> > > >
> > > > I don't view filter attach (nor the siglock) as fastpath: the lock is
> > > > rarely contested and the "long time" will only be during filter attach.
> > > >
> > > > When performing filter emulation, all the syscalls that are already
> > > > marked as "must run filter" on the previous filter can be skipped for
> > > > the new filter, since it cannot change the outcome, which makes the
> > > > emulation step faster.
> > > >
> > > > The previous filter's bitmap isn't "stable" until siglock is held.
> > > >
> > > > If we do the emulation step before siglock, we have to always do full
> > > > evaluation of all syscalls, and then merge the bitmap during attach.
> > > > That means all filters ever attached will take maximal time to perform
> > > > emulation.
> > > >
> > > > I prefer the idea of the emulation step taking advantage of the bitmap
> > > > optimization, since the kernel spends less time doing work over the life
> > > > of the process tree. It's certainly marginal, but it also lets all the
> > > > bitmap manipulation stay in one place (as opposed to being split between
> > > > "prepare" and "attach").
> > > >
> > > > What do you think?
> > > >
> > > >
> > >
> > > I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed.
> >
> > That seems like we'd need to track yet another bitmap of "did we emulate
> > this yet?" And it means the filter isn't really "done" until you run
> > another syscall? eeh, I'm not a fan: it scratches at my desire for
> > determinism. ;) Or maybe my implementation imagination is missing
> > something?
> >
> 
> We'd need at least three states per syscall: unknown, always-allow,
> and need-to-run-filter.
> 
> The downsides are less determinism and a bit of an uglier
> implementation.  The upside is that we don't need to loop over all
> syscalls at load -- instead the time that each operation takes is
> independent of the total number of syscalls on the system.  And we can
> entirely avoid, say, evaluating the x32 case until the task tries an
> x32 syscall.
> 
> I think it's at least worth considering.

Yeah, worth considering. I do still think the time spent in emulation is
SO small that it doesn't matter running all of the syscalls at attach
time. The filters are tiny and fail quickly if anything "interesting"
start to happen. ;)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-25 23:49                     ` Kees Cook
@ 2020-09-26  0:34                       ` Andy Lutomirski
  0 siblings, 0 replies; 135+ messages in thread
From: Andy Lutomirski @ 2020-09-26  0:34 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry



> On Sep 25, 2020, at 4:49 PM, Kees Cook <keescook@chromium.org> wrote:
> 
> On Fri, Sep 25, 2020 at 02:07:46PM -0700, Andy Lutomirski wrote:
>>> On Fri, Sep 25, 2020 at 1:37 PM Kees Cook <keescook@chromium.org> wrote:
>>> 
>>> On Fri, Sep 25, 2020 at 12:51:20PM -0700, Andy Lutomirski wrote:
>>>> 
>>>> 
>>>>> On Sep 25, 2020, at 12:42 PM, Kees Cook <keescook@chromium.org> wrote:
>>>>> 
>>>>> On Fri, Sep 25, 2020 at 11:45:05AM -0500, YiFei Zhu wrote:
>>>>>> On Thu, Sep 24, 2020 at 10:04 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>>>>>>>> Why do the prepare here instead of during attach? (And note that it
>>>>>>>> should not be written to fail.)
>>>>>>> 
>>>>>>> Right.
>>>>>> 
>>>>>> During attach a spinlock (current->sighand->siglock) is held. Do we
>>>>>> really want to put the emulator in the "atomic section"?
>>>>> 
>>>>> It's a good point, but I had some other ideas around it that lead to me
>>>>> a different conclusion. Here's what I've got in my head:
>>>>> 
>>>>> I don't view filter attach (nor the siglock) as fastpath: the lock is
>>>>> rarely contested and the "long time" will only be during filter attach.
>>>>> 
>>>>> When performing filter emulation, all the syscalls that are already
>>>>> marked as "must run filter" on the previous filter can be skipped for
>>>>> the new filter, since it cannot change the outcome, which makes the
>>>>> emulation step faster.
>>>>> 
>>>>> The previous filter's bitmap isn't "stable" until siglock is held.
>>>>> 
>>>>> If we do the emulation step before siglock, we have to always do full
>>>>> evaluation of all syscalls, and then merge the bitmap during attach.
>>>>> That means all filters ever attached will take maximal time to perform
>>>>> emulation.
>>>>> 
>>>>> I prefer the idea of the emulation step taking advantage of the bitmap
>>>>> optimization, since the kernel spends less time doing work over the life
>>>>> of the process tree. It's certainly marginal, but it also lets all the
>>>>> bitmap manipulation stay in one place (as opposed to being split between
>>>>> "prepare" and "attach").
>>>>> 
>>>>> What do you think?
>>>>> 
>>>>> 
>>>> 
>>>> I’m wondering if we should be much much lazier. We could potentially wait until someone actually tries to do a given syscall before we try to evaluate whether the result is fixed.
>>> 
>>> That seems like we'd need to track yet another bitmap of "did we emulate
>>> this yet?" And it means the filter isn't really "done" until you run
>>> another syscall? eeh, I'm not a fan: it scratches at my desire for
>>> determinism. ;) Or maybe my implementation imagination is missing
>>> something?
>>> 
>> 
>> We'd need at least three states per syscall: unknown, always-allow,
>> and need-to-run-filter.
>> 
>> The downsides are less determinism and a bit of an uglier
>> implementation.  The upside is that we don't need to loop over all
>> syscalls at load -- instead the time that each operation takes is
>> independent of the total number of syscalls on the system.  And we can
>> entirely avoid, say, evaluating the x32 case until the task tries an
>> x32 syscall.
>> 
>> I think it's at least worth considering.
> 
> Yeah, worth considering. I do still think the time spent in emulation is
> SO small that it doesn't matter running all of the syscalls at attach
> time. The filters are tiny and fail quickly if anything "interesting"
> start to happen. ;)
> 

There’s a middle ground, too: do it lazily per arch.  So we would allocate and populate the compat bitmap the first time a compat syscall is attempted and do the same for x32. This may help avoid the annoying extra memory usage and 3x startup overhead while retaining full functionality.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-25 21:07                   ` Andy Lutomirski
  2020-09-25 23:49                     ` Kees Cook
@ 2020-09-26  1:23                     ` YiFei Zhu
  2020-09-26  2:47                       ` Andy Lutomirski
  1 sibling, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-26  1:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Sep 25, 2020 at 4:07 PM Andy Lutomirski <luto@amacapital.net> wrote:
> We'd need at least three states per syscall: unknown, always-allow,
> and need-to-run-filter.
>
> The downsides are less determinism and a bit of an uglier
> implementation.  The upside is that we don't need to loop over all
> syscalls at load -- instead the time that each operation takes is
> independent of the total number of syscalls on the system.  And we can
> entirely avoid, say, evaluating the x32 case until the task tries an
> x32 syscall.

I was really afraid of multiple tasks writing to the bitmaps at once,
hence I used bitmap-per-task. Now I think about it, if this stays
lockless, the worst thing that can happen is that a write undo a bit
set by another task. In this case, if the "known" bit is cleared then
the worst would be the emulation is run many times. But if the "always
allow" is cleared but not "known" bit then we have an issue: the
syscall will always be executed in BPF.

Is it worth holding a spinlock here?

Though I'll try to get the benchmark numbers for the emulator later tonight.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-26  1:23                     ` YiFei Zhu
@ 2020-09-26  2:47                       ` Andy Lutomirski
  2020-09-26  4:35                         ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: Andy Lutomirski @ 2020-09-26  2:47 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Kees Cook, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry


> On Sep 25, 2020, at 6:23 PM, YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> 
> On Fri, Sep 25, 2020 at 4:07 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> We'd need at least three states per syscall: unknown, always-allow,
>> and need-to-run-filter.
>> 
>> The downsides are less determinism and a bit of an uglier
>> implementation.  The upside is that we don't need to loop over all
>> syscalls at load -- instead the time that each operation takes is
>> independent of the total number of syscalls on the system.  And we can
>> entirely avoid, say, evaluating the x32 case until the task tries an
>> x32 syscall.
> 
> I was really afraid of multiple tasks writing to the bitmaps at once,
> hence I used bitmap-per-task. Now I think about it, if this stays
> lockless, the worst thing that can happen is that a write undo a bit
> set by another task. In this case, if the "known" bit is cleared then
> the worst would be the emulation is run many times. But if the "always
> allow" is cleared but not "known" bit then we have an issue: the
> syscall will always be executed in BPF.
> 

If you interleave the bits, then you can read and write them atomically — both bits for any given syscall will be in the same word.

> Is it worth holding a spinlock here?
> 
> Though I'll try to get the benchmark numbers for the emulator later tonight.
> 
> YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent
  2020-09-26  2:47                       ` Andy Lutomirski
@ 2020-09-26  4:35                         ` Kees Cook
  0 siblings, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-09-26  4:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Sep 25, 2020 at 07:47:47PM -0700, Andy Lutomirski wrote:
> 
> > On Sep 25, 2020, at 6:23 PM, YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > 
> > On Fri, Sep 25, 2020 at 4:07 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >> We'd need at least three states per syscall: unknown, always-allow,
> >> and need-to-run-filter.
> >> 
> >> The downsides are less determinism and a bit of an uglier
> >> implementation.  The upside is that we don't need to loop over all
> >> syscalls at load -- instead the time that each operation takes is
> >> independent of the total number of syscalls on the system.  And we can
> >> entirely avoid, say, evaluating the x32 case until the task tries an
> >> x32 syscall.
> > 
> > I was really afraid of multiple tasks writing to the bitmaps at once,
> > hence I used bitmap-per-task. Now I think about it, if this stays
> > lockless, the worst thing that can happen is that a write undo a bit
> > set by another task. In this case, if the "known" bit is cleared then
> > the worst would be the emulation is run many times. But if the "always
> > allow" is cleared but not "known" bit then we have an issue: the
> > syscall will always be executed in BPF.
> > 
> 
> If you interleave the bits, then you can read and write them atomically — both bits for any given syscall will be in the same word.

I think we can just hold the spinlock. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results
  2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
                   ` (7 preceding siblings ...)
  2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
@ 2020-09-30 15:19 ` YiFei Zhu
  2020-09-30 15:19   ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu
                     ` (5 more replies)
  8 siblings, 6 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/

Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.
* Architectures supported by default through arch number array,
  except for MIPS with its sparse syscall numbers.
* Configurable per-build for future different cache modes.

This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.

Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.

Some benchmarks are performed with results in patch 5, copied below:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Benchmarking 200000000 syscalls...
  129.359381409 - 0.008724424 = 129350656985 (129.4s)
  getpid native: 646 ns
  264.385890006 - 129.360453229 = 135025436777 (135.0s)
  getpid RET_ALLOW 1 filter (bitmap): 675 ns
  399.400511893 - 264.387045901 = 135013465992 (135.0s)
  getpid RET_ALLOW 2 filters (bitmap): 675 ns
  545.872866260 - 399.401718327 = 146471147933 (146.5s)
  getpid RET_ALLOW 3 filters (full): 732 ns
  696.337101319 - 545.874097681 = 150463003638 (150.5s)
  getpid RET_ALLOW 4 filters (full): 752 ns
  Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
  Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
  Estimated total seccomp overhead for 3 full filters: 86 ns
  Estimated total seccomp overhead for 4 full filters: 106 ns
  Estimated seccomp entry overhead: 29 ns
  Estimated seccomp per-filter overhead (last 2 diff): 20 ns
  Estimated seccomp per-filter overhead (filters / 4): 19 ns
  Expectations:
  	native ≤ 1 bitmap (646 ≤ 675): ✔️
  	native ≤ 1 filter (646 ≤ 732): ✔️
  	per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
  	1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 1 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 2 bitmapped (29 ≈ 29): ✔️
  	native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️

v2 -> v3:
* Added array_index_nospec guards
* No more syscall_arches[] array and expecting on loop unrolling. Arches
  are configured with per-arch seccomp.h.
* Moved filter emulation to attach time (from prepare time).
* Further simplified emulator, basing on Kees's code.
* Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN.

v1 -> v2:
* Corrected one outdated function documentation.

RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
  have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
  instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
  during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.

Patch 1 adds the arch macros for x86.

Patch 2 implements the emulator that finds if a filter must return allow,

Patch 3 implements the test_bit against the bitmaps.

Patch 4 updates the selftest to better show the new semantics.

Patch 5 implements /proc/pid/seccomp_cache.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

Kees Cook (2):
  x86: Enable seccomp architecture tracking
  selftests/seccomp: Compare bitmap vs filter overhead

YiFei Zhu (3):
  seccomp/cache: Add "emulator" to check if filter is constant allow
  seccomp/cache: Lookup syscall allowlist for fast path
  seccomp/cache: Report cache data through /proc/pid/seccomp_cache

 arch/Kconfig                                  |  49 ++++
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/seccomp.h                |  15 +
 fs/proc/base.c                                |   3 +
 include/linux/seccomp.h                       |   5 +
 kernel/seccomp.c                              | 265 +++++++++++++++++-
 .../selftests/seccomp/seccomp_benchmark.c     | 151 ++++++++--
 tools/testing/selftests/seccomp/settings      |   2 +-
 8 files changed, 467 insertions(+), 24 deletions(-)

--
2.28.0

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking
  2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
@ 2020-09-30 15:19   ` YiFei Zhu
  2020-09-30 21:21     ` Kees Cook
  2020-09-30 15:19   ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: Kees Cook <keescook@chromium.org>

Provide seccomp internals with the details to calculate which syscall
table the running kernel is expecting to deal with. This allows for
efficient architecture pinning and paves the way for constant-action
bitmaps.

Signed-off-by: Kees Cook <keescook@chromium.org>
[YiFei: Removed x32, added macro for nr_syscalls]
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/x86/include/asm/seccomp.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 2bd1338de236..7b3a58271656 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -16,6 +16,18 @@
 #define __NR_seccomp_sigreturn_32	__NR_ia32_sigreturn
 #endif
 
+#ifdef CONFIG_X86_64
+# define SECCOMP_ARCH_DEFAULT			AUDIT_ARCH_X86_64
+# define SECCOMP_ARCH_DEFAULT_NR		NR_syscalls
+# ifdef CONFIG_COMPAT
+#  define SECCOMP_ARCH_COMPAT			AUDIT_ARCH_I386
+#  define SECCOMP_ARCH_COMPAT_NR		IA32_NR_syscalls
+# endif
+#else /* !CONFIG_X86_64 */
+# define SECCOMP_ARCH_DEFAULT		AUDIT_ARCH_I386
+# define SECCOMP_ARCH_DEFAULT_NR	NR_syscalls
+#endif
+
 #include <asm-generic/seccomp.h>
 
 #endif /* _ASM_X86_SECCOMP_H */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  2020-09-30 15:19   ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu
@ 2020-09-30 15:19   ` YiFei Zhu
  2020-09-30 22:24     ` Jann Horn
                       ` (2 more replies)
  2020-09-30 15:19   ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
                     ` (3 subsequent siblings)
  5 siblings, 3 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
access any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.

Each common BPF instruction are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.

The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.

Emulator structure and comments are from Kees [1] and Jann [2].

Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.

[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig     |  34 ++++++++++
 arch/x86/Kconfig |   1 +
 kernel/seccomp.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 201 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 21a3675a7a3a..ca867b2a5d71 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -471,6 +471,14 @@ config HAVE_ARCH_SECCOMP_FILTER
 	    results in the system call being skipped immediately.
 	  - seccomp syscall wired up
 
+config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
+	bool
+	help
+	  An arch should select this symbol if it provides all of these things:
+	  - all the requirements for HAVE_ARCH_SECCOMP_FILTER
+	  - SECCOMP_ARCH_DEFAULT
+	  - SECCOMP_ARCH_DEFAULT_NR
+
 config SECCOMP
 	prompt "Enable seccomp to safely execute untrusted bytecode"
 	def_bool y
@@ -498,6 +506,32 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+choice
+	prompt "Seccomp filter cache"
+	default SECCOMP_CACHE_NONE
+	depends on SECCOMP_FILTER
+	depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
+	help
+	  Seccomp filters can potentially incur large overhead for each
+	  system call. This can alleviate some of the overhead.
+
+	  If in doubt, select 'syscall numbers only'.
+
+config SECCOMP_CACHE_NONE
+	bool "None"
+	help
+	  No caching is done. Seccomp filters will be called each time
+	  a system call occurs in a seccomp-guarded task.
+
+config SECCOMP_CACHE_NR_ONLY
+	bool "Syscall number only"
+	depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
+	help
+	  For each syscall number, if the seccomp filter has a fixed
+	  result, store that result in a bitmap to speed up system calls.
+
+endchoice
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1ab22869a765..ff5289228ea5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -150,6 +150,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ae6b40cc39f4..f09c9e74ae05 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,6 +143,37 @@ struct notification {
 	struct list_head notifications;
 };
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_cache_filter_data - container for cache's per-filter data
+ *
+ * Tis struct is ordered to minimize padding holes.
+ *
+ * @syscall_allow_default: A bitmap where each bit represents whether the
+ *			   filter willalways allow the syscall, for the
+ *			   default architecture.
+ * @syscall_allow_compat: A bitmap where each bit represents whether the
+ *		          filter will always allow the syscall, for the
+ *			  compat architecture.
+ */
+struct seccomp_cache_filter_data {
+#ifdef SECCOMP_ARCH_DEFAULT
+	DECLARE_BITMAP(syscall_allow_default, SECCOMP_ARCH_DEFAULT_NR);
+#endif
+#ifdef SECCOMP_ARCH_COMPAT
+	DECLARE_BITMAP(syscall_allow_compat, SECCOMP_ARCH_COMPAT_NR);
+#endif
+};
+
+#define SECCOMP_EMU_MAX_PENDING_STATES 64
+#else
+struct seccomp_cache_filter_data { };
+
+static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -159,6 +190,7 @@ struct notification {
  *	   this filter after reaching 0. The @users count is always smaller
  *	   or equal to @refs. Hence, reaching 0 for @users does not mean
  *	   the filter can be freed.
+ * @cache: container for cache-related data.
  * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
  * @prev: points to a previously installed, or inherited, filter
  * @prog: the BPF program to evaluate
@@ -180,6 +212,7 @@ struct seccomp_filter {
 	refcount_t refs;
 	refcount_t users;
 	bool log;
+	struct seccomp_cache_filter_data cache;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
 	struct notification *notif;
@@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *sfilter;
 	int ret;
-	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
+			       IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
 		return ERR_PTR(-EINVAL);
@@ -610,6 +644,136 @@ seccomp_prepare_user_filter(const char __user *user_filter)
 	return filter;
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * seccomp_emu_is_const_allow - check if filter is constant allow with given data
+ * @fprog: The BPF programs
+ * @sd: The seccomp data to check against, only syscall number are arch
+ *      number are considered constant.
+ */
+static bool seccomp_emu_is_const_allow(struct sock_fprog_kern *fprog,
+				       struct seccomp_data *sd)
+{
+	unsigned int insns;
+	unsigned int reg_value = 0;
+	unsigned int pc;
+	bool op_res;
+
+	if (WARN_ON_ONCE(!fprog))
+		return false;
+
+	insns = bpf_classic_proglen(fprog);
+	for (pc = 0; pc < insns; pc++) {
+		struct sock_filter *insn = &fprog->filter[pc];
+		u16 code = insn->code;
+		u32 k = insn->k;
+
+		switch (code) {
+		case BPF_LD | BPF_W | BPF_ABS:
+			switch (k) {
+			case offsetof(struct seccomp_data, nr):
+				reg_value = sd->nr;
+				break;
+			case offsetof(struct seccomp_data, arch):
+				reg_value = sd->arch;
+				break;
+			default:
+				/* can't optimize (non-constant value load) */
+				return false;
+			}
+			break;
+		case BPF_RET | BPF_K:
+			/* reached return with constant values only, check allow */
+			return k == SECCOMP_RET_ALLOW;
+		case BPF_JMP | BPF_JA:
+			pc += insn->k;
+			break;
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JSET | BPF_K:
+			switch (BPF_OP(code)) {
+			case BPF_JEQ:
+				op_res = reg_value == k;
+				break;
+			case BPF_JGE:
+				op_res = reg_value >= k;
+				break;
+			case BPF_JGT:
+				op_res = reg_value > k;
+				break;
+			case BPF_JSET:
+				op_res = !!(reg_value & k);
+				break;
+			default:
+				/* can't optimize (unknown jump) */
+				return false;
+			}
+
+			pc += op_res ? insn->jt : insn->jf;
+			break;
+		case BPF_ALU | BPF_AND | BPF_K:
+			reg_value &= k;
+			break;
+		default:
+			/* can't optimize (unknown insn) */
+			return false;
+		}
+	}
+
+	/* ran off the end of the filter?! */
+	WARN_ON(1);
+	return false;
+}
+
+static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
+					 void *bitmap, const void *bitmap_prev,
+					 size_t bitmap_size, int arch)
+{
+	struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
+	struct seccomp_data sd;
+	int nr;
+
+	for (nr = 0; nr < bitmap_size; nr++) {
+		if (bitmap_prev && !test_bit(nr, bitmap_prev))
+			continue;
+
+		sd.nr = nr;
+		sd.arch = arch;
+
+		if (seccomp_emu_is_const_allow(fprog, &sd))
+			set_bit(nr, bitmap);
+	}
+}
+
+/**
+ * seccomp_cache_prepare - emulate the filter to find cachable syscalls
+ * @sfilter: The seccomp filter
+ *
+ * Returns 0 if successful or -errno if error occurred.
+ */
+static void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+	struct seccomp_cache_filter_data *cache = &sfilter->cache;
+	const struct seccomp_cache_filter_data *cache_prev =
+		sfilter->prev ? &sfilter->prev->cache : NULL;
+
+#ifdef SECCOMP_ARCH_DEFAULT
+	seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_default,
+				     cache_prev ? cache_prev->syscall_allow_default : NULL,
+				     SECCOMP_ARCH_DEFAULT_NR,
+				     SECCOMP_ARCH_DEFAULT);
+#endif /* SECCOMP_ARCH_DEFAULT */
+
+#ifdef SECCOMP_ARCH_COMPAT
+	seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_compat,
+				     cache_prev ? cache_prev->syscall_allow_compat : NULL,
+				     SECCOMP_ARCH_COMPAT_NR,
+				     SECCOMP_ARCH_COMPAT);
+#endif /* SECCOMP_ARCH_COMPAT */
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_attach_filter: validate and attach filter
  * @flags:  flags to change filter behavior
@@ -659,6 +823,7 @@ static long seccomp_attach_filter(unsigned int flags,
 	 * task reference.
 	 */
 	filter->prev = current->seccomp.filter;
+	seccomp_cache_prepare(filter);
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
 
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path
  2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  2020-09-30 15:19   ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu
  2020-09-30 15:19   ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
@ 2020-09-30 15:19   ` YiFei Zhu
  2020-09-30 21:32     ` Kees Cook
  2020-09-30 15:19   ` [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

This first finds the current allow bitmask by iterating through
syscall_arches[] array and comparing it to the one in struct
seccomp_data; this loop is expected to be unrolled. It then
does a test_bit against the bitmask. If the bit is set, then
there is no need to run the full filter; it returns
SECCOMP_RET_ALLOW immediately.

Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index f09c9e74ae05..bed3b2a7f6c8 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -172,6 +172,12 @@ struct seccomp_cache_filter_data { };
 static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
 {
 }
+
+static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				       const struct seccomp_data *sd)
+{
+	return false;
+}
 #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
 
 /**
@@ -331,6 +337,49 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 	return 0;
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+static bool seccomp_cache_check_bitmap(const void *bitmap, size_t bitmap_size,
+				       int syscall_nr)
+{
+	if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))
+		return false;
+	syscall_nr = array_index_nospec(syscall_nr, bitmap_size);
+
+	return test_bit(syscall_nr, bitmap);
+}
+
+/**
+ * seccomp_cache_check - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				const struct seccomp_data *sd)
+{
+	int syscall_nr = sd->nr;
+	const struct seccomp_cache_filter_data *cache = &sfilter->cache;
+
+#ifdef SECCOMP_ARCH_DEFAULT
+	if (likely(sd->arch == SECCOMP_ARCH_DEFAULT))
+		return seccomp_cache_check_bitmap(cache->syscall_allow_default,
+						  SECCOMP_ARCH_DEFAULT_NR,
+						  syscall_nr);
+#endif /* SECCOMP_ARCH_DEFAULT */
+
+#ifdef SECCOMP_ARCH_COMPAT
+	if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
+		return seccomp_cache_check_bitmap(cache->syscall_allow_compat,
+						  SECCOMP_ARCH_COMPAT_NR,
+						  syscall_nr);
+#endif /* SECCOMP_ARCH_COMPAT */
+
+	WARN_ON_ONCE(true);
+	return false;
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_run_filters - evaluates all seccomp filters against @sd
  * @sd: optional seccomp data to be passed to filters
@@ -353,6 +402,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 	if (WARN_ON(f == NULL))
 		return SECCOMP_RET_KILL_PROCESS;
 
+	if (seccomp_cache_check(f, sd))
+		return SECCOMP_RET_ALLOW;
+
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead
  2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
                     ` (2 preceding siblings ...)
  2020-09-30 15:19   ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
@ 2020-09-30 15:19   ` YiFei Zhu
  2020-09-30 15:19   ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
  2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  5 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: Kees Cook <keescook@chromium.org>

As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.

Example output with constant action bitmaps on x86:

$ sudo ./seccomp_benchmark 100000000
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 200000000 syscalls...
129.359381409 - 0.008724424 = 129350656985 (129.4s)
getpid native: 646 ns
264.385890006 - 129.360453229 = 135025436777 (135.0s)
getpid RET_ALLOW 1 filter (bitmap): 675 ns
399.400511893 - 264.387045901 = 135013465992 (135.0s)
getpid RET_ALLOW 2 filters (bitmap): 675 ns
545.872866260 - 399.401718327 = 146471147933 (146.5s)
getpid RET_ALLOW 3 filters (full): 732 ns
696.337101319 - 545.874097681 = 150463003638 (150.5s)
getpid RET_ALLOW 4 filters (full): 752 ns
Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
Estimated total seccomp overhead for 3 full filters: 86 ns
Estimated total seccomp overhead for 4 full filters: 106 ns
Estimated seccomp entry overhead: 29 ns
Estimated seccomp per-filter overhead (last 2 diff): 20 ns
Estimated seccomp per-filter overhead (filters / 4): 19 ns
Expectations:
	native ≤ 1 bitmap (646 ≤ 675): ✔️
	native ≤ 1 filter (646 ≤ 732): ✔️
	per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
	1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
	entry ≈ 1 bitmapped (29 ≈ 29): ✔️
	entry ≈ 2 bitmapped (29 ≈ 29): ✔️
	native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️

Signed-off-by: Kees Cook <keescook@chromium.org>
[YiFei: Changed commit message to show stats for this patch series]
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++++++++++---
 tools/testing/selftests/seccomp/settings      |   2 +-
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 91f5a89cadac..fcc806585266 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,12 +4,16 @@
  */
 #define _GNU_SOURCE
 #include <assert.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <unistd.h>
 #include <linux/filter.h>
 #include <linux/seccomp.h>
+#include <sys/param.h>
 #include <sys/prctl.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
@@ -70,18 +74,74 @@ unsigned long long calibrate(void)
 	return samples * seconds;
 }
 
+bool approx(int i_one, int i_two)
+{
+	double one = i_one, one_bump = one * 0.01;
+	double two = i_two, two_bump = two * 0.01;
+
+	one_bump = one + MAX(one_bump, 2.0);
+	two_bump = two + MAX(two_bump, 2.0);
+
+	/* Equal to, or within 1% or 2 digits */
+	if (one == two ||
+	    (one > two && one <= two_bump) ||
+	    (two > one && two <= one_bump))
+		return true;
+	return false;
+}
+
+bool le(int i_one, int i_two)
+{
+	if (i_one <= i_two)
+		return true;
+	return false;
+}
+
+long compare(const char *name_one, const char *name_eval, const char *name_two,
+	     unsigned long long one, bool (*eval)(int, int), unsigned long long two)
+{
+	bool good;
+
+	printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two,
+	       (long long)one, name_eval, (long long)two);
+	if (one > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)one);
+		return 1;
+	}
+	if (two > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)two);
+		return 1;
+	}
+
+	good = eval(one, two);
+	printf("%s\n", good ? "✔️" : "❌");
+
+	return good ? 0 : 1;
+}
+
 int main(int argc, char *argv[])
 {
+	struct sock_filter bitmap_filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog bitmap_prog = {
+		.len = (unsigned short)ARRAY_SIZE(bitmap_filter),
+		.filter = bitmap_filter,
+	};
 	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])),
 		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
 	};
 	struct sock_fprog prog = {
 		.len = (unsigned short)ARRAY_SIZE(filter),
 		.filter = filter,
 	};
-	long ret;
-	unsigned long long samples;
-	unsigned long long native, filter1, filter2;
+
+	long ret, bits;
+	unsigned long long samples, calc;
+	unsigned long long native, filter1, filter2, bitmap1, bitmap2;
+	unsigned long long entry, per_filter1, per_filter2;
 
 	printf("Current BPF sysctl settings:\n");
 	system("sysctl net.core.bpf_jit_enable");
@@ -101,35 +161,82 @@ int main(int argc, char *argv[])
 	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
 	assert(ret == 0);
 
-	/* One filter */
-	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
+	/* One filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
 	assert(ret == 0);
 
-	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1);
+	bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1);
+
+	/* Second filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	if (filter1 == native)
-		printf("No overhead measured!? Try running again with more samples.\n");
+	bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2);
 
-	/* Two filters */
+	/* Third filter, can no longer be converted to bitmap */
 	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
 	assert(ret == 0);
 
-	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2);
-
-	/* Calculations */
-	printf("Estimated total seccomp overhead for 1 filter: %llu ns\n",
-		filter1 - native);
+	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1);
 
-	printf("Estimated total seccomp overhead for 2 filters: %llu ns\n",
-		filter2 - native);
+	/* Fourth filter, can not be converted to bitmap because of filter 3 */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	printf("Estimated seccomp per-filter overhead: %llu ns\n",
-		filter2 - filter1);
+	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2);
+
+	/* Estimations */
+#define ESTIMATE(fmt, var, what)	do {			\
+		var = (what);					\
+		printf("Estimated " fmt ": %llu ns\n", var);	\
+		if (var > INT_MAX)				\
+			goto more_samples;			\
+	} while (0)
+
+	ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc,
+		 bitmap1 - native);
+	ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc,
+		 bitmap2 - native);
+	ESTIMATE("total seccomp overhead for 3 full filters", calc,
+		 filter1 - native);
+	ESTIMATE("total seccomp overhead for 4 full filters", calc,
+		 filter2 - native);
+	ESTIMATE("seccomp entry overhead", entry,
+		 bitmap1 - native - (bitmap2 - bitmap1));
+	ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1,
+		 filter2 - filter1);
+	ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2,
+		 (filter2 - native - entry) / 4);
+
+	printf("Expectations:\n");
+	ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1);
+	bits = compare("native", "≤", "1 filter", native, le, filter1);
+	if (bits)
+		goto more_samples;
+
+	ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)",
+			per_filter1, approx, per_filter2);
+
+	bits = compare("1 bitmapped", "≈", "2 bitmapped",
+			bitmap1 - native, approx, bitmap2 - native);
+	if (bits) {
+		printf("Skipping constant action bitmap expectations: they appear unsupported.\n");
+		goto out;
+	}
 
-	printf("Estimated seccomp entry overhead: %llu ns\n",
-		filter1 - native - (filter2 - filter1));
+	ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native);
+	ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native);
+	ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total",
+			entry + (per_filter1 * 4) + native, approx, filter2);
+	if (ret == 0)
+		goto out;
 
+more_samples:
+	printf("Saw unexpected benchmark result. Try running again with more samples?\n");
+out:
 	return 0;
 }
diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings
index ba4d85f74cd6..6091b45d226b 100644
--- a/tools/testing/selftests/seccomp/settings
+++ b/tools/testing/selftests/seccomp/settings
@@ -1 +1 @@
-timeout=90
+timeout=120
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
                     ` (3 preceding siblings ...)
  2020-09-30 15:19   ` [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
@ 2020-09-30 15:19   ` YiFei Zhu
  2020-09-30 22:00     ` Jann Horn
  2020-09-30 22:59     ` Kees Cook
  2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  5 siblings, 2 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-09-30 15:19 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]

This file is guarded by CONFIG_DEBUG_SECCOMP_CACHE with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig                   | 15 +++++++++++
 arch/x86/include/asm/seccomp.h |  3 +++
 fs/proc/base.c                 |  3 +++
 include/linux/seccomp.h        |  5 ++++
 kernel/seccomp.c               | 46 ++++++++++++++++++++++++++++++++++
 5 files changed, 72 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index ca867b2a5d71..b840cadcc882 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -478,6 +478,7 @@ config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
 	  - all the requirements for HAVE_ARCH_SECCOMP_FILTER
 	  - SECCOMP_ARCH_DEFAULT
 	  - SECCOMP_ARCH_DEFAULT_NR
+	  - SECCOMP_ARCH_DEFAULT_NAME
 
 config SECCOMP
 	prompt "Enable seccomp to safely execute untrusted bytecode"
@@ -532,6 +533,20 @@ config SECCOMP_CACHE_NR_ONLY
 
 endchoice
 
+config DEBUG_SECCOMP_CACHE
+	bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
+	depends on SECCOMP_CACHE_NR_ONLY
+	depends on PROC_FS
+	help
+	  This is enables /proc/pid/seccomp_cache interface to monitor
+	  seccomp cache data. The file format is subject to change. Reading
+	  the file requires CAP_SYS_ADMIN.
+
+	  This option is for debugging only. Enabling present the risk that
+	  an adversary may be able to infer the seccomp filter logic.
+
+	  If unsure, say N.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 7b3a58271656..33ccc074be7a 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -19,13 +19,16 @@
 #ifdef CONFIG_X86_64
 # define SECCOMP_ARCH_DEFAULT			AUDIT_ARCH_X86_64
 # define SECCOMP_ARCH_DEFAULT_NR		NR_syscalls
+# define SECCOMP_ARCH_DEFAULT_NAME		"x86_64"
 # ifdef CONFIG_COMPAT
 #  define SECCOMP_ARCH_COMPAT			AUDIT_ARCH_I386
 #  define SECCOMP_ARCH_COMPAT_NR		IA32_NR_syscalls
+#  define SECCOMP_ARCH_COMPAT_NAME		"x86_32"
 # endif
 #else /* !CONFIG_X86_64 */
 # define SECCOMP_ARCH_DEFAULT		AUDIT_ARCH_I386
 # define SECCOMP_ARCH_DEFAULT_NR	NR_syscalls
+# define SECCOMP_ARCH_COMPAT_NAME		"x86_32"
 #endif
 
 #include <asm-generic/seccomp.h>
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..c60c5fce70fa 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+#ifdef CONFIG_DEBUG_SECCOMP_CACHE
+	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..c35430f5f553 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_DEBUG_SECCOMP_CACHE
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task);
+#endif
 #endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index bed3b2a7f6c8..c5ca5e30281b 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2297,3 +2297,49 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_DEBUG_SECCOMP_CACHE
+/* Currently CONFIG_DEBUG_SECCOMP_CACHE implies CONFIG_SECCOMP_CACHE_NR_ONLY */
+static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name,
+					const void *bitmap, size_t bitmap_size)
+{
+	int nr;
+
+	for (nr = 0; nr < bitmap_size; nr++) {
+		bool cached = test_bit(nr, bitmap);
+		char *status = cached ? "ALLOW" : "FILTER";
+
+		seq_printf(m, "%s %d %s\n", name, nr, status);
+	}
+}
+
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task)
+{
+	struct seccomp_filter *f;
+
+	/*
+	 * We don't want some sandboxed process know what their seccomp
+	 * filters consist of.
+	 */
+	if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
+		return -EACCES;
+
+	f = READ_ONCE(task->seccomp.filter);
+	if (!f)
+		return 0;
+
+#ifdef SECCOMP_ARCH_DEFAULT
+	proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_DEFAULT_NAME,
+				    f->cache.syscall_allow_default,
+				    SECCOMP_ARCH_DEFAULT_NR);
+#endif /* SECCOMP_ARCH_DEFAULT */
+
+#ifdef SECCOMP_ARCH_COMPAT
+	proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME,
+				    f->cache.syscall_allow_compat,
+				    SECCOMP_ARCH_COMPAT_NR);
+#endif /* SECCOMP_ARCH_COMPAT */
+	return 0;
+}
+#endif /* CONFIG_DEBUG_SECCOMP_CACHE */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking
  2020-09-30 15:19   ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu
@ 2020-09-30 21:21     ` Kees Cook
  2020-09-30 21:33       ` Jann Horn
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-30 21:21 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Wed, Sep 30, 2020 at 10:19:12AM -0500, YiFei Zhu wrote:
> From: Kees Cook <keescook@chromium.org>
> 
> Provide seccomp internals with the details to calculate which syscall
> table the running kernel is expecting to deal with. This allows for
> efficient architecture pinning and paves the way for constant-action
> bitmaps.
> 
> Signed-off-by: Kees Cook <keescook@chromium.org>
> [YiFei: Removed x32, added macro for nr_syscalls]
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  arch/x86/include/asm/seccomp.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
> index 2bd1338de236..7b3a58271656 100644
> --- a/arch/x86/include/asm/seccomp.h
> +++ b/arch/x86/include/asm/seccomp.h
> @@ -16,6 +16,18 @@
>  #define __NR_seccomp_sigreturn_32	__NR_ia32_sigreturn
>  #endif
>  
> +#ifdef CONFIG_X86_64
> +# define SECCOMP_ARCH_DEFAULT			AUDIT_ARCH_X86_64
> +# define SECCOMP_ARCH_DEFAULT_NR		NR_syscalls

bikeshedding: let's call these SECCOMP_ARCH_NATIVE* -- I think it's more
descriptive.

> +# ifdef CONFIG_COMPAT
> +#  define SECCOMP_ARCH_COMPAT			AUDIT_ARCH_I386
> +#  define SECCOMP_ARCH_COMPAT_NR		IA32_NR_syscalls
> +# endif
> +#else /* !CONFIG_X86_64 */
> +# define SECCOMP_ARCH_DEFAULT		AUDIT_ARCH_I386
> +# define SECCOMP_ARCH_DEFAULT_NR	NR_syscalls
> +#endif
> +
>  #include <asm-generic/seccomp.h>
>  
>  #endif /* _ASM_X86_SECCOMP_H */
> -- 
> 2.28.0
> 

But otherwise, yes, looks good to me. For this patch, I think the S-o-b chain is probably more
accurately captured as:

Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path
  2020-09-30 15:19   ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
@ 2020-09-30 21:32     ` Kees Cook
  2020-10-09  0:17       ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-30 21:32 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Wed, Sep 30, 2020 at 10:19:14AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> The fast (common) path for seccomp should be that the filter permits
> the syscall to pass through, and failing seccomp is expected to be
> an exceptional case; it is not expected for userspace to call a
> denylisted syscall over and over.
> 
> This first finds the current allow bitmask by iterating through
> syscall_arches[] array and comparing it to the one in struct
> seccomp_data; this loop is expected to be unrolled. It then
> does a test_bit against the bitmask. If the bit is set, then
> there is no need to run the full filter; it returns
> SECCOMP_RET_ALLOW immediately.
> 
> Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>

I'd like the content/ordering of this and the emulator patch to be reorganized a bit.
I'd like to see the infrastructure of the cache added first (along with
the "always allow" test logic in this patch), with the emulator missing:
i.e. the patch is a logical no-op: no behavior changes because nothing
ever changes the cache bits, but all the operational logic, structure
changes, etc, is in place. Then the next patch would be replacing the
no-op with the emulator.

> ---
>  kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index f09c9e74ae05..bed3b2a7f6c8 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -172,6 +172,12 @@ struct seccomp_cache_filter_data { };
>  static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
>  {
>  }
> +
> +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter,

bikeshedding: "cache check" doesn't tell me anything about what it's
actually checking for. How about calling this seccomp_is_constant_allow() or
something that reflects both the "bool" return ("is") and what that bool
means ("should always be allowed").

> +				       const struct seccomp_data *sd)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
>  
>  /**
> @@ -331,6 +337,49 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +static bool seccomp_cache_check_bitmap(const void *bitmap, size_t bitmap_size,

Please also mark as "inline".

> +				       int syscall_nr)
> +{
> +	if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))
> +		return false;
> +	syscall_nr = array_index_nospec(syscall_nr, bitmap_size);
> +
> +	return test_bit(syscall_nr, bitmap);
> +}
> +
> +/**
> + * seccomp_cache_check - lookup seccomp cache
> + * @sfilter: The seccomp filter
> + * @sd: The seccomp data to lookup the cache with
> + *
> + * Returns true if the seccomp_data is cached and allowed.
> + */
> +static bool seccomp_cache_check(const struct seccomp_filter *sfilter,

inline too.

> +				const struct seccomp_data *sd)
> +{
> +	int syscall_nr = sd->nr;
> +	const struct seccomp_cache_filter_data *cache = &sfilter->cache;
> +
> +#ifdef SECCOMP_ARCH_DEFAULT
> +	if (likely(sd->arch == SECCOMP_ARCH_DEFAULT))
> +		return seccomp_cache_check_bitmap(cache->syscall_allow_default,
> +						  SECCOMP_ARCH_DEFAULT_NR,
> +						  syscall_nr);
> +#endif /* SECCOMP_ARCH_DEFAULT */
> +
> +#ifdef SECCOMP_ARCH_COMPAT
> +	if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
> +		return seccomp_cache_check_bitmap(cache->syscall_allow_compat,
> +						  SECCOMP_ARCH_COMPAT_NR,
> +						  syscall_nr);
> +#endif /* SECCOMP_ARCH_COMPAT */
> +
> +	WARN_ON_ONCE(true);
> +	return false;
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
> +
>  /**
>   * seccomp_run_filters - evaluates all seccomp filters against @sd
>   * @sd: optional seccomp data to be passed to filters
> @@ -353,6 +402,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
>  	if (WARN_ON(f == NULL))
>  		return SECCOMP_RET_KILL_PROCESS;
>  
> +	if (seccomp_cache_check(f, sd))
> +		return SECCOMP_RET_ALLOW;
> +
>  	/*
>  	 * All filters in the list are evaluated and the lowest BPF return
>  	 * value always takes priority (ignoring the DATA).
> -- 
> 2.28.0
> 

Otherwise, yup, looks good.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking
  2020-09-30 21:21     ` Kees Cook
@ 2020-09-30 21:33       ` Jann Horn
  2020-09-30 22:53         ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: Jann Horn @ 2020-09-30 21:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Wed, Sep 30, 2020 at 11:21 PM Kees Cook <keescook@chromium.org> wrote:
> On Wed, Sep 30, 2020 at 10:19:12AM -0500, YiFei Zhu wrote:
> > From: Kees Cook <keescook@chromium.org>
> >
> > Provide seccomp internals with the details to calculate which syscall
> > table the running kernel is expecting to deal with. This allows for
> > efficient architecture pinning and paves the way for constant-action
> > bitmaps.
> >
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > [YiFei: Removed x32, added macro for nr_syscalls]
> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
[...]
> But otherwise, yes, looks good to me. For this patch, I think the S-o-b chain is probably more
> accurately captured as:
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>

(Technically, https://www.kernel.org/doc/html/latest/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by
says that "every Co-developed-by: must be immediately followed by a
Signed-off-by: of the associated co-author" (and has an example of how
that should look).)

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-30 15:19   ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
@ 2020-09-30 22:00     ` Jann Horn
  2020-09-30 23:12       ` Kees Cook
  2020-10-01 12:06       ` YiFei Zhu
  2020-09-30 22:59     ` Kees Cook
  1 sibling, 2 replies; 135+ messages in thread
From: Jann Horn @ 2020-09-30 22:00 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
>
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <arch name> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
>
> For the docker default profile on x86_64 it looks like:
> x86_64 0 ALLOW
> x86_64 1 ALLOW
> x86_64 2 ALLOW
> x86_64 3 ALLOW
> [...]
> x86_64 132 ALLOW
> x86_64 133 ALLOW
> x86_64 134 FILTER
> x86_64 135 FILTER
> x86_64 136 FILTER
> x86_64 137 ALLOW
> x86_64 138 ALLOW
> x86_64 139 FILTER
> x86_64 140 ALLOW
> x86_64 141 ALLOW
> [...]

Oooh, neat! :) Thanks!

> Suggested-by: Jann Horn <jannh@google.com>
> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  arch/Kconfig                   | 15 +++++++++++
>  arch/x86/include/asm/seccomp.h |  3 +++
>  fs/proc/base.c                 |  3 +++
>  include/linux/seccomp.h        |  5 ++++
>  kernel/seccomp.c               | 46 ++++++++++++++++++++++++++++++++++
>  5 files changed, 72 insertions(+)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index ca867b2a5d71..b840cadcc882 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -478,6 +478,7 @@ config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
>           - all the requirements for HAVE_ARCH_SECCOMP_FILTER
>           - SECCOMP_ARCH_DEFAULT
>           - SECCOMP_ARCH_DEFAULT_NR
> +         - SECCOMP_ARCH_DEFAULT_NAME
>
>  config SECCOMP
>         prompt "Enable seccomp to safely execute untrusted bytecode"
> @@ -532,6 +533,20 @@ config SECCOMP_CACHE_NR_ONLY
>
>  endchoice
>
> +config DEBUG_SECCOMP_CACHE
> +       bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
> +       depends on SECCOMP_CACHE_NR_ONLY
> +       depends on PROC_FS
> +       help
> +         This is enables /proc/pid/seccomp_cache interface to monitor

nit: s/is enables/enables/

> +         seccomp cache data. The file format is subject to change. Reading
> +         the file requires CAP_SYS_ADMIN.
> +
> +         This option is for debugging only. Enabling present the risk that
> +         an adversary may be able to infer the seccomp filter logic.
> +
> +         If unsure, say N.
> +
[...]
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
[...]
> +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
> +                          struct pid *pid, struct task_struct *task)
> +{
> +       struct seccomp_filter *f;
> +
> +       /*
> +        * We don't want some sandboxed process know what their seccomp
> +        * filters consist of.
> +        */
> +       if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
> +               return -EACCES;
> +
> +       f = READ_ONCE(task->seccomp.filter);
> +       if (!f)
> +               return 0;

Hmm, this won't work, because the task could be exiting, and seccomp
filters are detached in release_task() (using
seccomp_filter_release()). And at the moment, seccomp_filter_release()
just locklessly NULLs out the tsk->seccomp.filter pointer and drops
the reference.

The locking here is kind of gross, but basically I think you can
change this code to use lock_task_sighand() / unlock_task_sighand()
(see the other examples in fs/proc/base.c), and bail out if
lock_task_sighand() returns NULL. And in seccomp_filter_release(), add
something like this:

/* We are effectively holding the siglock by not having any sighand. */
WARN_ON(tsk->sighand != NULL);

> +#ifdef SECCOMP_ARCH_DEFAULT
> +       proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_DEFAULT_NAME,
> +                                   f->cache.syscall_allow_default,
> +                                   SECCOMP_ARCH_DEFAULT_NR);
> +#endif /* SECCOMP_ARCH_DEFAULT */
> +
> +#ifdef SECCOMP_ARCH_COMPAT
> +       proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME,
> +                                   f->cache.syscall_allow_compat,
> +                                   SECCOMP_ARCH_COMPAT_NR);
> +#endif /* SECCOMP_ARCH_COMPAT */
> +       return 0;
> +}
> +#endif /* CONFIG_DEBUG_SECCOMP_CACHE */
> --
> 2.28.0
>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-09-30 15:19   ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
@ 2020-09-30 22:24     ` Jann Horn
  2020-09-30 22:49       ` Kees Cook
  2020-10-01 11:28       ` YiFei Zhu
  2020-09-30 22:40     ` Kees Cook
  2020-10-09  4:47     ` YiFei Zhu
  2 siblings, 2 replies; 135+ messages in thread
From: Jann Horn @ 2020-09-30 22:24 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> access any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> return allow regardless of syscall arguments for a given
> architecture number / syscall number pair. This is implemented
> here with a pseudo-emulator, and stored in a per-filter bitmap.
>
> Each common BPF instruction are emulated. Any weirdness or loading
> from a syscall argument will cause the emulator to bail.
>
> The emulation is also halted if it reaches a return. In that case,
> if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
>
> Emulator structure and comments are from Kees [1] and Jann [2].
>
> Emulation is done at attach time. If a filter depends on more
> filters, and if the dependee does not guarantee to allow the
> syscall, then we skip the emulation of this syscall.
>
> [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
> [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
[...]
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1ab22869a765..ff5289228ea5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -150,6 +150,7 @@ config X86
>         select HAVE_ARCH_COMPAT_MMAP_BASES      if MMU && COMPAT
>         select HAVE_ARCH_PREL32_RELOCATIONS
>         select HAVE_ARCH_SECCOMP_FILTER
> +       select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
>         select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>         select HAVE_ARCH_STACKLEAK
>         select HAVE_ARCH_TRACEHOOK

If you did the architecture enablement for X86 later in the series,
you could move this part over into that patch, that'd be cleaner.

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index ae6b40cc39f4..f09c9e74ae05 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,6 +143,37 @@ struct notification {
>         struct list_head notifications;
>  };
>
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * struct seccomp_cache_filter_data - container for cache's per-filter data
> + *
> + * Tis struct is ordered to minimize padding holes.

I think this comment can probably go away, there isn't really much
trickery around padding holes in the struct as it is now.

> + * @syscall_allow_default: A bitmap where each bit represents whether the
> + *                        filter willalways allow the syscall, for the

nit: s/willalways/will always/

[...]
> +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
> +                                        void *bitmap, const void *bitmap_prev,
> +                                        size_t bitmap_size, int arch)
> +{
> +       struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
> +       struct seccomp_data sd;
> +       int nr;
> +
> +       for (nr = 0; nr < bitmap_size; nr++) {
> +               if (bitmap_prev && !test_bit(nr, bitmap_prev))
> +                       continue;
> +
> +               sd.nr = nr;
> +               sd.arch = arch;
> +
> +               if (seccomp_emu_is_const_allow(fprog, &sd))
> +                       set_bit(nr, bitmap);

set_bit() is atomic, but since we only do this at filter setup, before
the filter becomes globally visible, we don't need atomicity here. So
this should probably use __set_bit() instead.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-09-30 15:19   ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
  2020-09-30 22:24     ` Jann Horn
@ 2020-09-30 22:40     ` Kees Cook
  2020-10-01 11:52       ` YiFei Zhu
  2020-10-09  4:47     ` YiFei Zhu
  2 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-30 22:40 UTC (permalink / raw)
  To: YiFei Zhu, Jann Horn
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Wed, Sep 30, 2020 at 10:19:13AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> access any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> return allow regardless of syscall arguments for a given
> architecture number / syscall number pair. This is implemented
> here with a pseudo-emulator, and stored in a per-filter bitmap.
> 
> Each common BPF instruction are emulated. Any weirdness or loading
> from a syscall argument will cause the emulator to bail.
> 
> The emulation is also halted if it reaches a return. In that case,
> if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
> 
> Emulator structure and comments are from Kees [1] and Jann [2].
> 
> Emulation is done at attach time. If a filter depends on more
> filters, and if the dependee does not guarantee to allow the
> syscall, then we skip the emulation of this syscall.
> 
> [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
> [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
> 
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>

See comments on patch 3 for reorganizing this a bit for the next
version.

For the infrastructure patch, I'd like to see much of the cover letter
in the commit log (otherwise those details are harder for people to
find). That will describe the _why_ for preparing this change, etc.

For the emulator patch, I'd like to see the discussion about how the
subset of BFP instructions was selected, what libraries  Jann and I
examined, etc.

(For all of these commit logs, I try to pretend that whoever is reading
it has not followed any lkml thread of discussion, etc.)

> ---
>  arch/Kconfig     |  34 ++++++++++
>  arch/x86/Kconfig |   1 +
>  kernel/seccomp.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 201 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 21a3675a7a3a..ca867b2a5d71 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -471,6 +471,14 @@ config HAVE_ARCH_SECCOMP_FILTER
>  	    results in the system call being skipped immediately.
>  	  - seccomp syscall wired up
>  
> +config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
> +	bool
> +	help
> +	  An arch should select this symbol if it provides all of these things:
> +	  - all the requirements for HAVE_ARCH_SECCOMP_FILTER
> +	  - SECCOMP_ARCH_DEFAULT
> +	  - SECCOMP_ARCH_DEFAULT_NR
> +

There's no need for this config and the per-arch Kconfig clutter:
SECCOMP_ARCH_NATIVE will be a sufficient gate.

>  config SECCOMP
>  	prompt "Enable seccomp to safely execute untrusted bytecode"
>  	def_bool y
> @@ -498,6 +506,32 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/userspace-api/seccomp_filter.rst for details.
>  
> +choice
> +	prompt "Seccomp filter cache"
> +	default SECCOMP_CACHE_NONE
> +	depends on SECCOMP_FILTER
> +	depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
> +	help
> +	  Seccomp filters can potentially incur large overhead for each
> +	  system call. This can alleviate some of the overhead.
> +
> +	  If in doubt, select 'syscall numbers only'.
> +
> +config SECCOMP_CACHE_NONE
> +	bool "None"
> +	help
> +	  No caching is done. Seccomp filters will be called each time
> +	  a system call occurs in a seccomp-guarded task.
> +
> +config SECCOMP_CACHE_NR_ONLY
> +	bool "Syscall number only"
> +	depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
> +	help
> +	  For each syscall number, if the seccomp filter has a fixed
> +	  result, store that result in a bitmap to speed up system calls.
> +
> +endchoice

I don't want this config: there is only 1 caching mechanism happening
in this series and I do not want to have it buildable as "off": it
should be available for all supported architectures. When further caching
methods happen, the config can be introduced then (though I'll likely
argue it should then be a boot param to allow distro kernels to make it
selectable).

> +
>  config HAVE_ARCH_STACKLEAK
>  	bool
>  	help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1ab22869a765..ff5289228ea5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -150,6 +150,7 @@ config X86
>  	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
>  	select HAVE_ARCH_PREL32_RELOCATIONS
>  	select HAVE_ARCH_SECCOMP_FILTER
> +	select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_STACKLEAK
>  	select HAVE_ARCH_TRACEHOOK
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index ae6b40cc39f4..f09c9e74ae05 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,6 +143,37 @@ struct notification {
>  	struct list_head notifications;
>  };
>  
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * struct seccomp_cache_filter_data - container for cache's per-filter data

naming nits: "data" doesn't tell me anything. "seccomp_action_cache"
might be better. Or since it's an internal struct, maybe just
"action_cache". And let's not use the word "container" for the kerndoc. ;)
How about "per-filter cache of seccomp actions per arch/syscall pair"

> + *
> + * Tis struct is ordered to minimize padding holes.

typo: This

> + *
> + * @syscall_allow_default: A bitmap where each bit represents whether the
> + *			   filter willalways allow the syscall, for the

typo: missing space

> + *			   default architecture.

default -> native

> + * @syscall_allow_compat: A bitmap where each bit represents whether the
> + *		          filter will always allow the syscall, for the
> + *			  compat architecture.
> + */
> +struct seccomp_cache_filter_data {
> +#ifdef SECCOMP_ARCH_DEFAULT
> +	DECLARE_BITMAP(syscall_allow_default, SECCOMP_ARCH_DEFAULT_NR);

naming nit: "syscall" is redundant here, IMO. "allow_native" should be
fine.

> +#endif
> +#ifdef SECCOMP_ARCH_COMPAT
> +	DECLARE_BITMAP(syscall_allow_compat, SECCOMP_ARCH_COMPAT_NR);
> +#endif
> +};
> +
> +#define SECCOMP_EMU_MAX_PENDING_STATES 64
> +#else
> +struct seccomp_cache_filter_data { };
> +
> +static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
> +{
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -159,6 +190,7 @@ struct notification {
>   *	   this filter after reaching 0. The @users count is always smaller
>   *	   or equal to @refs. Hence, reaching 0 for @users does not mean
>   *	   the filter can be freed.
> + * @cache: container for cache-related data.

more descriptive: "cache of arch/syscall mappings to actions"

>   * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
>   * @prev: points to a previously installed, or inherited, filter
>   * @prog: the BPF program to evaluate
> @@ -180,6 +212,7 @@ struct seccomp_filter {
>  	refcount_t refs;
>  	refcount_t users;
>  	bool log;
> +	struct seccomp_cache_filter_data cache;
>  	struct seccomp_filter *prev;
>  	struct bpf_prog *prog;
>  	struct notification *notif;
> @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  {
>  	struct seccomp_filter *sfilter;
>  	int ret;
> -	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> +	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
> +			       IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
>  
>  	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>  		return ERR_PTR(-EINVAL);
> @@ -610,6 +644,136 @@ seccomp_prepare_user_filter(const char __user *user_filter)
>  	return filter;
>  }
>  
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * seccomp_emu_is_const_allow - check if filter is constant allow with given data
> + * @fprog: The BPF programs
> + * @sd: The seccomp data to check against, only syscall number are arch
> + *      number are considered constant.
> + */
> +static bool seccomp_emu_is_const_allow(struct sock_fprog_kern *fprog,
> +				       struct seccomp_data *sd)

naming: I would drop "emu" from here. The caller doesn't care how it is
determined. ;)

> +{
> +	unsigned int insns;
> +	unsigned int reg_value = 0;
> +	unsigned int pc;
> +	bool op_res;
> +
> +	if (WARN_ON_ONCE(!fprog))
> +		return false;
> +
> +	insns = bpf_classic_proglen(fprog);
> +	for (pc = 0; pc < insns; pc++) {
> +		struct sock_filter *insn = &fprog->filter[pc];
> +		u16 code = insn->code;
> +		u32 k = insn->k;
> +
> +		switch (code) {
> +		case BPF_LD | BPF_W | BPF_ABS:
> +			switch (k) {
> +			case offsetof(struct seccomp_data, nr):
> +				reg_value = sd->nr;
> +				break;
> +			case offsetof(struct seccomp_data, arch):
> +				reg_value = sd->arch;
> +				break;
> +			default:
> +				/* can't optimize (non-constant value load) */
> +				return false;
> +			}
> +			break;
> +		case BPF_RET | BPF_K:
> +			/* reached return with constant values only, check allow */
> +			return k == SECCOMP_RET_ALLOW;
> +		case BPF_JMP | BPF_JA:
> +			pc += insn->k;
> +			break;
> +		case BPF_JMP | BPF_JEQ | BPF_K:
> +		case BPF_JMP | BPF_JGE | BPF_K:
> +		case BPF_JMP | BPF_JGT | BPF_K:
> +		case BPF_JMP | BPF_JSET | BPF_K:
> +			switch (BPF_OP(code)) {
> +			case BPF_JEQ:
> +				op_res = reg_value == k;
> +				break;
> +			case BPF_JGE:
> +				op_res = reg_value >= k;
> +				break;
> +			case BPF_JGT:
> +				op_res = reg_value > k;
> +				break;
> +			case BPF_JSET:
> +				op_res = !!(reg_value & k);
> +				break;
> +			default:
> +				/* can't optimize (unknown jump) */
> +				return false;
> +			}
> +
> +			pc += op_res ? insn->jt : insn->jf;
> +			break;
> +		case BPF_ALU | BPF_AND | BPF_K:
> +			reg_value &= k;
> +			break;
> +		default:
> +			/* can't optimize (unknown insn) */
> +			return false;
> +		}
> +	}
> +
> +	/* ran off the end of the filter?! */
> +	WARN_ON(1);
> +	return false;
> +}

For the emulator patch, you'll want to include these tags in the commit
log:

Suggested-by: Jann Horn <jannh@google.com>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>

> +
> +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
> +					 void *bitmap, const void *bitmap_prev,
> +					 size_t bitmap_size, int arch)
> +{
> +	struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
> +	struct seccomp_data sd;
> +	int nr;
> +
> +	for (nr = 0; nr < bitmap_size; nr++) {
> +		if (bitmap_prev && !test_bit(nr, bitmap_prev))
> +			continue;
> +
> +		sd.nr = nr;
> +		sd.arch = arch;
> +
> +		if (seccomp_emu_is_const_allow(fprog, &sd))
> +			set_bit(nr, bitmap);

The guiding principle with seccomp's designs is to always make things
_more_ restrictive, never less. While we can never escape the
consequences of having seccomp_is_const_allow() report the wrong
answer, we can at least follow the basic principles, hopefully
minimizing the impact.

When the bitmap starts with "always allowed" and we only flip it towards
"run full filters", we're only ever making things more restrictive. If
we instead go from "run full filters" towards "always allowed", we run
the risk of making things less restrictive. For example: a process that
maliciously adds a filter that the emulator mistakenly evaluates to
"always allow" doesn't suddenly cause all the prior filters to stop running.
(i.e. this isolates the flaw outcome, and doesn't depend on the early
"do not emulate if we already know we have to run filters" case before
the emulation call: there is no code path that allows the cache to
weaken: it can only maintain it being wrong).

Without any seccomp filter installed, all syscalls are "always allowed"
(from the perspective of the seccomp boundary), so the default of the
cache needs to be "always allowed".


	if (bitmap_prev) {
		/* The new filter must be as restrictive as the last. */
		bitmap_copy(bitmap, bitmap_prev, bitmap_size);
	} else {
		/* Before any filters, all syscalls are always allowed. */
		bitmap_fill(bitmap, bitmap_size);
	}

	for (nr = 0; nr < bitmap_size; nr++) {
		/* No bitmap change: not a cacheable action. */
		if (!test_bit(nr, bitmap_prev) ||
			continue;

		/* No bitmap change: continue to always allow. */
		if (seccomp_is_const_allow(fprog, &sd))
			continue;

		/* Not a cacheable action: always run filters. */
		clear_bit(nr, bitmap);

> +	}
> +}
> +
> +/**
> + * seccomp_cache_prepare - emulate the filter to find cachable syscalls
> + * @sfilter: The seccomp filter
> + *
> + * Returns 0 if successful or -errno if error occurred.
> + */
> +static void seccomp_cache_prepare(struct seccomp_filter *sfilter)
> +{
> +	struct seccomp_cache_filter_data *cache = &sfilter->cache;
> +	const struct seccomp_cache_filter_data *cache_prev =
> +		sfilter->prev ? &sfilter->prev->cache : NULL;
> +
> +#ifdef SECCOMP_ARCH_DEFAULT
> +	seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_default,
> +				     cache_prev ? cache_prev->syscall_allow_default : NULL,
> +				     SECCOMP_ARCH_DEFAULT_NR,
> +				     SECCOMP_ARCH_DEFAULT);
> +#endif /* SECCOMP_ARCH_DEFAULT */
> +
> +#ifdef SECCOMP_ARCH_COMPAT
> +	seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_compat,
> +				     cache_prev ? cache_prev->syscall_allow_compat : NULL,
> +				     SECCOMP_ARCH_COMPAT_NR,
> +				     SECCOMP_ARCH_COMPAT);
> +#endif /* SECCOMP_ARCH_COMPAT */
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
> +
>  /**
>   * seccomp_attach_filter: validate and attach filter
>   * @flags:  flags to change filter behavior
> @@ -659,6 +823,7 @@ static long seccomp_attach_filter(unsigned int flags,
>  	 * task reference.
>  	 */
>  	filter->prev = current->seccomp.filter;
> +	seccomp_cache_prepare(filter);
>  	current->seccomp.filter = filter;

Jann, do we need a WRITE_ONCE() or something when writing
current->seccomp.filter here? I think the rmb() in __seccomp_filter() will
cover the cache bitmap writes having finished before the filter pointer
is followed in the TSYNC case.

>  	atomic_inc(&current->seccomp.filter_count);
>  
> -- 
> 2.28.0
> 

Otherwise, yes, I'm looking forward to having this for everyone to use!
:)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-09-30 22:24     ` Jann Horn
@ 2020-09-30 22:49       ` Kees Cook
  2020-10-01 11:28       ` YiFei Zhu
  1 sibling, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-09-30 22:49 UTC (permalink / raw)
  To: Jann Horn
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Oct 01, 2020 at 12:24:32AM +0200, Jann Horn wrote:
> On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> > access any syscall arguments or instruction pointer. To facilitate
> > this we need a static analyser to know whether a filter will
> > return allow regardless of syscall arguments for a given
> > architecture number / syscall number pair. This is implemented
> > here with a pseudo-emulator, and stored in a per-filter bitmap.
> >
> > Each common BPF instruction are emulated. Any weirdness or loading
> > from a syscall argument will cause the emulator to bail.
> >
> > The emulation is also halted if it reaches a return. In that case,
> > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
> >
> > Emulator structure and comments are from Kees [1] and Jann [2].
> >
> > Emulation is done at attach time. If a filter depends on more
> > filters, and if the dependee does not guarantee to allow the
> > syscall, then we skip the emulation of this syscall.
> >
> > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
> > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
> [...]
> > +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
> > +                                        void *bitmap, const void *bitmap_prev,
> > +                                        size_t bitmap_size, int arch)
> > +{
> > +       struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
> > +       struct seccomp_data sd;
> > +       int nr;
> > +
> > +       for (nr = 0; nr < bitmap_size; nr++) {
> > +               if (bitmap_prev && !test_bit(nr, bitmap_prev))
> > +                       continue;
> > +
> > +               sd.nr = nr;
> > +               sd.arch = arch;
> > +
> > +               if (seccomp_emu_is_const_allow(fprog, &sd))
> > +                       set_bit(nr, bitmap);
> 
> set_bit() is atomic, but since we only do this at filter setup, before
> the filter becomes globally visible, we don't need atomicity here. So
> this should probably use __set_bit() instead.

Oh yes, excellent point! That will speed this up a bit. When you do
this, please include a comment here describing why its safe to do it
non-atomic. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking
  2020-09-30 21:33       ` Jann Horn
@ 2020-09-30 22:53         ` Kees Cook
  2020-09-30 23:15           ` Jann Horn
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-30 22:53 UTC (permalink / raw)
  To: Jann Horn
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Wed, Sep 30, 2020 at 11:33:15PM +0200, Jann Horn wrote:
> On Wed, Sep 30, 2020 at 11:21 PM Kees Cook <keescook@chromium.org> wrote:
> > On Wed, Sep 30, 2020 at 10:19:12AM -0500, YiFei Zhu wrote:
> > > From: Kees Cook <keescook@chromium.org>
> > >
> > > Provide seccomp internals with the details to calculate which syscall
> > > table the running kernel is expecting to deal with. This allows for
> > > efficient architecture pinning and paves the way for constant-action
> > > bitmaps.
> > >
> > > Signed-off-by: Kees Cook <keescook@chromium.org>
> > > [YiFei: Removed x32, added macro for nr_syscalls]
> > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> [...]
> > But otherwise, yes, looks good to me. For this patch, I think the S-o-b chain is probably more
> > accurately captured as:
> >
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> 
> (Technically, https://www.kernel.org/doc/html/latest/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by
> says that "every Co-developed-by: must be immediately followed by a
> Signed-off-by: of the associated co-author" (and has an example of how
> that should look).)

Right, but it is not needed for the commit author (here, the From:),
the second example given in the docs shows this:

	From: From Author <from@author.example.org>

	<changelog>

	Co-developed-by: Random Co-Author <random@coauthor.example.org>
	Signed-off-by: Random Co-Author <random@coauthor.example.org>
	Signed-off-by: From Author <from@author.example.org>
	Co-developed-by: Submitting Co-Author <sub@coauthor.example.org>
	Signed-off-by: Submitting Co-Author <sub@coauthor.example.org>

and there is no third co-developer, so it's:

	From: From Author <from@author.example.org>

	<changelog>

	Signed-off-by: From Author <from@author.example.org>
	Co-developed-by: Submitting Co-Author <sub@coauthor.example.org>
	Signed-off-by: Submitting Co-Author <sub@coauthor.example.org>

If I'm the From, and YiFei Zhu is the submitting co-developer, then
it's:

	From: Kees Cook <keescook@chromium.org>

	<changelog>

	Signed-off-by: Kees Cook <keescook@chromium.org>
	Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
	Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>

which is what I suggested.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-30 15:19   ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
  2020-09-30 22:00     ` Jann Horn
@ 2020-09-30 22:59     ` Kees Cook
  2020-09-30 23:08       ` Jann Horn
  1 sibling, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-09-30 22:59 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Wed, Sep 30, 2020 at 10:19:16AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
> 
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <arch name> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
> 
> For the docker default profile on x86_64 it looks like:
> x86_64 0 ALLOW
> x86_64 1 ALLOW
> x86_64 2 ALLOW
> x86_64 3 ALLOW
> [...]
> x86_64 132 ALLOW
> x86_64 133 ALLOW
> x86_64 134 FILTER
> x86_64 135 FILTER
> x86_64 136 FILTER
> x86_64 137 ALLOW
> x86_64 138 ALLOW
> x86_64 139 FILTER
> x86_64 140 ALLOW
> x86_64 141 ALLOW
> [...]
> 
> This file is guarded by CONFIG_DEBUG_SECCOMP_CACHE with a default
> of N because I think certain users of seccomp might not want the
> application to know which syscalls are definitely usable. For
> the same reason, it is also guarded by CAP_SYS_ADMIN.
> 
> Suggested-by: Jann Horn <jannh@google.com>
> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  arch/Kconfig                   | 15 +++++++++++
>  arch/x86/include/asm/seccomp.h |  3 +++
>  fs/proc/base.c                 |  3 +++
>  include/linux/seccomp.h        |  5 ++++
>  kernel/seccomp.c               | 46 ++++++++++++++++++++++++++++++++++
>  5 files changed, 72 insertions(+)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index ca867b2a5d71..b840cadcc882 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -478,6 +478,7 @@ config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
>  	  - all the requirements for HAVE_ARCH_SECCOMP_FILTER
>  	  - SECCOMP_ARCH_DEFAULT
>  	  - SECCOMP_ARCH_DEFAULT_NR
> +	  - SECCOMP_ARCH_DEFAULT_NAME
>  
>  config SECCOMP
>  	prompt "Enable seccomp to safely execute untrusted bytecode"
> @@ -532,6 +533,20 @@ config SECCOMP_CACHE_NR_ONLY
>  
>  endchoice
>  
> +config DEBUG_SECCOMP_CACHE

naming nit: I prefer where what how order, so SECCOMP_CACHE_DEBUG.

> +	bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
> +	depends on SECCOMP_CACHE_NR_ONLY
> +	depends on PROC_FS
> +	help
> +	  This is enables /proc/pid/seccomp_cache interface to monitor
> +	  seccomp cache data. The file format is subject to change. Reading
> +	  the file requires CAP_SYS_ADMIN.
> +
> +	  This option is for debugging only. Enabling present the risk that
> +	  an adversary may be able to infer the seccomp filter logic.
> +
> +	  If unsure, say N.
> +
>  config HAVE_ARCH_STACKLEAK
>  	bool
>  	help
> diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
> index 7b3a58271656..33ccc074be7a 100644
> --- a/arch/x86/include/asm/seccomp.h
> +++ b/arch/x86/include/asm/seccomp.h
> @@ -19,13 +19,16 @@
>  #ifdef CONFIG_X86_64
>  # define SECCOMP_ARCH_DEFAULT			AUDIT_ARCH_X86_64
>  # define SECCOMP_ARCH_DEFAULT_NR		NR_syscalls
> +# define SECCOMP_ARCH_DEFAULT_NAME		"x86_64"
>  # ifdef CONFIG_COMPAT
>  #  define SECCOMP_ARCH_COMPAT			AUDIT_ARCH_I386
>  #  define SECCOMP_ARCH_COMPAT_NR		IA32_NR_syscalls
> +#  define SECCOMP_ARCH_COMPAT_NAME		"x86_32"

I think this should be "ia32"? Is there a good definitive guide on this
naming convention?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-30 22:59     ` Kees Cook
@ 2020-09-30 23:08       ` Jann Horn
  2020-09-30 23:21         ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: Jann Horn @ 2020-09-30 23:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers

[adding x86 folks to enhance bikeshedding]

On Thu, Oct 1, 2020 at 12:59 AM Kees Cook <keescook@chromium.org> wrote:
> On Wed, Sep 30, 2020 at 10:19:16AM -0500, YiFei Zhu wrote:
> > From: YiFei Zhu <yifeifz2@illinois.edu>
> >
> > Currently the kernel does not provide an infrastructure to translate
> > architecture numbers to a human-readable name. Translating syscall
> > numbers to syscall names is possible through FTRACE_SYSCALL
> > infrastructure but it does not provide support for compat syscalls.
> >
> > This will create a file for each PID as /proc/pid/seccomp_cache.
> > The file will be empty when no seccomp filters are loaded, or be
> > in the format of:
> > <arch name> <decimal syscall number> <ALLOW | FILTER>
> > where ALLOW means the cache is guaranteed to allow the syscall,
> > and filter means the cache will pass the syscall to the BPF filter.
> >
> > For the docker default profile on x86_64 it looks like:
> > x86_64 0 ALLOW
> > x86_64 1 ALLOW
> > x86_64 2 ALLOW
> > x86_64 3 ALLOW
> > [...]
> > x86_64 132 ALLOW
> > x86_64 133 ALLOW
> > x86_64 134 FILTER
> > x86_64 135 FILTER
> > x86_64 136 FILTER
> > x86_64 137 ALLOW
> > x86_64 138 ALLOW
> > x86_64 139 FILTER
> > x86_64 140 ALLOW
> > x86_64 141 ALLOW
[...]
> > diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
> > index 7b3a58271656..33ccc074be7a 100644
> > --- a/arch/x86/include/asm/seccomp.h
> > +++ b/arch/x86/include/asm/seccomp.h
> > @@ -19,13 +19,16 @@
> >  #ifdef CONFIG_X86_64
> >  # define SECCOMP_ARCH_DEFAULT                        AUDIT_ARCH_X86_64
> >  # define SECCOMP_ARCH_DEFAULT_NR             NR_syscalls
> > +# define SECCOMP_ARCH_DEFAULT_NAME           "x86_64"
> >  # ifdef CONFIG_COMPAT
> >  #  define SECCOMP_ARCH_COMPAT                        AUDIT_ARCH_I386
> >  #  define SECCOMP_ARCH_COMPAT_NR             IA32_NR_syscalls
> > +#  define SECCOMP_ARCH_COMPAT_NAME           "x86_32"
>
> I think this should be "ia32"? Is there a good definitive guide on this
> naming convention?

"man 2 syscall" calls them "x86-64" and "i386". The syscall table
files use ABI names "i386" and "64". The syscall stub prefixes use
"x64" and "ia32".

I don't think we have a good consistent naming strategy here. :P

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-30 22:00     ` Jann Horn
@ 2020-09-30 23:12       ` Kees Cook
  2020-10-01 12:06       ` YiFei Zhu
  1 sibling, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-09-30 23:12 UTC (permalink / raw)
  To: Jann Horn
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Oct 01, 2020 at 12:00:46AM +0200, Jann Horn wrote:
> On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> [...]
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> [...]
> > +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
> > +                          struct pid *pid, struct task_struct *task)
> > +{
> > +       struct seccomp_filter *f;
> > +
> > +       /*
> > +        * We don't want some sandboxed process know what their seccomp
> > +        * filters consist of.
> > +        */
> > +       if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
> > +               return -EACCES;
> > +
> > +       f = READ_ONCE(task->seccomp.filter);
> > +       if (!f)
> > +               return 0;
> 
> Hmm, this won't work, because the task could be exiting, and seccomp
> filters are detached in release_task() (using
> seccomp_filter_release()). And at the moment, seccomp_filter_release()
> just locklessly NULLs out the tsk->seccomp.filter pointer and drops
> the reference.

Oh nice catch. Yeah, this would only happen if it was the only filter
remaining on a process with no children, etc.

> 
> The locking here is kind of gross, but basically I think you can
> change this code to use lock_task_sighand() / unlock_task_sighand()
> (see the other examples in fs/proc/base.c), and bail out if
> lock_task_sighand() returns NULL. And in seccomp_filter_release(), add
> something like this:
> 
> /* We are effectively holding the siglock by not having any sighand. */
> WARN_ON(tsk->sighand != NULL);

Yeah, good idea.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking
  2020-09-30 22:53         ` Kees Cook
@ 2020-09-30 23:15           ` Jann Horn
  0 siblings, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-09-30 23:15 UTC (permalink / raw)
  To: Kees Cook
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Oct 1, 2020 at 12:53 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Wed, Sep 30, 2020 at 11:33:15PM +0200, Jann Horn wrote:
> > On Wed, Sep 30, 2020 at 11:21 PM Kees Cook <keescook@chromium.org> wrote:
> > > On Wed, Sep 30, 2020 at 10:19:12AM -0500, YiFei Zhu wrote:
> > > > From: Kees Cook <keescook@chromium.org>
> > > >
> > > > Provide seccomp internals with the details to calculate which syscall
> > > > table the running kernel is expecting to deal with. This allows for
> > > > efficient architecture pinning and paves the way for constant-action
> > > > bitmaps.
> > > >
> > > > Signed-off-by: Kees Cook <keescook@chromium.org>
> > > > [YiFei: Removed x32, added macro for nr_syscalls]
> > > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> > [...]
> > > But otherwise, yes, looks good to me. For this patch, I think the S-o-b chain is probably more
> > > accurately captured as:
> > >
> > > Signed-off-by: Kees Cook <keescook@chromium.org>
> > > Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
> > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> >
> > (Technically, https://www.kernel.org/doc/html/latest/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by
> > says that "every Co-developed-by: must be immediately followed by a
> > Signed-off-by: of the associated co-author" (and has an example of how
> > that should look).)
>
> Right, but it is not needed for the commit author (here, the From:),
> the second example given in the docs shows this:

Aah, right. Nevermind, sorry for the noise.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-30 23:08       ` Jann Horn
@ 2020-09-30 23:21         ` Kees Cook
  0 siblings, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-09-30 23:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers

On Thu, Oct 01, 2020 at 01:08:04AM +0200, Jann Horn wrote:
> [adding x86 folks to enhance bikeshedding]
> 
> On Thu, Oct 1, 2020 at 12:59 AM Kees Cook <keescook@chromium.org> wrote:
> > On Wed, Sep 30, 2020 at 10:19:16AM -0500, YiFei Zhu wrote:
> > > From: YiFei Zhu <yifeifz2@illinois.edu>
> > >
> > > Currently the kernel does not provide an infrastructure to translate
> > > architecture numbers to a human-readable name. Translating syscall
> > > numbers to syscall names is possible through FTRACE_SYSCALL
> > > infrastructure but it does not provide support for compat syscalls.
> > >
> > > This will create a file for each PID as /proc/pid/seccomp_cache.
> > > The file will be empty when no seccomp filters are loaded, or be
> > > in the format of:
> > > <arch name> <decimal syscall number> <ALLOW | FILTER>
> > > where ALLOW means the cache is guaranteed to allow the syscall,
> > > and filter means the cache will pass the syscall to the BPF filter.
> > >
> > > For the docker default profile on x86_64 it looks like:
> > > x86_64 0 ALLOW
> > > x86_64 1 ALLOW
> > > x86_64 2 ALLOW
> > > x86_64 3 ALLOW
> > > [...]
> > > x86_64 132 ALLOW
> > > x86_64 133 ALLOW
> > > x86_64 134 FILTER
> > > x86_64 135 FILTER
> > > x86_64 136 FILTER
> > > x86_64 137 ALLOW
> > > x86_64 138 ALLOW
> > > x86_64 139 FILTER
> > > x86_64 140 ALLOW
> > > x86_64 141 ALLOW
> [...]
> > > diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
> > > index 7b3a58271656..33ccc074be7a 100644
> > > --- a/arch/x86/include/asm/seccomp.h
> > > +++ b/arch/x86/include/asm/seccomp.h
> > > @@ -19,13 +19,16 @@
> > >  #ifdef CONFIG_X86_64
> > >  # define SECCOMP_ARCH_DEFAULT                        AUDIT_ARCH_X86_64
> > >  # define SECCOMP_ARCH_DEFAULT_NR             NR_syscalls
> > > +# define SECCOMP_ARCH_DEFAULT_NAME           "x86_64"
> > >  # ifdef CONFIG_COMPAT
> > >  #  define SECCOMP_ARCH_COMPAT                        AUDIT_ARCH_I386
> > >  #  define SECCOMP_ARCH_COMPAT_NR             IA32_NR_syscalls
> > > +#  define SECCOMP_ARCH_COMPAT_NAME           "x86_32"
> >
> > I think this should be "ia32"? Is there a good definitive guide on this
> > naming convention?
> 
> "man 2 syscall" calls them "x86-64" and "i386". The syscall table
> files use ABI names "i386" and "64". The syscall stub prefixes use
> "x64" and "ia32".
> 
> I don't think we have a good consistent naming strategy here. :P

Agreed. And with "i386" being so hopelessly inaccurate, I prefer
"ia32" ... *shrug*

I would hope we don't have to be super-pedantic and call them "x86-64" and "IA-32". :P

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-09-30 22:24     ` Jann Horn
  2020-09-30 22:49       ` Kees Cook
@ 2020-10-01 11:28       ` YiFei Zhu
  2020-10-01 21:08         ` Jann Horn
  1 sibling, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-01 11:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Wed, Sep 30, 2020 at 5:24 PM Jann Horn <jannh@google.com> wrote:
> If you did the architecture enablement for X86 later in the series,
> you could move this part over into that patch, that'd be cleaner.

As in, patch 1: bitmap check logic. patch 2: emulator. patch 3: enable for x86?

> > + * Tis struct is ordered to minimize padding holes.
>
> I think this comment can probably go away, there isn't really much
> trickery around padding holes in the struct as it is now.

Oh right, I was trying the locks and adding bits to indicate if
certain arches are primed, then I undid that.

> > +                       set_bit(nr, bitmap);
>
> set_bit() is atomic, but since we only do this at filter setup, before
> the filter becomes globally visible, we don't need atomicity here. So
> this should probably use __set_bit() instead.

Right

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-09-30 22:40     ` Kees Cook
@ 2020-10-01 11:52       ` YiFei Zhu
  2020-10-01 21:05         ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-01 11:52 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Wed, Sep 30, 2020 at 5:40 PM Kees Cook <keescook@chromium.org> wrote:
> I don't want this config: there is only 1 caching mechanism happening
> in this series and I do not want to have it buildable as "off": it
> should be available for all supported architectures. When further caching
> methods happen, the config can be introduced then (though I'll likely
> argue it should then be a boot param to allow distro kernels to make it
> selectable).

Alright, we can think about configuration (or boot param) when more
methods happen then.

> The guiding principle with seccomp's designs is to always make things
> _more_ restrictive, never less. While we can never escape the
> consequences of having seccomp_is_const_allow() report the wrong
> answer, we can at least follow the basic principles, hopefully
> minimizing the impact.
>
> When the bitmap starts with "always allowed" and we only flip it towards
> "run full filters", we're only ever making things more restrictive. If
> we instead go from "run full filters" towards "always allowed", we run
> the risk of making things less restrictive. For example: a process that
> maliciously adds a filter that the emulator mistakenly evaluates to
> "always allow" doesn't suddenly cause all the prior filters to stop running.
> (i.e. this isolates the flaw outcome, and doesn't depend on the early
> "do not emulate if we already know we have to run filters" case before
> the emulation call: there is no code path that allows the cache to
> weaken: it can only maintain it being wrong).
>
> Without any seccomp filter installed, all syscalls are "always allowed"
> (from the perspective of the seccomp boundary), so the default of the
> cache needs to be "always allowed".

I cannot follow this. If a 'process that maliciously adds a filter
that the emulator mistakenly evaluates to "always allow" doesn't
suddenly cause all the prior filters to stop running', hence, you
want, by default, the cache to be as transparent as possible. You
would lift the restriction if and only if you are absolutely sure it
does not cause an impact.

In this patch, if there are prior filters, it goes through this logic:

        if (bitmap_prev && !test_bit(nr, bitmap_prev))
            continue;

Hence, if the malicious filter were to happen, and prior filters were
supposed to run, then seccomp_is_const_allow is simply not invoked --
what it returns cannot be used maliciously by an adversary.

>         if (bitmap_prev) {
>                 /* The new filter must be as restrictive as the last. */
>                 bitmap_copy(bitmap, bitmap_prev, bitmap_size);
>         } else {
>                 /* Before any filters, all syscalls are always allowed. */
>                 bitmap_fill(bitmap, bitmap_size);
>         }
>
>         for (nr = 0; nr < bitmap_size; nr++) {
>                 /* No bitmap change: not a cacheable action. */
>                 if (!test_bit(nr, bitmap_prev) ||
>                         continue;
>
>                 /* No bitmap change: continue to always allow. */
>                 if (seccomp_is_const_allow(fprog, &sd))
>                         continue;
>
>                 /* Not a cacheable action: always run filters. */
>                 clear_bit(nr, bitmap);

I'm not strongly against this logic. I just feel unconvinced that this
is any different with a slightly increased complexity.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-09-30 22:00     ` Jann Horn
  2020-09-30 23:12       ` Kees Cook
@ 2020-10-01 12:06       ` YiFei Zhu
  2020-10-01 16:05         ` Jann Horn
  1 sibling, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-01 12:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Wed, Sep 30, 2020 at 5:01 PM Jann Horn <jannh@google.com> wrote:
> Hmm, this won't work, because the task could be exiting, and seccomp
> filters are detached in release_task() (using
> seccomp_filter_release()). And at the moment, seccomp_filter_release()
> just locklessly NULLs out the tsk->seccomp.filter pointer and drops
> the reference.
>
> The locking here is kind of gross, but basically I think you can
> change this code to use lock_task_sighand() / unlock_task_sighand()
> (see the other examples in fs/proc/base.c), and bail out if
> lock_task_sighand() returns NULL. And in seccomp_filter_release(), add
> something like this:
>
> /* We are effectively holding the siglock by not having any sighand. */
> WARN_ON(tsk->sighand != NULL);

Ah thanks. I was thinking about how tasks exit and get freed and that
sort of stuff, and how this would race against them. The last time I
worked with procfs there was some magic going on that I could not
figure out, so I was thinking if some magic will stop the task_struct
from being released, considering it's an argument here.

I just looked at release_task and related functions; looks like it
will, at the end, decrease the reference count of the task_struct.
Does procfs increase the refcount while calling the procfs functions?
Hence, in procfs functions one can rely on the task_struct still being
a task_struct, but any direct effects of release_task may happen while
the procfs functions are running?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-01 12:06       ` YiFei Zhu
@ 2020-10-01 16:05         ` Jann Horn
  2020-10-01 16:18           ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Jann Horn @ 2020-10-01 16:05 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Thu, Oct 1, 2020 at 2:06 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> On Wed, Sep 30, 2020 at 5:01 PM Jann Horn <jannh@google.com> wrote:
> > Hmm, this won't work, because the task could be exiting, and seccomp
> > filters are detached in release_task() (using
> > seccomp_filter_release()). And at the moment, seccomp_filter_release()
> > just locklessly NULLs out the tsk->seccomp.filter pointer and drops
> > the reference.
> >
> > The locking here is kind of gross, but basically I think you can
> > change this code to use lock_task_sighand() / unlock_task_sighand()
> > (see the other examples in fs/proc/base.c), and bail out if
> > lock_task_sighand() returns NULL. And in seccomp_filter_release(), add
> > something like this:
> >
> > /* We are effectively holding the siglock by not having any sighand. */
> > WARN_ON(tsk->sighand != NULL);
>
> Ah thanks. I was thinking about how tasks exit and get freed and that
> sort of stuff, and how this would race against them. The last time I
> worked with procfs there was some magic going on that I could not
> figure out, so I was thinking if some magic will stop the task_struct
> from being released, considering it's an argument here.
>
> I just looked at release_task and related functions; looks like it
> will, at the end, decrease the reference count of the task_struct.
> Does procfs increase the refcount while calling the procfs functions?
> Hence, in procfs functions one can rely on the task_struct still being
> a task_struct, but any direct effects of release_task may happen while
> the procfs functions are running?

Yeah.

The ONE() entry you're adding to tgid_base_stuff is used to help
instantiate a "struct inode" when someone looks up the path
"/proc/$tgid/seccomp_cache"; then when that path is opened, a "struct
file" is created that holds a reference to the inode; and while that
file exists, your proc_pid_seccomp_cache() can be invoked.

proc_pid_seccomp_cache() is invoked from proc_single_show()
("PROC_I(inode)->op.proc_show" is proc_pid_seccomp_cache), and
proc_single_show() obtains a temporary reference to the task_struct
using get_pid_task() on a "struct pid" and drops that reference
afterwards with put_task_struct(). The "struct pid" is obtained from
the "struct proc_inode", which is essentially a subclass of "struct
inode". The "struct pid" is kept refererenced until the inode goes
away, via proc_pid_evict_inode(), called by proc_evict_inode().

By looking at put_task_struct() and its callees, you can figure out
which parts of the "struct task" are kept alive by the reference to
it.

By the way, maybe it'd make sense to add this to tid_base_stuff as
well? That should just be one extra line of code. Seccomp filters are
technically per-thread, so it would make sense to have them visible in
the per-thread subdirectories /proc/$pid/task/$tid/.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-01 16:05         ` Jann Horn
@ 2020-10-01 16:18           ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-10-01 16:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Thu, Oct 1, 2020 at 11:05 AM Jann Horn <jannh@google.com> wrote:
> Yeah.
>
> The ONE() entry you're adding to tgid_base_stuff is used to help
> instantiate a "struct inode" when someone looks up the path
> "/proc/$tgid/seccomp_cache"; then when that path is opened, a "struct
> file" is created that holds a reference to the inode; and while that
> file exists, your proc_pid_seccomp_cache() can be invoked.
>
> proc_pid_seccomp_cache() is invoked from proc_single_show()
> ("PROC_I(inode)->op.proc_show" is proc_pid_seccomp_cache), and
> proc_single_show() obtains a temporary reference to the task_struct
> using get_pid_task() on a "struct pid" and drops that reference
> afterwards with put_task_struct(). The "struct pid" is obtained from
> the "struct proc_inode", which is essentially a subclass of "struct
> inode". The "struct pid" is kept refererenced until the inode goes
> away, via proc_pid_evict_inode(), called by proc_evict_inode().
>
> By looking at put_task_struct() and its callees, you can figure out
> which parts of the "struct task" are kept alive by the reference to
> it.

Ah I see. Thanks for the explanation.

> By the way, maybe it'd make sense to add this to tid_base_stuff as
> well? That should just be one extra line of code. Seccomp filters are
> technically per-thread, so it would make sense to have them visible in
> the per-thread subdirectories /proc/$pid/task/$tid/.

Right. Will do.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-10-01 11:52       ` YiFei Zhu
@ 2020-10-01 21:05         ` Kees Cook
  2020-10-02 11:08           ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-10-01 21:05 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Jann Horn, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Oct 01, 2020 at 06:52:50AM -0500, YiFei Zhu wrote:
> On Wed, Sep 30, 2020 at 5:40 PM Kees Cook <keescook@chromium.org> wrote:
> > The guiding principle with seccomp's designs is to always make things
> > _more_ restrictive, never less. While we can never escape the
> > consequences of having seccomp_is_const_allow() report the wrong
> > answer, we can at least follow the basic principles, hopefully
> > minimizing the impact.
> >
> > When the bitmap starts with "always allowed" and we only flip it towards
> > "run full filters", we're only ever making things more restrictive. If
> > we instead go from "run full filters" towards "always allowed", we run
> > the risk of making things less restrictive. For example: a process that
> > maliciously adds a filter that the emulator mistakenly evaluates to
> > "always allow" doesn't suddenly cause all the prior filters to stop running.
> > (i.e. this isolates the flaw outcome, and doesn't depend on the early
> > "do not emulate if we already know we have to run filters" case before
> > the emulation call: there is no code path that allows the cache to
> > weaken: it can only maintain it being wrong).
> >
> > Without any seccomp filter installed, all syscalls are "always allowed"
> > (from the perspective of the seccomp boundary), so the default of the
> > cache needs to be "always allowed".
> 
> I cannot follow this. If a 'process that maliciously adds a filter
> that the emulator mistakenly evaluates to "always allow" doesn't
> suddenly cause all the prior filters to stop running', hence, you
> want, by default, the cache to be as transparent as possible. You
> would lift the restriction if and only if you are absolutely sure it
> does not cause an impact.

Yes, right now, the v3 code pattern is entirely safe.

> 
> In this patch, if there are prior filters, it goes through this logic:
> 
>         if (bitmap_prev && !test_bit(nr, bitmap_prev))
>             continue;
> 
> Hence, if the malicious filter were to happen, and prior filters were
> supposed to run, then seccomp_is_const_allow is simply not invoked --
> what it returns cannot be used maliciously by an adversary.

Right, but we depend on that test always doing the correct thing (and
continuing to do so into the future). I'm looking at this from the
perspective of future changes, maintenance, etc. I want the actions to
match the design principles as closely as possible so that future
evolutions of the code have lower risk to bugs causing security
failures. Right now, the code is simple. I want to design this so that
when it is complex, it will still fail toward safety in the face of
bugs.

> >         if (bitmap_prev) {
> >                 /* The new filter must be as restrictive as the last. */
> >                 bitmap_copy(bitmap, bitmap_prev, bitmap_size);
> >         } else {
> >                 /* Before any filters, all syscalls are always allowed. */
> >                 bitmap_fill(bitmap, bitmap_size);
> >         }
> >
> >         for (nr = 0; nr < bitmap_size; nr++) {
> >                 /* No bitmap change: not a cacheable action. */
> >                 if (!test_bit(nr, bitmap_prev) ||
> >                         continue;
> >
> >                 /* No bitmap change: continue to always allow. */
> >                 if (seccomp_is_const_allow(fprog, &sd))
> >                         continue;
> >
> >                 /* Not a cacheable action: always run filters. */
> >                 clear_bit(nr, bitmap);
> 
> I'm not strongly against this logic. I just feel unconvinced that this
> is any different with a slightly increased complexity.

I'd prefer this way because for the loop, the tests, and the results only
make the bitmap more restrictive. The worst thing a bug in here can do is
leave the bitmap unchanged (which is certainly bad), but it can't _undo_
an earlier restriction.

The proposed loop's leading test_bit() becomes only an optimization,
rather than being required for policy enforcement.

In other words, I prefer:

	inherit all prior prior bitmap restrictions
	for all syscalls
		if this filter not restricted
			continue
		set bitmap restricted

	within this loop (where the bulk of future logic may get added),
	the worse-case future bug-induced failure mode for the syscall
	bitmap is "skip *this* filter".


Instead of:

	set bitmap all restricted
	for all syscalls
		if previous bitmap not restricted and
		   filter not restricted
			set bitmap unrestricted

	within this loop the worst-case future bug-induced failure mode
	for the syscall bitmap is "skip *all* filters".




Or, to reword again, this:

	retain restrictions from previous caching decisions
	for all syscalls
		[evaluate this filter, maybe continue]
		set restricted

instead of:

	set new cache to all restricted
	for all syscalls
		[evaluate prior cache and this filter, maybe continue]
		set unrestricted

I expect the future code changes for caching to be in the "evaluate"
step, so I'd like the code designed to make things MORE restrictive not
less from the start, and remove any prior cache state tests from the
loop.

At the end of the day I believe changing the design like this now lays
the groundwork to the caching mechanism being more robust against having
future bugs introduce security flaws.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-10-01 11:28       ` YiFei Zhu
@ 2020-10-01 21:08         ` Jann Horn
  0 siblings, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-10-01 21:08 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Thu, Oct 1, 2020 at 1:28 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> On Wed, Sep 30, 2020 at 5:24 PM Jann Horn <jannh@google.com> wrote:
> > If you did the architecture enablement for X86 later in the series,
> > you could move this part over into that patch, that'd be cleaner.
>
> As in, patch 1: bitmap check logic. patch 2: emulator. patch 3: enable for x86?

Yeah.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-10-01 21:05         ` Kees Cook
@ 2020-10-02 11:08           ` YiFei Zhu
  0 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-10-02 11:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Thu, Oct 1, 2020 at 4:05 PM Kees Cook <keescook@chromium.org> wrote:
> Right, but we depend on that test always doing the correct thing (and
> continuing to do so into the future). I'm looking at this from the
> perspective of future changes, maintenance, etc. I want the actions to
> match the design principles as closely as possible so that future
> evolutions of the code have lower risk to bugs causing security
> failures. Right now, the code is simple. I want to design this so that
> when it is complex, it will still fail toward safety in the face of
> bugs.
>
> I'd prefer this way because for the loop, the tests, and the results only
> make the bitmap more restrictive. The worst thing a bug in here can do is
> leave the bitmap unchanged (which is certainly bad), but it can't _undo_
> an earlier restriction.
>
> The proposed loop's leading test_bit() becomes only an optimization,
> rather than being required for policy enforcement.
>
> In other words, I prefer:
>
>         inherit all prior prior bitmap restrictions
>         for all syscalls
>                 if this filter not restricted
>                         continue
>                 set bitmap restricted
>
>         within this loop (where the bulk of future logic may get added),
>         the worse-case future bug-induced failure mode for the syscall
>         bitmap is "skip *this* filter".
>
>
> Instead of:
>
>         set bitmap all restricted
>         for all syscalls
>                 if previous bitmap not restricted and
>                    filter not restricted
>                         set bitmap unrestricted
>
>         within this loop the worst-case future bug-induced failure mode
>         for the syscall bitmap is "skip *all* filters".
>
>
>
>
> Or, to reword again, this:
>
>         retain restrictions from previous caching decisions
>         for all syscalls
>                 [evaluate this filter, maybe continue]
>                 set restricted
>
> instead of:
>
>         set new cache to all restricted
>         for all syscalls
>                 [evaluate prior cache and this filter, maybe continue]
>                 set unrestricted
>
> I expect the future code changes for caching to be in the "evaluate"
> step, so I'd like the code designed to make things MORE restrictive not
> less from the start, and remove any prior cache state tests from the
> loop.
>
> At the end of the day I believe changing the design like this now lays
> the groundwork to the caching mechanism being more robust against having
> future bugs introduce security flaws.
>

I see. Makes sense. Thanks. Will do that in the v4

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path
  2020-09-30 21:32     ` Kees Cook
@ 2020-10-09  0:17       ` YiFei Zhu
  2020-10-09  5:35         ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-09  0:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Wed, Sep 30, 2020 at 4:32 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Wed, Sep 30, 2020 at 10:19:14AM -0500, YiFei Zhu wrote:
> > From: YiFei Zhu <yifeifz2@illinois.edu>
> >
> > The fast (common) path for seccomp should be that the filter permits
> > the syscall to pass through, and failing seccomp is expected to be
> > an exceptional case; it is not expected for userspace to call a
> > denylisted syscall over and over.
> >
> > This first finds the current allow bitmask by iterating through
> > syscall_arches[] array and comparing it to the one in struct
> > seccomp_data; this loop is expected to be unrolled. It then
> > does a test_bit against the bitmask. If the bit is set, then
> > there is no need to run the full filter; it returns
> > SECCOMP_RET_ALLOW immediately.
> >
> > Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> > Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
>
> I'd like the content/ordering of this and the emulator patch to be reorganized a bit.
> I'd like to see the infrastructure of the cache added first (along with
> the "always allow" test logic in this patch), with the emulator missing:
> i.e. the patch is a logical no-op: no behavior changes because nothing
> ever changes the cache bits, but all the operational logic, structure
> changes, etc, is in place. Then the next patch would be replacing the
> no-op with the emulator.
>
> > ---
> >  kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 52 insertions(+)
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index f09c9e74ae05..bed3b2a7f6c8 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -172,6 +172,12 @@ struct seccomp_cache_filter_data { };
> >  static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
> >  {
> >  }
> > +
> > +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter,
>
> bikeshedding: "cache check" doesn't tell me anything about what it's
> actually checking for. How about calling this seccomp_is_constant_allow() or
> something that reflects both the "bool" return ("is") and what that bool
> means ("should always be allowed").

We have a naming conflict here. I'm about to rename
seccomp_emu_is_const_allow to seccomp_is_const_allow. Adding another
seccomp_is_constant_allow is confusing. Suggestions?

I think I would prefer to change seccomp_cache_check to
seccomp_cache_check_allow. While in this patch set seccomp_cache_check
does imply the filter is "constant" allow, argument-processing cache
may change this, and specifying an "allow" in the name specifies the
'what that bool means ("should always be allowed")'.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-09-30 15:19   ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
  2020-09-30 22:24     ` Jann Horn
  2020-09-30 22:40     ` Kees Cook
@ 2020-10-09  4:47     ` YiFei Zhu
  2020-10-09  5:41       ` Kees Cook
  2 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-09  4:47 UTC (permalink / raw)
  To: Linux Containers
  Cc: YiFei Zhu, bpf, kernel list, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Wed, Sep 30, 2020 at 10:20 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  {
>         struct seccomp_filter *sfilter;
>         int ret;
> -       const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> +       const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
> +                              IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
>
>         if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>                 return ERR_PTR(-EINVAL);

I'm trying to use __is_defined(SECCOMP_ARCH_NATIVE) here, and got this message:

kernel/seccomp.c: In function ‘seccomp_prepare_filter’:
././include/linux/kconfig.h:44:44: error: pasting "__ARG_PLACEHOLDER_"
and "(" does not give a valid preprocessing token
   44 | #define ___is_defined(val)  ____is_defined(__ARG_PLACEHOLDER_##val)
      |                                            ^~~~~~~~~~~~~~~~~~
././include/linux/kconfig.h:43:27: note: in expansion of macro ‘___is_defined’
   43 | #define __is_defined(x)   ___is_defined(x)
      |                           ^~~~~~~~~~~~~
kernel/seccomp.c:629:11: note: in expansion of macro ‘__is_defined’
  629 |           __is_defined(SECCOMP_ARCH_NATIVE);
      |           ^~~~~~~~~~~~

Looking at the implementation of __is_defined, it is:

#define __ARG_PLACEHOLDER_1 0,
#define __take_second_arg(__ignored, val, ...) val
#define __is_defined(x) ___is_defined(x)
#define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val)
#define ____is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0)

Hence, when FOO is defined to be 1, then the expansion would be
__is_defined(FOO) -> ___is_defined(1) ->
____is_defined(__ARG_PLACEHOLDER_1) -> __take_second_arg(0, 1, 0) ->
1,
and when FOO is not defined, the expansion would be __is_defined(FOO)
-> ___is_defined(FOO) -> ____is_defined(__ARG_PLACEHOLDER_FOO) ->
__take_second_arg(__ARG_PLACEHOLDER_FOO 1, 0) -> 0

However, here SECCOMP_ARCH_NATIVE is an expression from an OR of some
bits, and __is_defined(SECCOMP_ARCH_NATIVE) would not expand to
__ARG_PLACEHOLDER_1 during any stage in the preprocessing.

Is there any better way to do this? I'm thinking of just doing #if
defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE)
like in Kee's patch.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path
  2020-10-09  0:17       ` YiFei Zhu
@ 2020-10-09  5:35         ` Kees Cook
  0 siblings, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-10-09  5:35 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Thu, Oct 08, 2020 at 07:17:39PM -0500, YiFei Zhu wrote:
> On Wed, Sep 30, 2020 at 4:32 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Wed, Sep 30, 2020 at 10:19:14AM -0500, YiFei Zhu wrote:
> > > From: YiFei Zhu <yifeifz2@illinois.edu>
> > >
> > > The fast (common) path for seccomp should be that the filter permits
> > > the syscall to pass through, and failing seccomp is expected to be
> > > an exceptional case; it is not expected for userspace to call a
> > > denylisted syscall over and over.
> > >
> > > This first finds the current allow bitmask by iterating through
> > > syscall_arches[] array and comparing it to the one in struct
> > > seccomp_data; this loop is expected to be unrolled. It then
> > > does a test_bit against the bitmask. If the bit is set, then
> > > there is no need to run the full filter; it returns
> > > SECCOMP_RET_ALLOW immediately.
> > >
> > > Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> > > Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> >
> > I'd like the content/ordering of this and the emulator patch to be reorganized a bit.
> > I'd like to see the infrastructure of the cache added first (along with
> > the "always allow" test logic in this patch), with the emulator missing:
> > i.e. the patch is a logical no-op: no behavior changes because nothing
> > ever changes the cache bits, but all the operational logic, structure
> > changes, etc, is in place. Then the next patch would be replacing the
> > no-op with the emulator.
> >
> > > ---
> > >  kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 52 insertions(+)
> > >
> > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > index f09c9e74ae05..bed3b2a7f6c8 100644
> > > --- a/kernel/seccomp.c
> > > +++ b/kernel/seccomp.c
> > > @@ -172,6 +172,12 @@ struct seccomp_cache_filter_data { };
> > >  static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
> > >  {
> > >  }
> > > +
> > > +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter,
> >
> > bikeshedding: "cache check" doesn't tell me anything about what it's
> > actually checking for. How about calling this seccomp_is_constant_allow() or
> > something that reflects both the "bool" return ("is") and what that bool
> > means ("should always be allowed").
> 
> We have a naming conflict here. I'm about to rename
> seccomp_emu_is_const_allow to seccomp_is_const_allow. Adding another
> seccomp_is_constant_allow is confusing. Suggestions?
> 
> I think I would prefer to change seccomp_cache_check to
> seccomp_cache_check_allow. While in this patch set seccomp_cache_check
> does imply the filter is "constant" allow, argument-processing cache
> may change this, and specifying an "allow" in the name specifies the
> 'what that bool means ("should always be allowed")'.

Yeah, that seems good.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-10-09  4:47     ` YiFei Zhu
@ 2020-10-09  5:41       ` Kees Cook
  0 siblings, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-10-09  5:41 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Thu, Oct 08, 2020 at 11:47:17PM -0500, YiFei Zhu wrote:
> On Wed, Sep 30, 2020 at 10:20 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
> >  {
> >         struct seccomp_filter *sfilter;
> >         int ret;
> > -       const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> > +       const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
> > +                              IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
> >
> >         if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> >                 return ERR_PTR(-EINVAL);
> 
> I'm trying to use __is_defined(SECCOMP_ARCH_NATIVE) here, and got this message:
> 
> kernel/seccomp.c: In function ‘seccomp_prepare_filter’:
> ././include/linux/kconfig.h:44:44: error: pasting "__ARG_PLACEHOLDER_"
> and "(" does not give a valid preprocessing token
>    44 | #define ___is_defined(val)  ____is_defined(__ARG_PLACEHOLDER_##val)
>       |                                            ^~~~~~~~~~~~~~~~~~
> ././include/linux/kconfig.h:43:27: note: in expansion of macro ‘___is_defined’
>    43 | #define __is_defined(x)   ___is_defined(x)
>       |                           ^~~~~~~~~~~~~
> kernel/seccomp.c:629:11: note: in expansion of macro ‘__is_defined’
>   629 |           __is_defined(SECCOMP_ARCH_NATIVE);
>       |           ^~~~~~~~~~~~
> 
> Looking at the implementation of __is_defined, it is:
> 
> #define __ARG_PLACEHOLDER_1 0,
> #define __take_second_arg(__ignored, val, ...) val
> #define __is_defined(x) ___is_defined(x)
> #define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val)
> #define ____is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0)
> 
> Hence, when FOO is defined to be 1, then the expansion would be
> __is_defined(FOO) -> ___is_defined(1) ->
> ____is_defined(__ARG_PLACEHOLDER_1) -> __take_second_arg(0, 1, 0) ->
> 1,
> and when FOO is not defined, the expansion would be __is_defined(FOO)
> -> ___is_defined(FOO) -> ____is_defined(__ARG_PLACEHOLDER_FOO) ->
> __take_second_arg(__ARG_PLACEHOLDER_FOO 1, 0) -> 0
> 
> However, here SECCOMP_ARCH_NATIVE is an expression from an OR of some
> bits, and __is_defined(SECCOMP_ARCH_NATIVE) would not expand to
> __ARG_PLACEHOLDER_1 during any stage in the preprocessing.
> 
> Is there any better way to do this? I'm thinking of just doing #if
> defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE)
> like in Kee's patch.

Yeah, I think that's simplest.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results
  2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
                     ` (4 preceding siblings ...)
  2020-09-30 15:19   ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
@ 2020-10-09 17:14   ` YiFei Zhu
  2020-10-09 17:14     ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
                       ` (5 more replies)
  5 siblings, 6 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/

Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.

This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.

Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.

Some benchmarks are performed with results in patch 5, copied below:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Benchmarking 200000000 syscalls...
  129.359381409 - 0.008724424 = 129350656985 (129.4s)
  getpid native: 646 ns
  264.385890006 - 129.360453229 = 135025436777 (135.0s)
  getpid RET_ALLOW 1 filter (bitmap): 675 ns
  399.400511893 - 264.387045901 = 135013465992 (135.0s)
  getpid RET_ALLOW 2 filters (bitmap): 675 ns
  545.872866260 - 399.401718327 = 146471147933 (146.5s)
  getpid RET_ALLOW 3 filters (full): 732 ns
  696.337101319 - 545.874097681 = 150463003638 (150.5s)
  getpid RET_ALLOW 4 filters (full): 752 ns
  Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
  Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
  Estimated total seccomp overhead for 3 full filters: 86 ns
  Estimated total seccomp overhead for 4 full filters: 106 ns
  Estimated seccomp entry overhead: 29 ns
  Estimated seccomp per-filter overhead (last 2 diff): 20 ns
  Estimated seccomp per-filter overhead (filters / 4): 19 ns
  Expectations:
  	native ≤ 1 bitmap (646 ≤ 675): ✔️
  	native ≤ 1 filter (646 ≤ 732): ✔️
  	per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
  	1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 1 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 2 bitmapped (29 ≈ 29): ✔️
  	native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️

v3 -> v4:
* Reordered patches
* Naming changes
* Fixed racing in /proc/pid/seccomp_cache against filter being released
  from task, using Jann's suggestion of sighand spinlock.
* Cache no longer configurable.
* Copied some description from cover letter to commit messages.
* Used Kees's logic to set clear bits from bitmap, rather than set bits.

v2 -> v3:
* Added array_index_nospec guards
* No more syscall_arches[] array and expecting on loop unrolling. Arches
  are configured with per-arch seccomp.h.
* Moved filter emulation to attach time (from prepare time).
* Further simplified emulator, basing on Kees's code.
* Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN.

v1 -> v2:
* Corrected one outdated function documentation.

RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
  have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
  instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
  during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.

Patch 1 implements the test_bit against the bitmaps.

Patch 2 implements the emulator that finds if a filter must return allow,

Patch 3 adds the arch macros for x86.

Patch 4 updates the selftest to better show the new semantics.

Patch 5 implements /proc/pid/seccomp_cache.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

Kees Cook (2):
  x86: Enable seccomp architecture tracking
  selftests/seccomp: Compare bitmap vs filter overhead

YiFei Zhu (3):
  seccomp/cache: Lookup syscall allowlist bitmap for fast path
  seccomp/cache: Add "emulator" to check if filter is constant allow
  seccomp/cache: Report cache data through /proc/pid/seccomp_cache

 arch/Kconfig                                  |  24 ++
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/seccomp.h                |  15 +
 fs/proc/base.c                                |   6 +
 include/linux/seccomp.h                       |   5 +
 kernel/seccomp.c                              | 289 +++++++++++++++++-
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++--
 tools/testing/selftests/seccomp/settings      |   2 +-
 8 files changed, 469 insertions(+), 24 deletions(-)

--
2.28.0

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path
  2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
@ 2020-10-09 17:14     ` YiFei Zhu
  2020-10-09 21:30       ` Jann Horn
  2020-10-09 23:18       ` Kees Cook
  2020-10-09 17:14     ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
                       ` (4 subsequent siblings)
  5 siblings, 2 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

When it can be concluded that an allow must occur for the given
architecture and syscall pair (this determination is introduced in
the next commit), seccomp will immediately allow the syscall,
bypassing further BPF execution.

Each architecture number has its own bitmap. The architecture
number in seccomp_data is checked against the defined architecture
number constant before proceeding to test the bit against the
bitmap with the syscall number as the index of the bit in the
bitmap, and if the bit is set, seccomp returns allow. The bitmaps
are all clear in this patch and will be initialized in the next
commit.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 kernel/seccomp.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ae6b40cc39f4..73f6b6e9a3b0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,6 +143,34 @@ struct notification {
 	struct list_head notifications;
 };
 
+#ifdef SECCOMP_ARCH_NATIVE
+/**
+ * struct action_cache - per-filter cache of seccomp actions per
+ * arch/syscall pair
+ *
+ * @allow_native: A bitmap where each bit represents whether the
+ *		  filter will always allow the syscall, for the
+ *		  native architecture.
+ * @allow_compat: A bitmap where each bit represents whether the
+ *		  filter will always allow the syscall, for the
+ *		  compat architecture.
+ */
+struct action_cache {
+	DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR);
+#ifdef SECCOMP_ARCH_COMPAT
+	DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR);
+#endif
+};
+#else
+struct action_cache { };
+
+static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
+					     const struct seccomp_data *sd)
+{
+	return false;
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -298,6 +326,47 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 	return 0;
 }
 
+#ifdef SECCOMP_ARCH_NATIVE
+static inline bool seccomp_cache_check_allow_bitmap(const void *bitmap,
+						    size_t bitmap_size,
+						    int syscall_nr)
+{
+	if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))
+		return false;
+	syscall_nr = array_index_nospec(syscall_nr, bitmap_size);
+
+	return test_bit(syscall_nr, bitmap);
+}
+
+/**
+ * seccomp_cache_check_allow - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
+					     const struct seccomp_data *sd)
+{
+	int syscall_nr = sd->nr;
+	const struct action_cache *cache = &sfilter->cache;
+
+	if (likely(sd->arch == SECCOMP_ARCH_NATIVE))
+		return seccomp_cache_check_allow_bitmap(cache->allow_native,
+							SECCOMP_ARCH_NATIVE_NR,
+							syscall_nr);
+#ifdef SECCOMP_ARCH_COMPAT
+	if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
+		return seccomp_cache_check_allow_bitmap(cache->allow_compat,
+							SECCOMP_ARCH_COMPAT_NR,
+							syscall_nr);
+#endif /* SECCOMP_ARCH_COMPAT */
+
+	WARN_ON_ONCE(true);
+	return false;
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
 /**
  * seccomp_run_filters - evaluates all seccomp filters against @sd
  * @sd: optional seccomp data to be passed to filters
@@ -320,6 +389,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 	if (WARN_ON(f == NULL))
 		return SECCOMP_RET_KILL_PROCESS;
 
+	if (seccomp_cache_check_allow(f, sd))
+		return SECCOMP_RET_ALLOW;
+
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  2020-10-09 17:14     ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
@ 2020-10-09 17:14     ` YiFei Zhu
  2020-10-09 21:30       ` Jann Horn
  2020-10-09 17:14     ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

Nearly all seccomp filters are built from these cBPF instructions:

BPF_LD  | BPF_W    | BPF_ABS
BPF_JMP | BPF_JEQ  | BPF_K
BPF_JMP | BPF_JGE  | BPF_K
BPF_JMP | BPF_JGT  | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
BPF_ALU | BPF_AND  | BPF_K

Each of these instructions are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.

The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.

Emulator structure and comments are from Kees [1] and Jann [2].

Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.

[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/

Suggested-by: Jann Horn <jannh@google.com>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 kernel/seccomp.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 157 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 73f6b6e9a3b0..51032b41fe59 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -169,6 +169,10 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte
 {
 	return false;
 }
+
+static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+}
 #endif /* SECCOMP_ARCH_NATIVE */
 
 /**
@@ -187,6 +191,7 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte
  *	   this filter after reaching 0. The @users count is always smaller
  *	   or equal to @refs. Hence, reaching 0 for @users does not mean
  *	   the filter can be freed.
+ * @cache: cache of arch/syscall mappings to actions
  * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
  * @prev: points to a previously installed, or inherited, filter
  * @prog: the BPF program to evaluate
@@ -208,6 +213,7 @@ struct seccomp_filter {
 	refcount_t refs;
 	refcount_t users;
 	bool log;
+	struct action_cache cache;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
 	struct notification *notif;
@@ -616,7 +622,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *sfilter;
 	int ret;
-	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+	const bool save_orig =
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE)
+		true;
+#else
+		false;
+#endif
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
 		return ERR_PTR(-EINVAL);
@@ -682,6 +693,150 @@ seccomp_prepare_user_filter(const char __user *user_filter)
 	return filter;
 }
 
+#ifdef SECCOMP_ARCH_NATIVE
+/**
+ * seccomp_is_const_allow - check if filter is constant allow with given data
+ * @fprog: The BPF programs
+ * @sd: The seccomp data to check against, only syscall number are arch
+ *      number are considered constant.
+ */
+static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
+				   struct seccomp_data *sd)
+{
+	unsigned int insns;
+	unsigned int reg_value = 0;
+	unsigned int pc;
+	bool op_res;
+
+	if (WARN_ON_ONCE(!fprog))
+		return false;
+
+	insns = bpf_classic_proglen(fprog);
+	for (pc = 0; pc < insns; pc++) {
+		struct sock_filter *insn = &fprog->filter[pc];
+		u16 code = insn->code;
+		u32 k = insn->k;
+
+		switch (code) {
+		case BPF_LD | BPF_W | BPF_ABS:
+			switch (k) {
+			case offsetof(struct seccomp_data, nr):
+				reg_value = sd->nr;
+				break;
+			case offsetof(struct seccomp_data, arch):
+				reg_value = sd->arch;
+				break;
+			default:
+				/* can't optimize (non-constant value load) */
+				return false;
+			}
+			break;
+		case BPF_RET | BPF_K:
+			/* reached return with constant values only, check allow */
+			return k == SECCOMP_RET_ALLOW;
+		case BPF_JMP | BPF_JA:
+			pc += insn->k;
+			break;
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JSET | BPF_K:
+			switch (BPF_OP(code)) {
+			case BPF_JEQ:
+				op_res = reg_value == k;
+				break;
+			case BPF_JGE:
+				op_res = reg_value >= k;
+				break;
+			case BPF_JGT:
+				op_res = reg_value > k;
+				break;
+			case BPF_JSET:
+				op_res = !!(reg_value & k);
+				break;
+			default:
+				/* can't optimize (unknown jump) */
+				return false;
+			}
+
+			pc += op_res ? insn->jt : insn->jf;
+			break;
+		case BPF_ALU | BPF_AND | BPF_K:
+			reg_value &= k;
+			break;
+		default:
+			/* can't optimize (unknown insn) */
+			return false;
+		}
+	}
+
+	/* ran off the end of the filter?! */
+	WARN_ON(1);
+	return false;
+}
+
+static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
+					 void *bitmap, const void *bitmap_prev,
+					 size_t bitmap_size, int arch)
+{
+	struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
+	struct seccomp_data sd;
+	int nr;
+
+	if (bitmap_prev) {
+		/* The new filter must be as restrictive as the last. */
+		bitmap_copy(bitmap, bitmap_prev, bitmap_size);
+	} else {
+		/* Before any filters, all syscalls are always allowed. */
+		bitmap_fill(bitmap, bitmap_size);
+	}
+
+	for (nr = 0; nr < bitmap_size; nr++) {
+		/* No bitmap change: not a cacheable action. */
+		if (!test_bit(nr, bitmap))
+			continue;
+
+		sd.nr = nr;
+		sd.arch = arch;
+
+		/* No bitmap change: continue to always allow. */
+		if (seccomp_is_const_allow(fprog, &sd))
+			continue;
+
+		/*
+		 * Not a cacheable action: always run filters.
+		 * atomic clear_bit() not needed, filter not visible yet.
+		 */
+		__clear_bit(nr, bitmap);
+	}
+}
+
+/**
+ * seccomp_cache_prepare - emulate the filter to find cachable syscalls
+ * @sfilter: The seccomp filter
+ *
+ * Returns 0 if successful or -errno if error occurred.
+ */
+static void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+	struct action_cache *cache = &sfilter->cache;
+	const struct action_cache *cache_prev =
+		sfilter->prev ? &sfilter->prev->cache : NULL;
+
+	seccomp_cache_prepare_bitmap(sfilter, cache->allow_native,
+				     cache_prev ? cache_prev->allow_native : NULL,
+				     SECCOMP_ARCH_NATIVE_NR,
+				     SECCOMP_ARCH_NATIVE);
+
+#ifdef SECCOMP_ARCH_COMPAT
+	seccomp_cache_prepare_bitmap(sfilter, cache->allow_compat,
+				     cache_prev ? cache_prev->allow_compat : NULL,
+				     SECCOMP_ARCH_COMPAT_NR,
+				     SECCOMP_ARCH_COMPAT);
+#endif /* SECCOMP_ARCH_COMPAT */
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
 /**
  * seccomp_attach_filter: validate and attach filter
  * @flags:  flags to change filter behavior
@@ -731,6 +886,7 @@ static long seccomp_attach_filter(unsigned int flags,
 	 * task reference.
 	 */
 	filter->prev = current->seccomp.filter;
+	seccomp_cache_prepare(filter);
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
 
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking
  2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  2020-10-09 17:14     ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
  2020-10-09 17:14     ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
@ 2020-10-09 17:14     ` YiFei Zhu
  2020-10-09 17:25       ` Andy Lutomirski
  2020-10-09 17:14     ` [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: Kees Cook <keescook@chromium.org>

Provide seccomp internals with the details to calculate which syscall
table the running kernel is expecting to deal with. This allows for
efficient architecture pinning and paves the way for constant-action
bitmaps.

Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/x86/include/asm/seccomp.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 2bd1338de236..03365af6165d 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -16,6 +16,18 @@
 #define __NR_seccomp_sigreturn_32	__NR_ia32_sigreturn
 #endif
 
+#ifdef CONFIG_X86_64
+# define SECCOMP_ARCH_NATIVE		AUDIT_ARCH_X86_64
+# define SECCOMP_ARCH_NATIVE_NR		NR_syscalls
+# ifdef CONFIG_COMPAT
+#  define SECCOMP_ARCH_COMPAT		AUDIT_ARCH_I386
+#  define SECCOMP_ARCH_COMPAT_NR	IA32_NR_syscalls
+# endif
+#else /* !CONFIG_X86_64 */
+# define SECCOMP_ARCH_NATIVE		AUDIT_ARCH_I386
+# define SECCOMP_ARCH_NATIVE_NR	        NR_syscalls
+#endif
+
 #include <asm-generic/seccomp.h>
 
 #endif /* _ASM_X86_SECCOMP_H */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead
  2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
                       ` (2 preceding siblings ...)
  2020-10-09 17:14     ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
@ 2020-10-09 17:14     ` YiFei Zhu
  2020-10-09 17:14     ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
  2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  5 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: Kees Cook <keescook@chromium.org>

As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.

Example output with constant action bitmaps on x86:

$ sudo ./seccomp_benchmark 100000000
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 200000000 syscalls...
129.359381409 - 0.008724424 = 129350656985 (129.4s)
getpid native: 646 ns
264.385890006 - 129.360453229 = 135025436777 (135.0s)
getpid RET_ALLOW 1 filter (bitmap): 675 ns
399.400511893 - 264.387045901 = 135013465992 (135.0s)
getpid RET_ALLOW 2 filters (bitmap): 675 ns
545.872866260 - 399.401718327 = 146471147933 (146.5s)
getpid RET_ALLOW 3 filters (full): 732 ns
696.337101319 - 545.874097681 = 150463003638 (150.5s)
getpid RET_ALLOW 4 filters (full): 752 ns
Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
Estimated total seccomp overhead for 3 full filters: 86 ns
Estimated total seccomp overhead for 4 full filters: 106 ns
Estimated seccomp entry overhead: 29 ns
Estimated seccomp per-filter overhead (last 2 diff): 20 ns
Estimated seccomp per-filter overhead (filters / 4): 19 ns
Expectations:
	native ≤ 1 bitmap (646 ≤ 675): ✔️
	native ≤ 1 filter (646 ≤ 732): ✔️
	per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
	1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
	entry ≈ 1 bitmapped (29 ≈ 29): ✔️
	entry ≈ 2 bitmapped (29 ≈ 29): ✔️
	native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️

Signed-off-by: Kees Cook <keescook@chromium.org>
[YiFei: Changed commit message to show stats for this patch series]
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++++++++++---
 tools/testing/selftests/seccomp/settings      |   2 +-
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 91f5a89cadac..fcc806585266 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,12 +4,16 @@
  */
 #define _GNU_SOURCE
 #include <assert.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <unistd.h>
 #include <linux/filter.h>
 #include <linux/seccomp.h>
+#include <sys/param.h>
 #include <sys/prctl.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
@@ -70,18 +74,74 @@ unsigned long long calibrate(void)
 	return samples * seconds;
 }
 
+bool approx(int i_one, int i_two)
+{
+	double one = i_one, one_bump = one * 0.01;
+	double two = i_two, two_bump = two * 0.01;
+
+	one_bump = one + MAX(one_bump, 2.0);
+	two_bump = two + MAX(two_bump, 2.0);
+
+	/* Equal to, or within 1% or 2 digits */
+	if (one == two ||
+	    (one > two && one <= two_bump) ||
+	    (two > one && two <= one_bump))
+		return true;
+	return false;
+}
+
+bool le(int i_one, int i_two)
+{
+	if (i_one <= i_two)
+		return true;
+	return false;
+}
+
+long compare(const char *name_one, const char *name_eval, const char *name_two,
+	     unsigned long long one, bool (*eval)(int, int), unsigned long long two)
+{
+	bool good;
+
+	printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two,
+	       (long long)one, name_eval, (long long)two);
+	if (one > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)one);
+		return 1;
+	}
+	if (two > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)two);
+		return 1;
+	}
+
+	good = eval(one, two);
+	printf("%s\n", good ? "✔️" : "❌");
+
+	return good ? 0 : 1;
+}
+
 int main(int argc, char *argv[])
 {
+	struct sock_filter bitmap_filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog bitmap_prog = {
+		.len = (unsigned short)ARRAY_SIZE(bitmap_filter),
+		.filter = bitmap_filter,
+	};
 	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])),
 		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
 	};
 	struct sock_fprog prog = {
 		.len = (unsigned short)ARRAY_SIZE(filter),
 		.filter = filter,
 	};
-	long ret;
-	unsigned long long samples;
-	unsigned long long native, filter1, filter2;
+
+	long ret, bits;
+	unsigned long long samples, calc;
+	unsigned long long native, filter1, filter2, bitmap1, bitmap2;
+	unsigned long long entry, per_filter1, per_filter2;
 
 	printf("Current BPF sysctl settings:\n");
 	system("sysctl net.core.bpf_jit_enable");
@@ -101,35 +161,82 @@ int main(int argc, char *argv[])
 	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
 	assert(ret == 0);
 
-	/* One filter */
-	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
+	/* One filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
 	assert(ret == 0);
 
-	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1);
+	bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1);
+
+	/* Second filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	if (filter1 == native)
-		printf("No overhead measured!? Try running again with more samples.\n");
+	bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2);
 
-	/* Two filters */
+	/* Third filter, can no longer be converted to bitmap */
 	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
 	assert(ret == 0);
 
-	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2);
-
-	/* Calculations */
-	printf("Estimated total seccomp overhead for 1 filter: %llu ns\n",
-		filter1 - native);
+	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1);
 
-	printf("Estimated total seccomp overhead for 2 filters: %llu ns\n",
-		filter2 - native);
+	/* Fourth filter, can not be converted to bitmap because of filter 3 */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	printf("Estimated seccomp per-filter overhead: %llu ns\n",
-		filter2 - filter1);
+	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2);
+
+	/* Estimations */
+#define ESTIMATE(fmt, var, what)	do {			\
+		var = (what);					\
+		printf("Estimated " fmt ": %llu ns\n", var);	\
+		if (var > INT_MAX)				\
+			goto more_samples;			\
+	} while (0)
+
+	ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc,
+		 bitmap1 - native);
+	ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc,
+		 bitmap2 - native);
+	ESTIMATE("total seccomp overhead for 3 full filters", calc,
+		 filter1 - native);
+	ESTIMATE("total seccomp overhead for 4 full filters", calc,
+		 filter2 - native);
+	ESTIMATE("seccomp entry overhead", entry,
+		 bitmap1 - native - (bitmap2 - bitmap1));
+	ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1,
+		 filter2 - filter1);
+	ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2,
+		 (filter2 - native - entry) / 4);
+
+	printf("Expectations:\n");
+	ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1);
+	bits = compare("native", "≤", "1 filter", native, le, filter1);
+	if (bits)
+		goto more_samples;
+
+	ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)",
+			per_filter1, approx, per_filter2);
+
+	bits = compare("1 bitmapped", "≈", "2 bitmapped",
+			bitmap1 - native, approx, bitmap2 - native);
+	if (bits) {
+		printf("Skipping constant action bitmap expectations: they appear unsupported.\n");
+		goto out;
+	}
 
-	printf("Estimated seccomp entry overhead: %llu ns\n",
-		filter1 - native - (filter2 - filter1));
+	ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native);
+	ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native);
+	ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total",
+			entry + (per_filter1 * 4) + native, approx, filter2);
+	if (ret == 0)
+		goto out;
 
+more_samples:
+	printf("Saw unexpected benchmark result. Try running again with more samples?\n");
+out:
 	return 0;
 }
diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings
index ba4d85f74cd6..6091b45d226b 100644
--- a/tools/testing/selftests/seccomp/settings
+++ b/tools/testing/selftests/seccomp/settings
@@ -1 +1 @@
-timeout=90
+timeout=120
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
                       ` (3 preceding siblings ...)
  2020-10-09 17:14     ` [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
@ 2020-10-09 17:14     ` YiFei Zhu
  2020-10-09 21:45       ` Jann Horn
  2020-10-09 23:14       ` Kees Cook
  2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  5 siblings, 2 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-10-09 17:14 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]

This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig                   | 24 ++++++++++++++
 arch/x86/Kconfig               |  1 +
 arch/x86/include/asm/seccomp.h |  3 ++
 fs/proc/base.c                 |  6 ++++
 include/linux/seccomp.h        |  5 +++
 kernel/seccomp.c               | 59 ++++++++++++++++++++++++++++++++++
 6 files changed, 98 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 21a3675a7a3a..85239a974f04 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -471,6 +471,15 @@ config HAVE_ARCH_SECCOMP_FILTER
 	    results in the system call being skipped immediately.
 	  - seccomp syscall wired up
 
+config HAVE_ARCH_SECCOMP_CACHE
+	bool
+	help
+	  An arch should select this symbol if it provides all of these things:
+	  - all the requirements for HAVE_ARCH_SECCOMP_FILTER
+	  - SECCOMP_ARCH_NATIVE
+	  - SECCOMP_ARCH_NATIVE_NR
+	  - SECCOMP_ARCH_NATIVE_NAME
+
 config SECCOMP
 	prompt "Enable seccomp to safely execute untrusted bytecode"
 	def_bool y
@@ -498,6 +507,21 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config SECCOMP_CACHE_DEBUG
+	bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
+	depends on SECCOMP
+	depends on SECCOMP_FILTER
+	depends on PROC_FS
+	help
+	  This is enables /proc/pid/seccomp_cache interface to monitor
+	  seccomp cache data. The file format is subject to change. Reading
+	  the file requires CAP_SYS_ADMIN.
+
+	  This option is for debugging only. Enabling present the risk that
+	  an adversary may be able to infer the seccomp filter logic.
+
+	  If unsure, say N.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1ab22869a765..1a807f89ac77 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -150,6 +150,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_SECCOMP_CACHE
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 03365af6165d..cd57c3eabab5 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -19,13 +19,16 @@
 #ifdef CONFIG_X86_64
 # define SECCOMP_ARCH_NATIVE		AUDIT_ARCH_X86_64
 # define SECCOMP_ARCH_NATIVE_NR		NR_syscalls
+# define SECCOMP_ARCH_NATIVE_NAME	"x86_64"
 # ifdef CONFIG_COMPAT
 #  define SECCOMP_ARCH_COMPAT		AUDIT_ARCH_I386
 #  define SECCOMP_ARCH_COMPAT_NR	IA32_NR_syscalls
+#  define SECCOMP_ARCH_COMPAT_NAME	"ia32"
 # endif
 #else /* !CONFIG_X86_64 */
 # define SECCOMP_ARCH_NATIVE		AUDIT_ARCH_I386
 # define SECCOMP_ARCH_NATIVE_NR	        NR_syscalls
+# define SECCOMP_ARCH_NATIVE_NAME	"ia32"
 #endif
 
 #include <asm-generic/seccomp.h>
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..a4990410ff05 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3587,6 +3590,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..1f028d55142a 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task);
+#endif
 #endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 51032b41fe59..a75746d259a5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -548,6 +548,9 @@ void seccomp_filter_release(struct task_struct *tsk)
 {
 	struct seccomp_filter *orig = tsk->seccomp.filter;
 
+	/* We are effectively holding the siglock by not having any sighand. */
+	WARN_ON(tsk->sighand != NULL);
+
 	/* Detach task from its filter tree. */
 	tsk->seccomp.filter = NULL;
 	__seccomp_filter_release(orig);
@@ -2308,3 +2311,59 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+/* Currently CONFIG_SECCOMP_CACHE_DEBUG implies SECCOMP_ARCH_NATIVE */
+static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name,
+					const void *bitmap, size_t bitmap_size)
+{
+	int nr;
+
+	for (nr = 0; nr < bitmap_size; nr++) {
+		bool cached = test_bit(nr, bitmap);
+		char *status = cached ? "ALLOW" : "FILTER";
+
+		seq_printf(m, "%s %d %s\n", name, nr, status);
+	}
+}
+
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task)
+{
+	struct seccomp_filter *f;
+	unsigned long flags;
+
+	/*
+	 * We don't want some sandboxed process know what their seccomp
+	 * filters consist of.
+	 */
+	if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (!lock_task_sighand(task, &flags))
+		return 0;
+
+	f = READ_ONCE(task->seccomp.filter);
+	if (!f) {
+		unlock_task_sighand(task, &flags);
+		return 0;
+	}
+
+	/* prevent filter from being freed while we are printing it */
+	__get_seccomp_filter(f);
+	unlock_task_sighand(task, &flags);
+
+	proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_NATIVE_NAME,
+				    f->cache.allow_native,
+				    SECCOMP_ARCH_NATIVE_NR);
+
+#ifdef SECCOMP_ARCH_COMPAT
+	proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME,
+				    f->cache.allow_compat,
+				    SECCOMP_ARCH_COMPAT_NR);
+#endif /* SECCOMP_ARCH_COMPAT */
+
+	__put_seccomp_filter(f);
+	return 0;
+}
+#endif /* CONFIG_SECCOMP_CACHE_DEBUG */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking
  2020-10-09 17:14     ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
@ 2020-10-09 17:25       ` Andy Lutomirski
  2020-10-09 18:32         ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Andy Lutomirski @ 2020-10-09 17:25 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, LKML, Aleksa Sarai,
	Andrea Arcangeli, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Oct 9, 2020 at 10:15 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> From: Kees Cook <keescook@chromium.org>
>
> Provide seccomp internals with the details to calculate which syscall
> table the running kernel is expecting to deal with. This allows for
> efficient architecture pinning and paves the way for constant-action
> bitmaps.
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  arch/x86/include/asm/seccomp.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
> index 2bd1338de236..03365af6165d 100644
> --- a/arch/x86/include/asm/seccomp.h
> +++ b/arch/x86/include/asm/seccomp.h
> @@ -16,6 +16,18 @@
>  #define __NR_seccomp_sigreturn_32      __NR_ia32_sigreturn
>  #endif
>
> +#ifdef CONFIG_X86_64
> +# define SECCOMP_ARCH_NATIVE           AUDIT_ARCH_X86_64
> +# define SECCOMP_ARCH_NATIVE_NR                NR_syscalls
> +# ifdef CONFIG_COMPAT
> +#  define SECCOMP_ARCH_COMPAT          AUDIT_ARCH_I386
> +#  define SECCOMP_ARCH_COMPAT_NR       IA32_NR_syscalls
> +# endif
> +#else /* !CONFIG_X86_64 */
> +# define SECCOMP_ARCH_NATIVE           AUDIT_ARCH_I386
> +# define SECCOMP_ARCH_NATIVE_NR                NR_syscalls
> +#endif

Is the idea that any syscall that's out of range for this (e.g. all of
the x32 syscalls) is unoptimized?  I'm okay with this, but I think it
could use a comment.

> +
>  #include <asm-generic/seccomp.h>
>
>  #endif /* _ASM_X86_SECCOMP_H */
> --
> 2.28.0
>


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking
  2020-10-09 17:25       ` Andy Lutomirski
@ 2020-10-09 18:32         ` YiFei Zhu
  2020-10-09 20:59           ` Andy Lutomirski
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-09 18:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, YiFei Zhu, bpf, LKML, Aleksa Sarai,
	Andrea Arcangeli, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Oct 9, 2020 at 12:25 PM Andy Lutomirski <luto@amacapital.net> wrote:
> Is the idea that any syscall that's out of range for this (e.g. all of
> the x32 syscalls) is unoptimized?  I'm okay with this, but I think it
> could use a comment.

Yes, any syscall number that is out of range is unoptimized. Where do
you think I should put a comment? seccomp_cache_check_allow_bitmap
above `if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))`,
with something like "any syscall number out of range is unoptimized"?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking
  2020-10-09 18:32         ` YiFei Zhu
@ 2020-10-09 20:59           ` Andy Lutomirski
  0 siblings, 0 replies; 135+ messages in thread
From: Andy Lutomirski @ 2020-10-09 20:59 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, LKML, Aleksa Sarai,
	Andrea Arcangeli, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Oct 9, 2020 at 11:32 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> On Fri, Oct 9, 2020 at 12:25 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > Is the idea that any syscall that's out of range for this (e.g. all of
> > the x32 syscalls) is unoptimized?  I'm okay with this, but I think it
> > could use a comment.
>
> Yes, any syscall number that is out of range is unoptimized. Where do
> you think I should put a comment? seccomp_cache_check_allow_bitmap
> above `if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))`,
> with something like "any syscall number out of range is unoptimized"?
>

I was imagining a comment near the new macros explaining that this is
the range of syscalls that seccomp will optimize, that behavior is
still correct (albeit slower) for out of range syscalls, and that x32
is intentionally not optimized.

This avoids people like future me reading this code, not remembering
the context, and thinking it looks buggy.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-10-09 17:14     ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
@ 2020-10-09 21:30       ` Jann Horn
  2020-10-09 22:47         ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: Jann Horn @ 2020-10-09 21:30 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> From: YiFei Zhu <yifeifz2@illinois.edu>
>
> SECCOMP_CACHE will only operate on syscalls that do not access
> any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> return allow regardless of syscall arguments for a given
> architecture number / syscall number pair. This is implemented
> here with a pseudo-emulator, and stored in a per-filter bitmap.
>
> In order to build this bitmap at filter attach time, each filter is
> emulated for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
> the program returns allow, then we can be sure that the filter must
> return allow independent from syscall arguments.
>
> Nearly all seccomp filters are built from these cBPF instructions:
>
> BPF_LD  | BPF_W    | BPF_ABS
> BPF_JMP | BPF_JEQ  | BPF_K
> BPF_JMP | BPF_JGE  | BPF_K
> BPF_JMP | BPF_JGT  | BPF_K
> BPF_JMP | BPF_JSET | BPF_K
> BPF_JMP | BPF_JA
> BPF_RET | BPF_K
> BPF_ALU | BPF_AND  | BPF_K
>
> Each of these instructions are emulated. Any weirdness or loading
> from a syscall argument will cause the emulator to bail.
>
> The emulation is also halted if it reaches a return. In that case,
> if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
>
> Emulator structure and comments are from Kees [1] and Jann [2].
>
> Emulation is done at attach time. If a filter depends on more
> filters, and if the dependee does not guarantee to allow the
> syscall, then we skip the emulation of this syscall.
>
> [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
> [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
[...]
> @@ -682,6 +693,150 @@ seccomp_prepare_user_filter(const char __user *user_filter)
>         return filter;
>  }
>
> +#ifdef SECCOMP_ARCH_NATIVE
> +/**
> + * seccomp_is_const_allow - check if filter is constant allow with given data
> + * @fprog: The BPF programs
> + * @sd: The seccomp data to check against, only syscall number are arch
> + *      number are considered constant.

nit: s/syscall number are arch number/syscall number and arch number/

> + */
> +static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
> +                                  struct seccomp_data *sd)
> +{
> +       unsigned int insns;
> +       unsigned int reg_value = 0;
> +       unsigned int pc;
> +       bool op_res;
> +
> +       if (WARN_ON_ONCE(!fprog))
> +               return false;
> +
> +       insns = bpf_classic_proglen(fprog);

bpf_classic_proglen() is defined as:

#define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))

so this is wrong - what you want is the number of instructions in the
program, what you actually have is the size of the program in bytes.
Please instead check for `pc < fprog->len` in the loop condition.

> +       for (pc = 0; pc < insns; pc++) {
> +               struct sock_filter *insn = &fprog->filter[pc];
> +               u16 code = insn->code;
> +               u32 k = insn->k;
[...]

> +       }
> +
> +       /* ran off the end of the filter?! */
> +       WARN_ON(1);
> +       return false;
> +}

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path
  2020-10-09 17:14     ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
@ 2020-10-09 21:30       ` Jann Horn
  2020-10-09 23:18       ` Kees Cook
  1 sibling, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-10-09 21:30 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
>
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.
>
> The fast (common) path for seccomp should be that the filter permits
> the syscall to pass through, and failing seccomp is expected to be
> an exceptional case; it is not expected for userspace to call a
> denylisted syscall over and over.
>
> When it can be concluded that an allow must occur for the given
> architecture and syscall pair (this determination is introduced in
> the next commit), seccomp will immediately allow the syscall,
> bypassing further BPF execution.
>
> Each architecture number has its own bitmap. The architecture
> number in seccomp_data is checked against the defined architecture
> number constant before proceeding to test the bit against the
> bitmap with the syscall number as the index of the bit in the
> bitmap, and if the bit is set, seccomp returns allow. The bitmaps
> are all clear in this patch and will be initialized in the next
> commit.
[...]
> Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>

Reviewed-by: Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-09 17:14     ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
@ 2020-10-09 21:45       ` Jann Horn
  2020-10-09 23:14       ` Kees Cook
  1 sibling, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-10-09 21:45 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
>
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <arch name> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
>
> For the docker default profile on x86_64 it looks like:
> x86_64 0 ALLOW
> x86_64 1 ALLOW
> x86_64 2 ALLOW
> x86_64 3 ALLOW
> [...]
> x86_64 132 ALLOW
> x86_64 133 ALLOW
> x86_64 134 FILTER
> x86_64 135 FILTER
> x86_64 136 FILTER
> x86_64 137 ALLOW
> x86_64 138 ALLOW
> x86_64 139 FILTER
> x86_64 140 ALLOW
> x86_64 141 ALLOW
> [...]
>
> This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
> of N because I think certain users of seccomp might not want the
> application to know which syscalls are definitely usable. For
> the same reason, it is also guarded by CAP_SYS_ADMIN.
>
> Suggested-by: Jann Horn <jannh@google.com>
> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
[...]
> diff --git a/arch/Kconfig b/arch/Kconfig
[...]
> +config SECCOMP_CACHE_DEBUG
> +       bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
> +       depends on SECCOMP
> +       depends on SECCOMP_FILTER
> +       depends on PROC_FS
> +       help
> +         This is enables /proc/pid/seccomp_cache interface to monitor

nit: s/This is enables/This enables the/

> +         seccomp cache data. The file format is subject to change. Reading
> +         the file requires CAP_SYS_ADMIN.
> +
> +         This option is for debugging only. Enabling present the risk that

nit: *presents

> +         an adversary may be able to infer the seccomp filter logic.

[...]
> +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
> +                          struct pid *pid, struct task_struct *task)
> +{
> +       struct seccomp_filter *f;
> +       unsigned long flags;
> +
> +       /*
> +        * We don't want some sandboxed process know what their seccomp

s/know/to know/

> +        * filters consist of.
> +        */
> +       if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
> +               return -EACCES;
> +
> +       if (!lock_task_sighand(task, &flags))
> +               return 0;

maybe return -ESRCH here so that userspace can distinguish between an
exiting process and a process with no filters?

> +       f = READ_ONCE(task->seccomp.filter);
> +       if (!f) {
> +               unlock_task_sighand(task, &flags);
> +               return 0;
> +       }
[...]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-10-09 21:30       ` Jann Horn
@ 2020-10-09 22:47         ` Kees Cook
  0 siblings, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-10-09 22:47 UTC (permalink / raw)
  To: Jann Horn
  Cc: YiFei Zhu, Linux Containers, YiFei Zhu, bpf, kernel list,
	Aleksa Sarai, Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

On Fri, Oct 09, 2020 at 11:30:18PM +0200, Jann Horn wrote:
> On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >
> > From: YiFei Zhu <yifeifz2@illinois.edu>
> >
> > SECCOMP_CACHE will only operate on syscalls that do not access
> > any syscall arguments or instruction pointer. To facilitate
> > this we need a static analyser to know whether a filter will
> > return allow regardless of syscall arguments for a given
> > architecture number / syscall number pair. This is implemented
> > here with a pseudo-emulator, and stored in a per-filter bitmap.
> >
> > In order to build this bitmap at filter attach time, each filter is
> > emulated for every syscall (under each possible architecture), and
> > checked for any accesses of struct seccomp_data that are not the "arch"
> > nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
> > the program returns allow, then we can be sure that the filter must
> > return allow independent from syscall arguments.
> >
> > Nearly all seccomp filters are built from these cBPF instructions:
> >
> > BPF_LD  | BPF_W    | BPF_ABS
> > BPF_JMP | BPF_JEQ  | BPF_K
> > BPF_JMP | BPF_JGE  | BPF_K
> > BPF_JMP | BPF_JGT  | BPF_K
> > BPF_JMP | BPF_JSET | BPF_K
> > BPF_JMP | BPF_JA
> > BPF_RET | BPF_K
> > BPF_ALU | BPF_AND  | BPF_K
> >
> > Each of these instructions are emulated. Any weirdness or loading
> > from a syscall argument will cause the emulator to bail.
> >
> > The emulation is also halted if it reaches a return. In that case,
> > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
> >
> > Emulator structure and comments are from Kees [1] and Jann [2].
> >
> > Emulation is done at attach time. If a filter depends on more
> > filters, and if the dependee does not guarantee to allow the
> > syscall, then we skip the emulation of this syscall.
> >
> > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
> > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
> [...]
> > @@ -682,6 +693,150 @@ seccomp_prepare_user_filter(const char __user *user_filter)
> >         return filter;
> >  }
> >
> > +#ifdef SECCOMP_ARCH_NATIVE
> > +/**
> > + * seccomp_is_const_allow - check if filter is constant allow with given data
> > + * @fprog: The BPF programs
> > + * @sd: The seccomp data to check against, only syscall number are arch
> > + *      number are considered constant.
> 
> nit: s/syscall number are arch number/syscall number and arch number/
> 
> > + */
> > +static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
> > +                                  struct seccomp_data *sd)
> > +{
> > +       unsigned int insns;
> > +       unsigned int reg_value = 0;
> > +       unsigned int pc;
> > +       bool op_res;
> > +
> > +       if (WARN_ON_ONCE(!fprog))
> > +               return false;
> > +
> > +       insns = bpf_classic_proglen(fprog);
> 
> bpf_classic_proglen() is defined as:
> 
> #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
> 
> so this is wrong - what you want is the number of instructions in the
> program, what you actually have is the size of the program in bytes.
> Please instead check for `pc < fprog->len` in the loop condition.

Oh yes, good catch. I had this wrong in my v1.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-09 17:14     ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
  2020-10-09 21:45       ` Jann Horn
@ 2020-10-09 23:14       ` Kees Cook
  2020-10-10 13:26         ` YiFei Zhu
  1 sibling, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-10-09 23:14 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Fri, Oct 09, 2020 at 12:14:33PM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
> 
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <arch name> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
> 
> For the docker default profile on x86_64 it looks like:
> x86_64 0 ALLOW
> x86_64 1 ALLOW
> x86_64 2 ALLOW
> x86_64 3 ALLOW
> [...]
> x86_64 132 ALLOW
> x86_64 133 ALLOW
> x86_64 134 FILTER
> x86_64 135 FILTER
> x86_64 136 FILTER
> x86_64 137 ALLOW
> x86_64 138 ALLOW
> x86_64 139 FILTER
> x86_64 140 ALLOW
> x86_64 141 ALLOW
> [...]
> 
> This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
> of N because I think certain users of seccomp might not want the
> application to know which syscalls are definitely usable. For
> the same reason, it is also guarded by CAP_SYS_ADMIN.
> 
> Suggested-by: Jann Horn <jannh@google.com>
> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  arch/Kconfig                   | 24 ++++++++++++++
>  arch/x86/Kconfig               |  1 +
>  arch/x86/include/asm/seccomp.h |  3 ++
>  fs/proc/base.c                 |  6 ++++
>  include/linux/seccomp.h        |  5 +++
>  kernel/seccomp.c               | 59 ++++++++++++++++++++++++++++++++++
>  6 files changed, 98 insertions(+)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 21a3675a7a3a..85239a974f04 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -471,6 +471,15 @@ config HAVE_ARCH_SECCOMP_FILTER
>  	    results in the system call being skipped immediately.
>  	  - seccomp syscall wired up
>  
> +config HAVE_ARCH_SECCOMP_CACHE
> +	bool
> +	help
> +	  An arch should select this symbol if it provides all of these things:
> +	  - all the requirements for HAVE_ARCH_SECCOMP_FILTER
> +	  - SECCOMP_ARCH_NATIVE
> +	  - SECCOMP_ARCH_NATIVE_NR
> +	  - SECCOMP_ARCH_NATIVE_NAME
> +
> [...]
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1ab22869a765..1a807f89ac77 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -150,6 +150,7 @@ config X86
>  	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
>  	select HAVE_ARCH_PREL32_RELOCATIONS
>  	select HAVE_ARCH_SECCOMP_FILTER
> +	select HAVE_ARCH_SECCOMP_CACHE
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_STACKLEAK
>  	select HAVE_ARCH_TRACEHOOK

HAVE_ARCH_SECCOMP_CACHE isn't used any more. I think this was left over
from before.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path
  2020-10-09 17:14     ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
  2020-10-09 21:30       ` Jann Horn
@ 2020-10-09 23:18       ` Kees Cook
  1 sibling, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-10-09 23:18 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, YiFei Zhu, bpf, linux-kernel, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Fri, Oct 09, 2020 at 12:14:29PM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <yifeifz2@illinois.edu>
> 
> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
> 
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.
> 
> The fast (common) path for seccomp should be that the filter permits
> the syscall to pass through, and failing seccomp is expected to be
> an exceptional case; it is not expected for userspace to call a
> denylisted syscall over and over.
> 
> When it can be concluded that an allow must occur for the given
> architecture and syscall pair (this determination is introduced in
> the next commit), seccomp will immediately allow the syscall,
> bypassing further BPF execution.
> 
> Each architecture number has its own bitmap. The architecture
> number in seccomp_data is checked against the defined architecture
> number constant before proceeding to test the bit against the
> bitmap with the syscall number as the index of the bit in the
> bitmap, and if the bit is set, seccomp returns allow. The bitmaps
> are all clear in this patch and will be initialized in the next
> commit.
> 
> [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
> [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
> [3] https://github.com/seccomp/libseccomp/issues/116
> [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
> [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
> [6] Draco: Architectural and Operating System Support for System Call Security
>     https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
> 
> Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  kernel/seccomp.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 72 insertions(+)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index ae6b40cc39f4..73f6b6e9a3b0 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,6 +143,34 @@ struct notification {
>  	struct list_head notifications;
>  };
>  
> +#ifdef SECCOMP_ARCH_NATIVE
> +/**
> + * struct action_cache - per-filter cache of seccomp actions per
> + * arch/syscall pair
> + *
> + * @allow_native: A bitmap where each bit represents whether the
> + *		  filter will always allow the syscall, for the
> + *		  native architecture.
> + * @allow_compat: A bitmap where each bit represents whether the
> + *		  filter will always allow the syscall, for the
> + *		  compat architecture.
> + */
> +struct action_cache {
> +	DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR);
> +#ifdef SECCOMP_ARCH_COMPAT
> +	DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR);
> +#endif
> +};
> +#else
> +struct action_cache { };
> +
> +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
> +					     const struct seccomp_data *sd)
> +{
> +	return false;
> +}
> +#endif /* SECCOMP_ARCH_NATIVE */
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -298,6 +326,47 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>  	return 0;
>  }
>  
> +#ifdef SECCOMP_ARCH_NATIVE
> +static inline bool seccomp_cache_check_allow_bitmap(const void *bitmap,
> +						    size_t bitmap_size,
> +						    int syscall_nr)
> +{
> +	if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))
> +		return false;
> +	syscall_nr = array_index_nospec(syscall_nr, bitmap_size);
> +
> +	return test_bit(syscall_nr, bitmap);
> +}
> +
> +/**
> + * seccomp_cache_check_allow - lookup seccomp cache
> + * @sfilter: The seccomp filter
> + * @sd: The seccomp data to lookup the cache with
> + *
> + * Returns true if the seccomp_data is cached and allowed.
> + */
> +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
> +					     const struct seccomp_data *sd)
> +{
> +	int syscall_nr = sd->nr;
> +	const struct action_cache *cache = &sfilter->cache;
> +
> +	if (likely(sd->arch == SECCOMP_ARCH_NATIVE))
> +		return seccomp_cache_check_allow_bitmap(cache->allow_native,
> +							SECCOMP_ARCH_NATIVE_NR,
> +							syscall_nr);
> +#ifdef SECCOMP_ARCH_COMPAT
> +	if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
> +		return seccomp_cache_check_allow_bitmap(cache->allow_compat,
> +							SECCOMP_ARCH_COMPAT_NR,
> +							syscall_nr);
> +#endif /* SECCOMP_ARCH_COMPAT */
> +
> +	WARN_ON_ONCE(true);
> +	return false;
> +}
> +#endif /* SECCOMP_ARCH_NATIVE */

An small optimization for the non-compat case might be to do this to
avoid the sd->arch test (which should have no way to ever change in such
builds):

static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
                                             const struct seccomp_data *sd)
{
        const struct action_cache *cache = &sfilter->cache;

#ifndef SECCOMP_ARCH_COMPAT
        /* A native-only architecture doesn't need to check sd->arch. */
        return seccomp_cache_check_allow_bitmap(cache->allow_native,
                                                SECCOMP_ARCH_NATIVE_NR,
                                                sd->nr);
#else /* SECCOMP_ARCH_COMPAT */
        if (likely(sd->arch == SECCOMP_ARCH_NATIVE))
                return seccomp_cache_check_allow_bitmap(cache->allow_native,
                                                        SECCOMP_ARCH_NATIVE_NR,
                                                        sd->nr);
        if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
                return seccomp_cache_check_allow_bitmap(cache->allow_compat,
                                                        SECCOMP_ARCH_COMPAT_NR,
                                                        sd->nr);
#endif

        WARN_ON_ONCE(true);
        return false;
}

> +
>  /**
>   * seccomp_run_filters - evaluates all seccomp filters against @sd
>   * @sd: optional seccomp data to be passed to filters
> @@ -320,6 +389,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
>  	if (WARN_ON(f == NULL))
>  		return SECCOMP_RET_KILL_PROCESS;
>  
> +	if (seccomp_cache_check_allow(f, sd))
> +		return SECCOMP_RET_ALLOW;
> +
>  	/*
>  	 * All filters in the list are evaluated and the lowest BPF return
>  	 * value always takes priority (ignoring the DATA).
> -- 
> 2.28.0
> 

This is all looking good; thank you! I'm doing some test builds/runs
now. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-09 23:14       ` Kees Cook
@ 2020-10-10 13:26         ` YiFei Zhu
  2020-10-12 22:57           ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-10 13:26 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Fri, Oct 9, 2020 at 6:14 PM Kees Cook <keescook@chromium.org> wrote:
> HAVE_ARCH_SECCOMP_CACHE isn't used any more. I think this was left over
> from before.

Oh, I was meant to add this to the dependencies of
SECCOMP_CACHE_DEBUG. Is this something that would make sense?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results
  2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
                       ` (4 preceding siblings ...)
  2020-10-09 17:14     ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
@ 2020-10-11 15:47     ` YiFei Zhu
  2020-10-11 15:47       ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
                         ` (4 more replies)
  5 siblings, 5 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/

Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.

This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.

Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.

Some benchmarks are performed with results in patch 5, copied below:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Benchmarking 200000000 syscalls...
  129.359381409 - 0.008724424 = 129350656985 (129.4s)
  getpid native: 646 ns
  264.385890006 - 129.360453229 = 135025436777 (135.0s)
  getpid RET_ALLOW 1 filter (bitmap): 675 ns
  399.400511893 - 264.387045901 = 135013465992 (135.0s)
  getpid RET_ALLOW 2 filters (bitmap): 675 ns
  545.872866260 - 399.401718327 = 146471147933 (146.5s)
  getpid RET_ALLOW 3 filters (full): 732 ns
  696.337101319 - 545.874097681 = 150463003638 (150.5s)
  getpid RET_ALLOW 4 filters (full): 752 ns
  Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
  Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
  Estimated total seccomp overhead for 3 full filters: 86 ns
  Estimated total seccomp overhead for 4 full filters: 106 ns
  Estimated seccomp entry overhead: 29 ns
  Estimated seccomp per-filter overhead (last 2 diff): 20 ns
  Estimated seccomp per-filter overhead (filters / 4): 19 ns
  Expectations:
  	native ≤ 1 bitmap (646 ≤ 675): ✔️
  	native ≤ 1 filter (646 ≤ 732): ✔️
  	per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
  	1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 1 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 2 bitmapped (29 ≈ 29): ✔️
  	native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️

v4 -> v5:
* Typo and wording fixes
* Skip arch number test when there are only one arch
* Fixed prog instruction number check.
* Added comment about the behavior of x32.
* /proc/pid/seccomp_cache return -ESRCH for exiting process.
* Fixed /proc/pid/seccomp_cache depend on the architecture.
* Fixed struct seq_file visibility reported by kernel test robot.

v3 -> v4:
* Reordered patches
* Naming changes
* Fixed racing in /proc/pid/seccomp_cache against filter being released
  from task, using Jann's suggestion of sighand spinlock.
* Cache no longer configurable.
* Copied some description from cover letter to commit messages.
* Used Kees's logic to set clear bits from bitmap, rather than set bits.

v2 -> v3:
* Added array_index_nospec guards
* No more syscall_arches[] array and expecting on loop unrolling. Arches
  are configured with per-arch seccomp.h.
* Moved filter emulation to attach time (from prepare time).
* Further simplified emulator, basing on Kees's code.
* Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN.

v1 -> v2:
* Corrected one outdated function documentation.

RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
  have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
  instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
  during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.

Patch 1 implements the test_bit against the bitmaps.

Patch 2 implements the emulator that finds if a filter must return allow,

Patch 3 adds the arch macros for x86.

Patch 4 updates the selftest to better show the new semantics.

Patch 5 implements /proc/pid/seccomp_cache.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

Kees Cook (2):
  x86: Enable seccomp architecture tracking
  selftests/seccomp: Compare bitmap vs filter overhead

YiFei Zhu (3):
  seccomp/cache: Lookup syscall allowlist bitmap for fast path
  seccomp/cache: Add "emulator" to check if filter is constant allow
  seccomp/cache: Report cache data through /proc/pid/seccomp_cache

 arch/Kconfig                                  |  24 ++
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/seccomp.h                |  20 ++
 fs/proc/base.c                                |   6 +
 include/linux/seccomp.h                       |   7 +
 kernel/seccomp.c                              | 292 +++++++++++++++++-
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++--
 tools/testing/selftests/seccomp/settings      |   2 +-
 8 files changed, 479 insertions(+), 24 deletions(-)

--
2.28.0

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path
  2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
@ 2020-10-11 15:47       ` YiFei Zhu
  2020-10-12  6:42         ` Jann Horn
  2020-10-11 15:47       ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

When it can be concluded that an allow must occur for the given
architecture and syscall pair (this determination is introduced in
the next commit), seccomp will immediately allow the syscall,
bypassing further BPF execution.

Each architecture number has its own bitmap. The architecture
number in seccomp_data is checked against the defined architecture
number constant before proceeding to test the bit against the
bitmap with the syscall number as the index of the bit in the
bitmap, and if the bit is set, seccomp returns allow. The bitmaps
are all clear in this patch and will be initialized in the next
commit.

When only one architecture exists, the check against architecture
number is skipped, suggested by Kees Cook [7].

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
[7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/

Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 kernel/seccomp.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ae6b40cc39f4..d67a8b61f2bf 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,6 +143,34 @@ struct notification {
 	struct list_head notifications;
 };
 
+#ifdef SECCOMP_ARCH_NATIVE
+/**
+ * struct action_cache - per-filter cache of seccomp actions per
+ * arch/syscall pair
+ *
+ * @allow_native: A bitmap where each bit represents whether the
+ *		  filter will always allow the syscall, for the
+ *		  native architecture.
+ * @allow_compat: A bitmap where each bit represents whether the
+ *		  filter will always allow the syscall, for the
+ *		  compat architecture.
+ */
+struct action_cache {
+	DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR);
+#ifdef SECCOMP_ARCH_COMPAT
+	DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR);
+#endif
+};
+#else
+struct action_cache { };
+
+static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
+					     const struct seccomp_data *sd)
+{
+	return false;
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -298,6 +326,52 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 	return 0;
 }
 
+#ifdef SECCOMP_ARCH_NATIVE
+static inline bool seccomp_cache_check_allow_bitmap(const void *bitmap,
+						    size_t bitmap_size,
+						    int syscall_nr)
+{
+	if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))
+		return false;
+	syscall_nr = array_index_nospec(syscall_nr, bitmap_size);
+
+	return test_bit(syscall_nr, bitmap);
+}
+
+/**
+ * seccomp_cache_check_allow - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
+					     const struct seccomp_data *sd)
+{
+	int syscall_nr = sd->nr;
+	const struct action_cache *cache = &sfilter->cache;
+
+#ifndef SECCOMP_ARCH_COMPAT
+	/* A native-only architecture doesn't need to check sd->arch. */
+	return seccomp_cache_check_allow_bitmap(cache->allow_native,
+						SECCOMP_ARCH_NATIVE_NR,
+						syscall_nr);
+#else
+	if (likely(sd->arch == SECCOMP_ARCH_NATIVE))
+		return seccomp_cache_check_allow_bitmap(cache->allow_native,
+							SECCOMP_ARCH_NATIVE_NR,
+							syscall_nr);
+	if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
+		return seccomp_cache_check_allow_bitmap(cache->allow_compat,
+							SECCOMP_ARCH_COMPAT_NR,
+							syscall_nr);
+#endif /* SECCOMP_ARCH_COMPAT */
+
+	WARN_ON_ONCE(true);
+	return false;
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
 /**
  * seccomp_run_filters - evaluates all seccomp filters against @sd
  * @sd: optional seccomp data to be passed to filters
@@ -320,6 +394,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 	if (WARN_ON(f == NULL))
 		return SECCOMP_RET_KILL_PROCESS;
 
+	if (seccomp_cache_check_allow(f, sd))
+		return SECCOMP_RET_ALLOW;
+
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  2020-10-11 15:47       ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
@ 2020-10-11 15:47       ` YiFei Zhu
  2020-10-12  6:46         ` Jann Horn
  2020-10-11 15:47       ` [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

Nearly all seccomp filters are built from these cBPF instructions:

BPF_LD  | BPF_W    | BPF_ABS
BPF_JMP | BPF_JEQ  | BPF_K
BPF_JMP | BPF_JGE  | BPF_K
BPF_JMP | BPF_JGT  | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
BPF_ALU | BPF_AND  | BPF_K

Each of these instructions are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.

The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.

Emulator structure and comments are from Kees [1] and Jann [2].

Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.

[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/

Suggested-by: Jann Horn <jannh@google.com>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 kernel/seccomp.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 155 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d67a8b61f2bf..236e7b367d4e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -169,6 +169,10 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte
 {
 	return false;
 }
+
+static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+}
 #endif /* SECCOMP_ARCH_NATIVE */
 
 /**
@@ -187,6 +191,7 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte
  *	   this filter after reaching 0. The @users count is always smaller
  *	   or equal to @refs. Hence, reaching 0 for @users does not mean
  *	   the filter can be freed.
+ * @cache: cache of arch/syscall mappings to actions
  * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
  * @prev: points to a previously installed, or inherited, filter
  * @prog: the BPF program to evaluate
@@ -208,6 +213,7 @@ struct seccomp_filter {
 	refcount_t refs;
 	refcount_t users;
 	bool log;
+	struct action_cache cache;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
 	struct notification *notif;
@@ -621,7 +627,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
 	struct seccomp_filter *sfilter;
 	int ret;
-	const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+	const bool save_orig =
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE)
+		true;
+#else
+		false;
+#endif
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
 		return ERR_PTR(-EINVAL);
@@ -687,6 +698,148 @@ seccomp_prepare_user_filter(const char __user *user_filter)
 	return filter;
 }
 
+#ifdef SECCOMP_ARCH_NATIVE
+/**
+ * seccomp_is_const_allow - check if filter is constant allow with given data
+ * @fprog: The BPF programs
+ * @sd: The seccomp data to check against, only syscall number and arch
+ *      number are considered constant.
+ */
+static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
+				   struct seccomp_data *sd)
+{
+	unsigned int reg_value = 0;
+	unsigned int pc;
+	bool op_res;
+
+	if (WARN_ON_ONCE(!fprog))
+		return false;
+
+	for (pc = 0; pc < fprog->len; pc++) {
+		struct sock_filter *insn = &fprog->filter[pc];
+		u16 code = insn->code;
+		u32 k = insn->k;
+
+		switch (code) {
+		case BPF_LD | BPF_W | BPF_ABS:
+			switch (k) {
+			case offsetof(struct seccomp_data, nr):
+				reg_value = sd->nr;
+				break;
+			case offsetof(struct seccomp_data, arch):
+				reg_value = sd->arch;
+				break;
+			default:
+				/* can't optimize (non-constant value load) */
+				return false;
+			}
+			break;
+		case BPF_RET | BPF_K:
+			/* reached return with constant values only, check allow */
+			return k == SECCOMP_RET_ALLOW;
+		case BPF_JMP | BPF_JA:
+			pc += insn->k;
+			break;
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JSET | BPF_K:
+			switch (BPF_OP(code)) {
+			case BPF_JEQ:
+				op_res = reg_value == k;
+				break;
+			case BPF_JGE:
+				op_res = reg_value >= k;
+				break;
+			case BPF_JGT:
+				op_res = reg_value > k;
+				break;
+			case BPF_JSET:
+				op_res = !!(reg_value & k);
+				break;
+			default:
+				/* can't optimize (unknown jump) */
+				return false;
+			}
+
+			pc += op_res ? insn->jt : insn->jf;
+			break;
+		case BPF_ALU | BPF_AND | BPF_K:
+			reg_value &= k;
+			break;
+		default:
+			/* can't optimize (unknown insn) */
+			return false;
+		}
+	}
+
+	/* ran off the end of the filter?! */
+	WARN_ON(1);
+	return false;
+}
+
+static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
+					 void *bitmap, const void *bitmap_prev,
+					 size_t bitmap_size, int arch)
+{
+	struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
+	struct seccomp_data sd;
+	int nr;
+
+	if (bitmap_prev) {
+		/* The new filter must be as restrictive as the last. */
+		bitmap_copy(bitmap, bitmap_prev, bitmap_size);
+	} else {
+		/* Before any filters, all syscalls are always allowed. */
+		bitmap_fill(bitmap, bitmap_size);
+	}
+
+	for (nr = 0; nr < bitmap_size; nr++) {
+		/* No bitmap change: not a cacheable action. */
+		if (!test_bit(nr, bitmap))
+			continue;
+
+		sd.nr = nr;
+		sd.arch = arch;
+
+		/* No bitmap change: continue to always allow. */
+		if (seccomp_is_const_allow(fprog, &sd))
+			continue;
+
+		/*
+		 * Not a cacheable action: always run filters.
+		 * atomic clear_bit() not needed, filter not visible yet.
+		 */
+		__clear_bit(nr, bitmap);
+	}
+}
+
+/**
+ * seccomp_cache_prepare - emulate the filter to find cachable syscalls
+ * @sfilter: The seccomp filter
+ *
+ * Returns 0 if successful or -errno if error occurred.
+ */
+static void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+	struct action_cache *cache = &sfilter->cache;
+	const struct action_cache *cache_prev =
+		sfilter->prev ? &sfilter->prev->cache : NULL;
+
+	seccomp_cache_prepare_bitmap(sfilter, cache->allow_native,
+				     cache_prev ? cache_prev->allow_native : NULL,
+				     SECCOMP_ARCH_NATIVE_NR,
+				     SECCOMP_ARCH_NATIVE);
+
+#ifdef SECCOMP_ARCH_COMPAT
+	seccomp_cache_prepare_bitmap(sfilter, cache->allow_compat,
+				     cache_prev ? cache_prev->allow_compat : NULL,
+				     SECCOMP_ARCH_COMPAT_NR,
+				     SECCOMP_ARCH_COMPAT);
+#endif /* SECCOMP_ARCH_COMPAT */
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
 /**
  * seccomp_attach_filter: validate and attach filter
  * @flags:  flags to change filter behavior
@@ -736,6 +889,7 @@ static long seccomp_attach_filter(unsigned int flags,
 	 * task reference.
 	 */
 	filter->prev = current->seccomp.filter;
+	seccomp_cache_prepare(filter);
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
 
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking
  2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
  2020-10-11 15:47       ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
  2020-10-11 15:47       ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
@ 2020-10-11 15:47       ` YiFei Zhu
  2020-10-11 15:47       ` [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
  2020-10-11 15:47       ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
  4 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: Kees Cook <keescook@chromium.org>

Provide seccomp internals with the details to calculate which syscall
table the running kernel is expecting to deal with. This allows for
efficient architecture pinning and paves the way for constant-action
bitmaps.

Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/x86/include/asm/seccomp.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 2bd1338de236..b17d037c72ce 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -16,6 +16,23 @@
 #define __NR_seccomp_sigreturn_32	__NR_ia32_sigreturn
 #endif
 
+#ifdef CONFIG_X86_64
+# define SECCOMP_ARCH_NATIVE		AUDIT_ARCH_X86_64
+# define SECCOMP_ARCH_NATIVE_NR		NR_syscalls
+# ifdef CONFIG_COMPAT
+#  define SECCOMP_ARCH_COMPAT		AUDIT_ARCH_I386
+#  define SECCOMP_ARCH_COMPAT_NR	IA32_NR_syscalls
+# endif
+/*
+ * x32 will have __X32_SYSCALL_BIT set in syscall number. We don't support
+ * caching them and they are treated as out of range syscalls, which will
+ * always pass through the BPF filter.
+ */
+#else /* !CONFIG_X86_64 */
+# define SECCOMP_ARCH_NATIVE		AUDIT_ARCH_I386
+# define SECCOMP_ARCH_NATIVE_NR	        NR_syscalls
+#endif
+
 #include <asm-generic/seccomp.h>
 
 #endif /* _ASM_X86_SECCOMP_H */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead
  2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
                         ` (2 preceding siblings ...)
  2020-10-11 15:47       ` [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
@ 2020-10-11 15:47       ` YiFei Zhu
  2020-10-11 15:47       ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
  4 siblings, 0 replies; 135+ messages in thread
From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: Kees Cook <keescook@chromium.org>

As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.

Example output with constant action bitmaps on x86:

$ sudo ./seccomp_benchmark 100000000
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 200000000 syscalls...
129.359381409 - 0.008724424 = 129350656985 (129.4s)
getpid native: 646 ns
264.385890006 - 129.360453229 = 135025436777 (135.0s)
getpid RET_ALLOW 1 filter (bitmap): 675 ns
399.400511893 - 264.387045901 = 135013465992 (135.0s)
getpid RET_ALLOW 2 filters (bitmap): 675 ns
545.872866260 - 399.401718327 = 146471147933 (146.5s)
getpid RET_ALLOW 3 filters (full): 732 ns
696.337101319 - 545.874097681 = 150463003638 (150.5s)
getpid RET_ALLOW 4 filters (full): 752 ns
Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
Estimated total seccomp overhead for 3 full filters: 86 ns
Estimated total seccomp overhead for 4 full filters: 106 ns
Estimated seccomp entry overhead: 29 ns
Estimated seccomp per-filter overhead (last 2 diff): 20 ns
Estimated seccomp per-filter overhead (filters / 4): 19 ns
Expectations:
	native ≤ 1 bitmap (646 ≤ 675): ✔️
	native ≤ 1 filter (646 ≤ 732): ✔️
	per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
	1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
	entry ≈ 1 bitmapped (29 ≈ 29): ✔️
	entry ≈ 2 bitmapped (29 ≈ 29): ✔️
	native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️

Signed-off-by: Kees Cook <keescook@chromium.org>
[YiFei: Changed commit message to show stats for this patch series]
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++++++++++---
 tools/testing/selftests/seccomp/settings      |   2 +-
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 91f5a89cadac..fcc806585266 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,12 +4,16 @@
  */
 #define _GNU_SOURCE
 #include <assert.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <unistd.h>
 #include <linux/filter.h>
 #include <linux/seccomp.h>
+#include <sys/param.h>
 #include <sys/prctl.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
@@ -70,18 +74,74 @@ unsigned long long calibrate(void)
 	return samples * seconds;
 }
 
+bool approx(int i_one, int i_two)
+{
+	double one = i_one, one_bump = one * 0.01;
+	double two = i_two, two_bump = two * 0.01;
+
+	one_bump = one + MAX(one_bump, 2.0);
+	two_bump = two + MAX(two_bump, 2.0);
+
+	/* Equal to, or within 1% or 2 digits */
+	if (one == two ||
+	    (one > two && one <= two_bump) ||
+	    (two > one && two <= one_bump))
+		return true;
+	return false;
+}
+
+bool le(int i_one, int i_two)
+{
+	if (i_one <= i_two)
+		return true;
+	return false;
+}
+
+long compare(const char *name_one, const char *name_eval, const char *name_two,
+	     unsigned long long one, bool (*eval)(int, int), unsigned long long two)
+{
+	bool good;
+
+	printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two,
+	       (long long)one, name_eval, (long long)two);
+	if (one > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)one);
+		return 1;
+	}
+	if (two > INT_MAX) {
+		printf("Miscalculation! Measurement went negative: %lld\n", (long long)two);
+		return 1;
+	}
+
+	good = eval(one, two);
+	printf("%s\n", good ? "✔️" : "❌");
+
+	return good ? 0 : 1;
+}
+
 int main(int argc, char *argv[])
 {
+	struct sock_filter bitmap_filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog bitmap_prog = {
+		.len = (unsigned short)ARRAY_SIZE(bitmap_filter),
+		.filter = bitmap_filter,
+	};
 	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])),
 		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
 	};
 	struct sock_fprog prog = {
 		.len = (unsigned short)ARRAY_SIZE(filter),
 		.filter = filter,
 	};
-	long ret;
-	unsigned long long samples;
-	unsigned long long native, filter1, filter2;
+
+	long ret, bits;
+	unsigned long long samples, calc;
+	unsigned long long native, filter1, filter2, bitmap1, bitmap2;
+	unsigned long long entry, per_filter1, per_filter2;
 
 	printf("Current BPF sysctl settings:\n");
 	system("sysctl net.core.bpf_jit_enable");
@@ -101,35 +161,82 @@ int main(int argc, char *argv[])
 	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
 	assert(ret == 0);
 
-	/* One filter */
-	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
+	/* One filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
 	assert(ret == 0);
 
-	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1);
+	bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1);
+
+	/* Second filter resulting in a bitmap */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	if (filter1 == native)
-		printf("No overhead measured!? Try running again with more samples.\n");
+	bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2);
 
-	/* Two filters */
+	/* Third filter, can no longer be converted to bitmap */
 	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
 	assert(ret == 0);
 
-	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
-	printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2);
-
-	/* Calculations */
-	printf("Estimated total seccomp overhead for 1 filter: %llu ns\n",
-		filter1 - native);
+	filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1);
 
-	printf("Estimated total seccomp overhead for 2 filters: %llu ns\n",
-		filter2 - native);
+	/* Fourth filter, can not be converted to bitmap because of filter 3 */
+	ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+	assert(ret == 0);
 
-	printf("Estimated seccomp per-filter overhead: %llu ns\n",
-		filter2 - filter1);
+	filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+	printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2);
+
+	/* Estimations */
+#define ESTIMATE(fmt, var, what)	do {			\
+		var = (what);					\
+		printf("Estimated " fmt ": %llu ns\n", var);	\
+		if (var > INT_MAX)				\
+			goto more_samples;			\
+	} while (0)
+
+	ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc,
+		 bitmap1 - native);
+	ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc,
+		 bitmap2 - native);
+	ESTIMATE("total seccomp overhead for 3 full filters", calc,
+		 filter1 - native);
+	ESTIMATE("total seccomp overhead for 4 full filters", calc,
+		 filter2 - native);
+	ESTIMATE("seccomp entry overhead", entry,
+		 bitmap1 - native - (bitmap2 - bitmap1));
+	ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1,
+		 filter2 - filter1);
+	ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2,
+		 (filter2 - native - entry) / 4);
+
+	printf("Expectations:\n");
+	ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1);
+	bits = compare("native", "≤", "1 filter", native, le, filter1);
+	if (bits)
+		goto more_samples;
+
+	ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)",
+			per_filter1, approx, per_filter2);
+
+	bits = compare("1 bitmapped", "≈", "2 bitmapped",
+			bitmap1 - native, approx, bitmap2 - native);
+	if (bits) {
+		printf("Skipping constant action bitmap expectations: they appear unsupported.\n");
+		goto out;
+	}
 
-	printf("Estimated seccomp entry overhead: %llu ns\n",
-		filter1 - native - (filter2 - filter1));
+	ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native);
+	ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native);
+	ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total",
+			entry + (per_filter1 * 4) + native, approx, filter2);
+	if (ret == 0)
+		goto out;
 
+more_samples:
+	printf("Saw unexpected benchmark result. Try running again with more samples?\n");
+out:
 	return 0;
 }
diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings
index ba4d85f74cd6..6091b45d226b 100644
--- a/tools/testing/selftests/seccomp/settings
+++ b/tools/testing/selftests/seccomp/settings
@@ -1 +1 @@
-timeout=90
+timeout=120
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
                         ` (3 preceding siblings ...)
  2020-10-11 15:47       ` [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
@ 2020-10-11 15:47       ` YiFei Zhu
  2020-10-12  6:49         ` Jann Horn
  4 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-11 15:47 UTC (permalink / raw)
  To: containers
  Cc: YiFei Zhu, bpf, linux-kernel, Aleksa Sarai, Andrea Arcangeli,
	Andy Lutomirski, David Laight, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jack Chen, Jann Horn,
	Josep Torrellas, Kees Cook, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tycho Andersen, Valentin Rothberg, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]

This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig                   | 24 ++++++++++++++
 arch/x86/Kconfig               |  1 +
 arch/x86/include/asm/seccomp.h |  3 ++
 fs/proc/base.c                 |  6 ++++
 include/linux/seccomp.h        |  7 ++++
 kernel/seccomp.c               | 59 ++++++++++++++++++++++++++++++++++
 6 files changed, 100 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 21a3675a7a3a..6157c3ce0662 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -471,6 +471,15 @@ config HAVE_ARCH_SECCOMP_FILTER
 	    results in the system call being skipped immediately.
 	  - seccomp syscall wired up
 
+config HAVE_ARCH_SECCOMP_CACHE
+	bool
+	help
+	  An arch should select this symbol if it provides all of these things:
+	  - all the requirements for HAVE_ARCH_SECCOMP_FILTER
+	  - SECCOMP_ARCH_NATIVE
+	  - SECCOMP_ARCH_NATIVE_NR
+	  - SECCOMP_ARCH_NATIVE_NAME
+
 config SECCOMP
 	prompt "Enable seccomp to safely execute untrusted bytecode"
 	def_bool y
@@ -498,6 +507,21 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config SECCOMP_CACHE_DEBUG
+	bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
+	depends on SECCOMP
+	depends on SECCOMP_FILTER && HAVE_ARCH_SECCOMP_CACHE
+	depends on PROC_FS
+	help
+	  This enables the /proc/pid/seccomp_cache interface to monitor
+	  seccomp cache data. The file format is subject to change. Reading
+	  the file requires CAP_SYS_ADMIN.
+
+	  This option is for debugging only. Enabling presents the risk that
+	  an adversary may be able to infer the seccomp filter logic.
+
+	  If unsure, say N.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1ab22869a765..1a807f89ac77 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -150,6 +150,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_SECCOMP_CACHE
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index b17d037c72ce..fef16e398161 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -19,9 +19,11 @@
 #ifdef CONFIG_X86_64
 # define SECCOMP_ARCH_NATIVE		AUDIT_ARCH_X86_64
 # define SECCOMP_ARCH_NATIVE_NR		NR_syscalls
+# define SECCOMP_ARCH_NATIVE_NAME	"x86_64"
 # ifdef CONFIG_COMPAT
 #  define SECCOMP_ARCH_COMPAT		AUDIT_ARCH_I386
 #  define SECCOMP_ARCH_COMPAT_NR	IA32_NR_syscalls
+#  define SECCOMP_ARCH_COMPAT_NAME	"ia32"
 # endif
 /*
  * x32 will have __X32_SYSCALL_BIT set in syscall number. We don't support
@@ -31,6 +33,7 @@
 #else /* !CONFIG_X86_64 */
 # define SECCOMP_ARCH_NATIVE		AUDIT_ARCH_I386
 # define SECCOMP_ARCH_NATIVE_NR	        NR_syscalls
+# define SECCOMP_ARCH_NATIVE_NAME	"ia32"
 #endif
 
 #include <asm-generic/seccomp.h>
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..a4990410ff05 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3587,6 +3590,9 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 	ONE("arch_status", S_IRUGO, proc_pid_arch_status),
 #endif
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..76963ec4641a 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -121,4 +121,11 @@ static inline long seccomp_get_metadata(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+struct seq_file;
+
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task);
+#endif
 #endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 236e7b367d4e..1df2fac281da 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -553,6 +553,9 @@ void seccomp_filter_release(struct task_struct *tsk)
 {
 	struct seccomp_filter *orig = tsk->seccomp.filter;
 
+	/* We are effectively holding the siglock by not having any sighand. */
+	WARN_ON(tsk->sighand != NULL);
+
 	/* Detach task from its filter tree. */
 	tsk->seccomp.filter = NULL;
 	__seccomp_filter_release(orig);
@@ -2311,3 +2314,59 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+/* Currently CONFIG_SECCOMP_CACHE_DEBUG implies SECCOMP_ARCH_NATIVE */
+static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name,
+					const void *bitmap, size_t bitmap_size)
+{
+	int nr;
+
+	for (nr = 0; nr < bitmap_size; nr++) {
+		bool cached = test_bit(nr, bitmap);
+		char *status = cached ? "ALLOW" : "FILTER";
+
+		seq_printf(m, "%s %d %s\n", name, nr, status);
+	}
+}
+
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+			   struct pid *pid, struct task_struct *task)
+{
+	struct seccomp_filter *f;
+	unsigned long flags;
+
+	/*
+	 * We don't want some sandboxed process to know what their seccomp
+	 * filters consist of.
+	 */
+	if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (!lock_task_sighand(task, &flags))
+		return -ESRCH;
+
+	f = READ_ONCE(task->seccomp.filter);
+	if (!f) {
+		unlock_task_sighand(task, &flags);
+		return 0;
+	}
+
+	/* prevent filter from being freed while we are printing it */
+	__get_seccomp_filter(f);
+	unlock_task_sighand(task, &flags);
+
+	proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_NATIVE_NAME,
+				    f->cache.allow_native,
+				    SECCOMP_ARCH_NATIVE_NR);
+
+#ifdef SECCOMP_ARCH_COMPAT
+	proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME,
+				    f->cache.allow_compat,
+				    SECCOMP_ARCH_COMPAT_NR);
+#endif /* SECCOMP_ARCH_COMPAT */
+
+	__put_seccomp_filter(f);
+	return 0;
+}
+#endif /* CONFIG_SECCOMP_CACHE_DEBUG */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path
  2020-10-11 15:47       ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
@ 2020-10-12  6:42         ` Jann Horn
  0 siblings, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-10-12  6:42 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Sun, Oct 11, 2020 at 5:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
>
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.
>
> The fast (common) path for seccomp should be that the filter permits
> the syscall to pass through, and failing seccomp is expected to be
> an exceptional case; it is not expected for userspace to call a
> denylisted syscall over and over.
>
> When it can be concluded that an allow must occur for the given
> architecture and syscall pair (this determination is introduced in
> the next commit), seccomp will immediately allow the syscall,
> bypassing further BPF execution.
>
> Each architecture number has its own bitmap. The architecture
> number in seccomp_data is checked against the defined architecture
> number constant before proceeding to test the bit against the
> bitmap with the syscall number as the index of the bit in the
> bitmap, and if the bit is set, seccomp returns allow. The bitmaps
> are all clear in this patch and will be initialized in the next
> commit.
>
> When only one architecture exists, the check against architecture
> number is skipped, suggested by Kees Cook [7].
>
> [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
> [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
> [3] https://github.com/seccomp/libseccomp/issues/116
> [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
> [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
> [6] Draco: Architectural and Operating System Support for System Call Security
>     https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
> [7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/
>
> Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>

Reviewed-by: Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow
  2020-10-11 15:47       ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
@ 2020-10-12  6:46         ` Jann Horn
  0 siblings, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-10-12  6:46 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Sun, Oct 11, 2020 at 5:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> SECCOMP_CACHE will only operate on syscalls that do not access
> any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> return allow regardless of syscall arguments for a given
> architecture number / syscall number pair. This is implemented
> here with a pseudo-emulator, and stored in a per-filter bitmap.
>
> In order to build this bitmap at filter attach time, each filter is
> emulated for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
> the program returns allow, then we can be sure that the filter must
> return allow independent from syscall arguments.
>
> Nearly all seccomp filters are built from these cBPF instructions:
>
> BPF_LD  | BPF_W    | BPF_ABS
> BPF_JMP | BPF_JEQ  | BPF_K
> BPF_JMP | BPF_JGE  | BPF_K
> BPF_JMP | BPF_JGT  | BPF_K
> BPF_JMP | BPF_JSET | BPF_K
> BPF_JMP | BPF_JA
> BPF_RET | BPF_K
> BPF_ALU | BPF_AND  | BPF_K
>
> Each of these instructions are emulated. Any weirdness or loading
> from a syscall argument will cause the emulator to bail.
>
> The emulation is also halted if it reaches a return. In that case,
> if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
>
> Emulator structure and comments are from Kees [1] and Jann [2].
>
> Emulation is done at attach time. If a filter depends on more
> filters, and if the dependee does not guarantee to allow the
> syscall, then we skip the emulation of this syscall.
>
> [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
> [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
>
> Suggested-by: Jann Horn <jannh@google.com>
> Co-developed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>

Reviewed-by: Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-11 15:47       ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
@ 2020-10-12  6:49         ` Jann Horn
  0 siblings, 0 replies; 135+ messages in thread
From: Jann Horn @ 2020-10-12  6:49 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Josep Torrellas, Kees Cook, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Sun, Oct 11, 2020 at 5:48 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
>
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <arch name> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
>
> For the docker default profile on x86_64 it looks like:
> x86_64 0 ALLOW
> x86_64 1 ALLOW
> x86_64 2 ALLOW
> x86_64 3 ALLOW
> [...]
> x86_64 132 ALLOW
> x86_64 133 ALLOW
> x86_64 134 FILTER
> x86_64 135 FILTER
> x86_64 136 FILTER
> x86_64 137 ALLOW
> x86_64 138 ALLOW
> x86_64 139 FILTER
> x86_64 140 ALLOW
> x86_64 141 ALLOW
> [...]
>
> This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
> of N because I think certain users of seccomp might not want the
> application to know which syscalls are definitely usable. For
> the same reason, it is also guarded by CAP_SYS_ADMIN.
>
> Suggested-by: Jann Horn <jannh@google.com>
> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>

Reviewed-by: Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-10 13:26         ` YiFei Zhu
@ 2020-10-12 22:57           ` Kees Cook
  2020-10-13  0:31             ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-10-12 22:57 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Sat, Oct 10, 2020 at 08:26:16AM -0500, YiFei Zhu wrote:
> On Fri, Oct 9, 2020 at 6:14 PM Kees Cook <keescook@chromium.org> wrote:
> > HAVE_ARCH_SECCOMP_CACHE isn't used any more. I think this was left over
> > from before.
> 
> Oh, I was meant to add this to the dependencies of
> SECCOMP_CACHE_DEBUG. Is this something that would make sense?

I think it's fine to just have this "dangle" with a help text update of
"if seccomp action caching is supported by the architecture, provide the
/proc/$pid ..."

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-12 22:57           ` Kees Cook
@ 2020-10-13  0:31             ` YiFei Zhu
  2020-10-22 20:52               ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-13  0:31 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Mon, Oct 12, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote:
> I think it's fine to just have this "dangle" with a help text update of
> "if seccomp action caching is supported by the architecture, provide the
> /proc/$pid ..."

I think it would be weird if someone sees this help text and wonder...
"hmm does my architecture support seccomp action caching" and without
a clear pointer to how seccomp action cache works, goes and compiles
the kernel with this config option on for the purpose of knowing if
their arch supports it... Or, is it a common practice in the kernel to
leave dangling configs?

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-13  0:31             ` YiFei Zhu
@ 2020-10-22 20:52               ` YiFei Zhu
  2020-10-22 22:32                 ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-22 20:52 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Mon, Oct 12, 2020 at 7:31 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> On Mon, Oct 12, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote:
> > I think it's fine to just have this "dangle" with a help text update of
> > "if seccomp action caching is supported by the architecture, provide the
> > /proc/$pid ..."
>
> I think it would be weird if someone sees this help text and wonder...
> "hmm does my architecture support seccomp action caching" and without
> a clear pointer to how seccomp action cache works, goes and compiles
> the kernel with this config option on for the purpose of knowing if
> their arch supports it... Or, is it a common practice in the kernel to
> leave dangling configs?

Bump, in case this question was missed. I don't really want to miss
the 5.10 merge window...

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-22 20:52               ` YiFei Zhu
@ 2020-10-22 22:32                 ` Kees Cook
  2020-10-22 23:40                   ` YiFei Zhu
  0 siblings, 1 reply; 135+ messages in thread
From: Kees Cook @ 2020-10-22 22:32 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Thu, Oct 22, 2020 at 03:52:20PM -0500, YiFei Zhu wrote:
> On Mon, Oct 12, 2020 at 7:31 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >
> > On Mon, Oct 12, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote:
> > > I think it's fine to just have this "dangle" with a help text update of
> > > "if seccomp action caching is supported by the architecture, provide the
> > > /proc/$pid ..."
> >
> > I think it would be weird if someone sees this help text and wonder...
> > "hmm does my architecture support seccomp action caching" and without
> > a clear pointer to how seccomp action cache works, goes and compiles
> > the kernel with this config option on for the purpose of knowing if
> > their arch supports it... Or, is it a common practice in the kernel to
> > leave dangling configs?
> 
> Bump, in case this question was missed.

I've been going back and forth on this, and I think what I've settled
on is I'd like to avoid new CONFIG dependencies just for this feature.
Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT
for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the
cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It
should be relatively simple to extract those details and make
SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches?

> I don't really want to miss the 5.10 merge window...

Sorry, the 5.10 merge window is already closed for stuff that hasn't
already been in -next. Most subsystem maintainers (myself included)
don't take new features into their trees between roughly N-rc6 and
(N+1)-rc1. My plan is to put this in my -next tree after -rc1 is released
(expected to be Oct 25th).

I'd still like to get more specific workload performance numbers too.
The microbenchmark is nice, but getting things like build times under
docker's default seccomp filter, etc would be lovely. I've almost gotten
there, but my benchmarks are still really noisy and CPU isolation
continues to frustrate me. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-22 22:32                 ` Kees Cook
@ 2020-10-22 23:40                   ` YiFei Zhu
  2020-10-24  2:51                     ` Kees Cook
  0 siblings, 1 reply; 135+ messages in thread
From: YiFei Zhu @ 2020-10-22 23:40 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Thu, Oct 22, 2020 at 5:32 PM Kees Cook <keescook@chromium.org> wrote:
> I've been going back and forth on this, and I think what I've settled
> on is I'd like to avoid new CONFIG dependencies just for this feature.
> Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT
> for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the
> cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It
> should be relatively simple to extract those details and make
> SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches?

Hmm. So I could enable the cache logic to every architecture (one
patch per arch) that does not have the sparse syscall numbers, and
then have the proc reporting after the arch patches? I could do that.
I don't have test machines to run anything other than x86_64 or ia32,
so they will need a closer look by people more familiar with those
arches.

> I'd still like to get more specific workload performance numbers too.
> The microbenchmark is nice, but getting things like build times under
> docker's default seccomp filter, etc would be lovely. I've almost gotten
> there, but my benchmarks are still really noisy and CPU isolation
> continues to frustrate me. :)

Ok, let me know if I can help.

YiFei Zhu

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache
  2020-10-22 23:40                   ` YiFei Zhu
@ 2020-10-24  2:51                     ` Kees Cook
  0 siblings, 0 replies; 135+ messages in thread
From: Kees Cook @ 2020-10-24  2:51 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Linux Containers, YiFei Zhu, bpf, kernel list, Aleksa Sarai,
	Andrea Arcangeli, Andy Lutomirski, David Laight,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jack Chen, Jann Horn, Josep Torrellas, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tycho Andersen, Valentin Rothberg,
	Will Drewry

On Thu, Oct 22, 2020 at 06:40:08PM -0500, YiFei Zhu wrote:
> On Thu, Oct 22, 2020 at 5:32 PM Kees Cook <keescook@chromium.org> wrote:
> > I've been going back and forth on this, and I think what I've settled
> > on is I'd like to avoid new CONFIG dependencies just for this feature.
> > Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT
> > for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the
> > cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It
> > should be relatively simple to extract those details and make
> > SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches?
> 
> Hmm. So I could enable the cache logic to every architecture (one
> patch per arch) that does not have the sparse syscall numbers, and
> then have the proc reporting after the arch patches? I could do that.
> I don't have test machines to run anything other than x86_64 or ia32,
> so they will need a closer look by people more familiar with those
> arches.

Cool, yes please. It looks like MIPS will need to be skipped for now. I
would have the debug cache reporting patch then depend on
!CONFIG_HAVE_SPARSE_SYSCALL_NR.

> > I'd still like to get more specific workload performance numbers too.
> > The microbenchmark is nice, but getting things like build times under
> > docker's default seccomp filter, etc would be lovely. I've almost gotten
> > there, but my benchmarks are still really noisy and CPU isolation
> > continues to frustrate me. :)
> 
> Ok, let me know if I can help.

Do you have a test environment where you can compare the before/after
of repeated kernel build times (or some other sufficiently
complex/interesting) workload under these conditions:

bare metal
docker w/ seccomp policy disabled
docker w/ default seccomp policy

This is what I've been trying to construct, but it's really noisy, so
I've been trying to pin CPUs and NUMA memory nodes, but it's not really
helping yet. :P

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 135+ messages in thread

end of thread, back to index

Thread overview: 135+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
2020-09-21  5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
2020-09-21 17:47   ` Jann Horn
2020-09-21 18:38     ` Jann Horn
2020-09-21 23:44     ` YiFei Zhu
2020-09-22  0:25       ` Jann Horn
2020-09-22  0:47         ` YiFei Zhu
2020-09-21  5:35 ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls YiFei Zhu
2020-09-21 18:08   ` Jann Horn
2020-09-21 22:50     ` YiFei Zhu
2020-09-21 22:57       ` Jann Horn
2020-09-21 23:08         ` YiFei Zhu
2020-09-25  0:01   ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook
2020-09-25  0:15     ` Jann Horn
2020-09-25  0:18       ` Al Viro
2020-09-25  0:24         ` Jann Horn
2020-09-25  1:27     ` YiFei Zhu
2020-09-25  3:09       ` Kees Cook
2020-09-25  3:28         ` YiFei Zhu
2020-09-25 16:39           ` YiFei Zhu
2020-09-21  5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon
2020-09-21  7:13   ` YiFei Zhu
2020-09-21  8:30 ` Christian Brauner
2020-09-21  8:44   ` YiFei Zhu
2020-09-21 13:51 ` Tycho Andersen
2020-09-21 15:27   ` YiFei Zhu
2020-09-21 16:39     ` Tycho Andersen
2020-09-21 22:57       ` YiFei Zhu
2020-09-21 19:16 ` Jann Horn
     [not found]   ` <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com>
2020-09-21 19:45     ` Jann Horn
2020-09-23 19:26 ` Kees Cook
2020-09-23 22:54   ` YiFei Zhu
2020-09-24  6:52     ` Kees Cook
2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
2020-09-24 12:06     ` YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
2020-09-24 19:11       ` Kees Cook
2020-09-24 12:44     ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
2020-09-24 13:47       ` David Laight
2020-09-24 14:16         ` YiFei Zhu
2020-09-24 14:20           ` David Laight
2020-09-24 14:37             ` YiFei Zhu
2020-09-24 16:02               ` YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
2020-09-24 23:25       ` Kees Cook
2020-09-25  3:04         ` YiFei Zhu
2020-09-25 16:45           ` YiFei Zhu
2020-09-25 19:42             ` Kees Cook
2020-09-25 19:51               ` Andy Lutomirski
2020-09-25 20:37                 ` Kees Cook
2020-09-25 21:07                   ` Andy Lutomirski
2020-09-25 23:49                     ` Kees Cook
2020-09-26  0:34                       ` Andy Lutomirski
2020-09-26  1:23                     ` YiFei Zhu
2020-09-26  2:47                       ` Andy Lutomirski
2020-09-26  4:35                         ` Kees Cook
2020-09-24 12:44     ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
2020-09-24 23:46       ` Kees Cook
2020-09-25  1:55         ` YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-09-24 23:47       ` Kees Cook
2020-09-25  1:35         ` YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-09-24 23:56       ` Kees Cook
2020-09-25  3:11         ` YiFei Zhu
2020-09-25  3:26           ` Kees Cook
2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
2020-09-30 15:19   ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu
2020-09-30 21:21     ` Kees Cook
2020-09-30 21:33       ` Jann Horn
2020-09-30 22:53         ` Kees Cook
2020-09-30 23:15           ` Jann Horn
2020-09-30 15:19   ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
2020-09-30 22:24     ` Jann Horn
2020-09-30 22:49       ` Kees Cook
2020-10-01 11:28       ` YiFei Zhu
2020-10-01 21:08         ` Jann Horn
2020-09-30 22:40     ` Kees Cook
2020-10-01 11:52       ` YiFei Zhu
2020-10-01 21:05         ` Kees Cook
2020-10-02 11:08           ` YiFei Zhu
2020-10-09  4:47     ` YiFei Zhu
2020-10-09  5:41       ` Kees Cook
2020-09-30 15:19   ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
2020-09-30 21:32     ` Kees Cook
2020-10-09  0:17       ` YiFei Zhu
2020-10-09  5:35         ` Kees Cook
2020-09-30 15:19   ` [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-09-30 15:19   ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-09-30 22:00     ` Jann Horn
2020-09-30 23:12       ` Kees Cook
2020-10-01 12:06       ` YiFei Zhu
2020-10-01 16:05         ` Jann Horn
2020-10-01 16:18           ` YiFei Zhu
2020-09-30 22:59     ` Kees Cook
2020-09-30 23:08       ` Jann Horn
2020-09-30 23:21         ` Kees Cook
2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
2020-10-09 17:14     ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
2020-10-09 21:30       ` Jann Horn
2020-10-09 23:18       ` Kees Cook
2020-10-09 17:14     ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
2020-10-09 21:30       ` Jann Horn
2020-10-09 22:47         ` Kees Cook
2020-10-09 17:14     ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
2020-10-09 17:25       ` Andy Lutomirski
2020-10-09 18:32         ` YiFei Zhu
2020-10-09 20:59           ` Andy Lutomirski
2020-10-09 17:14     ` [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-10-09 17:14     ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-10-09 21:45       ` Jann Horn
2020-10-09 23:14       ` Kees Cook
2020-10-10 13:26         ` YiFei Zhu
2020-10-12 22:57           ` Kees Cook
2020-10-13  0:31             ` YiFei Zhu
2020-10-22 20:52               ` YiFei Zhu
2020-10-22 22:32                 ` Kees Cook
2020-10-22 23:40                   ` YiFei Zhu
2020-10-24  2:51                     ` Kees Cook
2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
2020-10-11 15:47       ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
2020-10-12  6:42         ` Jann Horn
2020-10-11 15:47       ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
2020-10-12  6:46         ` Jann Horn
2020-10-11 15:47       ` [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
2020-10-11 15:47       ` [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-10-11 15:47       ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-10-12  6:49         ` Jann Horn

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git