bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: YiFei Zhu <zhuyifei1999@gmail.com>
To: containers@lists.linux-foundation.org
Cc: YiFei Zhu <yifeifz2@illinois.edu>,
	bpf@vger.kernel.org, Andrea Arcangeli <aarcange@redhat.com>,
	Dimitrios Skarlatos <dskarlat@cs.cmu.edu>,
	Giuseppe Scrivano <gscrivan@redhat.com>,
	Hubertus Franke <frankeh@us.ibm.com>,
	Jack Chen <jianyan2@illinois.edu>,
	Josep Torrellas <torrella@illinois.edu>,
	Kees Cook <keescook@chromium.org>, Tianyin Xu <tyxu@illinois.edu>,
	Tobin Feldman-Fitzthum <tobin@ibm.com>,
	Valentin Rothberg <vrothber@redhat.com>
Subject: [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls
Date: Mon, 21 Sep 2020 00:35:18 -0500	[thread overview]
Message-ID: <b792335294ee5598d0fb42702a49becbce2f925f.1600661419.git.yifeifz2@illinois.edu> (raw)
In-Reply-To: <cover.1600661418.git.yifeifz2@illinois.edu>

From: YiFei Zhu <yifeifz2@illinois.edu>

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

We do this by creating a per-task bitmap of permitted syscalls.
If seccomp filter is invoked we check if it is cached and if so
directly return allow. Else we call into the cBPF filter, and if
the result is an allow then we cache the results.

The cache is per-task to minimize thread-synchronization issues in
the hot path of cache lookup, and to avoid different architecture
numbers sharing the same cache.

To account for one thread changing the filter for another thread of
the same process, the per-task struct also contains a pointer to
the filter the cache is built on. When the cache lookup uses a
different filter then the last lookup, the per-task cache bitmap is
cleared.

Architecture number changes also invokes a clear of the per-task
cache, since it should be very unlikely for a given thread to change
its architecture.

Benchmark results, on qemu-kvm x86_64 VM, on Intel(R) Core(TM)
i5-8250U CPU @ 1.60GHz, with seccomp_benchmark:

With SECCOMP_CACHE_NONE:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Calibrating sample size for 15 seconds worth of syscalls ...
  Benchmarking 23486415 syscalls...
  16.079642020 - 1.013345439 = 15066296581 (15.1s)
  getpid native: 641 ns
  32.080237410 - 16.080763500 = 15999473910 (16.0s)
  getpid RET_ALLOW 1 filter: 681 ns
  48.609461618 - 32.081296173 = 16528165445 (16.5s)
  getpid RET_ALLOW 2 filters: 703 ns
  Estimated total seccomp overhead for 1 filter: 40 ns
  Estimated total seccomp overhead for 2 filters: 62 ns
  Estimated seccomp per-filter overhead: 22 ns
  Estimated seccomp entry overhead: 18 ns

With SECCOMP_CACHE_NR_ONLY:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Calibrating sample size for 15 seconds worth of syscalls ...
  Benchmarking 23486415 syscalls...
  16.059512499 - 1.014108434 = 15045404065 (15.0s)
  getpid native: 640 ns
  31.651075934 - 16.060637323 = 15590438611 (15.6s)
  getpid RET_ALLOW 1 filter: 663 ns
  47.367316169 - 31.652302661 = 15715013508 (15.7s)
  getpid RET_ALLOW 2 filters: 669 ns
  Estimated total seccomp overhead for 1 filter: 23 ns
  Estimated total seccomp overhead for 2 filters: 29 ns
  Estimated seccomp per-filter overhead: 6 ns
  Estimated seccomp entry overhead: 17 ns

Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 include/linux/seccomp.h | 22 ++++++++++++
 kernel/seccomp.c        | 77 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..08ec8b90c99d 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -21,6 +21,27 @@
 #include <asm/seccomp.h>
 
 struct seccomp_filter;
+
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_cache_task_data - container for seccomp cache's per-task data
+ *
+ * @syscall_ok: A bitmap where each bit represents whether the syscall is cached
+ *		and that the filter allowed it.
+ * @last_filter: If the next cache lookup uses a different filter, the lookup
+ *		 will clear cache.
+ * @last_arch: If the next cache lookup uses a different arch number, the
+ *	       lookup will clear cache.
+ */
+struct seccomp_cache_task_data {
+	DECLARE_BITMAP(syscall_ok, NR_syscalls);
+	const struct seccomp_filter *last_filter;
+	u32 last_arch;
+};
+#else
+struct seccomp_cache_task_data { };
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * struct seccomp - the state of a seccomp'ed process
  *
@@ -36,6 +57,7 @@ struct seccomp {
 	int mode;
 	atomic_t filter_count;
 	struct seccomp_filter *filter;
+	struct seccomp_cache_task_data cache;
 };
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d8c30901face..7096f8c86f71 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -162,6 +162,17 @@ static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
 {
 	return 0;
 }
+
+static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				       const struct seccomp_data *sd)
+{
+	return 0;
+}
+
+static inline void seccomp_cache_insert(const struct seccomp_filter *sfilter,
+					const struct seccomp_data *sd)
+{
+}
 #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
 
 /**
@@ -316,6 +327,59 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 	return 0;
 }
 
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * seccomp_cache_check - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static bool seccomp_cache_check(const struct seccomp_filter *sfilter,
+				const struct seccomp_data *sd)
+{
+	struct seccomp_cache_task_data *thread_data;
+	int syscall_nr = sd->nr;
+
+	if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
+		return false;
+
+	thread_data = &current->seccomp.cache;
+	if (unlikely(thread_data->last_filter != sfilter ||
+		     thread_data->last_arch != sd->arch)) {
+		thread_data->last_filter = sfilter;
+		thread_data->last_arch = sd->arch;
+
+		bitmap_zero(thread_data->syscall_ok, NR_syscalls);
+		return false;
+	}
+
+	return test_bit(syscall_nr, thread_data->syscall_ok);
+}
+
+/**
+ * seccomp_cache_insert - insert into seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to insert into the cache
+ */
+static void seccomp_cache_insert(const struct seccomp_filter *sfilter,
+				 const struct seccomp_data *sd)
+{
+	struct seccomp_cache_task_data *thread_data;
+	int syscall_nr = sd->nr;
+
+	if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls))
+		return;
+
+	thread_data = &current->seccomp.cache;
+
+	if (!test_bit(syscall_nr, sfilter->cache.syscall_ok))
+		return;
+
+	set_bit(syscall_nr, thread_data->syscall_ok);
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
 /**
  * seccomp_run_filters - evaluates all seccomp filters against @sd
  * @sd: optional seccomp data to be passed to filters
@@ -331,13 +395,18 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 {
 	u32 ret = SECCOMP_RET_ALLOW;
 	/* Make sure cross-thread synced filter points somewhere sane. */
-	struct seccomp_filter *f =
-			READ_ONCE(current->seccomp.filter);
+	struct seccomp_filter *f, *f_head;
+
+	f = READ_ONCE(current->seccomp.filter);
+	f_head = f;
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (WARN_ON(f == NULL))
 		return SECCOMP_RET_KILL_PROCESS;
 
+	if (seccomp_cache_check(f_head, sd))
+		return SECCOMP_RET_ALLOW;
+
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
@@ -350,6 +419,10 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 			*match = f;
 		}
 	}
+
+	if (ret == SECCOMP_RET_ALLOW)
+		seccomp_cache_insert(f_head, sd);
+
 	return ret;
 }
 #endif /* CONFIG_SECCOMP_FILTER */
-- 
2.28.0


  parent reply	other threads:[~2020-09-21  5:35 UTC|newest]

Thread overview: 149+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-21  5:35 [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
2020-09-21  5:35 ` [RFC PATCH seccomp 1/2] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
2020-09-21 17:47   ` Jann Horn
2020-09-21 18:38     ` Jann Horn
2020-09-21 23:44     ` YiFei Zhu
2020-09-22  0:25       ` Jann Horn
2020-09-22  0:47         ` YiFei Zhu
2020-09-21  5:35 ` YiFei Zhu [this message]
2020-09-21 18:08   ` [RFC PATCH seccomp 2/2] seccomp/cache: Cache filter results that allow syscalls Jann Horn
2020-09-21 22:50     ` YiFei Zhu
2020-09-21 22:57       ` Jann Horn
2020-09-21 23:08         ` YiFei Zhu
2020-09-25  0:01   ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array Kees Cook
2020-09-25  0:15     ` Jann Horn
2020-09-25  0:18       ` Al Viro
2020-09-25  0:24         ` Jann Horn
2020-09-25  1:27     ` YiFei Zhu
2020-09-25  3:09       ` Kees Cook
2020-09-25  3:28         ` YiFei Zhu
2020-09-25 16:39           ` YiFei Zhu
2020-09-21  5:48 ` [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls Sargun Dhillon
2020-09-21  7:13   ` YiFei Zhu
2020-09-21  8:30 ` Christian Brauner
2020-09-21  8:44   ` YiFei Zhu
2020-09-21 13:51 ` Tycho Andersen
2020-09-21 15:27   ` YiFei Zhu
2020-09-21 16:39     ` Tycho Andersen
2020-09-21 22:57       ` YiFei Zhu
2020-09-21 19:16 ` Jann Horn
     [not found]   ` <OF8837FC1A.5C0D4D64-ON852585EA.006B677F-852585EA.006BA663@notes.na.collabserv.com>
2020-09-21 19:45     ` Jann Horn
2020-09-23 19:26 ` Kees Cook
2020-09-23 22:54   ` YiFei Zhu
2020-09-24  6:52     ` Kees Cook
2020-09-24 12:06 ` [PATCH seccomp 0/6] " YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
2020-09-24 12:06     ` YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-09-24 12:06   ` [PATCH seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-09-24 12:44   ` [PATCH v2 seccomp 0/6] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 1/6] seccomp: Move config option SECCOMP to arch/Kconfig YiFei Zhu
2020-09-24 19:11       ` Kees Cook
2020-10-27  9:52       ` Geert Uytterhoeven
2020-10-27 19:08         ` YiFei Zhu
2020-10-28  0:06         ` Kees Cook
2020-10-28  8:18           ` Geert Uytterhoeven
2020-10-28  9:34             ` Jann Horn
2020-09-24 12:44     ` [PATCH v2 seccomp 2/6] asm/syscall.h: Add syscall_arches[] array YiFei Zhu
2020-09-24 13:47       ` David Laight
2020-09-24 14:16         ` YiFei Zhu
2020-09-24 14:20           ` David Laight
2020-09-24 14:37             ` YiFei Zhu
2020-09-24 16:02               ` YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 3/6] seccomp/cache: Add "emulator" to check if filter is arg-dependent YiFei Zhu
2020-09-24 23:25       ` Kees Cook
2020-09-25  3:04         ` YiFei Zhu
2020-09-25 16:45           ` YiFei Zhu
2020-09-25 19:42             ` Kees Cook
2020-09-25 19:51               ` Andy Lutomirski
2020-09-25 20:37                 ` Kees Cook
2020-09-25 21:07                   ` Andy Lutomirski
2020-09-25 23:49                     ` Kees Cook
2020-09-26  0:34                       ` Andy Lutomirski
2020-09-26  1:23                     ` YiFei Zhu
2020-09-26  2:47                       ` Andy Lutomirski
2020-09-26  4:35                         ` Kees Cook
2020-09-24 12:44     ` [PATCH v2 seccomp 4/6] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
2020-09-24 23:46       ` Kees Cook
2020-09-25  1:55         ` YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 5/6] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-09-24 23:47       ` Kees Cook
2020-09-25  1:35         ` YiFei Zhu
2020-09-24 12:44     ` [PATCH v2 seccomp 6/6] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-09-24 23:56       ` Kees Cook
2020-09-25  3:11         ` YiFei Zhu
2020-09-25  3:26           ` Kees Cook
2020-09-30 15:19 ` [PATCH v3 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
2020-09-30 15:19   ` [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking YiFei Zhu
2020-09-30 21:21     ` Kees Cook
2020-09-30 21:33       ` Jann Horn
2020-09-30 22:53         ` Kees Cook
2020-09-30 23:15           ` Jann Horn
2020-09-30 15:19   ` [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
2020-09-30 22:24     ` Jann Horn
2020-09-30 22:49       ` Kees Cook
2020-10-01 11:28       ` YiFei Zhu
2020-10-01 21:08         ` Jann Horn
2020-09-30 22:40     ` Kees Cook
2020-10-01 11:52       ` YiFei Zhu
2020-10-01 21:05         ` Kees Cook
2020-10-02 11:08           ` YiFei Zhu
2020-10-09  4:47     ` YiFei Zhu
2020-10-09  5:41       ` Kees Cook
2020-09-30 15:19   ` [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path YiFei Zhu
2020-09-30 21:32     ` Kees Cook
2020-10-09  0:17       ` YiFei Zhu
2020-10-09  5:35         ` Kees Cook
2020-09-30 15:19   ` [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-09-30 15:19   ` [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-09-30 22:00     ` Jann Horn
2020-09-30 23:12       ` Kees Cook
2020-10-01 12:06       ` YiFei Zhu
2020-10-01 16:05         ` Jann Horn
2020-10-01 16:18           ` YiFei Zhu
2020-09-30 22:59     ` Kees Cook
2020-09-30 23:08       ` Jann Horn
2020-09-30 23:21         ` Kees Cook
2020-10-09 17:14   ` [PATCH v4 seccomp 0/5] seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
2020-10-09 17:14     ` [PATCH v4 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
2020-10-09 21:30       ` Jann Horn
2020-10-09 23:18       ` Kees Cook
2020-10-09 17:14     ` [PATCH v4 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
2020-10-09 21:30       ` Jann Horn
2020-10-09 22:47         ` Kees Cook
2020-10-09 17:14     ` [PATCH v4 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
2020-10-09 17:25       ` Andy Lutomirski
2020-10-09 18:32         ` YiFei Zhu
2020-10-09 20:59           ` Andy Lutomirski
2020-10-09 17:14     ` [PATCH v4 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-10-09 17:14     ` [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-10-09 21:45       ` Jann Horn
2020-10-09 23:14       ` Kees Cook
2020-10-10 13:26         ` YiFei Zhu
2020-10-12 22:57           ` Kees Cook
2020-10-13  0:31             ` YiFei Zhu
2020-10-22 20:52               ` YiFei Zhu
2020-10-22 22:32                 ` Kees Cook
2020-10-22 23:40                   ` YiFei Zhu
2020-10-24  2:51                     ` Kees Cook
2020-10-30 12:18                       ` YiFei Zhu
2020-11-03 13:00                         ` YiFei Zhu
2020-11-04  0:29                           ` Kees Cook
2020-11-04 11:40                             ` YiFei Zhu
2020-11-04 18:57                               ` Kees Cook
2020-10-11 15:47     ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results YiFei Zhu
2020-10-11 15:47       ` [PATCH v5 seccomp 1/5] seccomp/cache: Lookup syscall allowlist bitmap for fast path YiFei Zhu
2020-10-12  6:42         ` Jann Horn
2020-10-11 15:47       ` [PATCH v5 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow YiFei Zhu
2020-10-12  6:46         ` Jann Horn
2020-10-11 15:47       ` [PATCH v5 seccomp 3/5] x86: Enable seccomp architecture tracking YiFei Zhu
2020-10-11 15:47       ` [PATCH v5 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead YiFei Zhu
2020-10-11 15:47       ` [PATCH v5 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache YiFei Zhu
2020-10-12  6:49         ` Jann Horn
2020-12-17 12:14         ` Geert Uytterhoeven
2020-12-17 18:34           ` YiFei Zhu
2020-12-18 12:35             ` Geert Uytterhoeven
2020-10-27 19:14       ` [PATCH v5 seccomp 0/5]seccomp: Add bitmap cache of constant allow filter results Kees Cook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b792335294ee5598d0fb42702a49becbce2f925f.1600661419.git.yifeifz2@illinois.edu \
    --to=zhuyifei1999@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=bpf@vger.kernel.org \
    --cc=containers@lists.linux-foundation.org \
    --cc=dskarlat@cs.cmu.edu \
    --cc=frankeh@us.ibm.com \
    --cc=gscrivan@redhat.com \
    --cc=jianyan2@illinois.edu \
    --cc=keescook@chromium.org \
    --cc=tobin@ibm.com \
    --cc=torrella@illinois.edu \
    --cc=tyxu@illinois.edu \
    --cc=vrothber@redhat.com \
    --cc=yifeifz2@illinois.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).