All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 bpf-next 0/3] Introduce CAP_BPF
@ 2020-05-13  3:19 Alexei Starovoitov
  2020-05-13  3:19 ` [PATCH v6 bpf-next 1/3] bpf, capability: " Alexei Starovoitov
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2020-05-13  3:19 UTC (permalink / raw)
  To: davem
  Cc: daniel, netdev, bpf, kernel-team, linux-security-module, acme,
	jamorris, jannh, kpsingh

From: Alexei Starovoitov <ast@kernel.org>

v5->v6:
- split allow_ptr_leaks into four flags.
- retain bpf_jit_limit under cap_sys_admin.
- fixed few other issues spotted by Daniel.

v4->v5:

Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of
CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN and keep some of them under CAP_SYS_ADMIN.

The user process has to have
- CAP_BPF to create maps and do other sys_bpf() commands
- CAP_BPF and CAP_PERFMON to load tracing programs.
- CAP_BPF and CAP_NET_ADMIN to load networking programs.
(or CAP_SYS_ADMIN for backward compatibility).

CAP_BPF solves three main goals:
1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
   More on this below. This is the major difference vs v4 set back from Sep 2019.
2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
   prevents pointer leaks and arbitrary kernel memory access.
3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
   and making BPF infra more secure. Currently fuzzers run in unpriv.
   They will be able to run with CAP_BPF.

The patchset is long overdue follow-up from the last plumbers conference.
Comparing to what was discussed at LPC the CAP* checks at attach time are gone.
For tracing progs the CAP_SYS_ADMIN check was done at load time only. There was
no check at attach time. For networking and cgroup progs CAP_SYS_ADMIN was
required at load time and CAP_NET_ADMIN at attach time, but there are several
ways to bypass CAP_NET_ADMIN:
- if networking prog is using tail_call writing FD into prog_array will
  effectively attach it, but bpf_map_update_elem is an unprivileged operation.
- freplace prog with CAP_SYS_ADMIN can replace networking prog

Consolidating all CAP checks at load time makes security model similar to
open() syscall. Once the user got an FD it can do everything with it.
read/write/poll don't check permissions. The same way when bpf_prog_load
command returns an FD the user can do everything (including attaching,
detaching, and bpf_test_run).

The important design decision is to allow ID->FD transition for
CAP_SYS_ADMIN only. What it means that user processes can run
with CAP_BPF and CAP_NET_ADMIN and they will not be able to affect each
other unless they pass FDs via scm_rights or via pinning in bpffs.
ID->FD is a mechanism for human override and introspection.
An admin can do 'sudo bpftool prog ...'. It's possible to enforce via LSM that
only bpftool binary does bpf syscall with CAP_SYS_ADMIN and the rest of user
space processes do bpf syscall with CAP_BPF isolating bpf objects (progs, maps,
links) that are owned by such processes from each other.

Another significant change from LPC is that the verifier checks are split into
four flags. The allow_ptr_leaks flag allows pointer manipulations. The
bpf_capable flag enables all modern verifier features like bpf-to-bpf calls,
BTF, bounded loops, dead code elimination, etc. All the goodness. The
bypass_spec_v1 flag enables indirect stack access from bpf programs and
disables speculative analysis and bpf array mitigations. The bypass_spec_v4
flag disables store sanitation. That allows networking progs with CAP_BPF +
CAP_NET_ADMIN enjoy modern verifier features while being more secure.

Some networking progs may need CAP_BPF + CAP_NET_ADMIN + CAP_PERFMON,
since subtracting pointers (like skb->data_end - skb->data) is a pointer leak,
but the verifier may get smarter in the future.

Please see patches for more details.

Alexei Starovoitov (3):
  bpf, capability: Introduce CAP_BPF
  bpf: implement CAP_BPF
  selftests/bpf: use CAP_BPF and CAP_PERFMON in tests

 drivers/media/rc/bpf-lirc.c                   |  2 +-
 include/linux/bpf.h                           | 18 +++-
 include/linux/bpf_verifier.h                  |  3 +
 include/linux/capability.h                    |  5 ++
 include/uapi/linux/capability.h               | 34 +++++++-
 kernel/bpf/arraymap.c                         | 10 +--
 kernel/bpf/bpf_struct_ops.c                   |  2 +-
 kernel/bpf/core.c                             |  2 +-
 kernel/bpf/cpumap.c                           |  2 +-
 kernel/bpf/hashtab.c                          |  4 +-
 kernel/bpf/helpers.c                          |  4 +-
 kernel/bpf/lpm_trie.c                         |  2 +-
 kernel/bpf/map_in_map.c                       |  2 +-
 kernel/bpf/queue_stack_maps.c                 |  2 +-
 kernel/bpf/reuseport_array.c                  |  2 +-
 kernel/bpf/stackmap.c                         |  2 +-
 kernel/bpf/syscall.c                          | 87 ++++++++++++++-----
 kernel/bpf/verifier.c                         | 37 ++++----
 kernel/trace/bpf_trace.c                      |  3 +
 net/core/bpf_sk_storage.c                     |  4 +-
 net/core/filter.c                             |  4 +-
 security/selinux/include/classmap.h           |  4 +-
 tools/testing/selftests/bpf/test_verifier.c   | 44 ++++++++--
 tools/testing/selftests/bpf/verifier/calls.c  | 16 ++--
 .../selftests/bpf/verifier/dead_code.c        | 10 +--
 25 files changed, 221 insertions(+), 84 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v6 bpf-next 1/3] bpf, capability: Introduce CAP_BPF
  2020-05-13  3:19 [PATCH v6 bpf-next 0/3] Introduce CAP_BPF Alexei Starovoitov
@ 2020-05-13  3:19 ` Alexei Starovoitov
  2020-05-13  3:19 ` [PATCH v6 bpf-next 2/3] bpf: implement CAP_BPF Alexei Starovoitov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2020-05-13  3:19 UTC (permalink / raw)
  To: davem
  Cc: daniel, netdev, bpf, kernel-team, linux-security-module, acme,
	jamorris, jannh, kpsingh

From: Alexei Starovoitov <ast@kernel.org>

Split BPF operations that are allowed under CAP_SYS_ADMIN into
combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN.
For backward compatibility include them in CAP_SYS_ADMIN as well.

The end result provides simple safety model for applications that use BPF:
- to load tracing program types
  BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
  use CAP_BPF and CAP_PERFMON
- to load networking program types
  BPF_PROG_TYPE_{SCHED_CLS, XDP, CGROUP_SKB, SK_SKB, etc}
  use CAP_BPF and CAP_NET_ADMIN

There are few exceptions from this rule:
- bpf_trace_printk() is allowed in networking programs, but it's using
  tracing mechanism, hence this helper needs additional CAP_PERFMON
  if networking program is using this helper.
- BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
  to discourage production use.
- BPF HW offload is allowed under CAP_SYS_ADMIN.
- bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only.

CAPs are not checked at attach/detach time with two exceptions:
- loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users,
  hence CAP_NET_ADMIN is required at attach time.
- flow_dissector detach doesn't check prog FD at detach,
  hence CAP_NET_ADMIN is required at detach time.

CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id
command and convert them to file descriptor via GET_FD_BY_ID command.
This restriction guarantees that mutliple tasks with CAP_BPF are not able to
affect each other. That leads to clean isolation of tasks. For example:
task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link.
task B with the same capabilities cannot detach that firewall unless
task A explicitly passed link FD to task B via scm_rights or bpffs.
CAP_SYS_ADMIN can still detach/unload everything.

Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can
accidentely mess with each other programs and maps.
Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other.

CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data.
Such networking progs cannot access arbitrary kernel memory or leak pointers.

bpftool, bpftrace, bcc tools binaries should NOT be installed with
CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets.
But users with these two permissions will be able to use these tracing tools.

CAP_PERFMON is least secure, since it allows kprobes and kernel memory access.
CAP_NET_ADMIN can stop network traffic via iproute2.
CAP_BPF is the safest from security point of view and harmless on its own.

Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map
and if that map is used by firewall-like bpf prog.
CAP_BPF allows many bpf prog_load commands in parallel. The verifier
may consume large amount of memory and significantly slow down the system.

Existing unprivileged BPF operations are not affected.
In particular unprivileged users are allowed to load socket_filter and cg_skb
program types and to create array, hash, prog_array, map-in-map map types.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/capability.h          |  5 +++++
 include/uapi/linux/capability.h     | 34 ++++++++++++++++++++++++++++-
 security/selinux/include/classmap.h |  4 ++--
 3 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 027d7e4a853b..b4345b38a6be 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -256,6 +256,11 @@ static inline bool perfmon_capable(void)
 	return capable(CAP_PERFMON) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool bpf_capable(void)
+{
+	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index e58c9636741b..c7372180a0a9 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -274,6 +274,7 @@ struct vfs_ns_cap_data {
    arbitrary SCSI commands */
 /* Allow setting encryption key on loopback filesystem */
 /* Allow setting zone reclaim policy */
+/* Allow everything under CAP_BPF and CAP_PERFMON for backward compatibility */
 
 #define CAP_SYS_ADMIN        21
 
@@ -374,7 +375,38 @@ struct vfs_ns_cap_data {
 
 #define CAP_PERFMON		38
 
-#define CAP_LAST_CAP         CAP_PERFMON
+/*
+ * CAP_BPF allows the following BPF operations:
+ * - Creating all types of BPF maps
+ * - Advanced verifier features
+ *   - Indirect variable access
+ *   - Bounded loops
+ *   - BPF to BPF function calls
+ *   - Scalar precision tracking
+ *   - Larger complexity limits
+ *   - Dead code elimination
+ *   - And potentially other features
+ * - Loading BPF Type Format (BTF) data
+ * - Retrieve xlated and JITed code of BPF programs
+ * - Use bpf_spin_lock() helper
+ *
+ * CAP_PERFMON relaxes the verifier checks further:
+ * - BPF progs can use of pointer-to-integer conversions
+ * - speculation attack hardening measures are bypassed
+ * - bpf_probe_read to read arbitrary kernel memory is allowed
+ * - bpf_trace_printk to print kernel memory is allowed
+ *
+ * CAP_SYS_ADMIN is required to use bpf_probe_write_user.
+ *
+ * CAP_SYS_ADMIN is required to iterate system wide loaded
+ * programs, maps, links, BTFs and convert their IDs to file descriptors.
+ *
+ * CAP_PERFMON and CAP_BPF are required to load tracing programs.
+ * CAP_NET_ADMIN and CAP_BPF are required to load networking programs.
+ */
+#define CAP_BPF			39
+
+#define CAP_LAST_CAP         CAP_BPF
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index d233ab3f1533..98e1513b608a 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,9 @@
 	    "audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-		"wake_alarm", "block_suspend", "audit_read", "perfmon"
+		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
 
-#if CAP_LAST_CAP > CAP_PERFMON
+#if CAP_LAST_CAP > CAP_BPF
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 bpf-next 2/3] bpf: implement CAP_BPF
  2020-05-13  3:19 [PATCH v6 bpf-next 0/3] Introduce CAP_BPF Alexei Starovoitov
  2020-05-13  3:19 ` [PATCH v6 bpf-next 1/3] bpf, capability: " Alexei Starovoitov
@ 2020-05-13  3:19 ` Alexei Starovoitov
  2020-05-13  3:19 ` [PATCH v6 bpf-next 3/3] selftests/bpf: use CAP_BPF and CAP_PERFMON in tests Alexei Starovoitov
  2020-05-13 10:50 ` [PATCH v6 bpf-next 0/3] Introduce CAP_BPF Marek Majkowski
  3 siblings, 0 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2020-05-13  3:19 UTC (permalink / raw)
  To: davem
  Cc: daniel, netdev, bpf, kernel-team, linux-security-module, acme,
	jamorris, jannh, kpsingh

From: Alexei Starovoitov <ast@kernel.org>

Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
  env->allow_ptr_leaks = bpf_allow_ptr_leaks();
  env->bypass_spec_v1 = bpf_bypass_spec_v1();
  env->bypass_spec_v4 = bpf_bypass_spec_v4();
  env->bpf_capable = bpf_capable();

The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.

'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.

That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 drivers/media/rc/bpf-lirc.c   |  2 +-
 include/linux/bpf.h           | 18 +++++++-
 include/linux/bpf_verifier.h  |  3 ++
 kernel/bpf/arraymap.c         | 10 ++--
 kernel/bpf/bpf_struct_ops.c   |  2 +-
 kernel/bpf/core.c             |  2 +-
 kernel/bpf/cpumap.c           |  2 +-
 kernel/bpf/hashtab.c          |  4 +-
 kernel/bpf/helpers.c          |  4 +-
 kernel/bpf/lpm_trie.c         |  2 +-
 kernel/bpf/map_in_map.c       |  2 +-
 kernel/bpf/queue_stack_maps.c |  2 +-
 kernel/bpf/reuseport_array.c  |  2 +-
 kernel/bpf/stackmap.c         |  2 +-
 kernel/bpf/syscall.c          | 87 ++++++++++++++++++++++++++---------
 kernel/bpf/verifier.c         | 37 ++++++++-------
 kernel/trace/bpf_trace.c      |  3 ++
 net/core/bpf_sk_storage.c     |  4 +-
 net/core/filter.c             |  4 +-
 19 files changed, 132 insertions(+), 60 deletions(-)

diff --git a/drivers/media/rc/bpf-lirc.c b/drivers/media/rc/bpf-lirc.c
index 069c42f22a8c..5bb144435c16 100644
--- a/drivers/media/rc/bpf-lirc.c
+++ b/drivers/media/rc/bpf-lirc.c
@@ -110,7 +110,7 @@ lirc_mode2_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_get_prandom_u32:
 		return &bpf_get_prandom_u32_proto;
 	case BPF_FUNC_trace_printk:
-		if (capable(CAP_SYS_ADMIN))
+		if (perfmon_capable())
 			return bpf_get_trace_printk_proto();
 		/* fall through */
 	default:
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cf4b6e44f2bc..7318b11b0298 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -19,6 +19,7 @@
 #include <linux/mutex.h>
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/capability.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -119,7 +120,7 @@ struct bpf_map {
 	struct bpf_map_memory memory;
 	char name[BPF_OBJ_NAME_LEN];
 	u32 btf_vmlinux_value_type_id;
-	bool unpriv_array;
+	bool bypass_spec_v1;
 	bool frozen; /* write-once; write-protected by freeze_mutex */
 	/* 22 bytes hole */
 
@@ -1088,6 +1089,21 @@ struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
 
 extern int sysctl_unprivileged_bpf_disabled;
 
+static inline bool bpf_allow_ptr_leaks(void)
+{
+	return perfmon_capable();
+}
+
+static inline bool bpf_bypass_spec_v1(void)
+{
+	return perfmon_capable();
+}
+
+static inline bool bpf_bypass_spec_v4(void)
+{
+	return perfmon_capable();
+}
+
 int bpf_map_new_fd(struct bpf_map *map, int flags);
 int bpf_prog_new_fd(struct bpf_prog *prog);
 
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 6abd5a778fcd..ea833087e853 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -375,6 +375,9 @@ struct bpf_verifier_env {
 	u32 used_map_cnt;		/* number of used maps */
 	u32 id_gen;			/* used to generate unique reg IDs */
 	bool allow_ptr_leaks;
+	bool bpf_capable;
+	bool bypass_spec_v1;
+	bool bypass_spec_v4;
 	bool seen_direct_write;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	const struct bpf_line_info *prev_linfo;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 95d77770353c..1d5bb0d983b2 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -77,7 +77,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
 	int ret, numa_node = bpf_map_attr_numa_node(attr);
 	u32 elem_size, index_mask, max_entries;
-	bool unpriv = !capable(CAP_SYS_ADMIN);
+	bool bypass_spec_v1 = bpf_bypass_spec_v1();
 	u64 cost, array_size, mask64;
 	struct bpf_map_memory mem;
 	struct bpf_array *array;
@@ -95,7 +95,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	mask64 -= 1;
 
 	index_mask = mask64;
-	if (unpriv) {
+	if (!bypass_spec_v1) {
 		/* round up array size to nearest power of 2,
 		 * since cpu will speculate within index_mask limits
 		 */
@@ -149,7 +149,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 	}
 	array->index_mask = index_mask;
-	array->map.unpriv_array = unpriv;
+	array->map.bypass_spec_v1 = bypass_spec_v1;
 
 	/* copy mandatory map attributes */
 	bpf_map_init_from_attr(&array->map, attr);
@@ -219,7 +219,7 @@ static u32 array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 
 	*insn++ = BPF_ALU64_IMM(BPF_ADD, map_ptr, offsetof(struct bpf_array, value));
 	*insn++ = BPF_LDX_MEM(BPF_W, ret, index, 0);
-	if (map->unpriv_array) {
+	if (!map->bypass_spec_v1) {
 		*insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 4);
 		*insn++ = BPF_ALU32_IMM(BPF_AND, ret, array->index_mask);
 	} else {
@@ -1053,7 +1053,7 @@ static u32 array_of_map_gen_lookup(struct bpf_map *map,
 
 	*insn++ = BPF_ALU64_IMM(BPF_ADD, map_ptr, offsetof(struct bpf_array, value));
 	*insn++ = BPF_LDX_MEM(BPF_W, ret, index, 0);
-	if (map->unpriv_array) {
+	if (!map->bypass_spec_v1) {
 		*insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 6);
 		*insn++ = BPF_ALU32_IMM(BPF_AND, ret, array->index_mask);
 	} else {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 26cb51f2db72..c6b0decaa46a 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -557,7 +557,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	struct bpf_map *map;
 	int err;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
 
 	st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 6aa11de67315..c40ff4cf9880 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -646,7 +646,7 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
 void bpf_prog_kallsyms_add(struct bpf_prog *fp)
 {
 	if (!bpf_prog_kallsyms_candidate(fp) ||
-	    !capable(CAP_SYS_ADMIN))
+	    !bpf_capable())
 		return;
 
 	bpf_prog_ksym_set_addr(fp);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3fe0b006d2d2..be8fdceefe55 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -85,7 +85,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	u64 cost;
 	int ret;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
 
 	/* check sanity of attributes */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d541c8486c95..b4b288a3c3c9 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -359,9 +359,9 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
 		     offsetof(struct htab_elem, hash_node.pprev));
 
-	if (lru && !capable(CAP_SYS_ADMIN))
+	if (lru && !bpf_capable())
 		/* LRU implementation is much complicated than other
-		 * maps.  Hence, limit to CAP_SYS_ADMIN for now.
+		 * maps.  Hence, limit to CAP_BPF.
 		 */
 		return -EPERM;
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5c0290e0696e..886949fdcece 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -633,7 +633,7 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		break;
 	}
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return NULL;
 
 	switch (func_id) {
@@ -642,6 +642,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 	case BPF_FUNC_spin_unlock:
 		return &bpf_spin_unlock_proto;
 	case BPF_FUNC_trace_printk:
+		if (!perfmon_capable())
+			return NULL;
 		return bpf_get_trace_printk_proto();
 	case BPF_FUNC_jiffies64:
 		return &bpf_jiffies64_proto;
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 65c236cf341e..c8cc4e4cf98d 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -543,7 +543,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 	u64 cost = sizeof(*trie), cost_per_node;
 	int ret;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
 
 	/* check sanity of attributes */
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index b3c48d1533cb..17738c93bec8 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -60,7 +60,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	/* Misc members not needed in bpf_map_meta_equal() check. */
 	inner_map_meta->ops = inner_map->ops;
 	if (inner_map->ops == &array_map_ops) {
-		inner_map_meta->unpriv_array = inner_map->unpriv_array;
+		inner_map_meta->bypass_spec_v1 = inner_map->bypass_spec_v1;
 		container_of(inner_map_meta, struct bpf_array, map)->index_mask =
 		     container_of(inner_map, struct bpf_array, map)->index_mask;
 	}
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 30e1373fd437..05c8e043b9d2 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -45,7 +45,7 @@ static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
 /* Called from syscall */
 static int queue_stack_map_alloc_check(union bpf_attr *attr)
 {
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return -EPERM;
 
 	/* check sanity of attributes */
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 01badd3eda7a..21cde24386db 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -154,7 +154,7 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 	struct bpf_map_memory mem;
 	u64 array_size;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
 
 	array_size = sizeof(*array);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index db76339fe358..7b8381ce40a0 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -93,7 +93,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	u64 cost, n_buckets;
 	int err;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
 
 	if (attr->map_flags & ~STACK_CREATE_FLAG_MASK)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index de2a75500233..422414cc7010 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1534,7 +1534,7 @@ static int map_freeze(const union bpf_attr *attr)
 		err = -EBUSY;
 		goto err_put;
 	}
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!bpf_capable()) {
 		err = -EPERM;
 		goto err_put;
 	}
@@ -2009,6 +2009,53 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 	}
 }
 
+static bool is_net_admin_prog_type(enum bpf_prog_type prog_type)
+{
+	switch (prog_type) {
+	case BPF_PROG_TYPE_SCHED_CLS:
+	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_XDP:
+	case BPF_PROG_TYPE_LWT_IN:
+	case BPF_PROG_TYPE_LWT_OUT:
+	case BPF_PROG_TYPE_LWT_XMIT:
+	case BPF_PROG_TYPE_LWT_SEG6LOCAL:
+	case BPF_PROG_TYPE_SK_SKB:
+	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_LIRC_MODE2:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
+	case BPF_PROG_TYPE_CGROUP_DEVICE:
+	case BPF_PROG_TYPE_CGROUP_SOCK:
+	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_SOCK_OPS:
+	case BPF_PROG_TYPE_EXT: /* extends any prog */
+		return true;
+	case BPF_PROG_TYPE_CGROUP_SKB: /* always unpriv */
+	default:
+		return false;
+	}
+}
+
+static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
+{
+	switch (prog_type) {
+	case BPF_PROG_TYPE_KPROBE:
+	case BPF_PROG_TYPE_TRACEPOINT:
+	case BPF_PROG_TYPE_PERF_EVENT:
+	case BPF_PROG_TYPE_RAW_TRACEPOINT:
+	case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE:
+	case BPF_PROG_TYPE_TRACING:
+	case BPF_PROG_TYPE_LSM:
+	case BPF_PROG_TYPE_STRUCT_OPS: /* has access to struct sock */
+	case BPF_PROG_TYPE_EXT: /* extends any prog */
+		return true;
+	default:
+		return false;
+	}
+}
+
 /* last field in 'union bpf_attr' used by this command */
 #define	BPF_PROG_LOAD_LAST_FIELD attach_prog_fd
 
@@ -2031,7 +2078,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
 	    (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
-	    !capable(CAP_SYS_ADMIN))
+	    !bpf_capable())
 		return -EPERM;
 
 	/* copy eBPF program license from user space */
@@ -2044,11 +2091,16 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	is_gpl = license_is_gpl_compatible(license);
 
 	if (attr->insn_cnt == 0 ||
-	    attr->insn_cnt > (capable(CAP_SYS_ADMIN) ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
+	    attr->insn_cnt > (bpf_capable() ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
 		return -E2BIG;
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
-	    !capable(CAP_SYS_ADMIN))
+	    !bpf_capable())
+		return -EPERM;
+
+	if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN))
+		return -EPERM;
+	if (is_perfmon_prog_type(type) && !perfmon_capable())
 		return -EPERM;
 
 	bpf_prog_load_fixup_attach_type(attr);
@@ -2682,6 +2734,11 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
+		if (!capable(CAP_NET_ADMIN))
+			/* cg-skb progs can be loaded by unpriv user.
+			 * check permissions at attach time.
+			 */
+			return -EPERM;
 		return prog->enforce_expected_attach_type &&
 			prog->expected_attach_type != attach_type ?
 			-EINVAL : 0;
@@ -2747,9 +2804,6 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	struct bpf_prog *prog;
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	if (CHECK_ATTR(BPF_PROG_ATTACH))
 		return -EINVAL;
 
@@ -2804,9 +2858,6 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 {
 	enum bpf_prog_type ptype;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	if (CHECK_ATTR(BPF_PROG_DETACH))
 		return -EINVAL;
 
@@ -2819,6 +2870,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_LIRC_MODE2:
 		return lirc_prog_detach(attr);
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
+		if (!capable(CAP_NET_ADMIN))
+			return -EPERM;
 		return skb_flow_dissector_bpf_prog_detach(attr);
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SKB:
@@ -2882,8 +2935,6 @@ static int bpf_prog_test_run(const union bpf_attr *attr,
 	struct bpf_prog *prog;
 	int ret = -ENOTSUPP;
 
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
 	if (CHECK_ATTR(BPF_PROG_TEST_RUN))
 		return -EINVAL;
 
@@ -3184,7 +3235,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 	info.run_time_ns = stats.nsecs;
 	info.run_cnt = stats.cnt;
 
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!bpf_capable()) {
 		info.jited_prog_len = 0;
 		info.xlated_prog_len = 0;
 		info.nr_jited_ksyms = 0;
@@ -3543,7 +3594,7 @@ static int bpf_btf_load(const union bpf_attr *attr)
 	if (CHECK_ATTR(BPF_BTF_LOAD))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return -EPERM;
 
 	return btf_new_fd(attr);
@@ -3766,9 +3817,6 @@ static int link_create(union bpf_attr *attr)
 	struct bpf_prog *prog;
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	if (CHECK_ATTR(BPF_LINK_CREATE))
 		return -EINVAL;
 
@@ -3817,9 +3865,6 @@ static int link_update(union bpf_attr *attr)
 	u32 flags;
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
 	if (CHECK_ATTR(BPF_LINK_UPDATE))
 		return -EINVAL;
 
@@ -3988,7 +4033,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	union bpf_attr attr;
 	int err;
 
-	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
+	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
 		return -EPERM;
 
 	err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2a1826c76bb6..2c34c44974d1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1295,7 +1295,7 @@ static void __mark_reg_unknown(const struct bpf_verifier_env *env,
 	reg->type = SCALAR_VALUE;
 	reg->var_off = tnum_unknown;
 	reg->frameno = 0;
-	reg->precise = env->subprog_cnt > 1 || !env->allow_ptr_leaks;
+	reg->precise = env->subprog_cnt > 1 || !env->bpf_capable;
 	__mark_reg_unbounded(reg);
 }
 
@@ -1427,8 +1427,9 @@ static int check_subprogs(struct bpf_verifier_env *env)
 			continue;
 		if (insn[i].src_reg != BPF_PSEUDO_CALL)
 			continue;
-		if (!env->allow_ptr_leaks) {
-			verbose(env, "function calls to other bpf functions are allowed for root only\n");
+		if (!env->bpf_capable) {
+			verbose(env,
+				"function calls to other bpf functions are allowed for CAP_BPF and CAP_SYS_ADMIN\n");
 			return -EPERM;
 		}
 		ret = add_subprog(env, i + insn[i].imm + 1);
@@ -1962,8 +1963,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
 	bool new_marks = false;
 	int i, err;
 
-	if (!env->allow_ptr_leaks)
-		/* backtracking is root only for now */
+	if (!env->bpf_capable)
 		return 0;
 
 	func = st->frame[st->curframe];
@@ -2211,7 +2211,7 @@ static int check_stack_write(struct bpf_verifier_env *env,
 		reg = &cur->regs[value_regno];
 
 	if (reg && size == BPF_REG_SIZE && register_is_const(reg) &&
-	    !register_is_null(reg) && env->allow_ptr_leaks) {
+	    !register_is_null(reg) && env->bpf_capable) {
 		if (dst_reg != BPF_REG_FP) {
 			/* The backtracking logic can only recognize explicit
 			 * stack slot address like [fp - 8]. Other spill of
@@ -2237,7 +2237,7 @@ static int check_stack_write(struct bpf_verifier_env *env,
 			return -EINVAL;
 		}
 
-		if (!env->allow_ptr_leaks) {
+		if (!env->bypass_spec_v4) {
 			bool sanitize = false;
 
 			if (state->stack[spi].slot_type[0] == STACK_SPILL &&
@@ -3432,7 +3432,7 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno,
 		 * Spectre masking for stack ALU.
 		 * See also retrieve_ptr_limit().
 		 */
-		if (!env->allow_ptr_leaks) {
+		if (!env->bypass_spec_v1) {
 			char tn_buf[48];
 
 			tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
@@ -4435,10 +4435,10 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 
 	if (!BPF_MAP_PTR(aux->map_ptr_state))
 		bpf_map_ptr_store(aux, meta->map_ptr,
-				  meta->map_ptr->unpriv_array);
+				  !meta->map_ptr->bypass_spec_v1);
 	else if (BPF_MAP_PTR(aux->map_ptr_state) != meta->map_ptr)
 		bpf_map_ptr_store(aux, BPF_MAP_PTR_POISON,
-				  meta->map_ptr->unpriv_array);
+				  !meta->map_ptr->bypass_spec_v1);
 	return 0;
 }
 
@@ -4807,7 +4807,7 @@ static int retrieve_ptr_limit(const struct bpf_reg_state *ptr_reg,
 static bool can_skip_alu_sanitation(const struct bpf_verifier_env *env,
 				    const struct bpf_insn *insn)
 {
-	return env->allow_ptr_leaks || BPF_SRC(insn->code) == BPF_K;
+	return env->bypass_spec_v1 || BPF_SRC(insn->code) == BPF_K;
 }
 
 static int update_alu_sanitation_state(struct bpf_insn_aux_data *aux,
@@ -5117,7 +5117,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	/* For unprivileged we require that resulting offset must be in bounds
 	 * in order to be able to sanitize access later on.
 	 */
-	if (!env->allow_ptr_leaks) {
+	if (!env->bypass_spec_v1) {
 		if (dst_reg->type == PTR_TO_MAP_VALUE &&
 		    check_map_access(env, dst, dst_reg->off, 1, false)) {
 			verbose(env, "R%d pointer arithmetic of map value goes out of range, "
@@ -7244,7 +7244,7 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env,
 		insn_stack[env->cfg.cur_stack++] = w;
 		return 1;
 	} else if ((insn_state[w] & 0xF0) == DISCOVERED) {
-		if (loop_ok && env->allow_ptr_leaks)
+		if (loop_ok && env->bpf_capable)
 			return 0;
 		verbose_linfo(env, t, "%d: ", t);
 		verbose_linfo(env, w, "%d: ", w);
@@ -8353,7 +8353,7 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 	if (env->max_states_per_insn < states_cnt)
 		env->max_states_per_insn = states_cnt;
 
-	if (!env->allow_ptr_leaks && states_cnt > BPF_COMPLEXITY_LIMIT_STATES)
+	if (!env->bpf_capable && states_cnt > BPF_COMPLEXITY_LIMIT_STATES)
 		return push_jmp_history(env, cur);
 
 	if (!add_new_state)
@@ -10014,7 +10014,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			insn->code = BPF_JMP | BPF_TAIL_CALL;
 
 			aux = &env->insn_aux_data[i + delta];
-			if (env->allow_ptr_leaks && !expect_blinding &&
+			if (env->bpf_capable && !expect_blinding &&
 			    prog->jit_requested &&
 			    !bpf_map_key_poisoned(aux) &&
 			    !bpf_map_ptr_poisoned(aux) &&
@@ -10759,7 +10759,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
 		env->insn_aux_data[i].orig_idx = i;
 	env->prog = *prog;
 	env->ops = bpf_verifier_ops[env->prog->type];
-	is_priv = capable(CAP_SYS_ADMIN);
+	is_priv = bpf_capable();
 
 	if (!btf_vmlinux && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) {
 		mutex_lock(&bpf_verifier_lock);
@@ -10800,7 +10800,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
 	if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
 		env->strict_alignment = false;
 
-	env->allow_ptr_leaks = is_priv;
+	env->allow_ptr_leaks = bpf_allow_ptr_leaks();
+	env->bypass_spec_v1 = bpf_bypass_spec_v1();
+	env->bypass_spec_v4 = bpf_bypass_spec_v4();
+	env->bpf_capable = bpf_capable();
 
 	if (is_priv)
 		env->test_state_freq = attr->prog_flags & BPF_F_TEST_STATE_FREQ;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d961428fb5b6..9a84d7fb4869 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -315,6 +315,9 @@ static const struct bpf_func_proto bpf_probe_write_user_proto = {
 
 static const struct bpf_func_proto *bpf_get_probe_write_proto(void)
 {
+	if (!capable(CAP_SYS_ADMIN))
+		return NULL;
+
 	pr_warn_ratelimited("%s[%d] is installing a program with bpf_probe_write_user helper that may corrupt user memory!",
 			    current->comm, task_pid_nr(current));
 
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index 756b63b6f7b3..d2c4d16dadba 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -625,7 +625,7 @@ static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr)
 	    !attr->btf_key_type_id || !attr->btf_value_type_id)
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return -EPERM;
 
 	if (attr->value_size > MAX_VALUE_SIZE)
@@ -978,7 +978,7 @@ bpf_sk_storage_diag_alloc(const struct nlattr *nla_stgs)
 	/* bpf_sk_storage_map is currently limited to CAP_SYS_ADMIN as
 	 * the map_alloc_check() side also does.
 	 */
-	if (!capable(CAP_SYS_ADMIN))
+	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
 
 	nla_for_each_nested(nla, nla_stgs, rem) {
diff --git a/net/core/filter.c b/net/core/filter.c
index da0634979f53..c3d33908efc6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6624,7 +6624,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		return false;
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_end):
-		if (!capable(CAP_SYS_ADMIN))
+		if (!bpf_capable())
 			return false;
 		break;
 	}
@@ -6636,7 +6636,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
 			break;
 		case bpf_ctx_range(struct __sk_buff, tstamp):
-			if (!capable(CAP_SYS_ADMIN))
+			if (!bpf_capable())
 				return false;
 			break;
 		default:
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 bpf-next 3/3] selftests/bpf: use CAP_BPF and CAP_PERFMON in tests
  2020-05-13  3:19 [PATCH v6 bpf-next 0/3] Introduce CAP_BPF Alexei Starovoitov
  2020-05-13  3:19 ` [PATCH v6 bpf-next 1/3] bpf, capability: " Alexei Starovoitov
  2020-05-13  3:19 ` [PATCH v6 bpf-next 2/3] bpf: implement CAP_BPF Alexei Starovoitov
@ 2020-05-13  3:19 ` Alexei Starovoitov
  2020-05-13 10:50 ` [PATCH v6 bpf-next 0/3] Introduce CAP_BPF Marek Majkowski
  3 siblings, 0 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2020-05-13  3:19 UTC (permalink / raw)
  To: davem
  Cc: daniel, netdev, bpf, kernel-team, linux-security-module, acme,
	jamorris, jannh, kpsingh

From: Alexei Starovoitov <ast@kernel.org>

Make all test_verifier test exercise CAP_BPF and CAP_PERFMON

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_verifier.c   | 44 +++++++++++++++----
 tools/testing/selftests/bpf/verifier/calls.c  | 16 +++----
 .../selftests/bpf/verifier/dead_code.c        | 10 ++---
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 21a1ce219c1c..78a6bae56ea6 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -818,10 +818,18 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	}
 }
 
+struct libcap {
+	struct __user_cap_header_struct hdr;
+	struct __user_cap_data_struct data[2];
+};
+
 static int set_admin(bool admin)
 {
 	cap_t caps;
-	const cap_value_t cap_val = CAP_SYS_ADMIN;
+	/* need CAP_BPF, CAP_NET_ADMIN, CAP_PERFMON to load progs */
+	const cap_value_t cap_net_admin = CAP_NET_ADMIN;
+	const cap_value_t cap_sys_admin = CAP_SYS_ADMIN;
+	struct libcap *cap;
 	int ret = -1;
 
 	caps = cap_get_proc();
@@ -829,11 +837,26 @@ static int set_admin(bool admin)
 		perror("cap_get_proc");
 		return -1;
 	}
-	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val,
+	cap = (struct libcap *)caps;
+	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_sys_admin, CAP_CLEAR)) {
+		perror("cap_set_flag clear admin");
+		goto out;
+	}
+	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_admin,
 				admin ? CAP_SET : CAP_CLEAR)) {
-		perror("cap_set_flag");
+		perror("cap_set_flag set_or_clear net");
 		goto out;
 	}
+	/* libcap is likely old and simply ignores CAP_BPF and CAP_PERFMON,
+	 * so update effective bits manually
+	 */
+	if (admin) {
+		cap->data[1].effective |= 1 << (38 /* CAP_PERFMON */ - 32);
+		cap->data[1].effective |= 1 << (39 /* CAP_BPF */ - 32);
+	} else {
+		cap->data[1].effective &= ~(1 << (38 - 32));
+		cap->data[1].effective &= ~(1 << (39 - 32));
+	}
 	if (cap_set_proc(caps)) {
 		perror("cap_set_proc");
 		goto out;
@@ -1067,9 +1090,11 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
 
 static bool is_admin(void)
 {
+	cap_flag_value_t net_priv = CAP_CLEAR;
+	bool perfmon_priv = false;
+	bool bpf_priv = false;
+	struct libcap *cap;
 	cap_t caps;
-	cap_flag_value_t sysadmin = CAP_CLEAR;
-	const cap_value_t cap_val = CAP_SYS_ADMIN;
 
 #ifdef CAP_IS_SUPPORTED
 	if (!CAP_IS_SUPPORTED(CAP_SETFCAP)) {
@@ -1082,11 +1107,14 @@ static bool is_admin(void)
 		perror("cap_get_proc");
 		return false;
 	}
-	if (cap_get_flag(caps, cap_val, CAP_EFFECTIVE, &sysadmin))
-		perror("cap_get_flag");
+	cap = (struct libcap *)caps;
+	bpf_priv = cap->data[1].effective & (1 << (39/* CAP_BPF */ - 32));
+	perfmon_priv = cap->data[1].effective & (1 << (38/* CAP_PERFMON */ - 32));
+	if (cap_get_flag(caps, CAP_NET_ADMIN, CAP_EFFECTIVE, &net_priv))
+		perror("cap_get_flag NET");
 	if (cap_free(caps))
 		perror("cap_free");
-	return (sysadmin == CAP_SET);
+	return bpf_priv && perfmon_priv && net_priv == CAP_SET;
 }
 
 static void get_unpriv_disabled()
diff --git a/tools/testing/selftests/bpf/verifier/calls.c b/tools/testing/selftests/bpf/verifier/calls.c
index 2d752c4f8d9d..7629a0cebb9b 100644
--- a/tools/testing/selftests/bpf/verifier/calls.c
+++ b/tools/testing/selftests/bpf/verifier/calls.c
@@ -19,7 +19,7 @@
 	BPF_MOV64_IMM(BPF_REG_0, 2),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
 	.retval = 1,
@@ -315,7 +315,7 @@
 	BPF_MOV64_REG(BPF_REG_0, BPF_REG_1),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "allowed for root only",
+	.errstr_unpriv = "allowed for",
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
 	.retval = POINTER_VALUE,
@@ -346,7 +346,7 @@
 	BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_2),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "allowed for root only",
+	.errstr_unpriv = "allowed for",
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
 	.retval = TEST_DATA_LEN + TEST_DATA_LEN - ETH_HLEN - ETH_HLEN,
@@ -397,7 +397,7 @@
 	BPF_MOV64_IMM(BPF_REG_0, 1),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.fixup_map_hash_48b = { 3 },
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
@@ -1064,7 +1064,7 @@
 	BPF_MOV64_IMM(BPF_REG_0, 0),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "allowed for root only",
+	.errstr_unpriv = "allowed for",
 	.result_unpriv = REJECT,
 	.errstr = "R0 !read_ok",
 	.result = REJECT,
@@ -1977,7 +1977,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
 },
@@ -2003,7 +2003,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.errstr = "!read_ok",
 	.result = REJECT,
 },
@@ -2028,7 +2028,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.errstr = "!read_ok",
 	.result = REJECT,
 },
diff --git a/tools/testing/selftests/bpf/verifier/dead_code.c b/tools/testing/selftests/bpf/verifier/dead_code.c
index 50a8a63be4ac..5cf361d8eb1c 100644
--- a/tools/testing/selftests/bpf/verifier/dead_code.c
+++ b/tools/testing/selftests/bpf/verifier/dead_code.c
@@ -85,7 +85,7 @@
 	BPF_MOV64_IMM(BPF_REG_0, 12),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
 	.retval = 7,
@@ -103,7 +103,7 @@
 	BPF_MOV64_IMM(BPF_REG_0, 12),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
 	.retval = 7,
@@ -121,7 +121,7 @@
 	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, -5),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
 	.retval = 7,
@@ -137,7 +137,7 @@
 	BPF_MOV64_REG(BPF_REG_0, BPF_REG_1),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
 	.retval = 2,
@@ -152,7 +152,7 @@
 	BPF_MOV64_REG(BPF_REG_0, BPF_REG_1),
 	BPF_EXIT_INSN(),
 	},
-	.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
+	.errstr_unpriv = "function calls to other bpf functions are allowed for",
 	.result_unpriv = REJECT,
 	.result = ACCEPT,
 	.retval = 2,
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 bpf-next 0/3] Introduce CAP_BPF
  2020-05-13  3:19 [PATCH v6 bpf-next 0/3] Introduce CAP_BPF Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2020-05-13  3:19 ` [PATCH v6 bpf-next 3/3] selftests/bpf: use CAP_BPF and CAP_PERFMON in tests Alexei Starovoitov
@ 2020-05-13 10:50 ` Marek Majkowski
  2020-05-13 17:53   ` Alexei Starovoitov
  3 siblings, 1 reply; 9+ messages in thread
From: Marek Majkowski @ 2020-05-13 10:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Daniel Borkmann, network dev, bpf, kernel-team,
	linux-security-module, acme, jamorris, Jann Horn, kpsingh,
	kernel-team

On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> CAP_BPF solves three main goals:
> 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
>    More on this below. This is the major difference vs v4 set back from Sep 2019.
> 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
>    prevents pointer leaks and arbitrary kernel memory access.
> 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
>    and making BPF infra more secure. Currently fuzzers run in unpriv.
>    They will be able to run with CAP_BPF.
>

Alexei, looking at this from a user point of view, this looks fine.

I'm slightly worried about REUSEPORT_EBPF. Currently without your
patch, as far as I understand it:

- You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
permissions

- For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
creation CAP_SYS_ADMIN is needed. But again, no permissions check for
SO_ATTACH_REUSEPORT_EBPF later.

If I read the patchset correctly, the former SOCKET_FILTER case
remains as it is and is not affected in any way by presence or absence
of CAP_BPF.

The latter case is different. Presence of CAP_BPF is sufficient for
map creation, but not sufficient for loading SK_REUSEPORT program. It
still requires CAP_SYS_ADMIN. I think it's a good opportunity to relax
this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.

Our specific use case is simple - we want an application program -
like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
but we don't want to grant it CAP_SYS_ADMIN.

Marek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 bpf-next 0/3] Introduce CAP_BPF
  2020-05-13 10:50 ` [PATCH v6 bpf-next 0/3] Introduce CAP_BPF Marek Majkowski
@ 2020-05-13 17:53   ` Alexei Starovoitov
  2020-05-13 18:30     ` Marek Majkowski
  0 siblings, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2020-05-13 17:53 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: David Miller, Daniel Borkmann, network dev, bpf, kernel-team,
	linux-security-module, acme, jamorris, Jann Horn, kpsingh,
	kernel-team

On Wed, May 13, 2020 at 11:50:42AM +0100, Marek Majkowski wrote:
> On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > CAP_BPF solves three main goals:
> > 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
> >    More on this below. This is the major difference vs v4 set back from Sep 2019.
> > 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
> >    prevents pointer leaks and arbitrary kernel memory access.
> > 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
> >    and making BPF infra more secure. Currently fuzzers run in unpriv.
> >    They will be able to run with CAP_BPF.
> >
> 
> Alexei, looking at this from a user point of view, this looks fine.
> 
> I'm slightly worried about REUSEPORT_EBPF. Currently without your
> patch, as far as I understand it:
> 
> - You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
> permissions

correct.

> - For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
> creation CAP_SYS_ADMIN is needed. But again, no permissions check for
> SO_ATTACH_REUSEPORT_EBPF later.

correct. With clarification that attaching process needs to own
FD of prog and FD of socket.

> If I read the patchset correctly, the former SOCKET_FILTER case
> remains as it is and is not affected in any way by presence or absence
> of CAP_BPF.

correct. As commit log says:
"Existing unprivileged BPF operations are not affected."

> The latter case is different. Presence of CAP_BPF is sufficient for
> map creation, but not sufficient for loading SK_REUSEPORT program. It
> still requires CAP_SYS_ADMIN. 

Not quite.
The patch will allow BPF_PROG_TYPE_SK_REUSEPORT progs to be loaded
with CAP_BPF + CAP_NET_ADMIN.
Since this type of progs is clearly networking type I figured it's
better to be consistent with the rest of networking types.
Two unpriv types SOCKET_FILTER and CGROUP_SKB is the only exception.

> I think it's a good opportunity to relax
> this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
> be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.
> 
> Our specific use case is simple - we want an application program -
> like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
> but we don't want to grant it CAP_SYS_ADMIN.

You'll be able to grant nginx CAP_BPF + CAP_NET_ADMIN to load SK_REUSEPORT
and unpriv child process will be able to attach just like before if
it has right FDs.
I suspect your load balancer needs CAP_NET_ADMIN already anyway due to
use of XDP and TC progs.
So granting CAP_BPF + CAP_NET_ADMIN should cover all bpf prog needs.
Does it address your concern?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 bpf-next 0/3] Introduce CAP_BPF
  2020-05-13 17:53   ` Alexei Starovoitov
@ 2020-05-13 18:30     ` Marek Majkowski
  2020-05-13 18:54       ` Alexei Starovoitov
  0 siblings, 1 reply; 9+ messages in thread
From: Marek Majkowski @ 2020-05-13 18:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Daniel Borkmann, network dev, bpf, kernel-team,
	linux-security-module, acme, jamorris, Jann Horn, kpsingh,
	kernel-team

On Wed, May 13, 2020 at 6:53 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, May 13, 2020 at 11:50:42AM +0100, Marek Majkowski wrote:
> > On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > CAP_BPF solves three main goals:
> > > 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
> > >    More on this below. This is the major difference vs v4 set back from Sep 2019.
> > > 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
> > >    prevents pointer leaks and arbitrary kernel memory access.
> > > 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
> > >    and making BPF infra more secure. Currently fuzzers run in unpriv.
> > >    They will be able to run with CAP_BPF.
> > >
> >
> > Alexei, looking at this from a user point of view, this looks fine.
> >
> > I'm slightly worried about REUSEPORT_EBPF. Currently without your
> > patch, as far as I understand it:
> >
> > - You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
> > permissions
>
> correct.
>
> > - For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
> > creation CAP_SYS_ADMIN is needed. But again, no permissions check for
> > SO_ATTACH_REUSEPORT_EBPF later.
>
> correct. With clarification that attaching process needs to own
> FD of prog and FD of socket.
>
> > If I read the patchset correctly, the former SOCKET_FILTER case
> > remains as it is and is not affected in any way by presence or absence
> > of CAP_BPF.
>
> correct. As commit log says:
> "Existing unprivileged BPF operations are not affected."
>
> > The latter case is different. Presence of CAP_BPF is sufficient for
> > map creation, but not sufficient for loading SK_REUSEPORT program. It
> > still requires CAP_SYS_ADMIN.
>
> Not quite.
> The patch will allow BPF_PROG_TYPE_SK_REUSEPORT progs to be loaded
> with CAP_BPF + CAP_NET_ADMIN.
> Since this type of progs is clearly networking type I figured it's
> better to be consistent with the rest of networking types.
> Two unpriv types SOCKET_FILTER and CGROUP_SKB is the only exception.

Ok, this is the controversy. It made sense to restrict SK_REUSEPORT
programs in the past, because programs needed CAP_NET_ADMIN to create
SOCKARRAY anyway. Now we change this and CAP_BPF is sufficient for
maps - I don't see why CAP_BPF is not sufficient for SK_REUSEPORT
programs. From a user point of view I don't get why this additional
CAP_NET_ADMIN is needed.

> > I think it's a good opportunity to relax
> > this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
> > be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.
> >
> > Our specific use case is simple - we want an application program -
> > like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
> > but we don't want to grant it CAP_SYS_ADMIN.
>
> You'll be able to grant nginx CAP_BPF + CAP_NET_ADMIN to load SK_REUSEPORT
> and unpriv child process will be able to attach just like before if
> it has right FDs.
> I suspect your load balancer needs CAP_NET_ADMIN already anyway due to
> use of XDP and TC progs.
> So granting CAP_BPF + CAP_NET_ADMIN should cover all bpf prog needs.
> Does it address your concern?

Load balancer (XDP+TC) is another layer and permissions there are not
a problem. The specific issue is nginx (port 443) and QUIC. QUIC is
UDP and due to the nginx design we must use REUSEPORT groups to
balance the load across workers. This is fine and could be done with a
simple SOCK_FILTER - we don't need to grant nginx any permissions,
apart from CAP_NET_BIND_SERVICE.

We would like to make the REUSEPORT program more complex to take
advantage of REUSEPORT_EBPF for stickyness (restarting server without
interfering with existing flows), we are happy to grant nginx CAP_BPF,
but we are not happy to grant it CAP_NET_ADMIN. Requiring this CAP for
REUSEPORT severely restricts the API usability for us.

In my head REUSEPORT_EBPF is much closer to SOCKET_FILTER. I
understand why it needed capabilities before (map creation) and I
argue these reasons go away in CAP_BPF world. I assume that any
service (with CAP_BPF) should be able to use reuseport to distribute
packets within its own sockets.  Let me know if I'm missing something.

Marek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 bpf-next 0/3] Introduce CAP_BPF
  2020-05-13 18:30     ` Marek Majkowski
@ 2020-05-13 18:54       ` Alexei Starovoitov
  2020-05-13 21:14         ` Marek Majkowski
  0 siblings, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2020-05-13 18:54 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: David Miller, Daniel Borkmann, network dev, bpf, kernel-team,
	linux-security-module, acme, jamorris, Jann Horn, kpsingh,
	kernel-team

On Wed, May 13, 2020 at 07:30:05PM +0100, Marek Majkowski wrote:
> On Wed, May 13, 2020 at 6:53 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Wed, May 13, 2020 at 11:50:42AM +0100, Marek Majkowski wrote:
> > > On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > CAP_BPF solves three main goals:
> > > > 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
> > > >    More on this below. This is the major difference vs v4 set back from Sep 2019.
> > > > 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
> > > >    prevents pointer leaks and arbitrary kernel memory access.
> > > > 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
> > > >    and making BPF infra more secure. Currently fuzzers run in unpriv.
> > > >    They will be able to run with CAP_BPF.
> > > >
> > >
> > > Alexei, looking at this from a user point of view, this looks fine.
> > >
> > > I'm slightly worried about REUSEPORT_EBPF. Currently without your
> > > patch, as far as I understand it:
> > >
> > > - You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
> > > permissions
> >
> > correct.
> >
> > > - For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
> > > creation CAP_SYS_ADMIN is needed. But again, no permissions check for
> > > SO_ATTACH_REUSEPORT_EBPF later.
> >
> > correct. With clarification that attaching process needs to own
> > FD of prog and FD of socket.
> >
> > > If I read the patchset correctly, the former SOCKET_FILTER case
> > > remains as it is and is not affected in any way by presence or absence
> > > of CAP_BPF.
> >
> > correct. As commit log says:
> > "Existing unprivileged BPF operations are not affected."
> >
> > > The latter case is different. Presence of CAP_BPF is sufficient for
> > > map creation, but not sufficient for loading SK_REUSEPORT program. It
> > > still requires CAP_SYS_ADMIN.
> >
> > Not quite.
> > The patch will allow BPF_PROG_TYPE_SK_REUSEPORT progs to be loaded
> > with CAP_BPF + CAP_NET_ADMIN.
> > Since this type of progs is clearly networking type I figured it's
> > better to be consistent with the rest of networking types.
> > Two unpriv types SOCKET_FILTER and CGROUP_SKB is the only exception.
> 
> Ok, this is the controversy. It made sense to restrict SK_REUSEPORT
> programs in the past, because programs needed CAP_NET_ADMIN to create
> SOCKARRAY anyway. 

Not quite. Currently sockarray needs CAP_SYS_ADMIN to create
which makes little sense from security pov.
CAP_BPF relaxes it CAP_BPF or CAP_SYS_ADMIN.

> Now we change this and CAP_BPF is sufficient for
> maps - I don't see why CAP_BPF is not sufficient for SK_REUSEPORT
> programs. From a user point of view I don't get why this additional
> CAP_NET_ADMIN is needed.

That actually bring another point. I'm not changing sock_map,
sock_hash, dev_map requirements yet. All three still require CAP_NET_ADMIN.
We can relax them to CAP_BPF _or_ CAP_NET_ADMIN in the future,
but I'd like to do that in the follow up.

> 
> > > I think it's a good opportunity to relax
> > > this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
> > > be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.
> > >
> > > Our specific use case is simple - we want an application program -
> > > like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
> > > but we don't want to grant it CAP_SYS_ADMIN.
> >
> > You'll be able to grant nginx CAP_BPF + CAP_NET_ADMIN to load SK_REUSEPORT
> > and unpriv child process will be able to attach just like before if
> > it has right FDs.
> > I suspect your load balancer needs CAP_NET_ADMIN already anyway due to
> > use of XDP and TC progs.
> > So granting CAP_BPF + CAP_NET_ADMIN should cover all bpf prog needs.
> > Does it address your concern?
> 
> Load balancer (XDP+TC) is another layer and permissions there are not
> a problem. The specific issue is nginx (port 443) and QUIC. QUIC is
> UDP and due to the nginx design we must use REUSEPORT groups to
> balance the load across workers. This is fine and could be done with a
> simple SOCK_FILTER - we don't need to grant nginx any permissions,
> apart from CAP_NET_BIND_SERVICE.
> 
> We would like to make the REUSEPORT program more complex to take
> advantage of REUSEPORT_EBPF for stickyness (restarting server without
> interfering with existing flows), we are happy to grant nginx CAP_BPF,
> but we are not happy to grant it CAP_NET_ADMIN. Requiring this CAP for
> REUSEPORT severely restricts the API usability for us.
> 
> In my head REUSEPORT_EBPF is much closer to SOCKET_FILTER. I
> understand why it needed capabilities before (map creation) and I
> argue these reasons go away in CAP_BPF world. I assume that any
> service (with CAP_BPF) should be able to use reuseport to distribute
> packets within its own sockets.  Let me know if I'm missing something.

Fair enough. We can include SK_REUSEPORT prog type as part of CAP_BPF alone.
But will it truly achieve what you want?
You still need CAP_NET_ADMIN for sock_hash which you're using.
Are you saying it's part of the different process that has that cap_net_admin
and nginx will be fine with cap_bpf + cap_net_bind_service ?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 bpf-next 0/3] Introduce CAP_BPF
  2020-05-13 18:54       ` Alexei Starovoitov
@ 2020-05-13 21:14         ` Marek Majkowski
  0 siblings, 0 replies; 9+ messages in thread
From: Marek Majkowski @ 2020-05-13 21:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Daniel Borkmann, network dev, bpf, kernel-team,
	linux-security-module, acme, jamorris, Jann Horn, KP Singh,
	kernel-team

On Wed, May 13, 2020 at 7:54 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 07:30:05PM +0100, Marek Majkowski wrote:
> > On Wed, May 13, 2020 at 6:53 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > > On Wed, May 13, 2020 at 11:50:42AM +0100, Marek Majkowski wrote:
> > > > On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > CAP_BPF solves three main goals:
> > > > > 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
> > > > >    More on this below. This is the major difference vs v4 set back from Sep 2019.
> > > > > 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
> > > > >    prevents pointer leaks and arbitrary kernel memory access.
> > > > > 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
> > > > >    and making BPF infra more secure. Currently fuzzers run in unpriv.
> > > > >    They will be able to run with CAP_BPF.
> > > > >
> > > >
> > > > Alexei, looking at this from a user point of view, this looks fine.
> > > >
> > > > I'm slightly worried about REUSEPORT_EBPF. Currently without your
> > > > patch, as far as I understand it:
> > > >
> > > > - You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
> > > > permissions
> > >
> > > correct.
> > >
> > > > - For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
> > > > creation CAP_SYS_ADMIN is needed. But again, no permissions check for
> > > > SO_ATTACH_REUSEPORT_EBPF later.
> > >
> > > correct. With clarification that attaching process needs to own
> > > FD of prog and FD of socket.
> > >
> > > > If I read the patchset correctly, the former SOCKET_FILTER case
> > > > remains as it is and is not affected in any way by presence or absence
> > > > of CAP_BPF.
> > >
> > > correct. As commit log says:
> > > "Existing unprivileged BPF operations are not affected."
> > >
> > > > The latter case is different. Presence of CAP_BPF is sufficient for
> > > > map creation, but not sufficient for loading SK_REUSEPORT program. It
> > > > still requires CAP_SYS_ADMIN.
> > >
> > > Not quite.
> > > The patch will allow BPF_PROG_TYPE_SK_REUSEPORT progs to be loaded
> > > with CAP_BPF + CAP_NET_ADMIN.
> > > Since this type of progs is clearly networking type I figured it's
> > > better to be consistent with the rest of networking types.
> > > Two unpriv types SOCKET_FILTER and CGROUP_SKB is the only exception.
> >
> > Ok, this is the controversy. It made sense to restrict SK_REUSEPORT
> > programs in the past, because programs needed CAP_NET_ADMIN to create
> > SOCKARRAY anyway.
>
> Not quite. Currently sockarray needs CAP_SYS_ADMIN to create
> which makes little sense from security pov.
> CAP_BPF relaxes it CAP_BPF or CAP_SYS_ADMIN.
>
> > Now we change this and CAP_BPF is sufficient for
> > maps - I don't see why CAP_BPF is not sufficient for SK_REUSEPORT
> > programs. From a user point of view I don't get why this additional
> > CAP_NET_ADMIN is needed.
>
> That actually bring another point. I'm not changing sock_map,
> sock_hash, dev_map requirements yet. All three still require CAP_NET_ADMIN.
> We can relax them to CAP_BPF _or_ CAP_NET_ADMIN in the future,
> but I'd like to do that in the follow up.

Agreed, we can discuss relaxation of SOCKMAP in the future.

> > > > I think it's a good opportunity to relax
> > > > this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
> > > > be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.
> > > >
> > > > Our specific use case is simple - we want an application program -
> > > > like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
> > > > but we don't want to grant it CAP_SYS_ADMIN.
> > >
> > > You'll be able to grant nginx CAP_BPF + CAP_NET_ADMIN to load SK_REUSEPORT
> > > and unpriv child process will be able to attach just like before if
> > > it has right FDs.
> > > I suspect your load balancer needs CAP_NET_ADMIN already anyway due to
> > > use of XDP and TC progs.
> > > So granting CAP_BPF + CAP_NET_ADMIN should cover all bpf prog needs.
> > > Does it address your concern?
> >
> > Load balancer (XDP+TC) is another layer and permissions there are not
> > a problem. The specific issue is nginx (port 443) and QUIC. QUIC is
> > UDP and due to the nginx design we must use REUSEPORT groups to
> > balance the load across workers. This is fine and could be done with a
> > simple SOCK_FILTER - we don't need to grant nginx any permissions,
> > apart from CAP_NET_BIND_SERVICE.
> >
> > We would like to make the REUSEPORT program more complex to take
> > advantage of REUSEPORT_EBPF for stickyness (restarting server without
> > interfering with existing flows), we are happy to grant nginx CAP_BPF,
> > but we are not happy to grant it CAP_NET_ADMIN. Requiring this CAP for
> > REUSEPORT severely restricts the API usability for us.
> >
> > In my head REUSEPORT_EBPF is much closer to SOCKET_FILTER. I
> > understand why it needed capabilities before (map creation) and I
> > argue these reasons go away in CAP_BPF world. I assume that any
> > service (with CAP_BPF) should be able to use reuseport to distribute
> > packets within its own sockets.  Let me know if I'm missing something.
>
> Fair enough. We can include SK_REUSEPORT prog type as part of CAP_BPF alone.
> But will it truly achieve what you want?

It will make the security model much more useful and sane for me and
other users of stuff that depends on SK_REUSEPORT (like nginx + UDP).
So yes, long-term it will help. Thanks.

> You still need CAP_NET_ADMIN for sock_hash which you're using.
> Are you saying it's part of the different process that has that cap_net_admin
> and nginx will be fine with cap_bpf + cap_net_bind_service ?

At this moment good old SOCKARRAY is sufficient. Having both SOCKARRAY
and SK_REUSEPORT_EBPF depend only on CAP_BPF is a good start. Thanks
for considering that. We can discuss relaxation of SOCKMAP in the
future.

Marek

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-05-13 21:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-13  3:19 [PATCH v6 bpf-next 0/3] Introduce CAP_BPF Alexei Starovoitov
2020-05-13  3:19 ` [PATCH v6 bpf-next 1/3] bpf, capability: " Alexei Starovoitov
2020-05-13  3:19 ` [PATCH v6 bpf-next 2/3] bpf: implement CAP_BPF Alexei Starovoitov
2020-05-13  3:19 ` [PATCH v6 bpf-next 3/3] selftests/bpf: use CAP_BPF and CAP_PERFMON in tests Alexei Starovoitov
2020-05-13 10:50 ` [PATCH v6 bpf-next 0/3] Introduce CAP_BPF Marek Majkowski
2020-05-13 17:53   ` Alexei Starovoitov
2020-05-13 18:30     ` Marek Majkowski
2020-05-13 18:54       ` Alexei Starovoitov
2020-05-13 21:14         ` Marek Majkowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.