BPF Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
@ 2019-08-29  5:12 Alexei Starovoitov
  2019-08-29  5:12 ` [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2019-08-29  5:12 UTC (permalink / raw)
  To: luto; +Cc: davem, peterz, rostedt, netdev, bpf, kernel-team, linux-api

CAP_BPF allows the following BPF operations:
- Loading all types of BPF programs
- Creating all types of BPF maps except:
   - stackmap that needs CAP_TRACING
   - devmap that needs CAP_NET_ADMIN
   - cpumap that needs CAP_SYS_ADMIN
- Advanced verifier features
  - Indirect variable access
  - Bounded loops
  - BPF to BPF function calls
  - Scalar precision tracking
  - Larger complexity limits
  - Dead code elimination
  - And potentially other features
- Use of pointer-to-integer conversions in BPF programs
- Bypassing of speculation attack hardening measures
- Loading BPF Type Format (BTF) data
- Iterate system wide loaded programs, maps, BTF objects
- Retrieve xlated and JITed code of BPF programs
- Access maps and programs via id
- Use bpf_spin_lock() helper

CAP_BPF and CAP_TRACING together allow the following:
- bpf_probe_read to read arbitrary kernel memory
- bpf_trace_printk to print data to ftrace ring buffer
- Attach to raw_tracepoint
- Query association between kprobe/tracepoint and bpf program

CAP_BPF and CAP_NET_ADMIN together allow the following:
- Attach to cgroup-bpf hooks and query
- skb, xdp, flow_dissector test_run command

CAP_NET_ADMIN allows:
- Attach networking bpf programs to xdp, tc, lwt, flow dissector

CAP_TRACING allows:
- Full use of perf_event_open(), similarly to the effect of
  kernel.perf_event_paranoid == -1
- Full use of tracefs
- Creation of [ku][ret]probe
- Accessing arbitrary kernel memory via kprobe + probe_kernel_read
- Attach tracing bpf programs to perf events
- Access to kallsyms

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/uapi/linux/capability.h     | 51 ++++++++++++++++++++++++++++-
 security/selinux/include/classmap.h |  4 +--
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 240fdb9a60f6..664e07d12888 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -366,8 +366,57 @@ struct vfs_ns_cap_data {
 
 #define CAP_AUDIT_READ		37
 
+/*
+ * CAP_BPF allows the following BPF operations:
+ * - Loading all types of BPF programs
+ * - Creating all types of BPF maps except:
+ *    - stackmap that needs CAP_TRACING
+ *    - devmap that needs CAP_NET_ADMIN
+ *    - cpumap that needs CAP_SYS_ADMIN
+ * - Advanced verifier features
+ *   - Indirect variable access
+ *   - Bounded loops
+ *   - BPF to BPF function calls
+ *   - Scalar precision tracking
+ *   - Larger complexity limits
+ *   - Dead code elimination
+ *   - And potentially other features
+ * - Use of pointer-to-integer conversions in BPF programs
+ * - Bypassing of speculation attack hardening measures
+ * - Loading BPF Type Format (BTF) data
+ * - Iterate system wide loaded programs, maps, BTF objects
+ * - Retrieve xlated and JITed code of BPF programs
+ * - Access maps and programs via id
+ * - Use bpf_spin_lock() helper
+ *
+ * CAP_BPF and CAP_TRACING together allow the following:
+ * - bpf_probe_read to read arbitrary kernel memory
+ * - bpf_trace_printk to print data to ftrace ring buffer
+ * - Attach to raw_tracepoint
+ * - Query association between kprobe/tracepoint and bpf program
+ *
+ * CAP_BPF and CAP_NET_ADMIN together allow the following:
+ * - Attach to cgroup-bpf hooks and query
+ * - skb, xdp, flow_dissector test_run command
+ *
+ * CAP_NET_ADMIN allows:
+ * - Attach networking bpf programs to xdp, tc, lwt, flow dissector
+ */
+#define CAP_BPF			38
+
+/*
+ * CAP_TRACING allows:
+ * - Full use of perf_event_open(), similarly to the effect of
+ *   kernel.perf_event_paranoid == -1
+ * - Full use of tracefs
+ * - Creation of [ku][ret]probe
+ * - Accessing arbitrary kernel memory via kprobe + probe_kernel_read
+ * - Attach tracing bpf programs to perf events
+ * - Access to kallsyms
+ */
+#define CAP_TRACING		39
 
-#define CAP_LAST_CAP         CAP_AUDIT_READ
+#define CAP_LAST_CAP         CAP_TRACING
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 201f7e588a29..0b364e245163 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -26,9 +26,9 @@
 	    "audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-		"wake_alarm", "block_suspend", "audit_read"
+		"wake_alarm", "block_suspend", "audit_read", "bpf", "tracing"
 
-#if CAP_LAST_CAP > CAP_AUDIT_READ
+#if CAP_LAST_CAP > CAP_TRACING
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif
 
-- 
2.20.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-08-29  5:12 [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Alexei Starovoitov
@ 2019-08-29  5:12 ` Alexei Starovoitov
  2019-08-29  6:04   ` Song Liu
  2019-08-29 15:32   ` Daniel Borkmann
  2019-08-29  5:12 ` [PATCH v2 bpf-next 3/3] perf: implement CAP_TRACING Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2019-08-29  5:12 UTC (permalink / raw)
  To: luto; +Cc: davem, peterz, rostedt, netdev, bpf, kernel-team, linux-api

Implement permissions as stated in uapi/linux/capability.h

Note that CAP_SYS_ADMIN is replaced with CAP_BPF.
All existing applications that use BPF do not drop all caps
and keep only CAP_SYS_ADMIN before doing bpf() syscall.
Hence it's highly unlikely that existing code will break.
If there will be reports of breakage then CAP_SYS_ADMIN
would be allowed as well with "it's usage is deprecated" message
similar to commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/filter.h                      |  1 +
 kernel/bpf/arraymap.c                       |  2 +-
 kernel/bpf/cgroup.c                         |  2 +-
 kernel/bpf/core.c                           |  9 +++-
 kernel/bpf/hashtab.c                        |  4 +-
 kernel/bpf/lpm_trie.c                       |  2 +-
 kernel/bpf/queue_stack_maps.c               |  2 +-
 kernel/bpf/reuseport_array.c                |  2 +-
 kernel/bpf/stackmap.c                       |  2 +-
 kernel/bpf/syscall.c                        | 32 ++++++++------
 kernel/bpf/verifier.c                       |  4 +-
 kernel/trace/bpf_trace.c                    |  2 +-
 net/core/bpf_sk_storage.c                   |  2 +-
 net/core/filter.c                           | 10 +++--
 tools/testing/selftests/bpf/test_verifier.c | 46 +++++++++++++++++----
 15 files changed, 83 insertions(+), 39 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 92c6e31fb008..16cea50af014 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -857,6 +857,7 @@ static inline bool bpf_dump_raw_ok(void)
 	return kallsyms_show_value() == 1;
 }
 
+bool cap_bpf_tracing(void);
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1c65ce0098a9..045e30b7160d 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -73,7 +73,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
 	int ret, numa_node = bpf_map_attr_numa_node(attr);
 	u32 elem_size, index_mask, max_entries;
-	bool unpriv = !capable(CAP_SYS_ADMIN);
+	bool unpriv = !capable(CAP_BPF);
 	u64 cost, array_size, mask64;
 	struct bpf_map_memory mem;
 	struct bpf_array *array;
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 6a6a154cfa7b..97f733354421 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -795,7 +795,7 @@ cgroup_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_get_current_cgroup_id:
 		return &bpf_get_current_cgroup_id_proto;
 	case BPF_FUNC_trace_printk:
-		if (capable(CAP_SYS_ADMIN))
+		if (cap_bpf_tracing())
 			return bpf_get_trace_printk_proto();
 		/* fall through */
 	default:
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8191a7db2777..16ed80835156 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -646,7 +646,7 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
 void bpf_prog_kallsyms_add(struct bpf_prog *fp)
 {
 	if (!bpf_prog_kallsyms_candidate(fp) ||
-	    !capable(CAP_SYS_ADMIN))
+	    !capable(CAP_BPF))
 		return;
 
 	spin_lock_bh(&bpf_lock);
@@ -768,7 +768,7 @@ static int bpf_jit_charge_modmem(u32 pages)
 {
 	if (atomic_long_add_return(pages, &bpf_jit_current) >
 	    (bpf_jit_limit >> PAGE_SHIFT)) {
-		if (!capable(CAP_SYS_ADMIN)) {
+		if (!capable(CAP_BPF)) {
 			atomic_long_sub(pages, &bpf_jit_current);
 			return -EPERM;
 		}
@@ -2104,6 +2104,11 @@ int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to,
 DEFINE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
 EXPORT_SYMBOL(bpf_stats_enabled_key);
 
+bool cap_bpf_tracing(void)
+{
+	return capable(CAP_BPF) && capable(CAP_TRACING);
+}
+
 /* All definitions of tracepoints related to BPF. */
 #define CREATE_TRACE_POINTS
 #include <linux/bpf_trace.h>
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a62c8c9..f459315625ac 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -244,9 +244,9 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
 		     offsetof(struct htab_elem, hash_node.pprev));
 
-	if (lru && !capable(CAP_SYS_ADMIN))
+	if (lru && !capable(CAP_BPF))
 		/* LRU implementation is much complicated than other
-		 * maps.  Hence, limit to CAP_SYS_ADMIN for now.
+		 * maps.  Hence, limit to CAP_BPF.
 		 */
 		return -EPERM;
 
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 56e6c75d354d..a45fa5464d98 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -543,7 +543,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 	u64 cost = sizeof(*trie), cost_per_node;
 	int ret;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return ERR_PTR(-EPERM);
 
 	/* check sanity of attributes */
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index f697647ceb54..ca0ba9edca86 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -45,7 +45,7 @@ static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
 /* Called from syscall */
 static int queue_stack_map_alloc_check(union bpf_attr *attr)
 {
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return -EPERM;
 
 	/* check sanity of attributes */
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 50c083ba978c..bfad7d41a061 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -154,7 +154,7 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 	struct bpf_map_memory mem;
 	u64 array_size;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return ERR_PTR(-EPERM);
 
 	array_size = sizeof(*array);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 052580c33d26..beaff32fccc5 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -90,7 +90,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	u64 cost, n_buckets;
 	int err;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_TRACING))
 		return ERR_PTR(-EPERM);
 
 	if (attr->map_flags & ~STACK_CREATE_FLAG_MASK)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c0f62fd67c6b..ef7b06ca30e5 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1176,7 +1176,7 @@ static int map_freeze(const union bpf_attr *attr)
 		err = -EBUSY;
 		goto err_put;
 	}
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!capable(CAP_BPF)) {
 		err = -EPERM;
 		goto err_put;
 	}
@@ -1634,7 +1634,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
 	    (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
-	    !capable(CAP_SYS_ADMIN))
+	    !capable(CAP_BPF))
 		return -EPERM;
 
 	/* copy eBPF program license from user space */
@@ -1647,11 +1647,11 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	is_gpl = license_is_gpl_compatible(license);
 
 	if (attr->insn_cnt == 0 ||
-	    attr->insn_cnt > (capable(CAP_SYS_ADMIN) ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
+	    attr->insn_cnt > (capable(CAP_BPF) ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
 		return -E2BIG;
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
-	    !capable(CAP_SYS_ADMIN))
+	    !capable(CAP_BPF))
 		return -EPERM;
 
 	bpf_prog_load_fixup_attach_type(attr);
@@ -1802,6 +1802,9 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 	char tp_name[128];
 	int tp_fd, err;
 
+	if (!cap_bpf_tracing())
+		return -EPERM;
+
 	if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name),
 			      sizeof(tp_name) - 1) < 0)
 		return -EFAULT;
@@ -2080,7 +2083,10 @@ static int bpf_prog_test_run(const union bpf_attr *attr,
 	struct bpf_prog *prog;
 	int ret = -ENOTSUPP;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_NET_ADMIN) || !capable(CAP_BPF))
+		/* test_run callback is available for networking progs only.
+		 * Add cap_bpf_tracing() above when tracing progs become runable.
+		 */
 		return -EPERM;
 	if (CHECK_ATTR(BPF_PROG_TEST_RUN))
 		return -EINVAL;
@@ -2117,7 +2123,7 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
 	if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return -EPERM;
 
 	next_id++;
@@ -2143,7 +2149,7 @@ static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
 	if (CHECK_ATTR(BPF_PROG_GET_FD_BY_ID))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return -EPERM;
 
 	spin_lock_bh(&prog_idr_lock);
@@ -2177,7 +2183,7 @@ static int bpf_map_get_fd_by_id(const union bpf_attr *attr)
 	    attr->open_flags & ~BPF_OBJ_FLAG_MASK)
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return -EPERM;
 
 	f_flags = bpf_get_file_flag(attr->open_flags);
@@ -2352,7 +2358,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 	info.run_time_ns = stats.nsecs;
 	info.run_cnt = stats.cnt;
 
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!capable(CAP_BPF)) {
 		info.jited_prog_len = 0;
 		info.xlated_prog_len = 0;
 		info.nr_jited_ksyms = 0;
@@ -2670,7 +2676,7 @@ static int bpf_btf_load(const union bpf_attr *attr)
 	if (CHECK_ATTR(BPF_BTF_LOAD))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return -EPERM;
 
 	return btf_new_fd(attr);
@@ -2683,7 +2689,7 @@ static int bpf_btf_get_fd_by_id(const union bpf_attr *attr)
 	if (CHECK_ATTR(BPF_BTF_GET_FD_BY_ID))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return -EPERM;
 
 	return btf_get_fd_by_id(attr->btf_id);
@@ -2752,7 +2758,7 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 	if (CHECK_ATTR(BPF_TASK_FD_QUERY))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!cap_bpf_tracing())
 		return -EPERM;
 
 	if (attr->task_fd_query.flags != 0)
@@ -2820,7 +2826,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	union bpf_attr attr = {};
 	int err;
 
-	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
+	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_BPF))
 		return -EPERM;
 
 	err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 10c0ff93f52b..5810e8cc9342 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -987,7 +987,7 @@ static void __mark_reg_unbounded(struct bpf_reg_state *reg)
 	reg->umax_value = U64_MAX;
 
 	/* constant backtracking is enabled for root only for now */
-	reg->precise = capable(CAP_SYS_ADMIN) ? false : true;
+	reg->precise = capable(CAP_BPF) ? false : true;
 }
 
 /* Mark a register as having a completely unknown (scalar) value. */
@@ -9233,7 +9233,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
 		env->insn_aux_data[i].orig_idx = i;
 	env->prog = *prog;
 	env->ops = bpf_verifier_ops[env->prog->type];
-	is_priv = capable(CAP_SYS_ADMIN);
+	is_priv = capable(CAP_BPF);
 
 	/* grab the mutex to protect few globals used by verifier */
 	if (!is_priv)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..2bf58ff5bf75 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1246,7 +1246,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 	u32 *ids, prog_cnt, ids_len;
 	int ret;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!cap_bpf_tracing())
 		return -EPERM;
 	if (event->attr.type != PERF_TYPE_TRACEPOINT)
 		return -EINVAL;
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index da5639a5bd3b..0b29f6abbeba 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -616,7 +616,7 @@ static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr)
 	    !attr->btf_key_type_id || !attr->btf_value_type_id)
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return -EPERM;
 
 	if (attr->value_size >= KMALLOC_MAX_SIZE -
diff --git a/net/core/filter.c b/net/core/filter.c
index 0c1059cdad3d..986277abfde2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5990,7 +5990,7 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		break;
 	}
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_BPF))
 		return NULL;
 
 	switch (func_id) {
@@ -5999,7 +5999,9 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 	case BPF_FUNC_spin_unlock:
 		return &bpf_spin_unlock_proto;
 	case BPF_FUNC_trace_printk:
-		return bpf_get_trace_printk_proto();
+		if (cap_bpf_tracing())
+			return bpf_get_trace_printk_proto();
+		/* fall through */
 	default:
 		return NULL;
 	}
@@ -6563,7 +6565,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		return false;
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_end):
-		if (!capable(CAP_SYS_ADMIN))
+		if (!capable(CAP_BPF))
 			return false;
 		break;
 	}
@@ -6575,7 +6577,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
 			break;
 		case bpf_ctx_range(struct __sk_buff, tstamp):
-			if (!capable(CAP_SYS_ADMIN))
+			if (!capable(CAP_BPF))
 				return false;
 			break;
 		default:
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 44e2d640b088..91a7f25512ca 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -805,10 +805,20 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	}
 }
 
+struct libcap {
+	struct __user_cap_header_struct hdr;
+	struct __user_cap_data_struct data[2];
+};
+
 static int set_admin(bool admin)
 {
 	cap_t caps;
-	const cap_value_t cap_val = CAP_SYS_ADMIN;
+	/* need CAP_BPF to load progs and CAP_NET_ADMIN to run networking progs,
+	 * and CAP_TRACING to create stackmap
+	 */
+	const cap_value_t cap_net_admin = CAP_NET_ADMIN;
+	const cap_value_t cap_sys_admin = CAP_SYS_ADMIN;
+	struct libcap *cap;
 	int ret = -1;
 
 	caps = cap_get_proc();
@@ -816,11 +826,26 @@ static int set_admin(bool admin)
 		perror("cap_get_proc");
 		return -1;
 	}
-	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val,
+	cap = (struct libcap *)caps;
+	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_sys_admin, CAP_CLEAR)) {
+		perror("cap_set_flag clear admin");
+		goto out;
+	}
+	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_admin,
 				admin ? CAP_SET : CAP_CLEAR)) {
-		perror("cap_set_flag");
+		perror("cap_set_flag set_or_clear net");
 		goto out;
 	}
+	/* libcap is likely old and simply ignores CAP_BPF and CAP_TRACING,
+	 * so update effective bits manually
+	 */
+	if (admin) {
+		cap->data[1].effective |= 1 << (38 /* CAP_BPF */ - 32);
+		cap->data[1].effective |= 1 << (39 /* CAP_TRACING */ - 32);
+	} else {
+		cap->data[1].effective &= ~(1 << (38 - 32));
+		cap->data[1].effective &= ~(1 << (39 - 32));
+	}
 	if (cap_set_proc(caps)) {
 		perror("cap_set_proc");
 		goto out;
@@ -1012,9 +1037,11 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
 
 static bool is_admin(void)
 {
+	cap_flag_value_t net_priv = CAP_CLEAR;
+	bool tracing_priv = false;
+	bool bpf_priv = false;
+	struct libcap *cap;
 	cap_t caps;
-	cap_flag_value_t sysadmin = CAP_CLEAR;
-	const cap_value_t cap_val = CAP_SYS_ADMIN;
 
 #ifdef CAP_IS_SUPPORTED
 	if (!CAP_IS_SUPPORTED(CAP_SETFCAP)) {
@@ -1027,11 +1054,14 @@ static bool is_admin(void)
 		perror("cap_get_proc");
 		return false;
 	}
-	if (cap_get_flag(caps, cap_val, CAP_EFFECTIVE, &sysadmin))
-		perror("cap_get_flag");
+	cap = (struct libcap *)caps;
+	bpf_priv = cap->data[1].effective & (1 << (38/* CAP_BPF */ - 32));
+	tracing_priv = cap->data[1].effective & (1 << (39/* CAP_TRACING */ - 32));
+	if (cap_get_flag(caps, CAP_NET_ADMIN, CAP_EFFECTIVE, &net_priv))
+		perror("cap_get_flag NET");
 	if (cap_free(caps))
 		perror("cap_free");
-	return (sysadmin == CAP_SET);
+	return bpf_priv && tracing_priv && net_priv == CAP_SET;
 }
 
 static void get_unpriv_disabled()
-- 
2.20.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 bpf-next 3/3] perf: implement CAP_TRACING
  2019-08-29  5:12 [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Alexei Starovoitov
  2019-08-29  5:12 ` [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF Alexei Starovoitov
@ 2019-08-29  5:12 ` Alexei Starovoitov
  2019-08-29  6:06   ` Song Liu
  2019-08-29  6:00 ` [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Song Liu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-08-29  5:12 UTC (permalink / raw)
  To: luto; +Cc: davem, peterz, rostedt, netdev, bpf, kernel-team, linux-api

Implement permissions as stated in uapi/linux/capability.h

Similar to CAP_BPF it's highly unlikely that s/CAP_SYS_ADMIN/CAP_TRACING/
replacement will cause user breakage.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/powerpc/perf/core-book3s.c |  4 ++--
 arch/x86/events/intel/bts.c     |  2 +-
 arch/x86/events/intel/core.c    |  2 +-
 arch/x86/events/intel/p4.c      |  2 +-
 kernel/events/core.c            | 14 +++++++-------
 kernel/events/hw_breakpoint.c   |  2 +-
 kernel/trace/trace_event_perf.c |  4 ++--
 7 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index ca92e01d0bd1..ddb5f4b81023 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -204,7 +204,7 @@ static inline void perf_get_data_addr(struct pt_regs *regs, u64 *addrp)
 	if (!(mmcra & MMCRA_SAMPLE_ENABLE) || sdar_valid)
 		*addrp = mfspr(SPRN_SDAR);
 
-	if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
+	if (perf_paranoid_kernel() && !capable(CAP_TRACING) &&
 		is_kernel_addr(mfspr(SPRN_SDAR)))
 		*addrp = 0;
 }
@@ -472,7 +472,7 @@ static void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
 			 * exporting it to userspace (avoid exposure of regions
 			 * where we could have speculative execution)
 			 */
-			if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
+			if (perf_paranoid_kernel() && !capable(CAP_TRACING) &&
 				is_kernel_addr(addr))
 				continue;
 
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 5ee3fed881d3..848a104a0d33 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -550,7 +550,7 @@ static int bts_event_init(struct perf_event *event)
 	 * users to profile the kernel.
 	 */
 	if (event->attr.exclude_kernel && perf_paranoid_kernel() &&
-	    !capable(CAP_SYS_ADMIN))
+	    !capable(CAP_TRACING))
 		return -EACCES;
 
 	if (x86_add_exclusive(x86_lbr_exclusive_bts))
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 648260b5f367..08714d33e566 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3307,7 +3307,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	if (x86_pmu.version < 3)
 		return -EINVAL;
 
-	if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+	if (perf_paranoid_cpu() && !capable(CAP_TRACING))
 		return -EACCES;
 
 	event->hw.config |= ARCH_PERFMON_EVENTSEL_ANY;
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index dee579efb2b2..abf20e33b523 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -776,7 +776,7 @@ static int p4_validate_raw_event(struct perf_event *event)
 	 * the user needs special permissions to be able to use it
 	 */
 	if (p4_ht_active() && p4_event_bind_map[v].shared) {
-		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+		if (perf_paranoid_cpu() && !capable(CAP_TRACING))
 			return -EACCES;
 	}
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..a0e5495cad53 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4134,7 +4134,7 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 
 	if (!task) {
 		/* Must be root to operate on a CPU event: */
-		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+		if (perf_paranoid_cpu() && !capable(CAP_TRACING))
 			return ERR_PTR(-EACCES);
 
 		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
@@ -8741,7 +8741,7 @@ static int perf_kprobe_event_init(struct perf_event *event)
 	if (event->attr.type != perf_kprobe.type)
 		return -ENOENT;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_TRACING))
 		return -EACCES;
 
 	/*
@@ -8801,7 +8801,7 @@ static int perf_uprobe_event_init(struct perf_event *event)
 	if (event->attr.type != perf_uprobe.type)
 		return -ENOENT;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_TRACING))
 		return -EACCES;
 
 	/*
@@ -10588,7 +10588,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		}
 		/* privileged levels capture (kernel, hv): check permissions */
 		if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
-		    && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+		    && perf_paranoid_kernel() && !capable(CAP_TRACING))
 			return -EACCES;
 	}
 
@@ -10807,12 +10807,12 @@ SYSCALL_DEFINE5(perf_event_open,
 		return err;
 
 	if (!attr.exclude_kernel) {
-		if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+		if (perf_paranoid_kernel() && !capable(CAP_TRACING))
 			return -EACCES;
 	}
 
 	if (attr.namespaces) {
-		if (!capable(CAP_SYS_ADMIN))
+		if (!capable(CAP_TRACING))
 			return -EACCES;
 	}
 
@@ -10826,7 +10826,7 @@ SYSCALL_DEFINE5(perf_event_open,
 
 	/* Only privileged users can get physical addresses */
 	if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
-	    perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+	    perf_paranoid_kernel() && !capable(CAP_TRACING))
 		return -EACCES;
 
 	/*
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index c5cd852fe86b..6174d547699d 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -404,7 +404,7 @@ static int hw_breakpoint_parse(struct perf_event *bp,
 		 * Don't let unprivileged users set a breakpoint in the trap
 		 * path to avoid trap recursion attacks.
 		 */
-		if (!capable(CAP_SYS_ADMIN))
+		if (!capable(CAP_TRACING))
 			return -EPERM;
 	}
 
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 0892e38ed6fb..1ec3e08d4539 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -46,7 +46,7 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
 
 	/* The ftrace function trace is allowed only for root. */
 	if (ftrace_event_is_function(tp_event)) {
-		if (perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN))
+		if (perf_paranoid_tracepoint_raw() && !capable(CAP_TRACING))
 			return -EPERM;
 
 		if (!is_sampling_event(p_event))
@@ -82,7 +82,7 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
 	 * ...otherwise raw tracepoint data can be a severe data leak,
 	 * only allow root to have these.
 	 */
-	if (perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN))
+	if (perf_paranoid_tracepoint_raw() && !capable(CAP_TRACING))
 		return -EPERM;
 
 	return 0;
-- 
2.20.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29  5:12 [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Alexei Starovoitov
  2019-08-29  5:12 ` [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF Alexei Starovoitov
  2019-08-29  5:12 ` [PATCH v2 bpf-next 3/3] perf: implement CAP_TRACING Alexei Starovoitov
@ 2019-08-29  6:00 ` Song Liu
  2019-08-29  7:44 ` Toke Høiland-Jørgensen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2019-08-29  6:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, davem, peterz, rostedt, netdev, bpf,
	Kernel Team, linux-api


> On Aug 28, 2019, at 10:12 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> 

[...]

> - Creation of [ku][ret]probe
> - Accessing arbitrary kernel memory via kprobe + probe_kernel_read
> - Attach tracing bpf programs to perf events
> - Access to kallsyms
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Song Liu <songliubraving@fb.com>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-08-29  5:12 ` [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF Alexei Starovoitov
@ 2019-08-29  6:04   ` Song Liu
  2019-08-29 17:28     ` Alexei Starovoitov
  2019-08-29 15:32   ` Daniel Borkmann
  1 sibling, 1 reply; 25+ messages in thread
From: Song Liu @ 2019-08-29  6:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, davem, peterz, rostedt, netdev, bpf,
	Kernel Team, linux-api



> On Aug 28, 2019, at 10:12 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> 

[...]

> diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> index 44e2d640b088..91a7f25512ca 100644
> --- a/tools/testing/selftests/bpf/test_verifier.c
> +++ b/tools/testing/selftests/bpf/test_verifier.c
> @@ -805,10 +805,20 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
> 	}
> }
> 
> +struct libcap {
> +	struct __user_cap_header_struct hdr;
> +	struct __user_cap_data_struct data[2];
> +};
> +

I am confused by struct libcap. Why do we need it? 

> static int set_admin(bool admin)
> {
> 	cap_t caps;
> -	const cap_value_t cap_val = CAP_SYS_ADMIN;
> +	/* need CAP_BPF to load progs and CAP_NET_ADMIN to run networking progs,
> +	 * and CAP_TRACING to create stackmap
> +	 */
> +	const cap_value_t cap_net_admin = CAP_NET_ADMIN;
> +	const cap_value_t cap_sys_admin = CAP_SYS_ADMIN;
> +	struct libcap *cap;
> 	int ret = -1;
> 
> 	caps = cap_get_proc();
> @@ -816,11 +826,26 @@ static int set_admin(bool admin)
> 		perror("cap_get_proc");
> 		return -1;
> 	}
> -	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val,
> +	cap = (struct libcap *)caps;
> +	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_sys_admin, CAP_CLEAR)) {
> +		perror("cap_set_flag clear admin");
> +		goto out;
> +	}
> +	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_admin,
> 				admin ? CAP_SET : CAP_CLEAR)) {
> -		perror("cap_set_flag");
> +		perror("cap_set_flag set_or_clear net");
> 		goto out;
> 	}
> +	/* libcap is likely old and simply ignores CAP_BPF and CAP_TRACING,
> +	 * so update effective bits manually
> +	 */
> +	if (admin) {
> +		cap->data[1].effective |= 1 << (38 /* CAP_BPF */ - 32);
> +		cap->data[1].effective |= 1 << (39 /* CAP_TRACING */ - 32);
> +	} else {
> +		cap->data[1].effective &= ~(1 << (38 - 32));
> +		cap->data[1].effective &= ~(1 << (39 - 32));
> +	}

And why we do not need cap->data[0]?

Thanks,
Song


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 3/3] perf: implement CAP_TRACING
  2019-08-29  5:12 ` [PATCH v2 bpf-next 3/3] perf: implement CAP_TRACING Alexei Starovoitov
@ 2019-08-29  6:06   ` Song Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2019-08-29  6:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, David S . Miller, peterz, rostedt, netdev, bpf,
	Kernel Team, linux-api



> On Aug 28, 2019, at 10:12 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> 
> Implement permissions as stated in uapi/linux/capability.h
> 
> Similar to CAP_BPF it's highly unlikely that s/CAP_SYS_ADMIN/CAP_TRACING/
> replacement will cause user breakage.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29  5:12 [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2019-08-29  6:00 ` [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Song Liu
@ 2019-08-29  7:44 ` Toke Høiland-Jørgensen
  2019-08-29 17:24   ` Alexei Starovoitov
  2019-08-29 13:36 ` Nicolas Dichtel
  2019-08-29 15:47 ` Daniel Borkmann
  5 siblings, 1 reply; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-08-29  7:44 UTC (permalink / raw)
  To: Alexei Starovoitov, luto
  Cc: davem, peterz, rostedt, netdev, bpf, kernel-team, linux-api

Alexei Starovoitov <ast@kernel.org> writes:

> CAP_BPF allows the following BPF operations:
> - Loading all types of BPF programs
> - Creating all types of BPF maps except:
>    - stackmap that needs CAP_TRACING
>    - devmap that needs CAP_NET_ADMIN
>    - cpumap that needs CAP_SYS_ADMIN

Why CAP_SYS_ADMIN instead of CAP_NET_ADMIN for cpumap?

-Toke

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29  5:12 [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2019-08-29  7:44 ` Toke Høiland-Jørgensen
@ 2019-08-29 13:36 ` Nicolas Dichtel
  2019-08-29 17:25   ` Alexei Starovoitov
  2019-08-29 15:47 ` Daniel Borkmann
  5 siblings, 1 reply; 25+ messages in thread
From: Nicolas Dichtel @ 2019-08-29 13:36 UTC (permalink / raw)
  To: Alexei Starovoitov, luto
  Cc: davem, peterz, rostedt, netdev, bpf, kernel-team, linux-api

Le 29/08/2019 à 07:12, Alexei Starovoitov a écrit :
[snip]
> CAP_BPF and CAP_NET_ADMIN together allow the following:
> - Attach to cgroup-bpf hooks and query
> - skb, xdp, flow_dissector test_run command
> 
> CAP_NET_ADMIN allows:
> - Attach networking bpf programs to xdp, tc, lwt, flow dissector
I'm not sure to understand the difference between these last two points.
But, with the current kernel, CAP_NET_ADMIN is not enough to attach bpf prog
with tc and it's still not enough after your patch.
The following command is rejected:
$ tc filter add dev eth0 ingress matchall action bpf obj ./tc_test_kern.o sec test

Prog section 'test' rejected: Operation not permitted (1)!
 - Type:         4
 - Instructions: 22 (0 over limit)
 - License:      GPL

Verifier analysis:

Error fetching program/map!
bad action parsing
parse_action: bad value (5:bpf)!
Illegal "action"
$


Like Andy, I'm also wondering about the backward compatibility. With my current
docker, I'm able to play with tc bpf with CAP_SYS_ADMIN. But if I update my
kernel with your patches, CAP_SYS_ADMIN doesn't allow anymore that and CAP_BPF
is not implemented in my current docker, thus I cannot give the correct
capabilities. In other words, an old docker cannot run on a new kernel.

Regards,
Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-08-29  5:12 ` [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF Alexei Starovoitov
  2019-08-29  6:04   ` Song Liu
@ 2019-08-29 15:32   ` Daniel Borkmann
  2019-08-29 17:30     ` Alexei Starovoitov
  1 sibling, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2019-08-29 15:32 UTC (permalink / raw)
  To: Alexei Starovoitov, luto
  Cc: davem, peterz, rostedt, netdev, bpf, kernel-team, linux-api

On 8/29/19 7:12 AM, Alexei Starovoitov wrote:
> Implement permissions as stated in uapi/linux/capability.h
> 
> Note that CAP_SYS_ADMIN is replaced with CAP_BPF.
> All existing applications that use BPF do not drop all caps
> and keep only CAP_SYS_ADMIN before doing bpf() syscall.
> Hence it's highly unlikely that existing code will break.
> If there will be reports of breakage then CAP_SYS_ADMIN
> would be allowed as well with "it's usage is deprecated" message
> similar to commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
[...]
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 22066a62c8c9..f459315625ac 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -244,9 +244,9 @@ static int htab_map_alloc_check(union bpf_attr *attr)
>   	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
>   		     offsetof(struct htab_elem, hash_node.pprev));
>   
> -	if (lru && !capable(CAP_SYS_ADMIN))
> +	if (lru && !capable(CAP_BPF))
>   		/* LRU implementation is much complicated than other
> -		 * maps.  Hence, limit to CAP_SYS_ADMIN for now.
> +		 * maps.  Hence, limit to CAP_BPF.
>   		 */
>   		return -EPERM;
>   
I don't think this works, this is pretty much going to break use cases where
orchestration daemons are deployed as containers that are explicitly granted
specified cap set and right now this is CAP_SYS_ADMIN and not CAP_BPF for bpf().
The former needs to be a superset of the latter in order for this to work and
not break compatibility between kernel upgrades.

- https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-capabilities-for-a-container
- https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29  5:12 [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2019-08-29 13:36 ` Nicolas Dichtel
@ 2019-08-29 15:47 ` Daniel Borkmann
  2019-08-29 16:28   ` Andy Lutomirski
  2019-08-30  4:16   ` Alexei Starovoitov
  5 siblings, 2 replies; 25+ messages in thread
From: Daniel Borkmann @ 2019-08-29 15:47 UTC (permalink / raw)
  To: Alexei Starovoitov, luto
  Cc: davem, peterz, rostedt, netdev, bpf, kernel-team, linux-api

On 8/29/19 7:12 AM, Alexei Starovoitov wrote:
[...]
>   
> +/*
> + * CAP_BPF allows the following BPF operations:
> + * - Loading all types of BPF programs
> + * - Creating all types of BPF maps except:
> + *    - stackmap that needs CAP_TRACING
> + *    - devmap that needs CAP_NET_ADMIN
> + *    - cpumap that needs CAP_SYS_ADMIN
> + * - Advanced verifier features
> + *   - Indirect variable access
> + *   - Bounded loops
> + *   - BPF to BPF function calls
> + *   - Scalar precision tracking
> + *   - Larger complexity limits
> + *   - Dead code elimination
> + *   - And potentially other features
> + * - Use of pointer-to-integer conversions in BPF programs
> + * - Bypassing of speculation attack hardening measures
> + * - Loading BPF Type Format (BTF) data
> + * - Iterate system wide loaded programs, maps, BTF objects
> + * - Retrieve xlated and JITed code of BPF programs
> + * - Access maps and programs via id
> + * - Use bpf_spin_lock() helper

This is still very wide. Consider following example: app has CAP_BPF +
CAP_NET_ADMIN. Why can't we in this case *only* allow loading networking
related [plus generic] maps and programs? If it doesn't have CAP_TRACING,
what would be a reason to allow loading it? Same vice versa. There are
some misc program types like the infraread stuff, but they could continue
to live under [CAP_BPF +] CAP_SYS_ADMIN as fallback. I think categorizing
a specific list of prog and map types might be more clear than disallowing
some helpers like below (e.g. why choice of bpf_probe_read() but not
bpf_probe_write_user() etc).

> + * CAP_BPF and CAP_TRACING together allow the following:
> + * - bpf_probe_read to read arbitrary kernel memory
> + * - bpf_trace_printk to print data to ftrace ring buffer
> + * - Attach to raw_tracepoint
> + * - Query association between kprobe/tracepoint and bpf program
> + *
> + * CAP_BPF and CAP_NET_ADMIN together allow the following:
> + * - Attach to cgroup-bpf hooks and query
> + * - skb, xdp, flow_dissector test_run command
> + *
> + * CAP_NET_ADMIN allows:
> + * - Attach networking bpf programs to xdp, tc, lwt, flow dissector
> + */
> +#define CAP_BPF			38
> +
> +/*
> + * CAP_TRACING allows:
> + * - Full use of perf_event_open(), similarly to the effect of
> + *   kernel.perf_event_paranoid == -1
> + * - Full use of tracefs
> + * - Creation of [ku][ret]probe
> + * - Accessing arbitrary kernel memory via kprobe + probe_kernel_read
> + * - Attach tracing bpf programs to perf events
> + * - Access to kallsyms
> + */
> +#define CAP_TRACING		39
>   
> -#define CAP_LAST_CAP         CAP_AUDIT_READ
> +#define CAP_LAST_CAP         CAP_TRACING
>   
>   #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
>   
> diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
> index 201f7e588a29..0b364e245163 100644
> --- a/security/selinux/include/classmap.h
> +++ b/security/selinux/include/classmap.h
> @@ -26,9 +26,9 @@
>   	    "audit_control", "setfcap"
>   
>   #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
> -		"wake_alarm", "block_suspend", "audit_read"
> +		"wake_alarm", "block_suspend", "audit_read", "bpf", "tracing"
>   
> -#if CAP_LAST_CAP > CAP_AUDIT_READ
> +#if CAP_LAST_CAP > CAP_TRACING
>   #error New capability defined, please update COMMON_CAP2_PERMS.
>   #endif
>   
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29 15:47 ` Daniel Borkmann
@ 2019-08-29 16:28   ` Andy Lutomirski
  2019-08-30  4:16   ` Alexei Starovoitov
  1 sibling, 0 replies; 25+ messages in thread
From: Andy Lutomirski @ 2019-08-29 16:28 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, David S. Miller, Peter Zijlstra,
	Steven Rostedt, Network Development, bpf, kernel-team, Linux API

> On Aug 29, 2019, at 8:47 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
>> On 8/29/19 7:12 AM, Alexei Starovoitov wrote:
>> [...]
>>  +/*
>> + * CAP_BPF allows the following BPF operations:
>> + * - Loading all types of BPF programs
>> + * - Creating all types of BPF maps except:
>> + *    - stackmap that needs CAP_TRACING
>> + *    - devmap that needs CAP_NET_ADMIN
>> + *    - cpumap that needs CAP_SYS_ADMIN
>> + * - Advanced verifier features
>> + *   - Indirect variable access
>> + *   - Bounded loops
>> + *   - BPF to BPF function calls
>> + *   - Scalar precision tracking
>> + *   - Larger complexity limits
>> + *   - Dead code elimination
>> + *   - And potentially other features
>> + * - Use of pointer-to-integer conversions in BPF programs
>> + * - Bypassing of speculation attack hardening measures
>> + * - Loading BPF Type Format (BTF) data
>> + * - Iterate system wide loaded programs, maps, BTF objects
>> + * - Retrieve xlated and JITed code of BPF programs
>> + * - Access maps and programs via id
>> + * - Use bpf_spin_lock() helper
>
> This is still very wide. Consider following example: app has CAP_BPF +
> CAP_NET_ADMIN. Why can't we in this case *only* allow loading networking
> related [plus generic] maps and programs? If it doesn't have CAP_TRACING,
> what would be a reason to allow loading it? Same vice versa. There are
> some misc program types like the infraread stuff, but they could continue
> to live under [CAP_BPF +] CAP_SYS_ADMIN as fallback. I think categorizing
> a specific list of prog and map types might be more clear than disallowing
> some helpers like below (e.g. why choice of bpf_probe_read() but not
> bpf_probe_write_user() etc).

Wow, I didn’t notice that bpf_probe_write_user() existed. That should
need something like CAP_PTRACE or CAP_SYS_ADMIN.

I'm starting to think that something like this:

https://lore.kernel.org/bpf/968f3551247a43e1104b198f2e58fb0595d425e7.1565040372.git.luto@kernel.org/

should maybe be finished before CAP_BPF happens at all.  It really
looks like the bpf operations that need privilege need to get fully
catalogued and dealt with rather than just coming up with a new
capability that covers a huge swath.

(bpf_probe_write_user() is also terminally broken on architectures
like s390x, but that's not really relevant right now.  I'm a bit
surprised it works on x86 with SMAP, though.)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29  7:44 ` Toke Høiland-Jørgensen
@ 2019-08-29 17:24   ` Alexei Starovoitov
  2019-08-29 18:05     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-08-29 17:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	kernel-team, linux-api

On Thu, Aug 29, 2019 at 09:44:18AM +0200, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <ast@kernel.org> writes:
> 
> > CAP_BPF allows the following BPF operations:
> > - Loading all types of BPF programs
> > - Creating all types of BPF maps except:
> >    - stackmap that needs CAP_TRACING
> >    - devmap that needs CAP_NET_ADMIN
> >    - cpumap that needs CAP_SYS_ADMIN
> 
> Why CAP_SYS_ADMIN instead of CAP_NET_ADMIN for cpumap?

Currently it's cap_sys_admin and I think it should stay this way
because it creates kthreads.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29 13:36 ` Nicolas Dichtel
@ 2019-08-29 17:25   ` Alexei Starovoitov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2019-08-29 17:25 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	kernel-team, linux-api

On Thu, Aug 29, 2019 at 03:36:42PM +0200, Nicolas Dichtel wrote:
> Le 29/08/2019 à 07:12, Alexei Starovoitov a écrit :
> [snip]
> > CAP_BPF and CAP_NET_ADMIN together allow the following:
> > - Attach to cgroup-bpf hooks and query
> > - skb, xdp, flow_dissector test_run command
> > 
> > CAP_NET_ADMIN allows:
> > - Attach networking bpf programs to xdp, tc, lwt, flow dissector
> I'm not sure to understand the difference between these last two points.
> But, with the current kernel, CAP_NET_ADMIN is not enough to attach bpf prog
> with tc and it's still not enough after your patch.
> The following command is rejected:
> $ tc filter add dev eth0 ingress matchall action bpf obj ./tc_test_kern.o sec test
> 
> Prog section 'test' rejected: Operation not permitted (1)!
>  - Type:         4
>  - Instructions: 22 (0 over limit)
>  - License:      GPL
> 
> Verifier analysis:
> 
> Error fetching program/map!
> bad action parsing
> parse_action: bad value (5:bpf)!
> Illegal "action"

because tc/iproute2 is doing load and attach.
Currently load needs cap_sys_admin and
attach needs cap_net_admin.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-08-29  6:04   ` Song Liu
@ 2019-08-29 17:28     ` Alexei Starovoitov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2019-08-29 17:28 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Andy Lutomirski, davem, peterz, rostedt,
	netdev, bpf, Kernel Team, linux-api

On Thu, Aug 29, 2019 at 06:04:42AM +0000, Song Liu wrote:
> 
> 
> > On Aug 28, 2019, at 10:12 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> > 
> 
> [...]
> 
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> > index 44e2d640b088..91a7f25512ca 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -805,10 +805,20 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
> > 	}
> > }
> > 
> > +struct libcap {
> > +	struct __user_cap_header_struct hdr;
> > +	struct __user_cap_data_struct data[2];
> > +};
> > +
> 
> I am confused by struct libcap. Why do we need it? 

because libcap is not compatible with new kernel.
It needs to be recompiled with new capability.h
Otherwise it limits max to CAP_AUDIT_READ
Any value higher it will error during cap_get_flag.
And will silently ignore it during cap_set_flag.
Not a great library decision.

Thankfully this struct above is exactly the kernel api.
One doesn't really need libcap. It's imo easier to do without it.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-08-29 15:32   ` Daniel Borkmann
@ 2019-08-29 17:30     ` Alexei Starovoitov
  2019-08-30 15:19       ` Nicolas Dichtel
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-08-29 17:30 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	kernel-team, linux-api

On Thu, Aug 29, 2019 at 05:32:27PM +0200, Daniel Borkmann wrote:
> On 8/29/19 7:12 AM, Alexei Starovoitov wrote:
> > Implement permissions as stated in uapi/linux/capability.h
> > 
> > Note that CAP_SYS_ADMIN is replaced with CAP_BPF.
> > All existing applications that use BPF do not drop all caps
> > and keep only CAP_SYS_ADMIN before doing bpf() syscall.
> > Hence it's highly unlikely that existing code will break.
> > If there will be reports of breakage then CAP_SYS_ADMIN
> > would be allowed as well with "it's usage is deprecated" message
> > similar to commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> [...]
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 22066a62c8c9..f459315625ac 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -244,9 +244,9 @@ static int htab_map_alloc_check(union bpf_attr *attr)
> >   	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
> >   		     offsetof(struct htab_elem, hash_node.pprev));
> > -	if (lru && !capable(CAP_SYS_ADMIN))
> > +	if (lru && !capable(CAP_BPF))
> >   		/* LRU implementation is much complicated than other
> > -		 * maps.  Hence, limit to CAP_SYS_ADMIN for now.
> > +		 * maps.  Hence, limit to CAP_BPF.
> >   		 */
> >   		return -EPERM;
> I don't think this works, this is pretty much going to break use cases where
> orchestration daemons are deployed as containers that are explicitly granted
> specified cap set and right now this is CAP_SYS_ADMIN and not CAP_BPF for bpf().
> The former needs to be a superset of the latter in order for this to work and
> not break compatibility between kernel upgrades.
> 
> - https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-capabilities-for-a-container
> - https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities

These are the links that showing that k8 can delegates caps.
Are you saying that you know of folks who specifically
delegate cap_sys_admin and cap_net_admin _only_ to a container to run bpf in there?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29 17:24   ` Alexei Starovoitov
@ 2019-08-29 18:05     ` Toke Høiland-Jørgensen
  2019-08-29 20:25       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-08-29 18:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	kernel-team, linux-api, brouer

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Aug 29, 2019 at 09:44:18AM +0200, Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <ast@kernel.org> writes:
>> 
>> > CAP_BPF allows the following BPF operations:
>> > - Loading all types of BPF programs
>> > - Creating all types of BPF maps except:
>> >    - stackmap that needs CAP_TRACING
>> >    - devmap that needs CAP_NET_ADMIN
>> >    - cpumap that needs CAP_SYS_ADMIN
>> 
>> Why CAP_SYS_ADMIN instead of CAP_NET_ADMIN for cpumap?
>
> Currently it's cap_sys_admin and I think it should stay this way
> because it creates kthreads.

Ah, right. I can sorta see that makes sense because of the kthreads, but
it also means that you can use all of XDP *except* cpumap with
CAP_NET_ADMIN+CAP_BPF. That is bound to create confusion, isn't it?

-Toke

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29 18:05     ` Toke Høiland-Jørgensen
@ 2019-08-29 20:25       ` Jesper Dangaard Brouer
  2019-08-29 21:10         ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Jesper Dangaard Brouer @ 2019-08-29 20:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Alexei Starovoitov, luto, davem, peterz,
	rostedt, netdev, bpf, kernel-team, linux-api, brouer

On Thu, 29 Aug 2019 20:05:49 +0200
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Thu, Aug 29, 2019 at 09:44:18AM +0200, Toke Høiland-Jørgensen wrote:  
> >> Alexei Starovoitov <ast@kernel.org> writes:
> >>   
> >> > CAP_BPF allows the following BPF operations:
> >> > - Loading all types of BPF programs
> >> > - Creating all types of BPF maps except:
> >> >    - stackmap that needs CAP_TRACING
> >> >    - devmap that needs CAP_NET_ADMIN
> >> >    - cpumap that needs CAP_SYS_ADMIN  
> >> 
> >> Why CAP_SYS_ADMIN instead of CAP_NET_ADMIN for cpumap?  
> >
> > Currently it's cap_sys_admin and I think it should stay this way
> > because it creates kthreads.  
> 
> Ah, right. I can sorta see that makes sense because of the kthreads, but
> it also means that you can use all of XDP *except* cpumap with
> CAP_NET_ADMIN+CAP_BPF. That is bound to create confusion, isn't it?
 
Hmm... I see 'cpumap' primarily as a network stack feature.  It is about
starting the network stack on a specific CPU, allocating and building
SKBs on that remote CPU.  It can only be used together with XDP_REDIRECT.
I would prefer CAP_NET_ADMIN like the devmap, to keep the XDP
capabilities consistent.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29 20:25       ` Jesper Dangaard Brouer
@ 2019-08-29 21:10         ` Alexei Starovoitov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2019-08-29 21:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, luto,
	davem, peterz, rostedt, netdev, bpf, kernel-team, linux-api

On Thu, Aug 29, 2019 at 10:25:30PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 29 Aug 2019 20:05:49 +0200
> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> 
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> > 
> > > On Thu, Aug 29, 2019 at 09:44:18AM +0200, Toke Høiland-Jørgensen wrote:  
> > >> Alexei Starovoitov <ast@kernel.org> writes:
> > >>   
> > >> > CAP_BPF allows the following BPF operations:
> > >> > - Loading all types of BPF programs
> > >> > - Creating all types of BPF maps except:
> > >> >    - stackmap that needs CAP_TRACING
> > >> >    - devmap that needs CAP_NET_ADMIN
> > >> >    - cpumap that needs CAP_SYS_ADMIN  
> > >> 
> > >> Why CAP_SYS_ADMIN instead of CAP_NET_ADMIN for cpumap?  
> > >
> > > Currently it's cap_sys_admin and I think it should stay this way
> > > because it creates kthreads.  
> > 
> > Ah, right. I can sorta see that makes sense because of the kthreads, but
> > it also means that you can use all of XDP *except* cpumap with
> > CAP_NET_ADMIN+CAP_BPF. That is bound to create confusion, isn't it?
>  
> Hmm... I see 'cpumap' primarily as a network stack feature.  It is about
> starting the network stack on a specific CPU, allocating and building
> SKBs on that remote CPU.  It can only be used together with XDP_REDIRECT.
> I would prefer CAP_NET_ADMIN like the devmap, to keep the XDP
> capabilities consistent.

I don't mind relaxing cpumap to cap_net_admin.
Looking at the reaction to the rest of the set. I'd rather discuss it
and do it later after basic cap_bpf is in.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
  2019-08-29 15:47 ` Daniel Borkmann
  2019-08-29 16:28   ` Andy Lutomirski
@ 2019-08-30  4:16   ` Alexei Starovoitov
  1 sibling, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2019-08-30  4:16 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andy Lutomirski, David S. Miller,
	Peter Zijlstra, Steven Rostedt, Network Development, bpf,
	Kernel Team, Linux API

On Thu, Aug 29, 2019 at 8:47 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/29/19 7:12 AM, Alexei Starovoitov wrote:
> [...]
> >
> > +/*
> > + * CAP_BPF allows the following BPF operations:
> > + * - Loading all types of BPF programs
> > + * - Creating all types of BPF maps except:
> > + *    - stackmap that needs CAP_TRACING
> > + *    - devmap that needs CAP_NET_ADMIN
> > + *    - cpumap that needs CAP_SYS_ADMIN
> > + * - Advanced verifier features
> > + *   - Indirect variable access
> > + *   - Bounded loops
> > + *   - BPF to BPF function calls
> > + *   - Scalar precision tracking
> > + *   - Larger complexity limits
> > + *   - Dead code elimination
> > + *   - And potentially other features
> > + * - Use of pointer-to-integer conversions in BPF programs
> > + * - Bypassing of speculation attack hardening measures
> > + * - Loading BPF Type Format (BTF) data
> > + * - Iterate system wide loaded programs, maps, BTF objects
> > + * - Retrieve xlated and JITed code of BPF programs
> > + * - Access maps and programs via id
> > + * - Use bpf_spin_lock() helper
>
> This is still very wide.

'still very wide' ? you make it sound like it's a bad thing.
Covering all of bpf with single CAP_BPF is #1 goal of this set.

> Consider following example: app has CAP_BPF +> CAP_NET_ADMIN. Why can't we in this case *only* allow loading networking
> related [plus generic] maps and programs? If it doesn't have CAP_TRACING,
> what would be a reason to allow loading it? Same vice versa. There are
> some misc program types like the infraread stuff, but they could continue
> to live under [CAP_BPF +] CAP_SYS_ADMIN as fallback. I think categorizing
> a specific list of prog and map types might be more clear than disallowing
> some helpers like below (e.g. why choice of bpf_probe_read() but not
> bpf_probe_write_user() etc).

It kinda makes sense:
cap_bpf+cap_net_admin allows networking progs.
cap_bpf+cap_tracing allows tracing progs.
But what to do with cg_sysctl, cg_device, lirc ?
They are clearly neither.
Invent yet another cap_foo for them?
or let them under cap_bpf alone?
If cap_bpf alone is enough then why bother with bpf+net_admin for networking?
It's not making anything cleaner. Only confuses users.

Also bpf_trace_printk() is using ftrace and can print arbitrary memory.
In that sense it's no different than bpf_probe_read.
Both should be under CAP_TRACING.
But bpf_trace_printk() is available to all progs.
Even to socket filters under cap_sys_admin today.
With this patch set I'm allowing bpf_trace_printk() under CAP_TRACING.
Same goes to bpf_probe_read.

High level description:
cap_bpf alone allows loading of all progs except when
later cap_net_admin or cap_tracing will _not_ be able to
filter out the helper at attach time that shouldn't be there.

Example of how this patch set works:
- to load and attach networking progs
both cap_bpf and cap_net_admin are necessary.
- to load and attach tracing progs
both cap_bpf and cap_tracing are necessary.

when networking prog is using bpf_trace_printk
then cap_tracing is needed too.
And it's checked at load time.
If we do what you're proposing:
"lets allow load of all networking with bpf+net_admin"
then this won't work for bpf_trace_printk.
Per helper function capability check is still needed.
And since it's needed I see no reason to limit
networking progs to bpf+net_admin at load time.
Load time is still cap_bpf only.
And helpers will be filtered out at attach by net_admin.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-08-29 17:30     ` Alexei Starovoitov
@ 2019-08-30 15:19       ` Nicolas Dichtel
  2019-09-04  1:39         ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Nicolas Dichtel @ 2019-08-30 15:19 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	kernel-team, linux-api

Le 29/08/2019 à 19:30, Alexei Starovoitov a écrit :
[snip]
> These are the links that showing that k8 can delegates caps.
> Are you saying that you know of folks who specifically
> delegate cap_sys_admin and cap_net_admin _only_ to a container to run bpf in there?
> 
Yes, we need cap_sys_admin only to load bpf:
tc filter add dev eth0 ingress matchall action bpf obj ./tc_test_kern.o sec test

I'm not sure to understand why cap_net_admin is not enough to run the previous
command (ie why load is forbidden).

I want to avoid sys_admin, thus cap_bpf will be ok. But we need to manage the
backward compatibility.

Regards,
Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-08-30 15:19       ` Nicolas Dichtel
@ 2019-09-04  1:39         ` Alexei Starovoitov
  2019-09-04 15:16           ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-09-04  1:39 UTC (permalink / raw)
  To: nicolas.dichtel, Alexei Starovoitov, Daniel Borkmann
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	Kernel Team, linux-api

On 8/30/19 8:19 AM, Nicolas Dichtel wrote:
> Le 29/08/2019 à 19:30, Alexei Starovoitov a écrit :
> [snip]
>> These are the links that showing that k8 can delegates caps.
>> Are you saying that you know of folks who specifically
>> delegate cap_sys_admin and cap_net_admin _only_ to a container to run bpf in there?
>>
> Yes, we need cap_sys_admin only to load bpf:
> tc filter add dev eth0 ingress matchall action bpf obj ./tc_test_kern.o sec test
> 
> I'm not sure to understand why cap_net_admin is not enough to run the previous
> command (ie why load is forbidden).

because bpf syscall prog_load command requires cap_sys_admin in
the current implementation.

> I want to avoid sys_admin, thus cap_bpf will be ok. But we need to manage the
> backward compatibility.

re: backward compatibility...
do you know of any case where task is running under userid=nobody
with cap_sys_admin and cap_net_admin in order to do bpf ?

If not then what is the concern about compatibility?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-09-04  1:39         ` Alexei Starovoitov
@ 2019-09-04 15:16           ` Daniel Borkmann
  2019-09-04 15:21             ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2019-09-04 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov, nicolas.dichtel, Alexei Starovoitov
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	Kernel Team, linux-api

On 9/4/19 3:39 AM, Alexei Starovoitov wrote:
> On 8/30/19 8:19 AM, Nicolas Dichtel wrote:
>> Le 29/08/2019 à 19:30, Alexei Starovoitov a écrit :
>> [snip]
>>> These are the links that showing that k8 can delegates caps.
>>> Are you saying that you know of folks who specifically
>>> delegate cap_sys_admin and cap_net_admin _only_ to a container to run bpf in there?
>>>
>> Yes, we need cap_sys_admin only to load bpf:
>> tc filter add dev eth0 ingress matchall action bpf obj ./tc_test_kern.o sec test
>>
>> I'm not sure to understand why cap_net_admin is not enough to run the previous
>> command (ie why load is forbidden).
> 
> because bpf syscall prog_load command requires cap_sys_admin in
> the current implementation.
> 
>> I want to avoid sys_admin, thus cap_bpf will be ok. But we need to manage the
>> backward compatibility.
> 
> re: backward compatibility...
> do you know of any case where task is running under userid=nobody
> with cap_sys_admin and cap_net_admin in order to do bpf ?
> 
> If not then what is the concern about compatibility?

Finally managed to find some cycles to pull up a k8s cluster. Looks like it would
break deployments with the patches as-is right away; meaning, any constellation
where BPF is used inside the pod.

With CAP_BPF patches applied on bpf-next:

# kubectl apply -f ./cilium.yaml
[...]
# kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                               READY   STATUS              RESTARTS   AGE     IP              NODE     NOMINATED NODE   READINESS GATES
kube-system   cilium-cz9qs                       0/1     CrashLoopBackOff    4          2m36s   192.168.1.125   apoc     <none>           <none>
kube-system   cilium-operator-6c7c6c788b-xcm9d   0/1     Pending             0          2m36s   <none>          <none>   <none>           <none>
kube-system   coredns-5c98db65d4-6nhpg           0/1     ContainerCreating   0          4m12s   <none>          apoc     <none>           <none>
kube-system   coredns-5c98db65d4-l5b94           0/1     ContainerCreating   0          4m12s   <none>          apoc     <none>           <none>
kube-system   etcd-apoc                          1/1     Running             0          3m26s   192.168.1.125   apoc     <none>           <none>
kube-system   kube-apiserver-apoc                1/1     Running             0          3m32s   192.168.1.125   apoc     <none>           <none>
kube-system   kube-controller-manager-apoc       1/1     Running             0          3m18s   192.168.1.125   apoc     <none>           <none>
kube-system   kube-proxy-jj9kz                   1/1     Running             0          4m12s   192.168.1.125   apoc     <none>           <none>
kube-system   kube-scheduler-apoc                1/1     Running             0          3m26s   192.168.1.125   apoc     <none>           <none>
# kubectl -n kube-system logs --timestamps cilium-cz9qs
[...]
2019-09-04T14:11:46.399478585Z level=info msg="Cilium 1.6.90 ba0ed147b 2019-09-03T21:20:30+02:00 go version go1.12.8 linux/amd64" subsys=daemon
2019-09-04T14:11:46.410564471Z level=info msg="cilium-envoy  version: b7a919ebdca3d3bbc6aae51357e78e9c603450ae/1.11.1/Modified/RELEASE/BoringSSL" subsys=daemon
2019-09-04T14:11:46.446983926Z level=info msg="clang (7.0.0) and kernel (5.3.0) versions: OK!" subsys=daemon
[...]
2019-09-04T14:11:47.27988188Z level=info msg="Mounting BPF filesystem at /run/cilium/bpffs" subsys=bpf
2019-09-04T14:11:47.279904256Z level=info msg="Detected mounted BPF filesystem at /run/cilium/bpffs" subsys=bpf
2019-09-04T14:11:47.280205098Z level=info msg="Valid label prefix configuration:" subsys=labels-filter
2019-09-04T14:11:47.280214528Z level=info msg=" - :io.kubernetes.pod.namespace" subsys=labels-filter
2019-09-04T14:11:47.28021738Z level=info msg=" - :io.cilium.k8s.namespace.labels" subsys=labels-filter
2019-09-04T14:11:47.280220836Z level=info msg=" - :app.kubernetes.io" subsys=labels-filter
2019-09-04T14:11:47.280223355Z level=info msg=" - !:io.kubernetes" subsys=labels-filter
2019-09-04T14:11:47.280225723Z level=info msg=" - !:kubernetes.io" subsys=labels-filter
2019-09-04T14:11:47.280228095Z level=info msg=" - !:.*beta.kubernetes.io" subsys=labels-filter
2019-09-04T14:11:47.280230409Z level=info msg=" - !:k8s.io" subsys=labels-filter
2019-09-04T14:11:47.280232699Z level=info msg=" - !:pod-template-generation" subsys=labels-filter
2019-09-04T14:11:47.280235569Z level=info msg=" - !:pod-template-hash" subsys=labels-filter
2019-09-04T14:11:47.28023792Z level=info msg=" - !:controller-revision-hash" subsys=labels-filter
2019-09-04T14:11:47.280240253Z level=info msg=" - !:annotation.*" subsys=labels-filter
2019-09-04T14:11:47.280242566Z level=info msg=" - !:etcd_node" subsys=labels-filter
2019-09-04T14:11:47.28026585Z level=info msg="Initializing daemon" subsys=daemon
2019-09-04T14:11:47.281344002Z level=info msg="Detected MTU 1500" subsys=mtu
2019-09-04T14:11:47.281771889Z level=error msg="Error while opening/creating BPF maps" error="Unable to create map /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted" subsys=daemon
2019-09-04T14:11:47.28178666Z level=fatal msg="Error while creating daemon" error="Unable to create map /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted" subsys=daemon

And /same/ deployment with reverted patches, hence no CAP_BPF gets it up and running again:

# kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE     NOMINATED NODE   READINESS GATES
kube-system   cilium-cz9qs                       1/1     Running   13         50m   192.168.1.125   apoc     <none>           <none>
kube-system   cilium-operator-6c7c6c788b-xcm9d   0/1     Pending   0          50m   <none>          <none>   <none>           <none>
kube-system   coredns-5c98db65d4-6nhpg           1/1     Running   0          52m   10.217.0.91     apoc     <none>           <none>
kube-system   coredns-5c98db65d4-l5b94           1/1     Running   0          52m   10.217.0.225    apoc     <none>           <none>
kube-system   etcd-apoc                          1/1     Running   1          51m   192.168.1.125   apoc     <none>           <none>
kube-system   kube-apiserver-apoc                1/1     Running   1          51m   192.168.1.125   apoc     <none>           <none>
kube-system   kube-controller-manager-apoc       1/1     Running   1          51m   192.168.1.125   apoc     <none>           <none>
kube-system   kube-proxy-jj9kz                   1/1     Running   1          52m   192.168.1.125   apoc     <none>           <none>
kube-system   kube-scheduler-apoc                1/1     Running   1          51m   192.168.1.125   apoc     <none>           <none>

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-09-04 15:16           ` Daniel Borkmann
@ 2019-09-04 15:21             ` Alexei Starovoitov
  2019-09-05  8:37               ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2019-09-04 15:21 UTC (permalink / raw)
  To: Daniel Borkmann, nicolas.dichtel, Alexei Starovoitov
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	Kernel Team, linux-api

On 9/4/19 8:16 AM, Daniel Borkmann wrote:
> opening/creating BPF maps" error="Unable to create map 
> /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted" 
> subsys=daemon
> 2019-09-04T14:11:47.28178666Z level=fatal msg="Error while creating 
> daemon" error="Unable to create map 
> /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted" 
> subsys=daemon

Ok. We have to include caps in both cap_sys_admin and cap_bpf then.

> And /same/ deployment with reverted patches, hence no CAP_BPF gets it up 
> and running again:
> 
> # kubectl get pods --all-namespaces -o wide

Can you share what this magic commands do underneath?

What user do they pick to start under? and what caps are granted?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-09-04 15:21             ` Alexei Starovoitov
@ 2019-09-05  8:37               ` Daniel Borkmann
  2019-09-05 22:00                 ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2019-09-05  8:37 UTC (permalink / raw)
  To: Alexei Starovoitov, nicolas.dichtel, Alexei Starovoitov
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	Kernel Team, linux-api

On 9/4/19 5:21 PM, Alexei Starovoitov wrote:
> On 9/4/19 8:16 AM, Daniel Borkmann wrote:
>> opening/creating BPF maps" error="Unable to create map
>> /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted"
>> subsys=daemon
>> 2019-09-04T14:11:47.28178666Z level=fatal msg="Error while creating
>> daemon" error="Unable to create map
>> /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted"
>> subsys=daemon
> 
> Ok. We have to include caps in both cap_sys_admin and cap_bpf then.
> 
>> And /same/ deployment with reverted patches, hence no CAP_BPF gets it up
>> and running again:
>>
>> # kubectl get pods --all-namespaces -o wide
> 
> Can you share what this magic commands do underneath?

What do you mean by magic commands? Latter is showing all pods in the cluster:

https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-finding-resources

I've only been using the normal kubeadm guide for setup, it's pretty straight
forward, just the kubeadm init to bootstrap and then the kubectl create for
deploying if you need to give it a spin for testing:

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#tabs-pod-install-4

> What user do they pick to start under? and what caps are granted?

The deployment is using a 'securityContext' with 'privileged: true' for the
the container spec as majority of CNIs do. My understanding is that this is
passed down to the underlying container runtime e.g. docker as one option.

Checking moby go code, it seems to exec with GetAllCapabilities which returns
all of the capabilities it is aware of, and afaics, they seem to be using
the below go library to get the hard-coded list from where obviously CAP_BPF
is unknown which might also explain the breakage I've been seeing:

https://github.com/syndtr/gocapability/blob/33e07d32887e1e06b7c025f27ce52f62c7990bc0/capability/enum_gen.go

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
  2019-09-05  8:37               ` Daniel Borkmann
@ 2019-09-05 22:00                 ` Alexei Starovoitov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2019-09-05 22:00 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, nicolas.dichtel, Alexei Starovoitov, luto,
	davem, peterz, rostedt, netdev, bpf, Kernel Team, linux-api

On Thu, Sep 05, 2019 at 10:37:03AM +0200, Daniel Borkmann wrote:
> On 9/4/19 5:21 PM, Alexei Starovoitov wrote:
> > On 9/4/19 8:16 AM, Daniel Borkmann wrote:
> > > opening/creating BPF maps" error="Unable to create map
> > > /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted"
> > > subsys=daemon
> > > 2019-09-04T14:11:47.28178666Z level=fatal msg="Error while creating
> > > daemon" error="Unable to create map
> > > /run/cilium/bpffs/tc/globals/cilium_lxc: operation not permitted"
> > > subsys=daemon
> > 
> > Ok. We have to include caps in both cap_sys_admin and cap_bpf then.
> > 
> > > And /same/ deployment with reverted patches, hence no CAP_BPF gets it up
> > > and running again:
> > > 
> > > # kubectl get pods --all-namespaces -o wide
> > 
> > Can you share what this magic commands do underneath?
> 
> What do you mean by magic commands? Latter is showing all pods in the cluster:
> 
> https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-finding-resources

"magic" in a sense that I have no idea how k8s "services" and "pods" map
to kernel namespaces.

> Checking moby go code, it seems to exec with GetAllCapabilities which returns
> all of the capabilities it is aware of, and afaics, they seem to be using
> the below go library to get the hard-coded list from where obviously CAP_BPF
> is unknown which might also explain the breakage I've been seeing:
> 
> https://github.com/syndtr/gocapability/blob/33e07d32887e1e06b7c025f27ce52f62c7990bc0/capability/enum_gen.go

thanks for the link.
That library is much better written than libcap.
capability_linux.go is reading cap_last_cap dynamically and it can understand
proposed CAP_BPF, CAP_TRACING without need o be recompiled (unlike libcap).
So passing new caps to k8s should be trivial. The library won't know
their names, but passing by number looks to be already supported.
I'm still not sure which part of k8s setup clears the caps and
why this v2 set doesn't work, but that doesn't matter any more.
I believe I addressed this compat issue in v3 set.
Could you please give a try just repeating your previous commands?


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, back to index

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-29  5:12 [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Alexei Starovoitov
2019-08-29  5:12 ` [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF Alexei Starovoitov
2019-08-29  6:04   ` Song Liu
2019-08-29 17:28     ` Alexei Starovoitov
2019-08-29 15:32   ` Daniel Borkmann
2019-08-29 17:30     ` Alexei Starovoitov
2019-08-30 15:19       ` Nicolas Dichtel
2019-09-04  1:39         ` Alexei Starovoitov
2019-09-04 15:16           ` Daniel Borkmann
2019-09-04 15:21             ` Alexei Starovoitov
2019-09-05  8:37               ` Daniel Borkmann
2019-09-05 22:00                 ` Alexei Starovoitov
2019-08-29  5:12 ` [PATCH v2 bpf-next 3/3] perf: implement CAP_TRACING Alexei Starovoitov
2019-08-29  6:06   ` Song Liu
2019-08-29  6:00 ` [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING Song Liu
2019-08-29  7:44 ` Toke Høiland-Jørgensen
2019-08-29 17:24   ` Alexei Starovoitov
2019-08-29 18:05     ` Toke Høiland-Jørgensen
2019-08-29 20:25       ` Jesper Dangaard Brouer
2019-08-29 21:10         ` Alexei Starovoitov
2019-08-29 13:36 ` Nicolas Dichtel
2019-08-29 17:25   ` Alexei Starovoitov
2019-08-29 15:47 ` Daniel Borkmann
2019-08-29 16:28   ` Andy Lutomirski
2019-08-30  4:16   ` Alexei Starovoitov

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org bpf@archiver.kernel.org
	public-inbox-index bpf


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/ public-inbox