All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] Add eBPF hooks for cgroups
@ 2016-08-26 19:58 Daniel Mack
  2016-08-26 19:58 ` [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Daniel Mack @ 2016-08-26 19:58 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, Daniel Mack

This is v3 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

I am posting this now with only very few changes from v2 because
I'll be travelling for a couple of days and won't have access to my
mails.


Changes from v2:

* Fixed the RCU locking details Tejun pointed out.

* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.


Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


As always, feedback is much appreciated.

Thanks,
Daniel

Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: core: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
    cgroups

 include/linux/bpf-cgroup.h      |  70 +++++++++++++++++
 include/linux/cgroup-defs.h     |   4 +
 include/uapi/linux/bpf.h        |  16 ++++
 init/Kconfig                    |  12 +++
 kernel/bpf/Makefile             |   1 +
 kernel/bpf/cgroup.c             | 165 ++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c            |  83 ++++++++++++++++++++
 kernel/bpf/verifier.c           |   1 +
 kernel/cgroup.c                 |  18 +++++
 net/core/dev.c                  |   6 ++
 net/core/filter.c               |  11 +++
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  23 ++++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +++++++++++++++++++++++++++++++++++
 15 files changed, 562 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.5.5

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering
  2016-08-26 19:58 [PATCH v3 0/6] Add eBPF hooks for cgroups Daniel Mack
@ 2016-08-26 19:58 ` Daniel Mack
  2016-08-29 22:14   ` Daniel Borkmann
  2016-08-26 19:58 ` [PATCH v3 2/6] cgroup: add support for eBPF programs Daniel Mack
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Daniel Mack @ 2016-08-26 19:58 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, Daniel Mack

For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
terms of checks during the verification process. It may access the skb as
well.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/uapi/linux/bpf.h | 7 +++++++
 kernel/bpf/verifier.c    | 1 +
 net/core/filter.c        | 6 ++++++
 3 files changed, 14 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e4c5a1b..1d5db42 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -95,6 +95,13 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SCHED_ACT,
 	BPF_PROG_TYPE_TRACEPOINT,
 	BPF_PROG_TYPE_XDP,
+	BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
+};
+
+enum bpf_attach_type {
+	BPF_ATTACH_TYPE_CGROUP_INET_INGRESS,
+	BPF_ATTACH_TYPE_CGROUP_INET_EGRESS,
+	__MAX_BPF_ATTACH_TYPE
 };
 
 #define BPF_PSEUDO_MAP_FD	1
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index abb61f3..12ca880 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1805,6 +1805,7 @@ static bool may_access_skb(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_SOCKET_FILTER:
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_CGROUP_SOCKET_FILTER:
 		return true;
 	default:
 		return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index a83766b..bc04e5c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2848,12 +2848,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_sk_filter_type __read_mostly = {
+	.ops	= &sk_filter_ops,
+	.type	= BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
+	bpf_register_prog_type(&cg_sk_filter_type);
 
 	return 0;
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 2/6] cgroup: add support for eBPF programs
  2016-08-26 19:58 [PATCH v3 0/6] Add eBPF hooks for cgroups Daniel Mack
  2016-08-26 19:58 ` [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
@ 2016-08-26 19:58 ` Daniel Mack
  2016-08-27  0:03   ` Alexei Starovoitov
                     ` (2 more replies)
  2016-08-26 19:58 ` [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 35+ messages in thread
From: Daniel Mack @ 2016-08-26 19:58 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, Daniel Mack

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/linux/bpf-cgroup.h  |  70 +++++++++++++++++++
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig                |  12 ++++
 kernel/bpf/Makefile         |   1 +
 kernel/bpf/cgroup.c         | 165 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup.c             |  18 +++++
 6 files changed, 270 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 0000000..a5a25c1
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,70 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include <linux/bpf.h>
+#include <uapi/linux/bpf.h>
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
+
+struct cgroup_bpf {
+	/*
+	 * Store two sets of bpf_prog pointers, one for programs that are
+	 * pinned directly to this cgroup, and one for those that are effective
+	 * when this cgroup is accessed.
+	 */
+	struct bpf_prog *prog[__MAX_BPF_ATTACH_TYPE];
+	struct bpf_prog *effective[__MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+			 struct cgroup *parent,
+			 struct bpf_prog *prog,
+			 enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+		       struct bpf_prog *prog,
+		       enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+			    struct sk_buff *skb,
+			    enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+					struct sk_buff *skb,
+					enum bpf_attach_type type)
+{
+	if (cgroup_bpf_enabled)
+		return __cgroup_bpf_run_filter(sk, skb, type);
+
+	return 0;
+}
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+				      struct cgroup *parent) {}
+
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+					struct sk_buff *skb,
+					enum bpf_attach_type type)
+{
+	return 0;
+}
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include <linux/percpu-refcount.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/workqueue.h>
+#include <linux/bpf-cgroup.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
 
+	/* used to store eBPF programs */
+	struct cgroup_bpf bpf;
+
 	/* ids of the ancestors at each level including self */
 	int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..5a89c83 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,18 @@ config CGROUP_PERF
 
 	  Say N if unsure.
 
+config CGROUP_BPF
+	bool "Support for eBPF programs attached to cgroups"
+	depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+	help
+	  Allow attaching eBPF programs to a cgroup using the bpf(2)
+	  syscall command BPF_PROG_ATTACH.
+
+	  In which context these programs are accessed depends on the type
+	  of attachment. For instance, programs that are attached using
+	  BPF_ATTACH_TYPE_CGROUP_INET_INGRESS will be executed on the
+	  ingress path of inet sockets.
+
 config CGROUP_DEBUG
 	bool "Example controller"
 	default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index eed911d..b22256b 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
+obj-$(CONFIG_CGROUP_BPF) += cgroup.o
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
new file mode 100644
index 0000000..4d64923
--- /dev/null
+++ b/kernel/bpf/cgroup.c
@@ -0,0 +1,165 @@
+/*
+ * Functions to manage eBPF programs attached to cgroups
+ *
+ * Copyright (c) 2016 Daniel Mack
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/bpf-cgroup.h>
+#include <net/sock.h>
+
+DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
+EXPORT_SYMBOL(cgroup_bpf_enabled_key);
+
+/**
+ * cgroup_bpf_put() - put references of all bpf programs
+ * @cgrp: the cgroup to modify
+ */
+void cgroup_bpf_put(struct cgroup *cgrp)
+{
+	unsigned int type;
+
+	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.prog); type++) {
+		struct bpf_prog *prog = cgrp->bpf.prog[type];
+
+		if (prog) {
+			bpf_prog_put(prog);
+			static_branch_dec(&cgroup_bpf_enabled_key);
+		}
+	}
+}
+
+/**
+ * cgroup_bpf_inherit() - inherit effective programs from parent
+ * @cgrp: the cgroup to modify
+ * @parent: the parent to inherit from
+ */
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
+{
+	unsigned int type;
+
+	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
+		struct bpf_prog *e;
+
+		e = rcu_dereference_protected(parent->bpf.effective[type],
+					      lockdep_is_held(&cgroup_mutex));
+		rcu_assign_pointer(cgrp->bpf.effective[type], e);
+	}
+}
+
+/**
+ * __cgroup_bpf_update() - Update the pinned program of a cgroup, and
+ *                         propagate the change to descendants
+ * @cgrp: The cgroup which descendants to traverse
+ * @prog: A new program to pin
+ * @type: Type of pinning operation (ingress/egress)
+ *
+ * Each cgroup has a set of two pointers for bpf programs; one for eBPF
+ * programs it owns, and which is effective for execution.
+ *
+ * If @prog is %NULL, this function attaches a new program to the cgroup and
+ * releases the one that is currently attached, if any. @prog is then made
+ * the effective program of type @type in that cgroup.
+ *
+ * If @prog is %NULL, the currently attached program of type @type is released,
+ * and the effective program of the parent cgroup (if any) is inherited to
+ * @cgrp.
+ *
+ * Then, the descendants of @cgrp are walked and the effective program for
+ * each of them is set to the effective program of @cgrp unless the
+ * descendant has its own program attached, in which case the subbranch is
+ * skipped. This ensures that delegated subcgroups with own programs are left
+ * untouched.
+ *
+ * Must be called with cgroup_mutex held.
+ */
+void __cgroup_bpf_update(struct cgroup *cgrp,
+			 struct cgroup *parent,
+			 struct bpf_prog *prog,
+			 enum bpf_attach_type type)
+{
+	struct bpf_prog *old_prog, *effective;
+	struct cgroup_subsys_state *pos;
+
+	old_prog = xchg(cgrp->bpf.prog + type, prog);
+
+	if (prog)
+		static_branch_inc(&cgroup_bpf_enabled_key);
+
+	if (old_prog) {
+		bpf_prog_put(old_prog);
+		static_branch_dec(&cgroup_bpf_enabled_key);
+	}
+
+	effective = (!prog && parent) ?
+		rcu_dereference_protected(parent->bpf.effective[type],
+					  lockdep_is_held(&cgroup_mutex)) :
+		prog;
+
+	css_for_each_descendant_pre(pos, &cgrp->self) {
+		struct cgroup *desc = container_of(pos, struct cgroup, self);
+
+		/* skip the subtree if the descendant has its own program */
+		if (desc->bpf.prog[type] && desc != cgrp)
+			pos = css_rightmost_descendant(pos);
+		else
+			rcu_assign_pointer(desc->bpf.effective[type],
+					   effective);
+	}
+}
+
+/**
+ * __cgroup_bpf_run_filter() - Run a program for packet filtering
+ * @sk: The socken sending or receiving traffic
+ * @skb: The skb that is being sent or received
+ * @type: The type of program to be exectuted
+ *
+ * If no socket is passed, or the socket is not of type INET or INET6,
+ * this function does nothing and returns 0.
+ *
+ * The program type passed in via @type must be suitable for network
+ * filtering. No further check is performed to assert that.
+ *
+ * This function will return %-EPERM if any if an attached program was found
+ * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter(struct sock *sk,
+			    struct sk_buff *skb,
+			    enum bpf_attach_type type)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+	int ret = 0;
+
+	if (!sk)
+		return 0;
+
+	if (sk->sk_family != AF_INET &&
+	    sk->sk_family != AF_INET6)
+		return 0;
+
+	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+
+	rcu_read_lock();
+
+	prog = rcu_dereference(cgrp->bpf.effective[type]);
+	if (prog) {
+		unsigned int offset = skb->data - skb_mac_header(skb);
+
+		__skb_push(skb, offset);
+		ret = bpf_prog_run_clear_cb(prog, skb) == 1 ? 0 : -EPERM;
+		__skb_pull(skb, offset);
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d1c51b7..57ade89 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5038,6 +5038,8 @@ static void css_release_work_fn(struct work_struct *work)
 		if (cgrp->kn)
 			RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv,
 					 NULL);
+
+		cgroup_bpf_put(cgrp);
 	}
 
 	mutex_unlock(&cgroup_mutex);
@@ -5245,6 +5247,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	if (!cgroup_on_dfl(cgrp))
 		cgrp->subtree_control = cgroup_control(cgrp);
 
+	if (parent)
+		cgroup_bpf_inherit(cgrp, parent);
+
 	cgroup_propagate_control(cgrp);
 
 	/* @cgrp doesn't have dir yet so the following will only create csses */
@@ -6417,6 +6422,19 @@ static __init int cgroup_namespaces_init(void)
 }
 subsys_initcall(cgroup_namespaces_init);
 
+#ifdef CONFIG_CGROUP_BPF
+void cgroup_bpf_update(struct cgroup *cgrp,
+		       struct bpf_prog *prog,
+		       enum bpf_attach_type type)
+{
+	struct cgroup *parent = cgroup_parent(cgrp);
+
+	mutex_lock(&cgroup_mutex);
+	__cgroup_bpf_update(cgrp, parent, prog, type);
+	mutex_unlock(&cgroup_mutex);
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 #ifdef CONFIG_CGROUP_DEBUG
 static struct cgroup_subsys_state *
 debug_css_alloc(struct cgroup_subsys_state *parent_css)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-26 19:58 [PATCH v3 0/6] Add eBPF hooks for cgroups Daniel Mack
  2016-08-26 19:58 ` [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
  2016-08-26 19:58 ` [PATCH v3 2/6] cgroup: add support for eBPF programs Daniel Mack
@ 2016-08-26 19:58 ` Daniel Mack
  2016-08-27  0:08   ` Alexei Starovoitov
  2016-08-29 23:00   ` Daniel Borkmann
  2016-08-26 19:58 ` [PATCH v3 4/6] net: filter: run cgroup eBPF ingress programs Daniel Mack
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 35+ messages in thread
From: Daniel Mack @ 2016-08-26 19:58 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, Daniel Mack

Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <daniel@zonque.org>
syscall
---
 include/uapi/linux/bpf.h |  9 ++++++
 kernel/bpf/syscall.c     | 83 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1d5db42..4cc2dcf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
 	BPF_PROG_LOAD,
 	BPF_OBJ_PIN,
 	BPF_OBJ_GET,
+	BPF_PROG_ATTACH,
+	BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -147,6 +149,13 @@ union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 	};
+
+	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+		__u32		target_fd;	/* container object to attach to */
+		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
+		__u64		attach_flags;
+	};
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..cc4d603 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,79 @@ static int bpf_obj_get(const union bpf_attr *attr)
 	return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+	struct bpf_prog *prog;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	/* Flags are unused for now */
+	if (attr->attach_flags != 0)
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+	case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
+	case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
+		struct cgroup *cgrp;
+
+		prog = bpf_prog_get_type(attr->attach_bpf_fd,
+					 BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp)) {
+			bpf_prog_put(prog);
+			return PTR_ERR(cgrp);
+		}
+
+		cgroup_bpf_update(cgrp, prog, attr->attach_type);
+		cgroup_put(cgrp);
+
+		break;
+	}
+
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	/* Flags are unused for now */
+	if (attr->attach_flags != 0)
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+	case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
+	case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
+		struct cgroup *cgrp;
+
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp))
+			return PTR_ERR(cgrp);
+
+		cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+		cgroup_put(cgrp);
+
+		break;
+	}
+
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -888,6 +961,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET:
 		err = bpf_obj_get(&attr);
 		break;
+
+#ifdef CONFIG_CGROUP_BPF
+	case BPF_PROG_ATTACH:
+		err = bpf_prog_attach(&attr);
+		break;
+	case BPF_PROG_DETACH:
+		err = bpf_prog_detach(&attr);
+		break;
+#endif
+
 	default:
 		err = -EINVAL;
 		break;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 4/6] net: filter: run cgroup eBPF ingress programs
  2016-08-26 19:58 [PATCH v3 0/6] Add eBPF hooks for cgroups Daniel Mack
                   ` (2 preceding siblings ...)
  2016-08-26 19:58 ` [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
@ 2016-08-26 19:58 ` Daniel Mack
  2016-08-29 23:15   ` Daniel Borkmann
  2016-08-26 19:58 ` [PATCH v3 5/6] net: core: run cgroup eBPF egress programs Daniel Mack
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Daniel Mack @ 2016-08-26 19:58 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, Daniel Mack

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 net/core/filter.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index bc04e5c..163f75b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,11 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
 		return -ENOMEM;
 
+	err = cgroup_bpf_run_filter(sk, skb,
+				    BPF_ATTACH_TYPE_CGROUP_INET_INGRESS);
+	if (err)
+		return err;
+
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)
 		return err;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 5/6] net: core: run cgroup eBPF egress programs
  2016-08-26 19:58 [PATCH v3 0/6] Add eBPF hooks for cgroups Daniel Mack
                   ` (3 preceding siblings ...)
  2016-08-26 19:58 ` [PATCH v3 4/6] net: filter: run cgroup eBPF ingress programs Daniel Mack
@ 2016-08-26 19:58 ` Daniel Mack
  2016-08-29 22:03   ` Daniel Borkmann
  2016-08-26 19:58 ` [PATCH v3 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
  2016-08-27 13:00 ` [PATCH v3 0/6] Add eBPF hooks for cgroups Rami Rosen
  6 siblings, 1 reply; 35+ messages in thread
From: Daniel Mack @ 2016-08-26 19:58 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, Daniel Mack

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from __dev_queue_xmit().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 net/core/dev.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index a75df86..17484e6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -141,6 +141,7 @@
 #include <linux/netfilter_ingress.h>
 #include <linux/sctp.h>
 #include <linux/crash_dump.h>
+#include <linux/bpf-cgroup.h>
 
 #include "net-sysfs.h"
 
@@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
 		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
 
+	rc = cgroup_bpf_run_filter(skb->sk, skb,
+				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
+	if (rc)
+		return rc;
+
 	/* Disable soft irqs for various locks below. Also
 	 * stops preemption for RCU.
 	 */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups
  2016-08-26 19:58 [PATCH v3 0/6] Add eBPF hooks for cgroups Daniel Mack
                   ` (4 preceding siblings ...)
  2016-08-26 19:58 ` [PATCH v3 5/6] net: core: run cgroup eBPF egress programs Daniel Mack
@ 2016-08-26 19:58 ` Daniel Mack
  2016-08-27 13:00 ` [PATCH v3 0/6] Add eBPF hooks for cgroups Rami Rosen
  6 siblings, 0 replies; 35+ messages in thread
From: Daniel Mack @ 2016-08-26 19:58 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, Daniel Mack

Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  23 +++++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 147 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 175 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index db3cb06..5c752f5 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -47,6 +48,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..95e196e 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,29 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_bpf_fd = prog_fd,
+		.attach_type = type,
+		.attach_flags = 0,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_type = type,
+		.attach_flags = 0,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
 	union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 364582b..f973241 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, int insn_len,
 		  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 0000000..0a44c3d
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+
+enum {
+	MAP_KEY_PACKETS,
+	MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd, int verdict)
+{
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* save r6 so it's not clobbered by BPF_CALL */
+
+		/* Count packets */
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_PACKETS), /* r0 = 0 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* load map fd to r1 */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		/* Count bytes */
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_BYTES), /* r0 = 1 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_6, offsetof(struct __sk_buff, len)), /* r1 = skb->len */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		BPF_MOV64_IMM(BPF_REG_0, verdict), /* r0 = verdict */
+		BPF_EXIT_INSN(),
+	};
+
+	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
+			     prog, sizeof(prog), "GPL", 0);
+}
+
+static int usage(const char *argv0)
+{
+	printf("Usage: %s <cg-path> <egress|ingress> [drop]\n", argv0);
+	return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+	int cg_fd, map_fd, prog_fd, key, ret;
+	long long pkt_cnt, byte_cnt;
+	enum bpf_attach_type type;
+	int verdict = 1;
+
+	if (argc < 3)
+		return usage(argv[0]);
+
+	if (strcmp(argv[2], "ingress") == 0)
+		type = BPF_ATTACH_TYPE_CGROUP_INET_INGRESS;
+	else if (strcmp(argv[2], "egress") == 0)
+		type = BPF_ATTACH_TYPE_CGROUP_INET_EGRESS;
+	else
+		return usage(argv[0]);
+
+	if (argc > 3 && strcmp(argv[3], "drop") == 0)
+		verdict = 0;
+
+	cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+	if (cg_fd < 0) {
+		printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY,
+				sizeof(key), sizeof(byte_cnt),
+				256, 0);
+	if (map_fd < 0) {
+		printf("Failed to create map: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	prog_fd = prog_load(map_fd, verdict);
+	printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+	if (prog_fd < 0) {
+		printf("Failed to load prog: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	ret = bpf_prog_detach(cg_fd, type);
+	printf("bpf_prog_detach() returned '%s' (%d)\n", strerror(errno), errno);
+
+	ret = bpf_prog_attach(prog_fd, cg_fd, type);
+	if (ret < 0) {
+		printf("Failed to attach prog to cgroup: '%s'\n",
+		       strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	while (1) {
+		key = MAP_KEY_PACKETS;
+		assert(bpf_lookup_elem(map_fd, &key, &pkt_cnt) == 0);
+
+		key = MAP_KEY_BYTES;
+		assert(bpf_lookup_elem(map_fd, &key, &byte_cnt) == 0);
+
+		printf("cgroup received %lld packets, %lld bytes\n",
+		       pkt_cnt, byte_cnt);
+		sleep(1);
+	}
+
+	return EXIT_SUCCESS;
+}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
  2016-08-26 19:58 ` [PATCH v3 2/6] cgroup: add support for eBPF programs Daniel Mack
@ 2016-08-27  0:03   ` Alexei Starovoitov
  2016-09-05 12:47     ` Daniel Mack
  2016-08-29 22:42   ` Daniel Borkmann
  2016-08-29 23:04   ` Sargun Dhillon
  2 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2016-08-27  0:03 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev, sargun

On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
> This patch adds two sets of eBPF program pointers to struct cgroup.
> One for such that are directly pinned to a cgroup, and one for such
> that are effective for it.
> 
> To illustrate the logic behind that, assume the following example
> cgroup hierarchy.
> 
>   A - B - C
>         \ D - E
> 
> If only B has a program attached, it will be effective for B, C, D
> and E. If D then attaches a program itself, that will be effective for
> both D and E, and the program in B will only affect B and C. Only one
> program of a given type is effective for a cgroup.
> 
> Attaching and detaching programs will be done through the bpf(2)
> syscall. For now, ingress and egress inet socket filtering are the
> only supported use-cases.
> 
> Signed-off-by: Daniel Mack <daniel@zonque.org>
...
> +	css_for_each_descendant_pre(pos, &cgrp->self) {
> +		struct cgroup *desc = container_of(pos, struct cgroup, self);
> +
> +		/* skip the subtree if the descendant has its own program */
> +		if (desc->bpf.prog[type] && desc != cgrp)

is desc != cgrp really needed?
I thought css_for_each_descendant_pre() shouldn't walk itself
or I'm missing how it works.

> +			pos = css_rightmost_descendant(pos);
> +		else
> +			rcu_assign_pointer(desc->bpf.effective[type],
> +					   effective);
> +	}
> +}
> +

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-26 19:58 ` [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
@ 2016-08-27  0:08   ` Alexei Starovoitov
  2016-09-05 12:56     ` Daniel Mack
  2016-08-29 23:00   ` Daniel Borkmann
  1 sibling, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2016-08-27  0:08 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev, sargun

On Fri, Aug 26, 2016 at 09:58:49PM +0200, Daniel Mack wrote:
> Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
> BPF_PROG_DETACH which allow attaching and detaching eBPF programs
> to a target.
> 
> On the API level, the target could be anything that has an fd in
> userspace, hence the name of the field in union bpf_attr is called
> 'target_fd'.
> 
> When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
> expected to be a valid file descriptor of a cgroup v2 directory which
> has the bpf controller enabled. These are the only use-cases
> implemented by this patch at this point, but more can be added.
> 
> If a program of the given type already exists in the given cgroup,
> the program is swapped automically, so userspace does not have to drop
> an existing program first before installing a new one, which would
> otherwise leave a gap in which no program is attached.
> 
> For more information on the propagation logic to subcgroups, please
> refer to the bpf cgroup controller implementation.
> 
> The API is guarded by CAP_NET_ADMIN.
> 
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> syscall
> ---
>  include/uapi/linux/bpf.h |  9 ++++++
>  kernel/bpf/syscall.c     | 83 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 92 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1d5db42..4cc2dcf 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -73,6 +73,8 @@ enum bpf_cmd {
>  	BPF_PROG_LOAD,
>  	BPF_OBJ_PIN,
>  	BPF_OBJ_GET,
> +	BPF_PROG_ATTACH,
> +	BPF_PROG_DETACH,
>  };
>  
>  enum bpf_map_type {
> @@ -147,6 +149,13 @@ union bpf_attr {
>  		__aligned_u64	pathname;
>  		__u32		bpf_fd;
>  	};
> +
> +	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> +		__u32		target_fd;	/* container object to attach to */
> +		__u32		attach_bpf_fd;	/* eBPF program to attach */
> +		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
> +		__u64		attach_flags;
> +	};

there is a 4 byte hole in this struct. Can we pack it differently?

>  } __attribute__((aligned(8)));
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 228f962..cc4d603 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -822,6 +822,79 @@ static int bpf_obj_get(const union bpf_attr *attr)
>  	return bpf_obj_get_user(u64_to_ptr(attr->pathname));
>  }
>  
> +#ifdef CONFIG_CGROUP_BPF
> +static int bpf_prog_attach(const union bpf_attr *attr)
> +{
> +	struct bpf_prog *prog;
> +
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	/* Flags are unused for now */
> +	if (attr->attach_flags != 0)
> +		return -EINVAL;
> +
> +	switch (attr->attach_type) {
> +	case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
> +	case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
> +		struct cgroup *cgrp;
> +
> +		prog = bpf_prog_get_type(attr->attach_bpf_fd,
> +					 BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
> +		if (IS_ERR(prog))
> +			return PTR_ERR(prog);
> +
> +		cgrp = cgroup_get_from_fd(attr->target_fd);
> +		if (IS_ERR(cgrp)) {
> +			bpf_prog_put(prog);
> +			return PTR_ERR(cgrp);
> +		}
> +
> +		cgroup_bpf_update(cgrp, prog, attr->attach_type);
> +		cgroup_put(cgrp);
> +
> +		break;
> +	}

this } formatting style is confusing. The above } looks
like it matches 'switch () {'.
May be move 'struct cgroup *cgrp' to the top to avoid that?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/6] Add eBPF hooks for cgroups
  2016-08-26 19:58 [PATCH v3 0/6] Add eBPF hooks for cgroups Daniel Mack
                   ` (5 preceding siblings ...)
  2016-08-26 19:58 ` [PATCH v3 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
@ 2016-08-27 13:00 ` Rami Rosen
  6 siblings, 0 replies; 35+ messages in thread
From: Rami Rosen @ 2016-08-27 13:00 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun, daniel, ast, David Miller, kafai, fw, pablo, harald,
	Netdev, sargun

Hi Daniel,
I don't see the cgroups mailing list address in the cc list. Since
this patch is related also to the cgroups subsystem, I would suggest
that going forward you will cc also cgroups@vger.kernel.org to future
patches related to cgroups. (I hope this won't cause exceeding the max
cc list length for patches).

Regards,
Rami Rosen

On 26 August 2016 at 22:58, Daniel Mack <daniel@zonque.org> wrote:
> This is v3 of the patch set to allow eBPF programs for network
> filtering and accounting to be attached to cgroups, so that they apply
> to all sockets of all tasks placed in that cgroup. The logic also
> allows to be extendeded for other cgroup based eBPF logic.
>
> I am posting this now with only very few changes from v2 because
> I'll be travelling for a couple of days and won't have access to my
> mails.
>
>
> Changes from v2:
>
> * Fixed the RCU locking details Tejun pointed out.
>
> * Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.
>
>
> Changes from v1:
>
> * Moved all bpf specific cgroup code into its own file, and stub
>   out related functions for !CONFIG_CGROUP_BPF as static inline nops.
>   This way, the call sites are not cluttered with #ifdef guards while
>   the feature remains compile-time configurable.
>
> * Implemented the new scheme proposed by Tejun. Per cgroup, store one
>   set of pointers that are pinned to the cgroup, and one for the
>   programs that are effective. When a program is attached or detached,
>   the change is propagated to all the cgroup's descendants. If a
>   subcgroup has its own pinned program, skip the whole subbranch in
>   order to allow delegation models.
>
> * The hookup for egress packets is now done from __dev_queue_xmit().
>
> * A static key is now used in both the ingress and egress fast paths
>   to keep performance penalties close to zero if the feature is
>   not in use.
>
> * Overall cleanup to make the accessors use the program arrays.
>   This should make it much easier to add new program types, which
>   will then automatically follow the pinned vs. effective logic.
>
> * Fixed locking issues, as pointed out by Eric Dumazet and Alexei
>   Starovoitov. Changes to the program array are now done with
>   xchg() and are protected by cgroup_mutex.
>
> * eBPF programs are now expected to return 1 to let the packet pass,
>   not >= 0. Pointed out by Alexei.
>
> * Operation is now limited to INET sockets, so local AF_UNIX sockets
>   are not affected. The enum members are renamed accordingly. In case
>   other socket families should be supported, this can be extended in
>   the future.
>
> * The sample program learned to support both ingress and egress, and
>   can now optionally make the eBPF program drop packets by making it
>   return 0.
>
>
> As always, feedback is much appreciated.
>
> Thanks,
> Daniel
>
> Daniel Mack (6):
>   bpf: add new prog type for cgroup socket filtering
>   cgroup: add support for eBPF programs
>   bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
>   net: filter: run cgroup eBPF ingress programs
>   net: core: run cgroup eBPF egress programs
>   samples: bpf: add userspace example for attaching eBPF programs to
>     cgroups
>
>  include/linux/bpf-cgroup.h      |  70 +++++++++++++++++
>  include/linux/cgroup-defs.h     |   4 +
>  include/uapi/linux/bpf.h        |  16 ++++
>  init/Kconfig                    |  12 +++
>  kernel/bpf/Makefile             |   1 +
>  kernel/bpf/cgroup.c             | 165 ++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c            |  83 ++++++++++++++++++++
>  kernel/bpf/verifier.c           |   1 +
>  kernel/cgroup.c                 |  18 +++++
>  net/core/dev.c                  |   6 ++
>  net/core/filter.c               |  11 +++
>  samples/bpf/Makefile            |   2 +
>  samples/bpf/libbpf.c            |  23 ++++++
>  samples/bpf/libbpf.h            |   3 +
>  samples/bpf/test_cgrp2_attach.c | 147 +++++++++++++++++++++++++++++++++++
>  15 files changed, 562 insertions(+)
>  create mode 100644 include/linux/bpf-cgroup.h
>  create mode 100644 kernel/bpf/cgroup.c
>  create mode 100644 samples/bpf/test_cgrp2_attach.c
>
> --
> 2.5.5
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] net: core: run cgroup eBPF egress programs
  2016-08-26 19:58 ` [PATCH v3 5/6] net: core: run cgroup eBPF egress programs Daniel Mack
@ 2016-08-29 22:03   ` Daniel Borkmann
  2016-08-29 22:23     ` Sargun Dhillon
  2016-09-05 14:22     ` Daniel Mack
  0 siblings, 2 replies; 35+ messages in thread
From: Daniel Borkmann @ 2016-08-29 22:03 UTC (permalink / raw)
  To: Daniel Mack, htejun, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 08/26/2016 09:58 PM, Daniel Mack wrote:
> If the cgroup associated with the receiving socket has an eBPF
> programs installed, run them from __dev_queue_xmit().
>
> eBPF programs used in this context are expected to either return 1 to
> let the packet pass, or != 1 to drop them. The programs have access to
> the full skb, including the MAC headers.
>
> Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
> for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
> the feature is unused.
>
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---
>   net/core/dev.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a75df86..17484e6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -141,6 +141,7 @@
>   #include <linux/netfilter_ingress.h>
>   #include <linux/sctp.h>
>   #include <linux/crash_dump.h>
> +#include <linux/bpf-cgroup.h>
>
>   #include "net-sysfs.h"
>
> @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>   	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
>   		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
>
> +	rc = cgroup_bpf_run_filter(skb->sk, skb,
> +				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
> +	if (rc)
> +		return rc;

This would leak the whole skb by the way.

Apart from that, could this be modeled w/o affecting the forwarding path (at some
local output point where we know to have a valid socket)? Then you could also drop
the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of what
clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to be
just zeroes since not available at that point?

>   	/* Disable soft irqs for various locks below. Also
>   	 * stops preemption for RCU.
>   	 */
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering
  2016-08-26 19:58 ` [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
@ 2016-08-29 22:14   ` Daniel Borkmann
  2016-09-05 12:48     ` Daniel Mack
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Borkmann @ 2016-08-29 22:14 UTC (permalink / raw)
  To: Daniel Mack, htejun, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 08/26/2016 09:58 PM, Daniel Mack wrote:
> For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
> terms of checks during the verification process. It may access the skb as
> well.
>
> Programs of this type will be attached to cgroups for network filtering
> and accounting.
>
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---
>   include/uapi/linux/bpf.h | 7 +++++++
>   kernel/bpf/verifier.c    | 1 +
>   net/core/filter.c        | 6 ++++++
>   3 files changed, 14 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e4c5a1b..1d5db42 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -95,6 +95,13 @@ enum bpf_prog_type {
>   	BPF_PROG_TYPE_SCHED_ACT,
>   	BPF_PROG_TYPE_TRACEPOINT,
>   	BPF_PROG_TYPE_XDP,
> +	BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
> +};

Nit: can we drop the _FILTER suffix? So just leaving it
at BPF_PROG_TYPE_CGROUP_SOCKET. Some of these use cases
might not always strictly be related to filtering, so
seems cleaner to just leave it out everywhere.

> +
> +enum bpf_attach_type {
> +	BPF_ATTACH_TYPE_CGROUP_INET_INGRESS,
> +	BPF_ATTACH_TYPE_CGROUP_INET_EGRESS,
> +	__MAX_BPF_ATTACH_TYPE
>   };

#define BPF_MAX_ATTACH_TYPE	__BPF_MAX_ATTACH_TYPE

And then use that in your follow-up patches for declaring
arrays, etc?

>
>   #define BPF_PSEUDO_MAP_FD	1
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index abb61f3..12ca880 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1805,6 +1805,7 @@ static bool may_access_skb(enum bpf_prog_type type)
>   	case BPF_PROG_TYPE_SOCKET_FILTER:
>   	case BPF_PROG_TYPE_SCHED_CLS:
>   	case BPF_PROG_TYPE_SCHED_ACT:
> +	case BPF_PROG_TYPE_CGROUP_SOCKET_FILTER:
>   		return true;
>   	default:
>   		return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index a83766b..bc04e5c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2848,12 +2848,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
>   	.type	= BPF_PROG_TYPE_XDP,
>   };
>
> +static struct bpf_prog_type_list cg_sk_filter_type __read_mostly = {
> +	.ops	= &sk_filter_ops,
> +	.type	= BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
> +};
> +
>   static int __init register_sk_filter_ops(void)
>   {
>   	bpf_register_prog_type(&sk_filter_type);
>   	bpf_register_prog_type(&sched_cls_type);
>   	bpf_register_prog_type(&sched_act_type);
>   	bpf_register_prog_type(&xdp_type);
> +	bpf_register_prog_type(&cg_sk_filter_type);
>
>   	return 0;
>   }
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] net: core: run cgroup eBPF egress programs
  2016-08-29 22:03   ` Daniel Borkmann
@ 2016-08-29 22:23     ` Sargun Dhillon
  2016-09-05 14:22     ` Daniel Mack
  1 sibling, 0 replies; 35+ messages in thread
From: Sargun Dhillon @ 2016-08-29 22:23 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Daniel Mack, htejun, ast, davem, kafai, fw, pablo, harald, netdev

On Tue, Aug 30, 2016 at 12:03:23AM +0200, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:
> >If the cgroup associated with the receiving socket has an eBPF
> >programs installed, run them from __dev_queue_xmit().
> >
> >eBPF programs used in this context are expected to either return 1 to
> >let the packet pass, or != 1 to drop them. The programs have access to
> >the full skb, including the MAC headers.
> >
> >Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
> >for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
> >the feature is unused.
> >
> >Signed-off-by: Daniel Mack <daniel@zonque.org>
> >---
> >  net/core/dev.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> >diff --git a/net/core/dev.c b/net/core/dev.c
> >index a75df86..17484e6 100644
> >--- a/net/core/dev.c
> >+++ b/net/core/dev.c
> >@@ -141,6 +141,7 @@
> >  #include <linux/netfilter_ingress.h>
> >  #include <linux/sctp.h>
> >  #include <linux/crash_dump.h>
> >+#include <linux/bpf-cgroup.h>
> >
> >  #include "net-sysfs.h"
> >
> >@@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
> >  	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
> >  		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
> >
> >+	rc = cgroup_bpf_run_filter(skb->sk, skb,
> >+				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
> >+	if (rc)
> >+		return rc;
> 
> This would leak the whole skb by the way.
> 
> Apart from that, could this be modeled w/o affecting the forwarding path (at some
> local output point where we know to have a valid socket)? Then you could also drop
> the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of what
> clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to be
> just zeroes since not available at that point?
> 
> >  	/* Disable soft irqs for various locks below. Also
> >  	 * stops preemption for RCU.
> >  	 */
> >
Given this patchset only effects AF_INET, and AF_INET6, why not put the hooks at 
ip_output, and ip6_output

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
  2016-08-26 19:58 ` [PATCH v3 2/6] cgroup: add support for eBPF programs Daniel Mack
  2016-08-27  0:03   ` Alexei Starovoitov
@ 2016-08-29 22:42   ` Daniel Borkmann
  2016-09-05 12:50     ` Daniel Mack
  2016-08-29 23:04   ` Sargun Dhillon
  2 siblings, 1 reply; 35+ messages in thread
From: Daniel Borkmann @ 2016-08-29 22:42 UTC (permalink / raw)
  To: Daniel Mack, htejun, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 08/26/2016 09:58 PM, Daniel Mack wrote:
> This patch adds two sets of eBPF program pointers to struct cgroup.
> One for such that are directly pinned to a cgroup, and one for such
> that are effective for it.
>
> To illustrate the logic behind that, assume the following example
> cgroup hierarchy.
>
>    A - B - C
>          \ D - E
>
> If only B has a program attached, it will be effective for B, C, D
> and E. If D then attaches a program itself, that will be effective for
> both D and E, and the program in B will only affect B and C. Only one
> program of a given type is effective for a cgroup.
>
> Attaching and detaching programs will be done through the bpf(2)
> syscall. For now, ingress and egress inet socket filtering are the
> only supported use-cases.
>
> Signed-off-by: Daniel Mack <daniel@zonque.org>
[...]
> +void __cgroup_bpf_update(struct cgroup *cgrp,
> +			 struct cgroup *parent,
> +			 struct bpf_prog *prog,
> +			 enum bpf_attach_type type)
> +{
> +	struct bpf_prog *old_prog, *effective;
> +	struct cgroup_subsys_state *pos;
> +
> +	old_prog = xchg(cgrp->bpf.prog + type, prog);
> +
> +	if (prog)
> +		static_branch_inc(&cgroup_bpf_enabled_key);
> +
> +	if (old_prog) {
> +		bpf_prog_put(old_prog);
> +		static_branch_dec(&cgroup_bpf_enabled_key);
> +	}
> +
> +	effective = (!prog && parent) ?
> +		rcu_dereference_protected(parent->bpf.effective[type],
> +					  lockdep_is_held(&cgroup_mutex)) :
> +		prog;
> +
> +	css_for_each_descendant_pre(pos, &cgrp->self) {
> +		struct cgroup *desc = container_of(pos, struct cgroup, self);
> +
> +		/* skip the subtree if the descendant has its own program */
> +		if (desc->bpf.prog[type] && desc != cgrp)
> +			pos = css_rightmost_descendant(pos);
> +		else
> +			rcu_assign_pointer(desc->bpf.effective[type],
> +					   effective);
> +	}

Shouldn't the old_prog reference only be released right here at the end
instead of above (otherwise this could race)?

> +}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-26 19:58 ` [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
  2016-08-27  0:08   ` Alexei Starovoitov
@ 2016-08-29 23:00   ` Daniel Borkmann
  2016-09-05 12:54     ` Daniel Mack
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel Borkmann @ 2016-08-29 23:00 UTC (permalink / raw)
  To: Daniel Mack, htejun, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 08/26/2016 09:58 PM, Daniel Mack wrote:
> Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
> BPF_PROG_DETACH which allow attaching and detaching eBPF programs
> to a target.
>
> On the API level, the target could be anything that has an fd in
> userspace, hence the name of the field in union bpf_attr is called
> 'target_fd'.
>
> When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
> expected to be a valid file descriptor of a cgroup v2 directory which
> has the bpf controller enabled. These are the only use-cases
> implemented by this patch at this point, but more can be added.
>
> If a program of the given type already exists in the given cgroup,
> the program is swapped automically, so userspace does not have to drop
> an existing program first before installing a new one, which would
> otherwise leave a gap in which no program is attached.
>
> For more information on the propagation logic to subcgroups, please
> refer to the bpf cgroup controller implementation.
>
> The API is guarded by CAP_NET_ADMIN.
>
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> syscall

^^^ slipped in?

> ---
>   include/uapi/linux/bpf.h |  9 ++++++
>   kernel/bpf/syscall.c     | 83 ++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 92 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1d5db42..4cc2dcf 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -73,6 +73,8 @@ enum bpf_cmd {
>   	BPF_PROG_LOAD,
>   	BPF_OBJ_PIN,
>   	BPF_OBJ_GET,
> +	BPF_PROG_ATTACH,
> +	BPF_PROG_DETACH,
>   };
>
>   enum bpf_map_type {
> @@ -147,6 +149,13 @@ union bpf_attr {
>   		__aligned_u64	pathname;
>   		__u32		bpf_fd;
>   	};
> +
> +	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> +		__u32		target_fd;	/* container object to attach to */
> +		__u32		attach_bpf_fd;	/* eBPF program to attach */
> +		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
> +		__u64		attach_flags;

Could we just do ...

__u32 dst_fd;
__u32 src_fd;
__u32 attach_type;

... and leave flags out, since unused anyway? Also see below.

> +	};
>   } __attribute__((aligned(8)));
>
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 228f962..cc4d603 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -822,6 +822,79 @@ static int bpf_obj_get(const union bpf_attr *attr)
>   	return bpf_obj_get_user(u64_to_ptr(attr->pathname));
>   }
>
> +#ifdef CONFIG_CGROUP_BPF

#define BPF_PROG_ATTACH_LAST_FIELD attach_type
#define BPF_PROG_DETACH_LAST_FIELD BPF_PROG_ATTACH_LAST_FIELD

> +static int bpf_prog_attach(const union bpf_attr *attr)
> +{
> +	struct bpf_prog *prog;
> +
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	/* Flags are unused for now */
> +	if (attr->attach_flags != 0)
> +		return -EINVAL;

Correct way would be to:

	if (CHECK_ATTR(BPF_PROG_ATTACH))
		return -EINVAL;

> +
> +	switch (attr->attach_type) {
> +	case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
> +	case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
> +		struct cgroup *cgrp;
> +
> +		prog = bpf_prog_get_type(attr->attach_bpf_fd,
> +					 BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
> +		if (IS_ERR(prog))
> +			return PTR_ERR(prog);
> +
> +		cgrp = cgroup_get_from_fd(attr->target_fd);
> +		if (IS_ERR(cgrp)) {
> +			bpf_prog_put(prog);
> +			return PTR_ERR(cgrp);
> +		}
> +
> +		cgroup_bpf_update(cgrp, prog, attr->attach_type);
> +		cgroup_put(cgrp);
> +
> +		break;
> +	}
> +
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int bpf_prog_detach(const union bpf_attr *attr)
> +{
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	/* Flags are unused for now */
> +	if (attr->attach_flags != 0)
> +		return -EINVAL;

	if (CHECK_ATTR(BPF_PROG_DETACH))
		return -EINVAL;

> +	switch (attr->attach_type) {
> +	case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
> +	case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
> +		struct cgroup *cgrp;
> +
> +		cgrp = cgroup_get_from_fd(attr->target_fd);
> +		if (IS_ERR(cgrp))
> +			return PTR_ERR(cgrp);
> +
> +		cgroup_bpf_update(cgrp, NULL, attr->attach_type);
> +		cgroup_put(cgrp);
> +
> +		break;
> +	}
> +
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +#endif /* CONFIG_CGROUP_BPF */
> +
>   SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
>   {
>   	union bpf_attr attr = {};
> @@ -888,6 +961,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>   	case BPF_OBJ_GET:
>   		err = bpf_obj_get(&attr);
>   		break;
> +
> +#ifdef CONFIG_CGROUP_BPF
> +	case BPF_PROG_ATTACH:
> +		err = bpf_prog_attach(&attr);
> +		break;
> +	case BPF_PROG_DETACH:
> +		err = bpf_prog_detach(&attr);
> +		break;
> +#endif
> +
>   	default:
>   		err = -EINVAL;
>   		break;
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
  2016-08-26 19:58 ` [PATCH v3 2/6] cgroup: add support for eBPF programs Daniel Mack
  2016-08-27  0:03   ` Alexei Starovoitov
  2016-08-29 22:42   ` Daniel Borkmann
@ 2016-08-29 23:04   ` Sargun Dhillon
  2016-09-05 14:49     ` Daniel Mack
  2 siblings, 1 reply; 35+ messages in thread
From: Sargun Dhillon @ 2016-08-29 23:04 UTC (permalink / raw)
  To: Daniel Mack; +Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
> This patch adds two sets of eBPF program pointers to struct cgroup.
> One for such that are directly pinned to a cgroup, and one for such
> that are effective for it.
> 
> To illustrate the logic behind that, assume the following example
> cgroup hierarchy.
> 
>   A - B - C
>         \ D - E
> 
> If only B has a program attached, it will be effective for B, C, D
> and E. If D then attaches a program itself, that will be effective for
> both D and E, and the program in B will only affect B and C. Only one
> program of a given type is effective for a cgroup.
> 
How does this work when running and orchestrator within an orchestrator? The 
Docker in Docker / Mesos in Mesos use case, where the top level orchestrator is 
observing the traffic, and there is an orchestrator within that also need to run 
it.

In this case, I'd like to run E's filter, then if it returns 0, D's, and B's, 
and so on. Is it possible to allow this, either by flattening out the 
datastructure (copy a ref to the bpf programs to C and E) or something similar?


> Attaching and detaching programs will be done through the bpf(2)
> syscall. For now, ingress and egress inet socket filtering are the
> only supported use-cases.
> 
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---
>  include/linux/bpf-cgroup.h  |  70 +++++++++++++++++++
>  include/linux/cgroup-defs.h |   4 ++
>  init/Kconfig                |  12 ++++
>  kernel/bpf/Makefile         |   1 +
>  kernel/bpf/cgroup.c         | 165 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/cgroup.c             |  18 +++++
>  6 files changed, 270 insertions(+)
>  create mode 100644 include/linux/bpf-cgroup.h
>  create mode 100644 kernel/bpf/cgroup.c
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> new file mode 100644
> index 0000000..a5a25c1
> --- /dev/null
> +++ b/include/linux/bpf-cgroup.h
> @@ -0,0 +1,70 @@
> +#ifndef _BPF_CGROUP_H
> +#define _BPF_CGROUP_H
> +
> +#include <linux/bpf.h>
> +#include <uapi/linux/bpf.h>
> +
> +struct sock;
> +struct cgroup;
> +struct sk_buff;
> +
> +#ifdef CONFIG_CGROUP_BPF
> +
> +extern struct static_key_false cgroup_bpf_enabled_key;
> +#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
> +
> +struct cgroup_bpf {
> +	/*
> +	 * Store two sets of bpf_prog pointers, one for programs that are
> +	 * pinned directly to this cgroup, and one for those that are effective
> +	 * when this cgroup is accessed.
> +	 */
> +	struct bpf_prog *prog[__MAX_BPF_ATTACH_TYPE];
> +	struct bpf_prog *effective[__MAX_BPF_ATTACH_TYPE];
> +};
> +
> +void cgroup_bpf_put(struct cgroup *cgrp);
> +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
> +
> +void __cgroup_bpf_update(struct cgroup *cgrp,
> +			 struct cgroup *parent,
> +			 struct bpf_prog *prog,
> +			 enum bpf_attach_type type);
> +
> +/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
> +void cgroup_bpf_update(struct cgroup *cgrp,
> +		       struct bpf_prog *prog,
> +		       enum bpf_attach_type type);
> +
> +int __cgroup_bpf_run_filter(struct sock *sk,
> +			    struct sk_buff *skb,
> +			    enum bpf_attach_type type);
> +
> +/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
> +static inline int cgroup_bpf_run_filter(struct sock *sk,
> +					struct sk_buff *skb,
> +					enum bpf_attach_type type)
> +{
> +	if (cgroup_bpf_enabled)
> +		return __cgroup_bpf_run_filter(sk, skb, type);
> +
> +	return 0;
> +}
> +
> +#else
> +
> +struct cgroup_bpf {};
> +static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
> +static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
> +				      struct cgroup *parent) {}
> +
> +static inline int cgroup_bpf_run_filter(struct sock *sk,
> +					struct sk_buff *skb,
> +					enum bpf_attach_type type)
> +{
> +	return 0;
> +}
> +
> +#endif /* CONFIG_CGROUP_BPF */
> +
> +#endif /* _BPF_CGROUP_H */
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 5b17de6..861b467 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -16,6 +16,7 @@
>  #include <linux/percpu-refcount.h>
>  #include <linux/percpu-rwsem.h>
>  #include <linux/workqueue.h>
> +#include <linux/bpf-cgroup.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -300,6 +301,9 @@ struct cgroup {
>  	/* used to schedule release agent */
>  	struct work_struct release_agent_work;
>  
> +	/* used to store eBPF programs */
> +	struct cgroup_bpf bpf;
> +
>  	/* ids of the ancestors at each level including self */
>  	int ancestor_ids[];
>  };
> diff --git a/init/Kconfig b/init/Kconfig
> index cac3f09..5a89c83 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1144,6 +1144,18 @@ config CGROUP_PERF
>  
>  	  Say N if unsure.
>  
> +config CGROUP_BPF
> +	bool "Support for eBPF programs attached to cgroups"
> +	depends on BPF_SYSCALL && SOCK_CGROUP_DATA
> +	help
> +	  Allow attaching eBPF programs to a cgroup using the bpf(2)
> +	  syscall command BPF_PROG_ATTACH.
> +
> +	  In which context these programs are accessed depends on the type
> +	  of attachment. For instance, programs that are attached using
> +	  BPF_ATTACH_TYPE_CGROUP_INET_INGRESS will be executed on the
> +	  ingress path of inet sockets.
> +
>  config CGROUP_DEBUG
>  	bool "Example controller"
>  	default n
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index eed911d..b22256b 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -5,3 +5,4 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o
>  ifeq ($(CONFIG_PERF_EVENTS),y)
>  obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
>  endif
> +obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> new file mode 100644
> index 0000000..4d64923
> --- /dev/null
> +++ b/kernel/bpf/cgroup.c
> @@ -0,0 +1,165 @@
> +/*
> + * Functions to manage eBPF programs attached to cgroups
> + *
> + * Copyright (c) 2016 Daniel Mack
> + *
> + * This file is subject to the terms and conditions of version 2 of the GNU
> + * General Public License.  See the file COPYING in the main directory of the
> + * Linux distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/atomic.h>
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <linux/bpf-cgroup.h>
> +#include <net/sock.h>
> +
> +DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
> +EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> +
> +/**
> + * cgroup_bpf_put() - put references of all bpf programs
> + * @cgrp: the cgroup to modify
> + */
> +void cgroup_bpf_put(struct cgroup *cgrp)
> +{
> +	unsigned int type;
> +
> +	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.prog); type++) {
> +		struct bpf_prog *prog = cgrp->bpf.prog[type];
> +
> +		if (prog) {
> +			bpf_prog_put(prog);
> +			static_branch_dec(&cgroup_bpf_enabled_key);
> +		}
> +	}
> +}
> +
> +/**
> + * cgroup_bpf_inherit() - inherit effective programs from parent
> + * @cgrp: the cgroup to modify
> + * @parent: the parent to inherit from
> + */
> +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
> +{
> +	unsigned int type;
> +
> +	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
> +		struct bpf_prog *e;
> +
> +		e = rcu_dereference_protected(parent->bpf.effective[type],
> +					      lockdep_is_held(&cgroup_mutex));
> +		rcu_assign_pointer(cgrp->bpf.effective[type], e);
> +	}
> +}
> +
> +/**
> + * __cgroup_bpf_update() - Update the pinned program of a cgroup, and
> + *                         propagate the change to descendants
> + * @cgrp: The cgroup which descendants to traverse
> + * @prog: A new program to pin
> + * @type: Type of pinning operation (ingress/egress)
> + *
> + * Each cgroup has a set of two pointers for bpf programs; one for eBPF
> + * programs it owns, and which is effective for execution.
> + *
> + * If @prog is %NULL, this function attaches a new program to the cgroup and
> + * releases the one that is currently attached, if any. @prog is then made
> + * the effective program of type @type in that cgroup.
> + *
> + * If @prog is %NULL, the currently attached program of type @type is released,
> + * and the effective program of the parent cgroup (if any) is inherited to
> + * @cgrp.
> + *
> + * Then, the descendants of @cgrp are walked and the effective program for
> + * each of them is set to the effective program of @cgrp unless the
> + * descendant has its own program attached, in which case the subbranch is
> + * skipped. This ensures that delegated subcgroups with own programs are left
> + * untouched.
> + *
> + * Must be called with cgroup_mutex held.
> + */
> +void __cgroup_bpf_update(struct cgroup *cgrp,
> +			 struct cgroup *parent,
> +			 struct bpf_prog *prog,
> +			 enum bpf_attach_type type)
> +{
> +	struct bpf_prog *old_prog, *effective;
> +	struct cgroup_subsys_state *pos;
> +
> +	old_prog = xchg(cgrp->bpf.prog + type, prog);
> +
> +	if (prog)
> +		static_branch_inc(&cgroup_bpf_enabled_key);
> +
> +	if (old_prog) {
> +		bpf_prog_put(old_prog);
> +		static_branch_dec(&cgroup_bpf_enabled_key);
> +	}
> +
> +	effective = (!prog && parent) ?
> +		rcu_dereference_protected(parent->bpf.effective[type],
> +					  lockdep_is_held(&cgroup_mutex)) :
> +		prog;
> +
> +	css_for_each_descendant_pre(pos, &cgrp->self) {
> +		struct cgroup *desc = container_of(pos, struct cgroup, self);
> +
> +		/* skip the subtree if the descendant has its own program */
> +		if (desc->bpf.prog[type] && desc != cgrp)
> +			pos = css_rightmost_descendant(pos);
> +		else
> +			rcu_assign_pointer(desc->bpf.effective[type],
> +					   effective);
> +	}
> +}
> +
> +/**
> + * __cgroup_bpf_run_filter() - Run a program for packet filtering
> + * @sk: The socken sending or receiving traffic
> + * @skb: The skb that is being sent or received
> + * @type: The type of program to be exectuted
> + *
> + * If no socket is passed, or the socket is not of type INET or INET6,
> + * this function does nothing and returns 0.
> + *
> + * The program type passed in via @type must be suitable for network
> + * filtering. No further check is performed to assert that.
> + *
> + * This function will return %-EPERM if any if an attached program was found
> + * and if it returned != 1 during execution. In all other cases, 0 is returned.
> + */
> +int __cgroup_bpf_run_filter(struct sock *sk,
> +			    struct sk_buff *skb,
> +			    enum bpf_attach_type type)
> +{
> +	struct bpf_prog *prog;
> +	struct cgroup *cgrp;
> +	int ret = 0;
> +
> +	if (!sk)
> +		return 0;
> +
> +	if (sk->sk_family != AF_INET &&
> +	    sk->sk_family != AF_INET6)
> +		return 0;
> +
> +	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +
> +	rcu_read_lock();
> +
> +	prog = rcu_dereference(cgrp->bpf.effective[type]);
> +	if (prog) {
> +		unsigned int offset = skb->data - skb_mac_header(skb);
> +
> +		__skb_push(skb, offset);
> +		ret = bpf_prog_run_clear_cb(prog, skb) == 1 ? 0 : -EPERM;
> +		__skb_pull(skb, offset);
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index d1c51b7..57ade89 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -5038,6 +5038,8 @@ static void css_release_work_fn(struct work_struct *work)
>  		if (cgrp->kn)
>  			RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv,
>  					 NULL);
> +
> +		cgroup_bpf_put(cgrp);
>  	}
>  
>  	mutex_unlock(&cgroup_mutex);
> @@ -5245,6 +5247,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  	if (!cgroup_on_dfl(cgrp))
>  		cgrp->subtree_control = cgroup_control(cgrp);
>  
> +	if (parent)
> +		cgroup_bpf_inherit(cgrp, parent);
> +
>  	cgroup_propagate_control(cgrp);
>  
>  	/* @cgrp doesn't have dir yet so the following will only create csses */
> @@ -6417,6 +6422,19 @@ static __init int cgroup_namespaces_init(void)
>  }
>  subsys_initcall(cgroup_namespaces_init);
>  
> +#ifdef CONFIG_CGROUP_BPF
> +void cgroup_bpf_update(struct cgroup *cgrp,
> +		       struct bpf_prog *prog,
> +		       enum bpf_attach_type type)
> +{
> +	struct cgroup *parent = cgroup_parent(cgrp);
> +
> +	mutex_lock(&cgroup_mutex);
> +	__cgroup_bpf_update(cgrp, parent, prog, type);
> +	mutex_unlock(&cgroup_mutex);
> +}
> +#endif /* CONFIG_CGROUP_BPF */
> +
>  #ifdef CONFIG_CGROUP_DEBUG
>  static struct cgroup_subsys_state *
>  debug_css_alloc(struct cgroup_subsys_state *parent_css)
> -- 
> 2.5.5
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/6] net: filter: run cgroup eBPF ingress programs
  2016-08-26 19:58 ` [PATCH v3 4/6] net: filter: run cgroup eBPF ingress programs Daniel Mack
@ 2016-08-29 23:15   ` Daniel Borkmann
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Borkmann @ 2016-08-29 23:15 UTC (permalink / raw)
  To: Daniel Mack, htejun, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 08/26/2016 09:58 PM, Daniel Mack wrote:
> If the cgroup associated with the receiving socket has an eBPF
> programs installed, run them from sk_filter_trim_cap().
>
> eBPF programs used in this context are expected to either return 1 to
> let the packet pass, or != 1 to drop them. The programs have access to
> the full skb, including the MAC headers.
>
> Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
> for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
> the feature is unused.
>
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---
>   net/core/filter.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bc04e5c..163f75b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -78,6 +78,11 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>   	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>   		return -ENOMEM;
>
> +	err = cgroup_bpf_run_filter(sk, skb,
> +				    BPF_ATTACH_TYPE_CGROUP_INET_INGRESS);

Maybe just BPF_CGROUP_INET_{IN,E}GRESS (seems less cluttered, and we know
these were set via bpf(2) as attach_type anyway)?

> +	if (err)
> +		return err;
> +
>   	err = security_sock_rcv_skb(sk, skb);
>   	if (err)
>   		return err;
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
  2016-08-27  0:03   ` Alexei Starovoitov
@ 2016-09-05 12:47     ` Daniel Mack
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 12:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev, sargun

Hi Alexei,

On 08/27/2016 02:03 AM, Alexei Starovoitov wrote:
> On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
>> This patch adds two sets of eBPF program pointers to struct cgroup.
>> One for such that are directly pinned to a cgroup, and one for such
>> that are effective for it.
>>
>> To illustrate the logic behind that, assume the following example
>> cgroup hierarchy.
>>
>>   A - B - C
>>         \ D - E
>>
>> If only B has a program attached, it will be effective for B, C, D
>> and E. If D then attaches a program itself, that will be effective for
>> both D and E, and the program in B will only affect B and C. Only one
>> program of a given type is effective for a cgroup.
>>
>> Attaching and detaching programs will be done through the bpf(2)
>> syscall. For now, ingress and egress inet socket filtering are the
>> only supported use-cases.
>>
>> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ...
>> +	css_for_each_descendant_pre(pos, &cgrp->self) {
>> +		struct cgroup *desc = container_of(pos, struct cgroup, self);
>> +
>> +		/* skip the subtree if the descendant has its own program */
>> +		if (desc->bpf.prog[type] && desc != cgrp)
> 
> is desc != cgrp really needed?
> I thought css_for_each_descendant_pre() shouldn't walk itself
> or I'm missing how it works.

Hmm, no - that check is necessary in my tests, and also according to the
documentation:

/**
 * css_for_each_descendant_pre - pre-order walk of a css's descendants
 * @pos: the css * to use as the loop cursor
 * @root: css whose descendants to walk
 *
 * Walk @root's descendants.  @root is included in the iteration and the
 * first node to be visited.  Must be called under rcu_read_lock().
 *


Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering
  2016-08-29 22:14   ` Daniel Borkmann
@ 2016-09-05 12:48     ` Daniel Mack
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 12:48 UTC (permalink / raw)
  To: Daniel Borkmann, htejun, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 08/30/2016 12:14 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>> For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
>> terms of checks during the verification process. It may access the skb as
>> well.
>>
>> Programs of this type will be attached to cgroups for network filtering
>> and accounting.
>>
>> Signed-off-by: Daniel Mack <daniel@zonque.org>
>> ---
>>   include/uapi/linux/bpf.h | 7 +++++++
>>   kernel/bpf/verifier.c    | 1 +
>>   net/core/filter.c        | 6 ++++++
>>   3 files changed, 14 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index e4c5a1b..1d5db42 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -95,6 +95,13 @@ enum bpf_prog_type {
>>   	BPF_PROG_TYPE_SCHED_ACT,
>>   	BPF_PROG_TYPE_TRACEPOINT,
>>   	BPF_PROG_TYPE_XDP,
>> +	BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
>> +};
> 
> Nit: can we drop the _FILTER suffix? So just leaving it
> at BPF_PROG_TYPE_CGROUP_SOCKET. Some of these use cases
> might not always strictly be related to filtering, so
> seems cleaner to just leave it out everywhere.
> 
>> +
>> +enum bpf_attach_type {
>> +	BPF_ATTACH_TYPE_CGROUP_INET_INGRESS,
>> +	BPF_ATTACH_TYPE_CGROUP_INET_EGRESS,
>> +	__MAX_BPF_ATTACH_TYPE
>>   };
> 
> #define BPF_MAX_ATTACH_TYPE	__BPF_MAX_ATTACH_TYPE
> 
> And then use that in your follow-up patches for declaring
> arrays, etc?

Agreed, will change.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
  2016-08-29 22:42   ` Daniel Borkmann
@ 2016-09-05 12:50     ` Daniel Mack
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 12:50 UTC (permalink / raw)
  To: Daniel Borkmann, htejun, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 08/30/2016 12:42 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>> This patch adds two sets of eBPF program pointers to struct cgroup.
>> One for such that are directly pinned to a cgroup, and one for such
>> that are effective for it.
>>
>> To illustrate the logic behind that, assume the following example
>> cgroup hierarchy.
>>
>>    A - B - C
>>          \ D - E
>>
>> If only B has a program attached, it will be effective for B, C, D
>> and E. If D then attaches a program itself, that will be effective for
>> both D and E, and the program in B will only affect B and C. Only one
>> program of a given type is effective for a cgroup.
>>
>> Attaching and detaching programs will be done through the bpf(2)
>> syscall. For now, ingress and egress inet socket filtering are the
>> only supported use-cases.
>>
>> Signed-off-by: Daniel Mack <daniel@zonque.org>
> [...]
>> +void __cgroup_bpf_update(struct cgroup *cgrp,
>> +			 struct cgroup *parent,
>> +			 struct bpf_prog *prog,
>> +			 enum bpf_attach_type type)
>> +{
>> +	struct bpf_prog *old_prog, *effective;
>> +	struct cgroup_subsys_state *pos;
>> +
>> +	old_prog = xchg(cgrp->bpf.prog + type, prog);
>> +
>> +	if (prog)
>> +		static_branch_inc(&cgroup_bpf_enabled_key);
>> +
>> +	if (old_prog) {
>> +		bpf_prog_put(old_prog);
>> +		static_branch_dec(&cgroup_bpf_enabled_key);
>> +	}
>> +
>> +	effective = (!prog && parent) ?
>> +		rcu_dereference_protected(parent->bpf.effective[type],
>> +					  lockdep_is_held(&cgroup_mutex)) :
>> +		prog;
>> +
>> +	css_for_each_descendant_pre(pos, &cgrp->self) {
>> +		struct cgroup *desc = container_of(pos, struct cgroup, self);
>> +
>> +		/* skip the subtree if the descendant has its own program */
>> +		if (desc->bpf.prog[type] && desc != cgrp)
>> +			pos = css_rightmost_descendant(pos);
>> +		else
>> +			rcu_assign_pointer(desc->bpf.effective[type],
>> +					   effective);
>> +	}
> 
> Shouldn't the old_prog reference only be released right here at the end
> instead of above (otherwise this could race)?

Yes, that's right. Will change as well. Thanks for spotting!


Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-29 23:00   ` Daniel Borkmann
@ 2016-09-05 12:54     ` Daniel Mack
  2016-09-05 13:56       ` Daniel Borkmann
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 12:54 UTC (permalink / raw)
  To: Daniel Borkmann, htejun, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 08/30/2016 01:00 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:

>>   enum bpf_map_type {
>> @@ -147,6 +149,13 @@ union bpf_attr {
>>   		__aligned_u64	pathname;
>>   		__u32		bpf_fd;
>>   	};
>> +
>> +	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>> +		__u32		target_fd;	/* container object to attach to */
>> +		__u32		attach_bpf_fd;	/* eBPF program to attach */
>> +		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
>> +		__u64		attach_flags;
> 
> Could we just do ...
> 
> __u32 dst_fd;
> __u32 src_fd;
> __u32 attach_type;
> 
> ... and leave flags out, since unused anyway? Also see below.

I'd really like to keep the flags, even if they're unused right now.
This only adds 8 bytes during the syscall operation, so it doesn't harm.
However, we cannot change the userspace API after the fact, and who
knows what this (rather generic) interface will be used for later on.

> #define BPF_PROG_ATTACH_LAST_FIELD attach_type
> #define BPF_PROG_DETACH_LAST_FIELD BPF_PROG_ATTACH_LAST_FIELD

...

> 
> Correct way would be to:
> 
> 	if (CHECK_ATTR(BPF_PROG_ATTACH))
> 		return -EINVAL;

Will add - thanks!


Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-27  0:08   ` Alexei Starovoitov
@ 2016-09-05 12:56     ` Daniel Mack
  2016-09-05 15:30       ` David Laight
  2016-09-05 17:29       ` Joe Perches
  0 siblings, 2 replies; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 12:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev, sargun

On 08/27/2016 02:08 AM, Alexei Starovoitov wrote:
> On Fri, Aug 26, 2016 at 09:58:49PM +0200, Daniel Mack wrote:

>> +
>> +	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>> +		__u32		target_fd;	/* container object to attach to */
>> +		__u32		attach_bpf_fd;	/* eBPF program to attach */
>> +		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
>> +		__u64		attach_flags;
>> +	};
> 
> there is a 4 byte hole in this struct. Can we pack it differently?

Okay - I swapped "type" and "flags" to repair this.

>> +	switch (attr->attach_type) {
>> +	case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
>> +	case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
>> +		struct cgroup *cgrp;
>> +
>> +		prog = bpf_prog_get_type(attr->attach_bpf_fd,
>> +					 BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
>> +		if (IS_ERR(prog))
>> +			return PTR_ERR(prog);
>> +
>> +		cgrp = cgroup_get_from_fd(attr->target_fd);
>> +		if (IS_ERR(cgrp)) {
>> +			bpf_prog_put(prog);
>> +			return PTR_ERR(cgrp);
>> +		}
>> +
>> +		cgroup_bpf_update(cgrp, prog, attr->attach_type);
>> +		cgroup_put(cgrp);
>> +
>> +		break;
>> +	}
> 
> this } formatting style is confusing. The above } looks
> like it matches 'switch () {'.
> May be move 'struct cgroup *cgrp' to the top to avoid that?

I kept it local to its users, but you're right, it's not worth it. Will
change.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-09-05 12:54     ` Daniel Mack
@ 2016-09-05 13:56       ` Daniel Borkmann
  2016-09-05 14:09         ` Daniel Mack
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Borkmann @ 2016-09-05 13:56 UTC (permalink / raw)
  To: Daniel Mack, htejun, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 09/05/2016 02:54 PM, Daniel Mack wrote:
> On 08/30/2016 01:00 AM, Daniel Borkmann wrote:
>> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>
>>>    enum bpf_map_type {
>>> @@ -147,6 +149,13 @@ union bpf_attr {
>>>    		__aligned_u64	pathname;
>>>    		__u32		bpf_fd;
>>>    	};
>>> +
>>> +	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>>> +		__u32		target_fd;	/* container object to attach to */
>>> +		__u32		attach_bpf_fd;	/* eBPF program to attach */
>>> +		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
>>> +		__u64		attach_flags;
>>
>> Could we just do ...
>>
>> __u32 dst_fd;
>> __u32 src_fd;
>> __u32 attach_type;
>>
>> ... and leave flags out, since unused anyway? Also see below.
>
> I'd really like to keep the flags, even if they're unused right now.
> This only adds 8 bytes during the syscall operation, so it doesn't harm.
> However, we cannot change the userspace API after the fact, and who
> knows what this (rather generic) interface will be used for later on.

With the below suggestion added, then flags doesn't need to be
added currently as it can be done safely at a later point in time
with respecting old binaries. See also the syscall handling code
in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
underlying idea of this was taken from perf_event_open() syscall
back then, see [1] for a summary.

   [1] https://lkml.org/lkml/2014/8/26/116

>> #define BPF_PROG_ATTACH_LAST_FIELD attach_type
>> #define BPF_PROG_DETACH_LAST_FIELD BPF_PROG_ATTACH_LAST_FIELD
>
> ...
>
>>
>> Correct way would be to:
>>
>> 	if (CHECK_ATTR(BPF_PROG_ATTACH))
>> 		return -EINVAL;
>
> Will add - thanks!
>
>
> Daniel
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-09-05 13:56       ` Daniel Borkmann
@ 2016-09-05 14:09         ` Daniel Mack
  2016-09-05 17:09           ` Daniel Borkmann
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 14:09 UTC (permalink / raw)
  To: Daniel Borkmann, htejun, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 09/05/2016 03:56 PM, Daniel Borkmann wrote:
> On 09/05/2016 02:54 PM, Daniel Mack wrote:
>> On 08/30/2016 01:00 AM, Daniel Borkmann wrote:
>>> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>>
>>>>    enum bpf_map_type {
>>>> @@ -147,6 +149,13 @@ union bpf_attr {
>>>>    		__aligned_u64	pathname;
>>>>    		__u32		bpf_fd;
>>>>    	};
>>>> +
>>>> +	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>>>> +		__u32		target_fd;	/* container object to attach to */
>>>> +		__u32		attach_bpf_fd;	/* eBPF program to attach */
>>>> +		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
>>>> +		__u64		attach_flags;
>>>
>>> Could we just do ...
>>>
>>> __u32 dst_fd;
>>> __u32 src_fd;
>>> __u32 attach_type;
>>>
>>> ... and leave flags out, since unused anyway? Also see below.
>>
>> I'd really like to keep the flags, even if they're unused right now.
>> This only adds 8 bytes during the syscall operation, so it doesn't harm.
>> However, we cannot change the userspace API after the fact, and who
>> knows what this (rather generic) interface will be used for later on.
> 
> With the below suggestion added, then flags doesn't need to be
> added currently as it can be done safely at a later point in time
> with respecting old binaries. See also the syscall handling code
> in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
> underlying idea of this was taken from perf_event_open() syscall
> back then, see [1] for a summary.
> 
>    [1] https://lkml.org/lkml/2014/8/26/116

Yes, I know that's possible, and I like the idea, but I don't think any
new interface should come without flags really, as flags are something
that will most certainly be needed at some point anyway. I didn't have
them in my first shot, but Alexei pointed out that they should be added,
and I agree.

Also, this optimization wouldn't make the transported struct payload any
smaller anyway, because the member of that union used by BPF_PROG_LOAD
is still by far the biggest.

I really don't think it's worth sparing 8 bytes here and then do the
binary compat dance after flags are added, for no real gain.



Thanks,
Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] net: core: run cgroup eBPF egress programs
  2016-08-29 22:03   ` Daniel Borkmann
  2016-08-29 22:23     ` Sargun Dhillon
@ 2016-09-05 14:22     ` Daniel Mack
  2016-09-06 17:14       ` Daniel Borkmann
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 14:22 UTC (permalink / raw)
  To: Daniel Borkmann, htejun, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 08/30/2016 12:03 AM, Daniel Borkmann wrote:
> On 08/26/2016 09:58 PM, Daniel Mack wrote:

>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index a75df86..17484e6 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -141,6 +141,7 @@
>>   #include <linux/netfilter_ingress.h>
>>   #include <linux/sctp.h>
>>   #include <linux/crash_dump.h>
>> +#include <linux/bpf-cgroup.h>
>>
>>   #include "net-sysfs.h"
>>
>> @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>>   	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
>>   		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
>>
>> +	rc = cgroup_bpf_run_filter(skb->sk, skb,
>> +				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
>> +	if (rc)
>> +		return rc;
> 
> This would leak the whole skb by the way.

Ah, right.

> Apart from that, could this be modeled w/o affecting the forwarding path (at some
> local output point where we know to have a valid socket)? Then you could also drop
> the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of what
> clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to be
> just zeroes since not available at that point?

Hmm, I wonder where this hook could be put instead then. When placed in
ip_output() and ip6_output(), the mac headers cannot be pushed before
running the program, resulting in bogus skb data from the eBPF program.

Also, if I read the code correctly, ip[6]_output is not called for
multicast packets.

Any other ideas?


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
  2016-08-29 23:04   ` Sargun Dhillon
@ 2016-09-05 14:49     ` Daniel Mack
  2016-09-05 21:40       ` Sargun Dhillon
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 14:49 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev

Hi,

On 08/30/2016 01:04 AM, Sargun Dhillon wrote:
> On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
>> This patch adds two sets of eBPF program pointers to struct cgroup.
>> One for such that are directly pinned to a cgroup, and one for such
>> that are effective for it.
>>
>> To illustrate the logic behind that, assume the following example
>> cgroup hierarchy.
>>
>>   A - B - C
>>         \ D - E
>>
>> If only B has a program attached, it will be effective for B, C, D
>> and E. If D then attaches a program itself, that will be effective for
>> both D and E, and the program in B will only affect B and C. Only one
>> program of a given type is effective for a cgroup.
>>
> How does this work when running and orchestrator within an orchestrator? The 
> Docker in Docker / Mesos in Mesos use case, where the top level orchestrator is 
> observing the traffic, and there is an orchestrator within that also need to run 
> it.
> 
> In this case, I'd like to run E's filter, then if it returns 0, D's, and B's, 
> and so on.

Running multiple programs was an idea I had in one of my earlier drafts,
but after some discussion, I refrained from it again because potentially
walking the cgroup hierarchy on every packet is just too expensive.

> Is it possible to allow this, either by flattening out the
> datastructure (copy a ref to the bpf programs to C and E) or
> something similar?

That would mean we carry a list of eBPF program pointers of dynamic
size. IOW, the deeper inside the cgroup hierarchy, the bigger the list,
so it can store a reference to all programs of all of its ancestor.

While I think that would be possible, even at some later point, I'd
really like to avoid it for the sake of simplicity.

Is there any reason why this can't be done in userspace? Compile a
program X for A, and overload it with Y, with Y doing the same than X
but add some extra checks? Note that all users of the bpf(2) syscall API
will need CAP_NET_ADMIN anyway, so there is no delegation to
unprivileged sub-orchestators or anything alike really.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-09-05 12:56     ` Daniel Mack
@ 2016-09-05 15:30       ` David Laight
  2016-09-05 15:40         ` Daniel Mack
  2016-09-05 17:29       ` Joe Perches
  1 sibling, 1 reply; 35+ messages in thread
From: David Laight @ 2016-09-05 15:30 UTC (permalink / raw)
  To: 'Daniel Mack', Alexei Starovoitov
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev, sargun

From: Daniel Mack
> >> +
> >> +	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> >> +		__u32		target_fd;	/* container object to attach to */
> >> +		__u32		attach_bpf_fd;	/* eBPF program to attach */
> >> +		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
> >> +		__u64		attach_flags;
> >> +	};
> >
> > there is a 4 byte hole in this struct. Can we pack it differently?
> 
> Okay - I swapped "type" and "flags" to repair this.

That just moves the pad to the end of the structure.
Still likely to cause a problem for 32bit apps on a 64bit kernel.
If you can't think of any flags, why 64 of them?

	David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-09-05 15:30       ` David Laight
@ 2016-09-05 15:40         ` Daniel Mack
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 15:40 UTC (permalink / raw)
  To: David Laight, Alexei Starovoitov
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev, sargun

On 09/05/2016 05:30 PM, David Laight wrote:
> From: Daniel Mack
>>>> +
>>>> +	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>>>> +		__u32		target_fd;	/* container object to attach to */
>>>> +		__u32		attach_bpf_fd;	/* eBPF program to attach */
>>>> +		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
>>>> +		__u64		attach_flags;
>>>> +	};
>>>
>>> there is a 4 byte hole in this struct. Can we pack it differently?
>>
>> Okay - I swapped "type" and "flags" to repair this.
> 
> That just moves the pad to the end of the structure.
> Still likely to cause a problem for 32bit apps on a 64bit kernel.

What kind of problem do you have in mind? Again, this is embedded in a
union of much bigger total size, and the API is not used in any kind of
hot-path.

> If you can't think of any flags, why 64 of them?

I can't think of them right now, but this is about defining an API that
can be used in other context as well. Also, it doesn't matter at all,
they don't harm. IMO, it's just better to have them right away than to
do a binary compat dance once someone needs them.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-09-05 14:09         ` Daniel Mack
@ 2016-09-05 17:09           ` Daniel Borkmann
  2016-09-05 18:32             ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Borkmann @ 2016-09-05 17:09 UTC (permalink / raw)
  To: Daniel Mack, htejun, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 09/05/2016 04:09 PM, Daniel Mack wrote:
> On 09/05/2016 03:56 PM, Daniel Borkmann wrote:
>> On 09/05/2016 02:54 PM, Daniel Mack wrote:
>>> On 08/30/2016 01:00 AM, Daniel Borkmann wrote:
>>>> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>>>
>>>>>     enum bpf_map_type {
>>>>> @@ -147,6 +149,13 @@ union bpf_attr {
>>>>>     		__aligned_u64	pathname;
>>>>>     		__u32		bpf_fd;
>>>>>     	};
>>>>> +
>>>>> +	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
>>>>> +		__u32		target_fd;	/* container object to attach to */
>>>>> +		__u32		attach_bpf_fd;	/* eBPF program to attach */
>>>>> +		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
>>>>> +		__u64		attach_flags;
>>>>
>>>> Could we just do ...
>>>>
>>>> __u32 dst_fd;
>>>> __u32 src_fd;
>>>> __u32 attach_type;
>>>>
>>>> ... and leave flags out, since unused anyway? Also see below.
>>>
>>> I'd really like to keep the flags, even if they're unused right now.
>>> This only adds 8 bytes during the syscall operation, so it doesn't harm.
>>> However, we cannot change the userspace API after the fact, and who
>>> knows what this (rather generic) interface will be used for later on.
>>
>> With the below suggestion added, then flags doesn't need to be
>> added currently as it can be done safely at a later point in time
>> with respecting old binaries. See also the syscall handling code
>> in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
>> underlying idea of this was taken from perf_event_open() syscall
>> back then, see [1] for a summary.
>>
>>     [1] https://lkml.org/lkml/2014/8/26/116
>
> Yes, I know that's possible, and I like the idea, but I don't think any
> new interface should come without flags really, as flags are something
> that will most certainly be needed at some point anyway. I didn't have
> them in my first shot, but Alexei pointed out that they should be added,
> and I agree.
>
> Also, this optimization wouldn't make the transported struct payload any
> smaller anyway, because the member of that union used by BPF_PROG_LOAD
> is still by far the biggest.
>
> I really don't think it's worth sparing 8 bytes here and then do the
> binary compat dance after flags are added, for no real gain.

Sure, but there's not much of a dance needed, see for example how map_flags
were added some time ago. So, iff there's really no foreseeable use-case in
sight and since we have this flexibility in place already, then I don't quite
follow why it's needed, if there's zero pain to add it later on. I would
understand it of course, if it cannot be handled later on anymore.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-09-05 12:56     ` Daniel Mack
  2016-09-05 15:30       ` David Laight
@ 2016-09-05 17:29       ` Joe Perches
  1 sibling, 0 replies; 35+ messages in thread
From: Joe Perches @ 2016-09-05 17:29 UTC (permalink / raw)
  To: Daniel Mack, Alexei Starovoitov
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev, sargun

On Mon, 2016-09-05 at 14:56 +0200, Daniel Mack wrote:
> On 08/27/2016 02:08 AM, Alexei Starovoitov wrote:
[]
> > +	switch (attr->attach_type) {
> > +	case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS:
> > +	case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: {
> > +		struct cgroup *cgrp;
> > +
> > +		prog = bpf_prog_get_type(attr->attach_bpf_fd,
> > +					 BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
> > +		if (IS_ERR(prog))
> > +			return PTR_ERR(prog);
> > +
> > +		cgrp = cgroup_get_from_fd(attr->target_fd);
> > +		if (IS_ERR(cgrp)) {
> > +			bpf_prog_put(prog);
> > +			return PTR_ERR(cgrp);
> > +		}
> > +
> > +		cgroup_bpf_update(cgrp, prog, attr->attach_type);
> > +		cgroup_put(cgrp);
> > +
> > +		break;
> > +	}
> this } formatting style is confusing. The above } looks
> like it matches 'switch () {'.
> May be move 'struct cgroup *cgrp' to the top to avoid that?

This style of case statements that declare local variables
with an open brace after the case statement

	switch (bar) {
	[cases...]
	case foo: {
		local declarations;

		code...
	}
	[cases...]
	}

is used quite frequently in the kernel.
I think it's fine as is.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-09-05 17:09           ` Daniel Borkmann
@ 2016-09-05 18:32             ` Alexei Starovoitov
  2016-09-05 18:43               ` Daniel Mack
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2016-09-05 18:32 UTC (permalink / raw)
  To: Daniel Borkmann, Daniel Mack, htejun
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 9/5/16 10:09 AM, Daniel Borkmann wrote:
> On 09/05/2016 04:09 PM, Daniel Mack wrote:
>> On 09/05/2016 03:56 PM, Daniel Borkmann wrote:
>>> On 09/05/2016 02:54 PM, Daniel Mack wrote:
>>>> On 08/30/2016 01:00 AM, Daniel Borkmann wrote:
>>>>> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>>>>
>>>>>>     enum bpf_map_type {
>>>>>> @@ -147,6 +149,13 @@ union bpf_attr {
>>>>>>             __aligned_u64    pathname;
>>>>>>             __u32        bpf_fd;
>>>>>>         };
>>>>>> +
>>>>>> +    struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH
>>>>>> commands */
>>>>>> +        __u32        target_fd;    /* container object to attach
>>>>>> to */
>>>>>> +        __u32        attach_bpf_fd;    /* eBPF program to attach */
>>>>>> +        __u32        attach_type;    /* BPF_ATTACH_TYPE_* */
>>>>>> +        __u64        attach_flags;
>>>>>
>>>>> Could we just do ...
>>>>>
>>>>> __u32 dst_fd;
>>>>> __u32 src_fd;
>>>>> __u32 attach_type;
>>>>>
>>>>> ... and leave flags out, since unused anyway? Also see below.
>>>>
>>>> I'd really like to keep the flags, even if they're unused right now.
>>>> This only adds 8 bytes during the syscall operation, so it doesn't
>>>> harm.
>>>> However, we cannot change the userspace API after the fact, and who
>>>> knows what this (rather generic) interface will be used for later on.
>>>
>>> With the below suggestion added, then flags doesn't need to be
>>> added currently as it can be done safely at a later point in time
>>> with respecting old binaries. See also the syscall handling code
>>> in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The
>>> underlying idea of this was taken from perf_event_open() syscall
>>> back then, see [1] for a summary.
>>>
>>>     [1] https://lkml.org/lkml/2014/8/26/116
>>
>> Yes, I know that's possible, and I like the idea, but I don't think any
>> new interface should come without flags really, as flags are something
>> that will most certainly be needed at some point anyway. I didn't have
>> them in my first shot, but Alexei pointed out that they should be added,
>> and I agree.
>>
>> Also, this optimization wouldn't make the transported struct payload any
>> smaller anyway, because the member of that union used by BPF_PROG_LOAD
>> is still by far the biggest.
>>
>> I really don't think it's worth sparing 8 bytes here and then do the
>> binary compat dance after flags are added, for no real gain.
>
> Sure, but there's not much of a dance needed, see for example how map_flags
> were added some time ago. So, iff there's really no foreseeable use-case in
> sight and since we have this flexibility in place already, then I don't
> quite
> follow why it's needed, if there's zero pain to add it later on. I would
> understand it of course, if it cannot be handled later on anymore.

I agree with Daniel B. Since flags are completely unused right now,
there is no plan to use it for anything in the coming months and
even worse they make annoying hole in the struct, let's not
add them. We can safely do that later. CHECK_ATTR() allows us to
do it easily. It's not like syscall where flags are must have,
since we cannot add it later. Here it's done trivially.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-09-05 18:32             ` Alexei Starovoitov
@ 2016-09-05 18:43               ` Daniel Mack
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Mack @ 2016-09-05 18:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, htejun
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 09/05/2016 08:32 PM, Alexei Starovoitov wrote:
> On 9/5/16 10:09 AM, Daniel Borkmann wrote:
>> On 09/05/2016 04:09 PM, Daniel Mack wrote:

>>> I really don't think it's worth sparing 8 bytes here and then do the
>>> binary compat dance after flags are added, for no real gain.
>>
>> Sure, but there's not much of a dance needed, see for example how map_flags
>> were added some time ago. So, iff there's really no foreseeable use-case in
>> sight and since we have this flexibility in place already, then I don't
>> quite
>> follow why it's needed, if there's zero pain to add it later on. I would
>> understand it of course, if it cannot be handled later on anymore.
> 
> I agree with Daniel B. Since flags are completely unused right now,
> there is no plan to use it for anything in the coming months and
> even worse they make annoying hole in the struct, let's not
> add them. We can safely do that later. CHECK_ATTR() allows us to
> do it easily. It's not like syscall where flags are must have,
> since we cannot add it later. Here it's done trivially.

Okay then. If you both agree, I won't interfere :)


Daniel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
  2016-09-05 14:49     ` Daniel Mack
@ 2016-09-05 21:40       ` Sargun Dhillon
  2016-09-05 22:39         ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Sargun Dhillon @ 2016-09-05 21:40 UTC (permalink / raw)
  To: Daniel Mack; +Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Mon, Sep 05, 2016 at 04:49:26PM +0200, Daniel Mack wrote:
> Hi,
> 
> On 08/30/2016 01:04 AM, Sargun Dhillon wrote:
> > On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
> >> This patch adds two sets of eBPF program pointers to struct cgroup.
> >> One for such that are directly pinned to a cgroup, and one for such
> >> that are effective for it.
> >>
> >> To illustrate the logic behind that, assume the following example
> >> cgroup hierarchy.
> >>
> >>   A - B - C
> >>         \ D - E
> >>
> >> If only B has a program attached, it will be effective for B, C, D
> >> and E. If D then attaches a program itself, that will be effective for
> >> both D and E, and the program in B will only affect B and C. Only one
> >> program of a given type is effective for a cgroup.
> >>
> > How does this work when running and orchestrator within an orchestrator? The 
> > Docker in Docker / Mesos in Mesos use case, where the top level orchestrator is 
> > observing the traffic, and there is an orchestrator within that also need to run 
> > it.
> > 
> > In this case, I'd like to run E's filter, then if it returns 0, D's, and B's, 
> > and so on.
> 
> Running multiple programs was an idea I had in one of my earlier drafts,
> but after some discussion, I refrained from it again because potentially
> walking the cgroup hierarchy on every packet is just too expensive.
>
I think you're correct here. Maybe this is something I do with the LSM-attached 
filters, and not for skb filters. Do you think there might be a way to opt-in to 
this option? 

> > Is it possible to allow this, either by flattening out the
> > datastructure (copy a ref to the bpf programs to C and E) or
> > something similar?
> 
> That would mean we carry a list of eBPF program pointers of dynamic
> size. IOW, the deeper inside the cgroup hierarchy, the bigger the list,
> so it can store a reference to all programs of all of its ancestor.
> 
> While I think that would be possible, even at some later point, I'd
> really like to avoid it for the sake of simplicity.
> 
> Is there any reason why this can't be done in userspace? Compile a
> program X for A, and overload it with Y, with Y doing the same than X
> but add some extra checks? Note that all users of the bpf(2) syscall API
> will need CAP_NET_ADMIN anyway, so there is no delegation to
> unprivileged sub-orchestators or anything alike really.

One of the use-cases that's becoming more and more common are 
containers-in-containers. In this, you have a privileged container that's 
running something like build orchestration, and you want to do macro-isolation 
(say limit access to only that tennant's infrastructure). Then, when the build 
orchestrator runs a build, it may want to monitor, and further isolate the tasks 
that run in the build job. This is a side-effect of composing different 
container technologies. Typically you use one system for images, then another 
for orchestration, and the actual program running inside of it can also leverage 
containerization.

Example:
K8s->Docker->Jenkins Agent->Jenkins Build Job

There's also a differentiation of ownership in each of these systems. I would 
really not require a middleware system that all my software has to talk to, 
because sometimes I'm taking off the shelf software (Jenkins), and porting it to 
containers. I think one of the pieces that's led to the success of cgroups is 
the straightforward API, and ease of use (and it's getting even easier in v2).

It's perfectly fine to give the lower level tasks CAP_NET_ADMIN, because we use 
something like seccomp-bpf plus some of the work I've been doing with the LSM to 
prevent the sub-orchestrators from accidentally blowing away the system. 
Usually, we trust these orchestrators (internal users), so it's more of a 
precautionary measure as opposed to a true security measure.

Also, rewriting BPF programs, although pretty straightforward sounds like a pain 
to do in userspace, even with a helper. If we were to take peoples programs and 
chain them together via tail call, or similar, I can imagine where rewriting a 
program might push you over the instruction limit.
> 
> 
> Thanks,
> Daniel
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
  2016-09-05 21:40       ` Sargun Dhillon
@ 2016-09-05 22:39         ` Alexei Starovoitov
  0 siblings, 0 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2016-09-05 22:39 UTC (permalink / raw)
  To: Sargun Dhillon, Daniel Mack
  Cc: htejun, daniel, davem, kafai, fw, pablo, harald, netdev

On 9/5/16 2:40 PM, Sargun Dhillon wrote:
> On Mon, Sep 05, 2016 at 04:49:26PM +0200, Daniel Mack wrote:
>> Hi,
>>
>> On 08/30/2016 01:04 AM, Sargun Dhillon wrote:
>>> On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote:
>>>> This patch adds two sets of eBPF program pointers to struct cgroup.
>>>> One for such that are directly pinned to a cgroup, and one for such
>>>> that are effective for it.
>>>>
>>>> To illustrate the logic behind that, assume the following example
>>>> cgroup hierarchy.
>>>>
>>>>    A - B - C
>>>>          \ D - E
>>>>
>>>> If only B has a program attached, it will be effective for B, C, D
>>>> and E. If D then attaches a program itself, that will be effective for
>>>> both D and E, and the program in B will only affect B and C. Only one
>>>> program of a given type is effective for a cgroup.
>>>>
>>> How does this work when running and orchestrator within an orchestrator? The
>>> Docker in Docker / Mesos in Mesos use case, where the top level orchestrator is
>>> observing the traffic, and there is an orchestrator within that also need to run
>>> it.
>>>
>>> In this case, I'd like to run E's filter, then if it returns 0, D's, and B's,
>>> and so on.
>>
>> Running multiple programs was an idea I had in one of my earlier drafts,
>> but after some discussion, I refrained from it again because potentially
>> walking the cgroup hierarchy on every packet is just too expensive.
>>
> I think you're correct here. Maybe this is something I do with the LSM-attached
> filters, and not for skb filters. Do you think there might be a way to opt-in to
> this option?
>
>>> Is it possible to allow this, either by flattening out the
>>> datastructure (copy a ref to the bpf programs to C and E) or
>>> something similar?
>>
>> That would mean we carry a list of eBPF program pointers of dynamic
>> size. IOW, the deeper inside the cgroup hierarchy, the bigger the list,
>> so it can store a reference to all programs of all of its ancestor.
>>
>> While I think that would be possible, even at some later point, I'd
>> really like to avoid it for the sake of simplicity.
>>
>> Is there any reason why this can't be done in userspace? Compile a
>> program X for A, and overload it with Y, with Y doing the same than X
>> but add some extra checks? Note that all users of the bpf(2) syscall API
>> will need CAP_NET_ADMIN anyway, so there is no delegation to
>> unprivileged sub-orchestators or anything alike really.
>
> One of the use-cases that's becoming more and more common are
> containers-in-containers. In this, you have a privileged container that's
> running something like build orchestration, and you want to do macro-isolation
> (say limit access to only that tennant's infrastructure). Then, when the build
> orchestrator runs a build, it may want to monitor, and further isolate the tasks
> that run in the build job. This is a side-effect of composing different
> container technologies. Typically you use one system for images, then another
> for orchestration, and the actual program running inside of it can also leverage
> containerization.
>
> Example:
> K8s->Docker->Jenkins Agent->Jenkins Build Job

frankly I don't buy this argument, since above
and other 'examples' of container-in-container look
fake to me. There is a ton work to be done for such
scheme to be even remotely feasible. The cgroup+bpf
stuff would be the last on my list to 'fix' for such
deployments. I don't think we should worry about it
at present.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] net: core: run cgroup eBPF egress programs
  2016-09-05 14:22     ` Daniel Mack
@ 2016-09-06 17:14       ` Daniel Borkmann
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Borkmann @ 2016-09-06 17:14 UTC (permalink / raw)
  To: Daniel Mack, htejun, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, sargun

On 09/05/2016 04:22 PM, Daniel Mack wrote:
> On 08/30/2016 12:03 AM, Daniel Borkmann wrote:
>> On 08/26/2016 09:58 PM, Daniel Mack wrote:
>
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index a75df86..17484e6 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -141,6 +141,7 @@
>>>    #include <linux/netfilter_ingress.h>
>>>    #include <linux/sctp.h>
>>>    #include <linux/crash_dump.h>
>>> +#include <linux/bpf-cgroup.h>
>>>
>>>    #include "net-sysfs.h"
>>>
>>> @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>>>    	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
>>>    		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
>>>
>>> +	rc = cgroup_bpf_run_filter(skb->sk, skb,
>>> +				   BPF_ATTACH_TYPE_CGROUP_INET_EGRESS);
>>> +	if (rc)
>>> +		return rc;
>>
>> This would leak the whole skb by the way.
>
> Ah, right.
>
>> Apart from that, could this be modeled w/o affecting the forwarding path (at some
>> local output point where we know to have a valid socket)? Then you could also drop
>> the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of what
>> clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to be
>> just zeroes since not available at that point?
>
> Hmm, I wonder where this hook could be put instead then. When placed in
> ip_output() and ip6_output(), the mac headers cannot be pushed before
> running the program, resulting in bogus skb data from the eBPF program.

But as it stands right now, RX will only see a subset of packets in sk_filter()
layer (depending on where it's called in the proto handler implementation,
so might not even include all control messages, for example) as opposed to
the TX hook going that far even 'seeing' everything incl. forwarded packets
in the sense that we know a priori that these kind of skbs going through the
cgroup_bpf_run_filter() handler when the hook is enabled will just skip this
hook eventually anyway. What about letting such progs see /only/ local skbs
for RX and TX, with skb->data from L3 onwards (iirc, that would be similar
to what current sk_filter() programs see)?

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-09-06 17:14 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-26 19:58 [PATCH v3 0/6] Add eBPF hooks for cgroups Daniel Mack
2016-08-26 19:58 ` [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
2016-08-29 22:14   ` Daniel Borkmann
2016-09-05 12:48     ` Daniel Mack
2016-08-26 19:58 ` [PATCH v3 2/6] cgroup: add support for eBPF programs Daniel Mack
2016-08-27  0:03   ` Alexei Starovoitov
2016-09-05 12:47     ` Daniel Mack
2016-08-29 22:42   ` Daniel Borkmann
2016-09-05 12:50     ` Daniel Mack
2016-08-29 23:04   ` Sargun Dhillon
2016-09-05 14:49     ` Daniel Mack
2016-09-05 21:40       ` Sargun Dhillon
2016-09-05 22:39         ` Alexei Starovoitov
2016-08-26 19:58 ` [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
2016-08-27  0:08   ` Alexei Starovoitov
2016-09-05 12:56     ` Daniel Mack
2016-09-05 15:30       ` David Laight
2016-09-05 15:40         ` Daniel Mack
2016-09-05 17:29       ` Joe Perches
2016-08-29 23:00   ` Daniel Borkmann
2016-09-05 12:54     ` Daniel Mack
2016-09-05 13:56       ` Daniel Borkmann
2016-09-05 14:09         ` Daniel Mack
2016-09-05 17:09           ` Daniel Borkmann
2016-09-05 18:32             ` Alexei Starovoitov
2016-09-05 18:43               ` Daniel Mack
2016-08-26 19:58 ` [PATCH v3 4/6] net: filter: run cgroup eBPF ingress programs Daniel Mack
2016-08-29 23:15   ` Daniel Borkmann
2016-08-26 19:58 ` [PATCH v3 5/6] net: core: run cgroup eBPF egress programs Daniel Mack
2016-08-29 22:03   ` Daniel Borkmann
2016-08-29 22:23     ` Sargun Dhillon
2016-09-05 14:22     ` Daniel Mack
2016-09-06 17:14       ` Daniel Borkmann
2016-08-26 19:58 ` [PATCH v3 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
2016-08-27 13:00 ` [PATCH v3 0/6] Add eBPF hooks for cgroups Rami Rosen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.