All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/6] Add eBPF hooks for cgroups
@ 2016-09-12 16:12 Daniel Mack
  2016-09-12 16:12 ` [PATCH v5 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-12 16:12 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack

This is v5 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

After chatting with Daniel Borkmann and Alexei off-list, we concluded
that __dev_queue_xmit() is the place where the egress hooks should live
when eBPF programs need access to the L2 bits of the skb.


Changes from v4:

* Plug an skb leak when dropping packets due to eBPF verdicts in
  __dev_queue_xmit(). Spotted by Daniel Borkmann.

* Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
  operate on timewait or request sockets. Suggested by Daniel Borkmann.

* Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
  Spotted by Rami Rosen.

* Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.


Changes from v3:

* Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
  BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
  __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.

* Dropped the attach_flags member from the anonymous struct for BPF
  attach operations in union bpf_attr. They can be added later on via
  CHECK_ATTR. Requested by Daniel Borkmann and Alexei.

* Release old_prog at the end of __cgroup_bpf_update rather that at
  the beginning to fix a race gap between program updates and their
  users. Spotted by Daniel Borkmann.

* Plugged an skb leak when dropping packets on the egress path.
  Spotted by Daniel Borkmann.

* Add cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org to the loop, as suggested by Rami Rosen.

* Some minor coding style adoptions not worth mentioning in particular.


Changes from v2:

* Fixed the RCU locking details Tejun pointed out.

* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.


Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


As always, feedback is much appreciated.

Thanks,
Daniel


Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: core: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
    cgroups

 include/linux/bpf-cgroup.h      |  71 +++++++++++++++++
 include/linux/cgroup-defs.h     |   4 +
 include/uapi/linux/bpf.h        |  17 ++++
 init/Kconfig                    |  12 +++
 kernel/bpf/Makefile             |   1 +
 kernel/bpf/cgroup.c             | 166 ++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c            |  81 ++++++++++++++++++++
 kernel/bpf/verifier.c           |   1 +
 kernel/cgroup.c                 |  18 +++++
 net/core/dev.c                  |   6 ++
 net/core/filter.c               |  10 +++
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  21 +++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +++++++++++++++++++++++++++++++++++
 15 files changed, 560 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.5.5

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v5 1/6] bpf: add new prog type for cgroup socket filtering
  2016-09-12 16:12 [PATCH v5 0/6] Add eBPF hooks for cgroups Daniel Mack
@ 2016-09-12 16:12 ` Daniel Mack
  2016-09-12 16:12 ` [PATCH v5 2/6] cgroup: add support for eBPF programs Daniel Mack
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-12 16:12 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups, Daniel Mack

For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
terms of checks during the verification process. It may access the skb as
well.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/uapi/linux/bpf.h | 9 +++++++++
 kernel/bpf/verifier.c    | 1 +
 net/core/filter.c        | 6 ++++++
 3 files changed, 16 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f896dfa..55f815e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_TRACEPOINT,
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
+	BPF_PROG_TYPE_CGROUP_SOCKET,
 };
 
+enum bpf_attach_type {
+	BPF_CGROUP_INET_INGRESS,
+	BPF_CGROUP_INET_EGRESS,
+	__MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD	1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 90493a6..d5d2875 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1830,6 +1830,7 @@ static bool may_access_skb(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_SOCKET_FILTER:
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_CGROUP_SOCKET:
 		return true;
 	default:
 		return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index a83766b..176b6f2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2848,12 +2848,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_sk_type __read_mostly = {
+	.ops	= &sk_filter_ops,
+	.type	= BPF_PROG_TYPE_CGROUP_SOCKET,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
+	bpf_register_prog_type(&cg_sk_type);
 
 	return 0;
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v5 2/6] cgroup: add support for eBPF programs
  2016-09-12 16:12 [PATCH v5 0/6] Add eBPF hooks for cgroups Daniel Mack
  2016-09-12 16:12 ` [PATCH v5 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
@ 2016-09-12 16:12 ` Daniel Mack
  2016-09-12 16:12 ` [PATCH v5 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-12 16:12 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups, Daniel Mack

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/linux/bpf-cgroup.h  |  71 +++++++++++++++++++
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig                |  12 ++++
 kernel/bpf/Makefile         |   1 +
 kernel/bpf/cgroup.c         | 166 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup.c             |  18 +++++
 6 files changed, 272 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 0000000..fc076de
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,71 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include <linux/bpf.h>
+#include <linux/jump_label.h>
+#include <uapi/linux/bpf.h>
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
+
+struct cgroup_bpf {
+	/*
+	 * Store two sets of bpf_prog pointers, one for programs that are
+	 * pinned directly to this cgroup, and one for those that are effective
+	 * when this cgroup is accessed.
+	 */
+	struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+	struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+			 struct cgroup *parent,
+			 struct bpf_prog *prog,
+			 enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+		       struct bpf_prog *prog,
+		       enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+			    struct sk_buff *skb,
+			    enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+					struct sk_buff *skb,
+					enum bpf_attach_type type)
+{
+	if (cgroup_bpf_enabled)
+		return __cgroup_bpf_run_filter(sk, skb, type);
+
+	return 0;
+}
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+				      struct cgroup *parent) {}
+
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+					struct sk_buff *skb,
+					enum bpf_attach_type type)
+{
+	return 0;
+}
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include <linux/percpu-refcount.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/workqueue.h>
+#include <linux/bpf-cgroup.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
 
+	/* used to store eBPF programs */
+	struct cgroup_bpf bpf;
+
 	/* ids of the ancestors at each level including self */
 	int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..71c71b0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,18 @@ config CGROUP_PERF
 
 	  Say N if unsure.
 
+config CGROUP_BPF
+	bool "Support for eBPF programs attached to cgroups"
+	depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+	help
+	  Allow attaching eBPF programs to a cgroup using the bpf(2)
+	  syscall command BPF_PROG_ATTACH.
+
+	  In which context these programs are accessed depends on the type
+	  of attachment. For instance, programs that are attached using
+	  BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
+	  inet sockets.
+
 config CGROUP_DEBUG
 	bool "Example controller"
 	default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index eed911d..b22256b 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
+obj-$(CONFIG_CGROUP_BPF) += cgroup.o
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
new file mode 100644
index 0000000..21d168c
--- /dev/null
+++ b/kernel/bpf/cgroup.c
@@ -0,0 +1,166 @@
+/*
+ * Functions to manage eBPF programs attached to cgroups
+ *
+ * Copyright (c) 2016 Daniel Mack
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/bpf-cgroup.h>
+#include <net/sock.h>
+
+DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
+EXPORT_SYMBOL(cgroup_bpf_enabled_key);
+
+/**
+ * cgroup_bpf_put() - put references of all bpf programs
+ * @cgrp: the cgroup to modify
+ */
+void cgroup_bpf_put(struct cgroup *cgrp)
+{
+	unsigned int type;
+
+	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.prog); type++) {
+		struct bpf_prog *prog = cgrp->bpf.prog[type];
+
+		if (prog) {
+			bpf_prog_put(prog);
+			static_branch_dec(&cgroup_bpf_enabled_key);
+		}
+	}
+}
+
+/**
+ * cgroup_bpf_inherit() - inherit effective programs from parent
+ * @cgrp: the cgroup to modify
+ * @parent: the parent to inherit from
+ */
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
+{
+	unsigned int type;
+
+	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
+		struct bpf_prog *e;
+
+		e = rcu_dereference_protected(parent->bpf.effective[type],
+					      lockdep_is_held(&cgroup_mutex));
+		rcu_assign_pointer(cgrp->bpf.effective[type], e);
+	}
+}
+
+/**
+ * __cgroup_bpf_update() - Update the pinned program of a cgroup, and
+ *                         propagate the change to descendants
+ * @cgrp: The cgroup which descendants to traverse
+ * @parent: The parent of @cgrp, or %NULL if @cgrp is the root
+ * @prog: A new program to pin
+ * @type: Type of pinning operation (ingress/egress)
+ *
+ * Each cgroup has a set of two pointers for bpf programs; one for eBPF
+ * programs it owns, and which is effective for execution.
+ *
+ * If @prog is %NULL, this function attaches a new program to the cgroup and
+ * releases the one that is currently attached, if any. @prog is then made
+ * the effective program of type @type in that cgroup.
+ *
+ * If @prog is %NULL, the currently attached program of type @type is released,
+ * and the effective program of the parent cgroup (if any) is inherited to
+ * @cgrp.
+ *
+ * Then, the descendants of @cgrp are walked and the effective program for
+ * each of them is set to the effective program of @cgrp unless the
+ * descendant has its own program attached, in which case the subbranch is
+ * skipped. This ensures that delegated subcgroups with own programs are left
+ * untouched.
+ *
+ * Must be called with cgroup_mutex held.
+ */
+void __cgroup_bpf_update(struct cgroup *cgrp,
+			 struct cgroup *parent,
+			 struct bpf_prog *prog,
+			 enum bpf_attach_type type)
+{
+	struct bpf_prog *old_prog, *effective;
+	struct cgroup_subsys_state *pos;
+
+	old_prog = xchg(cgrp->bpf.prog + type, prog);
+
+	effective = (!prog && parent) ?
+		rcu_dereference_protected(parent->bpf.effective[type],
+					  lockdep_is_held(&cgroup_mutex)) :
+		prog;
+
+	css_for_each_descendant_pre(pos, &cgrp->self) {
+		struct cgroup *desc = container_of(pos, struct cgroup, self);
+
+		/* skip the subtree if the descendant has its own program */
+		if (desc->bpf.prog[type] && desc != cgrp)
+			pos = css_rightmost_descendant(pos);
+		else
+			rcu_assign_pointer(desc->bpf.effective[type],
+					   effective);
+	}
+
+	if (prog)
+		static_branch_inc(&cgroup_bpf_enabled_key);
+
+	if (old_prog) {
+		bpf_prog_put(old_prog);
+		static_branch_dec(&cgroup_bpf_enabled_key);
+	}
+}
+
+/**
+ * __cgroup_bpf_run_filter() - Run a program for packet filtering
+ * @sk: The socken sending or receiving traffic
+ * @skb: The skb that is being sent or received
+ * @type: The type of program to be exectuted
+ *
+ * If no socket is passed, or the socket is not of type INET or INET6,
+ * this function does nothing and returns 0.
+ *
+ * The program type passed in via @type must be suitable for network
+ * filtering. No further check is performed to assert that.
+ *
+ * This function will return %-EPERM if any if an attached program was found
+ * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter(struct sock *sk,
+			    struct sk_buff *skb,
+			    enum bpf_attach_type type)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+	int ret = 0;
+
+	if (!sk || !sk_fullsock(sk))
+		return 0;
+
+	if (sk->sk_family != AF_INET &&
+	    sk->sk_family != AF_INET6)
+		return 0;
+
+	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+
+	rcu_read_lock();
+
+	prog = rcu_dereference(cgrp->bpf.effective[type]);
+	if (prog) {
+		unsigned int offset = skb->data - skb_mac_header(skb);
+
+		__skb_push(skb, offset);
+		ret = bpf_prog_run_clear_cb(prog, skb) == 1 ? 0 : -EPERM;
+		__skb_pull(skb, offset);
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d1c51b7..57ade89 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5038,6 +5038,8 @@ static void css_release_work_fn(struct work_struct *work)
 		if (cgrp->kn)
 			RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv,
 					 NULL);
+
+		cgroup_bpf_put(cgrp);
 	}
 
 	mutex_unlock(&cgroup_mutex);
@@ -5245,6 +5247,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	if (!cgroup_on_dfl(cgrp))
 		cgrp->subtree_control = cgroup_control(cgrp);
 
+	if (parent)
+		cgroup_bpf_inherit(cgrp, parent);
+
 	cgroup_propagate_control(cgrp);
 
 	/* @cgrp doesn't have dir yet so the following will only create csses */
@@ -6417,6 +6422,19 @@ static __init int cgroup_namespaces_init(void)
 }
 subsys_initcall(cgroup_namespaces_init);
 
+#ifdef CONFIG_CGROUP_BPF
+void cgroup_bpf_update(struct cgroup *cgrp,
+		       struct bpf_prog *prog,
+		       enum bpf_attach_type type)
+{
+	struct cgroup *parent = cgroup_parent(cgrp);
+
+	mutex_lock(&cgroup_mutex);
+	__cgroup_bpf_update(cgrp, parent, prog, type);
+	mutex_unlock(&cgroup_mutex);
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 #ifdef CONFIG_CGROUP_DEBUG
 static struct cgroup_subsys_state *
 debug_css_alloc(struct cgroup_subsys_state *parent_css)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v5 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-09-12 16:12 [PATCH v5 0/6] Add eBPF hooks for cgroups Daniel Mack
  2016-09-12 16:12 ` [PATCH v5 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
  2016-09-12 16:12 ` [PATCH v5 2/6] cgroup: add support for eBPF programs Daniel Mack
@ 2016-09-12 16:12 ` Daniel Mack
       [not found] ` <1473696735-11269-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-12 16:12 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups, Daniel Mack

Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/uapi/linux/bpf.h |  8 +++++
 kernel/bpf/syscall.c     | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 55f815e..7cd3616 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
 	BPF_PROG_LOAD,
 	BPF_OBJ_PIN,
 	BPF_OBJ_GET,
+	BPF_PROG_ATTACH,
+	BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 	};
+
+	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+		__u32		target_fd;	/* container object to attach to */
+		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		__u32		attach_type;
+	};
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1a8592a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
 	return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (CHECK_ATTR(BPF_PROG_ATTACH))
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+	case BPF_CGROUP_INET_INGRESS:
+	case BPF_CGROUP_INET_EGRESS:
+		prog = bpf_prog_get_type(attr->attach_bpf_fd,
+					 BPF_PROG_TYPE_CGROUP_SOCKET);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp)) {
+			bpf_prog_put(prog);
+			return PTR_ERR(cgrp);
+		}
+
+		cgroup_bpf_update(cgrp, prog, attr->attach_type);
+		cgroup_put(cgrp);
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+	struct cgroup *cgrp;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (CHECK_ATTR(BPF_PROG_DETACH))
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+	case BPF_CGROUP_INET_INGRESS:
+	case BPF_CGROUP_INET_EGRESS:
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp))
+			return PTR_ERR(cgrp);
+
+		cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+		cgroup_put(cgrp);
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET:
 		err = bpf_obj_get(&attr);
 		break;
+
+#ifdef CONFIG_CGROUP_BPF
+	case BPF_PROG_ATTACH:
+		err = bpf_prog_attach(&attr);
+		break;
+	case BPF_PROG_DETACH:
+		err = bpf_prog_detach(&attr);
+		break;
+#endif
+
 	default:
 		err = -EINVAL;
 		break;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v5 4/6] net: filter: run cgroup eBPF ingress programs
       [not found] ` <1473696735-11269-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
@ 2016-09-12 16:12   ` Daniel Mack
  2016-09-12 16:12   ` [PATCH v5 5/6] net: core: run cgroup eBPF egress programs Daniel Mack
  2016-09-12 16:12   ` [PATCH v5 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
  2 siblings, 0 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-12 16:12 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
---
 net/core/filter.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 176b6f2..3662c1a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
 		return -ENOMEM;
 
+	err = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_INGRESS);
+	if (err)
+		return err;
+
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)
 		return err;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v5 5/6] net: core: run cgroup eBPF egress programs
       [not found] ` <1473696735-11269-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  2016-09-12 16:12   ` [PATCH v5 4/6] net: filter: run cgroup eBPF ingress programs Daniel Mack
@ 2016-09-12 16:12   ` Daniel Mack
  2016-09-12 16:12   ` [PATCH v5 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
  2 siblings, 0 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-12 16:12 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from __dev_queue_xmit().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
---
 net/core/dev.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 34b5322..f951db2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -141,6 +141,7 @@
 #include <linux/netfilter_ingress.h>
 #include <linux/sctp.h>
 #include <linux/crash_dump.h>
+#include <linux/bpf-cgroup.h>
 
 #include "net-sysfs.h"
 
@@ -3329,6 +3330,10 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
 		__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
 
+	rc = cgroup_bpf_run_filter(skb->sk, skb, BPF_CGROUP_INET_EGRESS);
+	if (rc)
+		goto free_skb_list;
+
 	/* Disable soft irqs for various locks below. Also
 	 * stops preemption for RCU.
 	 */
@@ -3416,6 +3421,7 @@ recursion_alert:
 	rcu_read_unlock_bh();
 
 	atomic_long_inc(&dev->tx_dropped);
+free_skb_list:
 	kfree_skb_list(skb);
 	return rc;
 out:
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v5 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups
       [not found] ` <1473696735-11269-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  2016-09-12 16:12   ` [PATCH v5 4/6] net: filter: run cgroup eBPF ingress programs Daniel Mack
  2016-09-12 16:12   ` [PATCH v5 5/6] net: core: run cgroup eBPF egress programs Daniel Mack
@ 2016-09-12 16:12   ` Daniel Mack
  2 siblings, 0 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-12 16:12 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack

Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
---
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  21 ++++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 147 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_bpf_fd = prog_fd,
+		.attach_type = type,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_type = type,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
 	union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 364582b..f973241 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, int insn_len,
 		  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 0000000..19e4ec0
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+
+enum {
+	MAP_KEY_PACKETS,
+	MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd, int verdict)
+{
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* save r6 so it's not clobbered by BPF_CALL */
+
+		/* Count packets */
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_PACKETS), /* r0 = 0 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* load map fd to r1 */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		/* Count bytes */
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_BYTES), /* r0 = 1 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_6, offsetof(struct __sk_buff, len)), /* r1 = skb->len */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		BPF_MOV64_IMM(BPF_REG_0, verdict), /* r0 = verdict */
+		BPF_EXIT_INSN(),
+	};
+
+	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCKET,
+			     prog, sizeof(prog), "GPL", 0);
+}
+
+static int usage(const char *argv0)
+{
+	printf("Usage: %s <cg-path> <egress|ingress> [drop]\n", argv0);
+	return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+	int cg_fd, map_fd, prog_fd, key, ret;
+	long long pkt_cnt, byte_cnt;
+	enum bpf_attach_type type;
+	int verdict = 1;
+
+	if (argc < 3)
+		return usage(argv[0]);
+
+	if (strcmp(argv[2], "ingress") == 0)
+		type = BPF_CGROUP_INET_INGRESS;
+	else if (strcmp(argv[2], "egress") == 0)
+		type = BPF_CGROUP_INET_EGRESS;
+	else
+		return usage(argv[0]);
+
+	if (argc > 3 && strcmp(argv[3], "drop") == 0)
+		verdict = 0;
+
+	cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+	if (cg_fd < 0) {
+		printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY,
+				sizeof(key), sizeof(byte_cnt),
+				256, 0);
+	if (map_fd < 0) {
+		printf("Failed to create map: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	prog_fd = prog_load(map_fd, verdict);
+	printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+	if (prog_fd < 0) {
+		printf("Failed to load prog: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	ret = bpf_prog_detach(cg_fd, type);
+	printf("bpf_prog_detach() returned '%s' (%d)\n", strerror(errno), errno);
+
+	ret = bpf_prog_attach(prog_fd, cg_fd, type);
+	if (ret < 0) {
+		printf("Failed to attach prog to cgroup: '%s'\n",
+		       strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	while (1) {
+		key = MAP_KEY_PACKETS;
+		assert(bpf_lookup_elem(map_fd, &key, &pkt_cnt) == 0);
+
+		key = MAP_KEY_BYTES;
+		assert(bpf_lookup_elem(map_fd, &key, &byte_cnt) == 0);
+
+		printf("cgroup received %lld packets, %lld bytes\n",
+		       pkt_cnt, byte_cnt);
+		sleep(1);
+	}
+
+	return EXIT_SUCCESS;
+}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-12 16:12 [PATCH v5 0/6] Add eBPF hooks for cgroups Daniel Mack
                   ` (3 preceding siblings ...)
       [not found] ` <1473696735-11269-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
@ 2016-09-13 11:56 ` Pablo Neira Ayuso
  2016-09-13 13:31   ` Daniel Mack
  2016-09-15  6:36 ` Vincent Bernat
  5 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2016-09-13 11:56 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev, sargun, cgroups

Hi,

On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> This is v5 of the patch set to allow eBPF programs for network
> filtering and accounting to be attached to cgroups, so that they apply
> to all sockets of all tasks placed in that cgroup. The logic also
> allows to be extendeded for other cgroup based eBPF logic.

1) This infrastructure can only be useful to systemd, or any similar
   orchestration daemon. Look, you can only apply filtering policies
   to processes that are launched by systemd, so this only works
   for server processes. For client processes this infrastructure is
   *racy*, you have to add new processes in runtime to the cgroup,
   thus there will be time some little time where no filtering policy
   will be applied. For quality of service, this may be an acceptable
   race, but this is aiming to deploy a filtering policy.

2) This aproach looks uninfrastructured to me. This provides a hook
   to push a bpf blob at a place in the stack that deploys a filtering
   policy that is not visible to others. We have interfaces that allows
   us to dump the filtering policy that is being applied, report events
   to enable cooperation between several processes with similar
   capabilities and so on.  For the XDP thing, this ability to push
   blobs may be fine as long as it will not interfer with the stack so
   we can provide an alternative to DPDK in Linux. For tracing, that's
   fine too since it is innocuous. And likely for other applications is
   a good fit. But I don't think this is the case.

> After chatting with Daniel Borkmann and Alexei off-list, we concluded
> that __dev_queue_xmit() is the place where the egress hooks should live
> when eBPF programs need access to the L2 bits of the skb.

3) This egress hook is coming very late, the only reason I find to
   place it at __dev_queue_xmit() is that bpf naturally works with
   layer 2 information in place. But this new hook is placed in
   _everyone's output ath_ that only works for the very specific
   usecase I exposed above.

The main concern during the workshop was that a hook only for cgroups
is too specific, but this is actually even more specific than this.

I have nothing against systemd or the needs for more
programmability/flexibility in the stack, but I think this needs to
fulfill some requirements to fit into the infrastructure that we have
in the right way.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-13 11:56 ` [PATCH v5 0/6] Add eBPF hooks for cgroups Pablo Neira Ayuso
@ 2016-09-13 13:31   ` Daniel Mack
       [not found]     ` <da300784-284c-0d1f-a82e-aa0a0f8ae116-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Mack @ 2016-09-13 13:31 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi,

On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
>> This is v5 of the patch set to allow eBPF programs for network
>> filtering and accounting to be attached to cgroups, so that they apply
>> to all sockets of all tasks placed in that cgroup. The logic also
>> allows to be extendeded for other cgroup based eBPF logic.
> 
> 1) This infrastructure can only be useful to systemd, or any similar
>    orchestration daemon. Look, you can only apply filtering policies
>    to processes that are launched by systemd, so this only works
>    for server processes.

Sorry, but both statements aren't true. The eBPF policies apply to every
process that is placed in a cgroup, and my example program in 6/6 shows
how that can be done from the command line. Also, systemd is able to
control userspace processes just fine, and it not limited to 'server
processes'.

> For client processes this infrastructure is
>    *racy*, you have to add new processes in runtime to the cgroup,
>    thus there will be time some little time where no filtering policy
>    will be applied. For quality of service, this may be an acceptable
>    race, but this is aiming to deploy a filtering policy.

That's a limitation that applies to many more control mechanisms in the
kernel, and it's something that can easily be solved with fork+exec.

> 2) This aproach looks uninfrastructured to me. This provides a hook
>    to push a bpf blob at a place in the stack that deploys a filtering
>    policy that is not visible to others.

That's just as transparent as SO_ATTACH_FILTER. What kind of
introspection mechanism do you have in mind?

> We have interfaces that allows
>    us to dump the filtering policy that is being applied, report events
>    to enable cooperation between several processes with similar
>    capabilities and so on.

Well, in practice, for netfilter, there can only be one instance in the
system that acts as central authoritative, otherwise you'll end up with
orphaned entries or with situation where some client deletes rules
behind the back of the one that originally installed it. So I really
think there is nothing wrong with demanding a single, privileged
controller to manage things.

>> After chatting with Daniel Borkmann and Alexei off-list, we concluded
>> that __dev_queue_xmit() is the place where the egress hooks should live
>> when eBPF programs need access to the L2 bits of the skb.
> 
> 3) This egress hook is coming very late, the only reason I find to
>    place it at __dev_queue_xmit() is that bpf naturally works with
>    layer 2 information in place. But this new hook is placed in
>    _everyone's output ath_ that only works for the very specific
>    usecase I exposed above.

It's about filtering outgoing network packets of applications, and
providing them with L2 information for filtering purposes. I don't think
that's a very specific use-case.

When the feature is not used at all, the added costs on the output path
are close to zero, due to the use of static branches. If used somewhere
in the system but not for the packet in flight, costs are slightly
higher but acceptable. In fact, it's not even measurable in my tests
here. How is that different from the netfilter OUTPUT hook, btw?

That said, limiting it to L3 is still an option. It's just that we need
ingress and egress to be in sync, so both would be L3 then. So far, the
possible advantages for future use-cases having access to L2 outweighed
the concerns of putting the hook to dev_queue_xmit(), but I'm open to
discussing that.

> The main concern during the workshop was that a hook only for cgroups
> is too specific, but this is actually even more specific than this.

This patch set merely implements an infrastructure that can accommodate
many more things as well in the future. We could, in theory, even add
hooks for forwarded packets specifically, or other eBPF programs, not
even for network filtering etc.

> I have nothing against systemd or the needs for more
> programmability/flexibility in the stack, but I think this needs to
> fulfill some requirements to fit into the infrastructure that we have
> in the right way.

Well, as I explained already, this patch set results from endless
discussions that went nowhere, about how such a thing can be achieved
with netfilter.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
       [not found]     ` <da300784-284c-0d1f-a82e-aa0a0f8ae116-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
@ 2016-09-13 14:14       ` Daniel Borkmann
  2016-09-13 17:24       ` Pablo Neira Ayuso
  1 sibling, 0 replies; 27+ messages in thread
From: Daniel Borkmann @ 2016-09-13 14:14 UTC (permalink / raw)
  To: Daniel Mack, Pablo Neira Ayuso
  Cc: htejun-b10kYP2dOMg, ast-b10kYP2dOMg,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 09/13/2016 03:31 PM, Daniel Mack wrote:
> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
>>> This is v5 of the patch set to allow eBPF programs for network
>>> filtering and accounting to be attached to cgroups, so that they apply
>>> to all sockets of all tasks placed in that cgroup. The logic also
>>> allows to be extendeded for other cgroup based eBPF logic.
>>
>> 1) This infrastructure can only be useful to systemd, or any similar
>>     orchestration daemon. Look, you can only apply filtering policies
>>     to processes that are launched by systemd, so this only works
>>     for server processes.
>
> Sorry, but both statements aren't true. The eBPF policies apply to every
> process that is placed in a cgroup, and my example program in 6/6 shows
> how that can be done from the command line. Also, systemd is able to
> control userspace processes just fine, and it not limited to 'server
> processes'.
>
>> For client processes this infrastructure is
>>     *racy*, you have to add new processes in runtime to the cgroup,
>>     thus there will be time some little time where no filtering policy
>>     will be applied. For quality of service, this may be an acceptable
>>     race, but this is aiming to deploy a filtering policy.
>
> That's a limitation that applies to many more control mechanisms in the
> kernel, and it's something that can easily be solved with fork+exec.
>
>> 2) This aproach looks uninfrastructured to me. This provides a hook
>>     to push a bpf blob at a place in the stack that deploys a filtering
>>     policy that is not visible to others.
>
> That's just as transparent as SO_ATTACH_FILTER. What kind of
> introspection mechanism do you have in mind?
>
>> We have interfaces that allows
>>     us to dump the filtering policy that is being applied, report events
>>     to enable cooperation between several processes with similar
>>     capabilities and so on.
>
> Well, in practice, for netfilter, there can only be one instance in the
> system that acts as central authoritative, otherwise you'll end up with
> orphaned entries or with situation where some client deletes rules
> behind the back of the one that originally installed it. So I really
> think there is nothing wrong with demanding a single, privileged
> controller to manage things.
>
>>> After chatting with Daniel Borkmann and Alexei off-list, we concluded
>>> that __dev_queue_xmit() is the place where the egress hooks should live
>>> when eBPF programs need access to the L2 bits of the skb.
>>
>> 3) This egress hook is coming very late, the only reason I find to
>>     place it at __dev_queue_xmit() is that bpf naturally works with
>>     layer 2 information in place. But this new hook is placed in
>>     _everyone's output ath_ that only works for the very specific
>>     usecase I exposed above.
>
> It's about filtering outgoing network packets of applications, and
> providing them with L2 information for filtering purposes. I don't think
> that's a very specific use-case.
>
> When the feature is not used at all, the added costs on the output path
> are close to zero, due to the use of static branches. If used somewhere
> in the system but not for the packet in flight, costs are slightly
> higher but acceptable. In fact, it's not even measurable in my tests
> here. How is that different from the netfilter OUTPUT hook, btw?
>
> That said, limiting it to L3 is still an option. It's just that we need
> ingress and egress to be in sync, so both would be L3 then. So far, the
> possible advantages for future use-cases having access to L2 outweighed
> the concerns of putting the hook to dev_queue_xmit(), but I'm open to
> discussing that.

While I fully disagree with Pablo's point 1) and 2), in the last set I
raised a similar concern as in point 3) wrt __dev_queue_xmit(). The set
as-is would indeed need the L2 info, since a filter could do a load via
LLVM built-ins such as asm("llvm.bpf.load.byte") et al, with BPF_LL_OFF,
where we're forced to do a load relative to skb_mac_header(). As stated
by Daniel already, it would be nice to see the full frame, so it comes
down to a trade-off, but the option of L3 onwards also exists and BPF can
work just fine with it, too. This just means it's placed in the local
output path and the verifier would need to disallow these built-ins during
bpf(2) load time. They are a rather cumbersome legacy anyway, so
bpf_skb_load_bytes() helper can be used instead, which is also easier
to use.

>> The main concern during the workshop was that a hook only for cgroups
>> is too specific, but this is actually even more specific than this.
>
> This patch set merely implements an infrastructure that can accommodate
> many more things as well in the future. We could, in theory, even add
> hooks for forwarded packets specifically, or other eBPF programs, not
> even for network filtering etc.
>
>> I have nothing against systemd or the needs for more
>> programmability/flexibility in the stack, but I think this needs to
>> fulfill some requirements to fit into the infrastructure that we have
>> in the right way.
>
> Well, as I explained already, this patch set results from endless
> discussions that went nowhere, about how such a thing can be achieved
> with netfilter.
>
>
> Thanks,
> Daniel
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
       [not found]     ` <da300784-284c-0d1f-a82e-aa0a0f8ae116-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  2016-09-13 14:14       ` Daniel Borkmann
@ 2016-09-13 17:24       ` Pablo Neira Ayuso
  2016-09-14  4:42         ` Alexei Starovoitov
  2016-09-14 11:13         ` Daniel Mack
  1 sibling, 2 replies; 27+ messages in thread
From: Pablo Neira Ayuso @ 2016-09-13 17:24 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> Hi,
> 
> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> > On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> >> This is v5 of the patch set to allow eBPF programs for network
> >> filtering and accounting to be attached to cgroups, so that they apply
> >> to all sockets of all tasks placed in that cgroup. The logic also
> >> allows to be extendeded for other cgroup based eBPF logic.
> > 
> > 1) This infrastructure can only be useful to systemd, or any similar
> >    orchestration daemon. Look, you can only apply filtering policies
> >    to processes that are launched by systemd, so this only works
> >    for server processes.
> 
> Sorry, but both statements aren't true. The eBPF policies apply to every
> process that is placed in a cgroup, and my example program in 6/6 shows
> how that can be done from the command line.

Then you have to explain me how can anyone else than systemd use this
infrastructure?

> Also, systemd is able to control userspace processes just fine, and
> it not limited to 'server processes'.

My main point is that those processes *need* to be launched by the
orchestrator, which is was refering as 'server processes'.

> > For client processes this infrastructure is
> >    *racy*, you have to add new processes in runtime to the cgroup,
> >    thus there will be time some little time where no filtering policy
> >    will be applied. For quality of service, this may be an acceptable
> >    race, but this is aiming to deploy a filtering policy.
> 
> That's a limitation that applies to many more control mechanisms in the
> kernel, and it's something that can easily be solved with fork+exec.

As long as you have control to launch the processes yes, but this
will not work in other scenarios. Just like cgroup net_cls and friends
are broken for filtering for things that you have no control to
fork+exec.

To use this infrastructure from a non-launcher process, you'll have to
rely on the proc connection to subscribe to new process events, then
echo that pid to the cgroup, and that interface is asynchronous so
*adding new processes to the cgroup is subject to races*.

> > 2) This aproach looks uninfrastructured to me. This provides a hook
> >    to push a bpf blob at a place in the stack that deploys a filtering
> >    policy that is not visible to others.
> 
> That's just as transparent as SO_ATTACH_FILTER. What kind of
> introspection mechanism do you have in mind?

SO_ATTACH_FILTER is called from the process itself, so this is a local
filtering policy that you apply to your own process.

In this case, this filtering policy is *global*, other processes with
similar capabilities can get just a bpf blob at best...

[...]
> >> After chatting with Daniel Borkmann and Alexei off-list, we concluded
> >> that __dev_queue_xmit() is the place where the egress hooks should live
> >> when eBPF programs need access to the L2 bits of the skb.
> > 
> > 3) This egress hook is coming very late, the only reason I find to
> >    place it at __dev_queue_xmit() is that bpf naturally works with
> >    layer 2 information in place. But this new hook is placed in
> >    _everyone's output ath_ that only works for the very specific
> >    usecase I exposed above.
> 
> It's about filtering outgoing network packets of applications, and
> providing them with L2 information for filtering purposes. I don't think
> that's a very specific use-case.
> 
> When the feature is not used at all, the added costs on the output path
> are close to zero, due to the use of static branches.

*You're proposing a socket filtering facility that hooks layer 2
output path*!

[...]
> > I have nothing against systemd or the needs for more
> > programmability/flexibility in the stack, but I think this needs to
> > fulfill some requirements to fit into the infrastructure that we have
> > in the right way.
> 
> Well, as I explained already, this patch set results from endless
> discussions that went nowhere, about how such a thing can be achieved
> with netfilter.

That is only a rough ~30 lines kernel patchset to support this in
netfilter and only one extra input hook, with potential access to
conntrack and better integration with other existing subsystems.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-13 17:24       ` Pablo Neira Ayuso
@ 2016-09-14  4:42         ` Alexei Starovoitov
  2016-09-14  9:03           ` Thomas Graf
       [not found]           ` <20160914044217.GA44742-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  2016-09-14 11:13         ` Daniel Mack
  1 sibling, 2 replies; 27+ messages in thread
From: Alexei Starovoitov @ 2016-09-14  4:42 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Daniel Mack, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev, sargun, cgroups

On Tue, Sep 13, 2016 at 07:24:08PM +0200, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> > Hi,
> > 
> > On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> > > On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> > >> This is v5 of the patch set to allow eBPF programs for network
> > >> filtering and accounting to be attached to cgroups, so that they apply
> > >> to all sockets of all tasks placed in that cgroup. The logic also
> > >> allows to be extendeded for other cgroup based eBPF logic.
> > > 
> > > 1) This infrastructure can only be useful to systemd, or any similar
> > >    orchestration daemon. Look, you can only apply filtering policies
> > >    to processes that are launched by systemd, so this only works
> > >    for server processes.
> > 
> > Sorry, but both statements aren't true. The eBPF policies apply to every
> > process that is placed in a cgroup, and my example program in 6/6 shows
> > how that can be done from the command line.
> 
> Then you have to explain me how can anyone else than systemd use this
> infrastructure?

Sounds like systemd and bpf phobia combined :)
Jokes aside. I'm puzzled why systemd is even being mentioned here.
Here we use tupperware (our internal container management system) that
is heavily using cgroups and has nothing to do with systemd.
we're working as part of open container initiative, so hopefully soon
all container management systems will benefit from what we're building.
cgroups and bpf are crucial part of this process.

> > Also, systemd is able to control userspace processes just fine, and
> > it not limited to 'server processes'.
> 
> My main point is that those processes *need* to be launched by the
> orchestrator, which is was refering as 'server processes'.

No experience in systemd, so cannot comment about it,
but that statement is not true for our stuff.

> > > For client processes this infrastructure is
> > >    *racy*, you have to add new processes in runtime to the cgroup,
> > >    thus there will be time some little time where no filtering policy
> > >    will be applied. For quality of service, this may be an acceptable
> > >    race, but this is aiming to deploy a filtering policy.
> > 
> > That's a limitation that applies to many more control mechanisms in the
> > kernel, and it's something that can easily be solved with fork+exec.
> 
> As long as you have control to launch the processes yes, but this
> will not work in other scenarios. Just like cgroup net_cls and friends
> are broken for filtering for things that you have no control to
> fork+exec.

not true

> To use this infrastructure from a non-launcher process, you'll have to
> rely on the proc connection to subscribe to new process events, then
> echo that pid to the cgroup, and that interface is asynchronous so
> *adding new processes to the cgroup is subject to races*.

in general not true either. have you worked with cgroups or just speculating?
 
> *You're proposing a socket filtering facility that hooks layer 2
> output path*!

flashback. Not too long ago you were beating drums about netfilter
ingress hook operating at layer 2... sounds like nobody used it
and that was a bad call? Should we remove that netfilter hook then?

Our use case is different from Daniel's.
For us this cgroup+bpf is _not_ for filterting and _not_ for security.
We run a ton of tasks in cgroups that launch all sorts of
things on their own. We need to monitor what they do from networking
point of view. Therefore bpf programs need to monitor the traffic in
particular part of cgroup hierarchy. Not globally and no pass/drop decisions.
The monitoring itself is complicated. Like we need to group and
aggregate within bpf program based on certain bits of ipv6 address
and so on. bpf is only programmable engine that can do this job.
nft is simply not flexible enough to do that.
I'd really love to have an alternative to bpf for such tasks,
but you seem to spend all the energy arguing against bpf whereas
nft still has a lot to be desired.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-14  4:42         ` Alexei Starovoitov
@ 2016-09-14  9:03           ` Thomas Graf
       [not found]           ` <20160914044217.GA44742-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  1 sibling, 0 replies; 27+ messages in thread
From: Thomas Graf @ 2016-09-14  9:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Pablo Neira Ayuso, Daniel Mack, htejun, daniel, ast, davem,
	kafai, fw, harald, netdev, sargun, cgroups

[Sorry for the repost, gmail decided to start sending HTML crap along
 overnight for some reason]

On 09/13/16 at 09:42pm, Alexei Starovoitov wrote:
> On Tue, Sep 13, 2016 at 07:24:08PM +0200, Pablo Neira Ayuso wrote:
> > Then you have to explain me how can anyone else than systemd use this
> > infrastructure?
> 
> Jokes aside. I'm puzzled why systemd is even being mentioned here.
> Here we use tupperware (our internal container management system) that
> is heavily using cgroups and has nothing to do with systemd.

Just confirming that we are planning to use this decoupled from
systemd as well.  I fail to see how this is at all systemd specific.

> For us this cgroup+bpf is _not_ for filterting and _not_ for security.
> We run a ton of tasks in cgroups that launch all sorts of
> things on their own. We need to monitor what they do from networking
> point of view. Therefore bpf programs need to monitor the traffic in
> particular part of cgroup hierarchy. Not globally and no pass/drop decisions.

+10, although filtering/drop is a valid use case, the really strong
use case is definitely introspection at networking level. Statistics,
monitoring, verification of application correctness, etc. 

I don't see why this is at all an either or discussion. If nft wants
cgroups integration similar to this effort, I see no reason why that
should stop this effort.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
       [not found]           ` <20160914044217.GA44742-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-09-14 10:30             ` Pablo Neira Ayuso
  2016-09-14 11:06               ` Thomas Graf
  2016-09-14 11:36               ` Daniel Borkmann
  0 siblings, 2 replies; 27+ messages in thread
From: Pablo Neira Ayuso @ 2016-09-14 10:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Mack, htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Sep 13, 2016 at 09:42:19PM -0700, Alexei Starovoitov wrote:
[...]
> For us this cgroup+bpf is _not_ for filterting and _not_ for security.

If your goal is monitoring, then convert these hooks not to allow to
issue a verdict on the packet, so this becomes inoquous in the same
fashion as the tracing infrastructure.

[...]
> I'd really love to have an alternative to bpf for such tasks,
> but you seem to spend all the energy arguing against bpf whereas
> nft still has a lot to be desired.

Please Alexei, stop that FUD. Anyone that has spent just one day using
the bpf tooling and infrastructure knows you have problems to
resolve...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-14 10:30             ` Pablo Neira Ayuso
@ 2016-09-14 11:06               ` Thomas Graf
  2016-09-14 11:36               ` Daniel Borkmann
  1 sibling, 0 replies; 27+ messages in thread
From: Thomas Graf @ 2016-09-14 11:06 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Alexei Starovoitov, Daniel Mack, htejun-b10kYP2dOMg,
	daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 09/14/16 at 12:30pm, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 09:42:19PM -0700, Alexei Starovoitov wrote:
> [...]
> > For us this cgroup+bpf is _not_ for filterting and _not_ for security.
> 
> If your goal is monitoring, then convert these hooks not to allow to
> issue a verdict on the packet, so this becomes inoquous in the same
> fashion as the tracing infrastructure.

Why? How is this at all offensive? We have three parties voicing
interest in this work for both monitoring and security. At least
two specific use cases have been described.  It builds on top of
existing infrastructure and nicely complements other ongoing work.
Why not both?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-13 17:24       ` Pablo Neira Ayuso
  2016-09-14  4:42         ` Alexei Starovoitov
@ 2016-09-14 11:13         ` Daniel Mack
       [not found]           ` <6de6809a-13f5-4000-5639-c760dde30223-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  2016-09-16 19:57           ` Sargun Dhillon
  1 sibling, 2 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-14 11:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev, sargun, cgroups

Hi Pablo,

On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
>> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
>>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
>>>> This is v5 of the patch set to allow eBPF programs for network
>>>> filtering and accounting to be attached to cgroups, so that they apply
>>>> to all sockets of all tasks placed in that cgroup. The logic also
>>>> allows to be extendeded for other cgroup based eBPF logic.
>>>
>>> 1) This infrastructure can only be useful to systemd, or any similar
>>>    orchestration daemon. Look, you can only apply filtering policies
>>>    to processes that are launched by systemd, so this only works
>>>    for server processes.
>>
>> Sorry, but both statements aren't true. The eBPF policies apply to every
>> process that is placed in a cgroup, and my example program in 6/6 shows
>> how that can be done from the command line.
> 
> Then you have to explain me how can anyone else than systemd use this
> infrastructure?

I have no idea what makes you think this is limited to systemd. As I
said, I provided an example for userspace that works from the command
line. The same limitation apply as for all other users of cgroups.

> My main point is that those processes *need* to be launched by the
> orchestrator, which is was refering as 'server processes'.

Yes, that's right. But as I said, this rule applies to many other kernel
concepts, so I don't see any real issue.

>> That's a limitation that applies to many more control mechanisms in the
>> kernel, and it's something that can easily be solved with fork+exec.
> 
> As long as you have control to launch the processes yes, but this
> will not work in other scenarios. Just like cgroup net_cls and friends
> are broken for filtering for things that you have no control to
> fork+exec.

Probably, but that's only solvable with rules that store the full cgroup
path then, and do a string comparison (!) for each packet flying by.

>> That's just as transparent as SO_ATTACH_FILTER. What kind of
>> introspection mechanism do you have in mind?
> 
> SO_ATTACH_FILTER is called from the process itself, so this is a local
> filtering policy that you apply to your own process.

Not necessarily. You can as well do it the inetd way, and pass the
socket to a process that is launched on demand, but do SO_ATTACH_FILTER
+ SO_LOCK_FILTER  in the middle. What happens with payload on the socket
is not transparent to the launched binary at all. The proposed cgroup
eBPF solution implements a very similar behavior in that regard.

>> It's about filtering outgoing network packets of applications, and
>> providing them with L2 information for filtering purposes. I don't think
>> that's a very specific use-case.
>>
>> When the feature is not used at all, the added costs on the output path
>> are close to zero, due to the use of static branches.
> 
> *You're proposing a socket filtering facility that hooks layer 2
> output path*!

As I said, I'm open to discussing that. In order to make it work for L3,
the LL_OFF issues need to be solved, as Daniel explained. Daniel,
Alexei, any idea how much work that would be?

> That is only a rough ~30 lines kernel patchset to support this in
> netfilter and only one extra input hook, with potential access to
> conntrack and better integration with other existing subsystems.

Care to share the patches for that? I'd really like to have a look.

And FWIW, I agree with Thomas - there is nothing wrong with having
multiple options to use for such use-cases.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-14 10:30             ` Pablo Neira Ayuso
  2016-09-14 11:06               ` Thomas Graf
@ 2016-09-14 11:36               ` Daniel Borkmann
  1 sibling, 0 replies; 27+ messages in thread
From: Daniel Borkmann @ 2016-09-14 11:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Alexei Starovoitov
  Cc: Daniel Mack, htejun, ast, davem, kafai, fw, harald, netdev,
	sargun, cgroups

On 09/14/2016 12:30 PM, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 09:42:19PM -0700, Alexei Starovoitov wrote:
> [...]
>> For us this cgroup+bpf is _not_ for filterting and _not_ for security.
>
> If your goal is monitoring, then convert these hooks not to allow to
> issue a verdict on the packet, so this becomes inoquous in the same
> fashion as the tracing infrastructure.
>
> [...]
>> I'd really love to have an alternative to bpf for such tasks,
>> but you seem to spend all the energy arguing against bpf whereas
>> nft still has a lot to be desired.
>
> Please Alexei, stop that FUD. Anyone that has spent just one day using
> the bpf tooling and infrastructure knows you have problems to
> resolve...

Not quite sure on the spreading of FUD, but sounds like we should all
get back to technical things to resolve. ;)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
       [not found]           ` <6de6809a-13f5-4000-5639-c760dde30223-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
@ 2016-09-14 11:42             ` Daniel Borkmann
       [not found]               ` <57D937B9.2090100-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Borkmann @ 2016-09-14 11:42 UTC (permalink / raw)
  To: Daniel Mack, Pablo Neira Ayuso
  Cc: htejun-b10kYP2dOMg, ast-b10kYP2dOMg,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 09/14/2016 01:13 PM, Daniel Mack wrote:
> On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote:
>> On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
>>> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
>>>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
>>>>> This is v5 of the patch set to allow eBPF programs for network
>>>>> filtering and accounting to be attached to cgroups, so that they apply
>>>>> to all sockets of all tasks placed in that cgroup. The logic also
>>>>> allows to be extendeded for other cgroup based eBPF logic.
>>>>
>>>> 1) This infrastructure can only be useful to systemd, or any similar
>>>>     orchestration daemon. Look, you can only apply filtering policies
>>>>     to processes that are launched by systemd, so this only works
>>>>     for server processes.
>>>
>>> Sorry, but both statements aren't true. The eBPF policies apply to every
>>> process that is placed in a cgroup, and my example program in 6/6 shows
>>> how that can be done from the command line.
>>
>> Then you have to explain me how can anyone else than systemd use this
>> infrastructure?
>
> I have no idea what makes you think this is limited to systemd. As I
> said, I provided an example for userspace that works from the command
> line. The same limitation apply as for all other users of cgroups.
>
>> My main point is that those processes *need* to be launched by the
>> orchestrator, which is was refering as 'server processes'.
>
> Yes, that's right. But as I said, this rule applies to many other kernel
> concepts, so I don't see any real issue.
>
>>> That's a limitation that applies to many more control mechanisms in the
>>> kernel, and it's something that can easily be solved with fork+exec.
>>
>> As long as you have control to launch the processes yes, but this
>> will not work in other scenarios. Just like cgroup net_cls and friends
>> are broken for filtering for things that you have no control to
>> fork+exec.
>
> Probably, but that's only solvable with rules that store the full cgroup
> path then, and do a string comparison (!) for each packet flying by.
>
>>> That's just as transparent as SO_ATTACH_FILTER. What kind of
>>> introspection mechanism do you have in mind?
>>
>> SO_ATTACH_FILTER is called from the process itself, so this is a local
>> filtering policy that you apply to your own process.
>
> Not necessarily. You can as well do it the inetd way, and pass the
> socket to a process that is launched on demand, but do SO_ATTACH_FILTER
> + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
> is not transparent to the launched binary at all. The proposed cgroup
> eBPF solution implements a very similar behavior in that regard.
>
>>> It's about filtering outgoing network packets of applications, and
>>> providing them with L2 information for filtering purposes. I don't think
>>> that's a very specific use-case.
>>>
>>> When the feature is not used at all, the added costs on the output path
>>> are close to zero, due to the use of static branches.
>>
>> *You're proposing a socket filtering facility that hooks layer 2
>> output path*!
>
> As I said, I'm open to discussing that. In order to make it work for L3,
> the LL_OFF issues need to be solved, as Daniel explained. Daniel,
> Alexei, any idea how much work that would be?

Not much. You simply need to declare your own struct bpf_verifier_ops
with a get_func_proto() handler that handles BPF_FUNC_skb_load_bytes,
and verifier in do_check() loop would need to handle that these ld_abs/
ld_ind are rejected for BPF_PROG_TYPE_CGROUP_SOCKET.

>> That is only a rough ~30 lines kernel patchset to support this in
>> netfilter and only one extra input hook, with potential access to
>> conntrack and better integration with other existing subsystems.
>
> Care to share the patches for that? I'd really like to have a look.
>
> And FWIW, I agree with Thomas - there is nothing wrong with having
> multiple options to use for such use-cases.
>
>
> Thanks,
> Daniel
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
       [not found]               ` <57D937B9.2090100-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>
@ 2016-09-14 15:55                 ` Alexei Starovoitov
  0 siblings, 0 replies; 27+ messages in thread
From: Alexei Starovoitov @ 2016-09-14 15:55 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Daniel Mack, Pablo Neira Ayuso, htejun-b10kYP2dOMg,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Sep 14, 2016 at 01:42:49PM +0200, Daniel Borkmann wrote:
> >As I said, I'm open to discussing that. In order to make it work for L3,
> >the LL_OFF issues need to be solved, as Daniel explained. Daniel,
> >Alexei, any idea how much work that would be?
> 
> Not much. You simply need to declare your own struct bpf_verifier_ops
> with a get_func_proto() handler that handles BPF_FUNC_skb_load_bytes,
> and verifier in do_check() loop would need to handle that these ld_abs/
> ld_ind are rejected for BPF_PROG_TYPE_CGROUP_SOCKET.

yep. that part is solvable.
I'm still torn between l2 and l3.
On one side it sux to lose l2 information. yet we don't have a use case
to look into l2 for our container monitoring, so the only thing
lack of l2 will do is confuse byte accounting, since instead of
skb->len, we'd need to do skb->len + ETH_HLEN...
but I guess vlan handling messes it up as well.
On the other side doing it at socket level we can drop these checks:
+       if (!sk || !sk_fullsock(sk))
+               return 0;
+
+       if (sk->sk_family != AF_INET &&
+           sk->sk_family != AF_INET6)
+               return 0;
which will make it even faster when it's on.
So I don't mind either l2 and l3. I guess if l3 approach will prove
to be limiting, we can add l2 later?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-12 16:12 [PATCH v5 0/6] Add eBPF hooks for cgroups Daniel Mack
                   ` (4 preceding siblings ...)
  2016-09-13 11:56 ` [PATCH v5 0/6] Add eBPF hooks for cgroups Pablo Neira Ayuso
@ 2016-09-15  6:36 ` Vincent Bernat
       [not found]   ` <m3y42tlldz.fsf-PiWSfznZvZU/eRriIvX0kg@public.gmane.org>
  5 siblings, 1 reply; 27+ messages in thread
From: Vincent Bernat @ 2016-09-15  6:36 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev,
	sargun, cgroups

 ❦ 12 septembre 2016 18:12 CEST, Daniel Mack <daniel@zonque.org> :

> * The sample program learned to support both ingress and egress, and
>   can now optionally make the eBPF program drop packets by making it
>   return 0.

Ability to lock the eBPF program to avoid modification from a later
program or in a subcgroup would be pretty interesting from a security
perspective.
-- 
Use recursive procedures for recursively-defined data structures.
            - The Elements of Programming Style (Kernighan & Plauger)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
       [not found]   ` <m3y42tlldz.fsf-PiWSfznZvZU/eRriIvX0kg@public.gmane.org>
@ 2016-09-15  8:11       ` Daniel Mack
  0 siblings, 0 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-15  8:11 UTC (permalink / raw)
  To: Vincent Bernat
  Cc: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA

On 09/15/2016 08:36 AM, Vincent Bernat wrote:
>  ❦ 12 septembre 2016 18:12 CEST, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> :
> 
>> * The sample program learned to support both ingress and egress, and
>>   can now optionally make the eBPF program drop packets by making it
>>   return 0.
> 
> Ability to lock the eBPF program to avoid modification from a later
> program or in a subcgroup would be pretty interesting from a security
> perspective.

For now, you can achieve that by dropping CAP_NET_ADMIN after installing
a program between fork and exec. I think that should suffice for a first
version. Flags to further limit that could be be added later.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
@ 2016-09-15  8:11       ` Daniel Mack
  0 siblings, 0 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-15  8:11 UTC (permalink / raw)
  To: Vincent Bernat
  Cc: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA

On 09/15/2016 08:36 AM, Vincent Bernat wrote:
>  ‚ù¶ 12 septembre 2016 18:12 CEST, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> :
> 
>> * The sample program learned to support both ingress and egress, and
>>   can now optionally make the eBPF program drop packets by making it
>>   return 0.
> 
> Ability to lock the eBPF program to avoid modification from a later
> program or in a subcgroup would be pretty interesting from a security
> perspective.

For now, you can achieve that by dropping CAP_NET_ADMIN after installing
a program between fork and exec. I think that should suffice for a first
version. Flags to further limit that could be be added later.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-14 11:13         ` Daniel Mack
       [not found]           ` <6de6809a-13f5-4000-5639-c760dde30223-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
@ 2016-09-16 19:57           ` Sargun Dhillon
       [not found]             ` <20160916195728.GA14736-I4sfFR6g6EicJoAdRrHjTrzMkBWIpU9tytq7g7fCXyjEk0E+pv7Png@public.gmane.org>
  1 sibling, 1 reply; 27+ messages in thread
From: Sargun Dhillon @ 2016-09-16 19:57 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Pablo Neira Ayuso, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev, cgroups

On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote:
> Hi Pablo,
> 
> On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote:
> > On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> >> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> >>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> >>>> This is v5 of the patch set to allow eBPF programs for network
> >>>> filtering and accounting to be attached to cgroups, so that they apply
> >>>> to all sockets of all tasks placed in that cgroup. The logic also
> >>>> allows to be extendeded for other cgroup based eBPF logic.
> >>>
> >>> 1) This infrastructure can only be useful to systemd, or any similar
> >>>    orchestration daemon. Look, you can only apply filtering policies
> >>>    to processes that are launched by systemd, so this only works
> >>>    for server processes.
> >>
> >> Sorry, but both statements aren't true. The eBPF policies apply to every
> >> process that is placed in a cgroup, and my example program in 6/6 shows
> >> how that can be done from the command line.
> > 
> > Then you have to explain me how can anyone else than systemd use this
> > infrastructure?
> 
> I have no idea what makes you think this is limited to systemd. As I
> said, I provided an example for userspace that works from the command
> line. The same limitation apply as for all other users of cgroups.
> 
So, at least in my work, we have Mesos, but on nearly every machine that Mesos 
runs, people also have systemd. Now, there's recently become a bit of a battle 
of ownership of things like cgroups on these machines. We can usually solve it 
by nesting under systemd cgroups, and thus so far we've avoided making too many 
systemd-specific concessions.

The reason this works (mostly), is because everything we touch has a sense of 
nesting, where we can apply policy at a place lower in the hierarchy, and yet 
systemd's monitoring and policy still stays in place. 

Now, with this patch, we don't have that, but I think we can reasonably add some 
flag like "no override" when applying policies, or alternatively something like 
"no new privileges", to prevent children from applying policies that override 
top-level policy. I realize there is a speed concern as well, but I think for 
people who want nested policy, we're willing to make the tradeoff. The cost
of traversing a few extra pointers still outweighs the overhead of network
namespaces, iptables, etc.. for many of us. 

What do you think Daniel?

> > My main point is that those processes *need* to be launched by the
> > orchestrator, which is was refering as 'server processes'.
> 
> Yes, that's right. But as I said, this rule applies to many other kernel
> concepts, so I don't see any real issue.
>
Also, cgroups have become such a big part of how applications are managed
that many of us have solved this problem.

> >> That's a limitation that applies to many more control mechanisms in the
> >> kernel, and it's something that can easily be solved with fork+exec.
> > 
> > As long as you have control to launch the processes yes, but this
> > will not work in other scenarios. Just like cgroup net_cls and friends
> > are broken for filtering for things that you have no control to
> > fork+exec.
> 
> Probably, but that's only solvable with rules that store the full cgroup
> path then, and do a string comparison (!) for each packet flying by.
>
> >> That's just as transparent as SO_ATTACH_FILTER. What kind of
> >> introspection mechanism do you have in mind?
> > 
> > SO_ATTACH_FILTER is called from the process itself, so this is a local
> > filtering policy that you apply to your own process.
> 
> Not necessarily. You can as well do it the inetd way, and pass the
> socket to a process that is launched on demand, but do SO_ATTACH_FILTER
> + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
> is not transparent to the launched binary at all. The proposed cgroup
> eBPF solution implements a very similar behavior in that regard.
> 
It would be nice to be able to see whether or not a filter is attached to a 
cgroup, but given this is going through syscalls, at least introspection
is possible as opposed to something like netlink.

> >> It's about filtering outgoing network packets of applications, and
> >> providing them with L2 information for filtering purposes. I don't think
> >> that's a very specific use-case.
> >>
> >> When the feature is not used at all, the added costs on the output path
> >> are close to zero, due to the use of static branches.
> > 
> > *You're proposing a socket filtering facility that hooks layer 2
> > output path*!
> 
> As I said, I'm open to discussing that. In order to make it work for L3,
> the LL_OFF issues need to be solved, as Daniel explained. Daniel,
> Alexei, any idea how much work that would be?
> 
> > That is only a rough ~30 lines kernel patchset to support this in
> > netfilter and only one extra input hook, with potential access to
> > conntrack and better integration with other existing subsystems.
> 
> Care to share the patches for that? I'd really like to have a look.
> 
> And FWIW, I agree with Thomas - there is nothing wrong with having
> multiple options to use for such use-cases.
Right now, for containers, we have netfilter and network namespaces.
There's a lot of performance overhead that comes with this. Not only
that, but iptables doesn't really have a simple way of usage by
automated infrastructure. We (firewalld, systemd, dockerd, mesos)
end up fighting with one another for ownership over firewall rules.

Although, I have problems with this approach, I think that it's
a good baseline where we can have top level owned by systemd,
docker underneath that, and Mesos underneath that. We can add
additional hooks for things like Checmate and Landlock, and
with a little more work, we can do compositition, solving
all of our problems.

> 
> 
> Thanks,
> Daniel
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
       [not found]             ` <20160916195728.GA14736-I4sfFR6g6EicJoAdRrHjTrzMkBWIpU9tytq7g7fCXyjEk0E+pv7Png@public.gmane.org>
@ 2016-09-18 23:34               ` Sargun Dhillon
  2016-09-19 16:34               ` Daniel Mack
  1 sibling, 0 replies; 27+ messages in thread
From: Sargun Dhillon @ 2016-09-18 23:34 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Pablo Neira Ayuso, htejun-b10kYP2dOMg,
	daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Sep 16, 2016 at 12:57:29PM -0700, Sargun Dhillon wrote:
> On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote:
> > Hi Pablo,
> > 
> > On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote:
> > > On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> > >> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> > >>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> > >>>> This is v5 of the patch set to allow eBPF programs for network
> > >>>> filtering and accounting to be attached to cgroups, so that they apply
> > >>>> to all sockets of all tasks placed in that cgroup. The logic also
> > >>>> allows to be extendeded for other cgroup based eBPF logic.
> > >>>
> > >>> 1) This infrastructure can only be useful to systemd, or any similar
> > >>>    orchestration daemon. Look, you can only apply filtering policies
> > >>>    to processes that are launched by systemd, so this only works
> > >>>    for server processes.
> > >>
> > >> Sorry, but both statements aren't true. The eBPF policies apply to every
> > >> process that is placed in a cgroup, and my example program in 6/6 shows
> > >> how that can be done from the command line.
> > > 
> > > Then you have to explain me how can anyone else than systemd use this
> > > infrastructure?
> > 
> > I have no idea what makes you think this is limited to systemd. As I
> > said, I provided an example for userspace that works from the command
> > line. The same limitation apply as for all other users of cgroups.
> > 
> So, at least in my work, we have Mesos, but on nearly every machine that Mesos 
> runs, people also have systemd. Now, there's recently become a bit of a battle 
> of ownership of things like cgroups on these machines. We can usually solve it 
> by nesting under systemd cgroups, and thus so far we've avoided making too many 
> systemd-specific concessions.
> 
> The reason this works (mostly), is because everything we touch has a sense of 
> nesting, where we can apply policy at a place lower in the hierarchy, and yet 
> systemd's monitoring and policy still stays in place. 
> 
> Now, with this patch, we don't have that, but I think we can reasonably add some 
> flag like "no override" when applying policies, or alternatively something like 
> "no new privileges", to prevent children from applying policies that override 
> top-level policy. I realize there is a speed concern as well, but I think for 
> people who want nested policy, we're willing to make the tradeoff. The cost
> of traversing a few extra pointers still outweighs the overhead of network
> namespaces, iptables, etc.. for many of us. 
> 
> What do you think Daniel?
> 
> > > My main point is that those processes *need* to be launched by the
> > > orchestrator, which is was refering as 'server processes'.
> > 
> > Yes, that's right. But as I said, this rule applies to many other kernel
> > concepts, so I don't see any real issue.
> >
> Also, cgroups have become such a big part of how applications are managed
> that many of us have solved this problem.
> 
> > >> That's a limitation that applies to many more control mechanisms in the
> > >> kernel, and it's something that can easily be solved with fork+exec.
> > > 
> > > As long as you have control to launch the processes yes, but this
> > > will not work in other scenarios. Just like cgroup net_cls and friends
> > > are broken for filtering for things that you have no control to
> > > fork+exec.
> > 
> > Probably, but that's only solvable with rules that store the full cgroup
> > path then, and do a string comparison (!) for each packet flying by.
> >
> > >> That's just as transparent as SO_ATTACH_FILTER. What kind of
> > >> introspection mechanism do you have in mind?
> > > 
> > > SO_ATTACH_FILTER is called from the process itself, so this is a local
> > > filtering policy that you apply to your own process.
> > 
> > Not necessarily. You can as well do it the inetd way, and pass the
> > socket to a process that is launched on demand, but do SO_ATTACH_FILTER
> > + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
> > is not transparent to the launched binary at all. The proposed cgroup
> > eBPF solution implements a very similar behavior in that regard.
> > 
> It would be nice to be able to see whether or not a filter is attached to a 
> cgroup, but given this is going through syscalls, at least introspection
> is possible as opposed to something like netlink.
> 
> > >> It's about filtering outgoing network packets of applications, and
> > >> providing them with L2 information for filtering purposes. I don't think
> > >> that's a very specific use-case.
> > >>
> > >> When the feature is not used at all, the added costs on the output path
> > >> are close to zero, due to the use of static branches.
> > > 
> > > *You're proposing a socket filtering facility that hooks layer 2
> > > output path*!
> > 
> > As I said, I'm open to discussing that. In order to make it work for L3,
> > the LL_OFF issues need to be solved, as Daniel explained. Daniel,
> > Alexei, any idea how much work that would be?
> > 
> > > That is only a rough ~30 lines kernel patchset to support this in
> > > netfilter and only one extra input hook, with potential access to
> > > conntrack and better integration with other existing subsystems.
> > 
> > Care to share the patches for that? I'd really like to have a look.
> > 
> > And FWIW, I agree with Thomas - there is nothing wrong with having
> > multiple options to use for such use-cases.
> Right now, for containers, we have netfilter and network namespaces.
> There's a lot of performance overhead that comes with this. Not only
> that, but iptables doesn't really have a simple way of usage by
> automated infrastructure. We (firewalld, systemd, dockerd, mesos)
> end up fighting with one another for ownership over firewall rules.
> 
> Although, I have problems with this approach, I think that it's
> a good baseline where we can have top level owned by systemd,
> docker underneath that, and Mesos underneath that. We can add
> additional hooks for things like Checmate and Landlock, and
> with a little more work, we can do compositition, solving
> all of our problems.
> 
> > 
> > 
> > Thanks,
> > Daniel
> > 
Another thing --

It probably makes sense to make the warning in cgroup.c highlight the fact that 
it disables these filters as well. Perhaps, it makes sense to make it so you 
can't disable it (boot flag, say?). Alternatively, maybe it makes sense to 
introduce some exclusivity? So, that when you load a filter, it disables
net_cls, and when you load net_cls, it throws warnings.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
       [not found]             ` <20160916195728.GA14736-I4sfFR6g6EicJoAdRrHjTrzMkBWIpU9tytq7g7fCXyjEk0E+pv7Png@public.gmane.org>
  2016-09-18 23:34               ` Sargun Dhillon
@ 2016-09-19 16:34               ` Daniel Mack
  2016-09-19 21:53                 ` Sargun Dhillon
  1 sibling, 1 reply; 27+ messages in thread
From: Daniel Mack @ 2016-09-19 16:34 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Pablo Neira Ayuso, htejun-b10kYP2dOMg,
	daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi,

On 09/16/2016 09:57 PM, Sargun Dhillon wrote:
> On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote:

>> I have no idea what makes you think this is limited to systemd. As I
>> said, I provided an example for userspace that works from the command
>> line. The same limitation apply as for all other users of cgroups.
>>
> So, at least in my work, we have Mesos, but on nearly every machine that Mesos 
> runs, people also have systemd. Now, there's recently become a bit of a battle 
> of ownership of things like cgroups on these machines. We can usually solve it 
> by nesting under systemd cgroups, and thus so far we've avoided making too many 
> systemd-specific concessions.
> 
> The reason this works (mostly), is because everything we touch has a sense of 
> nesting, where we can apply policy at a place lower in the hierarchy, and yet 
> systemd's monitoring and policy still stays in place. 
> 
> Now, with this patch, we don't have that, but I think we can reasonably add some 
> flag like "no override" when applying policies, or alternatively something like 
> "no new privileges", to prevent children from applying policies that override 
> top-level policy.

Yes, but the API is already guarded by CAP_NET_ADMIN. Take that
capability away from your children, and they can't tamper with the
policy. Does that work for you?

> I realize there is a speed concern as well, but I think for 
> people who want nested policy, we're willing to make the tradeoff. The cost
> of traversing a few extra pointers still outweighs the overhead of network
> namespaces, iptables, etc.. for many of us. 

Not sure. Have you tried it?

> What do you think Daniel?

I think we should look at an implementation once we really need it, and
then revisit the performance impact. In any case, this can be changed
under the hood, without touching the userspace API (except for adding
flags if we need them).

>> Not necessarily. You can as well do it the inetd way, and pass the
>> socket to a process that is launched on demand, but do SO_ATTACH_FILTER
>> + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
>> is not transparent to the launched binary at all. The proposed cgroup
>> eBPF solution implements a very similar behavior in that regard.
>
> It would be nice to be able to see whether or not a filter is attached to a 
> cgroup, but given this is going through syscalls, at least introspection
> is possible as opposed to something like netlink.

Sure, there are many ways. I implemented the bpf cgroup logic using an
own cgroup controller once, which made it possible to read out the
status. But as we agreed on attaching programs through the bpf(2) system
call, I moved back to the implementation that directly stores the
pointers in the cgroup.

First enabling the controller through the fs-backed cgroup interface,
then come back through the bpf(2) syscall and then go back to the fs
interface to read out status values is a bit weird.

>> And FWIW, I agree with Thomas - there is nothing wrong with having
>> multiple options to use for such use-cases.
>
> Right now, for containers, we have netfilter and network namespaces.
> There's a lot of performance overhead that comes with this.

Out of curiosity: Could you express that in numbers? And how exactly are
you testing?

> Not only
> that, but iptables doesn't really have a simple way of usage by
> automated infrastructure. We (firewalld, systemd, dockerd, mesos)
> end up fighting with one another for ownership over firewall rules.

Yes, that's a common problem.

> Although, I have problems with this approach, I think that it's
> a good baseline where we can have top level owned by systemd,
> docker underneath that, and Mesos underneath that. We can add
> additional hooks for things like Checmate and Landlock, and
> with a little more work, we can do compositition, solving
> all of our problems.

It is supposed to be just a baseline, yes.


Thanks for your feedback,
Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
  2016-09-19 16:34               ` Daniel Mack
@ 2016-09-19 21:53                 ` Sargun Dhillon
       [not found]                   ` <20160919215311.GA9723-I4sfFR6g6EicJoAdRrHjTrzMkBWIpU9tytq7g7fCXyjEk0E+pv7Png@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sargun Dhillon @ 2016-09-19 21:53 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Pablo Neira Ayuso, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev, cgroups

On Mon, Sep 19, 2016 at 06:34:28PM +0200, Daniel Mack wrote:
> Hi,
> 
> On 09/16/2016 09:57 PM, Sargun Dhillon wrote:
> > On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote:
> 
> >> I have no idea what makes you think this is limited to systemd. As I
> >> said, I provided an example for userspace that works from the command
> >> line. The same limitation apply as for all other users of cgroups.
> >>
> > So, at least in my work, we have Mesos, but on nearly every machine that Mesos 
> > runs, people also have systemd. Now, there's recently become a bit of a battle 
> > of ownership of things like cgroups on these machines. We can usually solve it 
> > by nesting under systemd cgroups, and thus so far we've avoided making too many 
> > systemd-specific concessions.
> > 
> > The reason this works (mostly), is because everything we touch has a sense of 
> > nesting, where we can apply policy at a place lower in the hierarchy, and yet 
> > systemd's monitoring and policy still stays in place. 
> > 
> > Now, with this patch, we don't have that, but I think we can reasonably add some 
> > flag like "no override" when applying policies, or alternatively something like 
> > "no new privileges", to prevent children from applying policies that override 
> > top-level policy.
> 
> Yes, but the API is already guarded by CAP_NET_ADMIN. Take that
> capability away from your children, and they can't tamper with the
> policy. Does that work for you?
> 
No. This can be addressed in a follow-on patch, but the use-case is that I have 
a container orchestrator (Docker, or Mesos), and systemd. The sysadmin controls 
systemd, and Docker is controlled by devs. Typically, the system owner wants 
some system level statistics, and filtering, and then we want to do 
per-container filtering.

We really want to be able to do nesting with userspace tools that are oblivious, 
and we want to delegate a level of the cgroup hierarchy to the tool that created 
it. I do not see Docker integrating with systemd any time soon, and that's 
really the only other alternative.

> > I realize there is a speed concern as well, but I think for 
> > people who want nested policy, we're willing to make the tradeoff. The cost
> > of traversing a few extra pointers still outweighs the overhead of network
> > namespaces, iptables, etc.. for many of us. 
> 
> Not sure. Have you tried it?
> 
Tried nested policies? Yes. I tried nested policy execution with syscalls, and I 
tested with bind and connect. The performance overhead was pretty minimal, but 
latency increased by 100 microseconds+ once the number of BPF hooks increased 
beyond 30. The BPF programs were trivial, and essentially did a map lookup, and 
returned 0.

I don't think that it's just raw cycles / execution time, but I didn't spend 
enough time digging into it to determine the performance hit. I'm waiting
for your patchset to land, and then I plan to work off of it.

> > What do you think Daniel?
> 
> I think we should look at an implementation once we really need it, and
> then revisit the performance impact. In any case, this can be changed
> under the hood, without touching the userspace API (except for adding
> flags if we need them).
> 
+1
> >> Not necessarily. You can as well do it the inetd way, and pass the
> >> socket to a process that is launched on demand, but do SO_ATTACH_FILTER
> >> + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
> >> is not transparent to the launched binary at all. The proposed cgroup
> >> eBPF solution implements a very similar behavior in that regard.
> >
> > It would be nice to be able to see whether or not a filter is attached to a 
> > cgroup, but given this is going through syscalls, at least introspection
> > is possible as opposed to something like netlink.
> 
> Sure, there are many ways. I implemented the bpf cgroup logic using an
> own cgroup controller once, which made it possible to read out the
> status. But as we agreed on attaching programs through the bpf(2) system
> call, I moved back to the implementation that directly stores the
> pointers in the cgroup.
> 
> First enabling the controller through the fs-backed cgroup interface,
> then come back through the bpf(2) syscall and then go back to the fs
> interface to read out status values is a bit weird.
> 
Hrm, that makes sense. with the BPF syscall, would there be a way to get
file descriptor of the currently attached BPF program?

> >> And FWIW, I agree with Thomas - there is nothing wrong with having
> >> multiple options to use for such use-cases.
> >
> > Right now, for containers, we have netfilter and network namespaces.
> > There's a lot of performance overhead that comes with this.
> 
> Out of curiosity: Could you express that in numbers? And how exactly are
> you testing?
> 
Sure. Our workload that we use as a baseline is Redis with redis-benchmark. We 
reconnect after every connection, and we're running "isolation" between two 
containers on the same machine to try to rule out any physical infrastructure 
overhead.

So, we ran two tests with network namespaces. The first one was putting Redis 
into its own network namespace, and using tc to do some basic shaping:
Client--Veth---Host Namespace---Veth---Redis

The second was:
Client--Veth--Host Namespace+Iptables filtering--Veth--Redis. 

The second test required us to use conntrack, as we wanted stateful filtering.

Ops/sec:
Original: 4275
Situation 1: 3823
Situation 2: 1489

Latency (milliseconds):
Original: 0.69
Situation 1: 0.82
Situation 2: 2.11

This was on a (KVM) machine with 16GB of RAM, and 8 Cores where the machine was 
supposed to be dedicated to me. Given that it's not bare metal, take these 
numbers with a grain of salt.

> > Not only
> > that, but iptables doesn't really have a simple way of usage by
> > automated infrastructure. We (firewalld, systemd, dockerd, mesos)
> > end up fighting with one another for ownership over firewall rules.
> 
> Yes, that's a common problem.
> 
> > Although, I have problems with this approach, I think that it's
> > a good baseline where we can have top level owned by systemd,
> > docker underneath that, and Mesos underneath that. We can add
> > additional hooks for things like Checmate and Landlock, and
> > with a little more work, we can do compositition, solving
> > all of our problems.
> 
> It is supposed to be just a baseline, yes.
> 
> 
> Thanks for your feedback,
> Daniel
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
       [not found]                   ` <20160919215311.GA9723-I4sfFR6g6EicJoAdRrHjTrzMkBWIpU9tytq7g7fCXyjEk0E+pv7Png@public.gmane.org>
@ 2016-09-20 14:25                     ` Daniel Mack
  0 siblings, 0 replies; 27+ messages in thread
From: Daniel Mack @ 2016-09-20 14:25 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Pablo Neira Ayuso, htejun-b10kYP2dOMg,
	daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA

On 09/19/2016 11:53 PM, Sargun Dhillon wrote:
> On Mon, Sep 19, 2016 at 06:34:28PM +0200, Daniel Mack wrote:
>> On 09/16/2016 09:57 PM, Sargun Dhillon wrote:

>>> Now, with this patch, we don't have that, but I think we can reasonably add some 
>>> flag like "no override" when applying policies, or alternatively something like 
>>> "no new privileges", to prevent children from applying policies that override 
>>> top-level policy.
>>
>> Yes, but the API is already guarded by CAP_NET_ADMIN. Take that
>> capability away from your children, and they can't tamper with the
>> policy. Does that work for you?
>
> No. This can be addressed in a follow-on patch, but the use-case is that I have 
> a container orchestrator (Docker, or Mesos), and systemd. The sysadmin controls 
> systemd, and Docker is controlled by devs. Typically, the system owner wants 
> some system level statistics, and filtering, and then we want to do 
> per-container filtering.
> 
> We really want to be able to do nesting with userspace tools that are oblivious, 
> and we want to delegate a level of the cgroup hierarchy to the tool that created 
> it. I do not see Docker integrating with systemd any time soon, and that's 
> really the only other alternative.

Then we'd need to find out whether you want to block other users from
installing (thus overriding) an existing eBPF program, or if you want to
allow that but execute them all. Both is possible.

[...]

>>> It would be nice to be able to see whether or not a filter is attached to a 
>>> cgroup, but given this is going through syscalls, at least introspection
>>> is possible as opposed to something like netlink.
>>
>> Sure, there are many ways. I implemented the bpf cgroup logic using an
>> own cgroup controller once, which made it possible to read out the
>> status. But as we agreed on attaching programs through the bpf(2) system
>> call, I moved back to the implementation that directly stores the
>> pointers in the cgroup.
>>
>> First enabling the controller through the fs-backed cgroup interface,
>> then come back through the bpf(2) syscall and then go back to the fs
>> interface to read out status values is a bit weird.
>>
> Hrm, that makes sense. with the BPF syscall, would there be a way to get
> file descriptor of the currently attached BPF program?

A file descriptor is local to a task, so we would need to install a new
fd and return its number. But I'm not sure what we'd gain from that.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-09-20 14:25 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-12 16:12 [PATCH v5 0/6] Add eBPF hooks for cgroups Daniel Mack
2016-09-12 16:12 ` [PATCH v5 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
2016-09-12 16:12 ` [PATCH v5 2/6] cgroup: add support for eBPF programs Daniel Mack
2016-09-12 16:12 ` [PATCH v5 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
     [not found] ` <1473696735-11269-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
2016-09-12 16:12   ` [PATCH v5 4/6] net: filter: run cgroup eBPF ingress programs Daniel Mack
2016-09-12 16:12   ` [PATCH v5 5/6] net: core: run cgroup eBPF egress programs Daniel Mack
2016-09-12 16:12   ` [PATCH v5 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
2016-09-13 11:56 ` [PATCH v5 0/6] Add eBPF hooks for cgroups Pablo Neira Ayuso
2016-09-13 13:31   ` Daniel Mack
     [not found]     ` <da300784-284c-0d1f-a82e-aa0a0f8ae116-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
2016-09-13 14:14       ` Daniel Borkmann
2016-09-13 17:24       ` Pablo Neira Ayuso
2016-09-14  4:42         ` Alexei Starovoitov
2016-09-14  9:03           ` Thomas Graf
     [not found]           ` <20160914044217.GA44742-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-09-14 10:30             ` Pablo Neira Ayuso
2016-09-14 11:06               ` Thomas Graf
2016-09-14 11:36               ` Daniel Borkmann
2016-09-14 11:13         ` Daniel Mack
     [not found]           ` <6de6809a-13f5-4000-5639-c760dde30223-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
2016-09-14 11:42             ` Daniel Borkmann
     [not found]               ` <57D937B9.2090100-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>
2016-09-14 15:55                 ` Alexei Starovoitov
2016-09-16 19:57           ` Sargun Dhillon
     [not found]             ` <20160916195728.GA14736-I4sfFR6g6EicJoAdRrHjTrzMkBWIpU9tytq7g7fCXyjEk0E+pv7Png@public.gmane.org>
2016-09-18 23:34               ` Sargun Dhillon
2016-09-19 16:34               ` Daniel Mack
2016-09-19 21:53                 ` Sargun Dhillon
     [not found]                   ` <20160919215311.GA9723-I4sfFR6g6EicJoAdRrHjTrzMkBWIpU9tytq7g7fCXyjEk0E+pv7Png@public.gmane.org>
2016-09-20 14:25                     ` Daniel Mack
2016-09-15  6:36 ` Vincent Bernat
     [not found]   ` <m3y42tlldz.fsf-PiWSfznZvZU/eRriIvX0kg@public.gmane.org>
2016-09-15  8:11     ` Daniel Mack
2016-09-15  8:11       ` Daniel Mack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.