All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/6] Add eBPF hooks for cgroups
@ 2016-10-25 10:14 Daniel Mack
  2016-10-25 10:14 ` [PATCH v7 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Daniel Mack @ 2016-10-25 10:14 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack

This is v7 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

Although there are some minor updates as listed below, the overall
concept remains the same. Alexei, Daniel Borkmann and I have been
discussing some details off-list for some time which I'd like to
summarize quickly.

* Regarding introspection of installed eBPF programs, possible
  solutions are a) to add trace points to the bpf(2) syscall to dump
  programs when they are installed (which could be done by some
  sort of daemon), b) to use audit logging.

  Dumping programs once they are installed is problematic because of
  the internal optimizations done to the eBPF program during its
  lifetime. Also, the references to maps etc. would need to be
  restored during the dump.

  Just exposing whether or not a program is attached would be
  trivial to do, however, most easily through another bpf(2)
  command. That can be added later on though.

* Moving the filter hooks to some other sub-system such as tc or
  xdp is problematic because they are centered around net devices,
  which the current implementation is agnostic of, for good reasons.
  Also, we're back to square one then, because the local receiver,
  along with its cgroup membership, is not guaranteed to be known
  at that point in time; and that is the problem class what brought
  us here in the first place.


So, here we go again. Thanks for having another look.


Thanks,
Daniel


Changes from v6:

* Rebased to 4.9-rc2

* Add EXPORT_SYMBOL(__cgroup_bpf_run_filter). The kbuild test robot
  now succeeds in building this version of the patch set.

* Switch from bpf_prog_run_save_cb() to bpf_prog_run_clear_cb() to not
  tamper with the contents of skb->cb[]. Pointed out by Daniel
  Borkmann.

* Use sk_to_full_sk() in the egress path, as suggested by Daniel
  Borkmann.

* Renamed BPF_PROG_TYPE_CGROUP_SOCKET to BPF_PROG_TYPE_CGROUP_SKB, as
  requested by David Ahern.

* Added Alexei's Acked-by tags.


Changes from v5:

* The eBPF programs now operate on L3 rather than on L2 of the packets,
  and the egress hooks were moved from __dev_queue_xmit() to
  ip*_output().

* For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb
  through BPF_LD_[ABS|IND] instructions, but hook up the
  bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann
  for the help.


Changes from v4:

* Plug an skb leak when dropping packets due to eBPF verdicts in
  __dev_queue_xmit(). Spotted by Daniel Borkmann.

* Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
  operate on timewait or request sockets. Suggested by Daniel Borkmann.

* Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
  Spotted by Rami Rosen.

* Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.


Changes from v3:

* Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
  BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
  __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.

* Dropped the attach_flags member from the anonymous struct for BPF
  attach operations in union bpf_attr. They can be added later on via
  CHECK_ATTR. Requested by Daniel Borkmann and Alexei.

* Release old_prog at the end of __cgroup_bpf_update rather that at
  the beginning to fix a race gap between program updates and their
  users. Spotted by Daniel Borkmann.

* Plugged an skb leak when dropping packets on the egress path.
  Spotted by Daniel Borkmann.

* Add cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org to the loop, as suggested by Rami Rosen.

* Some minor coding style adoptions not worth mentioning in particular.


Changes from v2:

* Fixed the RCU locking details Tejun pointed out.

* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.


Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: ipv4, ipv6: run cgroup eBPF egress programs
  samples: bpf: add userspace example for attaching eBPF programs to
    cgroups

 include/linux/bpf-cgroup.h      |  71 +++++++++++++++++
 include/linux/cgroup-defs.h     |   4 +
 include/uapi/linux/bpf.h        |  17 ++++
 init/Kconfig                    |  12 +++
 kernel/bpf/Makefile             |   1 +
 kernel/bpf/cgroup.c             | 167 ++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c            |  81 +++++++++++++++++++
 kernel/cgroup.c                 |  18 +++++
 net/core/filter.c               |  27 +++++++
 net/ipv4/ip_output.c            |  17 ++++
 net/ipv6/ip6_output.c           |   9 +++
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  21 +++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 147 +++++++++++++++++++++++++++++++++++
 15 files changed, 597 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 1/6] bpf: add new prog type for cgroup socket filtering
  2016-10-25 10:14 [PATCH v7 0/6] Add eBPF hooks for cgroups Daniel Mack
@ 2016-10-25 10:14 ` Daniel Mack
  2016-10-25 10:14 ` [PATCH v7 4/6] net: filter: run cgroup eBPF ingress programs Daniel Mack
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Daniel Mack @ 2016-10-25 10:14 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups, Daniel Mack

This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 include/uapi/linux/bpf.h |  9 +++++++++
 net/core/filter.c        | 23 +++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f09c70b..1f3e6f1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_TRACEPOINT,
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
+	BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+enum bpf_attach_type {
+	BPF_CGROUP_INET_INGRESS,
+	BPF_CGROUP_INET_EGRESS,
+	__MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD	1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/net/core/filter.c b/net/core/filter.c
index 00351cd..e3813d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2576,6 +2576,17 @@ xdp_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+cg_skb_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	default:
+		return sk_filter_func_proto(func_id);
+	}
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2938,6 +2949,12 @@ static const struct bpf_verifier_ops xdp_ops = {
 	.convert_ctx_access	= xdp_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_skb_ops = {
+	.get_func_proto		= cg_skb_func_proto,
+	.is_valid_access	= sk_filter_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2958,12 +2975,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_skb_type __read_mostly = {
+	.ops	= &cg_skb_ops,
+	.type	= BPF_PROG_TYPE_CGROUP_SKB,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
+	bpf_register_prog_type(&cg_skb_type);
 
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 2/6] cgroup: add support for eBPF programs
       [not found] ` <1477390454-12553-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
@ 2016-10-25 10:14   ` Daniel Mack
  2016-10-25 10:14   ` [PATCH v7 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Daniel Mack @ 2016-10-25 10:14 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
Acked-by: Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/bpf-cgroup.h  |  71 +++++++++++++++++++
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig                |  12 ++++
 kernel/bpf/Makefile         |   1 +
 kernel/bpf/cgroup.c         | 167 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup.c             |  18 +++++
 6 files changed, 273 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 0000000..fc076de
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,71 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include <linux/bpf.h>
+#include <linux/jump_label.h>
+#include <uapi/linux/bpf.h>
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
+
+struct cgroup_bpf {
+	/*
+	 * Store two sets of bpf_prog pointers, one for programs that are
+	 * pinned directly to this cgroup, and one for those that are effective
+	 * when this cgroup is accessed.
+	 */
+	struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+	struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+			 struct cgroup *parent,
+			 struct bpf_prog *prog,
+			 enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+		       struct bpf_prog *prog,
+		       enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+			    struct sk_buff *skb,
+			    enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled */
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+					struct sk_buff *skb,
+					enum bpf_attach_type type)
+{
+	if (cgroup_bpf_enabled)
+		return __cgroup_bpf_run_filter(sk, skb, type);
+
+	return 0;
+}
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+				      struct cgroup *parent) {}
+
+static inline int cgroup_bpf_run_filter(struct sock *sk,
+					struct sk_buff *skb,
+					enum bpf_attach_type type)
+{
+	return 0;
+}
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include <linux/percpu-refcount.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/workqueue.h>
+#include <linux/bpf-cgroup.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
 
+	/* used to store eBPF programs */
+	struct cgroup_bpf bpf;
+
 	/* ids of the ancestors at each level including self */
 	int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index 34407f1..405120b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1154,6 +1154,18 @@ config CGROUP_PERF
 
 	  Say N if unsure.
 
+config CGROUP_BPF
+	bool "Support for eBPF programs attached to cgroups"
+	depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+	help
+	  Allow attaching eBPF programs to a cgroup using the bpf(2)
+	  syscall command BPF_PROG_ATTACH.
+
+	  In which context these programs are accessed depends on the type
+	  of attachment. For instance, programs that are attached using
+	  BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
+	  inet sockets.
+
 config CGROUP_DEBUG
 	bool "Example controller"
 	default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index eed911d..b22256b 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
+obj-$(CONFIG_CGROUP_BPF) += cgroup.o
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
new file mode 100644
index 0000000..a0ab43f
--- /dev/null
+++ b/kernel/bpf/cgroup.c
@@ -0,0 +1,167 @@
+/*
+ * Functions to manage eBPF programs attached to cgroups
+ *
+ * Copyright (c) 2016 Daniel Mack
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/bpf-cgroup.h>
+#include <net/sock.h>
+
+DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
+EXPORT_SYMBOL(cgroup_bpf_enabled_key);
+
+/**
+ * cgroup_bpf_put() - put references of all bpf programs
+ * @cgrp: the cgroup to modify
+ */
+void cgroup_bpf_put(struct cgroup *cgrp)
+{
+	unsigned int type;
+
+	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.prog); type++) {
+		struct bpf_prog *prog = cgrp->bpf.prog[type];
+
+		if (prog) {
+			bpf_prog_put(prog);
+			static_branch_dec(&cgroup_bpf_enabled_key);
+		}
+	}
+}
+
+/**
+ * cgroup_bpf_inherit() - inherit effective programs from parent
+ * @cgrp: the cgroup to modify
+ * @parent: the parent to inherit from
+ */
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
+{
+	unsigned int type;
+
+	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
+		struct bpf_prog *e;
+
+		e = rcu_dereference_protected(parent->bpf.effective[type],
+					      lockdep_is_held(&cgroup_mutex));
+		rcu_assign_pointer(cgrp->bpf.effective[type], e);
+	}
+}
+
+/**
+ * __cgroup_bpf_update() - Update the pinned program of a cgroup, and
+ *                         propagate the change to descendants
+ * @cgrp: The cgroup which descendants to traverse
+ * @parent: The parent of @cgrp, or %NULL if @cgrp is the root
+ * @prog: A new program to pin
+ * @type: Type of pinning operation (ingress/egress)
+ *
+ * Each cgroup has a set of two pointers for bpf programs; one for eBPF
+ * programs it owns, and which is effective for execution.
+ *
+ * If @prog is %NULL, this function attaches a new program to the cgroup and
+ * releases the one that is currently attached, if any. @prog is then made
+ * the effective program of type @type in that cgroup.
+ *
+ * If @prog is %NULL, the currently attached program of type @type is released,
+ * and the effective program of the parent cgroup (if any) is inherited to
+ * @cgrp.
+ *
+ * Then, the descendants of @cgrp are walked and the effective program for
+ * each of them is set to the effective program of @cgrp unless the
+ * descendant has its own program attached, in which case the subbranch is
+ * skipped. This ensures that delegated subcgroups with own programs are left
+ * untouched.
+ *
+ * Must be called with cgroup_mutex held.
+ */
+void __cgroup_bpf_update(struct cgroup *cgrp,
+			 struct cgroup *parent,
+			 struct bpf_prog *prog,
+			 enum bpf_attach_type type)
+{
+	struct bpf_prog *old_prog, *effective;
+	struct cgroup_subsys_state *pos;
+
+	old_prog = xchg(cgrp->bpf.prog + type, prog);
+
+	effective = (!prog && parent) ?
+		rcu_dereference_protected(parent->bpf.effective[type],
+					  lockdep_is_held(&cgroup_mutex)) :
+		prog;
+
+	css_for_each_descendant_pre(pos, &cgrp->self) {
+		struct cgroup *desc = container_of(pos, struct cgroup, self);
+
+		/* skip the subtree if the descendant has its own program */
+		if (desc->bpf.prog[type] && desc != cgrp)
+			pos = css_rightmost_descendant(pos);
+		else
+			rcu_assign_pointer(desc->bpf.effective[type],
+					   effective);
+	}
+
+	if (prog)
+		static_branch_inc(&cgroup_bpf_enabled_key);
+
+	if (old_prog) {
+		bpf_prog_put(old_prog);
+		static_branch_dec(&cgroup_bpf_enabled_key);
+	}
+}
+
+/**
+ * __cgroup_bpf_run_filter() - Run a program for packet filtering
+ * @sk: The socken sending or receiving traffic
+ * @skb: The skb that is being sent or received
+ * @type: The type of program to be exectuted
+ *
+ * If no socket is passed, or the socket is not of type INET or INET6,
+ * this function does nothing and returns 0.
+ *
+ * The program type passed in via @type must be suitable for network
+ * filtering. No further check is performed to assert that.
+ *
+ * This function will return %-EPERM if any if an attached program was found
+ * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter(struct sock *sk,
+			    struct sk_buff *skb,
+			    enum bpf_attach_type type)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+	int ret = 0;
+
+	if (!sk || !sk_fullsock(sk))
+		return 0;
+
+	if (sk->sk_family != AF_INET &&
+	    sk->sk_family != AF_INET6)
+		return 0;
+
+	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+
+	rcu_read_lock();
+
+	prog = rcu_dereference(cgrp->bpf.effective[type]);
+	if (prog) {
+		unsigned int offset = skb->data - skb_network_header(skb);
+
+		__skb_push(skb, offset);
+		ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
+		__skb_pull(skb, offset);
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 85bc9be..2ee9ec3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5074,6 +5074,8 @@ static void css_release_work_fn(struct work_struct *work)
 		if (cgrp->kn)
 			RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv,
 					 NULL);
+
+		cgroup_bpf_put(cgrp);
 	}
 
 	mutex_unlock(&cgroup_mutex);
@@ -5281,6 +5283,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	if (!cgroup_on_dfl(cgrp))
 		cgrp->subtree_control = cgroup_control(cgrp);
 
+	if (parent)
+		cgroup_bpf_inherit(cgrp, parent);
+
 	cgroup_propagate_control(cgrp);
 
 	/* @cgrp doesn't have dir yet so the following will only create csses */
@@ -6495,6 +6500,19 @@ static __init int cgroup_namespaces_init(void)
 }
 subsys_initcall(cgroup_namespaces_init);
 
+#ifdef CONFIG_CGROUP_BPF
+void cgroup_bpf_update(struct cgroup *cgrp,
+		       struct bpf_prog *prog,
+		       enum bpf_attach_type type)
+{
+	struct cgroup *parent = cgroup_parent(cgrp);
+
+	mutex_lock(&cgroup_mutex);
+	__cgroup_bpf_update(cgrp, parent, prog, type);
+	mutex_unlock(&cgroup_mutex);
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 #ifdef CONFIG_CGROUP_DEBUG
 static struct cgroup_subsys_state *
 debug_css_alloc(struct cgroup_subsys_state *parent_css)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
       [not found] ` <1477390454-12553-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  2016-10-25 10:14   ` [PATCH v7 2/6] cgroup: add support for eBPF programs Daniel Mack
@ 2016-10-25 10:14   ` Daniel Mack
  2016-10-25 10:14   ` [PATCH v7 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs Daniel Mack
  2016-10-25 10:14   ` [PATCH v7 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
  3 siblings, 0 replies; 25+ messages in thread
From: Daniel Mack @ 2016-10-25 10:14 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack

Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
Acked-by: Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/uapi/linux/bpf.h |  8 +++++
 kernel/bpf/syscall.c     | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1f3e6f1..f31b655 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
 	BPF_PROG_LOAD,
 	BPF_OBJ_PIN,
 	BPF_OBJ_GET,
+	BPF_PROG_ATTACH,
+	BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 	};
+
+	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+		__u32		target_fd;	/* container object to attach to */
+		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		__u32		attach_type;
+	};
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1814c01 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
 	return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (CHECK_ATTR(BPF_PROG_ATTACH))
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+	case BPF_CGROUP_INET_INGRESS:
+	case BPF_CGROUP_INET_EGRESS:
+		prog = bpf_prog_get_type(attr->attach_bpf_fd,
+					 BPF_PROG_TYPE_CGROUP_SKB);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp)) {
+			bpf_prog_put(prog);
+			return PTR_ERR(cgrp);
+		}
+
+		cgroup_bpf_update(cgrp, prog, attr->attach_type);
+		cgroup_put(cgrp);
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+	struct cgroup *cgrp;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (CHECK_ATTR(BPF_PROG_DETACH))
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+	case BPF_CGROUP_INET_INGRESS:
+	case BPF_CGROUP_INET_EGRESS:
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp))
+			return PTR_ERR(cgrp);
+
+		cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+		cgroup_put(cgrp);
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET:
 		err = bpf_obj_get(&attr);
 		break;
+
+#ifdef CONFIG_CGROUP_BPF
+	case BPF_PROG_ATTACH:
+		err = bpf_prog_attach(&attr);
+		break;
+	case BPF_PROG_DETACH:
+		err = bpf_prog_detach(&attr);
+		break;
+#endif
+
 	default:
 		err = -EINVAL;
 		break;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 4/6] net: filter: run cgroup eBPF ingress programs
  2016-10-25 10:14 [PATCH v7 0/6] Add eBPF hooks for cgroups Daniel Mack
  2016-10-25 10:14 ` [PATCH v7 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
@ 2016-10-25 10:14 ` Daniel Mack
       [not found] ` <1477390454-12553-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  2016-10-26 19:59 ` [PATCH v7 0/6] Add eBPF hooks for cgroups Pablo Neira Ayuso
  3 siblings, 0 replies; 25+ messages in thread
From: Daniel Mack @ 2016-10-25 10:14 UTC (permalink / raw)
  To: htejun, daniel, ast
  Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups, Daniel Mack

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 net/core/filter.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index e3813d6..bd6eebe 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
 		return -ENOMEM;
 
+	err = cgroup_bpf_run_filter(sk, skb, BPF_CGROUP_INET_INGRESS);
+	if (err)
+		return err;
+
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)
 		return err;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
       [not found] ` <1477390454-12553-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  2016-10-25 10:14   ` [PATCH v7 2/6] cgroup: add support for eBPF programs Daniel Mack
  2016-10-25 10:14   ` [PATCH v7 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
@ 2016-10-25 10:14   ` Daniel Mack
  2016-10-31 16:40     ` David Miller
  2016-10-25 10:14   ` [PATCH v7 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
  3 siblings, 1 reply; 25+ messages in thread
From: Daniel Mack @ 2016-10-25 10:14 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack

If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
Acked-by: Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 net/ipv4/ip_output.c  | 17 +++++++++++++++++
 net/ipv6/ip6_output.c |  9 +++++++++
 2 files changed, 26 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 05d1058..ee4b249 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -74,6 +74,7 @@
 #include <net/checksum.h>
 #include <net/inetpeer.h>
 #include <net/lwtunnel.h>
+#include <linux/bpf-cgroup.h>
 #include <linux/igmp.h>
 #include <linux/netfilter_ipv4.h>
 #include <linux/netfilter_bridge.h>
@@ -303,6 +304,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct rtable *rt = skb_rtable(skb);
 	struct net_device *dev = rt->dst.dev;
+	int ret;
 
 	/*
 	 *	If the indicated interface is up and running, send the packet.
@@ -312,6 +314,13 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
 
+	ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
+				    BPF_CGROUP_INET_EGRESS);
+	if (ret) {
+		kfree_skb(skb);
+		return ret;
+	}
+
 	/*
 	 *	Multicasts are looped back for other local users
 	 */
@@ -364,12 +373,20 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
+	int ret;
 
 	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
 
+	ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
+				    BPF_CGROUP_INET_EGRESS);
+	if (ret) {
+		kfree_skb(skb);
+		return ret;
+	}
+
 	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
 			    net, sk, skb, NULL, dev,
 			    ip_finish_output,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6001e78..1947026 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -39,6 +39,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 
+#include <linux/bpf-cgroup.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter_ipv6.h>
 
@@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 	struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
+	int ret;
 
 	if (unlikely(idev->cnf.disable_ipv6)) {
 		IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
@@ -150,6 +152,13 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 		return 0;
 	}
 
+	ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
+				    BPF_CGROUP_INET_EGRESS);
+	if (ret) {
+		kfree_skb(skb);
+		return ret;
+	}
+
 	return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
 			    net, sk, skb, NULL, dev,
 			    ip6_finish_output,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups
       [not found] ` <1477390454-12553-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2016-10-25 10:14   ` [PATCH v7 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs Daniel Mack
@ 2016-10-25 10:14   ` Daniel Mack
  3 siblings, 0 replies; 25+ messages in thread
From: Daniel Mack @ 2016-10-25 10:14 UTC (permalink / raw)
  To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack

Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
Acked-by: Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  21 ++++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 147 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_bpf_fd = prog_fd,
+		.attach_type = type,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_type = type,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
 	union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index ac6edb6..d0a799a 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, int insn_len,
 		  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 0000000..63ef208
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+
+enum {
+	MAP_KEY_PACKETS,
+	MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd, int verdict)
+{
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* save r6 so it's not clobbered by BPF_CALL */
+
+		/* Count packets */
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_PACKETS), /* r0 = 0 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* load map fd to r1 */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		/* Count bytes */
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_BYTES), /* r0 = 1 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_6, offsetof(struct __sk_buff, len)), /* r1 = skb->len */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		BPF_MOV64_IMM(BPF_REG_0, verdict), /* r0 = verdict */
+		BPF_EXIT_INSN(),
+	};
+
+	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SKB,
+			     prog, sizeof(prog), "GPL", 0);
+}
+
+static int usage(const char *argv0)
+{
+	printf("Usage: %s <cg-path> <egress|ingress> [drop]\n", argv0);
+	return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+	int cg_fd, map_fd, prog_fd, key, ret;
+	long long pkt_cnt, byte_cnt;
+	enum bpf_attach_type type;
+	int verdict = 1;
+
+	if (argc < 3)
+		return usage(argv[0]);
+
+	if (strcmp(argv[2], "ingress") == 0)
+		type = BPF_CGROUP_INET_INGRESS;
+	else if (strcmp(argv[2], "egress") == 0)
+		type = BPF_CGROUP_INET_EGRESS;
+	else
+		return usage(argv[0]);
+
+	if (argc > 3 && strcmp(argv[3], "drop") == 0)
+		verdict = 0;
+
+	cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+	if (cg_fd < 0) {
+		printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY,
+				sizeof(key), sizeof(byte_cnt),
+				256, 0);
+	if (map_fd < 0) {
+		printf("Failed to create map: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	prog_fd = prog_load(map_fd, verdict);
+	printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+	if (prog_fd < 0) {
+		printf("Failed to load prog: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	ret = bpf_prog_detach(cg_fd, type);
+	printf("bpf_prog_detach() returned '%s' (%d)\n", strerror(errno), errno);
+
+	ret = bpf_prog_attach(prog_fd, cg_fd, type);
+	if (ret < 0) {
+		printf("Failed to attach prog to cgroup: '%s'\n",
+		       strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	while (1) {
+		key = MAP_KEY_PACKETS;
+		assert(bpf_lookup_elem(map_fd, &key, &pkt_cnt) == 0);
+
+		key = MAP_KEY_BYTES;
+		assert(bpf_lookup_elem(map_fd, &key, &byte_cnt) == 0);
+
+		printf("cgroup received %lld packets, %lld bytes\n",
+		       pkt_cnt, byte_cnt);
+		sleep(1);
+	}
+
+	return EXIT_SUCCESS;
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
  2016-10-25 10:14 [PATCH v7 0/6] Add eBPF hooks for cgroups Daniel Mack
                   ` (2 preceding siblings ...)
       [not found] ` <1477390454-12553-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
@ 2016-10-26 19:59 ` Pablo Neira Ayuso
  2016-10-27  3:35   ` Alexei Starovoitov
  2016-10-27  8:40   ` Daniel Mack
  3 siblings, 2 replies; 25+ messages in thread
From: Pablo Neira Ayuso @ 2016-10-26 19:59 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev, sargun, cgroups

On Tue, Oct 25, 2016 at 12:14:08PM +0200, Daniel Mack wrote:
[...]
>   Dumping programs once they are installed is problematic because of
>   the internal optimizations done to the eBPF program during its
>   lifetime. Also, the references to maps etc. would need to be
>   restored during the dump.
> 
>   Just exposing whether or not a program is attached would be
>   trivial to do, however, most easily through another bpf(2)
>   command. That can be added later on though.

I don't know if anyone told you, but during last netconf, this topic
took a bit of time of discussion and it was controversial, I would say
1/3 of netdev hackers there showed their concerns, and that's
something that should not be skipped IMO.

While xdp is pushing bpf programs at the very early packet path, not
interfering with the stack, before even entering the generic ingress
path. But this is adding hooks to push bpf programs in the middle of
our generic stack, this is way different domain.

I would really like to explore way earlier filtering, by extending
socket lookup facilities. So far the problem seems to be that we need
to lookup for broadcast/multicast UDP sockets and those cannot be
attach via the usual skb->sk. I think it would be possible to wrap
around this socket code in functions so we can invoke it. I guess
filtering of UDP and TCP should be good for you at this stage. This
would require more work though, but this would come with no hooks in
the stack and packets will not have to consume *lots of cycles* just
to be dropped before entering the socket queue.

How useful can be to drop lots of unwanted traffic at such a late
stage? How would the performance numbers to drop packets would look
like? Extremely bad, I predict.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
  2016-10-26 19:59 ` [PATCH v7 0/6] Add eBPF hooks for cgroups Pablo Neira Ayuso
@ 2016-10-27  3:35   ` Alexei Starovoitov
       [not found]     ` <20161027033502.GA43960-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  2016-10-27  8:40   ` Daniel Mack
  1 sibling, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2016-10-27  3:35 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Daniel Mack, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev, sargun, cgroups

On Wed, Oct 26, 2016 at 09:59:33PM +0200, Pablo Neira Ayuso wrote:
> On Tue, Oct 25, 2016 at 12:14:08PM +0200, Daniel Mack wrote:
> [...]
> >   Dumping programs once they are installed is problematic because of
> >   the internal optimizations done to the eBPF program during its
> >   lifetime. Also, the references to maps etc. would need to be
> >   restored during the dump.
> > 
> >   Just exposing whether or not a program is attached would be
> >   trivial to do, however, most easily through another bpf(2)
> >   command. That can be added later on though.
> 
> I don't know if anyone told you, but during last netconf, this topic
> took a bit of time of discussion and it was controversial, I would say
> 1/3 of netdev hackers there showed their concerns, and that's
> something that should not be skipped IMO.

Though I attended netconf over hangouts, I think it was pretty
clear that bpf needs 'introspection' of loaded bpf programs and it
was a universal desire of everyone. Not 1/3 of hackers.
As commit log says it's an orthogonal work and over the last
month we've been discussing pros and cons of different approaches.
The audit infra, tracepoints and other ideas.
We kept the discussion in private because, unfortunately, public
discussions are not fruitful due to threads like this one.

The further points below were disputed many times in the past.
Let's address them one more time:

> path. But this is adding hooks to push bpf programs in the middle of
> our generic stack, this is way different domain.

incorrect. look at socket filters, cls_bpf.
bpf was running in the middle of the stack for years.
Even unix pipe and netfilter can be filtered with bpf.

> around this socket code in functions so we can invoke it. I guess
> filtering of UDP and TCP should be good for you at this stage.

DanielM mentioned few times that it's not only about UDP and TCP.

> This
> would require more work though, but this would come with no hooks in
> the stack and packets will not have to consume *lots of cycles* just
> to be dropped before entering the socket queue.

packets don't consume 'lost of cycles'. This is not a typical
n-tuple firewall framework. Not a DoS mitigation either. Please read
the cover letter and earlier submissions.
It's a framework centered around cgroups.
_Nothing_ in the current stack provides cgroup based monitoring
and application protection. Earlier cgroupv1 controllers don't
scale and we really cannot have more of v1 net controllers.
At the same time we've been brainstorming how this patch set
can work with v1. It's not easy. We're not giving up though.
For now it's v2 only.
Note that another two patchsets depend on this core cgroup+bpf framework.
It's not about hooks (socket/skb inspections points).
Both rx and tx hooks _can_ be moved in the future.
Unlike netfilter hooks, the hooks in patches 4 and 5 can be moved
if we find better place for them. It won't affect userspace.
The only assumption of BPF_CGROUP_INET_[E|IN]GRESS is that
there will be some place in packet delivery from nic to userspace
where bpf program can look at L3+ packet only if application is
under particular cgroup. Patches 4 and 5 are the best places so far.
It took months to converge on these points. Please see earlier
discussions on pro/con of every other place we've tried.

> How useful can be to drop lots of unwanted traffic at such a late
> stage? How would the performance numbers to drop packets would look
> like? Extremely bad, I predict.

facts check please. Daniel provided numbers before that show
excellent performance.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
  2016-10-26 19:59 ` [PATCH v7 0/6] Add eBPF hooks for cgroups Pablo Neira Ayuso
  2016-10-27  3:35   ` Alexei Starovoitov
@ 2016-10-27  8:40   ` Daniel Mack
       [not found]     ` <c9683122-d770-355b-e275-7c446e6d1d0f-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  1 sibling, 1 reply; 25+ messages in thread
From: Daniel Mack @ 2016-10-27  8:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 10/26/2016 09:59 PM, Pablo Neira Ayuso wrote:
> On Tue, Oct 25, 2016 at 12:14:08PM +0200, Daniel Mack wrote:
> [...]
>>   Dumping programs once they are installed is problematic because of
>>   the internal optimizations done to the eBPF program during its
>>   lifetime. Also, the references to maps etc. would need to be
>>   restored during the dump.
>>
>>   Just exposing whether or not a program is attached would be
>>   trivial to do, however, most easily through another bpf(2)
>>   command. That can be added later on though.
> 
> I don't know if anyone told you, but during last netconf, this topic
> took a bit of time of discussion and it was controversial, I would say
> 1/3 of netdev hackers there showed their concerns, and that's
> something that should not be skipped IMO.
> 
> While xdp is pushing bpf programs at the very early packet path, not
> interfering with the stack, before even entering the generic ingress
> path. But this is adding hooks to push bpf programs in the middle of
> our generic stack, this is way different domain.

It's not anything new. These hooks live on the very same level as
SO_ATTACH_FILTER. The only differences are that the BPF programs are
stored in the cgroup, and not in the socket, and that they exist for
egress as well.

> I would really like to explore way earlier filtering, by extending
> socket lookup facilities. So far the problem seems to be that we need
> to lookup for broadcast/multicast UDP sockets and those cannot be
> attach via the usual skb->sk.

We've been there. We've discussed all that. And we concluded that doing
early demux in the input filter path is not the right approach. That was
my very first take on that issue back in June 2015 (!), and it was
rightfully turned down for good reasons.

Adding it there would mean we need to early-demux *every* packet as soon
as there is *any* such rule installed, and that renders many
optimizations in the kernel to drop traffic that has no local receiver
useless.

> I think it would be possible to wrap
> around this socket code in functions so we can invoke it. I guess
> filtering of UDP and TCP should be good for you at this stage. This
> would require more work though, but this would come with no hooks in
> the stack and packets will not have to consume *lots of cycles* just
> to be dropped before entering the socket queue.
>
> How useful can be to drop lots of unwanted traffic at such a late
> stage? How would the performance numbers to drop packets would look
> like? Extremely bad, I predict.

I fear I'm repeating myself here, but this is unfounded. I'm not sure
why you keep bringing it up. As I said weeks ago - just loading the
netfilter modules without any rules deployed has more impact than
running the example program in 6/6 on every packet in the test traffic.
Please give it a shot yourself.

Also, the eBPF programs can well be used in combination with existing
netfilter setups. There is no reason to not combine the two levels of
filtering. Both have their right to exist, and nobody is taking anything
away.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
       [not found]     ` <20161027033502.GA43960-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-10-28 11:28       ` Pablo Neira Ayuso
  2016-10-28 15:00         ` David Ahern
  2016-10-29  1:42         ` Alexei Starovoitov
  0 siblings, 2 replies; 25+ messages in thread
From: Pablo Neira Ayuso @ 2016-10-28 11:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Mack, htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Alexei,

On Wed, Oct 26, 2016 at 08:35:04PM -0700, Alexei Starovoitov wrote:
> On Wed, Oct 26, 2016 at 09:59:33PM +0200, Pablo Neira Ayuso wrote:
> > On Tue, Oct 25, 2016 at 12:14:08PM +0200, Daniel Mack wrote:
> > [...]
> > >   Dumping programs once they are installed is problematic because of
> > >   the internal optimizations done to the eBPF program during its
> > >   lifetime. Also, the references to maps etc. would need to be
> > >   restored during the dump.
> > > 
> > >   Just exposing whether or not a program is attached would be
> > >   trivial to do, however, most easily through another bpf(2)
> > >   command. That can be added later on though.
> > 
> > I don't know if anyone told you, but during last netconf, this topic
> > took a bit of time of discussion and it was controversial, I would say
> > 1/3 of netdev hackers there showed their concerns, and that's
> > something that should not be skipped IMO.
> 
> Though I attended netconf over hangouts, I think it was pretty
> clear that bpf needs 'introspection' of loaded bpf programs and it
> was a universal desire of everyone. Not 1/3 of hackers.

Introspection is a different thing, very useful, no doubt. But this
infrastructure is allowing way more than simple innocuous introspection.

> As commit log says it's an orthogonal work and over the last
> month we've been discussing pros and cons of different approaches.
> The audit infra, tracepoints and other ideas.
> We kept the discussion in private because, unfortunately, public
> discussions are not fruitful due to threads like this one.

We need to understand what people are trying to solve and it what way.
That's why we have all those conferences and places to meet and
discuss too. Please, don't think like that, this is sending the wrong
message to everyone here, that is kind of: bypass public discussions
and don't take time to describe what you're doing since it is a waste
of time. That's not good.

> The further points below were disputed many times in the past.
> Let's address them one more time:
> 
> > path. But this is adding hooks to push bpf programs in the middle of
> > our generic stack, this is way different domain.
> 
> incorrect. look at socket filters, cls_bpf.

Classic socket filters don't allow you to deploy a global policy in
such a fine grain way as this is doing. Then, cls_bpf is fine since it
is visible via tc command, so sysadmins can use tools they are
familiar with to inspect policies and say "oh look, some of the
processes I'm deploying have installed filters via cls_bpf". However,
this approach is visible in no way.

[...]
> > around this socket code in functions so we can invoke it. I guess
> > filtering of UDP and TCP should be good for you at this stage.
> 
> DanielM mentioned few times that it's not only about UDP and TCP.

OK, since this is limited to the scope of inet sockets, let's revisit
what we have around: DCCP is hopeless, who cares. We also have SCTP,
that is deployed by telcos in datacenters, it cannot reach that domain
because many Internet gateways are broken for it, so you may not get
too far with it. Arguably it would be good to have SCTP support at
some point, but I guess this is not a priority now. Then, UDPlite
almost comes for free since it relies on the existing UDP
infrastructure, it's basically UDP with a bit more features. What
else?

> > This would require more work though, but this would come with no
> > hooks in the stack and packets will not have to consume *lots of
> > cycles* just to be dropped before entering the socket queue.
> 
> packets don't consume 'lost of cycles'. This is not a typical
> n-tuple firewall framework. Not a DoS mitigation either. Please read
> the cover letter and earlier submissions.
> It's a framework centered around cgroups.
> _Nothing_ in the current stack provides cgroup based monitoring
> and application protection. Earlier cgroupv1 controllers don't
> scale and we really cannot have more of v1 net controllers.
> At the same time we've been brainstorming how this patch set
> can work with v1. It's not easy. We're not giving up though.
> For now it's v2 only.
> Note that another two patchsets depend on this core cgroup+bpf framework.

I saw those, I would really like to have a closer look at David
Ahern's usecase since that skb iif mangling looks kludgy to me, and
given this is exposing a new helper for general use, not only vrf, it
would be good to make sure helpers provide something useful for
everyone. So that new helper is questionable at this stage IMO. I'm
concerned that people may start using bpf as the adhesive tape to glue
things to solve probably design problems.

The other patchset, I guess you refer to the new lsm module: I would
suggest we address one thing at a time, I guess he can starts without
relying on this chunk, as it can follow up later anyway.

In general, I think bpf is very useful, but I think we have to
accomodate new things in a way that makes sense into what we have. We
have traditionally followed an "add in-kernel infrastructured +
provide userspace interface as control plane" approach for long time.
I have concerns (and my impression is that others were concerned in
netconf too) on if we can go from the existing approach to a fully
uninfrastructured use-bpf-everywhere, from one day to another without
even taking the time to discuss the consequences of this decision.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
       [not found]     ` <c9683122-d770-355b-e275-7c446e6d1d0f-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
@ 2016-10-28 11:53       ` Pablo Neira Ayuso
  2016-10-28 12:07         ` Daniel Mack
  2016-10-29  3:51       ` Lorenzo Colitti
  1 sibling, 1 reply; 25+ messages in thread
From: Pablo Neira Ayuso @ 2016-10-28 11:53 UTC (permalink / raw)
  To: Daniel Mack
  Cc: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 27, 2016 at 10:40:14AM +0200, Daniel Mack wrote:
> On 10/26/2016 09:59 PM, Pablo Neira Ayuso wrote:
> > On Tue, Oct 25, 2016 at 12:14:08PM +0200, Daniel Mack wrote:
> > [...]
> >>   Dumping programs once they are installed is problematic because of
> >>   the internal optimizations done to the eBPF program during its
> >>   lifetime. Also, the references to maps etc. would need to be
> >>   restored during the dump.
> >>
> >>   Just exposing whether or not a program is attached would be
> >>   trivial to do, however, most easily through another bpf(2)
> >>   command. That can be added later on though.
> > 
> > I don't know if anyone told you, but during last netconf, this topic
> > took a bit of time of discussion and it was controversial, I would say
> > 1/3 of netdev hackers there showed their concerns, and that's
> > something that should not be skipped IMO.
> > 
> > While xdp is pushing bpf programs at the very early packet path, not
> > interfering with the stack, before even entering the generic ingress
> > path. But this is adding hooks to push bpf programs in the middle of
> > our generic stack, this is way different domain.
> 
> It's not anything new. These hooks live on the very same level as
> SO_ATTACH_FILTER. The only differences are that the BPF programs are
> stored in the cgroup, and not in the socket, and that they exist for
> egress as well.

Can we agree this is going further than SO_ATTACH_FILTER?

> > I would really like to explore way earlier filtering, by extending
> > socket lookup facilities. So far the problem seems to be that we need
> > to lookup for broadcast/multicast UDP sockets and those cannot be
> > attach via the usual skb->sk.
> 
> We've been there. We've discussed all that. And we concluded that doing
> early demux in the input filter path is not the right approach. That was
> my very first take on that issue back in June 2015 (!), and it was
> rightfully turned down for good reasons.
> 
> Adding it there would mean we need to early-demux *every* packet as soon
> as there is *any* such rule installed, and that renders many
> optimizations in the kernel to drop traffic that has no local receiver
> useless.

I think such concern applies to doing early demux inconditionally in
all possible scenarios (such as UDP broadcast/multicast), that implies
wasted cycles for people not requiring this.

If we can do what demuxing in an optional way, ie. only when socket
filtering is required, then only those that need it would pay that
price. Actually, if we can do this demux very early, from ingress,
performance numbers would be also good to perform any socket-based
filtering.

[...]
> > I think it would be possible to wrap
> > around this socket code in functions so we can invoke it. I guess
> > filtering of UDP and TCP should be good for you at this stage. This
> > would require more work though, but this would come with no hooks in
> > the stack and packets will not have to consume *lots of cycles* just
> > to be dropped before entering the socket queue.
> >
> > How useful can be to drop lots of unwanted traffic at such a late
> > stage? How would the performance numbers to drop packets would look
> > like? Extremely bad, I predict.
> 
> I fear I'm repeating myself here, but this is unfounded. I'm not sure
> why you keep bringing it up. As I said weeks ago - just loading the
> netfilter modules without any rules deployed has more impact than
> running the example program in 6/6 on every packet in the test traffic.

I guess you're using an old kernel and refering to iptables, this is
not true for some time, so we don't have any impact now with loaded
iptables modules.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
  2016-10-28 11:53       ` Pablo Neira Ayuso
@ 2016-10-28 12:07         ` Daniel Mack
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Mack @ 2016-10-28 12:07 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev, sargun, cgroups

On 10/28/2016 01:53 PM, Pablo Neira Ayuso wrote:
> On Thu, Oct 27, 2016 at 10:40:14AM +0200, Daniel Mack wrote:

>> It's not anything new. These hooks live on the very same level as
>> SO_ATTACH_FILTER. The only differences are that the BPF programs are
>> stored in the cgroup, and not in the socket, and that they exist for
>> egress as well.
> 
> Can we agree this is going further than SO_ATTACH_FILTER?

It's the same level. Only the way of setting the program(s) is different.

>> Adding it there would mean we need to early-demux *every* packet as soon
>> as there is *any* such rule installed, and that renders many
>> optimizations in the kernel to drop traffic that has no local receiver
>> useless.
> 
> I think such concern applies to doing early demux inconditionally in
> all possible scenarios (such as UDP broadcast/multicast), that implies
> wasted cycles for people not requiring this.

If you have a rule that acts on a condition based on a local receiver
detail such as a cgroup membership, then the INPUT filter *must* know
the local receiver for *all* packets passing by, otherwise it cannot act
upon it. And that means that you have to early-demux in any case as long
as at least one such a rule exists.

> If we can do what demuxing in an optional way, ie. only when socket
> filtering is required, then only those that need it would pay that
> price. Actually, if we can do this demux very early, from ingress,
> performance numbers would be also good to perform any socket-based
> filtering.

For multicast, rules have to be executed for each receiver, which is
another reason why the INPUT path is the wrong place to solve to problem.

You actually convinced me yourself about these details, but you seem to
constantly change your opinion about all this. Why is this such a
whack-a-mole game?

> I guess you're using an old kernel and refering to iptables, this is
> not true for some time, so we don't have any impact now with loaded
> iptables modules.

My point is that the performance decrease introduced by my patch set is
not really measurable, even if you pipe all the wire-saturating test
traffic through the example program. At least not with my setup here. If
a local receiver has no applicable bpf in its cgroup, the logic bails
out way earlier, leading a lot less overhead even. And if no cgroup has
any program attached, the code is basically no-op thanks to the static
branch. I really see no reason to block this patch set due to unfounded
claims of bad performance.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
  2016-10-28 11:28       ` Pablo Neira Ayuso
@ 2016-10-28 15:00         ` David Ahern
  2016-10-29  1:42         ` Alexei Starovoitov
  1 sibling, 0 replies; 25+ messages in thread
From: David Ahern @ 2016-10-28 15:00 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Alexei Starovoitov
  Cc: Daniel Mack, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev, sargun, cgroups

On 10/28/16 5:28 AM, Pablo Neira Ayuso wrote:
> I saw those, I would really like to have a closer look at David
> Ahern's usecase since that skb iif mangling looks kludgy to me, and
> given this is exposing a new helper for general use, not only vrf, it
> would be good to make sure helpers provide something useful for
> everyone. So that new helper is questionable at this stage IMO. I'm
> concerned that people may start using bpf as the adhesive tape to glue
> things to solve probably design problems.

It's changing sk_bound_dev_if on a socket when the socket is created, not skb iif. I explain the need in the cover letter. I have also explained the use case in the other attempts to solve this problem -- a stand alone l3mdev cgroup and adding a default setting to the task struct.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
  2016-10-28 11:28       ` Pablo Neira Ayuso
  2016-10-28 15:00         ` David Ahern
@ 2016-10-29  1:42         ` Alexei Starovoitov
  1 sibling, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2016-10-29  1:42 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Daniel Mack, htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Oct 28, 2016 at 01:28:39PM +0200, Pablo Neira Ayuso wrote:
> Hi Alexei,
> 
> On Wed, Oct 26, 2016 at 08:35:04PM -0700, Alexei Starovoitov wrote:
> > On Wed, Oct 26, 2016 at 09:59:33PM +0200, Pablo Neira Ayuso wrote:
> > > On Tue, Oct 25, 2016 at 12:14:08PM +0200, Daniel Mack wrote:
> > > [...]
> > > >   Dumping programs once they are installed is problematic because of
> > > >   the internal optimizations done to the eBPF program during its
> > > >   lifetime. Also, the references to maps etc. would need to be
> > > >   restored during the dump.
> > > > 
> > > >   Just exposing whether or not a program is attached would be
> > > >   trivial to do, however, most easily through another bpf(2)
> > > >   command. That can be added later on though.
> > > 
> > > I don't know if anyone told you, but during last netconf, this topic
> > > took a bit of time of discussion and it was controversial, I would say
> > > 1/3 of netdev hackers there showed their concerns, and that's
> > > something that should not be skipped IMO.
> > 
> > Though I attended netconf over hangouts, I think it was pretty
> > clear that bpf needs 'introspection' of loaded bpf programs and it
> > was a universal desire of everyone. Not 1/3 of hackers.
> 
> Introspection is a different thing, very useful, no doubt. But this
> infrastructure is allowing way more than simple innocuous introspection.

During netconf the discussion around bpf introspection was about
making bpf more visible to user space.
This patch does cgroup+bpf infrastructure which is orthogonal.

> > As commit log says it's an orthogonal work and over the last
> > month we've been discussing pros and cons of different approaches.
> > The audit infra, tracepoints and other ideas.
> > We kept the discussion in private because, unfortunately, public
> > discussions are not fruitful due to threads like this one.
> 
> We need to understand what people are trying to solve and it what way.
> That's why we have all those conferences and places to meet and
> discuss too. Please, don't think like that, this is sending the wrong
> message to everyone here, that is kind of: bypass public discussions
> and don't take time to describe what you're doing since it is a waste
> of time. That's not good.

In general I agree that it's best to discuss technical topics
on the mainling list and resolve personal conflicts in person
during the conferences.
Unfortunately the obstructionism seen in this thread forces people
discuss in private to have constructive conversation.
Anyway, neither your reply nor mine has nothing to do with this patch set
and being typical example why I prefer to discuss bpf related
ideas in private.

> > The further points below were disputed many times in the past.
> > Let's address them one more time:
> > 
> > > path. But this is adding hooks to push bpf programs in the middle of
> > > our generic stack, this is way different domain.
> > 
> > incorrect. look at socket filters, cls_bpf.
> 
> Classic socket filters don't allow you to deploy a global policy in
> such a fine grain way as this is doing. Then, cls_bpf is fine since it
> is visible via tc command, so sysadmins can use tools they are
> familiar with to inspect policies and say "oh look, some of the
> processes I'm deploying have installed filters via cls_bpf". However,
> this approach is visible in no way.

the audit and/or tracepoint infra that is being worked on in parallel
will answer this question in full detail.
It's no different from application using seccomp to control its children
or cls_bpf doing policy enforcement of containers or xdp doing low
level filtering.
xdp has boolean "is program loaded" flag and it turned out
to be not very useful for debuggability.
It should be solved in a universal way for all bpf programs.
We're not going to add something specific to this cgroup+bpf infra,
since it will be just as clumsy as xdp's bool flag.
Once again, audit/tracepoint for bpf prog introspection is a separate
work orthogonal to this set.

> [...]
> > > around this socket code in functions so we can invoke it. I guess
> > > filtering of UDP and TCP should be good for you at this stage.
> > 
> > DanielM mentioned few times that it's not only about UDP and TCP.
> 
> OK, since this is limited to the scope of inet sockets, let's revisit
> what we have around: DCCP is hopeless, who cares. We also have SCTP,
> that is deployed by telcos in datacenters, it cannot reach that domain
> because many Internet gateways are broken for it, so you may not get
> too far with it. Arguably it would be good to have SCTP support at
> some point, but I guess this is not a priority now. Then, UDPlite
> almost comes for free since it relies on the existing UDP
> infrastructure, it's basically UDP with a bit more features. What
> else?

thank you for nice description of current state of sctp,
but it has nothing to do with this patch set. The goal is to
monitor application networking activity at L3 level.

> > > This would require more work though, but this would come with no
> > > hooks in the stack and packets will not have to consume *lots of
> > > cycles* just to be dropped before entering the socket queue.
> > 
> > packets don't consume 'lost of cycles'. This is not a typical
> > n-tuple firewall framework. Not a DoS mitigation either. Please read
> > the cover letter and earlier submissions.
> > It's a framework centered around cgroups.
> > _Nothing_ in the current stack provides cgroup based monitoring
> > and application protection. Earlier cgroupv1 controllers don't
> > scale and we really cannot have more of v1 net controllers.
> > At the same time we've been brainstorming how this patch set
> > can work with v1. It's not easy. We're not giving up though.
> > For now it's v2 only.
> > Note that another two patchsets depend on this core cgroup+bpf framework.
> 
> I saw those, I would really like to have a closer look at David
> Ahern's usecase since that skb iif mangling looks kludgy to me, and
> given this is exposing a new helper for general use, not only vrf, it
> would be good to make sure helpers provide something useful for
> everyone. So that new helper is questionable at this stage IMO. I'm
> concerned that people may start using bpf as the adhesive tape to glue
> things to solve probably design problems.

the whole point of bpf is to be the fastest glue, so that kernel people
(like you and me) move out of the way for others to make their
own architectural choices for networking, DoS, firewalls, policy,
tracing, security, sandboxing and so on.
The job of the kernel is to be the best tool and not dictate
how firewalls should be done. People who work on DoS prevention
today have to use things like DPDK not only for performance reasons,
but due to flexible design choices and ability to change them quickly.
This cgroup+bpf set is a fundamental building block for all sorts of
other things like lsm, vrf, port binding policy and whatever
comes next. bpf is really nothing but the glue.

> The other patchset, I guess you refer to the new lsm module: I would
> suggest we address one thing at a time, I guess he can starts without
> relying on this chunk, as it can follow up later anyway.

lsm+bpf infra is only usable either via seccomp or via cgroup.
For our use case seccomp way is a non starter, since we manage
containers. Hence cgroup must be from the beginning. Therefore strong
dependency on this set.

> In general, I think bpf is very useful, but I think we have to
> accomodate new things in a way that makes sense into what we have. We
> have traditionally followed an "add in-kernel infrastructured +
> provide userspace interface as control plane" approach for long time.
> I have concerns (and my impression is that others were concerned in
> netconf too) on if we can go from the existing approach to a fully
> uninfrastructured use-bpf-everywhere, from one day to another without
> even taking the time to discuss the consequences of this decision.

bpf is already used everywhere and the work on bpf visibility is
orthogonal. Arguing that cgroup+bpf is no good without it is only
going to stall progress of everything that needs to be cgroup aware.
Note we already have experience dealing with cgroups from cls_bpf
via bpf_skb_under_cgroup() helper. And it didn't scale.
Hence this patch set.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
       [not found]     ` <c9683122-d770-355b-e275-7c446e6d1d0f-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
  2016-10-28 11:53       ` Pablo Neira Ayuso
@ 2016-10-29  3:51       ` Lorenzo Colitti
       [not found]         ` <CAKD1Yr2aRDNUxX8onReZyURufphxGoSTek=Fjk3Wswq9WOVp4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 25+ messages in thread
From: Lorenzo Colitti @ 2016-10-29  3:51 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Pablo Neira Ayuso, htejun-b10kYP2dOMg, Daniel Borkmann,
	ast-b10kYP2dOMg, David Miller, kafai-b10kYP2dOMg,
	Florian Westphal, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 27, 2016 at 5:40 PM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
> It's not anything new. These hooks live on the very same level as
> SO_ATTACH_FILTER. The only differences are that the BPF programs are
> stored in the cgroup, and not in the socket, and that they exist for
> egress as well.

What's the use case for egress?

We (android networking) are currently looking at implementing network
accounting via eBPF in order to replace the out-of-tree xt_qtaguid
code. A per-cgroup eBPF program run on all traffic would be great. But
when we looked at this patchset we realized it would not be useful for
accounting purposes because even if a packet is counted here, it might
still be dropped by netfilter hooks.

It seems like it would be much more useful to be able to do this in an
iptables rule.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
       [not found]         ` <CAKD1Yr2aRDNUxX8onReZyURufphxGoSTek=Fjk3Wswq9WOVp4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-29  4:51           ` Alexei Starovoitov
       [not found]             ` <20161029045107.GA61294-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2016-10-29  4:51 UTC (permalink / raw)
  To: Lorenzo Colitti
  Cc: Daniel Mack, Pablo Neira Ayuso, htejun-b10kYP2dOMg,
	Daniel Borkmann, ast-b10kYP2dOMg, David Miller,
	kafai-b10kYP2dOMg, Florian Westphal,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA

On Sat, Oct 29, 2016 at 12:51:37PM +0900, Lorenzo Colitti wrote:
> On Thu, Oct 27, 2016 at 5:40 PM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
> > It's not anything new. These hooks live on the very same level as
> > SO_ATTACH_FILTER. The only differences are that the BPF programs are
> > stored in the cgroup, and not in the socket, and that they exist for
> > egress as well.
> 
> What's the use case for egress?
> 
> We (android networking) are currently looking at implementing network
> accounting via eBPF in order to replace the out-of-tree xt_qtaguid
> code. A per-cgroup eBPF program run on all traffic would be great. But
> when we looked at this patchset we realized it would not be useful for
> accounting purposes because even if a packet is counted here, it might
> still be dropped by netfilter hooks.

don't use out-of-tree and instead drop using this mechanism or
any other in-kernel method? ;)
We (facebook infrastructure) have been using iptables and bpf networking
together with great success. They nicely co-exist and complement each other.
There is no need to reinvent the wheel if existing solution works.
iptables are great for their purpose.

> It seems like it would be much more useful to be able to do this in an
> iptables rule.

there is iptables+cBPF support. It's being used in some cases already.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
       [not found]             ` <20161029045107.GA61294-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-10-29  4:59               ` Lorenzo Colitti
       [not found]                 ` <CAKD1Yr2pMk52h7BdRwTvGwnP5+ONmr4ac6cyUBoZ9P+Kt-B8jw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Colitti @ 2016-10-29  4:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Mack, Pablo Neira Ayuso, htejun-b10kYP2dOMg,
	Daniel Borkmann, ast-b10kYP2dOMg, David Miller,
	kafai-b10kYP2dOMg, Florian Westphal,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Sargun Dhillon, cgroups-u79uwXL29TY76Z2rM5mHXA

On Sat, Oct 29, 2016 at 1:51 PM, Alexei Starovoitov
<alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> What's the use case for egress?
>>
>> We (android networking) are currently looking at implementing network
>> accounting via eBPF in order to replace the out-of-tree xt_qtaguid
>> code. A per-cgroup eBPF program run on all traffic would be great. But
>> when we looked at this patchset we realized it would not be useful for
>> accounting purposes because even if a packet is counted here, it might
>> still be dropped by netfilter hooks.
>
> don't use out-of-tree and instead drop using this mechanism or
> any other in-kernel method? ;)

Getting rid of out-of-tree code is the goal, yes. We do have a
requirement that things continue to work, though. Accounting for a
packet in ip{,6}_output is not correct if that packet ends up being
dropped by iptables later on.

> We (facebook infrastructure) have been using iptables and bpf networking
> together with great success. They nicely co-exist and complement each other.
> There is no need to reinvent the wheel if existing solution works.
> iptables are great for their purpose.

That doesn't really answer my "what is the use case for egress"
question though, right? Or are you saying "we use this, but we can't
talk about how we use it"?

> there is iptables+cBPF support. It's being used in some cases already.

Adding eBPF support to the xt_bpf iptables code would be an option for
what we want to do, yes. I think this requires that the eBPF map to be
an fd that is available to the process that exec()s iptables, but we
could do that.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
       [not found]                 ` <CAKD1Yr2pMk52h7BdRwTvGwnP5+ONmr4ac6cyUBoZ9P+Kt-B8jw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-29  6:24                   ` Alexei Starovoitov
  2016-10-29 15:34                     ` Lorenzo Colitti
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2016-10-29  6:24 UTC (permalink / raw)
  To: Lorenzo Colitti
  Cc: Daniel Mack, Pablo Neira Ayuso, htejun-b10kYP2dOMg,
	Daniel Borkmann, ast-b10kYP2dOMg, David Miller,
	kafai-b10kYP2dOMg, Florian Westphal,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Sargun Dhillon, cgroups-u79uwXL29TY76Z2rM5mHXA

On Sat, Oct 29, 2016 at 01:59:23PM +0900, Lorenzo Colitti wrote:
> On Sat, Oct 29, 2016 at 1:51 PM, Alexei Starovoitov
> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> What's the use case for egress?
> >>
> >> We (android networking) are currently looking at implementing network
> >> accounting via eBPF in order to replace the out-of-tree xt_qtaguid
> >> code. A per-cgroup eBPF program run on all traffic would be great. But
> >> when we looked at this patchset we realized it would not be useful for
> >> accounting purposes because even if a packet is counted here, it might
> >> still be dropped by netfilter hooks.
> >
> > don't use out-of-tree and instead drop using this mechanism or
> > any other in-kernel method? ;)
> 
> Getting rid of out-of-tree code is the goal, yes. We do have a
> requirement that things continue to work, though. Accounting for a
> packet in ip{,6}_output is not correct if that packet ends up being
> dropped by iptables later on.

understood.
it could be solved by swapping the order of cgroup_bpf_run_filter()
and NF_INET_POST_ROUTING in patch 5. It was proposed some time back, but
the current patch, I think, is more symmetrical.
cgroup+bpf runs after nf hook on rx and runs before it on tx.
imo it's more consistent.
Regardless of this choice... are you going to backport cgroupv2 to
android? Because this set is v2 only.

> > We (facebook infrastructure) have been using iptables and bpf networking
> > together with great success. They nicely co-exist and complement each other.
> > There is no need to reinvent the wheel if existing solution works.
> > iptables are great for their purpose.
> 
> That doesn't really answer my "what is the use case for egress"
> question though, right? Or are you saying "we use this, but we can't
> talk about how we use it"?

if the question is "why patch 4 alone is not enough and patch 5 is needed"?
Then it's symmetrical access. Accounting for RX only is a half done job.

> > there is iptables+cBPF support. It's being used in some cases already.
> 
> Adding eBPF support to the xt_bpf iptables code would be an option for
> what we want to do, yes. I think this requires that the eBPF map to be
> an fd that is available to the process that exec()s iptables, but we
> could do that.

yes. that's certainly doable, but sooner or later such approach will hit
scalability issue when number of cgroups is large. Same issue we saw
with cls_bpf and bpf_skb_under_cgroup(). Hence this patch set was needed
that is centered around cgroups instead of hooks. Note, unlike, tc and nf
there is no way to attach to a hook. The bpf program is attached to a cgroup.
It's an important distinction vs everything that currently exists in the stack.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
  2016-10-29  6:24                   ` Alexei Starovoitov
@ 2016-10-29 15:34                     ` Lorenzo Colitti
  2016-10-29 20:29                       ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Colitti @ 2016-10-29 15:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Mack, Pablo Neira Ayuso, htejun, Daniel Borkmann, ast,
	David Miller, kafai, Florian Westphal, harald, netdev,
	Sargun Dhillon, cgroups

On Sat, Oct 29, 2016 at 3:24 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> it could be solved by swapping the order of cgroup_bpf_run_filter()
> and NF_INET_POST_ROUTING in patch 5. It was proposed some time back, but
> the current patch, I think, is more symmetrical.
> cgroup+bpf runs after nf hook on rx and runs before it on tx.
> imo it's more consistent.

I guess what I was trying to say was: what does doing this filtering
in ip_output give you over running this from the netfilter hooks?
Doing this filtering in netfilter is much more general because there
can be complex rules both before and after the filtering is applied. I
hadn't thought of the scalability issue you note below though.

For accounting you probably want to run after the hooks, both for
ingress and for egress, because the hooks can do all sorts of stuff
like drop packets, change packet sizes, reroute them to different
interfaces, etc. Do you see use cases where you want to run before the
hooks?

> Regardless of this choice... are you going to backport cgroupv2 to
> android? Because this set is v2 only.

Certainly anything that can't easily be backported to, say,
android-4.4 is not really feasible in the short term. I don't think we
use network cgroups at all, so if v2 network cgroups can coexist with
v1 cgroups of other types (which what little I've read seems to
indicate) then that should be possible.

> yes. that's certainly doable, but sooner or later such approach will hit
> scalability issue when number of cgroups is large. Same issue we saw
> with cls_bpf and bpf_skb_under_cgroup(). Hence this patch set was needed
> that is centered around cgroups instead of hooks. Note, unlike, tc and nf
> there is no way to attach to a hook. The bpf program is attached to a cgroup.
> It's an important distinction vs everything that currently exists in the stack.

Ah, I see. Out of curiosity, what was the first scaling limitation you
hit? eBPF program length? eBPF map size?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
  2016-10-29 15:34                     ` Lorenzo Colitti
@ 2016-10-29 20:29                       ` Daniel Borkmann
  2016-11-01 15:25                         ` Lorenzo Colitti
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2016-10-29 20:29 UTC (permalink / raw)
  To: Lorenzo Colitti, Alexei Starovoitov
  Cc: Daniel Mack, Pablo Neira Ayuso, htejun, ast, David Miller, kafai,
	Florian Westphal, harald, netdev, Sargun Dhillon, cgroups

On 10/29/2016 05:34 PM, Lorenzo Colitti wrote:
> On Sat, Oct 29, 2016 at 3:24 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> it could be solved by swapping the order of cgroup_bpf_run_filter()
>> and NF_INET_POST_ROUTING in patch 5. It was proposed some time back, but
>> the current patch, I think, is more symmetrical.
>> cgroup+bpf runs after nf hook on rx and runs before it on tx.
>> imo it's more consistent.
>
> I guess what I was trying to say was: what does doing this filtering
> in ip_output give you over running this from the netfilter hooks?
> Doing this filtering in netfilter is much more general because there
> can be complex rules both before and after the filtering is applied. I
> hadn't thought of the scalability issue you note below though.
>
> For accounting you probably want to run after the hooks, both for
> ingress and for egress, because the hooks can do all sorts of stuff
> like drop packets, change packet sizes, reroute them to different
> interfaces, etc. Do you see use cases where you want to run before the
> hooks?

Fwiw, not sure if swapping brings much, even after netfilter there could
be complex processing that would potentially drop, mangle, redirect, etc
from tc layer (egress or from qdisc itself). But also at even lower layers
(although rather unlikely, but not impossible), for example in drivers or
shortly before passing skb to them during segmentation (GSO), etc.
Eventually, for that you'd need to monitor various things, and the cgroup
one is just at higher layers with different semantics.

>> Regardless of this choice... are you going to backport cgroupv2 to
>> android? Because this set is v2 only.
>
> Certainly anything that can't easily be backported to, say,
> android-4.4 is not really feasible in the short term. I don't think we
> use network cgroups at all, so if v2 network cgroups can coexist with
> v1 cgroups of other types (which what little I've read seems to
> indicate) then that should be possible.
>
>> yes. that's certainly doable, but sooner or later such approach will hit
>> scalability issue when number of cgroups is large. Same issue we saw
>> with cls_bpf and bpf_skb_under_cgroup(). Hence this patch set was needed
>> that is centered around cgroups instead of hooks. Note, unlike, tc and nf
>> there is no way to attach to a hook. The bpf program is attached to a cgroup.
>> It's an important distinction vs everything that currently exists in the stack.
>
> Ah, I see. Out of curiosity, what was the first scaling limitation you
> hit? eBPF program length? eBPF map size?

The scalability issue is not really program length or map size from eBPF
side in this context. While for v1, you have the bpf_get_cgroup_classid()
helper available on egress (not ingress though) that can scale with larger
number of cgroups since it works on the user-defined net_cls tagging, but
for v2, bpf_skb_under_cgroup() was initially introduced, which can only test
whether the sk's v2 cgroup related to the skb is in the sub-hierarchy of
a specific cgroup that is provided via maps. Effectively, when you have a
larger number of v2 cgroups that boolean test will not scale and you need
to linearly test through various cgroups. It's good enough when need to
special case only few cgroups in the v2 hierarchy on egress. Idea was that
attaching to cgroup itself would resolve this from a different angle for
egress and also ingress in a complementary way, but also seems to open up
for various other use-cases at the same time as seen from various patches
on the list.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
  2016-10-25 10:14   ` [PATCH v7 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs Daniel Mack
@ 2016-10-31 16:40     ` David Miller
       [not found]       ` <20161031.124003.1361406552151798940.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2016-10-31 16:40 UTC (permalink / raw)
  To: daniel
  Cc: htejun, daniel, ast, kafai, fw, pablo, harald, netdev, sargun, cgroups

From: Daniel Mack <daniel@zonque.org>
Date: Tue, 25 Oct 2016 12:14:13 +0200

> @@ -312,6 +314,13 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>  	skb->dev = dev;
>  	skb->protocol = htons(ETH_P_IP);
>  
> +	ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
> +				    BPF_CGROUP_INET_EGRESS);
> +	if (ret) {
> +		kfree_skb(skb);
> +		return ret;
> +	}
> +
>  	/*
>  	 *	Multicasts are looped back for other local users
>  	 */
> @@ -364,12 +373,20 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>  int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
>  	struct net_device *dev = skb_dst(skb)->dev;
> +	int ret;
>  
>  	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
>  
>  	skb->dev = dev;
>  	skb->protocol = htons(ETH_P_IP);
>  
> +	ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
> +				    BPF_CGROUP_INET_EGRESS);
> +	if (ret) {
> +		kfree_skb(skb);
> +		return ret;
> +	}
> +
>  	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
>  			    net, sk, skb, NULL, dev,
>  			    ip_finish_output,

The "sk" here is not necessarily the application socket.  It could be
a UDP tunnel socket or similar encapsulation object.

"skb->sk" is always the application socket, so is probably what you
need to pass down into the cgroup bpf run filter hook.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
  2016-10-29 20:29                       ` Daniel Borkmann
@ 2016-11-01 15:25                         ` Lorenzo Colitti
       [not found]                           ` <CAKD1Yr02SCHvd-xZJL14d_Ta8Dk4evHZ60zytpU0h4r80FucwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Colitti @ 2016-11-01 15:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Daniel Mack, Pablo Neira Ayuso, htejun, ast,
	David Miller, kafai, Florian Westphal, harald, netdev,
	Sargun Dhillon, cgroups

On Sun, Oct 30, 2016 at 5:29 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> Fwiw, not sure if swapping brings much, even after netfilter there could
> be complex processing that would potentially drop, mangle, redirect, etc
> from tc layer (egress or from qdisc itself). But also at even lower layers

I agree lots of stuff can still happen after the netfilter hooks have
finished. But it seems to me that netfilter hooks are more flexible
and have more access to more data (L2 headers, output devices,
conntrack state entries, etc. etc.) than this hook. So it seems to me
that this hook would be more useful if it ran after netfilter on
egress.

That way, if you want to modify the packet or do something
sophisticated in netfilter, you can still use the eBPF hook on the
results of that operation, and if you don't want to run netfilter, you
can write netfilter rules to skip the packet (and maybe still fix it
up later, perhaps in another netfilter chain).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
       [not found]                           ` <CAKD1Yr02SCHvd-xZJL14d_Ta8Dk4evHZ60zytpU0h4r80FucwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-11-01 15:38                             ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2016-11-01 15:38 UTC (permalink / raw)
  To: lorenzo-hpIqsD4AKlfQT0dZR+AlfA
  Cc: daniel-FeC+5ew28dpmcu3hnIyYJQ,
	alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w,
	daniel-cYrQPVfZoowdnm+yROfE0A, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	htejun-b10kYP2dOMg, ast-b10kYP2dOMg, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA

From: Lorenzo Colitti <lorenzo-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Date: Wed, 2 Nov 2016 00:25:15 +0900

> That way, if you want to modify the packet or do something
> sophisticated in netfilter, you can still use the eBPF hook on the
> results of that operation, and if you don't want to run netfilter, you
> can write netfilter rules to skip the packet (and maybe still fix it
> up later, perhaps in another netfilter chain).

The downside is that we classify the packet twice.  This transactional
cost adds up rather quickly.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
       [not found]       ` <20161031.124003.1361406552151798940.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
@ 2016-11-02  1:17         ` Daniel Borkmann
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Borkmann @ 2016-11-02  1:17 UTC (permalink / raw)
  To: David Miller, daniel-cYrQPVfZoowdnm+yROfE0A
  Cc: htejun-b10kYP2dOMg, ast-b10kYP2dOMg, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA

On 10/31/2016 05:40 PM, David Miller wrote:
> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> Date: Tue, 25 Oct 2016 12:14:13 +0200
>
>> @@ -312,6 +314,13 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>>   	skb->dev = dev;
>>   	skb->protocol = htons(ETH_P_IP);
>>
>> +	ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
>> +				    BPF_CGROUP_INET_EGRESS);
>> +	if (ret) {
>> +		kfree_skb(skb);
>> +		return ret;
>> +	}
>> +
>>   	/*
>>   	 *	Multicasts are looped back for other local users
>>   	 */
>> @@ -364,12 +373,20 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>>   int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>>   {
>>   	struct net_device *dev = skb_dst(skb)->dev;
>> +	int ret;
>>
>>   	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
>>
>>   	skb->dev = dev;
>>   	skb->protocol = htons(ETH_P_IP);
>>
>> +	ret = cgroup_bpf_run_filter(sk_to_full_sk(sk), skb,
>> +				    BPF_CGROUP_INET_EGRESS);
>> +	if (ret) {
>> +		kfree_skb(skb);
>> +		return ret;
>> +	}
>> +
>>   	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
>>   			    net, sk, skb, NULL, dev,
>>   			    ip_finish_output,
>
> The "sk" here is not necessarily the application socket.  It could be
> a UDP tunnel socket or similar encapsulation object.
>
> "skb->sk" is always the application socket, so is probably what you
> need to pass down into the cgroup bpf run filter hook.

Wouldn't that mean however, when you go through stacked devices that
you'd run the same eBPF cgroup program for skb->sk multiple times?
(Except for things like crossing netns due to the skb_orphan() there.)

For example, when you're doing accounting through this facility, then
TX part is counted X times instead of ideally just once, and for RX
afaik, we seem to short-cut sk_filter_trim_cap() on the encaps anyway.

I guess one could work around this by marking the skb with a verdict
once it was accounted for the first time, either from within eBPF prog
or from stack side. For eBPF we'd need to implement our own .is_valid_access()
callback for the verifier to allow writing into skb meta data like skb->mark
(sk filter can only write into cb[] for passing data between tail calls)
plus making sure in either case that no other user is mangling that flag
further on; hmm, probably rather unintuitive and fragile.

Unless I'm missing something obvious, I would currently think that just
passing "sk" might be more correct.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-11-02  1:17 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-25 10:14 [PATCH v7 0/6] Add eBPF hooks for cgroups Daniel Mack
2016-10-25 10:14 ` [PATCH v7 1/6] bpf: add new prog type for cgroup socket filtering Daniel Mack
2016-10-25 10:14 ` [PATCH v7 4/6] net: filter: run cgroup eBPF ingress programs Daniel Mack
     [not found] ` <1477390454-12553-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
2016-10-25 10:14   ` [PATCH v7 2/6] cgroup: add support for eBPF programs Daniel Mack
2016-10-25 10:14   ` [PATCH v7 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
2016-10-25 10:14   ` [PATCH v7 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs Daniel Mack
2016-10-31 16:40     ` David Miller
     [not found]       ` <20161031.124003.1361406552151798940.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2016-11-02  1:17         ` Daniel Borkmann
2016-10-25 10:14   ` [PATCH v7 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
2016-10-26 19:59 ` [PATCH v7 0/6] Add eBPF hooks for cgroups Pablo Neira Ayuso
2016-10-27  3:35   ` Alexei Starovoitov
     [not found]     ` <20161027033502.GA43960-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-10-28 11:28       ` Pablo Neira Ayuso
2016-10-28 15:00         ` David Ahern
2016-10-29  1:42         ` Alexei Starovoitov
2016-10-27  8:40   ` Daniel Mack
     [not found]     ` <c9683122-d770-355b-e275-7c446e6d1d0f-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
2016-10-28 11:53       ` Pablo Neira Ayuso
2016-10-28 12:07         ` Daniel Mack
2016-10-29  3:51       ` Lorenzo Colitti
     [not found]         ` <CAKD1Yr2aRDNUxX8onReZyURufphxGoSTek=Fjk3Wswq9WOVp4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-29  4:51           ` Alexei Starovoitov
     [not found]             ` <20161029045107.GA61294-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-10-29  4:59               ` Lorenzo Colitti
     [not found]                 ` <CAKD1Yr2pMk52h7BdRwTvGwnP5+ONmr4ac6cyUBoZ9P+Kt-B8jw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-29  6:24                   ` Alexei Starovoitov
2016-10-29 15:34                     ` Lorenzo Colitti
2016-10-29 20:29                       ` Daniel Borkmann
2016-11-01 15:25                         ` Lorenzo Colitti
     [not found]                           ` <CAKD1Yr02SCHvd-xZJL14d_Ta8Dk4evHZ60zytpU0h4r80FucwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-11-01 15:38                             ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.