linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/4] BPF updates
@ 2015-10-16  1:09 Daniel Borkmann
  2015-10-16  1:09 ` [PATCH net-next 1/4] bpf: abstract anon_inode_getfd invocations Daniel Borkmann
                   ` (4 more replies)
  0 siblings, 5 replies; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-16  1:09 UTC (permalink / raw)
  To: davem
  Cc: ast, viro, ebiederm, tgraf, hannes, netdev, linux-kernel,
	Daniel Borkmann

This set adds support for persistent maps/progs. Please see
individual patches for further details.

A man-page update to bpf(2) will be sent afterwards, also a
iproute2 patch for support in tc.

Thanks!

Daniel Borkmann (4):
  bpf: abstract anon_inode_getfd invocations
  bpf: align and clean bpf_{map,prog}_get helpers
  bpf: add support for persistent maps/progs
  bpf: add sample usages for persistent maps/progs

 include/linux/bpf.h        |  23 +-
 include/uapi/linux/bpf.h   |  45 +---
 include/uapi/linux/magic.h |   1 +
 include/uapi/linux/xattr.h |   3 +
 kernel/bpf/Makefile        |   4 +-
 kernel/bpf/inode.c         | 614 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c       | 164 ++++++++++--
 kernel/bpf/verifier.c      |   3 +-
 samples/bpf/Makefile       |   2 +
 samples/bpf/fds_example.c  | 224 +++++++++++++++++
 samples/bpf/libbpf.c       |  20 ++
 samples/bpf/libbpf.h       |   3 +
 12 files changed, 1045 insertions(+), 61 deletions(-)
 create mode 100644 kernel/bpf/inode.c
 create mode 100644 samples/bpf/fds_example.c

-- 
1.9.3


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH net-next 1/4] bpf: abstract anon_inode_getfd invocations
  2015-10-16  1:09 [PATCH net-next 0/4] BPF updates Daniel Borkmann
@ 2015-10-16  1:09 ` Daniel Borkmann
  2015-10-16  1:09 ` [PATCH net-next 2/4] bpf: align and clean bpf_{map,prog}_get helpers Daniel Borkmann
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-16  1:09 UTC (permalink / raw)
  To: davem
  Cc: ast, viro, ebiederm, tgraf, hannes, netdev, linux-kernel,
	Daniel Borkmann

Since we're going to use anon_inode_getfd() invocations in more than just
the current places, make a helper function for both, so that we only need
to pass a map/prog pointer to the helper itself in order to get a fd. The
new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/syscall.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 687dd6c..2b89ef0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -111,6 +111,12 @@ static const struct file_operations bpf_map_fops = {
 	.release = bpf_map_release,
 };
 
+static int bpf_map_new_fd(struct bpf_map *map)
+{
+	return anon_inode_getfd("bpf-map", &bpf_map_fops, map,
+				O_RDWR | O_CLOEXEC);
+}
+
 /* helper macro to check that unused fields 'union bpf_attr' are zero */
 #define CHECK_ATTR(CMD) \
 	memchr_inv((void *) &attr->CMD##_LAST_FIELD + \
@@ -141,8 +147,7 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		goto free_map;
 
-	err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC);
-
+	err = bpf_map_new_fd(map);
 	if (err < 0)
 		/* failed to allocate fd */
 		goto free_map;
@@ -538,6 +543,12 @@ static const struct file_operations bpf_prog_fops = {
         .release = bpf_prog_release,
 };
 
+static int bpf_prog_new_fd(struct bpf_prog *prog)
+{
+	return anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog,
+				O_RDWR | O_CLOEXEC);
+}
+
 static struct bpf_prog *get_prog(struct fd f)
 {
 	struct bpf_prog *prog;
@@ -647,7 +658,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 	if (err < 0)
 		goto free_used_maps;
 
-	err = anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog, O_RDWR | O_CLOEXEC);
+	err = bpf_prog_new_fd(prog);
 	if (err < 0)
 		/* failed to allocate fd */
 		goto free_used_maps;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 2/4] bpf: align and clean bpf_{map,prog}_get helpers
  2015-10-16  1:09 [PATCH net-next 0/4] BPF updates Daniel Borkmann
  2015-10-16  1:09 ` [PATCH net-next 1/4] bpf: abstract anon_inode_getfd invocations Daniel Borkmann
@ 2015-10-16  1:09 ` Daniel Borkmann
  2015-10-16  1:09 ` [PATCH net-next 3/4] bpf: add support for persistent maps/progs Daniel Borkmann
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-16  1:09 UTC (permalink / raw)
  To: davem
  Cc: ast, viro, ebiederm, tgraf, hannes, netdev, linux-kernel,
	Daniel Borkmann

Add a bpf_map_get() function that we're going to use later on and
align/clean the remaining helpers a bit so that we have them a bit
more consistent:

  - __bpf_map_get() and __bpf_prog_get() that both work on the fd
    struct, check whether the descriptor is eBPF and return the
    pointer to the map/prog stored in the private data.

    Also, we can return f.file->private_data directly, the function
    signature is enough of a documentation already.

  - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
    call their respective __bpf_map_get()/__bpf_prog_get() variants,
    and take a reference.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h   |  2 +-
 kernel/bpf/syscall.c  | 41 +++++++++++++++++++++++------------------
 kernel/bpf/verifier.c |  3 +--
 3 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e3a51b7..0ae6f77 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -167,7 +167,7 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
 void bpf_prog_put(struct bpf_prog *prog);
 void bpf_prog_put_rcu(struct bpf_prog *prog);
 
-struct bpf_map *bpf_map_get(struct fd f);
+struct bpf_map *__bpf_map_get(struct fd f);
 void bpf_map_put(struct bpf_map *map);
 
 extern int sysctl_unprivileged_bpf_disabled;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2b89ef0..3fff82c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -162,19 +162,29 @@ free_map:
 /* if error is returned, fd is released.
  * On success caller should complete fd access with matching fdput()
  */
-struct bpf_map *bpf_map_get(struct fd f)
+struct bpf_map *__bpf_map_get(struct fd f)
 {
-	struct bpf_map *map;
-
 	if (!f.file)
 		return ERR_PTR(-EBADF);
-
 	if (f.file->f_op != &bpf_map_fops) {
 		fdput(f);
 		return ERR_PTR(-EINVAL);
 	}
 
-	map = f.file->private_data;
+	return f.file->private_data;
+}
+
+static struct bpf_map *bpf_map_get(u32 ufd)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_map *map;
+
+	map = __bpf_map_get(f);
+	if (IS_ERR(map))
+		return map;
+
+	atomic_inc(&map->refcnt);
+	fdput(f);
 
 	return map;
 }
@@ -202,7 +212,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 		return -EINVAL;
 
 	f = fdget(ufd);
-	map = bpf_map_get(f);
+	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
@@ -261,7 +271,7 @@ static int map_update_elem(union bpf_attr *attr)
 		return -EINVAL;
 
 	f = fdget(ufd);
-	map = bpf_map_get(f);
+	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
@@ -314,7 +324,7 @@ static int map_delete_elem(union bpf_attr *attr)
 		return -EINVAL;
 
 	f = fdget(ufd);
-	map = bpf_map_get(f);
+	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
@@ -355,7 +365,7 @@ static int map_get_next_key(union bpf_attr *attr)
 		return -EINVAL;
 
 	f = fdget(ufd);
-	map = bpf_map_get(f);
+	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
@@ -549,21 +559,16 @@ static int bpf_prog_new_fd(struct bpf_prog *prog)
 				O_RDWR | O_CLOEXEC);
 }
 
-static struct bpf_prog *get_prog(struct fd f)
+static struct bpf_prog *__bpf_prog_get(struct fd f)
 {
-	struct bpf_prog *prog;
-
 	if (!f.file)
 		return ERR_PTR(-EBADF);
-
 	if (f.file->f_op != &bpf_prog_fops) {
 		fdput(f);
 		return ERR_PTR(-EINVAL);
 	}
 
-	prog = f.file->private_data;
-
-	return prog;
+	return f.file->private_data;
 }
 
 /* called by sockets/tracing/seccomp before attaching program to an event
@@ -574,13 +579,13 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
 	struct fd f = fdget(ufd);
 	struct bpf_prog *prog;
 
-	prog = get_prog(f);
-
+	prog = __bpf_prog_get(f);
 	if (IS_ERR(prog))
 		return prog;
 
 	atomic_inc(&prog->aux->refcnt);
 	fdput(f);
+
 	return prog;
 }
 EXPORT_SYMBOL_GPL(bpf_prog_get);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1d6b97b..735999a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1988,8 +1988,7 @@ static int replace_map_fd_with_map_ptr(struct verifier_env *env)
 			}
 
 			f = fdget(insn->imm);
-
-			map = bpf_map_get(f);
+			map = __bpf_map_get(f);
 			if (IS_ERR(map)) {
 				verbose("fd %d is not pointing to valid bpf_map\n",
 					insn->imm);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16  1:09 [PATCH net-next 0/4] BPF updates Daniel Borkmann
  2015-10-16  1:09 ` [PATCH net-next 1/4] bpf: abstract anon_inode_getfd invocations Daniel Borkmann
  2015-10-16  1:09 ` [PATCH net-next 2/4] bpf: align and clean bpf_{map,prog}_get helpers Daniel Borkmann
@ 2015-10-16  1:09 ` Daniel Borkmann
  2015-10-16 10:25   ` Hannes Frederic Sowa
  2015-10-16 17:21   ` Hannes Frederic Sowa
  2015-10-16  1:09 ` [PATCH net-next 4/4] bpf: add sample usages " Daniel Borkmann
  2015-10-19  2:53 ` [PATCH net-next 0/4] BPF updates David Miller
  4 siblings, 2 replies; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-16  1:09 UTC (permalink / raw)
  To: davem
  Cc: ast, viro, ebiederm, tgraf, hannes, netdev, linux-kernel,
	Daniel Borkmann, Alexei Starovoitov

This work adds support for "persistent" eBPF maps/programs. The term
"persistent" is to be understood that maps/programs have a facility
that lets them survive process termination. This is desired by various
eBPF subsystem users.

Just to name one example: tc classifier/action. Whenever tc parses
the ELF object, extracts and loads maps/progs into the kernel, these
file descriptors will be out of reach after the tc instance exits.
So a subsequent tc invocation won't be able to access/relocate on this
resource, and therefore maps cannot easily be shared, f.e. between the
ingress and egress networking data path.

The current workaround is that Unix domain sockets (UDS) need to be
instrumented in order to pass the created eBPF map/program file
descriptors to a third party management daemon through UDS' socket
passing facility. This makes it a bit complicated to deploy shared
eBPF maps or programs (programs f.e. for tail calls) among various
processes.

We've been brainstorming on how we could tackle this issue and various
approches have been tried out so far:

The first idea was to implement a fuse backend that proxies bpf(2)
syscalls that create/load maps and programs, where the fuse
implementation would hold these fds internally and pass them to
things like tc via UDS socket passing. There could be various fuse
implementations tailored to a particular eBPF subsystem's needs. The
advantage is that this would shift the complexity entirely into user
space, but with a couple of drawbacks along the way. One being that
fuse needs extra library dependencies and it also doesn't resolve the
issue that an extra daemon needs to run in the background. At Linux
Plumbers 2015, we've all concluded eventually that using fuse is not
an option and an in-kernel solution is needed.

The next idea I've tried out was an extension to the bpf(2) syscall
that works roughly in a way we bind(2)/connect(2) to paths backed
by special inodes in case of UDS. This works on top of any file system
that allows to create special files, where the newly created inode
operations are similar to those of S_IFSOCK inodes. The inode would
be instrumented as a lookup key in an rhashtable backend with the
prog/map stored as value. We found that there are a couple of
disadvantages on this approach. Since the underlying implementation
of the inode can differ among file systems, we need a periodic garbage
collection for dropping the rhashtable entry and the references to
the maps/progs held with it (it could be done by piggybacking into
the rhashtable's rehashing, though). The other issue is that this
requires to add something like S_IFBPF (to clearly identify this inode
as special to eBPF), where the available space in the S_IFMT is already
severly limited and could possibly clash with future POSIX values being
allocated there. Moreover, this approach might not be flexible enough
from a functionality point of view, f.e. things like future debugging
facilities, etc could be added that really wouldn't fit into bpf(2)
syscall (however, the bpf(2) syscall strictly stays the central place
to manage eBPF things).

This eventually leads us to this patch, which implements a minimal
eBPF file system. The idea is a bit similar, but to the point that
these inodes reside at one or multiple mount points. A directory
hierarchy can be tailored to a specific application use-case from the
various subsystem users and maps/progs pinned inside it. Two new eBPF
commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
order to create one or multiple special inodes from an existing file
descriptor that points to a map/program (we call it eBPF fd pinning),
or to create a new file descriptor from an existing special inode.
BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
can also be done unpriviledged when having appropriate permissions
to the path.

The next step I'm working on is to add dump eBPF map/prog commands
to bpf(2), so that a specification from a given file descriptor can
be retrieved. This can be used by things like CRIU but also applications
can inspect the meta data after calling BPF_NEW_FD.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h        |  21 ++
 include/uapi/linux/bpf.h   |  45 +---
 include/uapi/linux/magic.h |   1 +
 include/uapi/linux/xattr.h |   3 +
 kernel/bpf/Makefile        |   4 +-
 kernel/bpf/inode.c         | 614 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c       | 108 ++++++++
 7 files changed, 758 insertions(+), 38 deletions(-)
 create mode 100644 kernel/bpf/inode.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0ae6f77..bb82764 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -8,8 +8,10 @@
 #define _LINUX_BPF_H 1
 
 #include <uapi/linux/bpf.h>
+
 #include <linux/workqueue.h>
 #include <linux/file.h>
+#include <linux/fs.h>
 
 struct bpf_map;
 
@@ -137,6 +139,17 @@ struct bpf_prog_aux {
 	};
 };
 
+enum bpf_fd_type {
+	BPF_FD_TYPE_PROG,
+	BPF_FD_TYPE_MAP,
+};
+
+union bpf_any {
+	struct bpf_map *map;
+	struct bpf_prog *prog;
+	void *raw_ptr;
+};
+
 struct bpf_array {
 	struct bpf_map map;
 	u32 elem_size;
@@ -172,6 +185,14 @@ void bpf_map_put(struct bpf_map *map);
 
 extern int sysctl_unprivileged_bpf_disabled;
 
+void bpf_any_get(union bpf_any raw, enum bpf_fd_type type);
+void bpf_any_put(union bpf_any raw, enum bpf_fd_type type);
+
+int bpf_fd_inode_add(const struct filename *pathname,
+		     union bpf_any raw, enum bpf_fd_type type);
+union bpf_any bpf_fd_inode_get(const struct filename *pathname,
+			       enum bpf_fd_type *type);
+
 /* verify correctness of eBPF program */
 int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
 #else
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 564f1f0..f9b412c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,50 +63,16 @@ struct bpf_insn {
 	__s32	imm;		/* signed immediate constant */
 };
 
-/* BPF syscall commands */
+/* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
-	/* create a map with given type and attributes
-	 * fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size)
-	 * returns fd or negative error
-	 * map is deleted when fd is closed
-	 */
 	BPF_MAP_CREATE,
-
-	/* lookup key in a given map
-	 * err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->value
-	 * returns zero and stores found elem into value
-	 * or negative error
-	 */
 	BPF_MAP_LOOKUP_ELEM,
-
-	/* create or update key/value pair in a given map
-	 * err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->value, attr->flags
-	 * returns zero or negative error
-	 */
 	BPF_MAP_UPDATE_ELEM,
-
-	/* find and delete elem by key in a given map
-	 * err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key
-	 * returns zero or negative error
-	 */
 	BPF_MAP_DELETE_ELEM,
-
-	/* lookup key in a given map and return next key
-	 * err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->next_key
-	 * returns zero and stores next key or negative error
-	 */
 	BPF_MAP_GET_NEXT_KEY,
-
-	/* verify and load eBPF program
-	 * prog_fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size)
-	 * Using attr->prog_type, attr->insns, attr->license
-	 * returns fd or negative error
-	 */
 	BPF_PROG_LOAD,
+	BPF_PIN_FD,
+	BPF_NEW_FD,
 };
 
 enum bpf_map_type {
@@ -160,6 +126,11 @@ union bpf_attr {
 		__aligned_u64	log_buf;	/* user supplied buffer */
 		__u32		kern_version;	/* checked when prog_type=kprobe */
 	};
+
+	struct { /* anonymous struct used by BPF_{PIN,NEW}_FD command */
+		__u32		fd;
+		__aligned_u64	pathname;
+	};
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 7b1425a..c1c5cf7 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -75,5 +75,6 @@
 #define ANON_INODE_FS_MAGIC	0x09041934
 #define BTRFS_TEST_MAGIC	0x73727279
 #define NSFS_MAGIC		0x6e736673
+#define BPFFS_MAGIC		0xcafe4a11
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
index 1590c49..3586d28 100644
--- a/include/uapi/linux/xattr.h
+++ b/include/uapi/linux/xattr.h
@@ -42,6 +42,9 @@
 #define XATTR_USER_PREFIX "user."
 #define XATTR_USER_PREFIX_LEN (sizeof(XATTR_USER_PREFIX) - 1)
 
+#define XATTR_BPF_PREFIX "bpf."
+#define XATTR_BPF_PREFIX_LEN (sizeof(XATTR_BPF_PREFIX) - 1)
+
 /* Security namespace */
 #define XATTR_EVM_SUFFIX "evm"
 #define XATTR_NAME_EVM XATTR_SECURITY_PREFIX XATTR_EVM_SUFFIX
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e6983be..1327258 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,2 +1,4 @@
 obj-y := core.o
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o helpers.o
+
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
new file mode 100644
index 0000000..5cef673
--- /dev/null
+++ b/kernel/bpf/inode.c
@@ -0,0 +1,614 @@
+/*
+ * Minimal file system backend for special inodes holding eBPF maps and
+ * programs, used by eBPF fd pinning.
+ *
+ * (C) 2015 Daniel Borkmann <daniel@iogearbox.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/cred.h>
+#include <linux/parser.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/seq_file.h>
+#include <linux/fsnotify.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <linux/security.h>
+#include <linux/xattr.h>
+
+#define BPFFS_DEFAULT_MODE 0700
+
+enum {
+	BPF_OPT_UID,
+	BPF_OPT_GID,
+	BPF_OPT_MODE,
+	BPF_OPT_ERR,
+};
+
+struct bpf_mnt_opts {
+	kuid_t uid;
+	kgid_t gid;
+	umode_t mode;
+};
+
+struct bpf_fs_info {
+	struct bpf_mnt_opts mnt_opts;
+};
+
+struct bpf_dir_state {
+	unsigned long flags;
+};
+
+static const match_table_t bpf_tokens = {
+	{ BPF_OPT_UID, "uid=%u" },
+	{ BPF_OPT_GID, "gid=%u" },
+	{ BPF_OPT_MODE, "mode=%o" },
+	{ BPF_OPT_ERR, NULL },
+};
+
+static const struct inode_operations bpf_dir_iops;
+static const struct inode_operations bpf_prog_iops;
+static const struct inode_operations bpf_map_iops;
+
+static struct inode *bpf_get_inode(struct super_block *sb,
+				   const struct inode *dir,
+				   umode_t mode)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (!inode)
+		return ERR_PTR(-ENOSPC);
+
+	inode->i_ino = get_next_ino();
+	inode->i_atime = CURRENT_TIME;
+	inode->i_mtime = inode->i_atime;
+	inode->i_ctime = inode->i_atime;
+	inode_init_owner(inode, dir, mode);
+
+	return inode;
+}
+
+static struct inode *bpf_mknod(struct inode *dir, umode_t mode)
+{
+	return bpf_get_inode(dir->i_sb, dir, mode);
+}
+
+static bool bpf_dentry_name_reserved(const struct dentry *dentry)
+{
+	return strchr(dentry->d_name.name, '.');
+}
+
+enum {
+	/* Directory state is 'terminating', so no subdirectories
+	 * are allowed anymore in this directory. This is being
+	 * reserved so that in future, auto-generated directories
+	 * could be added along with the special map/prog inodes.
+	 */
+	BPF_DSTATE_TERM_BIT,
+};
+
+static bool bpf_inode_is_term(struct inode *dir)
+{
+	struct bpf_dir_state *state = dir->i_private;
+
+	return test_bit(BPF_DSTATE_TERM_BIT, &state->flags);
+}
+
+static bool bpf_inode_make_term(struct inode *dir)
+{
+	struct bpf_dir_state *state = dir->i_private;
+
+	return dir->i_nlink != 2 ||
+	       test_and_set_bit(BPF_DSTATE_TERM_BIT, &state->flags);
+}
+
+static void bpf_inode_undo_term(struct inode *dir)
+{
+	struct bpf_dir_state *state = dir->i_private;
+
+	clear_bit(BPF_DSTATE_TERM_BIT, &state->flags);
+}
+
+static int bpf_inode_type(const struct inode *inode, enum bpf_fd_type *i_type)
+{
+	if (inode->i_op == &bpf_prog_iops)
+		*i_type = BPF_FD_TYPE_PROG;
+	else if (inode->i_op == &bpf_map_iops)
+		*i_type = BPF_FD_TYPE_MAP;
+	else
+		return -EACCES;
+
+	return 0;
+}
+
+static int bpf_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = d_inode(dentry);
+	void *i_private = inode->i_private;
+	enum bpf_fd_type type;
+	bool is_fd, drop_ref;
+	int ret;
+
+	is_fd = bpf_inode_type(inode, &type) == 0;
+	drop_ref = inode->i_nlink == 1;
+
+	ret = simple_unlink(dir, dentry);
+	if (!ret && is_fd && drop_ref) {
+		union bpf_any raw;
+
+		raw.raw_ptr = i_private;
+		bpf_any_put(raw, type);
+		bpf_inode_undo_term(dir);
+	}
+
+	return ret;
+}
+
+static int bpf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct bpf_dir_state *state;
+	struct inode *inode;
+
+	if (bpf_inode_is_term(dir))
+		return -EPERM;
+	if (bpf_dentry_name_reserved(dentry))
+		return -EPERM;
+
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
+	if (!state)
+		return -ENOSPC;
+
+	inode = bpf_mknod(dir, dir->i_mode);
+	if (IS_ERR(inode)) {
+		kfree(state);
+		return PTR_ERR(inode);
+	}
+
+	inode->i_private = state;
+	inode->i_op = &bpf_dir_iops;
+	inode->i_fop = &simple_dir_operations;
+
+	inc_nlink(inode);
+	inc_nlink(dir);
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+
+	return 0;
+}
+
+static int bpf_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = d_inode(dentry);
+	void *i_private = inode->i_private;
+	int ret;
+
+	ret = simple_rmdir(dir, dentry);
+	if (!ret)
+		kfree(i_private);
+
+	return ret;
+}
+
+static const struct inode_operations bpf_dir_iops = {
+	.lookup		= simple_lookup,
+	.mkdir		= bpf_mkdir,
+	.rmdir		= bpf_rmdir,
+	.unlink		= bpf_unlink,
+};
+
+#define XATTR_TYPE_SUFFIX "type"
+
+#define XATTR_NAME_BPF_TYPE (XATTR_BPF_PREFIX XATTR_TYPE_SUFFIX)
+#define XATTR_NAME_BPF_TYPE_LEN (sizeof(XATTR_NAME_BPF_TYPE) - 1)
+
+#define XATTR_VALUE_MAP "map"
+#define XATTR_VALUE_PROG "prog"
+
+static ssize_t bpf_getxattr(struct dentry *dentry, const char *name,
+			    void *buffer, size_t size)
+{
+	enum bpf_fd_type type;
+	ssize_t ret;
+
+	if (strncmp(name, XATTR_NAME_BPF_TYPE, XATTR_NAME_BPF_TYPE_LEN))
+		return -ENODATA;
+
+	if (bpf_inode_type(d_inode(dentry), &type))
+		return -ENODATA;
+
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		ret = sizeof(XATTR_VALUE_PROG);
+		break;
+	case BPF_FD_TYPE_MAP:
+		ret = sizeof(XATTR_VALUE_MAP);
+		break;
+	}
+
+	if (buffer) {
+		if (size < ret)
+			return -ERANGE;
+
+		switch (type) {
+		case BPF_FD_TYPE_PROG:
+			strncpy(buffer, XATTR_VALUE_PROG, ret);
+			break;
+		case BPF_FD_TYPE_MAP:
+			strncpy(buffer, XATTR_VALUE_MAP, ret);
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static ssize_t bpf_listxattr(struct dentry *dentry, char *buffer, size_t size)
+{
+	ssize_t len, used = 0;
+
+	len = security_inode_listsecurity(d_inode(dentry), buffer, size);
+	if (len < 0)
+		return len;
+
+	used += len;
+	if (buffer) {
+		if (size < used)
+			return -ERANGE;
+
+		buffer += len;
+	}
+
+	len = XATTR_NAME_BPF_TYPE_LEN + 1;
+	used += len;
+	if (buffer) {
+		if (size < used)
+			return -ERANGE;
+
+		memcpy(buffer, XATTR_NAME_BPF_TYPE, len);
+		buffer += len;
+	}
+
+	return used;
+}
+
+/* Special inodes handling map/progs currently don't allow for syscalls
+ * such as open/read/write/etc. We use the same bpf_{map,prog}_new_fd()
+ * facility for installing an fd to the user as we do on BPF_MAP_CREATE
+ * and BPF_PROG_LOAD, so an applications using bpf(2) don't see any
+ * change in behaviour. In future, the set of open/read/write/etc could
+ * be used f.e. for implementing things like debugging facilities on the
+ * underlying map/prog that would work with non-bpf(2) aware tooling.
+ */
+static const struct inode_operations bpf_prog_iops = {
+	.getxattr	= bpf_getxattr,
+	.listxattr	= bpf_listxattr,
+};
+
+static const struct inode_operations bpf_map_iops = {
+	.getxattr	= bpf_getxattr,
+	.listxattr	= bpf_listxattr,
+};
+
+static int bpf_mkmap(struct inode *dir, struct dentry *dentry,
+		     struct bpf_map *map, umode_t i_mode)
+{
+	struct inode *inode;
+
+	if (bpf_dentry_name_reserved(dentry))
+		return -EPERM;
+	if (bpf_inode_make_term(dir))
+		return -EBUSY;
+
+	inode = bpf_mknod(dir, i_mode);
+	if (IS_ERR(inode)) {
+		bpf_inode_undo_term(dir);
+		return PTR_ERR(inode);
+	}
+
+	inode->i_private = map;
+	inode->i_op = &bpf_map_iops;
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+
+	return 0;
+}
+
+static int bpf_mkprog(struct inode *dir, struct dentry *dentry,
+		      struct bpf_prog *prog, umode_t i_mode)
+{
+	struct inode *inode;
+
+	if (bpf_dentry_name_reserved(dentry))
+		return -EPERM;
+	if (bpf_inode_make_term(dir))
+		return -EBUSY;
+
+	inode = bpf_mknod(dir, i_mode);
+	if (IS_ERR(inode)) {
+		bpf_inode_undo_term(dir);
+		return PTR_ERR(inode);
+	}
+
+	inode->i_private = prog;
+	inode->i_op = &bpf_prog_iops;
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+
+	return 0;
+}
+
+static const struct bpf_mnt_opts *bpf_sb_mnt_opts(const struct super_block *sb)
+{
+	const struct bpf_fs_info *bfi = sb->s_fs_info;
+
+	return &bfi->mnt_opts;
+}
+
+static int bpf_parse_options(char *opt_data, struct bpf_mnt_opts *opts)
+{
+	substring_t args[MAX_OPT_ARGS];
+	unsigned int opt_val, token;
+	char *opt_ptr;
+	kuid_t uid;
+	kgid_t gid;
+
+	opts->mode = BPFFS_DEFAULT_MODE;
+
+	while ((opt_ptr = strsep(&opt_data, ",")) != NULL) {
+		if (!*opt_ptr)
+			continue;
+
+		token = match_token(opt_ptr, bpf_tokens, args);
+		switch (token) {
+		case BPF_OPT_UID:
+			if (match_int(&args[0], &opt_val))
+				return -EINVAL;
+
+			uid = make_kuid(current_user_ns(), opt_val);
+			if (!uid_valid(uid))
+				return -EINVAL;
+
+			opts->uid = uid;
+			break;
+		case BPF_OPT_GID:
+			if (match_int(&args[0], &opt_val))
+				return -EINVAL;
+
+			gid = make_kgid(current_user_ns(), opt_val);
+			if (!gid_valid(gid))
+				return -EINVAL;
+
+			opts->gid = gid;
+			break;
+		case BPF_OPT_MODE:
+			if (match_octal(&args[0], &opt_val))
+				return -EINVAL;
+
+			opts->mode = opt_val & S_IALLUGO;
+			break;
+		default:
+			return -EINVAL;
+		};
+	}
+
+	return 0;
+}
+
+static int bpf_apply_options(struct super_block *sb)
+{
+	const struct bpf_mnt_opts *opts = bpf_sb_mnt_opts(sb);
+	struct inode *inode = sb->s_root->d_inode;
+
+	inode->i_mode &= ~S_IALLUGO;
+	inode->i_mode |= opts->mode;
+
+	inode->i_uid = opts->uid;
+	inode->i_gid = opts->gid;
+
+	return 0;
+}
+
+static int bpf_show_options(struct seq_file *m, struct dentry *root)
+{
+	const struct bpf_mnt_opts *opts = bpf_sb_mnt_opts(root->d_sb);
+
+	if (!uid_eq(opts->uid, GLOBAL_ROOT_UID))
+		seq_printf(m, ",uid=%u",
+			   from_kuid_munged(&init_user_ns, opts->uid));
+
+	if (!gid_eq(opts->gid, GLOBAL_ROOT_GID))
+		seq_printf(m, ",gid=%u",
+			   from_kgid_munged(&init_user_ns, opts->gid));
+
+	if (opts->mode != BPFFS_DEFAULT_MODE)
+		seq_printf(m, ",mode=%o", opts->mode);
+
+	return 0;
+}
+
+static int bpf_remount(struct super_block *sb, int *flags, char *opt_data)
+{
+	struct bpf_fs_info *bfi = sb->s_fs_info;
+	int ret;
+
+	sync_filesystem(sb);
+
+	ret = bpf_parse_options(opt_data, &bfi->mnt_opts);
+	if (ret)
+		return ret;
+
+	bpf_apply_options(sb);
+	return 0;
+}
+
+static const struct super_operations bpf_super_ops = {
+	.statfs		= simple_statfs,
+	.remount_fs	= bpf_remount,
+	.show_options	= bpf_show_options,
+};
+
+static int bpf_fill_super(struct super_block *sb, void *opt_data, int silent)
+{
+	static struct tree_descr bpf_files[] = { { "" } };
+	struct bpf_dir_state *state;
+	struct bpf_fs_info *bfi;
+	struct inode *inode;
+	int ret = -ENOMEM;
+
+	bfi = kzalloc(sizeof(*bfi), GFP_KERNEL);
+	if (!bfi)
+		return ret;
+
+	state = kzalloc(sizeof(*state), GFP_KERNEL);
+	if (!state)
+		goto err_bfi;
+
+	save_mount_options(sb, opt_data);
+	sb->s_fs_info = bfi;
+
+	ret = bpf_parse_options(opt_data, &bfi->mnt_opts);
+	if (ret)
+		goto err_state;
+
+	ret = simple_fill_super(sb, BPFFS_MAGIC, bpf_files);
+	if (ret)
+		goto err_state;
+
+	sb->s_op = &bpf_super_ops;
+
+	inode = sb->s_root->d_inode;
+	inode->i_op = &bpf_dir_iops;
+	inode->i_private = state;
+
+	bpf_apply_options(sb);
+
+	return 0;
+err_state:
+	kfree(state);
+err_bfi:
+	kfree(bfi);
+	return ret;
+}
+
+static void bpf_kill_super(struct super_block *sb)
+{
+	kfree(sb->s_root->d_inode->i_private);
+	kfree(sb->s_fs_info);
+	kill_litter_super(sb);
+}
+
+static struct dentry *bpf_mount(struct file_system_type *type,
+				int flags, const char *dev_name,
+				void *opt_data)
+{
+	return mount_nodev(type, flags, opt_data, bpf_fill_super);
+}
+
+static struct file_system_type bpf_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "bpf",
+	.mount		= bpf_mount,
+	.kill_sb	= bpf_kill_super,
+	.fs_flags	= FS_USERNS_MOUNT,
+};
+
+MODULE_ALIAS_FS("bpf");
+MODULE_ALIAS_FS("bpffs");
+
+static int __init bpf_init(void)
+{
+	return register_filesystem(&bpf_fs_type);
+}
+fs_initcall(bpf_init);
+
+int bpf_fd_inode_add(const struct filename *pathname,
+		     union bpf_any raw, enum bpf_fd_type type)
+{
+	umode_t i_mode = S_IFREG | S_IRUSR | S_IWUSR;
+	struct inode *dir_inode;
+	struct dentry *dentry;
+	struct path path;
+	int ret;
+
+	dentry = kern_path_create(AT_FDCWD, pathname->name, &path, 0);
+	if (IS_ERR(dentry)) {
+		ret = PTR_ERR(dentry);
+		return ret;
+	}
+
+	ret = security_path_mknod(&path, dentry, i_mode, 0);
+	if (ret)
+		goto out;
+
+	dir_inode = d_inode(path.dentry);
+	if (dir_inode->i_op != &bpf_dir_iops) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	ret = security_inode_mknod(dir_inode, dentry, i_mode, 0);
+	if (ret)
+		goto out;
+
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		ret = bpf_mkprog(dir_inode, dentry, raw.prog, i_mode);
+		break;
+	case BPF_FD_TYPE_MAP:
+		ret = bpf_mkmap(dir_inode, dentry, raw.map, i_mode);
+		break;
+	}
+out:
+	done_path_create(&path, dentry);
+	return ret;
+}
+
+union bpf_any bpf_fd_inode_get(const struct filename *pathname,
+			       enum bpf_fd_type *type)
+{
+	struct inode *inode;
+	union bpf_any raw;
+	struct path path;
+	int ret;
+
+	ret = kern_path(pathname->name, LOOKUP_FOLLOW, &path);
+	if (ret)
+		goto out;
+
+	inode = d_backing_inode(path.dentry);
+	ret = inode_permission(inode, MAY_WRITE);
+	if (ret)
+		goto out_path;
+
+	ret = bpf_inode_type(inode, type);
+	if (ret)
+		goto out_path;
+
+	raw.raw_ptr = inode->i_private;
+	if (!raw.raw_ptr) {
+		ret = -EACCES;
+		goto out_path;
+	}
+
+	bpf_any_get(raw, *type);
+	touch_atime(&path);
+	path_put(&path);
+
+	return raw;
+out_path:
+	path_put(&path);
+out:
+	raw.raw_ptr = ERR_PTR(ret);
+	return raw;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3fff82c..b4a93b8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -679,6 +679,108 @@ free_prog_nouncharge:
 	return err;
 }
 
+void bpf_any_get(union bpf_any raw, enum bpf_fd_type type)
+{
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		atomic_inc(&raw.prog->aux->refcnt);
+		break;
+	case BPF_FD_TYPE_MAP:
+		atomic_inc(&raw.map->refcnt);
+		break;
+	}
+}
+
+void bpf_any_put(union bpf_any raw, enum bpf_fd_type type)
+{
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		bpf_prog_put(raw.prog);
+		break;
+	case BPF_FD_TYPE_MAP:
+		bpf_map_put(raw.map);
+		break;
+	}
+}
+
+#define BPF_PIN_FD_LAST_FIELD	pathname
+#define BPF_NEW_FD_LAST_FIELD	BPF_PIN_FD_LAST_FIELD
+
+static int bpf_pin_fd(const union bpf_attr *attr)
+{
+	struct filename *pathname;
+	enum bpf_fd_type type;
+	union bpf_any raw;
+	int ret;
+
+	if (CHECK_ATTR(BPF_PIN_FD))
+		return -EINVAL;
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	pathname = getname(u64_to_ptr(attr->pathname));
+	if (IS_ERR(pathname))
+		return PTR_ERR(pathname);
+
+	type = BPF_FD_TYPE_MAP;
+	raw.map = bpf_map_get(attr->fd);
+	if (IS_ERR(raw.map)) {
+		type = BPF_FD_TYPE_PROG;
+		raw.prog = bpf_prog_get(attr->fd);
+		if (IS_ERR(raw.prog)) {
+			ret = PTR_ERR(raw.raw_ptr);
+			goto out;
+		}
+	}
+
+	ret = bpf_fd_inode_add(pathname, raw, type);
+	if (ret != 0)
+		bpf_any_put(raw, type);
+out:
+	putname(pathname);
+	return ret;
+}
+
+static int bpf_new_fd(const union bpf_attr *attr)
+{
+	struct filename *pathname;
+	enum bpf_fd_type type;
+	union bpf_any raw;
+	int ret;
+
+	if ((CHECK_ATTR(BPF_NEW_FD)) || attr->fd != 0)
+		return -EINVAL;
+
+	pathname = getname(u64_to_ptr(attr->pathname));
+	if (IS_ERR(pathname))
+		return PTR_ERR(pathname);
+
+	raw = bpf_fd_inode_get(pathname, &type);
+	if (IS_ERR(raw.raw_ptr)) {
+		ret = PTR_ERR(raw.raw_ptr);
+		goto out;
+	}
+
+	switch (type) {
+	case BPF_FD_TYPE_PROG:
+		ret = bpf_prog_new_fd(raw.prog);
+		break;
+	case BPF_FD_TYPE_MAP:
+		ret = bpf_map_new_fd(raw.map);
+		break;
+	default:
+		/* Shut up gcc. */
+		ret = -ENOENT;
+		break;
+	}
+
+	if (ret < 0)
+		bpf_any_put(raw, type);
+out:
+	putname(pathname);
+	return ret;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -739,6 +841,12 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_PROG_LOAD:
 		err = bpf_prog_load(&attr);
 		break;
+	case BPF_PIN_FD:
+		err = bpf_pin_fd(&attr);
+		break;
+	case BPF_NEW_FD:
+		err = bpf_new_fd(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 4/4] bpf: add sample usages for persistent maps/progs
  2015-10-16  1:09 [PATCH net-next 0/4] BPF updates Daniel Borkmann
                   ` (2 preceding siblings ...)
  2015-10-16  1:09 ` [PATCH net-next 3/4] bpf: add support for persistent maps/progs Daniel Borkmann
@ 2015-10-16  1:09 ` Daniel Borkmann
  2015-10-19  2:53 ` [PATCH net-next 0/4] BPF updates David Miller
  4 siblings, 0 replies; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-16  1:09 UTC (permalink / raw)
  To: davem
  Cc: ast, viro, ebiederm, tgraf, hannes, netdev, linux-kernel,
	Daniel Borkmann

This patch adds a couple of stand-alone examples on how BPF_PIN_FD
and BPF_NEW_FD commands can be used.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/Makefile      |   2 +
 samples/bpf/fds_example.c | 224 ++++++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/libbpf.c      |  20 +++++
 samples/bpf/libbpf.h      |   3 +
 4 files changed, 249 insertions(+)
 create mode 100644 samples/bpf/fds_example.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 63e7d50..f08a3cf 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -4,6 +4,7 @@ obj- := dummy.o
 # List of programs to build
 hostprogs-y := test_verifier test_maps
 hostprogs-y += sock_example
+hostprogs-y += fds_example
 hostprogs-y += sockex1
 hostprogs-y += sockex2
 hostprogs-y += sockex3
@@ -18,6 +19,7 @@ hostprogs-y += lathist
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
+fds_example-objs := fds_example.o libbpf.o
 sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
 sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
 sockex3-objs := bpf_load.o libbpf.o sockex3_user.o
diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c
new file mode 100644
index 0000000..6f912be
--- /dev/null
+++ b/samples/bpf/fds_example.c
@@ -0,0 +1,224 @@
+#include <linux/unistd.h>
+#include <linux/bpf.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <string.h>
+#include <assert.h>
+#include <errno.h>
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/xattr.h>
+
+#include <net/ethernet.h>
+#include <arpa/inet.h>
+
+#include "libbpf.h"
+
+static char *fname;
+
+static void test_bpf_pin_fd(int fd)
+{
+	char buff[64];
+	int ret;
+
+	memset(buff, 0, sizeof(buff));
+	ret = bpf_pin_fd(fd, fname);
+	getxattr(fname, "bpf.type", buff, sizeof(buff));
+
+	printf("fd:%d type:%s pinned (%s)\n", fd, buff, strerror(errno));
+	assert(ret == 0);
+}
+
+static int test_bpf_new_fd(void)
+{
+	char buff[64];
+	int fd;
+
+	memset(buff, 0, sizeof(buff));
+	getxattr(fname, "bpf.type", buff, sizeof(buff));
+	fd = bpf_new_fd(fname);
+
+	printf("fd:%d type:%s fetched (%s)\n", fd, buff, strerror(errno));
+	assert(fd > 0);
+
+	return fd;
+}
+
+static int test_bpf_map_create(void)
+{
+	int fd;
+
+	fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(uint32_t),
+			    sizeof(uint32_t), 1024);
+
+	printf("fd:%d created (%s)\n", fd, strerror(errno));
+	assert(fd > 0);
+
+	return fd;
+}
+
+static int test_bpf_map_insert(int fd, uint32_t val)
+{
+	uint32_t key = 123;
+	int ret;
+
+	ret = bpf_update_elem(fd, &key, &val, 0);
+
+	printf("fd:%d wrote (%u, %u)\n", fd, key, val);
+	assert(ret == 0);
+
+	return ret;
+}
+
+static int test_bpf_map_lookup(int fd)
+{
+	uint32_t key = 123, val;
+	int ret;
+
+	ret = bpf_lookup_elem(fd, &key, &val);
+
+	printf("fd:%d read (%u, %u)\n", fd, key, val);
+	assert(ret == 0);
+
+	return ret;
+}
+
+static int bpf_map_test_case_1(void)
+{
+	int fd;
+
+	fd = test_bpf_map_create();
+	test_bpf_pin_fd(fd);
+	test_bpf_map_insert(fd, 456);
+	test_bpf_map_lookup(fd);
+	close(fd);
+
+	return 0;
+}
+
+static int bpf_map_test_case_2(void)
+{
+	int fd;
+
+	fd = test_bpf_new_fd();
+	test_bpf_map_lookup(fd);
+	close(fd);
+
+	return 0;
+}
+
+static int bpf_map_test_case_3(void)
+{
+	int fd1, fd2;
+
+	unlink(fname);
+	fd1 = test_bpf_map_create();
+	test_bpf_pin_fd(fd1);
+	fd2 = test_bpf_new_fd();
+	test_bpf_map_lookup(fd1);
+	test_bpf_map_insert(fd2, 456);
+	test_bpf_map_lookup(fd1);
+	test_bpf_map_lookup(fd2);
+	test_bpf_map_insert(fd1, 789);
+	test_bpf_map_lookup(fd2);
+	close(fd1);
+	close(fd2);
+
+	return 0;
+}
+
+static int test_bpf_prog_create(void)
+{
+	struct bpf_insn insns[] = {
+		BPF_MOV64_IMM(BPF_REG_0, 1),
+		BPF_EXIT_INSN(),
+	};
+	int ret;
+
+	ret = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, insns,
+			    sizeof(insns), "GPL", 0);
+
+	assert(ret > 0);
+	printf("fd:%d created\n", ret);
+
+	return ret;
+}
+
+static int test_bpf_prog_attach(int fd)
+{
+	int sock, ret;
+
+	sock = open_raw_sock("lo");
+	ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &fd, sizeof(fd));
+
+	printf("sock:%d got fd:%d attached\n", sock, fd);
+	assert(ret == 0);
+
+	return ret;
+}
+
+static int bpf_prog_test_case_1(void)
+{
+	int fd;
+
+	fd = test_bpf_prog_create();
+	test_bpf_pin_fd(fd);
+	close(fd);
+
+	return 0;
+}
+
+static int bpf_prog_test_case_2(void)
+{
+	int fd;
+
+	fd = test_bpf_new_fd();
+	test_bpf_prog_attach(fd);
+	close(fd);
+
+	return 0;
+}
+
+static int bpf_prog_test_case_3(void)
+{
+	int fd1, fd2;
+
+	unlink(fname);
+	fd1 = test_bpf_prog_create();
+	test_bpf_pin_fd(fd1);
+	fd2 = test_bpf_new_fd();
+	test_bpf_prog_attach(fd1);
+	test_bpf_prog_attach(fd2);
+	close(fd1);
+	close(fd2);
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	if (argc != 3)
+		return -1;
+
+	fname = argv[2];
+
+	if (!strcmp("map-pin", argv[1]))
+		return bpf_map_test_case_1();
+	if (!strcmp("map-new", argv[1]))
+		return bpf_map_test_case_2();
+	if (!strcmp("map-all", argv[1]))
+		return bpf_map_test_case_3();
+
+	if (!strcmp("prog-pin", argv[1]))
+		return bpf_prog_test_case_1();
+	if (!strcmp("prog-new", argv[1]))
+		return bpf_prog_test_case_2();
+	if (!strcmp("prog-all", argv[1]))
+		return bpf_prog_test_case_3();
+
+	return 0;
+}
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 7e1efa7..dcb7c49 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -103,6 +103,26 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
 
+int bpf_pin_fd(int fd, const char *pathname)
+{
+	union bpf_attr attr = {
+		.fd = fd,
+		.pathname = ptr_to_u64((void *)pathname),
+	};
+
+	return syscall(__NR_bpf, BPF_PIN_FD, &attr, sizeof(attr));
+}
+
+int bpf_new_fd(const char *pathname)
+{
+	union bpf_attr attr = {
+		.fd = 0,
+		.pathname = ptr_to_u64((void *)pathname),
+	};
+
+	return syscall(__NR_bpf, BPF_NEW_FD, &attr, sizeof(attr));
+}
+
 int open_raw_sock(const char *name)
 {
 	struct sockaddr_ll sll;
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index b7f63c70..2832fc6 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, int insn_len,
 		  const char *license, int kern_version);
 
+int bpf_pin_fd(int fd, const char *pathname);
+int bpf_new_fd(const char *pathname);
+
 #define LOG_BUF_SIZE 65536
 extern char bpf_log_buf[LOG_BUF_SIZE];
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16  1:09 ` [PATCH net-next 3/4] bpf: add support for persistent maps/progs Daniel Borkmann
@ 2015-10-16 10:25   ` Hannes Frederic Sowa
  2015-10-16 13:36     ` Daniel Borkmann
  2015-10-16 16:18     ` Alexei Starovoitov
  2015-10-16 17:21   ` Hannes Frederic Sowa
  1 sibling, 2 replies; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-16 10:25 UTC (permalink / raw)
  To: Daniel Borkmann, davem
  Cc: ast, viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
> This eventually leads us to this patch, which implements a minimal
> eBPF file system. The idea is a bit similar, but to the point that
> these inodes reside at one or multiple mount points. A directory
> hierarchy can be tailored to a specific application use-case from the
> various subsystem users and maps/progs pinned inside it. Two new eBPF
> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
> order to create one or multiple special inodes from an existing file
> descriptor that points to a map/program (we call it eBPF fd pinning),
> or to create a new file descriptor from an existing special inode.
> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
> can also be done unpriviledged when having appropriate permissions
> to the path.

In my opinion this is very un-unixiy, I have to say at least.

Namespaces at some point dealt with the same problem, they nowadays use
bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
the namespace alive. This at least allows someone to build up its own
hierarchy with normal unix tools and not hidden inside a C-program. For
filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
work out of the box nowadays.

I don't know in terms of how many objects bpf should be able to handle
and if such a bind-mount based solution would work, I guess not.

In my opinion I still favor a user space approach. Subsystems which use
ebpf in a way that no user space program needs to be running to control
them would need to export the fds by itself. E.g. something like
sysfs/kobject for tc? The hierarchy would then be in control of the
subsystem which could also create a proper naming hierarchy or maybe
even use an already given one. Do most other eBPF users really need to
persist file descriptors somewhere without user space control and pick
them up later? 

Sorry for the rant and thanks for posting this patchset,
Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 10:25   ` Hannes Frederic Sowa
@ 2015-10-16 13:36     ` Daniel Borkmann
  2015-10-16 16:36       ` Hannes Frederic Sowa
  2015-10-16 16:18     ` Alexei Starovoitov
  1 sibling, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-16 13:36 UTC (permalink / raw)
  To: Hannes Frederic Sowa, davem
  Cc: ast, viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote:
> On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
>> This eventually leads us to this patch, which implements a minimal
>> eBPF file system. The idea is a bit similar, but to the point that
>> these inodes reside at one or multiple mount points. A directory
>> hierarchy can be tailored to a specific application use-case from the
>> various subsystem users and maps/progs pinned inside it. Two new eBPF
>> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
>> order to create one or multiple special inodes from an existing file
>> descriptor that points to a map/program (we call it eBPF fd pinning),
>> or to create a new file descriptor from an existing special inode.
>> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
>> can also be done unpriviledged when having appropriate permissions
>> to the path.
>
> In my opinion this is very un-unixiy, I have to say at least.
>
> Namespaces at some point dealt with the same problem, they nowadays use
> bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
> the namespace alive. This at least allows someone to build up its own
> hierarchy with normal unix tools and not hidden inside a C-program. For
> filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
> work out of the box nowadays.

Yes, that doesn't work out of the box, but I also don't know how usable
that would really be. The idea is roughly rather similar to the paths
passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You
have a map/prog resource that you stick to a special inode so that you
can retrieve it at a later point in time from the same or different
processes through a new fd pointing to the resource from user side, so
that the bpf(2) syscall can be performed upon it.

With Unix tools, you could still create/remove a hierarchy or unlink
those that have maps/progs. You are correct that tools that don't
implement bpf(2) currently cannot access the content behind it, since
bpf(2) manages access to the data itself. I did like the 2nd idea though,
mentioned in the commit log, but don't know how flexible we are in
terms of adding S_IFBPF to the UAPI.

> I don't know in terms of how many objects bpf should be able to handle
> and if such a bind-mount based solution would work, I guess not.
>
> In my opinion I still favor a user space approach. Subsystems which use
> ebpf in a way that no user space program needs to be running to control
> them would need to export the fds by itself. E.g. something like
> sysfs/kobject for tc? The hierarchy would then be in control of the
> subsystem which could also create a proper naming hierarchy or maybe
> even use an already given one. Do most other eBPF users really need to
> persist file descriptors somewhere without user space control and pick
> them up later?

I was thinking about a strict predefined hierarchy dictated by the kernel
as well, but was then considering a more flexible approach that could be
tailored freely to various use cases. A predefined hierarchy would most
likely need to be resolved per subsystem and it's not really easy to map
this properly. F.e. if the kernel would try to provide unique ids (as
opposed to have a name or annotation member through the syscall), it
could end up being quite cryptic. If we let the users choose names, I'm
not sure if a single hierarchy level would be enough. Then, additionally
you have facilities like tail calls that eBPF programs could do.

In such cases, one could even craft relationships where a (strict auto
generated) tree representation would not be sufficient (f.e. recirculation
up to a certain depth). The tail called programs could be changed
atomically during runtime, etc. The other issue related to a per subsystem
representation is that bpf(2) is the central management interface for
creating/accessing maps/progs, and each subsystem then has its own little
interface to "install" them internally (f.e. via netlink, setsockopt(2),
etc). That means, with tail calls, only the 'root' programs are installed
there and further transactions would be needed in order to make individual
subsystems aware, so they could potentially generate some hierarchy; don't
know, it seems rather complex.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 10:25   ` Hannes Frederic Sowa
  2015-10-16 13:36     ` Daniel Borkmann
@ 2015-10-16 16:18     ` Alexei Starovoitov
  2015-10-16 16:43       ` Hannes Frederic Sowa
  1 sibling, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-16 16:18 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Daniel Borkmann, davem
  Cc: viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote:
> Namespaces at some point dealt with the same problem, they nowadays use
> bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep
> the namespace alive. This at least allows someone to build up its own
> hierarchy with normal unix tools and not hidden inside a C-program. For
> filedescriptors we already have/proc/$$/fd/* but it seems that doesn't
> work out of the box nowadays.

bind mounting of /proc/../fd was initially proposed by Andy and we've
looked at it thoroughly, but after discussion with Eric it became
apparent that it doesn't fit here. At the end we need shell tools
to access maps.
Also I think you missed the hierarchy in this patch set _is_ built with
normal 'mkdir' and files are removed with 'rm'.
The only thing that C does is BPF_PIN_FD of fd that was received from
bpf syscall. That's way cleaner api than doing bind mount from C
program.
We've considered letting open() of the file return bpf specific
anon-inode, but decided to reserve that for other more natural file
operations. Therefore BPF_NEW_FD is needed.

> I don't know in terms of how many objects bpf should be able to handle
> and if such a bind-mount based solution would work, I guess not.

We definitely missed you at the last plumbers where it was discussed :)

> In my opinion I still favor a user space approach.

that's not acceptable for tracing use cases. No daemons allowed.

> Subsystems which use
> ebpf in a way that no user space program needs to be running to control
> them would need to export the fds by itself. E.g. something like
> sysfs/kobject for tc? The hierarchy would then be in control of the
> subsystem which could also create a proper naming hierarchy or maybe
> even use an already given one. Do most other eBPF users really need to
> persist file descriptors somewhere without user space control and pick
> them up later?

I think it's way cleaner to have one way of solving it (like this patch
does) instead of asking every subsystem to solve it differently.
We've also looked at sysfs and it's ugly when it comes to removing,
since the user cannot use normal 'rm'.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 13:36     ` Daniel Borkmann
@ 2015-10-16 16:36       ` Hannes Frederic Sowa
  2015-10-16 17:27         ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-16 16:36 UTC (permalink / raw)
  To: Daniel Borkmann, davem
  Cc: ast, viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On Fri, Oct 16, 2015, at 15:36, Daniel Borkmann wrote:
> On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote:
> > On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
> >> This eventually leads us to this patch, which implements a minimal
> >> eBPF file system. The idea is a bit similar, but to the point that
> >> these inodes reside at one or multiple mount points. A directory
> >> hierarchy can be tailored to a specific application use-case from the
> >> various subsystem users and maps/progs pinned inside it. Two new eBPF
> >> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
> >> order to create one or multiple special inodes from an existing file
> >> descriptor that points to a map/program (we call it eBPF fd pinning),
> >> or to create a new file descriptor from an existing special inode.
> >> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
> >> can also be done unpriviledged when having appropriate permissions
> >> to the path.
> >
> > In my opinion this is very un-unixiy, I have to say at least.
> >
> > Namespaces at some point dealt with the same problem, they nowadays use
> > bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
> > the namespace alive. This at least allows someone to build up its own
> > hierarchy with normal unix tools and not hidden inside a C-program. For
> > filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
> > work out of the box nowadays.
> 
> Yes, that doesn't work out of the box, but I also don't know how usable
> that would really be. The idea is roughly rather similar to the paths
> passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You
> have a map/prog resource that you stick to a special inode so that you
> can retrieve it at a later point in time from the same or different
> processes through a new fd pointing to the resource from user side, so
> that the bpf(2) syscall can be performed upon it.
> 
> With Unix tools, you could still create/remove a hierarchy or unlink
> those that have maps/progs. You are correct that tools that don't
> implement bpf(2) currently cannot access the content behind it, since
> bpf(2) manages access to the data itself. I did like the 2nd idea though,
> mentioned in the commit log, but don't know how flexible we are in
> terms of adding S_IFBPF to the UAPI.

I don't think it should be a problem. You referred to POSIX Standard in
your other mail but I can't see any reason why not to establish a new
file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors,
Event Ports) are just examples of new modes being added.

mknod /bpf/map/1 m 1 1

:)

Yes, maybe I think this is a better solution architectural instead of
constructing a new filesystem.

> > I don't know in terms of how many objects bpf should be able to handle
> > and if such a bind-mount based solution would work, I guess not.
> >
> > In my opinion I still favor a user space approach. Subsystems which use
> > ebpf in a way that no user space program needs to be running to control
> > them would need to export the fds by itself. E.g. something like
> > sysfs/kobject for tc? The hierarchy would then be in control of the
> > subsystem which could also create a proper naming hierarchy or maybe
> > even use an already given one. Do most other eBPF users really need to
> > persist file descriptors somewhere without user space control and pick
> > them up later?
> 
> I was thinking about a strict predefined hierarchy dictated by the kernel
> as well, but was then considering a more flexible approach that could be
> tailored freely to various use cases. A predefined hierarchy would most
> likely need to be resolved per subsystem and it's not really easy to map
> this properly. F.e. if the kernel would try to provide unique ids (as
> opposed to have a name or annotation member through the syscall), it
> could end up being quite cryptic. If we let the users choose names, I'm
> not sure if a single hierarchy level would be enough. Then, additionally
> you have facilities like tail calls that eBPF programs could do.

I don't think that most subsystems need to expose those file
descriptors. Seccomp probably will have a supervisor process running and
per aggregation will also have a user space process running keeping the
fd alive. So it is all about tc/sched.

And I am not sure if tc will really needs a filesystem to handle all
this. The simplest approach is to just keep a name <-> fd mapping
somewhere in the net/sched/ subsystem and use this for all tc users.
Otherwise can we somehow Incorporate this in sysfs directory where we
maybe create a kobject per installed filter, something along those
lines.

I see that tail calls makes this all very difficult to show which entity
uses which ebpf entity in some way, as it looks like n:m relationships.

> In such cases, one could even craft relationships where a (strict auto
> generated) tree representation would not be sufficient (f.e.
> recirculation
> up to a certain depth). The tail called programs could be changed
> atomically during runtime, etc. The other issue related to a per
> subsystem
> representation is that bpf(2) is the central management interface for
> creating/accessing maps/progs, and each subsystem then has its own little
> interface to "install" them internally (f.e. via netlink, setsockopt(2),
> etc). That means, with tail calls, only the 'root' programs are installed
> there and further transactions would be needed in order to make
> individual
> subsystems aware, so they could potentially generate some hierarchy;
> don't
> know, it seems rather complex.

I understand, this is really not suitable to represent in its entirety
in sysfs or any kind of hierarchical structure right now. Either we
limit it somewhat (Alexei will certainly intervene here) or one of your
filesystem approaches will win.

But I still wonder why people are so against user space dependencies?

Another idea that I discussed with Daniel just to have it publicly
available: a userspace helper would be called for every ebpf entity
change so it could mirror or keep track ebpf handles in user space. You
can think along the lines of kernel/core_pattern. This would probably
also depend on non-anon-inode usage of ebpf fds.

Will have to think about that a bit more,
Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 16:18     ` Alexei Starovoitov
@ 2015-10-16 16:43       ` Hannes Frederic Sowa
  2015-10-16 17:32         ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-16 16:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, davem
  Cc: viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

Hi Alexei,

On Fri, Oct 16, 2015, at 18:18, Alexei Starovoitov wrote:
> On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote:
> > Namespaces at some point dealt with the same problem, they nowadays use
> > bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep
> > the namespace alive. This at least allows someone to build up its own
> > hierarchy with normal unix tools and not hidden inside a C-program. For
> > filedescriptors we already have/proc/$$/fd/* but it seems that doesn't
> > work out of the box nowadays.
> 
> bind mounting of /proc/../fd was initially proposed by Andy and we've
> looked at it thoroughly, but after discussion with Eric it became
> apparent that it doesn't fit here. At the end we need shell tools
> to access maps.

Oh yes, I want shell tools for this very much! Maybe even that things
like strings, grep etc. work. :)

> Also I think you missed the hierarchy in this patch set _is_ built with
> normal 'mkdir' and files are removed with 'rm'.

I did not miss that, I am just concerned that if the kernel does not
enforce such a hierarchy automatically it won't really happen.

> The only thing that C does is BPF_PIN_FD of fd that was received from
> bpf syscall. That's way cleaner api than doing bind mount from C
> program.

I am with you there. Unfortunately we don't have a give "this fd a name"
syscalls so far so I totally understand the decision here.

> We've considered letting open() of the file return bpf specific
> anon-inode, but decided to reserve that for other more natural file
> operations. Therefore BPF_NEW_FD is needed.

Can't this be overloaded somehow. You can use mknod for creation and
open for regular file use. mknod is its own syscall.

> > I don't know in terms of how many objects bpf should be able to handle
> > and if such a bind-mount based solution would work, I guess not.
> 
> We definitely missed you at the last plumbers where it was discussed :)

Yes. :(

> > In my opinion I still favor a user space approach.
> 
> that's not acceptable for tracing use cases. No daemons allowed.

Oh, tracing does not allow daemons. Why? I can only imagine embedded
users, no?

> > Subsystems which use
> > ebpf in a way that no user space program needs to be running to control
> > them would need to export the fds by itself. E.g. something like
> > sysfs/kobject for tc? The hierarchy would then be in control of the
> > subsystem which could also create a proper naming hierarchy or maybe
> > even use an already given one. Do most other eBPF users really need to
> > persist file descriptors somewhere without user space control and pick
> > them up later?
> 
> I think it's way cleaner to have one way of solving it (like this patch
> does) instead of asking every subsystem to solve it differently.
> We've also looked at sysfs and it's ugly when it comes to removing,
> since the user cannot use normal 'rm'.

Ah, okay. Probably it would depend on some tc node always referencing
the bpf entity. But I see that sysfs might become too problematic.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16  1:09 ` [PATCH net-next 3/4] bpf: add support for persistent maps/progs Daniel Borkmann
  2015-10-16 10:25   ` Hannes Frederic Sowa
@ 2015-10-16 17:21   ` Hannes Frederic Sowa
  2015-10-16 17:42     ` Alexei Starovoitov
  1 sibling, 1 reply; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-16 17:21 UTC (permalink / raw)
  To: Daniel Borkmann, davem
  Cc: ast, viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
> This eventually leads us to this patch, which implements a minimal
> eBPF file system. The idea is a bit similar, but to the point that
> these inodes reside at one or multiple mount points. A directory
> hierarchy can be tailored to a specific application use-case from the
> various subsystem users and maps/progs pinned inside it. Two new eBPF
> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
> order to create one or multiple special inodes from an existing file
> descriptor that points to a map/program (we call it eBPF fd pinning),
> or to create a new file descriptor from an existing special inode.
> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
> can also be done unpriviledged when having appropriate permissions
> to the path.
> 

Another question:
Should multiple mount of the filesystem result in an empty fs (a new
instance) or in one were one can see other ebpf-fs entities? I think
Daniel wanted to already use the mountpoint as some kind of hierarchy
delimiter. I would have used directories for that and multiple mounts
would then have resulted in the same content of the filesystem. IMHO
this would remove some ambiguity but then the question arises how this
is handled in a namespaced environment. Was there some specific reason
to do so?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 16:36       ` Hannes Frederic Sowa
@ 2015-10-16 17:27         ` Daniel Borkmann
  2015-10-16 17:37           ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-16 17:27 UTC (permalink / raw)
  To: Hannes Frederic Sowa, davem
  Cc: ast, viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/16/2015 06:36 PM, Hannes Frederic Sowa wrote:
> On Fri, Oct 16, 2015, at 15:36, Daniel Borkmann wrote:
>> On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote:
>>> On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
>>>> This eventually leads us to this patch, which implements a minimal
>>>> eBPF file system. The idea is a bit similar, but to the point that
>>>> these inodes reside at one or multiple mount points. A directory
>>>> hierarchy can be tailored to a specific application use-case from the
>>>> various subsystem users and maps/progs pinned inside it. Two new eBPF
>>>> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
>>>> order to create one or multiple special inodes from an existing file
>>>> descriptor that points to a map/program (we call it eBPF fd pinning),
>>>> or to create a new file descriptor from an existing special inode.
>>>> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
>>>> can also be done unpriviledged when having appropriate permissions
>>>> to the path.
>>>
>>> In my opinion this is very un-unixiy, I have to say at least.
>>>
>>> Namespaces at some point dealt with the same problem, they nowadays use
>>> bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
>>> the namespace alive. This at least allows someone to build up its own
>>> hierarchy with normal unix tools and not hidden inside a C-program. For
>>> filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
>>> work out of the box nowadays.
>>
>> Yes, that doesn't work out of the box, but I also don't know how usable
>> that would really be. The idea is roughly rather similar to the paths
>> passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You
>> have a map/prog resource that you stick to a special inode so that you
>> can retrieve it at a later point in time from the same or different
>> processes through a new fd pointing to the resource from user side, so
>> that the bpf(2) syscall can be performed upon it.
>>
>> With Unix tools, you could still create/remove a hierarchy or unlink
>> those that have maps/progs. You are correct that tools that don't
>> implement bpf(2) currently cannot access the content behind it, since
>> bpf(2) manages access to the data itself. I did like the 2nd idea though,
>> mentioned in the commit log, but don't know how flexible we are in
>> terms of adding S_IFBPF to the UAPI.
>
> I don't think it should be a problem. You referred to POSIX Standard in
> your other mail but I can't see any reason why not to establish a new
> file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors,
> Event Ports) are just examples of new modes being added.
>
> mknod /bpf/map/1 m 1 1
>
> :)
>
> Yes, maybe I think this is a better solution architectural instead of
> constructing a new filesystem.

Yeah, also 'man 2 stat' lists a couple of others used by various systems.

The pro's of this approach would be that no new file system would be needed
and the special inode could be placed on top of any 'regular' file system
that would support special files. I do like that as well.

I'm wondering whether this would prevent us in future from opening access
to shell tools etc on that special file, but probably one could provide a
default set of file ops via init_special_inode() that could be overloaded
by the underlying fs if required.

>>> I don't know in terms of how many objects bpf should be able to handle
>>> and if such a bind-mount based solution would work, I guess not.
>>>
>>> In my opinion I still favor a user space approach. Subsystems which use
>>> ebpf in a way that no user space program needs to be running to control
>>> them would need to export the fds by itself. E.g. something like
>>> sysfs/kobject for tc? The hierarchy would then be in control of the
>>> subsystem which could also create a proper naming hierarchy or maybe
>>> even use an already given one. Do most other eBPF users really need to
>>> persist file descriptors somewhere without user space control and pick
>>> them up later?
>>
>> I was thinking about a strict predefined hierarchy dictated by the kernel
>> as well, but was then considering a more flexible approach that could be
>> tailored freely to various use cases. A predefined hierarchy would most
>> likely need to be resolved per subsystem and it's not really easy to map
>> this properly. F.e. if the kernel would try to provide unique ids (as
>> opposed to have a name or annotation member through the syscall), it
>> could end up being quite cryptic. If we let the users choose names, I'm
>> not sure if a single hierarchy level would be enough. Then, additionally
>> you have facilities like tail calls that eBPF programs could do.
>
> I don't think that most subsystems need to expose those file
> descriptors. Seccomp probably will have a supervisor process running and
> per aggregation will also have a user space process running keeping the
> fd alive. So it is all about tc/sched.
>
> And I am not sure if tc will really needs a filesystem to handle all
> this. The simplest approach is to just keep a name <-> fd mapping
> somewhere in the net/sched/ subsystem and use this for all tc users.

Solving this on a generic level eventually felt cleaner, where a subsystem
would have the choice of whether making use of this or not. tc/sched has
currently two types BPF_PROG_TYPE_SCHED_{CLS,ACT}, so a common facility
would be needed for both subsystems. It's a bit hard to see what other
subsystems would come in future, and we could end up with multiple
subsystem-specific APIs essentially doing the same thing.

At the very beginning, there was also the idea to just reference such an
object by name, but it would need to be made available somewhere (procfs?)
to get a picture and manage them from an admin pov. Having some object
exposed as a file like other ipc building blocks seems better, imho.
Whether as special file or file system, yeah, that's a different question.

[...]
> I see that tail calls makes this all very difficult to show which entity
> uses which ebpf entity in some way, as it looks like n:m relationships.

Yes, this is indeed the case.

>> In such cases, one could even craft relationships where a (strict auto
>> generated) tree representation would not be sufficient (f.e.
>> recirculation
>> up to a certain depth). The tail called programs could be changed
>> atomically during runtime, etc. The other issue related to a per
>> subsystem
>> representation is that bpf(2) is the central management interface for
>> creating/accessing maps/progs, and each subsystem then has its own little
>> interface to "install" them internally (f.e. via netlink, setsockopt(2),
>> etc). That means, with tail calls, only the 'root' programs are installed
>> there and further transactions would be needed in order to make
>> individual
>> subsystems aware, so they could potentially generate some hierarchy;
>> don't
>> know, it seems rather complex.
>
> I understand, this is really not suitable to represent in its entirety
> in sysfs or any kind of hierarchical structure right now. Either we
> limit it somewhat (Alexei will certainly intervene here) or one of your
> filesystem approaches will win.
>
> But I still wonder why people are so against user space dependencies?
>
> Another idea that I discussed with Daniel just to have it publicly
> available: a userspace helper would be called for every ebpf entity
> change so it could mirror or keep track ebpf handles in user space. You
> can think along the lines of kernel/core_pattern. This would probably
> also depend on non-anon-inode usage of ebpf fds.

Yes, it seems to me, but other than that, it would also require a user
space daemon managing all these, right? At least from the consensus at
Plumbers, running an extra daemon was considered rather impractical wrt
deployment (same with fuse).

Best,
Daniel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 16:43       ` Hannes Frederic Sowa
@ 2015-10-16 17:32         ` Alexei Starovoitov
  2015-10-16 17:37           ` Thomas Graf
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-16 17:32 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Daniel Borkmann, davem
  Cc: viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote:
> Hi Alexei,
>
> On Fri, Oct 16, 2015, at 18:18, Alexei Starovoitov wrote:
>> On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote:
>>> Namespaces at some point dealt with the same problem, they nowadays use
>>> bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep
>>> the namespace alive. This at least allows someone to build up its own
>>> hierarchy with normal unix tools and not hidden inside a C-program. For
>>> filedescriptors we already have/proc/$$/fd/* but it seems that doesn't
>>> work out of the box nowadays.
>>
>> bind mounting of /proc/../fd was initially proposed by Andy and we've
>> looked at it thoroughly, but after discussion with Eric it became
>> apparent that it doesn't fit here. At the end we need shell tools
>> to access maps.
>
> Oh yes, I want shell tools for this very much! Maybe even that things
> like strings, grep etc. work. :)

yes and the only way to get there is to have it done via fs.

>> Also I think you missed the hierarchy in this patch set _is_ built with
>> normal 'mkdir' and files are removed with 'rm'.
>
> I did not miss that, I am just concerned that if the kernel does not
> enforce such a hierarchy automatically it won't really happen.

if it's easier for user to work with single level of directories,
it should be able to do so. It's not a job of the kernel to enforce
how user space apps should be designed.

> Oh, tracing does not allow daemons. Why? I can only imagine embedded
> users, no?

yes and for networking: restartability and HA.
cannot really do that with fuse/daemons.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 17:32         ` Alexei Starovoitov
@ 2015-10-16 17:37           ` Thomas Graf
  0 siblings, 0 replies; 56+ messages in thread
From: Thomas Graf @ 2015-10-16 17:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hannes Frederic Sowa, Daniel Borkmann, davem, viro, ebiederm,
	netdev, linux-kernel, Alexei Starovoitov

On 10/16/15 at 10:32am, Alexei Starovoitov wrote:
> On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote:
> >Oh, tracing does not allow daemons. Why? I can only imagine embedded
> >users, no?
> 
> yes and for networking: restartability and HA.
> cannot really do that with fuse/daemons.

Right, the smaller the footprint, the better.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 17:27         ` Daniel Borkmann
@ 2015-10-16 17:37           ` Alexei Starovoitov
  0 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-16 17:37 UTC (permalink / raw)
  To: Daniel Borkmann, Hannes Frederic Sowa, davem
  Cc: viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/16/15 10:27 AM, Daniel Borkmann wrote:
>>> but don't know how flexible we are in
>>> terms of adding S_IFBPF to the UAPI.
>>
>> I don't think it should be a problem. You referred to POSIX Standard in
>> your other mail but I can't see any reason why not to establish a new
>> file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors,
>> Event Ports) are just examples of new modes being added.
>>
>> mknod /bpf/map/1 m 1 1
>>
>> :)
>>
>> Yes, maybe I think this is a better solution architectural instead of
>> constructing a new filesystem.
>
> Yeah, also 'man 2 stat' lists a couple of others used by various systems.
>
> The pro's of this approach would be that no new file system would be needed
> and the special inode could be placed on top of any 'regular' file system
> that would support special files. I do like that as well.

I don't like it at all for the reasons you've just stated:
'it will prevent us doing shell style access to such files'

> I'm wondering whether this would prevent us in future from opening access
> to shell tools etc on that special file, but probably one could provide a
> default set of file ops via init_special_inode() that could be overloaded
> by the underlying fs if required.

and also because adding new S_ISSOCK-like bit for bpf feel as a begining
of nightmare, since sooner or later all filesystems would need to have
a check for it like they have for sock type.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 17:21   ` Hannes Frederic Sowa
@ 2015-10-16 17:42     ` Alexei Starovoitov
  2015-10-16 17:56       ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-16 17:42 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Daniel Borkmann, davem
  Cc: viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
> Another question:
> Should multiple mount of the filesystem result in an empty fs (a new
> instance) or in one were one can see other ebpf-fs entities? I think
> Daniel wanted to already use the mountpoint as some kind of hierarchy
> delimiter. I would have used directories for that and multiple mounts
> would then have resulted in the same content of the filesystem. IMHO
> this would remove some ambiguity but then the question arises how this
> is handled in a namespaced environment. Was there some specific reason
> to do so?

That's an interesting question!
I think all mounts should be independent.
I can see tracing using one and networking using another one
with different hierarchies suitable for their own use cases.
What's an advantage to have the same content everywhere?
Feels harder to manage, since different users would need to
coordinate.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 17:42     ` Alexei Starovoitov
@ 2015-10-16 17:56       ` Daniel Borkmann
  2015-10-16 18:41         ` Eric W. Biederman
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-16 17:56 UTC (permalink / raw)
  To: Alexei Starovoitov, Hannes Frederic Sowa, davem
  Cc: viro, ebiederm, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:
> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
>> Another question:
>> Should multiple mount of the filesystem result in an empty fs (a new
>> instance) or in one were one can see other ebpf-fs entities? I think
>> Daniel wanted to already use the mountpoint as some kind of hierarchy
>> delimiter. I would have used directories for that and multiple mounts
>> would then have resulted in the same content of the filesystem. IMHO
>> this would remove some ambiguity but then the question arises how this
>> is handled in a namespaced environment. Was there some specific reason
>> to do so?
>
> That's an interesting question!
> I think all mounts should be independent.
> I can see tracing using one and networking using another one
> with different hierarchies suitable for their own use cases.
> What's an advantage to have the same content everywhere?
> Feels harder to manage, since different users would need to
> coordinate.

I initially had it as a mount_single() file system, where I was thinking
to have an entry under /sys/fs/bpf/, so all subsystems would work on top
of that mount point, but for the same reasons above I lifted that restriction.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 17:56       ` Daniel Borkmann
@ 2015-10-16 18:41         ` Eric W. Biederman
  2015-10-16 19:27           ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-10-16 18:41 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Hannes Frederic Sowa, davem, viro, tgraf,
	netdev, linux-kernel, Alexei Starovoitov

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:
>> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
>>> Another question:
>>> Should multiple mount of the filesystem result in an empty fs (a new
>>> instance) or in one were one can see other ebpf-fs entities? I think
>>> Daniel wanted to already use the mountpoint as some kind of hierarchy
>>> delimiter. I would have used directories for that and multiple mounts
>>> would then have resulted in the same content of the filesystem. IMHO
>>> this would remove some ambiguity but then the question arises how this
>>> is handled in a namespaced environment. Was there some specific reason
>>> to do so?
>>
>> That's an interesting question!
>> I think all mounts should be independent.
>> I can see tracing using one and networking using another one
>> with different hierarchies suitable for their own use cases.
>> What's an advantage to have the same content everywhere?
>> Feels harder to manage, since different users would need to
>> coordinate.
>
> I initially had it as a mount_single() file system, where I was thinking
> to have an entry under /sys/fs/bpf/, so all subsystems would work on top
> of that mount point, but for the same reasons above I lifted that restriction.

I am missing something.

When I suggested using a filesystem it was my thought there would be
exactly one superblock per map, and the map would be specified at mount
time.  You clearly are not implementing that.

A filesystem per map makes sense as you have a key-value store with one
file per key.

The idea is that something resembling your bpf_pin_fd function would be
the mount system call for the filesystem.

The the keys in the map could be read by "ls /mountpoint/".
Key values could be inspected with "cat /mountpoint/key".

That allows all hierarchy etc to be handled in userspace, just as with
my files for namespaces.

I do not understand why you have presented to userspace a magic
filesystem that you allow binding to.  That is not what I intended to
suggest and I do not know how that makes any sense.

Eric



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 18:41         ` Eric W. Biederman
@ 2015-10-16 19:27           ` Alexei Starovoitov
  2015-10-16 19:53             ` Eric W. Biederman
  2015-10-16 19:54             ` Daniel Borkmann
  0 siblings, 2 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-16 19:27 UTC (permalink / raw)
  To: Eric W. Biederman, Daniel Borkmann
  Cc: Hannes Frederic Sowa, davem, viro, tgraf, netdev, linux-kernel,
	Alexei Starovoitov

On 10/16/15 11:41 AM, Eric W. Biederman wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>
>> On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:
>>> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
>>>> Another question:
>>>> Should multiple mount of the filesystem result in an empty fs (a new
>>>> instance) or in one were one can see other ebpf-fs entities? I think
>>>> Daniel wanted to already use the mountpoint as some kind of hierarchy
>>>> delimiter. I would have used directories for that and multiple mounts
>>>> would then have resulted in the same content of the filesystem. IMHO
>>>> this would remove some ambiguity but then the question arises how this
>>>> is handled in a namespaced environment. Was there some specific reason
>>>> to do so?
>>>
>>> That's an interesting question!
>>> I think all mounts should be independent.
>>> I can see tracing using one and networking using another one
>>> with different hierarchies suitable for their own use cases.
>>> What's an advantage to have the same content everywhere?
>>> Feels harder to manage, since different users would need to
>>> coordinate.
>>
>> I initially had it as a mount_single() file system, where I was thinking
>> to have an entry under /sys/fs/bpf/, so all subsystems would work on top
>> of that mount point, but for the same reasons above I lifted that restriction.
>
> I am missing something.
>
> When I suggested using a filesystem it was my thought there would be
> exactly one superblock per map, and the map would be specified at mount
> time.  You clearly are not implementing that.

I don't think it's practical to have sb per map, since that would mean
sb per prog and that won't scale.
Also map today is an fd that belongs to a process. I cannot see
an api from C program to do 'mount of FD' that wouldn't look like
ugly hack.

> A filesystem per map makes sense as you have a key-value store with one
> file per key.
>
> The idea is that something resembling your bpf_pin_fd function would be
> the mount system call for the filesystem.
>
> The the keys in the map could be read by "ls /mountpoint/".
> Key values could be inspected with "cat /mountpoint/key".

yes. that is still the goal for follow up patches, but contained
within given bpffs. Something bpf_pin_fd-like command for bpf syscall
would create files for keys in a map and allow 'cat' via open/read.
Such api would be much cleaner from C app point of view.
Potentially we can allow mount of a file created via BPF_PIN_FD
that will expand into keys/values.
All of that are our future plans.
There, actually, the main contention point is 'how to represent keys
and values'. whether key is hex representation or we need some
pretty-printers via format string or via schema? etc, etc.
We tried few ideas of representing keys in our fuse implementations,
but don't have an agreement yet.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 19:27           ` Alexei Starovoitov
@ 2015-10-16 19:53             ` Eric W. Biederman
  2015-10-16 20:56               ` Alexei Starovoitov
  2015-10-16 19:54             ` Daniel Borkmann
  1 sibling, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-10-16 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Hannes Frederic Sowa, davem, viro, tgraf,
	netdev, linux-kernel, Alexei Starovoitov

Alexei Starovoitov <ast@plumgrid.com> writes:

> On 10/16/15 11:41 AM, Eric W. Biederman wrote:
[...]
>> I am missing something.
>>
>> When I suggested using a filesystem it was my thought there would be
>> exactly one superblock per map, and the map would be specified at mount
>> time.  You clearly are not implementing that.
>
> I don't think it's practical to have sb per map, since that would mean
> sb per prog and that won't scale.

What do you mean won't scale?  You want to have a name per map/prog so the
basic complexity appears the same.  Is there some crucial interaction
between the persistent dodads you are placing on a filesystem that I am
missing?

Given the fact you don't normally need any persistence without a program
I am puzzled why "scaling" is an issue of any kind.  This is for a
comparitively rare case if I am not mistaken.

> Also map today is an fd that belongs to a process. I cannot see
> an api from C program to do 'mount of FD' that wouldn't look like
> ugly hack.

mount -t bpffs ... -o fd=1234 

That is not all convoluted or hacky.  Especially compared to some of the
alternatives I am seeing.

It is no problem at all to wrap something like that in a nice function
call that has the exact same complexity of use as any of the other
options that are being explored to give something that starts out
as a filedescriptor a name.

>> A filesystem per map makes sense as you have a key-value store with one
>> file per key.
>>
>> The idea is that something resembling your bpf_pin_fd function would be
>> the mount system call for the filesystem.
>>
>> The the keys in the map could be read by "ls /mountpoint/".
>> Key values could be inspected with "cat /mountpoint/key".
>
> yes. that is still the goal for follow up patches, but contained
> within given bpffs. Something bpf_pin_fd-like command for bpf syscall
> would create files for keys in a map and allow 'cat' via open/read.
> Such api would be much cleaner from C app point of view.
> Potentially we can allow mount of a file created via BPF_PIN_FD
> that will expand into keys/values.
> All of that are our future plans.
> There, actually, the main contention point is 'how to represent keys
> and values'. whether key is hex representation or we need some
> pretty-printers via format string or via schema? etc, etc.
> We tried few ideas of representing keys in our fuse implementations,
> but don't have an agreement yet.

My gut feel would be to keep it simple and use the same representation
you use in your existing system calls.  Certainly ordinary filenames are
keys of arbitrary binary data that can included everything except
a '\0' byte.  That they are human readable is a nice convention, but not
at all fundamental to what they are.

Eric


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 19:27           ` Alexei Starovoitov
  2015-10-16 19:53             ` Eric W. Biederman
@ 2015-10-16 19:54             ` Daniel Borkmann
  1 sibling, 0 replies; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-16 19:54 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric W. Biederman
  Cc: Hannes Frederic Sowa, davem, viro, tgraf, netdev, linux-kernel,
	Alexei Starovoitov

On 10/16/2015 09:27 PM, Alexei Starovoitov wrote:
> On 10/16/15 11:41 AM, Eric W. Biederman wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>> On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:
>>>> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
>>>>> Another question:
>>>>> Should multiple mount of the filesystem result in an empty fs (a new
>>>>> instance) or in one were one can see other ebpf-fs entities? I think
>>>>> Daniel wanted to already use the mountpoint as some kind of hierarchy
>>>>> delimiter. I would have used directories for that and multiple mounts
>>>>> would then have resulted in the same content of the filesystem. IMHO
>>>>> this would remove some ambiguity but then the question arises how this
>>>>> is handled in a namespaced environment. Was there some specific reason
>>>>> to do so?
>>>>
>>>> That's an interesting question!
>>>> I think all mounts should be independent.
>>>> I can see tracing using one and networking using another one
>>>> with different hierarchies suitable for their own use cases.
>>>> What's an advantage to have the same content everywhere?
>>>> Feels harder to manage, since different users would need to
>>>> coordinate.
>>>
>>> I initially had it as a mount_single() file system, where I was thinking
>>> to have an entry under /sys/fs/bpf/, so all subsystems would work on top
>>> of that mount point, but for the same reasons above I lifted that restriction.
>>
>> I am missing something.
>>
>> When I suggested using a filesystem it was my thought there would be
>> exactly one superblock per map, and the map would be specified at mount
>> time.  You clearly are not implementing that.
>
> I don't think it's practical to have sb per map, since that would mean
> sb per prog and that won't scale.
> Also map today is an fd that belongs to a process. I cannot see
> an api from C program to do 'mount of FD' that wouldn't look like
> ugly hack.
>
>> A filesystem per map makes sense as you have a key-value store with one
>> file per key.
>>
>> The idea is that something resembling your bpf_pin_fd function would be
>> the mount system call for the filesystem.
>>
>> The the keys in the map could be read by "ls /mountpoint/".
>> Key values could be inspected with "cat /mountpoint/key".
>
> yes. that is still the goal for follow up patches, but contained
> within given bpffs. Something bpf_pin_fd-like command for bpf syscall
> would create files for keys in a map and allow 'cat' via open/read.
> Such api would be much cleaner from C app point of view.
> Potentially we can allow mount of a file created via BPF_PIN_FD
> that will expand into keys/values.

Yeah, sort of making this an optional debugging facility if anything (maybe
to just get a read-only snapshot view). Having maps with a very large number
of entries might end up being problematic by its own, or mapping potential
future map candidates such as rhashtable.

> There, actually, the main contention point is 'how to represent keys
> and values'. whether key is hex representation or we need some
> pretty-printers via format string or via schema? etc, etc.
> We tried few ideas of representing keys in our fuse implementations,
> but don't have an agreement yet.

That is unclear as well to make it useful.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 19:53             ` Eric W. Biederman
@ 2015-10-16 20:56               ` Alexei Starovoitov
  2015-10-16 23:44                 ` Eric W. Biederman
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-16 20:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Borkmann, Hannes Frederic Sowa, davem, viro, tgraf,
	netdev, linux-kernel, Alexei Starovoitov

On 10/16/15 12:53 PM, Eric W. Biederman wrote:
> Alexei Starovoitov <ast@plumgrid.com> writes:
>
>> On 10/16/15 11:41 AM, Eric W. Biederman wrote:
> [...]
>>> I am missing something.
>>>
>>> When I suggested using a filesystem it was my thought there would be
>>> exactly one superblock per map, and the map would be specified at mount
>>> time.  You clearly are not implementing that.
>>
>> I don't think it's practical to have sb per map, since that would mean
>> sb per prog and that won't scale.
>
> What do you mean won't scale?  You want to have a name per map/prog so the
> basic complexity appears the same.  Is there some crucial interaction
> between the persistent dodads you are placing on a filesystem that I am
> missing?
>
> Given the fact you don't normally need any persistence without a program
> I am puzzled why "scaling" is an issue of any kind.  This is for a
> comparitively rare case if I am not mistaken.

representing map as a directory tree with files as keys is indeed 'rare'
since it's mainly for debugging and slow accesses,
but 'pin_fd' functionality now popping up everywhere.
Mainly because in things like openstack there are tons of disjoint
libraries written in different languages and the only thing
common is kernel. So pin_fd/new_fd is a mandatory feature.

>> Also map today is an fd that belongs to a process. I cannot see
>> an api from C program to do 'mount of FD' that wouldn't look like
>> ugly hack.
>
> mount -t bpffs ... -o fd=1234
>
> That is not all convoluted or hacky.  Especially compared to some of the
> alternatives I am seeing.
>
> It is no problem at all to wrap something like that in a nice function
> call that has the exact same complexity of use as any of the other
> options that are being explored to give something that starts out
> as a filedescriptor a name.

Frankly, I don't think parsing 'fd=1234' string is a clean api, but
before we argue about fs philosophy of passing options, let's
get on the same page with requirements.
First goal that this patch is solving is providing an ability
to 'pin' an FD, so that map/prog won't disappear when user app exist.
Second goal of future patches is to expose map internals as a directory
structure.
These two goals are independent.
We can argue about api for 2nd, whether it's mount with fd=1234 string
or else, but for the first mount style doesn't make sense.

>>> A filesystem per map makes sense as you have a key-value store with one
>>> file per key.
>>>
>>> The idea is that something resembling your bpf_pin_fd function would be
>>> the mount system call for the filesystem.
>>>
>>> The the keys in the map could be read by "ls /mountpoint/".
>>> Key values could be inspected with "cat /mountpoint/key".
>>
>> yes. that is still the goal for follow up patches, but contained
>> within given bpffs. Something bpf_pin_fd-like command for bpf syscall
>> would create files for keys in a map and allow 'cat' via open/read.
>> Such api would be much cleaner from C app point of view.
>> Potentially we can allow mount of a file created via BPF_PIN_FD
>> that will expand into keys/values.
>> All of that are our future plans.
>> There, actually, the main contention point is 'how to represent keys
>> and values'. whether key is hex representation or we need some
>> pretty-printers via format string or via schema? etc, etc.
>> We tried few ideas of representing keys in our fuse implementations,
>> but don't have an agreement yet.
>
> My gut feel would be to keep it simple and use the same representation
> you use in your existing system calls.  Certainly ordinary filenames are
> keys of arbitrary binary data that can included everything except
> a '\0' byte.  That they are human readable is a nice convention, but not
> at all fundamental to what they are.

that doesn't work. map keys are never human readable. they're arbitrary
binary data. That's why representing them as file name is not trivial.
Some pretty-printer is needed.
Again that is 2nd goal of bpffs in general. We cannot really solve it
now, because we cannot say 'lets represent keys like X and work
from there', since that will become kernel ABI and we won't be able to
change that.
It's equally not clear that thousands of keys can even work as files.
So quite a bit of brainstorming still to do for this 2nd goal.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 20:56               ` Alexei Starovoitov
@ 2015-10-16 23:44                 ` Eric W. Biederman
  2015-10-17  2:43                   ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-10-16 23:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Hannes Frederic Sowa, davem, viro, tgraf,
	netdev, linux-kernel, Alexei Starovoitov

Alexei Starovoitov <ast@plumgrid.com> writes:

> We can argue about api for 2nd, whether it's mount with fd=1234 string
> or else, but for the first mount style doesn't make sense.

Why does mount not make sense?  It is exactly what you are looking for
so why does it not make sense?

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-16 23:44                 ` Eric W. Biederman
@ 2015-10-17  2:43                   ` Alexei Starovoitov
  2015-10-17 12:28                     ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-17  2:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Borkmann, Hannes Frederic Sowa, davem, viro, tgraf,
	netdev, linux-kernel, Alexei Starovoitov

On 10/16/15 4:44 PM, Eric W. Biederman wrote:
> Alexei Starovoitov <ast@plumgrid.com> writes:
>
>> We can argue about api for 2nd, whether it's mount with fd=1234 string
>> or else, but for the first mount style doesn't make sense.
>
> Why does mount not make sense?  It is exactly what you are looking for
> so why does it not make sense?

hmm, how do you get a new fd back after mounting it?
Note, open cannot be overloaded, so we end up with BPF_NEW_FD anyway,
but now it's more convoluted and empty mounts are everywhere.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-17  2:43                   ` Alexei Starovoitov
@ 2015-10-17 12:28                     ` Daniel Borkmann
  2015-10-18  2:20                       ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-17 12:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric W. Biederman
  Cc: Hannes Frederic Sowa, davem, viro, tgraf, netdev, linux-kernel,
	Alexei Starovoitov

On 10/17/2015 04:43 AM, Alexei Starovoitov wrote:
> On 10/16/15 4:44 PM, Eric W. Biederman wrote:
>> Alexei Starovoitov <ast@plumgrid.com> writes:
>>
>>> We can argue about api for 2nd, whether it's mount with fd=1234 string
>>> or else, but for the first mount style doesn't make sense.
>>
>> Why does mount not make sense?  It is exactly what you are looking for
>> so why does it not make sense?
>
> hmm, how do you get a new fd back after mounting it?
> Note, open cannot be overloaded, so we end up with BPF_NEW_FD anyway,
> but now it's more convoluted and empty mounts are everywhere.

That would be my understanding as well, but as Alexei already said,
these are two different issues, it would be step 2 (let me get back
to that further below). But in any case, I don't really like dumping
key/value somewhere as files. You have binary blobs as both, and
lets say your application has a lookup-key (for whatever reason) of
several cachelines it all ends up getting rather messy than making
it really useful for non-bpf(2) aware cmdline tools to deal with.

Anyway, another idea I've been brainstorming with Hannes today a
bit is about the following:

We register two major numbers, one for eBPF maps (X), one for eBPF
progs (Y). A user can either via cmdline call something like ...
mknod /dev/bpf/maps/map_pkts c X Z to create a special character
device, or alternatively out of an application through mknod(2)
syscall (f.e. tc when setting up maps/progs internally from the obj
file for a classifer).

Then, we still have 2 eBPF commands for bpf(2) syscall to add, say
(for example) BPF_BIND_DEV and BPF_FETCH_DEV. The application that
created a map (or prog) already has the map fd and after mknod(2) it
can open(2) the special file to get the special file fd. Then it can
call something like bpf(BPF_BIND_DEV, &attr, sizeof(attr))) where
attr looks like:

   union bpf_attr attr = {
     .bpf_fd    = bpf_fd,
     .dev_fd    = dev_fd,
   };

The bpf(2) syscall can check whether dev_fd belongs to an eBPF special
file and it can then copy over file->private_data from the bpf_fd
to the dev_fd's underlying file, where the private_data, as we know,
from the bpf_fd already points to a proper bpf_map/bpf_prog structure.
The map/prog would then get ref'ed and lives onwards in the char device's
lifetime. No special hashtable, gc, etc needed. The char device has fops
that we can define by ourself, and unlinking would drop the ref from
its private_data.

Now to the other part: BPF_FETCH_DEV would work similar. The application
opens the device, and fills bpf_attr as follows again:

   union bpf_attr attr = {
     .bpf_fd    = 0,
     .dev_fd    = dev_fd,
   };

This would allow us to look up the map/prog from the dev_fd's file->
private_data, and installs a new fd via bpf_{map,prog}_new_fd() that
is returned from bpf(2) for bpf-related access. The remaining fops
from the char device could still be reserved for possibilities like
debugging in future.

Now in future (2nd step), could either be to use Eric's idea and then do
something like mount -t bpffs ... -o /dev/bpf/maps/map_pkts to dump
attributes or other properties to some location for inspection from such
a special file, or we could use kobjects for that attached to the device
if the fops from the cdev should not be sufficient.

So closing the loop to the special files where there were concerns:

This won't forbid to have a future shell-style access possibility, and
it would also not end up as a nightmare on what you mentioned with the
S_ISSOCK-like bit in the other email.

The pinning mechanism would not require an extra file system to be mounted
somewhere, and yet the user can define himself an arbitrary hierarchy
where he puts the special files as this facility already exists. An
approach like this looks overall cleaner to me, and most likely be
realizable in fewer lines of code as well.

Thoughts?

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-17 12:28                     ` Daniel Borkmann
@ 2015-10-18  2:20                       ` Alexei Starovoitov
  2015-10-18 15:03                         ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-18  2:20 UTC (permalink / raw)
  To: Daniel Borkmann, Eric W. Biederman
  Cc: Hannes Frederic Sowa, davem, viro, tgraf, netdev, linux-kernel,
	Alexei Starovoitov

On 10/17/15 5:28 AM, Daniel Borkmann wrote:
>
> Anyway, another idea I've been brainstorming with Hannes today a
> bit is about the following:
>
> We register two major numbers, one for eBPF maps (X), one for eBPF
> progs (Y). A user can either via cmdline call something like ...
> mknod /dev/bpf/maps/map_pkts c X Z to create a special character
> device, or alternatively out of an application through mknod(2)
> syscall (f.e. tc when setting up maps/progs internally from the obj
> file for a classifer).
>
> Then, we still have 2 eBPF commands for bpf(2) syscall to add, say
> (for example) BPF_BIND_DEV and BPF_FETCH_DEV. The application that
> created a map (or prog) already has the map fd and after mknod(2) it
> can open(2) the special file to get the special file fd. Then it can
> call something like bpf(BPF_BIND_DEV, &attr, sizeof(attr))) where
> attr looks like:
>
>    union bpf_attr attr = {
>      .bpf_fd    = bpf_fd,
>      .dev_fd    = dev_fd,
>    };
>
> The bpf(2) syscall can check whether dev_fd belongs to an eBPF special
> file and it can then copy over file->private_data from the bpf_fd
> to the dev_fd's underlying file, where the private_data, as we know,
> from the bpf_fd already points to a proper bpf_map/bpf_prog structure.
> The map/prog would then get ref'ed and lives onwards in the char device's
> lifetime. No special hashtable, gc, etc needed. The char device has fops
> that we can define by ourself, and unlinking would drop the ref from
> its private_data.
>
> Now to the other part: BPF_FETCH_DEV would work similar. The application
> opens the device, and fills bpf_attr as follows again:
>
>    union bpf_attr attr = {
>      .bpf_fd    = 0,
>      .dev_fd    = dev_fd,
>    };
>
> This would allow us to look up the map/prog from the dev_fd's file->
> private_data, and installs a new fd via bpf_{map,prog}_new_fd() that
> is returned from bpf(2) for bpf-related access. The remaining fops
> from the char device could still be reserved for possibilities like
> debugging in future.
>
> Now in future (2nd step), could either be to use Eric's idea and then do
> something like mount -t bpffs ... -o /dev/bpf/maps/map_pkts to dump
> attributes or other properties to some location for inspection from such
> a special file, or we could use kobjects for that attached to the device
> if the fops from the cdev should not be sufficient.
>
> So closing the loop to the special files where there were concerns:
>
> This won't forbid to have a future shell-style access possibility, and
> it would also not end up as a nightmare on what you mentioned with the
> S_ISSOCK-like bit in the other email.
>
> The pinning mechanism would not require an extra file system to be mounted
> somewhere, and yet the user can define himself an arbitrary hierarchy
> where he puts the special files as this facility already exists. An
> approach like this looks overall cleaner to me, and most likely be
> realizable in fewer lines of code as well.
>
> Thoughts?

that indeed sounds cleaner, less lines of code, no fs, etc, but
I don't see how it will work yet.
For chardev with our own ops we can be triggered on open and close
of that chardev, so replacing private_data will be cleared when
user process does close(dev_fd) ? There is no fops for unlink either,
it's fs only property ?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-18  2:20                       ` Alexei Starovoitov
@ 2015-10-18 15:03                         ` Daniel Borkmann
  2015-10-18 16:49                           ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-18 15:03 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric W. Biederman
  Cc: Hannes Frederic Sowa, davem, viro, tgraf, netdev, linux-kernel,
	Alexei Starovoitov

On 10/18/2015 04:20 AM, Alexei Starovoitov wrote:
...
> that indeed sounds cleaner, less lines of code, no fs, etc, but
> I don't see how it will work yet.

I'll have some code ready very soon to show the concept. Will post it here
tonight, stay tuned. ;)

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-18 15:03                         ` Daniel Borkmann
@ 2015-10-18 16:49                           ` Daniel Borkmann
  2015-10-18 20:59                             ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-18 16:49 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric W. Biederman
  Cc: Hannes Frederic Sowa, davem, viro, tgraf, netdev, linux-kernel,
	Alexei Starovoitov

On 10/18/2015 05:03 PM, Daniel Borkmann wrote:
> On 10/18/2015 04:20 AM, Alexei Starovoitov wrote:
> ...
>> that indeed sounds cleaner, less lines of code, no fs, etc, but
>> I don't see how it will work yet.
>
> I'll have some code ready very soon to show the concept. Will post it here
> tonight, stay tuned. ;)

Okay, I have pushed some rough working proof of concept here:

   https://git.breakpoint.cc/cgit/dborkman/net-next.git/log/?h=ebpf-fds-final5

So the idea eventually had to be slightly modified after giving this further
thoughts and is the following:

We have 3 commands (BPF_DEV_CREATE, BPF_DEV_DESTROY, BPF_DEV_CONNECT), and
related to that a bpf_attr extension with only a single __u32 fd member in it.

Now, when we have an existing map/prog fd, we can do bpf_dev_create(fd) from
the application, and the kernel will create automatically a device, assigning
major/minor, etc.

You'll automatically have a sysfs entry under a new "bpf" class, for example:

   # ls -la /sys/class/bpf/
   lrwxrwxrwx.  1 root root 0 Oct 18 18:24 bpf_map0 -> ../../devices/virtual/bpf/bpf_map0
   lrwxrwxrwx.  1 root root 0 Oct 18 18:24 bpf_prog0 -> ../../devices/virtual/bpf/bpf_prog0

   # cat /sys/class/bpf/bpf_map0/dev
   249:0
   # cat /sys/class/bpf/bpf_prog0/dev
   248:0

And they also appear automatically under:

   # ls -la /dev/bpf/
   crw-------.  1 root root 249, 0 Oct 18 17:38 bpf_map0
   crw-------.  1 root root 248, 0 Oct 18 18:23 bpf_prog0

This means, you can create your own hierarchy somewhere and then symlink to it,
or add further mknod's, f.e.:

   # mknod ./foomap c 249 0
   # ./samples/bpf/devicex map-connect ./foomap
   dev, fd:3 (Success)
   map, fd:4 (Success)
   map, fd:4 read pair:(123,0) (Success)

The nice thing about it is that you can create/unlink as many as you want, but
when you remove the real device from an application via bpf_dev_destroy(fd),
then all links disappear with it. Just like in the case of a normal device driver.

On device creation, the kernel will return the minor number via bpf(2), so you
can access the file easily, f.e. /dev/bpf/bpf_map<minor> resp. /dev/bpf/bpf_prog<minor>,
and then move on with mknod(2) or symlink(2) from there if wished.

Last but not least, we can open the device driver, and then bpf_dev_connect(fd)
will return a new fd you can operate with the bpf(2) syscall to access maps and
stuff. Same here, the remaining device driver ops, we can then still use in future
for something useful.

The example code (top commit) does, to show the concept:

** Map:

  * Create map and place map into special device:
   # ./samples/bpf/devicex map-create /tmp/map-test
   map, fd:3 (Success)
   map, dev minor:2 (Success)
   map, /dev/bpf/bpf_map2 linked to /tmp/map-test (Success)
   map, fd:3 wrote pair:(123,456) (Success)
   map, fd:3 read pair:(123,456) (Success)

  * Retrieve map from special device:
   # ./samples/bpf/devicex map-connect /tmp/map-test
   dev, fd:3 (Success)
   map, fd:4 (Success)
   map, fd:4 read pair:(123,456) (Success)

  * Destroy special device (map is still locally available):
   # ./samples/bpf/devicex map-destroy /tmp/map-test2
   dev, fd:3 (Success)
   map, fd:4 (Success)
   map, dev destroyed:2 (Success)
   map, fd:4 read pair:(123,456) (Success)

** Prog:

  * Create prog and place prog into special device:
   # ./samples/bpf/devicex prog-create /tmp/prog-test
   prog, fd:3 (Success)
   prog, dev minor:0 (Success)
   prog, /dev/bpf/bpf_prog0 linked to /tmp/prog-test (Success)
   sock, fd:4 (Success)
   sock, prog attached:0 (Success)

  * Retrieve prog from special device, attach to sock:
   # ./samples/bpf/devicex prog-connect /tmp/prog-test
   dev, fd:3 (Success)
   prog, fd:4 (Success)
   sock, fd:3 (Success)
   sock, prog attached:0 (Success)

  * Destroy special device (prog is still locally available):
   # ./samples/bpf/devicex prog-destroy /tmp/prog-test
   dev, fd:3 (Success)
   prog, fd:4 (Success)
   prog, dev destroyed:0 (Success)

The actual code needed (2nd commit from above link), would be roughly along the
lines of what is shown below ... the code is overall a bit smaller than the fs.

This model seems much cleaner and more flexible to me than the file system. So,
I could polish this stuff further up a bit and do further tests/reviews on Monday
for a real submission. Does that sound like a plan?

Thanks,
Daniel

Code:

  include/linux/bpf.h      |  20 +++
  include/uapi/linux/bpf.h |  45 +-----
  kernel/bpf/Makefile      |   4 +-
  kernel/bpf/core.c        |   2 +-
  kernel/bpf/device.c      | 407 +++++++++++++++++++++++++++++++++++++++++++++++
  kernel/bpf/syscall.c     |  52 ++++--
  6 files changed, 482 insertions(+), 48 deletions(-)
  create mode 100644 kernel/bpf/device.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0ae6f77..52d57ed 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -8,8 +8,12 @@
  #define _LINUX_BPF_H 1

  #include <uapi/linux/bpf.h>
+
  #include <linux/workqueue.h>
  #include <linux/file.h>
+#include <linux/cdev.h>
+
+#define BPF_F_HAS_DEV	(1 << 0)

  struct bpf_map;

@@ -37,7 +41,11 @@ struct bpf_map {
  	u32 value_size;
  	u32 max_entries;
  	u32 pages;
+	int minor;
+	unsigned long flags;
+	struct mutex m_lock;
  	struct user_struct *user;
+	struct cdev cdev;
  	const struct bpf_map_ops *ops;
  	struct work_struct work;
  };
@@ -127,10 +135,14 @@ struct bpf_prog_type_list {
  struct bpf_prog_aux {
  	atomic_t refcnt;
  	u32 used_map_cnt;
+	int minor;
+	unsigned long flags;
+	struct mutex p_lock;
  	const struct bpf_verifier_ops *ops;
  	struct bpf_map **used_maps;
  	struct bpf_prog *prog;
  	struct user_struct *user;
+	struct cdev cdev;
  	union {
  		struct work_struct work;
  		struct rcu_head	rcu;
@@ -167,11 +179,19 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
  void bpf_prog_put(struct bpf_prog *prog);
  void bpf_prog_put_rcu(struct bpf_prog *prog);

+struct bpf_map *bpf_map_get(u32 ufd);
  struct bpf_map *__bpf_map_get(struct fd f);
  void bpf_map_put(struct bpf_map *map);

  extern int sysctl_unprivileged_bpf_disabled;

+int __bpf_dev_create(__u32 ufd);
+int __bpf_dev_destroy(__u32 ufd);
+int __bpf_dev_connect(__u32 ufd);
+
+int bpf_map_new_fd(struct bpf_map *map);
+int bpf_prog_new_fd(struct bpf_prog *prog);
+
  /* verify correctness of eBPF program */
  int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
  #else
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 564f1f0..55e5aad 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,50 +63,17 @@ struct bpf_insn {
  	__s32	imm;		/* signed immediate constant */
  };

-/* BPF syscall commands */
+/* BPF syscall commands, see bpf(2) man-page for details. */
  enum bpf_cmd {
-	/* create a map with given type and attributes
-	 * fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size)
-	 * returns fd or negative error
-	 * map is deleted when fd is closed
-	 */
  	BPF_MAP_CREATE,
-
-	/* lookup key in a given map
-	 * err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->value
-	 * returns zero and stores found elem into value
-	 * or negative error
-	 */
  	BPF_MAP_LOOKUP_ELEM,
-
-	/* create or update key/value pair in a given map
-	 * err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->value, attr->flags
-	 * returns zero or negative error
-	 */
  	BPF_MAP_UPDATE_ELEM,
-
-	/* find and delete elem by key in a given map
-	 * err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key
-	 * returns zero or negative error
-	 */
  	BPF_MAP_DELETE_ELEM,
-
-	/* lookup key in a given map and return next key
-	 * err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->next_key
-	 * returns zero and stores next key or negative error
-	 */
  	BPF_MAP_GET_NEXT_KEY,
-
-	/* verify and load eBPF program
-	 * prog_fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size)
-	 * Using attr->prog_type, attr->insns, attr->license
-	 * returns fd or negative error
-	 */
  	BPF_PROG_LOAD,
+	BPF_DEV_CREATE,
+	BPF_DEV_DESTROY,
+	BPF_DEV_CONNECT,
  };

  enum bpf_map_type {
@@ -160,6 +127,10 @@ union bpf_attr {
  		__aligned_u64	log_buf;	/* user supplied buffer */
  		__u32		kern_version;	/* checked when prog_type=kprobe */
  	};
+
+	struct { /* anonymous struct used by BPF_DEV_* commands */
+		__u32		fd;
+	};
  } __attribute__((aligned(8)));

  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e6983be..f871ca6 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,2 +1,4 @@
  obj-y := core.o
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o helpers.o
+
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o device.o helpers.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8086471..260058d 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -92,6 +92,7 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)

  	fp->pages = size / PAGE_SIZE;
  	fp->aux = aux;
+	aux->prog = fp;

  	return fp;
  }
@@ -726,7 +727,6 @@ void bpf_prog_free(struct bpf_prog *fp)
  	struct bpf_prog_aux *aux = fp->aux;

  	INIT_WORK(&aux->work, bpf_prog_free_deferred);
-	aux->prog = fp;
  	schedule_work(&aux->work);
  }
  EXPORT_SYMBOL_GPL(bpf_prog_free);
diff --git a/kernel/bpf/device.c b/kernel/bpf/device.c
new file mode 100644
index 0000000..e99fc82
--- /dev/null
+++ b/kernel/bpf/device.c
@@ -0,0 +1,407 @@
+/*
+ * Special file backend for persistent eBPF maps and programs, used by
+ * bpf() system call.
+ *
+ * (C) 2015 Daniel Borkmann <daniel@iogearbox.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <linux/idr.h>
+#include <linux/mutex.h>
+#include <linux/cdev.h>
+
+#define BPF_MAX_DEVS	(1UL << MINORBITS)
+
+enum bpf_type {
+	BPF_TYPE_PROG,
+	BPF_TYPE_MAP,
+};
+
+static struct class *bpf_class;
+
+static dev_t bpf_map_devt;
+static DEFINE_IDR(bpf_map_idr);
+static DEFINE_MUTEX(bpf_map_idr_lock);
+
+static dev_t bpf_prog_devt;
+static DEFINE_IDR(bpf_prog_idr);
+static DEFINE_MUTEX(bpf_prog_idr_lock);
+
+static int bpf_map_get_minor(struct bpf_map *map)
+{
+	int minor;
+
+	mutex_lock(&bpf_map_idr_lock);
+	minor = idr_alloc(&bpf_map_idr, map, 0, BPF_MAX_DEVS, GFP_KERNEL);
+	mutex_unlock(&bpf_map_idr_lock);
+
+	return minor;
+}
+
+static void bpf_map_put_minor(const struct bpf_map *map)
+{
+	mutex_lock(&bpf_map_idr_lock);
+	idr_remove(&bpf_map_idr, map->minor);
+	mutex_unlock(&bpf_map_idr_lock);
+}
+
+static int bpf_prog_get_minor(struct bpf_prog *prog)
+{
+	int minor;
+
+	mutex_lock(&bpf_prog_idr_lock);
+	minor = idr_alloc(&bpf_prog_idr, prog, 0, BPF_MAX_DEVS, GFP_KERNEL);
+	mutex_unlock(&bpf_prog_idr_lock);
+
+	return minor;
+}
+
+static void bpf_prog_put_minor(const struct bpf_prog *prog)
+{
+	mutex_lock(&bpf_prog_idr_lock);
+	idr_remove(&bpf_prog_idr, prog->aux->minor);
+	mutex_unlock(&bpf_prog_idr_lock);
+}
+
+static int bpf_map_open(struct inode *inode, struct file *filep)
+{
+	filep->private_data = container_of(inode->i_cdev,
+					   struct bpf_map, cdev);
+	return 0;
+}
+
+static const struct file_operations bpf_dev_map_fops = {
+	.owner		= THIS_MODULE,
+	.open		= bpf_map_open,
+	.llseek		= noop_llseek,
+};
+
+static int bpf_prog_open(struct inode *inode, struct file *filep)
+{
+	filep->private_data = container_of(inode->i_cdev,
+					   struct bpf_prog_aux, cdev)->prog;
+	return 0;
+}
+
+static const struct file_operations bpf_dev_prog_fops = {
+	.owner		= THIS_MODULE,
+	.open		= bpf_prog_open,
+	.llseek		= noop_llseek,
+};
+
+static char *bpf_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "bpf/%s", dev_name(dev));
+}
+
+static int bpf_map_make_dev(struct bpf_map *map)
+{
+	struct device *dev;
+	dev_t devt;
+	int ret;
+
+	mutex_lock(&map->m_lock);
+	if (map->flags & BPF_F_HAS_DEV) {
+		ret = map->minor;
+		goto out;
+	}
+
+	cdev_init(&map->cdev, &bpf_dev_map_fops);
+	map->cdev.owner = map->cdev.ops->owner;
+	map->minor = bpf_map_get_minor(map);
+
+	devt = MKDEV(MAJOR(bpf_map_devt), map->minor);
+	ret = cdev_add(&map->cdev, devt, 1);
+	if (ret)
+		goto unwind;
+
+	dev = device_create(bpf_class, NULL, devt, NULL, "bpf_map%d",
+			    map->minor);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto unwind_cdev;
+	}
+
+	map->flags |= BPF_F_HAS_DEV;
+	ret = map->minor;
+out:
+	mutex_unlock(&map->m_lock);
+	return ret;
+unwind_cdev:
+	cdev_del(&map->cdev);
+unwind:
+	bpf_map_put_minor(map);
+	goto out;
+}
+
+static int bpf_map_destroy_dev(struct bpf_map *map)
+{
+	bool drop_ref = false;
+	dev_t devt;
+	int ret;
+
+	mutex_lock(&map->m_lock);
+	if (!(map->flags & BPF_F_HAS_DEV)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	devt = MKDEV(MAJOR(bpf_map_devt), map->minor);
+	ret = map->minor;
+
+	cdev_del(&map->cdev);
+	device_destroy(bpf_class, devt);
+	bpf_map_put_minor(map);
+
+	map->flags &= ~BPF_F_HAS_DEV;
+	drop_ref = true;
+out:
+	mutex_unlock(&map->m_lock);
+
+	if (drop_ref)
+		bpf_map_put(map);
+	return ret;
+}
+
+static int bpf_prog_make_dev(struct bpf_prog *prog)
+{
+	struct bpf_prog_aux *aux = prog->aux;
+	struct device *dev;
+	dev_t devt;
+	int ret;
+
+	mutex_lock(&aux->p_lock);
+	if (aux->flags & BPF_F_HAS_DEV) {
+		ret = aux->minor;
+		goto out;
+	}
+
+	cdev_init(&aux->cdev, &bpf_dev_prog_fops);
+	aux->cdev.owner = aux->cdev.ops->owner;
+	aux->minor = bpf_prog_get_minor(prog);
+
+	devt = MKDEV(MAJOR(bpf_prog_devt), aux->minor);
+	ret = cdev_add(&aux->cdev, devt, 1);
+	if (ret)
+		goto unwind;
+
+	dev = device_create(bpf_class, NULL, devt, NULL, "bpf_prog%d",
+			    aux->minor);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto unwind_cdev;
+	}
+
+	aux->flags |= BPF_F_HAS_DEV;
+	ret = aux->minor;
+out:
+	mutex_unlock(&aux->p_lock);
+	return ret;
+unwind_cdev:
+	cdev_del(&aux->cdev);
+unwind:
+	bpf_prog_put_minor(prog);
+	goto out;
+}
+
+static int bpf_prog_destroy_dev(struct bpf_prog *prog)
+{
+	struct bpf_prog_aux *aux = prog->aux;
+	bool drop_ref = false;
+	dev_t devt;
+	int ret;
+
+	mutex_lock(&aux->p_lock);
+	if (!(aux->flags & BPF_F_HAS_DEV)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	devt = MKDEV(MAJOR(bpf_prog_devt), aux->minor);
+	ret = aux->minor;
+
+	cdev_del(&aux->cdev);
+	device_destroy(bpf_class, devt);
+	bpf_prog_put_minor(prog);
+
+	aux->flags &= ~BPF_F_HAS_DEV;
+	drop_ref = true;
+out:
+	mutex_unlock(&aux->p_lock);
+
+	if (drop_ref)
+		bpf_prog_put(prog);
+	return ret;
+}
+
+static void bpf_any_get(void *raw, enum bpf_type type)
+{
+	switch (type) {
+	case BPF_TYPE_PROG:
+		atomic_inc(&((struct bpf_prog *)raw)->aux->refcnt);
+		break;
+	case BPF_TYPE_MAP:
+		atomic_inc(&((struct bpf_map *)raw)->refcnt);
+		break;
+	}
+}
+
+void bpf_any_put(void *raw, enum bpf_type type)
+{
+	switch (type) {
+	case BPF_TYPE_PROG:
+		bpf_prog_put(raw);
+		break;
+	case BPF_TYPE_MAP:
+		bpf_map_put(raw);
+		break;
+	}
+}
+
+static void *__bpf_dev_get(struct fd f, enum bpf_type *type)
+{
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+	if (f.file->f_op != &bpf_dev_map_fops &&
+	    f.file->f_op != &bpf_dev_prog_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	*type = f.file->f_op == &bpf_dev_map_fops ?
+		BPF_TYPE_MAP : BPF_TYPE_PROG;
+	return f.file->private_data;
+}
+
+static void *bpf_dev_get(u32 ufd, enum bpf_type *type)
+{
+	struct fd f = fdget(ufd);
+	void *raw;
+
+	raw = __bpf_dev_get(f, type);
+	if (IS_ERR(raw))
+		return raw;
+
+	bpf_any_get(raw, *type);
+	fdput(f);
+
+	return raw;
+}
+
+int __bpf_dev_create(__u32 ufd)
+{
+	enum bpf_type type;
+	void *raw;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	type = BPF_TYPE_MAP;
+	raw = bpf_map_get(ufd);
+	if (IS_ERR(raw)) {
+		type = BPF_TYPE_PROG;
+		raw = bpf_prog_get(ufd);
+		if (IS_ERR(raw))
+			return PTR_ERR(raw);
+	}
+
+	switch (type) {
+	case BPF_TYPE_MAP:
+		ret = bpf_map_make_dev(raw);
+		break;
+	case BPF_TYPE_PROG:
+		ret = bpf_prog_make_dev(raw);
+		break;
+	}
+
+	if (ret < 0)
+		bpf_any_put(raw, type);
+
+	return ret;
+}
+
+int __bpf_dev_destroy(__u32 ufd)
+{
+	enum bpf_type type;
+	void *raw;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	type = BPF_TYPE_MAP;
+	raw = bpf_map_get(ufd);
+	if (IS_ERR(raw)) {
+		type = BPF_TYPE_PROG;
+		raw = bpf_prog_get(ufd);
+		if (IS_ERR(raw))
+			return PTR_ERR(raw);
+	}
+
+	switch (type) {
+	case BPF_TYPE_MAP:
+		ret = bpf_map_destroy_dev(raw);
+		break;
+	case BPF_TYPE_PROG:
+		ret = bpf_prog_destroy_dev(raw);
+		break;
+	}
+
+	bpf_any_put(raw, type);
+	return ret;
+}
+
+int __bpf_dev_connect(__u32 ufd)
+{
+	enum bpf_type type;
+	void *raw;
+	int ret;
+
+	raw = bpf_dev_get(ufd, &type);
+	if (IS_ERR(raw))
+		return PTR_ERR(raw);
+
+	switch (type) {
+	case BPF_TYPE_MAP:
+		ret = bpf_map_new_fd(raw);
+		break;
+	case BPF_TYPE_PROG:
+		ret = bpf_prog_new_fd(raw);
+		break;
+	}
+	if (ret < 0)
+		bpf_any_put(raw, type);
+
+	return ret;
+}
+
+static int __init bpf_dev_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&bpf_map_devt, 0, BPF_MAX_DEVS,
+				  "bpf_map");
+	if (ret)
+		return ret;
+
+	ret = alloc_chrdev_region(&bpf_prog_devt, 0, BPF_MAX_DEVS,
+				  "bpf_prog");
+	if (ret)
+		unregister_chrdev_region(bpf_map_devt, BPF_MAX_DEVS);
+
+	bpf_class = class_create(THIS_MODULE, "bpf");
+	bpf_class->devnode = bpf_devnode;
+
+	return ret;
+}
+late_initcall(bpf_dev_init);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c629fe6..458b2f9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -14,6 +14,7 @@
  #include <linux/slab.h>
  #include <linux/anon_inodes.h>
  #include <linux/file.h>
+#include <linux/mutex.h>
  #include <linux/license.h>
  #include <linux/filter.h>
  #include <linux/version.h>
@@ -111,7 +112,7 @@ static const struct file_operations bpf_map_fops = {
  	.release = bpf_map_release,
  };

-static int bpf_map_new_fd(struct bpf_map *map)
+int bpf_map_new_fd(struct bpf_map *map)
  {
  	return anon_inode_getfd("bpf-map", &bpf_map_fops, map,
  				O_RDWR | O_CLOEXEC);
@@ -141,6 +142,7 @@ static int map_create(union bpf_attr *attr)
  	if (IS_ERR(map))
  		return PTR_ERR(map);

+	mutex_init(&map->m_lock);
  	atomic_set(&map->refcnt, 1);

  	err = bpf_map_charge_memlock(map);
@@ -174,7 +176,7 @@ struct bpf_map *__bpf_map_get(struct fd f)
  	return f.file->private_data;
  }

-static struct bpf_map *bpf_map_get(u32 ufd)
+struct bpf_map *bpf_map_get(u32 ufd)
  {
  	struct fd f = fdget(ufd);
  	struct bpf_map *map;
@@ -525,18 +527,14 @@ static void __prog_put_common(struct rcu_head *rcu)
  /* version of bpf_prog_put() that is called after a grace period */
  void bpf_prog_put_rcu(struct bpf_prog *prog)
  {
-	if (atomic_dec_and_test(&prog->aux->refcnt)) {
-		prog->aux->prog = prog;
+	if (atomic_dec_and_test(&prog->aux->refcnt))
  		call_rcu(&prog->aux->rcu, __prog_put_common);
-	}
  }

  void bpf_prog_put(struct bpf_prog *prog)
  {
-	if (atomic_dec_and_test(&prog->aux->refcnt)) {
-		prog->aux->prog = prog;
+	if (atomic_dec_and_test(&prog->aux->refcnt))
  		__prog_put_common(&prog->aux->rcu);
-	}
  }
  EXPORT_SYMBOL_GPL(bpf_prog_put);

@@ -552,7 +550,7 @@ static const struct file_operations bpf_prog_fops = {
          .release = bpf_prog_release,
  };

-static int bpf_prog_new_fd(struct bpf_prog *prog)
+int bpf_prog_new_fd(struct bpf_prog *prog)
  {
  	return anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog,
  				O_RDWR | O_CLOEXEC);
@@ -641,6 +639,7 @@ static int bpf_prog_load(union bpf_attr *attr)
  	prog->orig_prog = NULL;
  	prog->jited = 0;

+	mutex_init(&prog->aux->p_lock);
  	atomic_set(&prog->aux->refcnt, 1);
  	prog->gpl_compatible = is_gpl ? 1 : 0;

@@ -678,6 +677,32 @@ free_prog_nouncharge:
  	return err;
  }

+#define BPF_DEV_LAST_FIELD fd
+
+static int bpf_dev_create(const union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_DEV))
+		return -EINVAL;
+
+	return __bpf_dev_create(attr->fd);
+}
+
+static int bpf_dev_destroy(const union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_DEV))
+		return -EINVAL;
+
+	return __bpf_dev_destroy(attr->fd);
+}
+
+static int bpf_dev_connect(const union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_DEV))
+		return -EINVAL;
+
+	return __bpf_dev_connect(attr->fd);
+}
+
  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
  {
  	union bpf_attr attr = {};
@@ -738,6 +763,15 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
  	case BPF_PROG_LOAD:
  		err = bpf_prog_load(&attr);
  		break;
+	case BPF_DEV_CREATE:
+		err = bpf_dev_create(&attr);
+		break;
+	case BPF_DEV_DESTROY:
+		err = bpf_dev_destroy(&attr);
+		break;
+	case BPF_DEV_CONNECT:
+		err = bpf_dev_connect(&attr);
+		break;
  	default:
  		err = -EINVAL;
  		break;
-- 
cgit


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-18 16:49                           ` Daniel Borkmann
@ 2015-10-18 20:59                             ` Alexei Starovoitov
  2015-10-19  7:36                               ` Hannes Frederic Sowa
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-18 20:59 UTC (permalink / raw)
  To: Daniel Borkmann, Eric W. Biederman
  Cc: Hannes Frederic Sowa, davem, viro, tgraf, netdev, linux-kernel,
	Alexei Starovoitov

On 10/18/15 9:49 AM, Daniel Borkmann wrote:
> Okay, I have pushed some rough working proof of concept here:
>
> https://git.breakpoint.cc/cgit/dborkman/net-next.git/log/?h=ebpf-fds-final5
>
> So the idea eventually had to be slightly modified after giving this
> further
> thoughts and is the following:
>
> We have 3 commands (BPF_DEV_CREATE, BPF_DEV_DESTROY, BPF_DEV_CONNECT), and
> related to that a bpf_attr extension with only a single __u32 fd member
> in it.
...
> The nice thing about it is that you can create/unlink as many as you
> want, but
> when you remove the real device from an application via
> bpf_dev_destroy(fd),
> then all links disappear with it. Just like in the case of a normal
> device driver.

interesting idea!
What happens if user app creates a dev via bpf_dev_create(), exits and
then admin does rm of that dev ?
Looks like map/prog will leak ?
So the only proper way to delete such cdevs is via bpf_dev_destroy ?

> On device creation, the kernel will return the minor number via bpf(2),
> so you
> can access the file easily, f.e. /dev/bpf/bpf_map<minor> resp.
> /dev/bpf/bpf_prog<minor>,
> and then move on with mknod(2) or symlink(2) from there if wished.

what if admin mknod in that dir with some arbitrary minor ?
mknod will succeed, but it won't hold anything?
looks like bpf_dev_connect will handle it gracefully.
So these cdevs should only be created and destroyed via bpf syscall
and only sensible operations on them is open() to get fd and pass
to bpf_dev_connect and symlink. Anything else admin should be
careful not to do. Right?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 0/4] BPF updates
  2015-10-16  1:09 [PATCH net-next 0/4] BPF updates Daniel Borkmann
                   ` (3 preceding siblings ...)
  2015-10-16  1:09 ` [PATCH net-next 4/4] bpf: add sample usages " Daniel Borkmann
@ 2015-10-19  2:53 ` David Miller
  4 siblings, 0 replies; 56+ messages in thread
From: David Miller @ 2015-10-19  2:53 UTC (permalink / raw)
  To: daniel; +Cc: ast, viro, ebiederm, tgraf, hannes, netdev, linux-kernel

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri, 16 Oct 2015 03:09:21 +0200

> This set adds support for persistent maps/progs. Please see
> individual patches for further details.
> 
> A man-page update to bpf(2) will be sent afterwards, also a
> iproute2 patch for support in tc.
> 
> Thanks!

It seems like the design here is still a little bit in flux, so I'm
marking this as "changes requested" in patchwork.

Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-18 20:59                             ` Alexei Starovoitov
@ 2015-10-19  7:36                               ` Hannes Frederic Sowa
  2015-10-19  9:51                                 ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-19  7:36 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

Hi,

On Sun, Oct 18, 2015, at 22:59, Alexei Starovoitov wrote:
> On 10/18/15 9:49 AM, Daniel Borkmann wrote:
> > Okay, I have pushed some rough working proof of concept here:
> >
> > https://git.breakpoint.cc/cgit/dborkman/net-next.git/log/?h=ebpf-fds-final5
> >
> > So the idea eventually had to be slightly modified after giving this
> > further
> > thoughts and is the following:
> >
> > We have 3 commands (BPF_DEV_CREATE, BPF_DEV_DESTROY, BPF_DEV_CONNECT), and
> > related to that a bpf_attr extension with only a single __u32 fd member
> > in it.
> ...
> > The nice thing about it is that you can create/unlink as many as you
> > want, but
> > when you remove the real device from an application via
> > bpf_dev_destroy(fd),
> > then all links disappear with it. Just like in the case of a normal
> > device driver.
> 
> interesting idea!
> What happens if user app creates a dev via bpf_dev_create(), exits and
> then admin does rm of that dev ?
> Looks like map/prog will leak ?
> So the only proper way to delete such cdevs is via bpf_dev_destroy ?

The mknod is not the holder but rather the kobject which should be
represented in sysfs will be. So you can still get the map major:minor
by looking up the /dev file in the correspdonding sysfs directory or I
think we should provide a 'unbind' file, which will drop the kobject if
the user writes a '1' to it.

> 
> > On device creation, the kernel will return the minor number via bpf(2),
> > so you
> > can access the file easily, f.e. /dev/bpf/bpf_map<minor> resp.
> > /dev/bpf/bpf_prog<minor>,
> > and then move on with mknod(2) or symlink(2) from there if wished.
> 
> what if admin mknod in that dir with some arbitrary minor ?

Basically, -EIO. :)

> mknod will succeed, but it won't hold anything?

That is right now true for basically all mknod operations, which udev
creates.

> looks like bpf_dev_connect will handle it gracefully.
> So these cdevs should only be created and destroyed via bpf syscall
> and only sensible operations on them is open() to get fd and pass
> to bpf_dev_connect and symlink. Anything else admin should be
> careful not to do. Right?

Besides maybe some statistics and other stuff in sysfs directory, no,
that is all.

Bye,
Hannes


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19  7:36                               ` Hannes Frederic Sowa
@ 2015-10-19  9:51                                 ` Daniel Borkmann
  2015-10-19 14:23                                   ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-19  9:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Alexei Starovoitov, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/2015 09:36 AM, Hannes Frederic Sowa wrote:
> Hi,
>
> On Sun, Oct 18, 2015, at 22:59, Alexei Starovoitov wrote:
>> On 10/18/15 9:49 AM, Daniel Borkmann wrote:
>>> Okay, I have pushed some rough working proof of concept here:
>>>
>>> https://git.breakpoint.cc/cgit/dborkman/net-next.git/log/?h=ebpf-fds-final5
>>>
>>> So the idea eventually had to be slightly modified after giving this
>>> further
>>> thoughts and is the following:
>>>
>>> We have 3 commands (BPF_DEV_CREATE, BPF_DEV_DESTROY, BPF_DEV_CONNECT), and
>>> related to that a bpf_attr extension with only a single __u32 fd member
>>> in it.
>> ...
>>> The nice thing about it is that you can create/unlink as many as you
>>> want, but
>>> when you remove the real device from an application via
>>> bpf_dev_destroy(fd),
>>> then all links disappear with it. Just like in the case of a normal
>>> device driver.
>>
>> interesting idea!
>> What happens if user app creates a dev via bpf_dev_create(), exits and
>> then admin does rm of that dev ?
>> Looks like map/prog will leak ?
>> So the only proper way to delete such cdevs is via bpf_dev_destroy ?
>
> The mknod is not the holder but rather the kobject which should be
> represented in sysfs will be. So you can still get the map major:minor
> by looking up the /dev file in the correspdonding sysfs directory or I
> think we should provide a 'unbind' file, which will drop the kobject if
> the user writes a '1' to it.

I agree, this could still be done.

>>> On device creation, the kernel will return the minor number via bpf(2),
>>> so you
>>> can access the file easily, f.e. /dev/bpf/bpf_map<minor> resp.
>>> /dev/bpf/bpf_prog<minor>,
>>> and then move on with mknod(2) or symlink(2) from there if wished.
>>
>> what if admin mknod in that dir with some arbitrary minor ?
>
> Basically, -EIO. :)
>
>> mknod will succeed, but it won't hold anything?
>
> That is right now true for basically all mknod operations, which udev
> creates.
>
>> looks like bpf_dev_connect will handle it gracefully.
>> So these cdevs should only be created and destroyed via bpf syscall
>> and only sensible operations on them is open() to get fd and pass
>> to bpf_dev_connect and symlink. Anything else admin should be
>> careful not to do. Right?
>
> Besides maybe some statistics and other stuff in sysfs directory, no,
> that is all.
>
> Bye,
> Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19  9:51                                 ` Daniel Borkmann
@ 2015-10-19 14:23                                   ` Daniel Borkmann
  2015-10-19 16:22                                     ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-19 14:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Alexei Starovoitov, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/2015 11:51 AM, Daniel Borkmann wrote:
> On 10/19/2015 09:36 AM, Hannes Frederic Sowa wrote:
>> On Sun, Oct 18, 2015, at 22:59, Alexei Starovoitov wrote:
>>> On 10/18/15 9:49 AM, Daniel Borkmann wrote:
>>>> Okay, I have pushed some rough working proof of concept here:
>>>>
>>>> https://git.breakpoint.cc/cgit/dborkman/net-next.git/log/?h=ebpf-fds-final5
>>>>
>>>> So the idea eventually had to be slightly modified after giving this
>>>> further
>>>> thoughts and is the following:
>>>>
>>>> We have 3 commands (BPF_DEV_CREATE, BPF_DEV_DESTROY, BPF_DEV_CONNECT), and
>>>> related to that a bpf_attr extension with only a single __u32 fd member
>>>> in it.
>>> ...
>>>> The nice thing about it is that you can create/unlink as many as you
>>>> want, but
>>>> when you remove the real device from an application via
>>>> bpf_dev_destroy(fd),
>>>> then all links disappear with it. Just like in the case of a normal
>>>> device driver.
>>>
>>> interesting idea!
>>> What happens if user app creates a dev via bpf_dev_create(), exits and
>>> then admin does rm of that dev ?
>>> Looks like map/prog will leak ?
>>> So the only proper way to delete such cdevs is via bpf_dev_destroy ?
>>
>> The mknod is not the holder but rather the kobject which should be
>> represented in sysfs will be. So you can still get the map major:minor
>> by looking up the /dev file in the correspdonding sysfs directory or I
>> think we should provide a 'unbind' file, which will drop the kobject if
>> the user writes a '1' to it.
>
> I agree, this could still be done.
>
>>>> On device creation, the kernel will return the minor number via bpf(2),
>>>> so you
>>>> can access the file easily, f.e. /dev/bpf/bpf_map<minor> resp.
>>>> /dev/bpf/bpf_prog<minor>,
>>>> and then move on with mknod(2) or symlink(2) from there if wished.
>>>
>>> what if admin mknod in that dir with some arbitrary minor ?
>>
>> Basically, -EIO. :)

If an admin does a mknod that has the major of a map or prog cdev, but a
not yet used minor, then connecting to that fails. And at the time when a
real device has been created with that assigned minor, then connecting to
it succeeds.

It's nothing different than with other devices in the system, f.e. ...

   # ls -la /dev/urandom
   crw-rw-rw-. 1 root root 1, 9 Oct 19 15:18 /dev/urandom
   # mknod ./foobar c 1 9

... will make random driver available under ./foobar as well.

If your question is rather on what happens when an admin does an ``mknod
/dev/bpf/bpf_map9 c 249 11'' and the device created has a minor of 9 and
/dev/bpf/bpf_map9 already exists in the system, then udev won't auto-create
or overwrite the node pointing to the major:minor there. The device itself
is being created nevertheless and visible under /sys/class/bpf/, but I think
this is a non-issue and nothing different from any other device drivers.

As Hannes said, under /sys/class/bpf/ an admin can see all held nodes, so
visibility is there for free at all times. The device management (creation/
deletion) itself and the mknod's pointing to it are simply decoupled.

This whole approach looks sound to me, also integrates nicely into the
existing Linux facilities, and works on top of every fs supporting special
files. Much cleaner than an extra file-system that would be required by a
syscall in order to make the syscall work.

>>> mknod will succeed, but it won't hold anything?
>>
>> That is right now true for basically all mknod operations, which udev
>> creates.
>>
>>> looks like bpf_dev_connect will handle it gracefully.
>>> So these cdevs should only be created and destroyed via bpf syscall
>>> and only sensible operations on them is open() to get fd and pass
>>> to bpf_dev_connect and symlink. Anything else admin should be
>>> careful not to do. Right?
>>
>> Besides maybe some statistics and other stuff in sysfs directory, no,
>> that is all.
>>
>> Bye,
>> Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 14:23                                   ` Daniel Borkmann
@ 2015-10-19 16:22                                     ` Alexei Starovoitov
  2015-10-19 17:37                                       ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-19 16:22 UTC (permalink / raw)
  To: Daniel Borkmann, Hannes Frederic Sowa, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/15 7:23 AM, Daniel Borkmann wrote:
>>> The mknod is not the holder but rather the kobject which should be
>>> represented in sysfs will be. So you can still get the map major:minor
>>> by looking up the /dev file in the correspdonding sysfs directory or I
>>> think we should provide a 'unbind' file, which will drop the kobject if
>>> the user writes a '1' to it.
>>
>> I agree, this could still be done.

imo doing 'rm' is way cleaner then dealing with 'unbind' file.

> As Hannes said, under /sys/class/bpf/ an admin can see all held nodes, so
> visibility is there for free at all times. The device management (creation/
> deletion) itself and the mknod's pointing to it are simply decoupled.
>
> This whole approach looks sound to me, also integrates nicely into the
> existing Linux facilities, and works on top of every fs supporting special
> files. Much cleaner than an extra file-system that would be required by a
> syscall in order to make the syscall work.

thanks for the explanations. I think I got a complete picture now on
how such cdev will be used and I don't like it.
There is nothing in linux or any unix that creates thousands of cdevs
on the fly, but here user apps will create/destroy them back and forth
and they would need to do it quickly. Whole sysfs/kobj baggage is
completely unnecessary here. The kernel will consume more memory for
no real reason other than cdev are used to keep prog/maps around.
imo fs is cleaner and we can tailor it to be similar to cdev style.
For example we can make bpffs automount in /sys/kernel/bpf/ as standard
location and have one directory structure for all mounts (like tracefs).
Then within it have idr mechanism to crate bpf_progX and bpf_mapY
special files via BPF_PIN_FD bpf syscall with single FD argument.
At this point fs and cdev approach from user point of view look
exactly the same, but overhead of fs is significantly lower,
normal 'rm' works just fine and much faster.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 16:22                                     ` Alexei Starovoitov
@ 2015-10-19 17:37                                       ` Daniel Borkmann
  2015-10-19 18:15                                         ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-19 17:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Hannes Frederic Sowa, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/2015 06:22 PM, Alexei Starovoitov wrote:
> On 10/19/15 7:23 AM, Daniel Borkmann wrote:
>>>> The mknod is not the holder but rather the kobject which should be
>>>> represented in sysfs will be. So you can still get the map major:minor
>>>> by looking up the /dev file in the correspdonding sysfs directory or I
>>>> think we should provide a 'unbind' file, which will drop the kobject if
>>>> the user writes a '1' to it.
>>>
>>> I agree, this could still be done.
>
> imo doing 'rm' is way cleaner then dealing with 'unbind' file.

Hmm, not sure, maybe this was misunderstood. It's not about files, but
rather devices. Devices are decoupled.

This unbind file is optional and could live under /sys/class/bpf/bpf_{map,
prog}<X>/unbind for a device release. It's not strictly necessary for this
to work, though, the management is, as explained, via bpf() syscall.

>> As Hannes said, under /sys/class/bpf/ an admin can see all held nodes, so
>> visibility is there for free at all times. The device management (creation/
>> deletion) itself and the mknod's pointing to it are simply decoupled.
>>
>> This whole approach looks sound to me, also integrates nicely into the
>> existing Linux facilities, and works on top of every fs supporting special
>> files. Much cleaner than an extra file-system that would be required by a
>> syscall in order to make the syscall work.
>
> thanks for the explanations. I think I got a complete picture now on
> how such cdev will be used and I don't like it.
> There is nothing in linux or any unix that creates thousands of cdevs
> on the fly, but here user apps will create/destroy them back and forth
> and they would need to do it quickly. Whole sysfs/kobj baggage is

Well, you are talking about thousand maps and even root can create about
5 maps and then will get an -EPERM. ;) Until an admin will figure out over
couple of corners that ulimit -l needs to be adjusted ... ;)

But more serious, can you elaborate what you mean?

An eBPF program or map loading/destruction is *not* by any means to be
considered fast-path. We currently hold a global mutex during loading.
So, how can that be considered fast-path? Similarly, socket creation/
destruction is also not fast-path, etc. Do you expect that applications
would create/destroy these devices within milliseconds? I'd argue that
something would be seriously wrong with that application, then. Such
persistent maps are to be considered rather mid-long living objects in
the system. The fast-path surely is the data-path of them.

> completely unnecessary here. The kernel will consume more memory for
> no real reason other than cdev are used to keep prog/maps around.

I don't consider this a big issue, and well worth the trade-off. You'll
have an infrastructure that integrates *nicely* into the *existing* kernel
model *and* tooling with the proposed patch. This is a HUGE plus. The
UAPI of this is simple and minimal. And to me, these are in-fact special
files, not regular ones.

> imo fs is cleaner and we can tailor it to be similar to cdev style.

Really, IMHO I think this is over-designed, and much much more hacky. We
design a whole new file system that works *exactly* like cdevs, takes
likely more than twice the code and complexity to realize but just to
save a few bytes ...? I don't understand that.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 17:37                                       ` Daniel Borkmann
@ 2015-10-19 18:15                                         ` Alexei Starovoitov
  2015-10-19 18:46                                           ` Hannes Frederic Sowa
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-19 18:15 UTC (permalink / raw)
  To: Daniel Borkmann, Hannes Frederic Sowa, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/15 10:37 AM, Daniel Borkmann wrote:
> An eBPF program or map loading/destruction is *not* by any means to be
> considered fast-path. We currently hold a global mutex during loading.
> So, how can that be considered fast-path? Similarly, socket creation/
> destruction is also not fast-path, etc. Do you expect that applications
> would create/destroy these devices within milliseconds? I'd argue that
> something would be seriously wrong with that application, then. Such
> persistent maps are to be considered rather mid-long living objects in
> the system. The fast-path surely is the data-path of them.

consider seccomp that loads several programs for every tab, then
container use case where we're loading two for each as well.
Who knows what use cases will come up in the future.
It's obviously not a fast path that is being hit million times a second,
but why add overhead when it's unnecessary?

>> completely unnecessary here. The kernel will consume more memory for
>> no real reason other than cdev are used to keep prog/maps around.
>
> I don't consider this a big issue, and well worth the trade-off. You'll
> have an infrastructure that integrates *nicely* into the *existing* kernel
> model *and* tooling with the proposed patch. This is a HUGE plus. The
> UAPI of this is simple and minimal. And to me, these are in-fact special
> files, not regular ones.

Seriously? Syscall to create/destory cdevs is a nice api?
Not by any means. We can argue in circles, but it doesn't fit.
Using cdev to hold maps is a hack.
Telling kernel that fake device is created only to hold a map?
This fake device doesn't have any of the device properties.
Look at fops->open. Surely it's a smart trick, but it's not behaving
like device.

uapi for fs adds only two commands, whereas cdev needs three.

>> imo fs is cleaner and we can tailor it to be similar to cdev style.
>
> Really, IMHO I think this is over-designed, and much much more hacky. We
> design a whole new file system that works *exactly* like cdevs, takes
> likely more than twice the code and complexity to realize but just to
> save a few bytes ...? I don't understand that.

Let's argue with facts. Your fs patch 758 lines vs cdev 483.
In fs the cost is single alloc_inode(), whereas in cdev the struct cdev
and mutex will add to all maps and progs (even when they're not going
to be pinned), plus kobj allocation, plus gigantic struct device, plus
sysfs inodes, etc. Way more than few bytes of difference.

where do you see 'over-design' in fs? It's a straightforward code and
most of it is boilerplate like all other fs.
Kill rmdir/mkdir ops, term bit and options and the fs diff will shrink
by another 200+ lines.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 18:15                                         ` Alexei Starovoitov
@ 2015-10-19 18:46                                           ` Hannes Frederic Sowa
  2015-10-19 19:34                                             ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-19 18:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

Hi,

On Mon, Oct 19, 2015, at 20:15, Alexei Starovoitov wrote:
> On 10/19/15 10:37 AM, Daniel Borkmann wrote:
> > An eBPF program or map loading/destruction is *not* by any means to be
> > considered fast-path. We currently hold a global mutex during loading.
> > So, how can that be considered fast-path? Similarly, socket creation/
> > destruction is also not fast-path, etc. Do you expect that applications
> > would create/destroy these devices within milliseconds? I'd argue that
> > something would be seriously wrong with that application, then. Such
> > persistent maps are to be considered rather mid-long living objects in
> > the system. The fast-path surely is the data-path of them.
> 
> consider seccomp that loads several programs for every tab, then
> container use case where we're loading two for each as well.
> Who knows what use cases will come up in the future.
> It's obviously not a fast path that is being hit million times a second,
> but why add overhead when it's unnecessary?

But the browser using seccomp does not need persistent maps at all. It
would merely create temporary maps without the overhead which are
automatically released when the program finished. In no way should they
be persistent but automatically garbage collected as soon as the
supervisor process and its children die.

> >> completely unnecessary here. The kernel will consume more memory for
> >> no real reason other than cdev are used to keep prog/maps around.
> >
> > I don't consider this a big issue, and well worth the trade-off. You'll
> > have an infrastructure that integrates *nicely* into the *existing* kernel
> > model *and* tooling with the proposed patch. This is a HUGE plus. The
> > UAPI of this is simple and minimal. And to me, these are in-fact special
> > files, not regular ones.
> 
> Seriously? Syscall to create/destory cdevs is a nice api?

It is pretty normal to do. They don't use syscall but ioctl on a special
pseudo device node (lvm2/device-mapper, tun, etc.). Same thing, very
overloaded syscall. Also I don't see a reason to not make the creation
possible via sysfs, which would be even nicer but not orthogonal to the
current creation of maps, which is nowadays set in stone by uapi.

> Not by any means. We can argue in circles, but it doesn't fit.
> Using cdev to hold maps is a hack.
> Telling kernel that fake device is created only to hold a map?
> This fake device doesn't have any of the device properties.
> Look at fops->open. Surely it's a smart trick, but it's not behaving
> like device.
> 
> uapi for fs adds only two commands, whereas cdev needs three.
> 
> >> imo fs is cleaner and we can tailor it to be similar to cdev style.
> >
> > Really, IMHO I think this is over-designed, and much much more hacky. We
> > design a whole new file system that works *exactly* like cdevs, takes
> > likely more than twice the code and complexity to realize but just to
> > save a few bytes ...? I don't understand that.
> 
> Let's argue with facts. Your fs patch 758 lines vs cdev 483.
> In fs the cost is single alloc_inode(), whereas in cdev the struct cdev
> and mutex will add to all maps and progs (even when they're not going
> to be pinned), plus kobj allocation, plus gigantic struct device, plus
> sysfs inodes, etc. Way more than few bytes of difference.

sysfs already has infrastructure to ensure supportability and
debugability. Dependency graphs can be created to see which programs
depend on which bpf_progs. kobjects are singeltons in sysfs
representation. If you have multiple ebpf filesystems, even maybe
referencing the same hashtable with gigabytes of data multiple times,
there needs to be some way to help administrators to check resource
usage, statistics, tweak and tune the rhashtable. All this needs to be
handled as well in the future. It doesn't really fit the filesystem
model, but representing a kobject seems to be a good fit to me.

Policy can be done by user space with help of udev. Selinux policies can
easily being extended to allow specific domains access. Namespace
"passthrough" is defined for devices already.

> where do you see 'over-design' in fs? It's a straightforward code and
> most of it is boilerplate like all other fs.
> Kill rmdir/mkdir ops, term bit and options and the fs diff will shrink
> by another 200+ lines.

I don't think that only a filesystem will do it in the foreseeable
future. You want to have tools like lsof reporting which map and which
program has references to each other. Those file nodes will need more
metadata attached in future anyway, so currently just comparing lines of
codes seems not to be a good way for arguing.

With filesystems suddenly a lot of other questions arise: selinux
support, acls etc.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 18:46                                           ` Hannes Frederic Sowa
@ 2015-10-19 19:34                                             ` Alexei Starovoitov
  2015-10-19 20:03                                               ` Hannes Frederic Sowa
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-19 19:34 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/15 11:46 AM, Hannes Frederic Sowa wrote:
> Hi,
>
> On Mon, Oct 19, 2015, at 20:15, Alexei Starovoitov wrote:
>> On 10/19/15 10:37 AM, Daniel Borkmann wrote:
>>> An eBPF program or map loading/destruction is *not* by any means to be
>>> considered fast-path. We currently hold a global mutex during loading.
>>> So, how can that be considered fast-path? Similarly, socket creation/
>>> destruction is also not fast-path, etc. Do you expect that applications
>>> would create/destroy these devices within milliseconds? I'd argue that
>>> something would be seriously wrong with that application, then. Such
>>> persistent maps are to be considered rather mid-long living objects in
>>> the system. The fast-path surely is the data-path of them.
>>
>> consider seccomp that loads several programs for every tab, then
>> container use case where we're loading two for each as well.
>> Who knows what use cases will come up in the future.
>> It's obviously not a fast path that is being hit million times a second,
>> but why add overhead when it's unnecessary?
>
> But the browser using seccomp does not need persistent maps at all. It

today it doesn't, but if it was a light weight feature, it could
have been used much more broadly.
Currently we're using user agent for networking and it's a pain.

>>>> completely unnecessary here. The kernel will consume more memory for
>>>> no real reason other than cdev are used to keep prog/maps around.
>>>
>>> I don't consider this a big issue, and well worth the trade-off. You'll
>>> have an infrastructure that integrates *nicely* into the *existing* kernel
>>> model *and* tooling with the proposed patch. This is a HUGE plus. The
>>> UAPI of this is simple and minimal. And to me, these are in-fact special
>>> files, not regular ones.
>>
>> Seriously? Syscall to create/destory cdevs is a nice api?
>
> It is pretty normal to do. They don't use syscall but ioctl on a special
> pseudo device node (lvm2/device-mapper, tun, etc.). Same thing, very
> overloaded syscall. Also I don't see a reason to not make the creation
> possible via sysfs, which would be even nicer but not orthogonal to the
> current creation of maps, which is nowadays set in stone by uapi.

Isn't it your point going against cdev? ioctls on devs are overloaded,
whereas here we have clean bpf syscall with no extra baggage.
Going old-school to device model goes against that clean syscall
approach.

> sysfs already has infrastructure to ensure supportability and
> debugability. Dependency graphs can be created to see which programs
> depend on which bpf_progs. kobjects are singeltons in sysfs
> representation. If you have multiple ebpf filesystems, even maybe
> referencing the same hashtable with gigabytes of data multiple times,
> there needs to be some way to help administrators to check resource
> usage, statistics, tweak and tune the rhashtable. All this needs to be
> handled as well in the future. It doesn't really fit the filesystem
> model, but representing a kobject seems to be a good fit to me.

quite the opposite. The way cdev patch is done now, there is no way
to extend it to create a hierarchy without breaking users,
whereas fs style with initial Daniel's patch can be extended.
Doing 'resource stats' via sysfs requires bpf to add to sysfs, which
is not this cdev approach.

> Policy can be done by user space with help of udev. Selinux policies can
> easily being extended to allow specific domains access. Namespace
> "passthrough" is defined for devices already.

that's an interesting point, but isn't it better done with fs?
you can go much more fine-grained with directory permissions in fs.
With different users/namespaces having their own hierarchies and mounts
whereas cdev dumps everything into one spot.

>> where do you see 'over-design' in fs? It's a straightforward code and
>> most of it is boilerplate like all other fs.
>> Kill rmdir/mkdir ops, term bit and options and the fs diff will shrink
>> by another 200+ lines.
>
> I don't think that only a filesystem will do it in the foreseeable
> future. You want to have tools like lsof reporting which map and which
> program has references to each other. Those file nodes will need more
> metadata attached in future anyway, so currently just comparing lines of
> codes seems not to be a good way for arguing.

my point was that both have roughly the same number, but the lines of
code will grow in either approach and when cdev starts as a hack I can
only see more hacks added in the future, whereas fs gives us
full flexibility to do any type of file access and user visible
representation.
Also I don't buy the point of reinventing sysfs. bpffs is not doing
sysfs. I don't want to see _every_ bpf object in sysfs. It's way too
much overhead. Classic doesn't have sysfs and everyone have been
using it just fine.
bpffs is solving the need to 'pin_fd' when user process exits.
That's it. Let's solve that and not go into any of this sysfs stuff.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 19:34                                             ` Alexei Starovoitov
@ 2015-10-19 20:03                                               ` Hannes Frederic Sowa
  2015-10-19 20:48                                                 ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-19 20:03 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

Hi Alexei,

On Mon, Oct 19, 2015, at 21:34, Alexei Starovoitov wrote:
> On 10/19/15 11:46 AM, Hannes Frederic Sowa wrote:
> > Hi,
> >
> > On Mon, Oct 19, 2015, at 20:15, Alexei Starovoitov wrote:
> >> On 10/19/15 10:37 AM, Daniel Borkmann wrote:
> >>> An eBPF program or map loading/destruction is *not* by any means to be
> >>> considered fast-path. We currently hold a global mutex during loading.
> >>> So, how can that be considered fast-path? Similarly, socket creation/
> >>> destruction is also not fast-path, etc. Do you expect that applications
> >>> would create/destroy these devices within milliseconds? I'd argue that
> >>> something would be seriously wrong with that application, then. Such
> >>> persistent maps are to be considered rather mid-long living objects in
> >>> the system. The fast-path surely is the data-path of them.
> >>
> >> consider seccomp that loads several programs for every tab, then
> >> container use case where we're loading two for each as well.
> >> Who knows what use cases will come up in the future.
> >> It's obviously not a fast path that is being hit million times a second,
> >> but why add overhead when it's unnecessary?
> >
> > But the browser using seccomp does not need persistent maps at all. It
> 
> today it doesn't, but if it was a light weight feature, it could
> have been used much more broadly.
> Currently we're using user agent for networking and it's a pain.

I doubt it will stay a lightweight feature as it should not be in the
responsibility of user space to provide those debug facilities.

> >>>> completely unnecessary here. The kernel will consume more memory for
> >>>> no real reason other than cdev are used to keep prog/maps around.
> >>>
> >>> I don't consider this a big issue, and well worth the trade-off. You'll
> >>> have an infrastructure that integrates *nicely* into the *existing* kernel
> >>> model *and* tooling with the proposed patch. This is a HUGE plus. The
> >>> UAPI of this is simple and minimal. And to me, these are in-fact special
> >>> files, not regular ones.
> >>
> >> Seriously? Syscall to create/destory cdevs is a nice api?
> >
> > It is pretty normal to do. They don't use syscall but ioctl on a special
> > pseudo device node (lvm2/device-mapper, tun, etc.). Same thing, very
> > overloaded syscall. Also I don't see a reason to not make the creation
> > possible via sysfs, which would be even nicer but not orthogonal to the
> > current creation of maps, which is nowadays set in stone by uapi.
> 
> Isn't it your point going against cdev? ioctls on devs are overloaded,
> whereas here we have clean bpf syscall with no extra baggage.
> Going old-school to device model goes against that clean syscall
> approach.

The bpf syscall is still used to create the pseudo nodes. If they should
be persistent they just get registered in the sysfs class hierarchy.

> > sysfs already has infrastructure to ensure supportability and
> > debugability. Dependency graphs can be created to see which programs
> > depend on which bpf_progs. kobjects are singeltons in sysfs
> > representation. If you have multiple ebpf filesystems, even maybe
> > referencing the same hashtable with gigabytes of data multiple times,
> > there needs to be some way to help administrators to check resource
> > usage, statistics, tweak and tune the rhashtable. All this needs to be
> > handled as well in the future. It doesn't really fit the filesystem
> > model, but representing a kobject seems to be a good fit to me.
> 
> quite the opposite. The way cdev patch is done now, there is no way
> to extend it to create a hierarchy without breaking users,
> whereas fs style with initial Daniel's patch can be extended.
> Doing 'resource stats' via sysfs requires bpf to add to sysfs, which
> is not this cdev approach.

This is not yet part of the patch, but I think this would be added.
Daniel?

> > Policy can be done by user space with help of udev. Selinux policies can
> > easily being extended to allow specific domains access. Namespace
> > "passthrough" is defined for devices already.
> 
> that's an interesting point, but isn't it better done with fs?
> you can go much more fine-grained with directory permissions in fs.
> With different users/namespaces having their own hierarchies and mounts
> whereas cdev dumps everything into one spot.

Policy can already be defined in terms of cgroups:
<http://lxr.free-electrons.com/source/Documentation/cgroups/devices.txt>

I don't think there are broad differences. But in case a namespaces uses
huge number of maps with tons of data, the admin in the initial
namespace might want to debug that without searching all mountpoints and
find dependencies between processes etc. IMHO sysfs approach can be
better extended here.
 
> >> where do you see 'over-design' in fs? It's a straightforward code and
> >> most of it is boilerplate like all other fs.
> >> Kill rmdir/mkdir ops, term bit and options and the fs diff will shrink
> >> by another 200+ lines.
> >
> > I don't think that only a filesystem will do it in the foreseeable
> > future. You want to have tools like lsof reporting which map and which
> > program has references to each other. Those file nodes will need more
> > metadata attached in future anyway, so currently just comparing lines of
> > codes seems not to be a good way for arguing.
> 
> my point was that both have roughly the same number, but the lines of
> code will grow in either approach and when cdev starts as a hack I can
> only see more hacks added in the future, whereas fs gives us
> full flexibility to do any type of file access and user visible
> representation.
> Also I don't buy the point of reinventing sysfs. bpffs is not doing
> sysfs. I don't want to see _every_ bpf object in sysfs. It's way too
> much overhead. Classic doesn't have sysfs and everyone have been
> using it just fine.

But classic bpf does not have persistence for maps and data. ;) There is
a 1:1 relationship between socket and bpf_prog for example.

> bpffs is solving the need to 'pin_fd' when user process exits.
> That's it. Let's solve that and not go into any of this sysfs stuff.

But how can the filesystem be extended in terms of tunables and
information? File attributes? Wouldn't it need the same infrastructure
otherwise as sysfs? Some third-party lookup filesystem or ioctl? This
char dev approach also pins maps and progs while giving more policy in
hand of central user space programs we are currently using (udev,
systemd, whatever, etc.).

Bye,
Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 20:03                                               ` Hannes Frederic Sowa
@ 2015-10-19 20:48                                                 ` Alexei Starovoitov
  2015-10-19 22:17                                                   ` Daniel Borkmann
  2015-10-19 23:02                                                   ` Hannes Frederic Sowa
  0 siblings, 2 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-19 20:48 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/15 1:03 PM, Hannes Frederic Sowa wrote:
>
> I doubt it will stay a lightweight feature as it should not be in the
> responsibility of user space to provide those debug facilities.

It feels we're talking past each other.
I want to solve 'persistent map' problem.
debugging of maps/progs, hierarchy, etc are all nice to have,
but different issues.
In case of persistent maps I imagine unprivileged process would want
to use it eventually as well, so this requirement already kills cdev
approach for me, since I don't think we ever let unprivileged apps
create cdev with syscall.

> The bpf syscall is still used to create the pseudo nodes. If they should
> be persistent they just get registered in the sysfs class hierarchy.

nope. they should not. sysfs is debugging/tunning facility.
There is absolutely no need for bpf to plug into sysfs.

>> Doing 'resource stats' via sysfs requires bpf to add to sysfs, which
>> is not this cdev approach.
>
> This is not yet part of the patch, but I think this would be added.
> Daniel?

please don't. I'm strongly against adding unnecessary bloat.

> I don't think there are broad differences. But in case a namespaces uses
> huge number of maps with tons of data, the admin in the initial
> namespace might want to debug that without searching all mountpoints and
> find dependencies between processes etc. IMHO sysfs approach can be
> better extended here.

sure, then we can force all bpffs to have the same hierarchy and mounted
in /sys/kernel/bpf location. That would be the same.

It feels you're pushing for cdev only because of that potential
debugging need. Did you actually face that need? I didn't and
don't like to add 'nice to have' feature until real need comes.

>> Also I don't buy the point of reinventing sysfs. bpffs is not doing
>> sysfs. I don't want to see _every_ bpf object in sysfs. It's way too
>> much overhead. Classic doesn't have sysfs and everyone have been
>> using it just fine.
>
> But classic bpf does not have persistence for maps and data. ;) There is
> a 1:1 relationship between socket and bpf_prog for example.

single task in seccomp can have a chain of bpf progs, so hierarchy
is already there.

> But how can the filesystem be extended in terms of tunables and
> information? File attributes? Wouldn't it need the same infrastructure
> otherwise as sysfs? Some third-party lookup filesystem or ioctl? This
> char dev approach also pins maps and progs while giving more policy in
> hand of central user space programs we are currently using (udev,
> systemd, whatever, etc.).

tunables for bpf maps? There are no such things today.
I think you're implying that we can add rhashtable type of map, so
admin can tune thresholds ? Ouch. I think if we add it, its parameters
will be specified by the user that is creating the map only. There will
be no tunables exposed to sysfs and there should be no way of creating
maps via sysfs.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 20:48                                                 ` Alexei Starovoitov
@ 2015-10-19 22:17                                                   ` Daniel Borkmann
  2015-10-20  0:30                                                     ` Alexei Starovoitov
  2015-10-19 23:02                                                   ` Hannes Frederic Sowa
  1 sibling, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-19 22:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Hannes Frederic Sowa, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/2015 10:48 PM, Alexei Starovoitov wrote:
> On 10/19/15 1:03 PM, Hannes Frederic Sowa wrote:
>>
>> I doubt it will stay a lightweight feature as it should not be in the
>> responsibility of user space to provide those debug facilities.
>
> It feels we're talking past each other.
> I want to solve 'persistent map' problem.
> debugging of maps/progs, hierarchy, etc are all nice to have,
> but different issues.

Ok, so you are saying that this file system will have *only* regular
files that are to be considered nodes for eBPF maps and progs. Nothing
else will ever be added to it? Yet, eBPF map and prog nodes are *not*
*regular files* to me, this seems odd.

As soon as you are starting to add additional folders that contain
files dumping additional meta data, etc, you basically end up on a
bit of similar concept to sysfs, no?

> In case of persistent maps I imagine unprivileged process would want
> to use it eventually as well, so this requirement already kills cdev
> approach for me, since I don't think we ever let unprivileged apps
> create cdev with syscall.

Hmm, I see. So far the discussion was only about having this for privileged
users (also in this fs patch). F.e. privileged system daemons could setup
and distribute progs/maps to consumers, etc (f.e. seccomp and tc case).

When we start lifting this, eBPF maps by its own will become a real kernel
IPC facility for unprivileged Linux applications (independently whether
they are connected to an actual eBPF program). Those kernel IPC facilities
that are anchored in the file system like named pipes and Unix domain sockets
are indicated as such as special files, no?

>> The bpf syscall is still used to create the pseudo nodes. If they should
>> be persistent they just get registered in the sysfs class hierarchy.
>
> nope. they should not. sysfs is debugging/tunning facility.
> There is absolutely no need for bpf to plug into sysfs.
>
>>> Doing 'resource stats' via sysfs requires bpf to add to sysfs, which
>>> is not this cdev approach.
>>
>> This is not yet part of the patch, but I think this would be added.
>> Daniel?
>
> please don't. I'm strongly against adding unnecessary bloat.
>
>> I don't think there are broad differences. But in case a namespaces uses
>> huge number of maps with tons of data, the admin in the initial
>> namespace might want to debug that without searching all mountpoints and
>> find dependencies between processes etc. IMHO sysfs approach can be
>> better extended here.
>
> sure, then we can force all bpffs to have the same hierarchy and mounted
> in /sys/kernel/bpf location. That would be the same.

That would imply to have a mount_single() file system (like f.e. tracefs and
securityfs), right?

So you'd loose having various mounts in different namespaces. And if you
allow various mount points, how would that /facilitate/ to an admin to
identify all eBPF objects/resources currently present in the system?

Or to an application developer finding possible mount points for his own
application so that bpf(2) syscall to create these nodes succeeds? Would
you make mounting also unprivileged? What if various distros have these
mount points at different locations? What should unprivileged applications
do to know that they can use these locations for themselves?

Also, since they are only regular files, one can only try and find these
objects based on their naming schemes, which seems to get a bit odd in case
this file system also carries (perhaps future?) other regular files that
are not eBPF map and program-special.

> It feels you're pushing for cdev only because of that potential
> debugging need. Did you actually face that need? I didn't and
> don't like to add 'nice to have' feature until real need comes.

I think this discussion arose, because the question of how flexible we are
in future to extend this facility. Nothing more.

I still don't consider the cdev approach as a hack. Despite a bit "old
school" as you say, the semantics seem to be better defined, imho. It
integrates better into existing facilities.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 20:48                                                 ` Alexei Starovoitov
  2015-10-19 22:17                                                   ` Daniel Borkmann
@ 2015-10-19 23:02                                                   ` Hannes Frederic Sowa
  2015-10-20  1:09                                                     ` Alexei Starovoitov
  1 sibling, 1 reply; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-19 23:02 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

Hello Alexei,

On Mon, Oct 19, 2015, at 22:48, Alexei Starovoitov wrote:
> On 10/19/15 1:03 PM, Hannes Frederic Sowa wrote:
> >
> > I doubt it will stay a lightweight feature as it should not be in the
> > responsibility of user space to provide those debug facilities.
> 
> It feels we're talking past each other.
> I want to solve 'persistent map' problem.
> debugging of maps/progs, hierarchy, etc are all nice to have,
> but different issues.

I understand that. Main problem is to persist fds with attached maps,
sure.

I am not a big fan of the kernel referencing or persisting resources on
behalf of its own. IMHO they should be attached to user space programs.
So I am still in favor for user space, but I see that most people are
not comfortable with that. So a way to make this kind of persistence in
the kernel as introspectable as possible is the main goal of the idea we
are sharing here.

I bet commercial software will make use of this ebpf framework, too. And
the kernel always helped me and gave me a way to see what is going on,
debug which part of my operating system universe interacts with which
other part. Merely dropping file descriptors with data attached to them
in an filesystem seems not to fulfill my need at all. I would love to
see where resources are referenced and why, like I am nowadays.

Btw.: has anybody had a look at kdbus if it allows user space to much
more easily handle those file descriptors (as an alternative to
af_unix). I haven't, yet.

> In case of persistent maps I imagine unprivileged process would want
> to use it eventually as well, so this requirement already kills cdev
> approach for me, since I don't think we ever let unprivileged apps
> create cdev with syscall.

TTY code is creating nodes on behalf of users, I check if this could
work for bpf cdevs as well.

> > The bpf syscall is still used to create the pseudo nodes. If they should
> > be persistent they just get registered in the sysfs class hierarchy.
> 
> nope. they should not. sysfs is debugging/tunning facility.
> There is absolutely no need for bpf to plug into sysfs.
> 
> >> Doing 'resource stats' via sysfs requires bpf to add to sysfs, which
> >> is not this cdev approach.
> >
> > This is not yet part of the patch, but I think this would be added.
> > Daniel?
> 
> please don't. I'm strongly against adding unnecessary bloat.
> 
> > I don't think there are broad differences. But in case a namespaces uses
> > huge number of maps with tons of data, the admin in the initial
> > namespace might want to debug that without searching all mountpoints and
> > find dependencies between processes etc. IMHO sysfs approach can be
> > better extended here.
> 
> sure, then we can force all bpffs to have the same hierarchy and mounted
> in /sys/kernel/bpf location. That would be the same.
> 
> It feels you're pushing for cdev only because of that potential
> debugging need. Did you actually face that need? I didn't and
> don't like to add 'nice to have' feature until real need comes.

Given that we want to monitor the load of a hashmap for graphing
purposes. Or liberate some hashmaps from its restriction on number of
keys and make upper bounds configurable by admins who know the
dimensions of their systems and not some software deep down buried in
the bpf syscall where I might not have access to source code. In tc
force e.g. hashmaps to do garbage collection because we cannot be sure
that under DoS attacks user space clean up gets scheduled early enough
if ebpf adds flows to hashtables. I do see need to expand and implement
some kind of policy in the future.

> >> Also I don't buy the point of reinventing sysfs. bpffs is not doing
> >> sysfs. I don't want to see _every_ bpf object in sysfs. It's way too
> >> much overhead. Classic doesn't have sysfs and everyone have been
> >> using it just fine.
> >
> > But classic bpf does not have persistence for maps and data. ;) There is
> > a 1:1 relationship between socket and bpf_prog for example.
> 
> single task in seccomp can have a chain of bpf progs, so hierarchy
> is already there.

And it would be great to inspect them.

> > But how can the filesystem be extended in terms of tunables and
> > information? File attributes? Wouldn't it need the same infrastructure
> > otherwise as sysfs? Some third-party lookup filesystem or ioctl? This
> > char dev approach also pins maps and progs while giving more policy in
> > hand of central user space programs we are currently using (udev,
> > systemd, whatever, etc.).
> 
> tunables for bpf maps? There are no such things today.
> I think you're implying that we can add rhashtable type of map, so
> admin can tune thresholds ? Ouch. I think if we add it, its parameters
> will be specified by the user that is creating the map only. There will
> be no tunables exposed to sysfs and there should be no way of creating
> maps via sysfs.

I am fine with creating maps only by bpf syscall. But to hide
configuration details or at least not be really able to query them
easily seems odd to me. If we go with the ebpffs how could those
attributes be added?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 22:17                                                   ` Daniel Borkmann
@ 2015-10-20  0:30                                                     ` Alexei Starovoitov
  2015-10-20  8:46                                                       ` Daniel Borkmann
  2015-10-20  9:43                                                       ` Hannes Frederic Sowa
  0 siblings, 2 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-20  0:30 UTC (permalink / raw)
  To: Daniel Borkmann, Hannes Frederic Sowa, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/15 3:17 PM, Daniel Borkmann wrote:
> On 10/19/2015 10:48 PM, Alexei Starovoitov wrote:
>> On 10/19/15 1:03 PM, Hannes Frederic Sowa wrote:
>>>
>>> I doubt it will stay a lightweight feature as it should not be in the
>>> responsibility of user space to provide those debug facilities.
>>
>> It feels we're talking past each other.
>> I want to solve 'persistent map' problem.
>> debugging of maps/progs, hierarchy, etc are all nice to have,
>> but different issues.
>
> Ok, so you are saying that this file system will have *only* regular
> files that are to be considered nodes for eBPF maps and progs. Nothing
> else will ever be added to it? Yet, eBPF map and prog nodes are *not*
> *regular files* to me, this seems odd.
>
> As soon as you are starting to add additional folders that contain
> files dumping additional meta data, etc, you basically end up on a
> bit of similar concept to sysfs, no?

as we discussed in this thread and earlier during plumbers I think
it would be good to expose key/values somehow in this fs.
'how' is a big question.
But regardless which path we take, sysfs is too rigid.
For the sake of argument say we do every key as a new file in bpffs.
It's not very scalable, but comparing to sysfs it's better
(resource wise).
If we decide to add bpf syscall command to expose map details we
can provide pretty printer to it in a form of printk-like string
or via some schema passed to syscall, so that keys(file names) can
look properly in bpffs.
The above and other ideas all possible in bpffs, but not possible
in sysfs.

>> In case of persistent maps I imagine unprivileged process would want
>> to use it eventually as well, so this requirement already kills cdev
>> approach for me, since I don't think we ever let unprivileged apps
>> create cdev with syscall.
>
> Hmm, I see. So far the discussion was only about having this for privileged
> users (also in this fs patch). F.e. privileged system daemons could setup
> and distribute progs/maps to consumers, etc (f.e. seccomp and tc case).

It completely makes sense to restrict it to admin today, but design
should not prevent relaxing it in the future.

> When we start lifting this, eBPF maps by its own will become a real kernel
> IPC facility for unprivileged Linux applications (independently whether
> they are connected to an actual eBPF program). Those kernel IPC facilities
> that are anchored in the file system like named pipes and Unix domain
> sockets
> are indicated as such as special files, no?

not everything in unix is a model that should be followed.
af_unix with name[0]!=0 is a bad api that wasn't thought through.
Thankfully Linux improved it with abstract names that don't use
special files.
bpf maps obviously is not an IPC (either pinned or not).

>> sure, then we can force all bpffs to have the same hierarchy and mounted
>> in /sys/kernel/bpf location. That would be the same.
>
> That would imply to have a mount_single() file system (like f.e. tracefs
> and
> securityfs), right?

Probably. I'm not sure whether it should be single fs or we allow
multiple mount points. There are pro and con for both.

> So you'd loose having various mounts in different namespaces. And if you
> allow various mount points, how would that /facilitate/ to an admin to
> identify all eBPF objects/resources currently present in the system?

if it's single mount point there are no issues, but would be nice
to separate users and namespaces somehow.

> Or to an application developer finding possible mount points for his own
> application so that bpf(2) syscall to create these nodes succeeds? Would
> you make mounting also unprivileged? What if various distros have these
> mount points at different locations? What should unprivileged applications
> do to know that they can use these locations for themselves?

mounting is root only of course and having standard location answers
all of these questions.

> Also, since they are only regular files, one can only try and find these
> objects based on their naming schemes, which seems to get a bit odd in case
> this file system also carries (perhaps future?) other regular files that
> are not eBPF map and program-special.

not sure what you meant. If names are given by kernel, there are no
problem finding them. If by user, we'd file attributes like the way
you did with xattr.

>> It feels you're pushing for cdev only because of that potential
>> debugging need. Did you actually face that need? I didn't and
>> don't like to add 'nice to have' feature until real need comes.
>
> I think this discussion arose, because the question of how flexible we are
> in future to extend this facility. Nothing more.

Exactly and cdev style pushes us into the corner of traditional
cdev with ioctl which I don't think is flexible enough.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-19 23:02                                                   ` Hannes Frederic Sowa
@ 2015-10-20  1:09                                                     ` Alexei Starovoitov
  2015-10-20 10:07                                                       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-20  1:09 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/19/15 4:02 PM, Hannes Frederic Sowa wrote:
> I bet commercial software will make use of this ebpf framework, too. And
> the kernel always helped me and gave me a way to see what is going on,
> debug which part of my operating system universe interacts with which
> other part. Merely dropping file descriptors with data attached to them
> in an filesystem seems not to fulfill my need at all. I would love to
> see where resources are referenced and why, like I am nowadays.

agree. common fs with hierarchy will give this visibility in
one place.

>> >It feels you're pushing for cdev only because of that potential
>> >debugging need. Did you actually face that need? I didn't and
>> >don't like to add 'nice to have' feature until real need comes.
> Given that we want to monitor the load of a hashmap for graphing
> purposes. Or liberate some hashmaps from its restriction on number of
> keys and make upper bounds configurable by admins who know the
> dimensions of their systems and not some software deep down buried in
> the bpf syscall where I might not have access to source code. In tc
> force e.g. hashmaps to do garbage collection because we cannot be sure
> that under DoS attacks user space clean up gets scheduled early enough
> if ebpf adds flows to hashtables. I do see need to expand and implement
> some kind of policy in the future.

disagree here. admin should not interfere with map parameters.
What you proposing above sounds very very dangerous.
Admins to configure GC of maps? What do you think the programs will do
with such sophisticated maps? What kind of networking app you have
in mind? Anyway that's a bit off-topic. I'm very curious though.

>> >single task in seccomp can have a chain of bpf progs, so hierarchy
>> >is already there.
> And it would be great to inspect them.

again let's not mix criu and lsof-like requirements with 'pin fd'.
For visibility of normal maps we can add fdinfo and lsof
can pick it up without any fs or any cdevs.

> I am fine with creating maps only by bpf syscall. But to hide
> configuration details or at least not be really able to query them
> easily seems odd to me. If we go with the ebpffs how could those
> attributes be added?

I'm not advocating to hide details. Most of the time maps will not be
pinned, so fdinfo seems the easiest way to show things like key_size,
value_size, max_entries, type.
Even if we decide to do it some other way, it's not related to 'pin fd'
discussion, since debugging/visibility is nice to have for all bpf 
objects. Note that walking of key/value without pretty-printers
provided by the app is meaningless for admin, so only things
like 'how much memory this map is using' are useful.

May be we should try to draft the hierarchy of this common fs.
How about:
/sys/kernel/bpf/username/optional_dirs_mkdir_by_user/progX
and 'cat' of it will print the same as fdinfo for normal maps,
so admin can see what maps were pinned by user and its cost.

Inside 'fdinfo' output we can provide pointers to which progs
are using which maps as
# cat /sys/kernel/bpf/.../mapX
key_size: 4
used_by: /proc/xxx/fd/5
# cat /sys/kernel/bpf/.../progY
type: socket
using: /proc/xxx/fd/6
using: /sys/kernel/bpf/.../mapZ
and similar for cat /proc/xxx/fdinfo/6
but showing hierarchy as directories is non starter, since
it's no a tree.

All of these would be nice, but doesn't have to be implemented
along with 'pin fd' feature.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-20  0:30                                                     ` Alexei Starovoitov
@ 2015-10-20  8:46                                                       ` Daniel Borkmann
  2015-10-20 17:53                                                         ` Alexei Starovoitov
  2015-10-20  9:43                                                       ` Hannes Frederic Sowa
  1 sibling, 1 reply; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-20  8:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Hannes Frederic Sowa, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/20/2015 02:30 AM, Alexei Starovoitov wrote:
> On 10/19/15 3:17 PM, Daniel Borkmann wrote:
>> On 10/19/2015 10:48 PM, Alexei Starovoitov wrote:
>>> On 10/19/15 1:03 PM, Hannes Frederic Sowa wrote:
>>>>
>>>> I doubt it will stay a lightweight feature as it should not be in the
>>>> responsibility of user space to provide those debug facilities.
>>>
>>> It feels we're talking past each other.
>>> I want to solve 'persistent map' problem.
>>> debugging of maps/progs, hierarchy, etc are all nice to have,
>>> but different issues.
>>
>> Ok, so you are saying that this file system will have *only* regular
>> files that are to be considered nodes for eBPF maps and progs. Nothing
>> else will ever be added to it? Yet, eBPF map and prog nodes are *not*
>> *regular files* to me, this seems odd.
>>
>> As soon as you are starting to add additional folders that contain
>> files dumping additional meta data, etc, you basically end up on a
>> bit of similar concept to sysfs, no?
>
> as we discussed in this thread and earlier during plumbers I think
> it would be good to expose key/values somehow in this fs.
> 'how' is a big question.

Yes, it is a big question, and probably best left to the domain-specific
application itself, which can already dump the map nowadays via bpf(2)
syscall. You can add bindings to various languages to make it available
elsewhere as well.

Or, you have a user space 'bpf' tool that can connect to any map that is
being exposed with whatever model, and have modular pretty printers in
user space somewhere located as shared objects, they could get auto-loaded
in the background. Maps could get an annotation attached as an attribute
during creation that is being exposed somewhere, so it can be mapped to
a pretty printer shared object. This would better be solved in user space
entirely, in my opinion, why should the kernel add complexity for this
when this is so much user-space application specific anyway?

As we all agreed, looking into key/values via shell is a rare event and
not needed most of the times. It comes with it's own problems (f.e. think
of dumping a possible rhashtable map with key/values as files). But even
iff we'd want to stick this into files by all means, fusefs can do this
specific job entirely in user space _plus_ fetching these shared objects
for pretty printers etc, all we need for this is to add this annotation/
mapping attribute somewhere to bpf_maps and that's all it takes.

This question is no doubt independant of the fd pinning mechanism, but as
I said, I don't think sticking this into the kernel is a good idea. Why
would that be the kernel's job?

In the other email, you are mentioning fdinfo. fdinfo can be done for any
map/prog already today by just adding the right .show_fdinfo() callback to
bpf_map_fops and bpf_prog_fops, so we let the anon-inodes that we already
use today to do this job for free and such debugging info can be inspected
through procfs already. This is common practice, f.e. look at timerfd,
signalfd and others.

> But regardless which path we take, sysfs is too rigid.
> For the sake of argument say we do every key as a new file in bpffs.
> It's not very scalable, but comparing to sysfs it's better
> (resource wise).

I doubt this is scaleable at all, no matter if its sysfs or a own custom
fs. How should that work. You have a map with possibly thousands or millions
of entries. Are these files to be generated on the fly like in procfs as
soon as you enter that directory? Or as a one-time snapshot (but then
the user mights want to create various snapshots)? There might be new
map elements as building blocks in the future such as pipes, ring buffers
etc. How are they being dumped as files?

> If we decide to add bpf syscall command to expose map details we
> can provide pretty printer to it in a form of printk-like string
> or via some schema passed to syscall, so that keys(file names) can
> look properly in bpffs.
> The above and other ideas all possible in bpffs, but not possible
> in sysfs.

So, you'd need schemes for keys /and/ values. They could both be complicated
structures and even within these structures, you might need further domain
specific knowledge to dump stuff properly. Perhaps to make sense of it, a
member of that structure needs further processing, etc. That's why I think,
as mentioned above, it's better done in user space through modular shared
objects helpers. A default set of these .so files for pretty printing common
stuff could be shipped already and for more complex applications, they can
ship through distros their own .so files along with the application. And
developers can hack their own modules together locally, too.

So, neither sysfs nor bpffs would need anything here. Why force this into
the kernel? Even more so, if it's just rarely used anyway?

>>> In case of persistent maps I imagine unprivileged process would want
>>> to use it eventually as well, so this requirement already kills cdev
>>> approach for me, since I don't think we ever let unprivileged apps
>>> create cdev with syscall.
>>
>> Hmm, I see. So far the discussion was only about having this for privileged
>> users (also in this fs patch). F.e. privileged system daemons could setup
>> and distribute progs/maps to consumers, etc (f.e. seccomp and tc case).
>
> It completely makes sense to restrict it to admin today, but design
> should not prevent relaxing it in the future.
>
>> When we start lifting this, eBPF maps by its own will become a real kernel
>> IPC facility for unprivileged Linux applications (independently whether
>> they are connected to an actual eBPF program). Those kernel IPC facilities
>> that are anchored in the file system like named pipes and Unix domain
>> sockets
>> are indicated as such as special files, no?
>
> not everything in unix is a model that should be followed.
> af_unix with name[0]!=0 is a bad api that wasn't thought through.
> Thankfully Linux improved it with abstract names that don't use
> special files.
> bpf maps obviously is not an IPC (either pinned or not).

So, if this pinning facility is unprivileged and available for *all*
applications, then applications can in-fact use eBPF maps (w/o any
other aides such as Unix domain sockets to transfer fds) among themselves
to exchange state via bpf(2) syscall. It doesn't need a corresponding
program.

>>> sure, then we can force all bpffs to have the same hierarchy and mounted
>>> in /sys/kernel/bpf location. That would be the same.
>>
>> That would imply to have a mount_single() file system (like f.e. tracefs
>> and
>> securityfs), right?
>
> Probably. I'm not sure whether it should be single fs or we allow
> multiple mount points. There are pro and con for both.
>
>> So you'd loose having various mounts in different namespaces. And if you
>> allow various mount points, how would that /facilitate/ to an admin to
>> identify all eBPF objects/resources currently present in the system?
>
> if it's single mount point there are no issues, but would be nice
> to separate users and namespaces somehow.
>
>> Or to an application developer finding possible mount points for his own
>> application so that bpf(2) syscall to create these nodes succeeds? Would
>> you make mounting also unprivileged? What if various distros have these
>> mount points at different locations? What should unprivileged applications
>> do to know that they can use these locations for themselves?
>
> mounting is root only of course and having standard location answers
> all of these questions.

Okay, sure, but then having a mount_single() and separating users and
namespaces is still not being resolved, as you've noticed.

>> Also, since they are only regular files, one can only try and find these
>> objects based on their naming schemes, which seems to get a bit odd in case
>> this file system also carries (perhaps future?) other regular files that
>> are not eBPF map and program-special.
>
> not sure what you meant. If names are given by kernel, there are no
> problem finding them. If by user, we'd file attributes like the way
> you did with xattr.

So, if you distribute the names through the kernel and dictate a strict
hierarchy, then we'll end up with a similar model that cdevs resolve.

>>> It feels you're pushing for cdev only because of that potential
>>> debugging need. Did you actually face that need? I didn't and
>>> don't like to add 'nice to have' feature until real need comes.
>>
>> I think this discussion arose, because the question of how flexible we are
>> in future to extend this facility. Nothing more.
>
> Exactly and cdev style pushes us into the corner of traditional
> cdev with ioctl which I don't think is flexible enough.

So, as elaborated above in terms of pretty printers and fdinfo examples,
they can all be resolved much more elegantly. So far I still fail to see
the reason why a custom dedicated fs would be better to make the pinning
syscall work, when this can be solved with a better integration into
Linux facilities by using cdevs.

Thanks !

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-20  0:30                                                     ` Alexei Starovoitov
  2015-10-20  8:46                                                       ` Daniel Borkmann
@ 2015-10-20  9:43                                                       ` Hannes Frederic Sowa
  1 sibling, 0 replies; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-20  9:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

Hey Alexei,

On Tue, Oct 20, 2015, at 02:30, Alexei Starovoitov wrote:
> On 10/19/15 3:17 PM, Daniel Borkmann wrote:
> > On 10/19/2015 10:48 PM, Alexei Starovoitov wrote:
> >> On 10/19/15 1:03 PM, Hannes Frederic Sowa wrote:
> >>>
> >>> I doubt it will stay a lightweight feature as it should not be in the
> >>> responsibility of user space to provide those debug facilities.
> >>
> >> It feels we're talking past each other.
> >> I want to solve 'persistent map' problem.
> >> debugging of maps/progs, hierarchy, etc are all nice to have,
> >> but different issues.
> >
> > Ok, so you are saying that this file system will have *only* regular
> > files that are to be considered nodes for eBPF maps and progs. Nothing
> > else will ever be added to it? Yet, eBPF map and prog nodes are *not*
> > *regular files* to me, this seems odd.
> >
> > As soon as you are starting to add additional folders that contain
> > files dumping additional meta data, etc, you basically end up on a
> > bit of similar concept to sysfs, no?
> 
> as we discussed in this thread and earlier during plumbers I think
> it would be good to expose key/values somehow in this fs.
> 'how' is a big question.
> But regardless which path we take, sysfs is too rigid.
> For the sake of argument say we do every key as a new file in bpffs.
> It's not very scalable, but comparing to sysfs it's better
> (resource wise).
> If we decide to add bpf syscall command to expose map details we
> can provide pretty printer to it in a form of printk-like string
> or via some schema passed to syscall, so that keys(file names) can
> look properly in bpffs.
> The above and other ideas all possible in bpffs, but not possible
> in sysfs.

If you come up with a layout (albeit I don't understand how to enforce
it, later on this more), it is kind of an uapi. It is not possible to
just change the layout of the filesystem in every kernel version.

Take f.e. bpffs like Daniel proposed it, how would you extend it to a
key/value store like of map.

Really, my IPv6 addresses very often have '\0' inside them, I don't see
a way, if they are part of a key, how to represent them in a filename.
That is forbidden by all means.

Adding pretty printer for such a filesystem is something user space
should do. How do you want to handle that? Make printk available from
ebpf programs and ebpf programs bring their own pretty-printer for their
key and values? This really looks like a security hazard. User space is
so much easier and fuse, if you want to have a key/value representation
of your map, why not in user space?

In any way, in case a key/value filesystem is needed I certainly want to
go with Eric Biederman and have one mount point per map. Otherwise I
also can't see how this should work in terms of permissions.

> >> In case of persistent maps I imagine unprivileged process would want
> >> to use it eventually as well, so this requirement already kills cdev
> >> approach for me, since I don't think we ever let unprivileged apps
> >> create cdev with syscall.
> >
> > Hmm, I see. So far the discussion was only about having this for privileged
> > users (also in this fs patch). F.e. privileged system daemons could setup
> > and distribute progs/maps to consumers, etc (f.e. seccomp and tc case).
> 
> It completely makes sense to restrict it to admin today, but design
> should not prevent relaxing it in the future.

Even today lot's of unprivileged devices are used by users from day to
day. Soundcards, terminals etc. This is also possible with cdevs.

> > When we start lifting this, eBPF maps by its own will become a real kernel
> > IPC facility for unprivileged Linux applications (independently whether
> > they are connected to an actual eBPF program). Those kernel IPC facilities
> > that are anchored in the file system like named pipes and Unix domain
> > sockets
> > are indicated as such as special files, no?
> 
> not everything in unix is a model that should be followed.
> af_unix with name[0]!=0 is a bad api that wasn't thought through.
> Thankfully Linux improved it with abstract names that don't use
> special files.
> bpf maps obviously is not an IPC (either pinned or not).

Sure, it is, IPC between ebpf programs and some kind of user space
controlling application.

> >> sure, then we can force all bpffs to have the same hierarchy and mounted
> >> in /sys/kernel/bpf location. That would be the same.
> >
> > That would imply to have a mount_single() file system (like f.e. tracefs
> > and
> > securityfs), right?
> 
> Probably. I'm not sure whether it should be single fs or we allow
> multiple mount points. There are pro and con for both.

Multiple mount points actually seem dangerous to me. Especially if fds
get pinned into multiple of those filesystems.

> > So you'd loose having various mounts in different namespaces. And if you
> > allow various mount points, how would that /facilitate/ to an admin to
> > identify all eBPF objects/resources currently present in the system?
> 
> if it's single mount point there are no issues, but would be nice
> to separate users and namespaces somehow.
> 
> > Or to an application developer finding possible mount points for his own
> > application so that bpf(2) syscall to create these nodes succeeds? Would
> > you make mounting also unprivileged? What if various distros have these
> > mount points at different locations? What should unprivileged applications
> > do to know that they can use these locations for themselves?
> 
> mounting is root only of course and having standard location answers
> all of these questions.

Then we miss the separation of users and namespaces we were talking
about above. This also needs some way of governing user space
application, I don't believe users will conform to the "you should first
add a directory for your user then for your program"-convention. cgroups
layout kind of got first "standardized" by systemd.

> > Also, since they are only regular files, one can only try and find these
> > objects based on their naming schemes, which seems to get a bit odd in case
> > this file system also carries (perhaps future?) other regular files that
> > are not eBPF map and program-special.
> 
> not sure what you meant. If names are given by kernel, there are no
> problem finding them. If by user, we'd file attributes like the way
> you did with xattr.
> 
> >> It feels you're pushing for cdev only because of that potential
> >> debugging need. Did you actually face that need? I didn't and
> >> don't like to add 'nice to have' feature until real need comes.
> >
> > I think this discussion arose, because the question of how flexible we are
> > in future to extend this facility. Nothing more.
> 
> Exactly and cdev style pushes us into the corner of traditional
> cdev with ioctl which I don't think is flexible enough.

But we don't need ioctls! ;)

Bye,
Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-20  1:09                                                     ` Alexei Starovoitov
@ 2015-10-20 10:07                                                       ` Hannes Frederic Sowa
  2015-10-20 18:44                                                         ` Alexei Starovoitov
  0 siblings, 1 reply; 56+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-20 10:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

Hello Alexei,

On Tue, Oct 20, 2015, at 03:09, Alexei Starovoitov wrote:
> On 10/19/15 4:02 PM, Hannes Frederic Sowa wrote:
> > I bet commercial software will make use of this ebpf framework, too. And
> > the kernel always helped me and gave me a way to see what is going on,
> > debug which part of my operating system universe interacts with which
> > other part. Merely dropping file descriptors with data attached to them
> > in an filesystem seems not to fulfill my need at all. I would love to
> > see where resources are referenced and why, like I am nowadays.
> 
> agree. common fs with hierarchy will give this visibility in
> one place.
> 
> >> >It feels you're pushing for cdev only because of that potential
> >> >debugging need. Did you actually face that need? I didn't and
> >> >don't like to add 'nice to have' feature until real need comes.
> > Given that we want to monitor the load of a hashmap for graphing
> > purposes. Or liberate some hashmaps from its restriction on number of
> > keys and make upper bounds configurable by admins who know the
> > dimensions of their systems and not some software deep down buried in
> > the bpf syscall where I might not have access to source code. In tc
> > force e.g. hashmaps to do garbage collection because we cannot be sure
> > that under DoS attacks user space clean up gets scheduled early enough
> > if ebpf adds flows to hashtables. I do see need to expand and implement
> > some kind of policy in the future.
> 
> disagree here. admin should not interfere with map parameters.
> What you proposing above sounds very very dangerous.
> Admins to configure GC of maps? What do you think the programs will do
> with such sophisticated maps? What kind of networking app you have
> in mind? Anyway that's a bit off-topic. I'm very curious though.

<off-topic>
Just a pretty obvious idea is accurate sampling of flows.
</off-topic>

> >> >single task in seccomp can have a chain of bpf progs, so hierarchy
> >> >is already there.
> > And it would be great to inspect them.
> 
> again let's not mix criu and lsof-like requirements with 'pin fd'.
> For visibility of normal maps we can add fdinfo and lsof
> can pick it up without any fs or any cdevs.

fdinfo tells me where my position in a file is and which locks the file
have? Nothing like that is supposed to work on bpf file descriptors,
because they are kind of special. A new hierarchy has to be installed
alongside fdinfo/.

> > I am fine with creating maps only by bpf syscall. But to hide
> > configuration details or at least not be really able to query them
> > easily seems odd to me. If we go with the ebpffs how could those
> > attributes be added?
> 
> I'm not advocating to hide details. Most of the time maps will not be
> pinned, so fdinfo seems the easiest way to show things like key_size,
> value_size, max_entries, type.

This is an argument in favor of the "fdinfo-like" approach.

So far, if someone wants to delve into the details of a map my approach
would be to take the file descriptor and make it persistence. I have to
think about that some more.

> Even if we decide to do it some other way, it's not related to 'pin fd'
> discussion, since debugging/visibility is nice to have for all bpf 
> objects. Note that walking of key/value without pretty-printers
> provided by the app is meaningless for admin, so only things
> like 'how much memory this map is using' are useful.

Yes, absolutely and I am absolutely against pretty printing key values
in kernel domain.

> May be we should try to draft the hierarchy of this common fs.
> How about:
> /sys/kernel/bpf/username/optional_dirs_mkdir_by_user/progX
> and 'cat' of it will print the same as fdinfo for normal maps,
> so admin can see what maps were pinned by user and its cost.

So cat-ing them will produce text output with some details about the
map? This is what I wanted to avoid. The concept with symlinks and small
files seems much cleaner and nicer to me. Also you cannot add writable
attributes to this filesystem or you overload stuff heavily?

> Inside 'fdinfo' output we can provide pointers to which progs
> are using which maps as
> # cat /sys/kernel/bpf/.../mapX
> key_size: 4
> used_by: /proc/xxx/fd/5
> # cat /sys/kernel/bpf/.../progY
> type: socket
> using: /proc/xxx/fd/6
> using: /sys/kernel/bpf/.../mapZ
> and similar for cat /proc/xxx/fdinfo/6
> but showing hierarchy as directories is non starter, since
> it's no a tree.

It is not a tree but a graph, sure, that's why sysfs allows to break the
cyclic dependencies and create symlinks (see holders/ directories). ;)

And if you implement the same set of features IMHO you basically
re-implement sysfs. In the beginning we just expose the basic maps and
there won't be any features in sysfs, but it will be cheap to have
read/write flags on maps etc. etc. (I don't know what people will come
up with, yet.). In my opinion those are clearly attributes of a map and
should be defined and managed alongside with their holders.

> All of these would be nice, but doesn't have to be implemented
> along with 'pin fd' feature.

The pinfd feature will provide the future infrastructure alongside to
make this usable, so I think it is worth spending time to think about
it.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-20  8:46                                                       ` Daniel Borkmann
@ 2015-10-20 17:53                                                         ` Alexei Starovoitov
  2015-10-20 18:56                                                           ` Eric W. Biederman
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-20 17:53 UTC (permalink / raw)
  To: Daniel Borkmann, Hannes Frederic Sowa, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/20/15 1:46 AM, Daniel Borkmann wrote:
>> as we discussed in this thread and earlier during plumbers I think
>> it would be good to expose key/values somehow in this fs.
>> 'how' is a big question.
>
> Yes, it is a big question, and probably best left to the domain-specific
> application itself, which can already dump the map nowadays via bpf(2)
> syscall. You can add bindings to various languages to make it available
> elsewhere as well.
>
> Or, you have a user space 'bpf' tool that can connect to any map that is
> being exposed with whatever model, and have modular pretty printers in
> user space somewhere located as shared objects, they could get auto-loaded
> in the background. Maps could get an annotation attached as an attribute
> during creation that is being exposed somewhere, so it can be mapped to
> a pretty printer shared object. This would better be solved in user space
> entirely, in my opinion, why should the kernel add complexity for this
> when this is so much user-space application specific anyway?
>
> As we all agreed, looking into key/values via shell is a rare event and
> not needed most of the times. It comes with it's own problems (f.e. think
> of dumping a possible rhashtable map with key/values as files). But even
> iff we'd want to stick this into files by all means, fusefs can do this
> specific job entirely in user space _plus_ fetching these shared objects
> for pretty printers etc, all we need for this is to add this annotation/
> mapping attribute somewhere to bpf_maps and that's all it takes.
>
> This question is no doubt independant of the fd pinning mechanism, but as
> I said, I don't think sticking this into the kernel is a good idea. Why
> would that be the kernel's job?

agree with all of the concerns above. I said it would be good for
kernel to expose key/values and I still think it would be a useful
feature. Regardless whether kernel does it or not in the future,
the point was 'IF we want kernel to do it then bpf FS is the right way'.

> In the other email, you are mentioning fdinfo. fdinfo can be done for any
> map/prog already today by just adding the right .show_fdinfo() callback to
> bpf_map_fops and bpf_prog_fops, so we let the anon-inodes that we already
> use today to do this job for free and such debugging info can be inspected
> through procfs already. This is common practice, f.e. look at timerfd,
> signalfd and others.

I know. That's exactly what I proposed, but again the point was
that fdinfo of regular FDs should match in style to pinned FDs,
'cat /sys/kernel/bpf/.../map5' should be similar to
'cat /proc/.../fdinfo/5'
and 'cat /sys/kernel/bpf...' you can only cleanly do with bpffs.

>> But regardless which path we take, sysfs is too rigid.
>> For the sake of argument say we do every key as a new file in bpffs.
>> It's not very scalable, but comparing to sysfs it's better
>> (resource wise).
>
> I doubt this is scaleable at all, no matter if its sysfs or a own custom
> fs. How should that work. You have a map with possibly thousands or
> millions
> of entries. Are these files to be generated on the fly like in procfs as
> soon as you enter that directory? Or as a one-time snapshot (but then
> the user mights want to create various snapshots)? There might be new
> map elements as building blocks in the future such as pipes, ring buffers
> etc. How are they being dumped as files?

you're arguing that keys as files are not scalable. sure.
See what I said above "it's not very scalable"
The point is that fs approach is more flexible comparing to cdev.

>> not everything in unix is a model that should be followed.
>> af_unix with name[0]!=0 is a bad api that wasn't thought through.
>> Thankfully Linux improved it with abstract names that don't use
>> special files.
>> bpf maps obviously is not an IPC (either pinned or not).
>
> So, if this pinning facility is unprivileged and available for *all*
> applications, then applications can in-fact use eBPF maps (w/o any
> other aides such as Unix domain sockets to transfer fds) among themselves
> to exchange state via bpf(2) syscall. It doesn't need a corresponding
> program.

Obviously I know that, but it doesn't make it an IPC.
Just because two processes can talk to each other via normal tcpip it
doesn't make tcpip an IPC mechanism.
The point is "just because two processes can communicate with each
other via X (bpf maps) we are not going to optimize (or make
architectural decisions in X) just for this use case". It's a job of
generic IPC and we have enough of them already.

> Okay, sure, but then having a mount_single() and separating users and
> namespaces is still not being resolved, as you've noticed.

yes and that's what I proposed to do:
Tweaking this FS patch to do mount_single() and define directory
structure is the best way forward.

> So, if you distribute the names through the kernel and dictate a strict
> hierarchy, then we'll end up with a similar model that cdevs resolve.

yes. exactly.
but comparing to cdev, it will be:
- cheaper for kernel to keep (memory wise)
- faster to pin FDs
- do normal 'rm' to destroy
- possible to extend to unprivileged users
- possible to add fdinfo (same output for pinned and normal fd)
- possible to expose key/value

I'm puzzled how you can keep arguing in favor of cdev when it's
obviously deficient comparing to fs and fs has no disadvantages.
Looks like we can only resolve it over beer.
How about we setup a public hangout ? Today or tomorrow?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-20 10:07                                                       ` Hannes Frederic Sowa
@ 2015-10-20 18:44                                                         ` Alexei Starovoitov
  0 siblings, 0 replies; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-20 18:44 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Daniel Borkmann, Eric W. Biederman
  Cc: davem, viro, tgraf, netdev, linux-kernel, Alexei Starovoitov

On 10/20/15 3:07 AM, Hannes Frederic Sowa wrote:
> <off-topic>
> Just a pretty obvious idea is accurate sampling of flows.
> </off-topic>

ok, so you want to time out flows. Makes sense, but it should be
done by user space with little or none help from the kernel.

> fdinfo tells me where my position in a file is and which locks the file
> have?

obviously not. see the example fdinfo from the other email.

> So far, if someone wants to delve into the details of a map my approach
> would be to take the file descriptor and make it persistence. I have to
> think about that some more.

nope. you cannot do that. admin should never interfere with running
process this way.

> Yes, absolutely and I am absolutely against pretty printing key values
> in kernel domain.

let's table that part. I think it can be useful, but it's
irrelevant for this discussion.

> So cat-ing them will produce text output with some details about the
> map? This is what I wanted to avoid. The concept with symlinks and small
> files seems much cleaner and nicer to me. Also you cannot add writable
> attributes to this filesystem or you overload stuff heavily?

nope. no writeable stuff. fdinfo is read-only.

> It is not a tree but a graph, sure, that's why sysfs allows to break the
> cyclic dependencies and create symlinks (see holders/ directories). ;)

that's an obvious example of another resource waste.
You can do that for real devices, but for thousands of maps and programs
it is really a waste.

> And if you implement the same set of features IMHO you basically
> re-implement sysfs. In the beginning we just expose the basic maps and
> there won't be any features in sysfs, but it will be cheap to have
> read/write flags on maps etc. etc. (I don't know what people will come
> up with, yet.). In my opinion those are clearly attributes of a map and
> should be defined and managed alongside with their holders.

nope. bpf syscall is the only interface to access maps.
if we expose them in bpffs it will be read-only for debugging only.

> The pinfd feature will provide the future infrastructure alongside to
> make this usable, so I think it is worth spending time to think about
> it.

yes. but since we're going in circles, let's have a 'beer call' to
resolve it :)


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-20 17:53                                                         ` Alexei Starovoitov
@ 2015-10-20 18:56                                                           ` Eric W. Biederman
  2015-10-21 15:17                                                             ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-10-20 18:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Hannes Frederic Sowa, davem, viro, tgraf,
	netdev, linux-kernel, Alexei Starovoitov

Alexei Starovoitov <ast@plumgrid.com> writes:

> On 10/20/15 1:46 AM, Daniel Borkmann wrote:
>>> as we discussed in this thread and earlier during plumbers I think
>>> it would be good to expose key/values somehow in this fs.
>>> 'how' is a big question.
>>
>> Yes, it is a big question, and probably best left to the domain-specific
>> application itself, which can already dump the map nowadays via bpf(2)
>> syscall. You can add bindings to various languages to make it available
>> elsewhere as well.
>>
>> Or, you have a user space 'bpf' tool that can connect to any map that is
>> being exposed with whatever model, and have modular pretty printers in
>> user space somewhere located as shared objects, they could get auto-loaded
>> in the background. Maps could get an annotation attached as an attribute
>> during creation that is being exposed somewhere, so it can be mapped to
>> a pretty printer shared object. This would better be solved in user space
>> entirely, in my opinion, why should the kernel add complexity for this
>> when this is so much user-space application specific anyway?
>>
>> As we all agreed, looking into key/values via shell is a rare event and
>> not needed most of the times. It comes with it's own problems (f.e. think
>> of dumping a possible rhashtable map with key/values as files). But even
>> iff we'd want to stick this into files by all means, fusefs can do this
>> specific job entirely in user space _plus_ fetching these shared objects
>> for pretty printers etc, all we need for this is to add this annotation/
>> mapping attribute somewhere to bpf_maps and that's all it takes.
>>
>> This question is no doubt independant of the fd pinning mechanism, but as
>> I said, I don't think sticking this into the kernel is a good idea. Why
>> would that be the kernel's job?
>
> agree with all of the concerns above. I said it would be good for
> kernel to expose key/values and I still think it would be a useful
> feature. Regardless whether kernel does it or not in the future,
> the point was 'IF we want kernel to do it then bpf FS is the right way'.
>
>> In the other email, you are mentioning fdinfo. fdinfo can be done for any
>> map/prog already today by just adding the right .show_fdinfo() callback to
>> bpf_map_fops and bpf_prog_fops, so we let the anon-inodes that we already
>> use today to do this job for free and such debugging info can be inspected
>> through procfs already. This is common practice, f.e. look at timerfd,
>> signalfd and others.
>
> I know. That's exactly what I proposed, but again the point was
> that fdinfo of regular FDs should match in style to pinned FDs,
> 'cat /sys/kernel/bpf/.../map5' should be similar to
> 'cat /proc/.../fdinfo/5'
> and 'cat /sys/kernel/bpf...' you can only cleanly do with bpffs.
>
>>> But regardless which path we take, sysfs is too rigid.
>>> For the sake of argument say we do every key as a new file in bpffs.
>>> It's not very scalable, but comparing to sysfs it's better
>>> (resource wise).
>>
>> I doubt this is scaleable at all, no matter if its sysfs or a own custom
>> fs. How should that work. You have a map with possibly thousands or
>> millions
>> of entries. Are these files to be generated on the fly like in procfs as
>> soon as you enter that directory? Or as a one-time snapshot (but then
>> the user mights want to create various snapshots)? There might be new
>> map elements as building blocks in the future such as pipes, ring buffers
>> etc. How are they being dumped as files?
>
> you're arguing that keys as files are not scalable. sure.
> See what I said above "it's not very scalable"
> The point is that fs approach is more flexible comparing to cdev.
>
>>> not everything in unix is a model that should be followed.
>>> af_unix with name[0]!=0 is a bad api that wasn't thought through.
>>> Thankfully Linux improved it with abstract names that don't use
>>> special files.
>>> bpf maps obviously is not an IPC (either pinned or not).
>>
>> So, if this pinning facility is unprivileged and available for *all*
>> applications, then applications can in-fact use eBPF maps (w/o any
>> other aides such as Unix domain sockets to transfer fds) among themselves
>> to exchange state via bpf(2) syscall. It doesn't need a corresponding
>> program.
>
> Obviously I know that, but it doesn't make it an IPC.
> Just because two processes can talk to each other via normal tcpip it
> doesn't make tcpip an IPC mechanism.
> The point is "just because two processes can communicate with each
> other via X (bpf maps) we are not going to optimize (or make
> architectural decisions in X) just for this use case". It's a job of
> generic IPC and we have enough of them already.
>
>> Okay, sure, but then having a mount_single() and separating users and
>> namespaces is still not being resolved, as you've noticed.
>
> yes and that's what I proposed to do:
> Tweaking this FS patch to do mount_single() and define directory
> structure is the best way forward.
>
>> So, if you distribute the names through the kernel and dictate a strict
>> hierarchy, then we'll end up with a similar model that cdevs resolve.
>
> yes. exactly.
> but comparing to cdev, it will be:
> - cheaper for kernel to keep (memory wise)
> - faster to pin FDs
> - do normal 'rm' to destroy
> - possible to extend to unprivileged users
> - possible to add fdinfo (same output for pinned and normal fd)
> - possible to expose key/value
>
> I'm puzzled how you can keep arguing in favor of cdev when it's
> obviously deficient comparing to fs and fs has no disadvantages.
> Looks like we can only resolve it over beer.
> How about we setup a public hangout ? Today or tomorrow?

Just FYI:  Using a device for this kind of interface is pretty
much a non-starter as that quickly gets you into situations where
things do not work in containers.  If someone gets a version of device
namespaces past GregKH it might be up for discussion to use character
devices.

But really device nodes are a technology that is slowly being changed to
support hotplug.  Nothing you are doing seems to match up well with
devices.  So for an interface that you want ordinary applications to use
character devices are a bad bad fit.

Eric


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-20 18:56                                                           ` Eric W. Biederman
@ 2015-10-21 15:17                                                             ` Daniel Borkmann
  2015-10-21 18:34                                                               ` Thomas Graf
  2015-10-22 19:35                                                               ` Eric W. Biederman
  0 siblings, 2 replies; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-21 15:17 UTC (permalink / raw)
  To: Eric W. Biederman, Alexei Starovoitov
  Cc: Hannes Frederic Sowa, davem, viro, tgraf, netdev, linux-kernel,
	Alexei Starovoitov

On 10/20/2015 08:56 PM, Eric W. Biederman wrote:
...
> Just FYI:  Using a device for this kind of interface is pretty
> much a non-starter as that quickly gets you into situations where
> things do not work in containers.  If someone gets a version of device
> namespaces past GregKH it might be up for discussion to use character
> devices.

Okay, you are referring to this discussion here:

   http://thread.gmane.org/gmane.linux.kernel.containers/26760

What had been mentioned earlier in this thread was to have a namespace
pass-through facility enforced by device cgroups we have in the kernel,
which is one out of various means used to enforce policy today by
deployment systems such as docker, for example. But more below.

I think this all depends on the kind of expectations we have, where all
this is going. In the original proposal, it was agreed to have the
operation that creates a node as 'capable(CAP_SYS_ADMIN)'-only (in the
way like most of the rest of eBPF is restricted), and based on the use
case we distribute such objects to unprivileged applications. But I
understand that it seems the trend lately to lift eBPF restrictions at
some point anyway, and thus the CAP_SYS_ADMIN is suddenly irrelevant
again. Fair enough.

Don't get me wrong, I really don't mind if it will be some version of
this fs patch or whatever architecture else we find consensus on, I
think this discussion is merely trying to evaluate/discuss on what seems
to be a good fit, also in terms of future requirements and integration.

So far, during this discussion, it was proposed to modify the file system
to a single-mount one and to stick this under /sys/kernel/bpf/. This
will not have "real" namespace support either, but it was proposed to
have a following structure:

   /sys/kernel/bpf/username/<optional_dirs_mkdir_by_user>/progX

So, the file system will have kind of a user home-directory for each user
to isolate through permissions, if I understood correctly.

If we really want to go this route, then I think there are no big stones
in the way for the other model either. It should look roughly drafted like
the below.

Together with device cgroups for containers, it would allow scenarios where
you can have:

   * eBPF (map/prog) device pass-through so a map/prog could even be shared out
     from the initial namespace into individual ones/all (one could possibly
     extend such maps as read-only for these consumers).
   * eBPF device creation for unprivileged users with permissions being set
     accordingly (as in fs case).
   * Since cgroup controller can also do wildcards on major/minors, we could
     make that further fine-grained.
   * eBPF device creation can also be enforced by the cgroup controller to be
     entirely disallowed for a specific container.

(An admin can determine the dynamically created major f.e. under /proc/devices.)

FWIW, here's a drafted diff on the idea:

  (https://git.breakpoint.cc/cgit/dborkman/net-next.git/log/?h=ebpf-fds-final6)

  drivers/base/core.c      |  38 +++-
  include/linux/bpf.h      |  39 ++++-
  include/linux/device.h   |  10 +-
  include/uapi/linux/bpf.h |  45 +----
  kernel/bpf/Makefile      |   4 +-
  kernel/bpf/core.c        |   3 +-
  kernel/bpf/device.c      | 441 +++++++++++++++++++++++++++++++++++++++++++++++
  kernel/bpf/syscall.c     |  52 +++++-
  mm/backing-dev.c         |   3 +-
  9 files changed, 567 insertions(+), 68 deletions(-)
  create mode 100644 kernel/bpf/device.c

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 334ec7e..11721c8 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1678,7 +1678,8 @@ static void device_create_release(struct device *dev)

  static struct device *
  device_create_groups_vargs(struct class *class, struct device *parent,
-			   dev_t devt, void *drvdata,
+			   dev_t devt, const struct device_type *type,
+			   void *drvdata,
  			   const struct attribute_group **groups,
  			   const char *fmt, va_list args)
  {
@@ -1697,6 +1698,7 @@ device_create_groups_vargs(struct class *class, struct device *parent,
  	device_initialize(dev);
  	dev->devt = devt;
  	dev->class = class;
+	dev->type = type;
  	dev->parent = parent;
  	dev->groups = groups;
  	dev->release = device_create_release;
@@ -1743,11 +1745,11 @@ error:
   * been created with a call to class_create().
   */
  struct device *device_create_vargs(struct class *class, struct device *parent,
-				   dev_t devt, void *drvdata, const char *fmt,
-				   va_list args)
+				   dev_t devt, const struct device_type *type,
+				   void *drvdata, const char *fmt, va_list args)
  {
-	return device_create_groups_vargs(class, parent, devt, drvdata, NULL,
-					  fmt, args);
+	return device_create_groups_vargs(class, parent, devt, type, drvdata,
+					  NULL, fmt, args);
  }
  EXPORT_SYMBOL_GPL(device_create_vargs);

@@ -1782,12 +1784,31 @@ struct device *device_create(struct class *class, struct device *parent,
  	struct device *dev;

  	va_start(vargs, fmt);
-	dev = device_create_vargs(class, parent, devt, drvdata, fmt, vargs);
+	dev = device_create_vargs(class, parent, devt, NULL, drvdata,
+				  fmt, vargs);
  	va_end(vargs);
+
  	return dev;
  }
  EXPORT_SYMBOL_GPL(device_create);

+/* XXX document */
+struct device *device_create_type(struct class *class, struct device *parent,
+				  dev_t devt, const struct device_type *type,
+				  void *drvdata, const char *fmt, ...)
+{
+	va_list vargs;
+	struct device *dev;
+
+	va_start(vargs, fmt);
+	dev = device_create_vargs(class, parent, devt, type, drvdata,
+				  fmt, vargs);
+	va_end(vargs);
+
+	return dev;
+}
+EXPORT_SYMBOL_GPL(device_create_type);
+
  /**
   * device_create_with_groups - creates a device and registers it with sysfs
   * @class: pointer to the struct class that this device should be registered to
@@ -1825,9 +1846,10 @@ struct device *device_create_with_groups(struct class *class,
  	struct device *dev;

  	va_start(vargs, fmt);
-	dev = device_create_groups_vargs(class, parent, devt, drvdata, groups,
-					 fmt, vargs);
+	dev = device_create_groups_vargs(class, parent, devt, NULL, drvdata,
+					 groups, fmt, vargs);
  	va_end(vargs);
+
  	return dev;
  }
  EXPORT_SYMBOL_GPL(device_create_with_groups);
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0ae6f77..8476911 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -8,8 +8,13 @@
  #define _LINUX_BPF_H 1

  #include <uapi/linux/bpf.h>
+
  #include <linux/workqueue.h>
  #include <linux/file.h>
+#include <linux/cdev.h>
+
+/* BPF flags. */
+#define BPF_F_HAS_DEV	(1 << 0)

  struct bpf_map;

@@ -31,15 +36,21 @@ struct bpf_map_ops {
  };

  struct bpf_map {
-	atomic_t refcnt;
+	const struct bpf_map_ops *ops;
+	struct user_struct *user;
  	enum bpf_map_type map_type;
  	u32 key_size;
  	u32 value_size;
  	u32 max_entries;
  	u32 pages;
-	struct user_struct *user;
-	const struct bpf_map_ops *ops;
-	struct work_struct work;
+	int minor;
+	atomic_t refcnt;
+	u32 flags;
+	union {
+		struct work_struct work;
+		struct mutex m_lock;
+	};
+	struct cdev cdev;
  };

  struct bpf_map_type_list {
@@ -125,16 +136,20 @@ struct bpf_prog_type_list {
  };

  struct bpf_prog_aux {
-	atomic_t refcnt;
-	u32 used_map_cnt;
  	const struct bpf_verifier_ops *ops;
-	struct bpf_map **used_maps;
-	struct bpf_prog *prog;
  	struct user_struct *user;
+	struct bpf_prog *prog;
+	struct bpf_map **used_maps;
+	u32 used_map_cnt;
+	int minor;
+	atomic_t refcnt;
+	u32 flags;
  	union {
+		struct mutex p_lock;
  		struct work_struct work;
  		struct rcu_head	rcu;
  	};
+	struct cdev cdev;
  };

  struct bpf_array {
@@ -167,11 +182,19 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
  void bpf_prog_put(struct bpf_prog *prog);
  void bpf_prog_put_rcu(struct bpf_prog *prog);

+struct bpf_map *bpf_map_get(u32 ufd);
  struct bpf_map *__bpf_map_get(struct fd f);
  void bpf_map_put(struct bpf_map *map);

  extern int sysctl_unprivileged_bpf_disabled;

+int __bpf_dev_create(__u32 ufd);
+int __bpf_dev_destroy(__u32 ufd);
+int __bpf_dev_connect(__u32 ufd);
+
+int bpf_map_new_fd(struct bpf_map *map);
+int bpf_prog_new_fd(struct bpf_prog *prog);
+
  /* verify correctness of eBPF program */
  int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
  #else
diff --git a/include/linux/device.h b/include/linux/device.h
index 5d7bc63..a9a3360 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1038,15 +1038,19 @@ extern int __must_check device_reprobe(struct device *dev);
  /*
   * Easy functions for dynamically creating devices on the fly
   */
-extern __printf(5, 0)
+extern __printf(6, 0)
  struct device *device_create_vargs(struct class *cls, struct device *parent,
-				   dev_t devt, void *drvdata,
-				   const char *fmt, va_list vargs);
+				   dev_t devt, const struct device_type *type,
+				   void *drvdata, const char *fmt, va_list vargs);
  extern __printf(5, 6)
  struct device *device_create(struct class *cls, struct device *parent,
  			     dev_t devt, void *drvdata,
  			     const char *fmt, ...);
  extern __printf(6, 7)
+struct device *device_create_type(struct class *class, struct device *parent,
+				  dev_t devt, const struct device_type *type,
+				  void *drvdata, const char *fmt, ...);
+extern __printf(6, 7)
  struct device *device_create_with_groups(struct class *cls,
  			     struct device *parent, dev_t devt, void *drvdata,
  			     const struct attribute_group **groups,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 564f1f0..55e5aad 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,50 +63,17 @@ struct bpf_insn {
  	__s32	imm;		/* signed immediate constant */
  };

-/* BPF syscall commands */
+/* BPF syscall commands, see bpf(2) man-page for details. */
  enum bpf_cmd {
-	/* create a map with given type and attributes
-	 * fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size)
-	 * returns fd or negative error
-	 * map is deleted when fd is closed
-	 */
  	BPF_MAP_CREATE,
-
-	/* lookup key in a given map
-	 * err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->value
-	 * returns zero and stores found elem into value
-	 * or negative error
-	 */
  	BPF_MAP_LOOKUP_ELEM,
-
-	/* create or update key/value pair in a given map
-	 * err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->value, attr->flags
-	 * returns zero or negative error
-	 */
  	BPF_MAP_UPDATE_ELEM,
-
-	/* find and delete elem by key in a given map
-	 * err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key
-	 * returns zero or negative error
-	 */
  	BPF_MAP_DELETE_ELEM,
-
-	/* lookup key in a given map and return next key
-	 * err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->next_key
-	 * returns zero and stores next key or negative error
-	 */
  	BPF_MAP_GET_NEXT_KEY,
-
-	/* verify and load eBPF program
-	 * prog_fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size)
-	 * Using attr->prog_type, attr->insns, attr->license
-	 * returns fd or negative error
-	 */
  	BPF_PROG_LOAD,
+	BPF_DEV_CREATE,
+	BPF_DEV_DESTROY,
+	BPF_DEV_CONNECT,
  };

  enum bpf_map_type {
@@ -160,6 +127,10 @@ union bpf_attr {
  		__aligned_u64	log_buf;	/* user supplied buffer */
  		__u32		kern_version;	/* checked when prog_type=kprobe */
  	};
+
+	struct { /* anonymous struct used by BPF_DEV_* commands */
+		__u32		fd;
+	};
  } __attribute__((aligned(8)));

  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e6983be..f871ca6 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,2 +1,4 @@
  obj-y := core.o
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o helpers.o
+
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o device.o helpers.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8086471..334b1bd 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -92,6 +92,7 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)

  	fp->pages = size / PAGE_SIZE;
  	fp->aux = aux;
+	fp->aux->prog = fp;

  	return fp;
  }
@@ -116,6 +117,7 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,

  		memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE);
  		fp->pages = size / PAGE_SIZE;
+		fp->aux->prog = fp;

  		/* We keep fp->aux from fp_old around in the new
  		 * reallocated structure.
@@ -726,7 +728,6 @@ void bpf_prog_free(struct bpf_prog *fp)
  	struct bpf_prog_aux *aux = fp->aux;

  	INIT_WORK(&aux->work, bpf_prog_free_deferred);
-	aux->prog = fp;
  	schedule_work(&aux->work);
  }
  EXPORT_SYMBOL_GPL(bpf_prog_free);
diff --git a/kernel/bpf/device.c b/kernel/bpf/device.c
new file mode 100644
index 0000000..711c9a4
--- /dev/null
+++ b/kernel/bpf/device.c
@@ -0,0 +1,441 @@
+/*
+ * Special file backend for persistent eBPF maps and programs, used by
+ * bpf() system call.
+ *
+ * (C) 2015 Daniel Borkmann <daniel@iogearbox.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <linux/idr.h>
+#include <linux/mutex.h>
+#include <linux/device_cgroup.h>
+#include <linux/cdev.h>
+
+#define BPF_MAX_DEVS	(1UL << MINORBITS)
+#define BPF_MODE_DEF	(S_IRUSR | S_IWUSR)
+
+enum bpf_type {
+	BPF_TYPE_PROG,
+	BPF_TYPE_MAP,
+};
+
+static struct class *bpf_class;
+
+static dev_t bpf_map_devt;
+static DEFINE_IDR(bpf_map_idr);
+static DEFINE_MUTEX(bpf_map_idr_lock);
+
+static dev_t bpf_prog_devt;
+static DEFINE_IDR(bpf_prog_idr);
+static DEFINE_MUTEX(bpf_prog_idr_lock);
+
+static int bpf_map_get_minor(struct bpf_map *map)
+{
+	int minor;
+
+	mutex_lock(&bpf_map_idr_lock);
+	minor = idr_alloc(&bpf_map_idr, map, 0, BPF_MAX_DEVS, GFP_KERNEL);
+	mutex_unlock(&bpf_map_idr_lock);
+
+	return minor;
+}
+
+static void bpf_map_put_minor(const struct bpf_map *map)
+{
+	mutex_lock(&bpf_map_idr_lock);
+	idr_remove(&bpf_map_idr, map->minor);
+	mutex_unlock(&bpf_map_idr_lock);
+}
+
+static int bpf_prog_get_minor(struct bpf_prog *prog)
+{
+	int minor;
+
+	mutex_lock(&bpf_prog_idr_lock);
+	minor = idr_alloc(&bpf_prog_idr, prog, 0, BPF_MAX_DEVS, GFP_KERNEL);
+	mutex_unlock(&bpf_prog_idr_lock);
+
+	return minor;
+}
+
+static void bpf_prog_put_minor(const struct bpf_prog *prog)
+{
+	mutex_lock(&bpf_prog_idr_lock);
+	idr_remove(&bpf_prog_idr, prog->aux->minor);
+	mutex_unlock(&bpf_prog_idr_lock);
+}
+
+static int bpf_map_open(struct inode *inode, struct file *filep)
+{
+	filep->private_data = container_of(inode->i_cdev,
+					   struct bpf_map, cdev);
+	return 0;
+}
+
+static const struct file_operations bpf_dev_map_fops = {
+	.owner		= THIS_MODULE,
+	.open		= bpf_map_open,
+	.llseek		= noop_llseek,
+};
+
+static int bpf_prog_open(struct inode *inode, struct file *filep)
+{
+	filep->private_data = container_of(inode->i_cdev,
+					   struct bpf_prog_aux, cdev)->prog;
+	return 0;
+}
+
+static const struct file_operations bpf_dev_prog_fops = {
+	.owner		= THIS_MODULE,
+	.open		= bpf_prog_open,
+	.llseek		= noop_llseek,
+};
+
+static char *bpf_type_devnode(struct device *dev, umode_t *mode,
+			      kuid_t *uid, kgid_t *gid)
+{
+	if (mode)
+		*mode = BPF_MODE_DEF;
+	if (uid && gid)
+		current_uid_gid(uid, gid);
+
+	return kasprintf(GFP_KERNEL, "bpf/%s", dev_name(dev));
+}
+
+static const struct device_type bpf_dev_map_type = {
+	.name		= "map",
+	.devnode	= bpf_type_devnode,
+};
+
+static const struct device_type bpf_dev_prog_type = {
+	.name		= "prog",
+	.devnode	= bpf_type_devnode,
+};
+
+static int bpf_map_make_dev(struct bpf_map *map)
+{
+	struct device *dev;
+	dev_t devt;
+	int ret;
+
+	mutex_lock(&map->m_lock);
+	if (map->flags & BPF_F_HAS_DEV) {
+		ret = map->minor;
+		goto out;
+	}
+
+	cdev_init(&map->cdev, &bpf_dev_map_fops);
+	map->cdev.owner = map->cdev.ops->owner;
+
+	map->minor = bpf_map_get_minor(map);
+	if (map->minor < 0) {
+		ret = map->minor;
+		goto out;
+	}
+
+	devt = MKDEV(MAJOR(bpf_map_devt), map->minor);
+
+	ret = devcgroup_inode_mknod(S_IFCHR | BPF_MODE_DEF, devt);
+	if (ret)
+		goto unwind;
+
+	ret = cdev_add(&map->cdev, devt, 1);
+	if (ret)
+		goto unwind;
+
+	dev = device_create_type(bpf_class, NULL, devt, &bpf_dev_map_type,
+				 NULL, "bpf_map%d", map->minor);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto unwind_cdev;
+	}
+
+	map->flags |= BPF_F_HAS_DEV;
+	ret = map->minor;
+out:
+	mutex_unlock(&map->m_lock);
+	return ret;
+unwind_cdev:
+	cdev_del(&map->cdev);
+unwind:
+	bpf_map_put_minor(map);
+	goto out;
+}
+
+static int bpf_map_destroy_dev(struct bpf_map *map)
+{
+	bool drop_ref = false;
+	dev_t devt;
+	int ret;
+
+	mutex_lock(&map->m_lock);
+	if (!(map->flags & BPF_F_HAS_DEV)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	devt = MKDEV(MAJOR(bpf_map_devt), map->minor);
+	ret = map->minor;
+
+	cdev_del(&map->cdev);
+	device_destroy(bpf_class, devt);
+	bpf_map_put_minor(map);
+
+	map->flags &= ~BPF_F_HAS_DEV;
+	drop_ref = true;
+out:
+	mutex_unlock(&map->m_lock);
+
+	if (drop_ref)
+		bpf_map_put(map);
+	return ret;
+}
+
+static int bpf_prog_make_dev(struct bpf_prog *prog)
+{
+	struct bpf_prog_aux *aux = prog->aux;
+	struct device *dev;
+	dev_t devt;
+	int ret;
+
+	mutex_lock(&aux->p_lock);
+	if (aux->flags & BPF_F_HAS_DEV) {
+		ret = aux->minor;
+		goto out;
+	}
+
+	cdev_init(&aux->cdev, &bpf_dev_prog_fops);
+	aux->cdev.owner = aux->cdev.ops->owner;
+
+	aux->minor = bpf_prog_get_minor(prog);
+	if (aux->minor < 0) {
+		ret = aux->minor;
+		goto out;
+	}
+
+	devt = MKDEV(MAJOR(bpf_prog_devt), aux->minor);
+
+	ret = devcgroup_inode_mknod(S_IFCHR | BPF_MODE_DEF, devt);
+	if (ret)
+		goto unwind;
+
+	ret = cdev_add(&aux->cdev, devt, 1);
+	if (ret)
+		goto unwind;
+
+	dev = device_create_type(bpf_class, NULL, devt, &bpf_dev_prog_type,
+				 NULL, "bpf_prog%d", aux->minor);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto unwind_cdev;
+	}
+
+	aux->flags |= BPF_F_HAS_DEV;
+	ret = aux->minor;
+out:
+	mutex_unlock(&aux->p_lock);
+	return ret;
+unwind_cdev:
+	cdev_del(&aux->cdev);
+unwind:
+	bpf_prog_put_minor(prog);
+	goto out;
+}
+
+static int bpf_prog_destroy_dev(struct bpf_prog *prog)
+{
+	struct bpf_prog_aux *aux = prog->aux;
+	bool drop_ref = false;
+	dev_t devt;
+	int ret;
+
+	mutex_lock(&aux->p_lock);
+	if (!(aux->flags & BPF_F_HAS_DEV)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	devt = MKDEV(MAJOR(bpf_prog_devt), aux->minor);
+	ret = aux->minor;
+
+	cdev_del(&aux->cdev);
+	device_destroy(bpf_class, devt);
+	bpf_prog_put_minor(prog);
+
+	aux->flags &= ~BPF_F_HAS_DEV;
+	drop_ref = true;
+out:
+	mutex_unlock(&aux->p_lock);
+
+	if (drop_ref)
+		bpf_prog_put(prog);
+	return ret;
+}
+
+static void bpf_any_get(void *raw, enum bpf_type type)
+{
+	switch (type) {
+	case BPF_TYPE_PROG:
+		atomic_inc(&((struct bpf_prog *)raw)->aux->refcnt);
+		break;
+	case BPF_TYPE_MAP:
+		atomic_inc(&((struct bpf_map *)raw)->refcnt);
+		break;
+	}
+}
+
+void bpf_any_put(void *raw, enum bpf_type type)
+{
+	switch (type) {
+	case BPF_TYPE_PROG:
+		bpf_prog_put(raw);
+		break;
+	case BPF_TYPE_MAP:
+		bpf_map_put(raw);
+		break;
+	}
+}
+
+static void *__bpf_dev_get(struct fd f, enum bpf_type *type)
+{
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+	if (f.file->f_op != &bpf_dev_map_fops &&
+	    f.file->f_op != &bpf_dev_prog_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	*type = f.file->f_op == &bpf_dev_map_fops ?
+		BPF_TYPE_MAP : BPF_TYPE_PROG;
+	return f.file->private_data;
+}
+
+static void *bpf_dev_get(u32 ufd, enum bpf_type *type)
+{
+	struct fd f = fdget(ufd);
+	void *raw;
+
+	raw = __bpf_dev_get(f, type);
+	if (IS_ERR(raw))
+		return raw;
+
+	bpf_any_get(raw, *type);
+	fdput(f);
+
+	return raw;
+}
+
+static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
+{
+	void *raw;
+
+	*type = BPF_TYPE_MAP;
+	raw = bpf_map_get(ufd);
+	if (IS_ERR(raw)) {
+		*type = BPF_TYPE_PROG;
+		raw = bpf_prog_get(ufd);
+	}
+
+	return raw;
+}
+
+int __bpf_dev_create(__u32 ufd)
+{
+	enum bpf_type type;
+	void *raw;
+	int ret;
+
+	raw = bpf_fd_probe_obj(ufd, &type);
+	if (IS_ERR(raw))
+		return PTR_ERR(raw);
+
+	switch (type) {
+	case BPF_TYPE_MAP:
+		ret = bpf_map_make_dev(raw);
+		break;
+	case BPF_TYPE_PROG:
+		ret = bpf_prog_make_dev(raw);
+		break;
+	}
+
+	if (ret < 0)
+		bpf_any_put(raw, type);
+
+	return ret;
+}
+
+int __bpf_dev_destroy(__u32 ufd)
+{
+	enum bpf_type type;
+	void *raw;
+	int ret;
+
+	raw = bpf_fd_probe_obj(ufd, &type);
+	if (IS_ERR(raw))
+		return PTR_ERR(raw);
+
+	switch (type) {
+	case BPF_TYPE_MAP:
+		ret = bpf_map_destroy_dev(raw);
+		break;
+	case BPF_TYPE_PROG:
+		ret = bpf_prog_destroy_dev(raw);
+		break;
+	}
+
+	bpf_any_put(raw, type);
+	return ret;
+}
+
+int __bpf_dev_connect(__u32 ufd)
+{
+	enum bpf_type type;
+	void *raw;
+	int ret;
+
+	raw = bpf_dev_get(ufd, &type);
+	if (IS_ERR(raw))
+		return PTR_ERR(raw);
+
+	switch (type) {
+	case BPF_TYPE_MAP:
+		ret = bpf_map_new_fd(raw);
+		break;
+	case BPF_TYPE_PROG:
+		ret = bpf_prog_new_fd(raw);
+		break;
+	}
+	if (ret < 0)
+		bpf_any_put(raw, type);
+
+	return ret;
+}
+
+static int __init bpf_dev_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&bpf_map_devt, 0, BPF_MAX_DEVS,
+				  "bpf_map");
+	if (ret)
+		return ret;
+
+	ret = alloc_chrdev_region(&bpf_prog_devt, 0, BPF_MAX_DEVS,
+				  "bpf_prog");
+	if (ret)
+		unregister_chrdev_region(bpf_map_devt, BPF_MAX_DEVS);
+
+	bpf_class = class_create(THIS_MODULE, "bpf");
+	return ret;
+}
+late_initcall(bpf_dev_init);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c629fe6..458b2f9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -14,6 +14,7 @@
  #include <linux/slab.h>
  #include <linux/anon_inodes.h>
  #include <linux/file.h>
+#include <linux/mutex.h>
  #include <linux/license.h>
  #include <linux/filter.h>
  #include <linux/version.h>
@@ -111,7 +112,7 @@ static const struct file_operations bpf_map_fops = {
  	.release = bpf_map_release,
  };

-static int bpf_map_new_fd(struct bpf_map *map)
+int bpf_map_new_fd(struct bpf_map *map)
  {
  	return anon_inode_getfd("bpf-map", &bpf_map_fops, map,
  				O_RDWR | O_CLOEXEC);
@@ -141,6 +142,7 @@ static int map_create(union bpf_attr *attr)
  	if (IS_ERR(map))
  		return PTR_ERR(map);

+	mutex_init(&map->m_lock);
  	atomic_set(&map->refcnt, 1);

  	err = bpf_map_charge_memlock(map);
@@ -174,7 +176,7 @@ struct bpf_map *__bpf_map_get(struct fd f)
  	return f.file->private_data;
  }

-static struct bpf_map *bpf_map_get(u32 ufd)
+struct bpf_map *bpf_map_get(u32 ufd)
  {
  	struct fd f = fdget(ufd);
  	struct bpf_map *map;
@@ -525,18 +527,14 @@ static void __prog_put_common(struct rcu_head *rcu)
  /* version of bpf_prog_put() that is called after a grace period */
  void bpf_prog_put_rcu(struct bpf_prog *prog)
  {
-	if (atomic_dec_and_test(&prog->aux->refcnt)) {
-		prog->aux->prog = prog;
+	if (atomic_dec_and_test(&prog->aux->refcnt))
  		call_rcu(&prog->aux->rcu, __prog_put_common);
-	}
  }

  void bpf_prog_put(struct bpf_prog *prog)
  {
-	if (atomic_dec_and_test(&prog->aux->refcnt)) {
-		prog->aux->prog = prog;
+	if (atomic_dec_and_test(&prog->aux->refcnt))
  		__prog_put_common(&prog->aux->rcu);
-	}
  }
  EXPORT_SYMBOL_GPL(bpf_prog_put);

@@ -552,7 +550,7 @@ static const struct file_operations bpf_prog_fops = {
          .release = bpf_prog_release,
  };

-static int bpf_prog_new_fd(struct bpf_prog *prog)
+int bpf_prog_new_fd(struct bpf_prog *prog)
  {
  	return anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog,
  				O_RDWR | O_CLOEXEC);
@@ -641,6 +639,7 @@ static int bpf_prog_load(union bpf_attr *attr)
  	prog->orig_prog = NULL;
  	prog->jited = 0;

+	mutex_init(&prog->aux->p_lock);
  	atomic_set(&prog->aux->refcnt, 1);
  	prog->gpl_compatible = is_gpl ? 1 : 0;

@@ -678,6 +677,32 @@ free_prog_nouncharge:
  	return err;
  }

+#define BPF_DEV_LAST_FIELD fd
+
+static int bpf_dev_create(const union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_DEV))
+		return -EINVAL;
+
+	return __bpf_dev_create(attr->fd);
+}
+
+static int bpf_dev_destroy(const union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_DEV))
+		return -EINVAL;
+
+	return __bpf_dev_destroy(attr->fd);
+}
+
+static int bpf_dev_connect(const union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_DEV))
+		return -EINVAL;
+
+	return __bpf_dev_connect(attr->fd);
+}
+
  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
  {
  	union bpf_attr attr = {};
@@ -738,6 +763,15 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
  	case BPF_PROG_LOAD:
  		err = bpf_prog_load(&attr);
  		break;
+	case BPF_DEV_CREATE:
+		err = bpf_dev_create(&attr);
+		break;
+	case BPF_DEV_DESTROY:
+		err = bpf_dev_destroy(&attr);
+		break;
+	case BPF_DEV_CONNECT:
+		err = bpf_dev_connect(&attr);
+		break;
  	default:
  		err = -EINVAL;
  		break;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2df8ddc..acf3847 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -786,7 +786,8 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
  		return 0;

  	va_start(args, fmt);
-	dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);
+	dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), NULL,
+				  bdi, fmt, args);
  	va_end(args);
  	if (IS_ERR(dev))
  		return PTR_ERR(dev);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-21 15:17                                                             ` Daniel Borkmann
@ 2015-10-21 18:34                                                               ` Thomas Graf
  2015-10-21 22:44                                                                 ` Alexei Starovoitov
  2015-10-22 19:35                                                               ` Eric W. Biederman
  1 sibling, 1 reply; 56+ messages in thread
From: Thomas Graf @ 2015-10-21 18:34 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Eric W. Biederman, Alexei Starovoitov, Hannes Frederic Sowa,
	davem, viro, netdev, linux-kernel, Alexei Starovoitov

On 10/21/15 at 05:17pm, Daniel Borkmann wrote:
> On 10/20/2015 08:56 PM, Eric W. Biederman wrote:
> ...
> >Just FYI:  Using a device for this kind of interface is pretty
> >much a non-starter as that quickly gets you into situations where
> >things do not work in containers.  If someone gets a version of device
> >namespaces past GregKH it might be up for discussion to use character
> >devices.
> 
> Okay, you are referring to this discussion here:
> 
>   http://thread.gmane.org/gmane.linux.kernel.containers/26760
> 
> What had been mentioned earlier in this thread was to have a namespace
> pass-through facility enforced by device cgroups we have in the kernel,
> which is one out of various means used to enforce policy today by
> deployment systems such as docker, for example. But more below.
> 
> I think this all depends on the kind of expectations we have, where all
> this is going. In the original proposal, it was agreed to have the
> operation that creates a node as 'capable(CAP_SYS_ADMIN)'-only (in the
> way like most of the rest of eBPF is restricted), and based on the use
> case we distribute such objects to unprivileged applications. But I
> understand that it seems the trend lately to lift eBPF restrictions at
> some point anyway, and thus the CAP_SYS_ADMIN is suddenly irrelevant
> again. Fair enough.
> 
> Don't get me wrong, I really don't mind if it will be some version of
> this fs patch or whatever architecture else we find consensus on, I
> think this discussion is merely trying to evaluate/discuss on what seems
> to be a good fit, also in terms of future requirements and integration.
> 
> So far, during this discussion, it was proposed to modify the file system
> to a single-mount one and to stick this under /sys/kernel/bpf/. This
> will not have "real" namespace support either, but it was proposed to
> have a following structure:
> 
>   /sys/kernel/bpf/username/<optional_dirs_mkdir_by_user>/progX

This would probably work as you would typically map the ebpf map
using -v like this to give a stable path:

        docker run -v /sys/kernel/bpf/foo/maps/progX:/map proX
 
> So, the file system will have kind of a user home-directory for each user
> to isolate through permissions, if I understood correctly.
> 
> If we really want to go this route, then I think there are no big stones
> in the way for the other model either. It should look roughly drafted like
> the below.
> 
> Together with device cgroups for containers, it would allow scenarios where
> you can have:
> 
>   * eBPF (map/prog) device pass-through so a map/prog could even be shared out
>     from the initial namespace into individual ones/all (one could possibly
>     extend such maps as read-only for these consumers).
>   * eBPF device creation for unprivileged users with permissions being set
>     accordingly (as in fs case).
>   * Since cgroup controller can also do wildcards on major/minors, we could
>     make that further fine-grained.
>   * eBPF device creation can also be enforced by the cgroup controller to be
>     entirely disallowed for a specific container.
> 
> (An admin can determine the dynamically created major f.e. under /proc/devices.)

I've read the discussion passively and my take away is that, frankly,
I think the differences are somewhat minor. Both architectures can
scale to what we need. Both will do the job. I'm slightly worried about
exposing uAPI as a FS, I think that didn't work too well for sysfs. It's
pretty much a define the format once and never touch it again kind of
deal.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-21 18:34                                                               ` Thomas Graf
@ 2015-10-21 22:44                                                                 ` Alexei Starovoitov
  2015-10-22 13:22                                                                   ` Daniel Borkmann
  0 siblings, 1 reply; 56+ messages in thread
From: Alexei Starovoitov @ 2015-10-21 22:44 UTC (permalink / raw)
  To: Thomas Graf, Daniel Borkmann
  Cc: Eric W. Biederman, Hannes Frederic Sowa, davem, viro, netdev,
	linux-kernel, Alexei Starovoitov

On 10/21/15 11:34 AM, Thomas Graf wrote:
>> So far, during this discussion, it was proposed to modify the file system
>> >to a single-mount one and to stick this under/sys/kernel/bpf/. This
>> >will not have "real" namespace support either, but it was proposed to
>> >have a following structure:
>> >
>> >   /sys/kernel/bpf/username/<optional_dirs_mkdir_by_user>/progX
> This would probably work as you would typically map the ebpf map
> using -v like this to give a stable path:
>
>          docker run -v /sys/kernel/bpf/foo/maps/progX:/map proX

yep
tracefs works inside docker the same way.
May be we should let users pick names similar to this fs patch to make
the above easier to use.
Also from bpf syscall point of the user shouldn't see
/sys/kernel/bpf/user/ prefix. Only 'optional_dirs_mkdir_by_user/name'
when doing pin/new_fd.
May be prog type should be a fixed part of the path as well.

>> >Together with device cgroups for containers, it would allow scenarios where
>> >you can have:
>> >
>> >   * eBPF (map/prog) device pass-through so a map/prog could even be shared out
>> >     from the initial namespace into individual ones/all (one could possibly
>> >     extend such maps as read-only for these consumers).
>> >   * eBPF device creation for unprivileged users with permissions being set
>> >     accordingly (as in fs case).
>> >   * Since cgroup controller can also do wildcards on major/minors, we could
>> >     make that further fine-grained.
>> >   * eBPF device creation can also be enforced by the cgroup controller to be
>> >     entirely disallowed for a specific container.

none of the above is practical. It can be demoed in a canned
environment, but it's a complete mismatch of apis. cgroup/dev is a
static config, whereas bpf-cdev is dynamic (with minors out of idr for
all users) When you have to hack drivers/base/core.c to get there it
should have been a warning sign that something is wrong with
this cdev approach.

> I've read the discussion passively and my take away is that, frankly,
> I think the differences are somewhat minor. Both architectures can
> scale to what we need. Both will do the job. I'm slightly worried about
> exposing uAPI as a FS, I think that didn't work too well for sysfs. It's
> pretty much a define the format once and never touch it again kind of
> deal.

It's even worse in cdev style since it piggy backs on sysfs.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-21 22:44                                                                 ` Alexei Starovoitov
@ 2015-10-22 13:22                                                                   ` Daniel Borkmann
  0 siblings, 0 replies; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-22 13:22 UTC (permalink / raw)
  To: Alexei Starovoitov, Thomas Graf
  Cc: Eric W. Biederman, Hannes Frederic Sowa, davem, viro, netdev,
	linux-kernel, Alexei Starovoitov

On 10/22/2015 12:44 AM, Alexei Starovoitov wrote:
...
> all users) When you have to hack drivers/base/core.c to get there it
> should have been a warning sign that something is wrong with
> this cdev approach.

Hmm, you know, this had nothing to do with it, merely to save ~20 LoC
that I can do just as well inside BPF framework. No changes in driver
API needed.

>> I've read the discussion passively and my take away is that, frankly,
>> I think the differences are somewhat minor. Both architectures can
>> scale to what we need. Both will do the job. I'm slightly worried about
>> exposing uAPI as a FS, I think that didn't work too well for sysfs. It's
>> pretty much a define the format once and never touch it again kind of
>> deal.
>
> It's even worse in cdev style since it piggy backs on sysfs.

I don't mind with what approach we're going in the end, but this kind
of discussion is really tiring, and not going anywhere.

Lets just make a beer call, so we can hash out a way forward that works
for everyone.

On that note: cheers! ;)
Daniel

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-21 15:17                                                             ` Daniel Borkmann
  2015-10-21 18:34                                                               ` Thomas Graf
@ 2015-10-22 19:35                                                               ` Eric W. Biederman
  2015-10-23 13:47                                                                 ` Daniel Borkmann
  1 sibling, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-10-22 19:35 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Hannes Frederic Sowa, davem, viro, tgraf,
	netdev, linux-kernel, Alexei Starovoitov

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 10/20/2015 08:56 PM, Eric W. Biederman wrote:
> ...
>> Just FYI:  Using a device for this kind of interface is pretty
>> much a non-starter as that quickly gets you into situations where
>> things do not work in containers.  If someone gets a version of device
>> namespaces past GregKH it might be up for discussion to use character
>> devices.
>
> Okay, you are referring to this discussion here:
>
>   http://thread.gmane.org/gmane.linux.kernel.containers/26760

That is a piece of it.  It is an old old discussion (which generally has
been handled poorly).  For the forseeable future device namespaces have
a firm NACK by GregKH.  Which means that dynamic character device based
interfaces do not work in containers.  Which means if you are not
talking about physical hardware, character devices are a poor fit.

Making a character based interface for eBPF not workable.

Eric

p.s.  There are plenty of reasons (even if privilege remains a
      requirement) to ask how can this functionality be used in a
      container.  If for no other reason than sandboxing privileged
      applications is typically a good idea.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
  2015-10-22 19:35                                                               ` Eric W. Biederman
@ 2015-10-23 13:47                                                                 ` Daniel Borkmann
  0 siblings, 0 replies; 56+ messages in thread
From: Daniel Borkmann @ 2015-10-23 13:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexei Starovoitov, Hannes Frederic Sowa, davem, viro, tgraf,
	netdev, linux-kernel, Alexei Starovoitov

On 10/22/2015 09:35 PM, Eric W. Biederman wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>> On 10/20/2015 08:56 PM, Eric W. Biederman wrote:
>> ...
>>> Just FYI:  Using a device for this kind of interface is pretty
>>> much a non-starter as that quickly gets you into situations where
>>> things do not work in containers.  If someone gets a version of device
>>> namespaces past GregKH it might be up for discussion to use character
>>> devices.
>>
>> Okay, you are referring to this discussion here:
>>
>>    http://thread.gmane.org/gmane.linux.kernel.containers/26760
>
> That is a piece of it.  It is an old old discussion (which generally has
> been handled poorly).  For the forseeable future device namespaces have
> a firm NACK by GregKH.  Which means that dynamic character device based
> interfaces do not work in containers.  Which means if you are not
> talking about physical hardware, character devices are a poor fit.

Yes, it breaks down with real namespace support. Reworking the set with
an improved version of the fs code is already in progress.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2015-10-23 13:48 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-16  1:09 [PATCH net-next 0/4] BPF updates Daniel Borkmann
2015-10-16  1:09 ` [PATCH net-next 1/4] bpf: abstract anon_inode_getfd invocations Daniel Borkmann
2015-10-16  1:09 ` [PATCH net-next 2/4] bpf: align and clean bpf_{map,prog}_get helpers Daniel Borkmann
2015-10-16  1:09 ` [PATCH net-next 3/4] bpf: add support for persistent maps/progs Daniel Borkmann
2015-10-16 10:25   ` Hannes Frederic Sowa
2015-10-16 13:36     ` Daniel Borkmann
2015-10-16 16:36       ` Hannes Frederic Sowa
2015-10-16 17:27         ` Daniel Borkmann
2015-10-16 17:37           ` Alexei Starovoitov
2015-10-16 16:18     ` Alexei Starovoitov
2015-10-16 16:43       ` Hannes Frederic Sowa
2015-10-16 17:32         ` Alexei Starovoitov
2015-10-16 17:37           ` Thomas Graf
2015-10-16 17:21   ` Hannes Frederic Sowa
2015-10-16 17:42     ` Alexei Starovoitov
2015-10-16 17:56       ` Daniel Borkmann
2015-10-16 18:41         ` Eric W. Biederman
2015-10-16 19:27           ` Alexei Starovoitov
2015-10-16 19:53             ` Eric W. Biederman
2015-10-16 20:56               ` Alexei Starovoitov
2015-10-16 23:44                 ` Eric W. Biederman
2015-10-17  2:43                   ` Alexei Starovoitov
2015-10-17 12:28                     ` Daniel Borkmann
2015-10-18  2:20                       ` Alexei Starovoitov
2015-10-18 15:03                         ` Daniel Borkmann
2015-10-18 16:49                           ` Daniel Borkmann
2015-10-18 20:59                             ` Alexei Starovoitov
2015-10-19  7:36                               ` Hannes Frederic Sowa
2015-10-19  9:51                                 ` Daniel Borkmann
2015-10-19 14:23                                   ` Daniel Borkmann
2015-10-19 16:22                                     ` Alexei Starovoitov
2015-10-19 17:37                                       ` Daniel Borkmann
2015-10-19 18:15                                         ` Alexei Starovoitov
2015-10-19 18:46                                           ` Hannes Frederic Sowa
2015-10-19 19:34                                             ` Alexei Starovoitov
2015-10-19 20:03                                               ` Hannes Frederic Sowa
2015-10-19 20:48                                                 ` Alexei Starovoitov
2015-10-19 22:17                                                   ` Daniel Borkmann
2015-10-20  0:30                                                     ` Alexei Starovoitov
2015-10-20  8:46                                                       ` Daniel Borkmann
2015-10-20 17:53                                                         ` Alexei Starovoitov
2015-10-20 18:56                                                           ` Eric W. Biederman
2015-10-21 15:17                                                             ` Daniel Borkmann
2015-10-21 18:34                                                               ` Thomas Graf
2015-10-21 22:44                                                                 ` Alexei Starovoitov
2015-10-22 13:22                                                                   ` Daniel Borkmann
2015-10-22 19:35                                                               ` Eric W. Biederman
2015-10-23 13:47                                                                 ` Daniel Borkmann
2015-10-20  9:43                                                       ` Hannes Frederic Sowa
2015-10-19 23:02                                                   ` Hannes Frederic Sowa
2015-10-20  1:09                                                     ` Alexei Starovoitov
2015-10-20 10:07                                                       ` Hannes Frederic Sowa
2015-10-20 18:44                                                         ` Alexei Starovoitov
2015-10-16 19:54             ` Daniel Borkmann
2015-10-16  1:09 ` [PATCH net-next 4/4] bpf: add sample usages " Daniel Borkmann
2015-10-19  2:53 ` [PATCH net-next 0/4] BPF updates David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).