All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexei Starovoitov <ast@plumgrid.com>
To: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>,
	Linus Torvalds <torvalds@linuxfoundation.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Steven Rostedt <rostedt@goodmis.org>,
	Daniel Borkmann <dborkman@redhat.com>,
	Hannes Frederic Sowa <hannes@stressinduktion.org>,
	Chema Gonzalez <chema@google.com>,
	Eric Dumazet <edumazet@google.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Andrew Morton <akpm@linuxfoundation.org>,
	Kees Cook <keescook@chromium.org>,
	linux-api@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v11 net-next 01/12] bpf: introduce BPF syscall and maps
Date: Tue,  9 Sep 2014 22:09:57 -0700	[thread overview]
Message-ID: <1410325808-3657-2-git-send-email-ast@plumgrid.com> (raw)
In-Reply-To: <1410325808-3657-1-git-send-email-ast@plumgrid.com>

BPF syscall is a multiplexor for a range of different operations on eBPF.
This patch introduces syscall with single command to create a map.
Next patch adds commands to access maps.

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

Userspace example:
/* this syscall wrapper creates a map with given type and attributes
 * and returns map_fd on success.
 * use close(map_fd) to delete the map
 */
int bpf_create_map(enum bpf_map_type map_type, int key_size,
                   int value_size, int max_entries)
{
    union bpf_attr attr = {
        .map_type = map_type,
        .key_size = key_size,
        .value_size = value_size,
        .max_entries = max_entries
    };

    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
}

syscall is using 'union bpf_attr' to be backwards compatible with future
extensions. Different syscall commands will use different attributes.

More details in Documentation/networking/filter.txt and in manpage

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 Documentation/networking/filter.txt |   39 +++++++++
 include/linux/bpf.h                 |   41 ++++++++++
 include/uapi/linux/bpf.h            |   24 ++++++
 kernel/bpf/Makefile                 |    2 +-
 kernel/bpf/syscall.c                |  149 +++++++++++++++++++++++++++++++++++
 5 files changed, 254 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf.h
 create mode 100644 kernel/bpf/syscall.c

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index 81916ab5d96f..1900d29518f1 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1001,6 +1001,45 @@ instruction that loads 64-bit immediate value into a dst_reg.
 Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
 32-bit immediate value into a register.
 
+eBPF maps
+---------
+'maps' is a generic storage of different types for sharing data between kernel
+and userspace.
+
+The maps are accessed from user space via BPF syscall, which has commands:
+- create a map with given type and attributes
+  map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
+  using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
+  returns process-local file descriptor or negative error
+
+- lookup key in a given map
+  err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
+  using attr->map_fd, attr->key, attr->value
+  returns zero and stores found elem into value or negative error
+
+- create or update key/value pair in a given map
+  err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
+  using attr->map_fd, attr->key, attr->value
+  returns zero or negative error
+
+- find and delete element by key in a given map
+  err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
+  using attr->map_fd, attr->key
+
+- to delete map: close(fd)
+  Exiting process will delete maps automatically
+
+userspace programs use this syscall to create/access maps that eBPF programs
+are concurrently updating.
+
+maps can have different types: hash, array, bloom filter, radix-tree, etc.
+
+The map is defined by:
+  . type
+  . max number of elements
+  . key size in bytes
+  . value size in bytes
+
 Testing
 -------
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
new file mode 100644
index 000000000000..48014a71f0fe
--- /dev/null
+++ b/include/linux/bpf.h
@@ -0,0 +1,41 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_BPF_H
+#define _LINUX_BPF_H 1
+
+#include <uapi/linux/bpf.h>
+#include <linux/workqueue.h>
+
+struct bpf_map;
+
+/* map is generic key/value storage optionally accesible by eBPF programs */
+struct bpf_map_ops {
+	/* funcs callable from userspace (via syscall) */
+	struct bpf_map *(*map_alloc)(union bpf_attr *attr);
+	void (*map_free)(struct bpf_map *);
+};
+
+struct bpf_map {
+	atomic_t refcnt;
+	enum bpf_map_type map_type;
+	u32 key_size;
+	u32 value_size;
+	u32 max_entries;
+	struct bpf_map_ops *ops;
+	struct work_struct work;
+};
+
+struct bpf_map_type_list {
+	struct list_head list_node;
+	struct bpf_map_ops *ops;
+	enum bpf_map_type type;
+};
+
+void bpf_register_map_type(struct bpf_map_type_list *tl);
+void bpf_map_put(struct bpf_map *map);
+
+#endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 479ed0b6be16..7d83ef63849d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -62,4 +62,28 @@ struct bpf_insn {
 	__s32	imm;		/* signed immediate constant */
 };
 
+/* BPF syscall commands */
+enum bpf_cmd {
+	/* create a map with given type and attributes
+	 * fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size)
+	 * returns fd or negative error
+	 * map is deleted when fd is closed
+	 */
+	BPF_MAP_CREATE,
+};
+
+enum bpf_map_type {
+	BPF_MAP_TYPE_UNSPEC,
+};
+
+union bpf_attr {
+	struct { /* anonymous struct used by BPF_MAP_CREATE command */
+		enum bpf_map_type map_type;
+		__u32	key_size;	/* size of key in bytes */
+		__u32	value_size;	/* size of value in bytes */
+		__u32	max_entries;	/* max number of entries in a map */
+#define BPF_MAP_CREATE_LAST_FIELD max_entries
+	};
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 6a71145e2769..e9f7334ed07a 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o
+obj-y := core.o syscall.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
new file mode 100644
index 000000000000..e353eaf3ac59
--- /dev/null
+++ b/kernel/bpf/syscall.c
@@ -0,0 +1,149 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <linux/syscalls.h>
+#include <linux/slab.h>
+#include <linux/anon_inodes.h>
+
+static LIST_HEAD(bpf_map_types);
+
+static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
+{
+	struct bpf_map_type_list *tl;
+	struct bpf_map *map;
+
+	list_for_each_entry(tl, &bpf_map_types, list_node) {
+		if (tl->type == attr->map_type) {
+			map = tl->ops->map_alloc(attr);
+			if (IS_ERR(map))
+				return map;
+			map->ops = tl->ops;
+			map->map_type = attr->map_type;
+			return map;
+		}
+	}
+	return ERR_PTR(-EINVAL);
+}
+
+/* boot time registration of different map implementations */
+void bpf_register_map_type(struct bpf_map_type_list *tl)
+{
+	list_add(&tl->list_node, &bpf_map_types);
+}
+
+/* called from workqueue */
+static void bpf_map_free_deferred(struct work_struct *work)
+{
+	struct bpf_map *map = container_of(work, struct bpf_map, work);
+
+	/* implementation dependent freeing */
+	map->ops->map_free(map);
+}
+
+/* decrement map refcnt and schedule it for freeing via workqueue
+ * (unrelying map implementation ops->map_free() might sleep)
+ */
+void bpf_map_put(struct bpf_map *map)
+{
+	if (atomic_dec_and_test(&map->refcnt)) {
+		INIT_WORK(&map->work, bpf_map_free_deferred);
+		schedule_work(&map->work);
+	}
+}
+
+static int bpf_map_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_map *map = filp->private_data;
+
+	bpf_map_put(map);
+	return 0;
+}
+
+static const struct file_operations bpf_map_fops = {
+	.release = bpf_map_release,
+};
+
+/* helper macro to check that unused fields 'union bpf_attr' are zero */
+#define CHECK_ATTR(CMD) \
+	memchr_inv((void *) &attr->CMD##_LAST_FIELD + \
+		   sizeof(attr->CMD##_LAST_FIELD), 0, \
+		   sizeof(*attr) - \
+		   offsetof(union bpf_attr, CMD##_LAST_FIELD)) != NULL
+
+/* called via syscall */
+static int map_create(union bpf_attr *attr)
+{
+	struct bpf_map *map;
+	int err;
+
+	err = CHECK_ATTR(BPF_MAP_CREATE);
+	if (err)
+		return -EINVAL;
+
+	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
+	map = find_and_alloc_map(attr);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	atomic_set(&map->refcnt, 1);
+
+	err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC);
+
+	if (err < 0)
+		/* failed to allocate fd */
+		goto free_map;
+
+	return err;
+
+free_map:
+	map->ops->map_free(map);
+	return err;
+}
+
+SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
+{
+	union bpf_attr *attr;
+	int err;
+
+	/* the syscall is limited to root temporarily. This restriction will be
+	 * lifted when security audit is clean. Note that eBPF+tracing must have
+	 * this restriction, since it may pass kernel data to user space
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* newer userspace cannot run with older kernel */
+	if (size > sizeof(*attr))
+		return -EINVAL;
+
+	attr = kzalloc(sizeof(*attr), GFP_USER);
+	if (!attr)
+		return -ENOMEM;
+
+	/* copy attributes from user space, may be less than sizeof(bpf_attr) */
+	err = -EFAULT;
+	if (copy_from_user(attr, uattr, size) != 0)
+		goto free_attr;
+
+	switch (cmd) {
+	case BPF_MAP_CREATE:
+		err = map_create(attr);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+free_attr:
+	kfree(attr);
+	return err;
+}
-- 
1.7.9.5


  reply	other threads:[~2014-09-10  5:10 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-10  5:09 [PATCH v11 net-next 00/12] eBPF syscall, verifier, testsuite Alexei Starovoitov
2014-09-10  5:09 ` Alexei Starovoitov
2014-09-10  5:09 ` Alexei Starovoitov [this message]
2014-09-10  5:09 ` [PATCH v11 net-next 02/12] bpf: enable bpf syscall on x64 and i386 Alexei Starovoitov
2014-09-10  5:09 ` [PATCH v11 net-next 03/12] bpf: add lookup/update/delete/iterate methods to BPF maps Alexei Starovoitov
2014-09-10  5:10 ` [PATCH v11 net-next 04/12] bpf: expand BPF syscall with program load/unload Alexei Starovoitov
2014-09-10  8:04   ` Daniel Borkmann
2014-09-10  8:04     ` Daniel Borkmann
2014-09-10 17:19     ` Alexei Starovoitov
2014-09-10  5:10 ` [PATCH v11 net-next 05/12] bpf: handle pseudo BPF_CALL insn Alexei Starovoitov
2014-09-10  5:10 ` [PATCH v11 net-next 06/12] bpf: verifier (add docs) Alexei Starovoitov
2014-09-10  5:10 ` [PATCH v11 net-next 07/12] bpf: verifier (add ability to receive verification log) Alexei Starovoitov
2014-09-10  5:10 ` [PATCH v11 net-next 08/12] bpf: handle pseudo BPF_LD_IMM64 insn Alexei Starovoitov
2014-09-10  5:10 ` [PATCH v11 net-next 09/12] bpf: verifier (add branch/goto checks) Alexei Starovoitov
2014-09-10  5:10 ` [PATCH v11 net-next 10/12] bpf: verifier (add verifier core) Alexei Starovoitov
2014-09-10  5:10   ` Alexei Starovoitov
2014-09-10  5:10 ` [PATCH v11 net-next 11/12] net: filter: move eBPF instruction macros Alexei Starovoitov
2014-09-10 11:24   ` Daniel Borkmann
2014-09-10 11:24     ` Daniel Borkmann
2014-09-10 18:16     ` Alexei Starovoitov
2014-09-10 18:16       ` Alexei Starovoitov
2014-09-11  6:29       ` Daniel Borkmann
2014-09-11  6:45         ` Alexei Starovoitov
2014-09-10  5:10 ` [PATCH v11 net-next 12/12] bpf: mini eBPF library, test stubs and verifier testsuite Alexei Starovoitov
2014-09-10 11:35   ` Daniel Borkmann
2014-09-10 11:35     ` Daniel Borkmann
2014-09-10 18:08     ` Alexei Starovoitov
2014-09-10 18:08       ` Alexei Starovoitov
2014-09-17  7:16       ` Daniel Borkmann
2014-09-17  7:16         ` Daniel Borkmann
2014-09-17 16:17         ` Alexei Starovoitov
2014-09-17 21:59           ` Daniel Borkmann
2014-09-17 21:59             ` Daniel Borkmann
2014-09-17 22:16             ` Alexei Starovoitov
2014-09-10  8:19 ` [PATCH v11 net-next 00/12] eBPF syscall, verifier, testsuite Daniel Borkmann
2014-09-10  8:19   ` Daniel Borkmann
2014-09-10 17:28   ` Alexei Starovoitov
2014-09-10  9:03 ` Daniel Borkmann
2014-09-10 17:32   ` Alexei Starovoitov
2014-09-10 17:32     ` Alexei Starovoitov
2014-09-11 19:47     ` Daniel Borkmann
2014-09-11 19:47       ` Daniel Borkmann
2014-09-11 20:33       ` Alexei Starovoitov
2014-09-11 20:33         ` Alexei Starovoitov
2014-09-11 21:54         ` Andy Lutomirski
2014-09-11 21:54           ` Andy Lutomirski
2014-09-11 22:29           ` Alexei Starovoitov
2014-09-11 22:29             ` Alexei Starovoitov
2014-09-12  1:17             ` Andy Lutomirski
2014-09-12  1:29               ` Alexei Starovoitov
2014-09-12 22:40               ` Alexei Starovoitov
2014-09-10  9:21 ` Daniel Borkmann
2014-09-10 17:48   ` Alexei Starovoitov
2014-09-10 18:22 ` Andy Lutomirski
2014-09-10 20:21   ` Alexei Starovoitov
2014-09-10 20:21     ` Alexei Starovoitov
2014-09-11 19:54     ` Daniel Borkmann
2014-09-11 20:35       ` Alexei Starovoitov
2014-09-11 20:35         ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1410325808-3657-2-git-send-email-ast@plumgrid.com \
    --to=ast@plumgrid.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linuxfoundation.org \
    --cc=chema@google.com \
    --cc=davem@davemloft.net \
    --cc=dborkman@redhat.com \
    --cc=edumazet@google.com \
    --cc=hannes@stressinduktion.org \
    --cc=hpa@zytor.com \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pablo@netfilter.org \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linuxfoundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.