netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v3 net-next 0/3] eBPF examples in C
@ 2014-07-22  3:45 Alexei Starovoitov
  2014-07-22  3:45 ` [PATCH RFC v3 net-next 1/3] samples: bpf: elf file loader Alexei Starovoitov
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2014-07-22  3:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

Hi,

I've been asked to repost LLVM eBPF backend and examples in C, so here they are.
LLVM backend is 99% the same as it was in Feb. Not resending it to the list,
since I only fixed minor things there. See it in my tree.

ex1 - is the same example I showed in Feb, but now it works through BPF syscall
  and Ctrl-C does auto cleanup. This is patch 2/3

ex2 - is a new example that demonstrates key feature of eBPF programs
  for kernel debugging/tracing. This is patch 3/3

dtrace/systemtap/ktap approach is to use one script file that should provide
all desired functionality. That architectural decision overcomplicated their
implementations.
eBPF follows split model: everything that needs to process millions of events
per second needs to run in kernel and needs to be short and deterministic,
all other things like aggregation and nice graphs should run in user space.

In the patch 3/3, kfree_skb events are counted by a program written in C,
compiled into eBPF and attached to the event. That's ex2_kern.c file.
The corresponding user space part is ex2_user.c which walks in-kernel map
every second and prints its contents. So user space and kernel are
accessing BPF maps in parallel. Kernel is counting events, user space
prints them.

Patch 1/3 is a parser of .o file generated by LLVM. It looks for pre-defined
ELF sections like 'license', 'maps', 'events' and loads bpf maps/programs
via BPF syscall that I posted earlier.

Alexei Starovoitov (3):
  samples: bpf: elf file loader
  samples: bpf: eBPF example in C
  samples: bpf: eBPF dropmon example in C

 samples/bpf/Makefile      |   17 +++-
 samples/bpf/bpf_helpers.h |   21 +++++
 samples/bpf/bpf_load.c    |  228 +++++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/bpf_load.h    |   18 ++++
 samples/bpf/ex1_kern.c    |   27 ++++++
 samples/bpf/ex1_user.c    |   11 +++
 samples/bpf/ex2_kern.c    |   29 ++++++
 samples/bpf/ex2_user.c    |   28 ++++++
 8 files changed, 377 insertions(+), 2 deletions(-)
 create mode 100644 samples/bpf/bpf_helpers.h
 create mode 100644 samples/bpf/bpf_load.c
 create mode 100644 samples/bpf/bpf_load.h
 create mode 100644 samples/bpf/ex1_kern.c
 create mode 100644 samples/bpf/ex1_user.c
 create mode 100644 samples/bpf/ex2_kern.c
 create mode 100644 samples/bpf/ex2_user.c

----

The following changes since commit 240524089d7a5c0396656574e299beb3a55461e3:

  net: bcmgenet: only update UMAC_CMD if something changed (2014-07-21 19:49:11 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master

for you to fetch changes up to 27ae0cec12d0aac6f0705b0269ee705a0c599571:

  samples: bpf: eBPF dropmon example in C (2014-07-21 20:01:29 -0700)

----------------------------------------------------------------
Alexei Starovoitov (20):
      net: filter: split filter.c into two files
      bpf: update MAINTAINERS entry
      net: filter: rename struct sock_filter_int into bpf_insn
      net: filter: split filter.h and expose eBPF to user space
      bpf: introduce syscall(BPF, ...) and BPF maps
      bpf: enable bpf syscall on x64
      bpf: add lookup/update/delete/iterate methods to BPF maps
      bpf: add hashtable type of BPF maps
      bpf: expand BPF syscall with program load/unload
      bpf: add eBPF verifier
      bpf: allow eBPF programs to use maps
      net: sock: allow eBPF programs to be attached to sockets
      tracing: allow eBPF programs to be attached to events
      samples: bpf: add mini eBPF library to manipulate maps and programs
      samples: bpf: example of stateful socket filtering
      samples: bpf: example of tracing filters with eBPF
      bpf: llvm backend
      samples: bpf: elf file loader
      samples: bpf: eBPF example in C
      samples: bpf: eBPF dropmon example in C

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH RFC v3 net-next 1/3] samples: bpf: elf file loader
  2014-07-22  3:45 [PATCH RFC v3 net-next 0/3] eBPF examples in C Alexei Starovoitov
@ 2014-07-22  3:45 ` Alexei Starovoitov
  2014-07-22  3:45 ` [PATCH RFC v3 net-next 2/3] samples: bpf: eBPF example in C Alexei Starovoitov
  2014-07-22  3:45 ` [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon " Alexei Starovoitov
  2 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2014-07-22  3:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

simple .o parser and loader using BPF syscall.
.o is a standard ELF generated by LLVM backend

/* parses elf file compiled by llvm .c->.o
 * creates maps and stores FD into map_fd array
 * loads eBPF programs
 * returns zero on success
 */
int load_bpf_file(char *path);

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/bpf_helpers.h |   21 +++++
 samples/bpf/bpf_load.c    |  228 +++++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/bpf_load.h    |   18 ++++
 3 files changed, 267 insertions(+)
 create mode 100644 samples/bpf/bpf_helpers.h
 create mode 100644 samples/bpf/bpf_load.c
 create mode 100644 samples/bpf/bpf_load.h

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
new file mode 100644
index 000000000000..356e7d34d174
--- /dev/null
+++ b/samples/bpf/bpf_helpers.h
@@ -0,0 +1,21 @@
+#ifndef __BPF_HELPERS_H
+#define __BPF_HELPERS_H
+
+#define SEC(NAME) __attribute__((section(NAME), used))
+
+static void *(*bpf_load_pointer)(void *unsafe_ptr) = (void *) BPF_FUNC_load_pointer;
+static int (*bpf_printk)(const char *fmt, int fmt_size, ...) = (void *) BPF_FUNC_printk;
+static int (*bpf_memcmp)(void *unsafe_ptr, void *safe_ptr, int size) = (void *) BPF_FUNC_memcmp;
+static void *(*bpf_map_lookup_elem)(void *map, void *key) = (void*) BPF_FUNC_map_lookup_elem;
+static int (*bpf_map_update_elem)(void *map, void *key, void *value) = (void*) BPF_FUNC_map_update_elem;
+static int (*bpf_map_delete_elem)(void *map, void *key) = (void *) BPF_FUNC_map_delete_elem;
+static void (*bpf_dump_stack)(void) = (void *) BPF_FUNC_dump_stack;
+
+struct bpf_map_def {
+	int type;
+	int key_size;
+	int value_size;
+	int max_entries;
+};
+
+#endif
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
new file mode 100644
index 000000000000..7614f19d5282
--- /dev/null
+++ b/samples/bpf/bpf_load.c
@@ -0,0 +1,228 @@
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <libelf.h>
+#include <gelf.h>
+#include <errno.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_helpers.h"
+#include "bpf_load.h"
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+static char license[128];
+static bool processed_sec[128];
+int map_fd[MAX_MAPS];
+
+static int load_and_attach(const char *event, char *prog, int size,
+			   const struct bpf_map_fixup *fixups, int fixup_len)
+{
+	int fd, event_fd, err;
+	char fmt[32];
+	char path[256] = DEBUGFS;
+
+	fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER,
+			   (struct bpf_insn *)prog, size, license,
+			   fixups, fixup_len);
+
+	if (fd < 0) {
+		printf("err %d errno %d\n", fd, errno);
+		return fd;
+	}
+
+
+	snprintf(fmt, sizeof(fmt), "bpf-%d", fd);
+
+	strcat(path, event);
+	strcat(path, "/filter");
+
+	printf("writing %s -> %s\n", fmt, path);
+
+	event_fd = open(path, O_WRONLY, 0);
+	if (event_fd < 0) {
+		printf("failed to open event %s\n", event);
+		return event_fd;
+	}
+
+	err = write(event_fd, fmt, strlen(fmt));
+	(void) err;
+
+	return 0;
+}
+
+static int load_maps(struct bpf_map_def *maps, int len)
+{
+	int i;
+
+	for (i = 0; i < len / sizeof(struct bpf_map_def); i++) {
+
+		map_fd[i] = bpf_create_map(maps[i].type,
+					   maps[i].key_size,
+					   maps[i].value_size,
+					   maps[i].max_entries);
+		if (map_fd[i] < 0)
+			return 1;
+	}
+	return 0;
+}
+
+static int get_sec(Elf *elf, int i, GElf_Ehdr *ehdr, char **shname,
+		   GElf_Shdr *shdr, Elf_Data **data)
+{
+	Elf_Scn *scn;
+
+	scn = elf_getscn(elf, i);
+	if (!scn)
+		return 1;
+
+	if (gelf_getshdr(scn, shdr) != shdr)
+		return 2;
+
+	*shname = elf_strptr(elf, ehdr->e_shstrndx, shdr->sh_name);
+	if (!*shname || !shdr->sh_size)
+		return 3;
+
+	*data = elf_getdata(scn, 0);
+	if (!*data || elf_getdata(scn, *data) != NULL)
+		return 4;
+
+	return 0;
+}
+
+static struct bpf_map_fixup fixup[128];
+
+static int parse_relo_into_map_fixup(Elf_Data *data, Elf_Data *symbols,
+				     GElf_Shdr *shdr)
+{
+	GElf_Sym sym;
+	GElf_Rel rel;
+	int i, nrels;
+
+	nrels = shdr->sh_size / shdr->sh_entsize;
+
+	for (i = 0; i < nrels; i++) {
+		gelf_getrel(data, i, &rel);
+
+		fixup[i].insn_idx = rel.r_offset / sizeof(struct bpf_insn);
+
+		gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym);
+
+		fixup[i].fd = map_fd[sym.st_value / sizeof(struct bpf_map_def)];
+	}
+
+	return nrels;
+}
+
+int load_bpf_file(char *path)
+{
+	int fd, i, nrels;
+	Elf *elf;
+	GElf_Ehdr ehdr;
+	GElf_Shdr shdr, shdr_prog;
+	Elf_Data *data, *data_prog, *symbols = NULL;
+	char *shname, *shname_prog;
+
+	if (elf_version(EV_CURRENT) == EV_NONE)
+		return 1;
+
+	fd = open(path, O_RDONLY, 0);
+	if (fd < 0)
+		return 1;
+
+	elf = elf_begin(fd, ELF_C_READ, NULL);
+
+	if (!elf)
+		return 1;
+
+	if (gelf_getehdr(elf, &ehdr) != &ehdr)
+		return 1;
+
+	/* scan over all elf sections to get license and map info */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (0)
+			printf("section %d:%s data %p size %zd link %d flags %d\n",
+			       i, shname, data->d_buf, data->d_size,
+			       shdr.sh_link, (int) shdr.sh_flags);
+
+		if (strcmp(shname, "license") == 0) {
+			processed_sec[i] = true;
+			memcpy(license, data->d_buf, data->d_size);
+		} else if (strcmp(shname, "maps") == 0) {
+			processed_sec[i] = true;
+			if (load_maps(data->d_buf, data->d_size))
+				return 1;
+		} else if (shdr.sh_type == SHT_SYMTAB) {
+			symbols = data;
+		}
+	}
+
+	/* load programs that need map fixup (relocations) */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+		if (shdr.sh_type == SHT_REL) {
+
+			if (get_sec(elf, shdr.sh_info, &ehdr, &shname_prog,
+				    &shdr_prog, &data_prog))
+				continue;
+
+			if (0)
+				printf("relo %s into %s\n", shname, shname_prog);
+
+			processed_sec[shdr.sh_info] = true;
+			processed_sec[i] = true;
+
+			nrels = parse_relo_into_map_fixup(data, symbols, &shdr);
+
+			if (memcmp(shname_prog, "events/", sizeof("events/") - 1) == 0)
+				load_and_attach(shname_prog,
+						data_prog->d_buf, data_prog->d_size,
+						fixup, nrels * sizeof(fixup[0]));
+		}
+	}
+
+	/* load programs that don't use maps */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (processed_sec[i])
+			continue;
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (memcmp(shname, "events/", sizeof("events/") - 1) == 0)
+			load_and_attach(shname, data->d_buf, data->d_size,
+					NULL, 0);
+	}
+
+	close(fd);
+	return 0;
+}
+
+void read_trace_pipe(void)
+{
+	int trace_fd;
+
+	trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
+	if (trace_fd < 0)
+		return;
+
+	while (1) {
+		static char buf[4096];
+		ssize_t sz;
+
+		sz = read(trace_fd, buf, sizeof(buf));
+		if (sz)
+			puts(buf);
+	}
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
new file mode 100644
index 000000000000..d34e61a101d2
--- /dev/null
+++ b/samples/bpf/bpf_load.h
@@ -0,0 +1,18 @@
+#ifndef __BPF_LOAD_H
+#define __BPF_LOAD_H
+
+#define MAX_MAPS 64
+
+extern int map_fd[MAX_MAPS];
+
+/* parses elf file compiled by llvm .c->.o
+ * creates maps and stores FD into map_fd array
+ * loads eBPF programs
+ * returns zero on success
+ */
+int load_bpf_file(char *path);
+
+/* forever reads /sys/.../trace_pipe */
+void read_trace_pipe(void);
+
+#endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v3 net-next 2/3] samples: bpf: eBPF example in C
  2014-07-22  3:45 [PATCH RFC v3 net-next 0/3] eBPF examples in C Alexei Starovoitov
  2014-07-22  3:45 ` [PATCH RFC v3 net-next 1/3] samples: bpf: elf file loader Alexei Starovoitov
@ 2014-07-22  3:45 ` Alexei Starovoitov
  2014-07-22  3:45 ` [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon " Alexei Starovoitov
  2 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2014-07-22  3:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

ex1_kern.c - C program which will be compiled into eBPF
to filter netif_receive_skb events on skb->dev->name == "lo"

ex1_user.c - corresponding user space component that
forever reads /sys/.../trace_pipe

Usage:
$ sudo ex1 ex1_kern.o

should see:
writing bpf-4 -> /sys/kernel/debug/tracing/events/net/netif_receive_skb/filter
  ping-25476 [001] ..s3  5639.718218: __netif_receive_skb_core: skb 4d51700 dev b9e6000
  ping-25476 [001] ..s3  5639.718262: __netif_receive_skb_core: skb 4d51400 dev b9e6000

  ping-25476 [002] ..s3  5640.716233: __netif_receive_skb_core: skb 5d06500 dev b9e6000
  ping-25476 [002] ..s3  5640.716272: __netif_receive_skb_core: skb 5d06300 dev b9e6000

Ctrl-C at any time, kernel will auto cleanup

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile   |   15 +++++++++++++--
 samples/bpf/ex1_kern.c |   27 +++++++++++++++++++++++++++
 samples/bpf/ex1_user.c |   11 +++++++++++
 3 files changed, 51 insertions(+), 2 deletions(-)
 create mode 100644 samples/bpf/ex1_kern.c
 create mode 100644 samples/bpf/ex1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index caf1ab93b37c..c97f408fcd6d 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,12 +2,23 @@
 obj- := dummy.o
 
 # List of programs to build
-hostprogs-y := sock_example dropmon
+hostprogs-y := sock_example dropmon ex1
 
 sock_example-objs := sock_example.o libbpf.o
 dropmon-objs := dropmon.o libbpf.o
+ex1-objs := bpf_load.o libbpf.o ex1_user.o
 
 # Tell kbuild to always build the programs
-always := $(hostprogs-y)
+always := $(hostprogs-y) ex1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
+
+HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
+HOSTLOADLIBES_ex1 += -lelf
+
+LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
+
+%.o: %.c
+	clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
+		-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
+		-O2 -emit-llvm -c $< -o -| $(LLC) -o $@
diff --git a/samples/bpf/ex1_kern.c b/samples/bpf/ex1_kern.c
new file mode 100644
index 000000000000..3fa7e4db0be6
--- /dev/null
+++ b/samples/bpf/ex1_kern.c
@@ -0,0 +1,27 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+SEC("events/net/netif_receive_skb")
+int bpf_prog1(struct bpf_context *ctx)
+{
+	/*
+	 * attaches to /sys/kernel/debug/tracing/events/net/netif_receive_skb
+	 * prints events for loobpack device only
+	 */
+	char devname[] = "lo";
+	struct net_device *dev;
+	struct sk_buff *skb = 0;
+
+	skb = (struct sk_buff *) ctx->arg1;
+	dev = bpf_load_pointer(&skb->dev);
+	if (bpf_memcmp(dev->name, devname, 2) == 0) {
+		char fmt[] = "skb %x dev %x\n";
+		bpf_printk(fmt, sizeof(fmt), skb, dev);
+	}
+	return 0;
+}
+
+char license[] SEC("license") = "GPL";
diff --git a/samples/bpf/ex1_user.c b/samples/bpf/ex1_user.c
new file mode 100644
index 000000000000..d90c543ba30c
--- /dev/null
+++ b/samples/bpf/ex1_user.c
@@ -0,0 +1,11 @@
+#include "bpf_load.h"
+
+int main(int ac, char **argv)
+{
+	if (load_bpf_file(argv[1]))
+		return 1;
+
+	read_trace_pipe();
+
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C
  2014-07-22  3:45 [PATCH RFC v3 net-next 0/3] eBPF examples in C Alexei Starovoitov
  2014-07-22  3:45 ` [PATCH RFC v3 net-next 1/3] samples: bpf: elf file loader Alexei Starovoitov
  2014-07-22  3:45 ` [PATCH RFC v3 net-next 2/3] samples: bpf: eBPF example in C Alexei Starovoitov
@ 2014-07-22  3:45 ` Alexei Starovoitov
       [not found]   ` <1406000723-4872-4-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2014-07-22  3:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

semantically this example is the same as dropmon.c which has eBPF in assembler.

In this example both eBPF and user space are in C:

ex2_kern.c - C code that will compile into eBPF and will
  be attached to kfree_skb event
ex2_user.c - corresponding user space part that iterates the map
  populated by in-kernel eBPF code

Usage:
$ sudo ex2 ex2_kern.o

Should see:
writing bpf-5 -> /sys/kernel/debug/tracing/events/skb/kfree_skb/filter
location 0xffffffff816efc67 count 1

location 0xffffffff815d8030 count 1
location 0xffffffff816efc67 count 3

location 0xffffffff815d8030 count 4
location 0xffffffff816efc67 count 9

location 0xffffffff815d8030 count 7
location 0xffffffff816efc67 count 15

location 0xffffffff815d8030 count 10
location 0xffffffff816efc67 count 23

location 0xffffffff815d8030 count 13
location 0xffffffff816efc67 count 29

Ctrl-C at any time. Kernel will auto cleanup maps and programs

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile   |    6 ++++--
 samples/bpf/ex2_kern.c |   29 +++++++++++++++++++++++++++++
 samples/bpf/ex2_user.c |   28 ++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 2 deletions(-)
 create mode 100644 samples/bpf/ex2_kern.c
 create mode 100644 samples/bpf/ex2_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index c97f408fcd6d..b865a5df5c60 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,19 +2,21 @@
 obj- := dummy.o
 
 # List of programs to build
-hostprogs-y := sock_example dropmon ex1
+hostprogs-y := sock_example dropmon ex1 ex2
 
 sock_example-objs := sock_example.o libbpf.o
 dropmon-objs := dropmon.o libbpf.o
 ex1-objs := bpf_load.o libbpf.o ex1_user.o
+ex2-objs := bpf_load.o libbpf.o ex2_user.o
 
 # Tell kbuild to always build the programs
-always := $(hostprogs-y) ex1_kern.o
+always := $(hostprogs-y) ex1_kern.o ex2_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
 HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
 HOSTLOADLIBES_ex1 += -lelf
+HOSTLOADLIBES_ex2 += -lelf
 
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
 
diff --git a/samples/bpf/ex2_kern.c b/samples/bpf/ex2_kern.c
new file mode 100644
index 000000000000..b4a8e73a3f9d
--- /dev/null
+++ b/samples/bpf/ex2_kern.c
@@ -0,0 +1,29 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(long),
+	.value_size = sizeof(long),
+	.max_entries = 1024,
+};
+
+SEC("events/skb/kfree_skb")
+int bpf_prog2(struct bpf_context *ctx)
+{
+	long loc = ctx->arg2;
+	long init_val = 1;
+	void *value;
+
+	value = bpf_map_lookup_elem(&my_map, &loc);
+	if (value)
+		(*(long *) value) += 1;
+	else
+		bpf_map_update_elem(&my_map, &loc, &init_val);
+	return 0;
+}
+
+char license[] SEC("license") = "GPL";
diff --git a/samples/bpf/ex2_user.c b/samples/bpf/ex2_user.c
new file mode 100644
index 000000000000..8b0fc91f83ca
--- /dev/null
+++ b/samples/bpf/ex2_user.c
@@ -0,0 +1,28 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int ac, char **argv)
+{
+	long key, next_key, value;
+	int i;
+
+	if (load_bpf_file(argv[1]))
+		return 1;
+
+	for (i = 0; i < 10; i++) {
+		key = 0;
+		while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
+			bpf_lookup_elem(map_fd[0], &next_key, &value);
+			printf("location 0x%lx count %ld\n", next_key, value);
+			key = next_key;
+		}
+		if (key)
+			printf("\n");
+		sleep(1);
+	}
+
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C
       [not found]   ` <1406000723-4872-4-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
@ 2014-07-30 15:45     ` Frank Ch. Eigler
  2014-07-30 17:17       ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Frank Ch. Eigler @ 2014-07-30 15:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Kees Cook,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


ast wrote earlier:

> [...]
> dtrace/systemtap/ktap approach is to use one script file that should provide
> all desired functionality. That architectural decision overcomplicated their
> implementations.
>
> eBPF follows split model: everything that needs to process millions of events
> per second needs to run in kernel and needs to be short and deterministic,
> all other things like aggregation and nice graphs should run in user space.
> [...]

For the record, this is not entirely accurate as to dtrace.  dtrace
delegates aggregation and most reporting to userspace.  Also,
systemtap is "short and deterministic" even for aggregations & nice
graphs, but since it limits its storage & cpu consumption, its
arrays/reports cannot get super large.


> [...]
> +SEC("events/skb/kfree_skb")
> +int bpf_prog2(struct bpf_context *ctx)
> +{
> +[...]
> +	value = bpf_map_lookup_elem(&my_map, &loc);
> +	if (value)
> +		(*(long *) value) += 1;
> +	else
> +		bpf_map_update_elem(&my_map, &loc, &init_val);
> +	return 0;
> +}

What kind of locking/serialization is provided by the ebpf runtime
over shared variables such as my_map?


- FChE

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C
  2014-07-30 15:45     ` Frank Ch. Eigler
@ 2014-07-30 17:17       ` Alexei Starovoitov
  2014-07-30 17:36         ` Frank Ch. Eigler
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2014-07-30 17:17 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Kees Cook,
	Linux API, Network Development, LKML

On Wed, Jul 30, 2014 at 8:45 AM, Frank Ch. Eigler <fche@redhat.com> wrote:
> For the record, this is not entirely accurate as to dtrace.  dtrace
> delegates aggregation and most reporting to userspace.  Also,
> systemtap is "short and deterministic" even for aggregations & nice
> graphs, but since it limits its storage & cpu consumption, its
> arrays/reports cannot get super large.

My understanding of systemtap is that the whole .stp script is converted
to C, compiled as .ko and loaded, so all map walking and prints are
happening in the kernel. Similarly for ktap which has special functions
in kernel to print histograms.
I thought dtrace printf are also happening from the kernel. What is the
trick they use to know which pieces of dtrace script should be run in
user space?
In ebpf examples there are two C files: one for kernel with ebpf isa
and one for userspace as native. I thought about combining them,
but couldn't figure out a clean way of doing it.

>> [...]
>> +SEC("events/skb/kfree_skb")
>> +int bpf_prog2(struct bpf_context *ctx)
>> +{
>> +[...]
>> +     value = bpf_map_lookup_elem(&my_map, &loc);
>> +     if (value)
>> +             (*(long *) value) += 1;
>> +     else
>> +             bpf_map_update_elem(&my_map, &loc, &init_val);
>> +     return 0;
>> +}
>
> What kind of locking/serialization is provided by the ebpf runtime
> over shared variables such as my_map?

it's traditional rcu scheme.
Programs are running under rcu_read_lock(), so that
bpf_map_lookup_elem() can return pointer to map value which
won't disappear while program is running.
In-kernel map implementation needs to use rcu style to match
ebpf program assumptions. map implementation is enforcing
the limit to the number of elements.
I didn't post 'array' type of map yet. bpf_map_lookup in this
implementation will just return 'base + index' pointer.
Regardless of type of map the same ebpf program running on
different cpus may lookup the same 'key' and receive the same
map value pointer. In such case concurrent write access to
map value can be done with bpf_xadd instruction, though
using normal read/write is also allowed. In some cases
the speed of racy var++ is preferred over 'lock xadd'.
There are no lock/unlock function helpers available to ebpf
programs, since program may terminate early with div by zero
for example, so in-kernel lock helper implementation would
be complicated and slow. It's possible to do, but for the use
cases so far there is no need.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C
  2014-07-30 17:17       ` Alexei Starovoitov
@ 2014-07-30 17:36         ` Frank Ch. Eigler
       [not found]           ` <20140730173638.GB8745-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Frank Ch. Eigler @ 2014-07-30 17:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Kees Cook,
	Linux API, Network Development, LKML

Hi, Alexei -

> My understanding of systemtap is that the whole .stp script is converted
> to C, compiled as .ko and loaded, so all map walking and prints are
> happening in the kernel. Similarly for ktap which has special functions
> in kernel to print histograms.

That is correct.

> I thought dtrace printf are also happening from the kernel. What is
> the trick they use to know which pieces of dtrace script should be
> run in user space?

It appears as though the bytecode language running in the kernel sends
some action commands back out to userspace, not just plain data.


> In ebpf examples there are two C files: one for kernel with ebpf isa
> and one for userspace as native. I thought about combining them,
> but couldn't figure out a clean way of doing it.

(#if ?)


> > What kind of locking/serialization is provided by the ebpf runtime
> > over shared variables such as my_map?
> 
> it's traditional rcu scheme.

OK, that protects the table structure, but:

> [...] In such case concurrent write access to map value can be done
> with bpf_xadd instruction, though using normal read/write is also
> allowed. In some cases the speed of racy var++ is preferred over
> 'lock xadd'.

... so concurrency control over shared values is left up to the
programmer.

> There are no lock/unlock function helpers available to ebpf
> programs, since program may terminate early with div by zero
> for example, so in-kernel lock helper implementation would
> be complicated and slow. It's possible to do, but for the use
> cases so far there is no need.

OK, I hope that works out.  I've been told that dtrace does something
similiar (!)  by eschewing protection on global variables such as
strings.  In their case it's less bad than it sounds because they are
used to offloading computation to userspace or to store only
thread-local state, and accept the corollary limitations on control.

(Systemtap does fully & automatically protect shared variables, even
in the face of run-time script errors.)


- FChE

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C
       [not found]           ` <20140730173638.GB8745-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-07-30 18:53             ` Alexei Starovoitov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2014-07-30 18:53 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Kees Cook,
	Linux API, Network Development, LKML

On Wed, Jul 30, 2014 at 10:36 AM, Frank Ch. Eigler <fche-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > What kind of locking/serialization is provided by the ebpf runtime
>> > over shared variables such as my_map?
>>
>> it's traditional rcu scheme.
>
> OK, that protects the table structure, but:
>
>> [...] In such case concurrent write access to map value can be done
>> with bpf_xadd instruction, though using normal read/write is also
>> allowed. In some cases the speed of racy var++ is preferred over
>> 'lock xadd'.
>
> ... so concurrency control over shared values is left up to the
> programmer.

yes. It has to be flexible and fast.
One of our main use cases is network analytics where a lot of
packets are going through ebpf programs, so every cycle counts.
Mandatory locks in critical path are not acceptable. If we add
locks they will be optional.

>> There are no lock/unlock function helpers available to ebpf
>> programs, since program may terminate early with div by zero
>> for example, so in-kernel lock helper implementation would
>> be complicated and slow. It's possible to do, but for the use
>> cases so far there is no need.
>
> OK, I hope that works out.  I've been told that dtrace does something
> similiar (!)  by eschewing protection on global variables such as
> strings.  In their case it's less bad than it sounds because they are
> used to offloading computation to userspace or to store only
> thread-local state, and accept the corollary limitations on control.

interesting.
btw, things like global variables, per-cpu storage are potential ebpf
features. So far they're 'nice to have' instead of 'mandatory'.
The maps are powerful enough to do the same:
Global storage is map of one element.
Per-cpu storage is map of num_cpu elements.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-07-30 18:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-22  3:45 [PATCH RFC v3 net-next 0/3] eBPF examples in C Alexei Starovoitov
2014-07-22  3:45 ` [PATCH RFC v3 net-next 1/3] samples: bpf: elf file loader Alexei Starovoitov
2014-07-22  3:45 ` [PATCH RFC v3 net-next 2/3] samples: bpf: eBPF example in C Alexei Starovoitov
2014-07-22  3:45 ` [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon " Alexei Starovoitov
     [not found]   ` <1406000723-4872-4-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
2014-07-30 15:45     ` Frank Ch. Eigler
2014-07-30 17:17       ` Alexei Starovoitov
2014-07-30 17:36         ` Frank Ch. Eigler
     [not found]           ` <20140730173638.GB8745-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-07-30 18:53             ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).