netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next v3 0/7] Implement bpf queue/stack maps
@ 2018-09-18  4:52 Mauricio Vasquez B
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 1/7] bpf: rename stack trace map Mauricio Vasquez B
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Mauricio Vasquez B @ 2018-09-18  4:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: Yonghong Song

In some applications this is needed have a pool of free elements, like for
example the list of free L4 ports in a SNAT.  None of the current maps allow
to do it as it is not possibleto get an any element without having they key
it is associated to.

This patchset implements two new kind of eBPF maps: queue and stack.
Those maps provide to eBPF programs the peek, push and pop operations, and for
userspace applications a new bpf_map_lookup_and_delete_elem() is added.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>

v2 -> v3:
 - Return elements by value instead that by reference
 - Implement queue/stack base on array and head + tail indexes
 - Rename stack trace related files to avoid confusion and conflicts

v1 -> v2:
 - Create two separate maps instead of single one + flags
 - Implement bpf_map_lookup_and_delete syscall
 - Support peek operation
 - Define replacement policy through flags in the update() method
 - Add eBPF side tests

---

Mauricio Vasquez B (7):
      bpf: rename stack trace map
      bpf/syscall: allow key to be null in map functions
      bpf: add lookup_and_delete map operation
      bpf: add bpf queue and stack maps
      bpf: restrict use of peek/push/pop
      Sync uapi/bpf.h to tools/include
      selftests/bpf: add test cases for queue and stack maps


 include/linux/bpf.h                                |    4 
 include/linux/bpf_types.h                          |    4 
 include/uapi/linux/bpf.h                           |   31 +
 kernel/bpf/Makefile                                |    4 
 kernel/bpf/core.c                                  |    3 
 kernel/bpf/helpers.c                               |   98 +++
 kernel/bpf/queue_stack_maps.c                      |  291 +++++++++
 kernel/bpf/stackmap.c                              |  624 --------------------
 kernel/bpf/stacktracemap.c                         |  624 ++++++++++++++++++++
 kernel/bpf/syscall.c                               |  101 +++
 kernel/bpf/verifier.c                              |   19 +
 net/core/filter.c                                  |    6 
 tools/include/uapi/linux/bpf.h                     |   31 +
 tools/lib/bpf/bpf.c                                |   12 
 tools/lib/bpf/bpf.h                                |    1 
 tools/testing/selftests/bpf/Makefile               |    5 
 tools/testing/selftests/bpf/bpf_helpers.h          |    7 
 tools/testing/selftests/bpf/test_maps.c            |  130 ++++
 tools/testing/selftests/bpf/test_progs.c           |   99 +++
 tools/testing/selftests/bpf/test_queue_map.c       |    4 
 tools/testing/selftests/bpf/test_queue_stack_map.h |   59 ++
 tools/testing/selftests/bpf/test_stack_map.c       |    4 
 22 files changed, 1526 insertions(+), 635 deletions(-)
 create mode 100644 kernel/bpf/queue_stack_maps.c
 delete mode 100644 kernel/bpf/stackmap.c
 create mode 100644 kernel/bpf/stacktracemap.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
 create mode 100644 tools/testing/selftests/bpf/test_stack_map.c

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH bpf-next v3 1/7] bpf: rename stack trace map
  2018-09-18  4:52 [RFC PATCH bpf-next v3 0/7] Implement bpf queue/stack maps Mauricio Vasquez B
@ 2018-09-18  4:52 ` Mauricio Vasquez B
  2018-09-18 23:05   ` Alexei Starovoitov
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 2/7] bpf/syscall: allow key to be null in map functions Mauricio Vasquez B
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Mauricio Vasquez B @ 2018-09-18  4:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: Yonghong Song

In the following patches queue and stack maps (FIFO and LIFO
datastructures) will be implemented.  In order to avoid confusion and
a possible name clash rename stackmap.c to stacktracemap.c and
stack_map_ops to stack_trace_map_ops

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 include/linux/bpf_types.h  |    2 
 kernel/bpf/Makefile        |    2 
 kernel/bpf/stackmap.c      |  624 --------------------------------------------
 kernel/bpf/stacktracemap.c |  624 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 626 insertions(+), 626 deletions(-)
 delete mode 100644 kernel/bpf/stackmap.c
 create mode 100644 kernel/bpf/stacktracemap.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..33f7f574b983 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -49,7 +49,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_HASH, htab_lru_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_PERCPU_HASH, htab_lru_percpu_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LPM_TRIE, trie_map_ops)
 #ifdef CONFIG_PERF_EVENTS
-BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_trace_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 0488b8258321..e656bce87c8f 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -20,7 +20,7 @@ endif
 endif
 endif
 ifeq ($(CONFIG_PERF_EVENTS),y)
-obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
+obj-$(CONFIG_BPF_SYSCALL) += stacktracemap.o
 endif
 obj-$(CONFIG_CGROUP_BPF) += cgroup.o
 ifeq ($(CONFIG_INET),y)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
deleted file mode 100644
index 8061a439ef18..000000000000
--- a/kernel/bpf/stackmap.c
+++ /dev/null
@@ -1,624 +0,0 @@
-/* Copyright (c) 2016 Facebook
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of version 2 of the GNU General Public
- * License as published by the Free Software Foundation.
- */
-#include <linux/bpf.h>
-#include <linux/jhash.h>
-#include <linux/filter.h>
-#include <linux/stacktrace.h>
-#include <linux/perf_event.h>
-#include <linux/elf.h>
-#include <linux/pagemap.h>
-#include <linux/irq_work.h>
-#include "percpu_freelist.h"
-
-#define STACK_CREATE_FLAG_MASK					\
-	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY |	\
-	 BPF_F_STACK_BUILD_ID)
-
-struct stack_map_bucket {
-	struct pcpu_freelist_node fnode;
-	u32 hash;
-	u32 nr;
-	u64 data[];
-};
-
-struct bpf_stack_map {
-	struct bpf_map map;
-	void *elems;
-	struct pcpu_freelist freelist;
-	u32 n_buckets;
-	struct stack_map_bucket *buckets[];
-};
-
-/* irq_work to run up_read() for build_id lookup in nmi context */
-struct stack_map_irq_work {
-	struct irq_work irq_work;
-	struct rw_semaphore *sem;
-};
-
-static void do_up_read(struct irq_work *entry)
-{
-	struct stack_map_irq_work *work;
-
-	work = container_of(entry, struct stack_map_irq_work, irq_work);
-	up_read(work->sem);
-	work->sem = NULL;
-}
-
-static DEFINE_PER_CPU(struct stack_map_irq_work, up_read_work);
-
-static inline bool stack_map_use_build_id(struct bpf_map *map)
-{
-	return (map->map_flags & BPF_F_STACK_BUILD_ID);
-}
-
-static inline int stack_map_data_size(struct bpf_map *map)
-{
-	return stack_map_use_build_id(map) ?
-		sizeof(struct bpf_stack_build_id) : sizeof(u64);
-}
-
-static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
-{
-	u32 elem_size = sizeof(struct stack_map_bucket) + smap->map.value_size;
-	int err;
-
-	smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries,
-					 smap->map.numa_node);
-	if (!smap->elems)
-		return -ENOMEM;
-
-	err = pcpu_freelist_init(&smap->freelist);
-	if (err)
-		goto free_elems;
-
-	pcpu_freelist_populate(&smap->freelist, smap->elems, elem_size,
-			       smap->map.max_entries);
-	return 0;
-
-free_elems:
-	bpf_map_area_free(smap->elems);
-	return err;
-}
-
-/* Called from syscall */
-static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
-{
-	u32 value_size = attr->value_size;
-	struct bpf_stack_map *smap;
-	u64 cost, n_buckets;
-	int err;
-
-	if (!capable(CAP_SYS_ADMIN))
-		return ERR_PTR(-EPERM);
-
-	if (attr->map_flags & ~STACK_CREATE_FLAG_MASK)
-		return ERR_PTR(-EINVAL);
-
-	/* check sanity of attributes */
-	if (attr->max_entries == 0 || attr->key_size != 4 ||
-	    value_size < 8 || value_size % 8)
-		return ERR_PTR(-EINVAL);
-
-	BUILD_BUG_ON(sizeof(struct bpf_stack_build_id) % sizeof(u64));
-	if (attr->map_flags & BPF_F_STACK_BUILD_ID) {
-		if (value_size % sizeof(struct bpf_stack_build_id) ||
-		    value_size / sizeof(struct bpf_stack_build_id)
-		    > sysctl_perf_event_max_stack)
-			return ERR_PTR(-EINVAL);
-	} else if (value_size / 8 > sysctl_perf_event_max_stack)
-		return ERR_PTR(-EINVAL);
-
-	/* hash table size must be power of 2 */
-	n_buckets = roundup_pow_of_two(attr->max_entries);
-
-	cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap);
-	if (cost >= U32_MAX - PAGE_SIZE)
-		return ERR_PTR(-E2BIG);
-
-	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr));
-	if (!smap)
-		return ERR_PTR(-ENOMEM);
-
-	err = -E2BIG;
-	cost += n_buckets * (value_size + sizeof(struct stack_map_bucket));
-	if (cost >= U32_MAX - PAGE_SIZE)
-		goto free_smap;
-
-	bpf_map_init_from_attr(&smap->map, attr);
-	smap->map.value_size = value_size;
-	smap->n_buckets = n_buckets;
-	smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
-
-	err = bpf_map_precharge_memlock(smap->map.pages);
-	if (err)
-		goto free_smap;
-
-	err = get_callchain_buffers(sysctl_perf_event_max_stack);
-	if (err)
-		goto free_smap;
-
-	err = prealloc_elems_and_freelist(smap);
-	if (err)
-		goto put_buffers;
-
-	return &smap->map;
-
-put_buffers:
-	put_callchain_buffers();
-free_smap:
-	bpf_map_area_free(smap);
-	return ERR_PTR(err);
-}
-
-#define BPF_BUILD_ID 3
-/*
- * Parse build id from the note segment. This logic can be shared between
- * 32-bit and 64-bit system, because Elf32_Nhdr and Elf64_Nhdr are
- * identical.
- */
-static inline int stack_map_parse_build_id(void *page_addr,
-					   unsigned char *build_id,
-					   void *note_start,
-					   Elf32_Word note_size)
-{
-	Elf32_Word note_offs = 0, new_offs;
-
-	/* check for overflow */
-	if (note_start < page_addr || note_start + note_size < note_start)
-		return -EINVAL;
-
-	/* only supports note that fits in the first page */
-	if (note_start + note_size > page_addr + PAGE_SIZE)
-		return -EINVAL;
-
-	while (note_offs + sizeof(Elf32_Nhdr) < note_size) {
-		Elf32_Nhdr *nhdr = (Elf32_Nhdr *)(note_start + note_offs);
-
-		if (nhdr->n_type == BPF_BUILD_ID &&
-		    nhdr->n_namesz == sizeof("GNU") &&
-		    nhdr->n_descsz == BPF_BUILD_ID_SIZE) {
-			memcpy(build_id,
-			       note_start + note_offs +
-			       ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr),
-			       BPF_BUILD_ID_SIZE);
-			return 0;
-		}
-		new_offs = note_offs + sizeof(Elf32_Nhdr) +
-			ALIGN(nhdr->n_namesz, 4) + ALIGN(nhdr->n_descsz, 4);
-		if (new_offs <= note_offs)  /* overflow */
-			break;
-		note_offs = new_offs;
-	}
-	return -EINVAL;
-}
-
-/* Parse build ID from 32-bit ELF */
-static int stack_map_get_build_id_32(void *page_addr,
-				     unsigned char *build_id)
-{
-	Elf32_Ehdr *ehdr = (Elf32_Ehdr *)page_addr;
-	Elf32_Phdr *phdr;
-	int i;
-
-	/* only supports phdr that fits in one page */
-	if (ehdr->e_phnum >
-	    (PAGE_SIZE - sizeof(Elf32_Ehdr)) / sizeof(Elf32_Phdr))
-		return -EINVAL;
-
-	phdr = (Elf32_Phdr *)(page_addr + sizeof(Elf32_Ehdr));
-
-	for (i = 0; i < ehdr->e_phnum; ++i)
-		if (phdr[i].p_type == PT_NOTE)
-			return stack_map_parse_build_id(page_addr, build_id,
-					page_addr + phdr[i].p_offset,
-					phdr[i].p_filesz);
-	return -EINVAL;
-}
-
-/* Parse build ID from 64-bit ELF */
-static int stack_map_get_build_id_64(void *page_addr,
-				     unsigned char *build_id)
-{
-	Elf64_Ehdr *ehdr = (Elf64_Ehdr *)page_addr;
-	Elf64_Phdr *phdr;
-	int i;
-
-	/* only supports phdr that fits in one page */
-	if (ehdr->e_phnum >
-	    (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr))
-		return -EINVAL;
-
-	phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr));
-
-	for (i = 0; i < ehdr->e_phnum; ++i)
-		if (phdr[i].p_type == PT_NOTE)
-			return stack_map_parse_build_id(page_addr, build_id,
-					page_addr + phdr[i].p_offset,
-					phdr[i].p_filesz);
-	return -EINVAL;
-}
-
-/* Parse build ID of ELF file mapped to vma */
-static int stack_map_get_build_id(struct vm_area_struct *vma,
-				  unsigned char *build_id)
-{
-	Elf32_Ehdr *ehdr;
-	struct page *page;
-	void *page_addr;
-	int ret;
-
-	/* only works for page backed storage  */
-	if (!vma->vm_file)
-		return -EINVAL;
-
-	page = find_get_page(vma->vm_file->f_mapping, 0);
-	if (!page)
-		return -EFAULT;	/* page not mapped */
-
-	ret = -EINVAL;
-	page_addr = page_address(page);
-	ehdr = (Elf32_Ehdr *)page_addr;
-
-	/* compare magic x7f "ELF" */
-	if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0)
-		goto out;
-
-	/* only support executable file and shared object file */
-	if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN)
-		goto out;
-
-	if (ehdr->e_ident[EI_CLASS] == ELFCLASS32)
-		ret = stack_map_get_build_id_32(page_addr, build_id);
-	else if (ehdr->e_ident[EI_CLASS] == ELFCLASS64)
-		ret = stack_map_get_build_id_64(page_addr, build_id);
-out:
-	put_page(page);
-	return ret;
-}
-
-static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
-					  u64 *ips, u32 trace_nr, bool user)
-{
-	int i;
-	struct vm_area_struct *vma;
-	bool irq_work_busy = false;
-	struct stack_map_irq_work *work = NULL;
-
-	if (in_nmi()) {
-		work = this_cpu_ptr(&up_read_work);
-		if (work->irq_work.flags & IRQ_WORK_BUSY)
-			/* cannot queue more up_read, fallback */
-			irq_work_busy = true;
-	}
-
-	/*
-	 * We cannot do up_read() in nmi context. To do build_id lookup
-	 * in nmi context, we need to run up_read() in irq_work. We use
-	 * a percpu variable to do the irq_work. If the irq_work is
-	 * already used by another lookup, we fall back to report ips.
-	 *
-	 * Same fallback is used for kernel stack (!user) on a stackmap
-	 * with build_id.
-	 */
-	if (!user || !current || !current->mm || irq_work_busy ||
-	    down_read_trylock(&current->mm->mmap_sem) == 0) {
-		/* cannot access current->mm, fall back to ips */
-		for (i = 0; i < trace_nr; i++) {
-			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
-			id_offs[i].ip = ips[i];
-		}
-		return;
-	}
-
-	for (i = 0; i < trace_nr; i++) {
-		vma = find_vma(current->mm, ips[i]);
-		if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
-			/* per entry fall back to ips */
-			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
-			id_offs[i].ip = ips[i];
-			continue;
-		}
-		id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
-			- vma->vm_start;
-		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
-	}
-
-	if (!work) {
-		up_read(&current->mm->mmap_sem);
-	} else {
-		work->sem = &current->mm->mmap_sem;
-		irq_work_queue(&work->irq_work);
-	}
-}
-
-BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
-	   u64, flags)
-{
-	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
-	struct perf_callchain_entry *trace;
-	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
-	u32 max_depth = map->value_size / stack_map_data_size(map);
-	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
-	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
-	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
-	u32 hash, id, trace_nr, trace_len;
-	bool user = flags & BPF_F_USER_STACK;
-	bool kernel = !user;
-	u64 *ips;
-	bool hash_matches;
-
-	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
-			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
-		return -EINVAL;
-
-	trace = get_perf_callchain(regs, init_nr, kernel, user,
-				   sysctl_perf_event_max_stack, false, false);
-
-	if (unlikely(!trace))
-		/* couldn't fetch the stack trace */
-		return -EFAULT;
-
-	/* get_perf_callchain() guarantees that trace->nr >= init_nr
-	 * and trace-nr <= sysctl_perf_event_max_stack, so trace_nr <= max_depth
-	 */
-	trace_nr = trace->nr - init_nr;
-
-	if (trace_nr <= skip)
-		/* skipping more than usable stack trace */
-		return -EFAULT;
-
-	trace_nr -= skip;
-	trace_len = trace_nr * sizeof(u64);
-	ips = trace->ip + skip + init_nr;
-	hash = jhash2((u32 *)ips, trace_len / sizeof(u32), 0);
-	id = hash & (smap->n_buckets - 1);
-	bucket = READ_ONCE(smap->buckets[id]);
-
-	hash_matches = bucket && bucket->hash == hash;
-	/* fast cmp */
-	if (hash_matches && flags & BPF_F_FAST_STACK_CMP)
-		return id;
-
-	if (stack_map_use_build_id(map)) {
-		/* for build_id+offset, pop a bucket before slow cmp */
-		new_bucket = (struct stack_map_bucket *)
-			pcpu_freelist_pop(&smap->freelist);
-		if (unlikely(!new_bucket))
-			return -ENOMEM;
-		new_bucket->nr = trace_nr;
-		stack_map_get_build_id_offset(
-			(struct bpf_stack_build_id *)new_bucket->data,
-			ips, trace_nr, user);
-		trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
-		if (hash_matches && bucket->nr == trace_nr &&
-		    memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
-			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
-			return id;
-		}
-		if (bucket && !(flags & BPF_F_REUSE_STACKID)) {
-			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
-			return -EEXIST;
-		}
-	} else {
-		if (hash_matches && bucket->nr == trace_nr &&
-		    memcmp(bucket->data, ips, trace_len) == 0)
-			return id;
-		if (bucket && !(flags & BPF_F_REUSE_STACKID))
-			return -EEXIST;
-
-		new_bucket = (struct stack_map_bucket *)
-			pcpu_freelist_pop(&smap->freelist);
-		if (unlikely(!new_bucket))
-			return -ENOMEM;
-		memcpy(new_bucket->data, ips, trace_len);
-	}
-
-	new_bucket->hash = hash;
-	new_bucket->nr = trace_nr;
-
-	old_bucket = xchg(&smap->buckets[id], new_bucket);
-	if (old_bucket)
-		pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
-	return id;
-}
-
-const struct bpf_func_proto bpf_get_stackid_proto = {
-	.func		= bpf_get_stackid,
-	.gpl_only	= true,
-	.ret_type	= RET_INTEGER,
-	.arg1_type	= ARG_PTR_TO_CTX,
-	.arg2_type	= ARG_CONST_MAP_PTR,
-	.arg3_type	= ARG_ANYTHING,
-};
-
-BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
-	   u64, flags)
-{
-	u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
-	bool user_build_id = flags & BPF_F_USER_BUILD_ID;
-	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
-	bool user = flags & BPF_F_USER_STACK;
-	struct perf_callchain_entry *trace;
-	bool kernel = !user;
-	int err = -EINVAL;
-	u64 *ips;
-
-	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
-			       BPF_F_USER_BUILD_ID)))
-		goto clear;
-	if (kernel && user_build_id)
-		goto clear;
-
-	elem_size = (user && user_build_id) ? sizeof(struct bpf_stack_build_id)
-					    : sizeof(u64);
-	if (unlikely(size % elem_size))
-		goto clear;
-
-	num_elem = size / elem_size;
-	if (sysctl_perf_event_max_stack < num_elem)
-		init_nr = 0;
-	else
-		init_nr = sysctl_perf_event_max_stack - num_elem;
-	trace = get_perf_callchain(regs, init_nr, kernel, user,
-				   sysctl_perf_event_max_stack, false, false);
-	if (unlikely(!trace))
-		goto err_fault;
-
-	trace_nr = trace->nr - init_nr;
-	if (trace_nr < skip)
-		goto err_fault;
-
-	trace_nr -= skip;
-	trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
-	copy_len = trace_nr * elem_size;
-	ips = trace->ip + skip + init_nr;
-	if (user && user_build_id)
-		stack_map_get_build_id_offset(buf, ips, trace_nr, user);
-	else
-		memcpy(buf, ips, copy_len);
-
-	if (size > copy_len)
-		memset(buf + copy_len, 0, size - copy_len);
-	return copy_len;
-
-err_fault:
-	err = -EFAULT;
-clear:
-	memset(buf, 0, size);
-	return err;
-}
-
-const struct bpf_func_proto bpf_get_stack_proto = {
-	.func		= bpf_get_stack,
-	.gpl_only	= true,
-	.ret_type	= RET_INTEGER,
-	.arg1_type	= ARG_PTR_TO_CTX,
-	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
-	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
-	.arg4_type	= ARG_ANYTHING,
-};
-
-/* Called from eBPF program */
-static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
-{
-	return NULL;
-}
-
-/* Called from syscall */
-int bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
-{
-	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
-	struct stack_map_bucket *bucket, *old_bucket;
-	u32 id = *(u32 *)key, trace_len;
-
-	if (unlikely(id >= smap->n_buckets))
-		return -ENOENT;
-
-	bucket = xchg(&smap->buckets[id], NULL);
-	if (!bucket)
-		return -ENOENT;
-
-	trace_len = bucket->nr * stack_map_data_size(map);
-	memcpy(value, bucket->data, trace_len);
-	memset(value + trace_len, 0, map->value_size - trace_len);
-
-	old_bucket = xchg(&smap->buckets[id], bucket);
-	if (old_bucket)
-		pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
-	return 0;
-}
-
-static int stack_map_get_next_key(struct bpf_map *map, void *key,
-				  void *next_key)
-{
-	struct bpf_stack_map *smap = container_of(map,
-						  struct bpf_stack_map, map);
-	u32 id;
-
-	WARN_ON_ONCE(!rcu_read_lock_held());
-
-	if (!key) {
-		id = 0;
-	} else {
-		id = *(u32 *)key;
-		if (id >= smap->n_buckets || !smap->buckets[id])
-			id = 0;
-		else
-			id++;
-	}
-
-	while (id < smap->n_buckets && !smap->buckets[id])
-		id++;
-
-	if (id >= smap->n_buckets)
-		return -ENOENT;
-
-	*(u32 *)next_key = id;
-	return 0;
-}
-
-static int stack_map_update_elem(struct bpf_map *map, void *key, void *value,
-				 u64 map_flags)
-{
-	return -EINVAL;
-}
-
-/* Called from syscall or from eBPF program */
-static int stack_map_delete_elem(struct bpf_map *map, void *key)
-{
-	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
-	struct stack_map_bucket *old_bucket;
-	u32 id = *(u32 *)key;
-
-	if (unlikely(id >= smap->n_buckets))
-		return -E2BIG;
-
-	old_bucket = xchg(&smap->buckets[id], NULL);
-	if (old_bucket) {
-		pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
-		return 0;
-	} else {
-		return -ENOENT;
-	}
-}
-
-/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
-static void stack_map_free(struct bpf_map *map)
-{
-	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
-
-	/* wait for bpf programs to complete before freeing stack map */
-	synchronize_rcu();
-
-	bpf_map_area_free(smap->elems);
-	pcpu_freelist_destroy(&smap->freelist);
-	bpf_map_area_free(smap);
-	put_callchain_buffers();
-}
-
-const struct bpf_map_ops stack_map_ops = {
-	.map_alloc = stack_map_alloc,
-	.map_free = stack_map_free,
-	.map_get_next_key = stack_map_get_next_key,
-	.map_lookup_elem = stack_map_lookup_elem,
-	.map_update_elem = stack_map_update_elem,
-	.map_delete_elem = stack_map_delete_elem,
-	.map_check_btf = map_check_no_btf,
-};
-
-static int __init stack_map_init(void)
-{
-	int cpu;
-	struct stack_map_irq_work *work;
-
-	for_each_possible_cpu(cpu) {
-		work = per_cpu_ptr(&up_read_work, cpu);
-		init_irq_work(&work->irq_work, do_up_read);
-	}
-	return 0;
-}
-subsys_initcall(stack_map_init);
diff --git a/kernel/bpf/stacktracemap.c b/kernel/bpf/stacktracemap.c
new file mode 100644
index 000000000000..bb41e293418d
--- /dev/null
+++ b/kernel/bpf/stacktracemap.c
@@ -0,0 +1,624 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <linux/jhash.h>
+#include <linux/filter.h>
+#include <linux/stacktrace.h>
+#include <linux/perf_event.h>
+#include <linux/elf.h>
+#include <linux/pagemap.h>
+#include <linux/irq_work.h>
+#include "percpu_freelist.h"
+
+#define STACK_CREATE_FLAG_MASK					\
+	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY |	\
+	 BPF_F_STACK_BUILD_ID)
+
+struct stack_map_bucket {
+	struct pcpu_freelist_node fnode;
+	u32 hash;
+	u32 nr;
+	u64 data[];
+};
+
+struct bpf_stack_map {
+	struct bpf_map map;
+	void *elems;
+	struct pcpu_freelist freelist;
+	u32 n_buckets;
+	struct stack_map_bucket *buckets[];
+};
+
+/* irq_work to run up_read() for build_id lookup in nmi context */
+struct stack_map_irq_work {
+	struct irq_work irq_work;
+	struct rw_semaphore *sem;
+};
+
+static void do_up_read(struct irq_work *entry)
+{
+	struct stack_map_irq_work *work;
+
+	work = container_of(entry, struct stack_map_irq_work, irq_work);
+	up_read(work->sem);
+	work->sem = NULL;
+}
+
+static DEFINE_PER_CPU(struct stack_map_irq_work, up_read_work);
+
+static inline bool stack_map_use_build_id(struct bpf_map *map)
+{
+	return (map->map_flags & BPF_F_STACK_BUILD_ID);
+}
+
+static inline int stack_map_data_size(struct bpf_map *map)
+{
+	return stack_map_use_build_id(map) ?
+		sizeof(struct bpf_stack_build_id) : sizeof(u64);
+}
+
+static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
+{
+	u32 elem_size = sizeof(struct stack_map_bucket) + smap->map.value_size;
+	int err;
+
+	smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries,
+					 smap->map.numa_node);
+	if (!smap->elems)
+		return -ENOMEM;
+
+	err = pcpu_freelist_init(&smap->freelist);
+	if (err)
+		goto free_elems;
+
+	pcpu_freelist_populate(&smap->freelist, smap->elems, elem_size,
+			       smap->map.max_entries);
+	return 0;
+
+free_elems:
+	bpf_map_area_free(smap->elems);
+	return err;
+}
+
+/* Called from syscall */
+static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
+{
+	u32 value_size = attr->value_size;
+	struct bpf_stack_map *smap;
+	u64 cost, n_buckets;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (attr->map_flags & ~STACK_CREATE_FLAG_MASK)
+		return ERR_PTR(-EINVAL);
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->key_size != 4 ||
+	    value_size < 8 || value_size % 8)
+		return ERR_PTR(-EINVAL);
+
+	BUILD_BUG_ON(sizeof(struct bpf_stack_build_id) % sizeof(u64));
+	if (attr->map_flags & BPF_F_STACK_BUILD_ID) {
+		if (value_size % sizeof(struct bpf_stack_build_id) ||
+		    value_size / sizeof(struct bpf_stack_build_id)
+		    > sysctl_perf_event_max_stack)
+			return ERR_PTR(-EINVAL);
+	} else if (value_size / 8 > sysctl_perf_event_max_stack)
+		return ERR_PTR(-EINVAL);
+
+	/* hash table size must be power of 2 */
+	n_buckets = roundup_pow_of_two(attr->max_entries);
+
+	cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap);
+	if (cost >= U32_MAX - PAGE_SIZE)
+		return ERR_PTR(-E2BIG);
+
+	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr));
+	if (!smap)
+		return ERR_PTR(-ENOMEM);
+
+	err = -E2BIG;
+	cost += n_buckets * (value_size + sizeof(struct stack_map_bucket));
+	if (cost >= U32_MAX - PAGE_SIZE)
+		goto free_smap;
+
+	bpf_map_init_from_attr(&smap->map, attr);
+	smap->map.value_size = value_size;
+	smap->n_buckets = n_buckets;
+	smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+	err = bpf_map_precharge_memlock(smap->map.pages);
+	if (err)
+		goto free_smap;
+
+	err = get_callchain_buffers(sysctl_perf_event_max_stack);
+	if (err)
+		goto free_smap;
+
+	err = prealloc_elems_and_freelist(smap);
+	if (err)
+		goto put_buffers;
+
+	return &smap->map;
+
+put_buffers:
+	put_callchain_buffers();
+free_smap:
+	bpf_map_area_free(smap);
+	return ERR_PTR(err);
+}
+
+#define BPF_BUILD_ID 3
+/*
+ * Parse build id from the note segment. This logic can be shared between
+ * 32-bit and 64-bit system, because Elf32_Nhdr and Elf64_Nhdr are
+ * identical.
+ */
+static inline int stack_map_parse_build_id(void *page_addr,
+					   unsigned char *build_id,
+					   void *note_start,
+					   Elf32_Word note_size)
+{
+	Elf32_Word note_offs = 0, new_offs;
+
+	/* check for overflow */
+	if (note_start < page_addr || note_start + note_size < note_start)
+		return -EINVAL;
+
+	/* only supports note that fits in the first page */
+	if (note_start + note_size > page_addr + PAGE_SIZE)
+		return -EINVAL;
+
+	while (note_offs + sizeof(Elf32_Nhdr) < note_size) {
+		Elf32_Nhdr *nhdr = (Elf32_Nhdr *)(note_start + note_offs);
+
+		if (nhdr->n_type == BPF_BUILD_ID &&
+		    nhdr->n_namesz == sizeof("GNU") &&
+		    nhdr->n_descsz == BPF_BUILD_ID_SIZE) {
+			memcpy(build_id,
+			       note_start + note_offs +
+			       ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr),
+			       BPF_BUILD_ID_SIZE);
+			return 0;
+		}
+		new_offs = note_offs + sizeof(Elf32_Nhdr) +
+			ALIGN(nhdr->n_namesz, 4) + ALIGN(nhdr->n_descsz, 4);
+		if (new_offs <= note_offs)  /* overflow */
+			break;
+		note_offs = new_offs;
+	}
+	return -EINVAL;
+}
+
+/* Parse build ID from 32-bit ELF */
+static int stack_map_get_build_id_32(void *page_addr,
+				     unsigned char *build_id)
+{
+	Elf32_Ehdr *ehdr = (Elf32_Ehdr *)page_addr;
+	Elf32_Phdr *phdr;
+	int i;
+
+	/* only supports phdr that fits in one page */
+	if (ehdr->e_phnum >
+	    (PAGE_SIZE - sizeof(Elf32_Ehdr)) / sizeof(Elf32_Phdr))
+		return -EINVAL;
+
+	phdr = (Elf32_Phdr *)(page_addr + sizeof(Elf32_Ehdr));
+
+	for (i = 0; i < ehdr->e_phnum; ++i)
+		if (phdr[i].p_type == PT_NOTE)
+			return stack_map_parse_build_id(page_addr, build_id,
+					page_addr + phdr[i].p_offset,
+					phdr[i].p_filesz);
+	return -EINVAL;
+}
+
+/* Parse build ID from 64-bit ELF */
+static int stack_map_get_build_id_64(void *page_addr,
+				     unsigned char *build_id)
+{
+	Elf64_Ehdr *ehdr = (Elf64_Ehdr *)page_addr;
+	Elf64_Phdr *phdr;
+	int i;
+
+	/* only supports phdr that fits in one page */
+	if (ehdr->e_phnum >
+	    (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr))
+		return -EINVAL;
+
+	phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr));
+
+	for (i = 0; i < ehdr->e_phnum; ++i)
+		if (phdr[i].p_type == PT_NOTE)
+			return stack_map_parse_build_id(page_addr, build_id,
+					page_addr + phdr[i].p_offset,
+					phdr[i].p_filesz);
+	return -EINVAL;
+}
+
+/* Parse build ID of ELF file mapped to vma */
+static int stack_map_get_build_id(struct vm_area_struct *vma,
+				  unsigned char *build_id)
+{
+	Elf32_Ehdr *ehdr;
+	struct page *page;
+	void *page_addr;
+	int ret;
+
+	/* only works for page backed storage  */
+	if (!vma->vm_file)
+		return -EINVAL;
+
+	page = find_get_page(vma->vm_file->f_mapping, 0);
+	if (!page)
+		return -EFAULT;	/* page not mapped */
+
+	ret = -EINVAL;
+	page_addr = page_address(page);
+	ehdr = (Elf32_Ehdr *)page_addr;
+
+	/* compare magic x7f "ELF" */
+	if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0)
+		goto out;
+
+	/* only support executable file and shared object file */
+	if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN)
+		goto out;
+
+	if (ehdr->e_ident[EI_CLASS] == ELFCLASS32)
+		ret = stack_map_get_build_id_32(page_addr, build_id);
+	else if (ehdr->e_ident[EI_CLASS] == ELFCLASS64)
+		ret = stack_map_get_build_id_64(page_addr, build_id);
+out:
+	put_page(page);
+	return ret;
+}
+
+static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
+					  u64 *ips, u32 trace_nr, bool user)
+{
+	int i;
+	struct vm_area_struct *vma;
+	bool irq_work_busy = false;
+	struct stack_map_irq_work *work = NULL;
+
+	if (in_nmi()) {
+		work = this_cpu_ptr(&up_read_work);
+		if (work->irq_work.flags & IRQ_WORK_BUSY)
+			/* cannot queue more up_read, fallback */
+			irq_work_busy = true;
+	}
+
+	/*
+	 * We cannot do up_read() in nmi context. To do build_id lookup
+	 * in nmi context, we need to run up_read() in irq_work. We use
+	 * a percpu variable to do the irq_work. If the irq_work is
+	 * already used by another lookup, we fall back to report ips.
+	 *
+	 * Same fallback is used for kernel stack (!user) on a stackmap
+	 * with build_id.
+	 */
+	if (!user || !current || !current->mm || irq_work_busy ||
+	    down_read_trylock(&current->mm->mmap_sem) == 0) {
+		/* cannot access current->mm, fall back to ips */
+		for (i = 0; i < trace_nr; i++) {
+			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
+			id_offs[i].ip = ips[i];
+		}
+		return;
+	}
+
+	for (i = 0; i < trace_nr; i++) {
+		vma = find_vma(current->mm, ips[i]);
+		if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
+			/* per entry fall back to ips */
+			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
+			id_offs[i].ip = ips[i];
+			continue;
+		}
+		id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
+			- vma->vm_start;
+		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
+	}
+
+	if (!work) {
+		up_read(&current->mm->mmap_sem);
+	} else {
+		work->sem = &current->mm->mmap_sem;
+		irq_work_queue(&work->irq_work);
+	}
+}
+
+BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
+	   u64, flags)
+{
+	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
+	struct perf_callchain_entry *trace;
+	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
+	u32 max_depth = map->value_size / stack_map_data_size(map);
+	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
+	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+	u32 hash, id, trace_nr, trace_len;
+	bool user = flags & BPF_F_USER_STACK;
+	bool kernel = !user;
+	u64 *ips;
+	bool hash_matches;
+
+	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
+		return -EINVAL;
+
+	trace = get_perf_callchain(regs, init_nr, kernel, user,
+				   sysctl_perf_event_max_stack, false, false);
+
+	if (unlikely(!trace))
+		/* couldn't fetch the stack trace */
+		return -EFAULT;
+
+	/* get_perf_callchain() guarantees that trace->nr >= init_nr
+	 * and trace-nr <= sysctl_perf_event_max_stack, so trace_nr <= max_depth
+	 */
+	trace_nr = trace->nr - init_nr;
+
+	if (trace_nr <= skip)
+		/* skipping more than usable stack trace */
+		return -EFAULT;
+
+	trace_nr -= skip;
+	trace_len = trace_nr * sizeof(u64);
+	ips = trace->ip + skip + init_nr;
+	hash = jhash2((u32 *)ips, trace_len / sizeof(u32), 0);
+	id = hash & (smap->n_buckets - 1);
+	bucket = READ_ONCE(smap->buckets[id]);
+
+	hash_matches = bucket && bucket->hash == hash;
+	/* fast cmp */
+	if (hash_matches && flags & BPF_F_FAST_STACK_CMP)
+		return id;
+
+	if (stack_map_use_build_id(map)) {
+		/* for build_id+offset, pop a bucket before slow cmp */
+		new_bucket = (struct stack_map_bucket *)
+			pcpu_freelist_pop(&smap->freelist);
+		if (unlikely(!new_bucket))
+			return -ENOMEM;
+		new_bucket->nr = trace_nr;
+		stack_map_get_build_id_offset(
+			(struct bpf_stack_build_id *)new_bucket->data,
+			ips, trace_nr, user);
+		trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
+		if (hash_matches && bucket->nr == trace_nr &&
+		    memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
+			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
+			return id;
+		}
+		if (bucket && !(flags & BPF_F_REUSE_STACKID)) {
+			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
+			return -EEXIST;
+		}
+	} else {
+		if (hash_matches && bucket->nr == trace_nr &&
+		    memcmp(bucket->data, ips, trace_len) == 0)
+			return id;
+		if (bucket && !(flags & BPF_F_REUSE_STACKID))
+			return -EEXIST;
+
+		new_bucket = (struct stack_map_bucket *)
+			pcpu_freelist_pop(&smap->freelist);
+		if (unlikely(!new_bucket))
+			return -ENOMEM;
+		memcpy(new_bucket->data, ips, trace_len);
+	}
+
+	new_bucket->hash = hash;
+	new_bucket->nr = trace_nr;
+
+	old_bucket = xchg(&smap->buckets[id], new_bucket);
+	if (old_bucket)
+		pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
+	return id;
+}
+
+const struct bpf_func_proto bpf_get_stackid_proto = {
+	.func		= bpf_get_stackid,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
+	   u64, flags)
+{
+	u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
+	bool user_build_id = flags & BPF_F_USER_BUILD_ID;
+	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+	bool user = flags & BPF_F_USER_STACK;
+	struct perf_callchain_entry *trace;
+	bool kernel = !user;
+	int err = -EINVAL;
+	u64 *ips;
+
+	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+			       BPF_F_USER_BUILD_ID)))
+		goto clear;
+	if (kernel && user_build_id)
+		goto clear;
+
+	elem_size = (user && user_build_id) ? sizeof(struct bpf_stack_build_id)
+					    : sizeof(u64);
+	if (unlikely(size % elem_size))
+		goto clear;
+
+	num_elem = size / elem_size;
+	if (sysctl_perf_event_max_stack < num_elem)
+		init_nr = 0;
+	else
+		init_nr = sysctl_perf_event_max_stack - num_elem;
+	trace = get_perf_callchain(regs, init_nr, kernel, user,
+				   sysctl_perf_event_max_stack, false, false);
+	if (unlikely(!trace))
+		goto err_fault;
+
+	trace_nr = trace->nr - init_nr;
+	if (trace_nr < skip)
+		goto err_fault;
+
+	trace_nr -= skip;
+	trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
+	copy_len = trace_nr * elem_size;
+	ips = trace->ip + skip + init_nr;
+	if (user && user_build_id)
+		stack_map_get_build_id_offset(buf, ips, trace_nr, user);
+	else
+		memcpy(buf, ips, copy_len);
+
+	if (size > copy_len)
+		memset(buf + copy_len, 0, size - copy_len);
+	return copy_len;
+
+err_fault:
+	err = -EFAULT;
+clear:
+	memset(buf, 0, size);
+	return err;
+}
+
+const struct bpf_func_proto bpf_get_stack_proto = {
+	.func		= bpf_get_stack,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
+/* Called from eBPF program */
+static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return NULL;
+}
+
+/* Called from syscall */
+int bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
+{
+	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
+	struct stack_map_bucket *bucket, *old_bucket;
+	u32 id = *(u32 *)key, trace_len;
+
+	if (unlikely(id >= smap->n_buckets))
+		return -ENOENT;
+
+	bucket = xchg(&smap->buckets[id], NULL);
+	if (!bucket)
+		return -ENOENT;
+
+	trace_len = bucket->nr * stack_map_data_size(map);
+	memcpy(value, bucket->data, trace_len);
+	memset(value + trace_len, 0, map->value_size - trace_len);
+
+	old_bucket = xchg(&smap->buckets[id], bucket);
+	if (old_bucket)
+		pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
+	return 0;
+}
+
+static int stack_map_get_next_key(struct bpf_map *map, void *key,
+				  void *next_key)
+{
+	struct bpf_stack_map *smap = container_of(map,
+						  struct bpf_stack_map, map);
+	u32 id;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (!key) {
+		id = 0;
+	} else {
+		id = *(u32 *)key;
+		if (id >= smap->n_buckets || !smap->buckets[id])
+			id = 0;
+		else
+			id++;
+	}
+
+	while (id < smap->n_buckets && !smap->buckets[id])
+		id++;
+
+	if (id >= smap->n_buckets)
+		return -ENOENT;
+
+	*(u32 *)next_key = id;
+	return 0;
+}
+
+static int stack_map_update_elem(struct bpf_map *map, void *key, void *value,
+				 u64 map_flags)
+{
+	return -EINVAL;
+}
+
+/* Called from syscall or from eBPF program */
+static int stack_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
+	struct stack_map_bucket *old_bucket;
+	u32 id = *(u32 *)key;
+
+	if (unlikely(id >= smap->n_buckets))
+		return -E2BIG;
+
+	old_bucket = xchg(&smap->buckets[id], NULL);
+	if (old_bucket) {
+		pcpu_freelist_push(&smap->freelist, &old_bucket->fnode);
+		return 0;
+	} else {
+		return -ENOENT;
+	}
+}
+
+/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
+static void stack_map_free(struct bpf_map *map)
+{
+	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
+
+	/* wait for bpf programs to complete before freeing stack map */
+	synchronize_rcu();
+
+	bpf_map_area_free(smap->elems);
+	pcpu_freelist_destroy(&smap->freelist);
+	bpf_map_area_free(smap);
+	put_callchain_buffers();
+}
+
+const struct bpf_map_ops stack_trace_map_ops = {
+	.map_alloc = stack_map_alloc,
+	.map_free = stack_map_free,
+	.map_get_next_key = stack_map_get_next_key,
+	.map_lookup_elem = stack_map_lookup_elem,
+	.map_update_elem = stack_map_update_elem,
+	.map_delete_elem = stack_map_delete_elem,
+	.map_check_btf = map_check_no_btf,
+};
+
+static int __init stack_map_init(void)
+{
+	int cpu;
+	struct stack_map_irq_work *work;
+
+	for_each_possible_cpu(cpu) {
+		work = per_cpu_ptr(&up_read_work, cpu);
+		init_irq_work(&work->irq_work, do_up_read);
+	}
+	return 0;
+}
+subsys_initcall(stack_map_init);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH bpf-next v3 2/7] bpf/syscall: allow key to be null in map functions
  2018-09-18  4:52 [RFC PATCH bpf-next v3 0/7] Implement bpf queue/stack maps Mauricio Vasquez B
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 1/7] bpf: rename stack trace map Mauricio Vasquez B
@ 2018-09-18  4:52 ` Mauricio Vasquez B
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 3/7] bpf: add lookup_and_delete map operation Mauricio Vasquez B
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Mauricio Vasquez B @ 2018-09-18  4:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: Yonghong Song

This commit adds the required logic to allow key being NULL
in case the key_size of the map is 0.

A new __bpf_copy_key function helper only copies the key from
userpsace when key_size != 0, otherwise it enforces that key must be
null.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 kernel/bpf/syscall.c |   19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3c9636f03bb2..f2d4e4f280dc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -651,6 +651,17 @@ int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
 	return -ENOTSUPP;
 }
 
+static void *__bpf_copy_key(void __user *ukey, u64 key_size)
+{
+	if (key_size)
+		return memdup_user(ukey, key_size);
+
+	if (ukey)
+		return ERR_PTR(-EINVAL);
+
+	return NULL;
+}
+
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value
 
@@ -678,7 +689,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	key = memdup_user(ukey, map->key_size);
+	key = __bpf_copy_key(ukey, map->key_size);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -766,7 +777,7 @@ static int map_update_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	key = memdup_user(ukey, map->key_size);
+	key = __bpf_copy_key(ukey, map->key_size);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -864,7 +875,7 @@ static int map_delete_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	key = memdup_user(ukey, map->key_size);
+	key = __bpf_copy_key(ukey, map->key_size);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -916,7 +927,7 @@ static int map_get_next_key(union bpf_attr *attr)
 	}
 
 	if (ukey) {
-		key = memdup_user(ukey, map->key_size);
+		key = __bpf_copy_key(ukey, map->key_size);
 		if (IS_ERR(key)) {
 			err = PTR_ERR(key);
 			goto err_put;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH bpf-next v3 3/7] bpf: add lookup_and_delete map operation
  2018-09-18  4:52 [RFC PATCH bpf-next v3 0/7] Implement bpf queue/stack maps Mauricio Vasquez B
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 1/7] bpf: rename stack trace map Mauricio Vasquez B
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 2/7] bpf/syscall: allow key to be null in map functions Mauricio Vasquez B
@ 2018-09-18  4:52 ` Mauricio Vasquez B
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps Mauricio Vasquez B
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Mauricio Vasquez B @ 2018-09-18  4:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: Yonghong Song

The following patch implements a bpf queue/stack maps that
provides the peek/pop/push functions.  There is not a direct
relationship between those functions and the current operations
supported by a map, hence a new lookup_and_delete map operation
is added, this operation would be used by the pop helper.

A pop operation is not added because it will too specific to
stack/queue maps, instead this new operation could be useful
for other maps as well.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 include/linux/bpf.h      |    1 +
 include/uapi/linux/bpf.h |    1 +
 kernel/bpf/syscall.c     |   82 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..c63a44381d3f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -39,6 +39,7 @@ struct bpf_map_ops {
 	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
 	int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags);
 	int (*map_delete_elem)(struct bpf_map *map, void *key);
+	void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void *key);
 
 	/* funcs called by prog_array and perf_event_array map */
 	void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..4cda584c6640 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -103,6 +103,7 @@ enum bpf_cmd {
 	BPF_BTF_LOAD,
 	BPF_BTF_GET_FD_BY_ID,
 	BPF_TASK_FD_QUERY,
+	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 };
 
 enum bpf_map_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f2d4e4f280dc..7d429123a298 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -968,6 +968,85 @@ static int map_get_next_key(union bpf_attr *attr)
 	return err;
 }
 
+#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
+
+static int map_lookup_and_delete_elem(union bpf_attr *attr)
+{
+	void __user *ukey = u64_to_user_ptr(attr->key);
+	void __user *uvalue = u64_to_user_ptr(attr->value);
+	int ufd = attr->map_fd;
+	struct bpf_map *map;
+	void *key, *value, *ptr;
+	u32 value_size;
+	struct fd f;
+	int err;
+
+	if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM))
+		return -EINVAL;
+
+	f = fdget(ufd);
+	map = __bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	if (!(f.file->f_mode & FMODE_CAN_WRITE)) {
+		err = -EPERM;
+		goto err_put;
+	}
+
+	if (!map->ops->map_lookup_and_delete_elem) {
+		err = -ENOTSUPP;
+		goto err_put;
+	}
+
+	key = __bpf_copy_key(ukey, map->key_size);
+	if (IS_ERR(key)) {
+		err = PTR_ERR(key);
+		goto err_put;
+	}
+
+	value_size = map->value_size;
+
+	err = -ENOMEM;
+	value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
+	if (!value)
+		goto free_key;
+
+	err = -EFAULT;
+	if (copy_from_user(value, uvalue, value_size) != 0)
+		goto free_value;
+
+	/* must increment bpf_prog_active to avoid kprobe+bpf triggering from
+	 * inside bpf map update or delete otherwise deadlocks are possible
+	 */
+	preempt_disable();
+	__this_cpu_inc(bpf_prog_active);
+	rcu_read_lock();
+	ptr = map->ops->map_lookup_and_delete_elem(map, key);
+	if (ptr)
+		memcpy(value, ptr, value_size);
+	rcu_read_unlock();
+		err = ptr ? 0 : -ENOENT;
+	__this_cpu_dec(bpf_prog_active);
+	preempt_enable();
+
+	if (err)
+		goto free_value;
+
+	if (copy_to_user(uvalue, value, value_size) != 0)
+		goto free_value;
+
+	err = 0;
+
+free_value:
+	kfree(value);
+free_key:
+	kfree(key);
+err_put:
+	fdput(f);
+	return err;
+}
+
 static const struct bpf_prog_ops * const bpf_prog_types[] = {
 #define BPF_PROG_TYPE(_id, _name) \
 	[_id] = & _name ## _prog_ops,
@@ -2428,6 +2507,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_TASK_FD_QUERY:
 		err = bpf_task_fd_query(&attr, uattr);
 		break;
+	case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
+		err = map_lookup_and_delete_elem(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps
  2018-09-18  4:52 [RFC PATCH bpf-next v3 0/7] Implement bpf queue/stack maps Mauricio Vasquez B
                   ` (2 preceding siblings ...)
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 3/7] bpf: add lookup_and_delete map operation Mauricio Vasquez B
@ 2018-09-18  4:52 ` Mauricio Vasquez B
  2018-09-18 23:27   ` Alexei Starovoitov
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 5/7] bpf: restrict use of peek/push/pop Mauricio Vasquez B
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Mauricio Vasquez B @ 2018-09-18  4:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: Yonghong Song

Implement two new kind of maps that support the peek, push and pop
operations.

A use case for this is to keep track of a pool of elements, like
network ports in a SNAT.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 include/linux/bpf.h           |    3 
 include/linux/bpf_types.h     |    2 
 include/uapi/linux/bpf.h      |   30 ++++
 kernel/bpf/Makefile           |    2 
 kernel/bpf/core.c             |    3 
 kernel/bpf/helpers.c          |   98 ++++++++++++++
 kernel/bpf/queue_stack_maps.c |  291 +++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c         |    5 +
 net/core/filter.c             |    6 +
 9 files changed, 437 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/queue_stack_maps.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c63a44381d3f..8e924b5c5a0e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -807,6 +807,9 @@ static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
 extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
 extern const struct bpf_func_proto bpf_map_delete_elem_proto;
+extern const struct bpf_func_proto bpf_map_push_elem_proto;
+extern const struct bpf_func_proto bpf_map_pop_elem_proto;
+extern const struct bpf_func_proto bpf_map_peek_elem_proto;
 
 extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
 extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 33f7f574b983..903a446f14c3 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -67,3 +67,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
 #endif
 #endif
+BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4cda584c6640..c899386dcb2b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -128,6 +128,8 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_SOCKHASH,
 	BPF_MAP_TYPE_CGROUP_STORAGE,
 	BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
+	BPF_MAP_TYPE_QUEUE,
+	BPF_MAP_TYPE_STACK,
 };
 
 enum bpf_prog_type {
@@ -460,6 +462,29 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
+ * int bpf_map_push_elem(struct bpf_map *map, const void *value, u32 len,
+ * 			 u64 flags)
+ * 	Description
+ * 		Push an element *value* in *map*. *flags* is one of:
+ *
+ * 		**BPF_EXIST**
+ * 		If the queue/stack is full, the oldest element is removed to
+ * 		make room for this.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_map_pop_elem(struct bpf_map *map, void *value, u32 len)
+ * 	Description
+ * 		Pop an element from *map*.
+ * Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_map_peek_elem(struct bpf_map *map, void *value, u32 len)
+ * 	Description
+ * 		Get an element from *map* without removing it.
+ * Return
+ * 		0 on success, or a negative error in case of failure.
+ *
  * int bpf_probe_read(void *dst, u32 size, const void *src)
  * 	Description
  * 		For tracing programs, safely attempt to read *size* bytes from
@@ -2227,7 +2252,10 @@ union bpf_attr {
 	FN(get_current_cgroup_id),	\
 	FN(get_local_storage),		\
 	FN(sk_select_reuseport),	\
-	FN(skb_ancestor_cgroup_id),
+	FN(skb_ancestor_cgroup_id),	\
+	FN(map_push_elem),		\
+	FN(map_pop_elem),		\
+	FN(map_peek_elem),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e656bce87c8f..2d77bc5b2aca 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -3,7 +3,7 @@ obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o
 ifeq ($(CONFIG_NET),y)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3f5bf1af0826..8d2db076d123 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1783,6 +1783,9 @@ BPF_CALL_0(bpf_user_rnd_u32)
 const struct bpf_func_proto bpf_map_lookup_elem_proto __weak;
 const struct bpf_func_proto bpf_map_update_elem_proto __weak;
 const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
+const struct bpf_func_proto bpf_map_push_elem_proto __weak;
+const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
+const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
 
 const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
 const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1991466b8327..5f364e6acaf1 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -76,6 +76,104 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = {
 	.arg2_type	= ARG_PTR_TO_MAP_KEY,
 };
 
+BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32 size,
+	   u64, flags)
+{
+	if (map->value_size != size)
+		return -EINVAL;
+
+	return map->ops->map_update_elem(map, NULL, value, flags);
+}
+
+const struct bpf_func_proto bpf_map_push_elem_proto = {
+	.func		= bpf_map_push_elem,
+	.gpl_only	= false,
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *, value, u32, size)
+{
+	void *ptr;
+
+	if (map->value_size != size)
+		return -EINVAL;
+
+	ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
+	if (!ptr)
+		return -ENOENT;
+
+	switch (size) {
+	case 1:
+		*(u8 *) value = *(u8 *) ptr;
+		break;
+	case 2:
+		*(u16 *) value = *(u16 *) ptr;
+		break;
+	case 4:
+		*(u32 *) value = *(u32 *) ptr;
+		break;
+	case 8:
+		*(u64 *) value = *(u64 *) ptr;
+		break;
+	}
+
+	return 0;
+}
+
+const struct bpf_func_proto bpf_map_pop_elem_proto = {
+	.func		= bpf_map_pop_elem,
+	.gpl_only	= false,
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+};
+
+BPF_CALL_3(bpf_map_peek_elem, struct bpf_map *, map, void *, value, u32, size)
+{
+	void *ptr;
+
+	if (map->value_size != size)
+		return -EINVAL;
+
+	ptr = map->ops->map_lookup_elem(map, NULL);
+	if (!ptr)
+		return -ENOENT;
+
+	switch (size) {
+	case 1:
+		*(u8 *) value = *(u8 *) ptr;
+		break;
+	case 2:
+		*(u16 *) value = *(u16 *) ptr;
+		break;
+	case 4:
+		*(u32 *) value = *(u32 *) ptr;
+		break;
+	case 8:
+		*(u64 *) value = *(u64 *) ptr;
+		break;
+	}
+
+	return 0;
+}
+
+const struct bpf_func_proto bpf_map_peek_elem_proto = {
+	.func		= bpf_map_pop_elem,
+	.gpl_only	= false,
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+};
+
 const struct bpf_func_proto bpf_get_prandom_u32_proto = {
 	.func		= bpf_user_rnd_u32,
 	.gpl_only	= false,
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
new file mode 100644
index 000000000000..10c081f3f02b
--- /dev/null
+++ b/kernel/bpf/queue_stack_maps.c
@@ -0,0 +1,291 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * queue_stack_maps.c: BPF queue and stack maps
+ *
+ * Copyright (c) 2018 Politecnico di Torino
+ */
+#include <linux/bpf.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include "percpu_freelist.h"
+
+#define QUEUE_STACK_CREATE_FLAG_MASK \
+	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
+
+
+struct bpf_queue_stack {
+	struct bpf_map map;
+	raw_spinlock_t lock;
+	u32 head, tail;
+	u32 index_mask;
+	u32 count;
+
+	char elements[0] __aligned(8);
+};
+
+static struct bpf_queue_stack *bpf_queue_stack(struct bpf_map *map)
+{
+	return container_of(map, struct bpf_queue_stack, map);
+}
+
+static bool queue_stack_map_is_empty(struct bpf_queue_stack *qs)
+{
+	return qs->count == 0;
+}
+
+static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
+{
+	return qs->count == qs->map.max_entries;
+}
+
+/* Called from syscall */
+static int queue_stack_map_alloc_check(union bpf_attr *attr)
+{
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->key_size != 0 ||
+	    (attr->value_size != 1 && attr->value_size != 2 &&
+	    attr->value_size != 4 && attr->value_size != 8) ||
+	    attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK)
+		return -EINVAL;
+
+	if (attr->value_size > KMALLOC_MAX_SIZE)
+		/* if value_size is bigger, the user space won't be able to
+		 * access the elements.
+		 */
+		return -E2BIG;
+
+	return 0;
+}
+
+static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
+{
+	int ret, numa_node = bpf_map_attr_numa_node(attr);
+	u32 max_entries, value_size, index_mask;
+	u64 queue_size, cost, mask64;
+	struct bpf_queue_stack *qs;
+
+	max_entries = attr->max_entries;
+	value_size = attr->value_size;
+
+	/* From arraymap.c:
+	 * On 32 bit archs roundup_pow_of_two() with max_entries that has
+	 * upper most bit set in u32 space is undefined behavior due to
+	 * resulting 1U << 32, so do it manually here in u64 space.
+	 */
+	mask64 = fls_long(max_entries - 1);
+	mask64 = 1ULL << mask64;
+	mask64 -= 1;
+
+	index_mask = mask64;
+
+	/* Round up queue size to nearest power of 2 */
+	max_entries = index_mask + 1;
+	/* Check for overflows. */
+	if (max_entries < attr->max_entries)
+		return ERR_PTR(-E2BIG);
+
+	queue_size = sizeof(*qs) + (u64) value_size * max_entries;
+
+	cost = queue_size;
+	if (cost >= U32_MAX - PAGE_SIZE)
+		return ERR_PTR(-E2BIG);
+
+	cost = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+	ret = bpf_map_precharge_memlock(cost);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	qs = bpf_map_area_alloc(queue_size, numa_node);
+	if (!qs)
+		return ERR_PTR(-ENOMEM);
+
+	memset(qs, 0, sizeof(*qs));
+
+	bpf_map_init_from_attr(&qs->map, attr);
+
+	qs->map.pages = cost;
+	qs->index_mask = index_mask;
+
+	raw_spin_lock_init(&qs->lock);
+
+	return &qs->map;
+}
+
+/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
+static void queue_stack_map_free(struct bpf_map *map)
+{
+	struct bpf_queue_stack *qs = bpf_queue_stack(map);
+
+	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
+	 * so the programs (can be more than one that used this map) were
+	 * disconnected from events. Wait for outstanding critical sections in
+	 * these programs to complete
+	 */
+	synchronize_rcu();
+
+	bpf_map_area_free(qs);
+}
+
+static void *__queue_map_lookup(struct bpf_map *map, bool delete)
+{
+	struct bpf_queue_stack *qs = bpf_queue_stack(map);
+	unsigned long flags;
+	void *ptr = NULL;
+
+	raw_spin_lock_irqsave(&qs->lock, flags);
+
+	if (queue_stack_map_is_empty(qs))
+		goto out;
+
+	ptr = &qs->elements[qs->tail * qs->map.value_size];
+
+	if (delete) {
+		qs->tail = (qs->tail + 1) & qs->index_mask;
+		qs->count--;
+	}
+
+out:
+	raw_spin_unlock_irqrestore(&qs->lock, flags);
+	return ptr;
+}
+
+
+static void *__stack_map_lookup(struct bpf_map *map, bool delete)
+{
+	struct bpf_queue_stack *qs = bpf_queue_stack(map);
+	unsigned long flags;
+	void *ptr = NULL;
+	u32 index;
+
+	raw_spin_lock_irqsave(&qs->lock, flags);
+
+	if (queue_stack_map_is_empty(qs))
+		goto out;
+
+	index = (qs->head - 1) & qs->index_mask;
+	ptr = &qs->elements[index * qs->map.value_size];
+
+	if (delete) {
+		qs->head = (qs->head - 1) & qs->index_mask;
+		qs->count--;
+	}
+
+out:
+	raw_spin_unlock_irqrestore(&qs->lock, flags);
+	return ptr;
+}
+
+/* Called from syscall or from eBPF program */
+static void *queue_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return __queue_map_lookup(map, false);
+}
+
+/* Called from syscall or from eBPF program */
+static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return __stack_map_lookup(map, false);
+}
+
+/* Called from syscall or from eBPF program */
+static void *queue_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
+{
+	return __queue_map_lookup(map, true);
+}
+
+/* Called from syscall or from eBPF program */
+static void *stack_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
+{
+	return __stack_map_lookup(map, true);
+}
+
+/* Called from syscall or from eBPF program */
+static int queue_stack_map_update_elem(struct bpf_map *map, void *key,
+				       void *value, u64 flags)
+{
+	struct bpf_queue_stack *qs = bpf_queue_stack(map);
+	unsigned long irq_flags;
+	int err = 0;
+	void *dst;
+
+	/* BPF_EXIST is used to force making room for a new element in case the
+	 * map is full
+	 */
+	bool replace = (flags & BPF_EXIST);
+
+	/* Check supported flags for queue and stqueue_map_is_preallocack maps */
+	if (flags & BPF_NOEXIST || flags > BPF_EXIST)
+		return -EINVAL;
+
+	raw_spin_lock_irqsave(&qs->lock, irq_flags);
+
+	if (queue_stack_map_is_full(qs)) {
+		if (!replace) {
+			err = -E2BIG;
+			goto out;
+		}
+		/* advance tail pointer to overwrite oldest element */
+		qs->tail = (qs->tail + 1) & qs->index_mask;
+		qs->count--;
+	}
+
+	dst = &qs->elements[qs->head * qs->map.value_size];
+
+	switch (qs->map.value_size) {
+	case 1:
+		*(u8 *) dst = *(u8 *) value;
+		break;
+	case 2:
+		*(u16 *) dst = *(u16 *) value;
+		break;
+	case 4:
+		*(u32 *) dst = *(u32 *) value;
+		break;
+	case 8:
+		*(u64 *) dst = *(u64 *) value;
+		break;
+	}
+
+	qs->head = (qs->head + 1) & qs->index_mask;
+	qs->count++;
+
+out:
+	raw_spin_unlock_irqrestore(&qs->lock, irq_flags);
+	return err;
+}
+
+/* Called from syscall or from eBPF program */
+static int queue_stack_map_delete_elem(struct bpf_map *map, void *key)
+{
+	return -EINVAL;
+}
+
+/* Called from syscall */
+static int queue_stack_map_get_next_key(struct bpf_map *map, void *key,
+					void *next_key)
+{
+	return -EINVAL;
+}
+
+const struct bpf_map_ops queue_map_ops = {
+	.map_alloc_check = queue_stack_map_alloc_check,
+	.map_alloc = queue_stack_map_alloc,
+	.map_free = queue_stack_map_free,
+	.map_lookup_elem = queue_map_lookup_elem,
+	.map_lookup_and_delete_elem = queue_map_lookup_and_delete_elem,
+	.map_update_elem = queue_stack_map_update_elem,
+	.map_delete_elem = queue_stack_map_delete_elem,
+	.map_get_next_key = queue_stack_map_get_next_key,
+};
+
+const struct bpf_map_ops stack_map_ops = {
+	.map_alloc_check = queue_stack_map_alloc_check,
+	.map_alloc = queue_stack_map_alloc,
+	.map_free = queue_stack_map_free,
+	.map_lookup_elem = stack_map_lookup_elem,
+	.map_lookup_and_delete_elem = stack_map_lookup_and_delete_elem,
+	.map_update_elem = queue_stack_map_update_elem,
+	.map_delete_elem = queue_stack_map_delete_elem,
+	.map_get_next_key = queue_stack_map_get_next_key,
+};
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f4ff0c569e54..b9e005188f0e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2369,7 +2369,10 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 	if (func_id != BPF_FUNC_tail_call &&
 	    func_id != BPF_FUNC_map_lookup_elem &&
 	    func_id != BPF_FUNC_map_update_elem &&
-	    func_id != BPF_FUNC_map_delete_elem)
+	    func_id != BPF_FUNC_map_delete_elem &&
+	    func_id != BPF_FUNC_map_push_elem &&
+	    func_id != BPF_FUNC_map_pop_elem &&
+	    func_id != BPF_FUNC_map_peek_elem)
 		return 0;
 
 	if (meta->map_ptr == NULL) {
diff --git a/net/core/filter.c b/net/core/filter.c
index feb578506009..c7b73376c23a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4839,6 +4839,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_map_update_elem_proto;
 	case BPF_FUNC_map_delete_elem:
 		return &bpf_map_delete_elem_proto;
+	case BPF_FUNC_map_push_elem:
+		return &bpf_map_push_elem_proto;
+	case BPF_FUNC_map_pop_elem:
+		return &bpf_map_pop_elem_proto;
+	case BPF_FUNC_map_peek_elem:
+		return &bpf_map_peek_elem_proto;
 	case BPF_FUNC_get_prandom_u32:
 		return &bpf_get_prandom_u32_proto;
 	case BPF_FUNC_get_smp_processor_id:

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH bpf-next v3 5/7] bpf: restrict use of peek/push/pop
  2018-09-18  4:52 [RFC PATCH bpf-next v3 0/7] Implement bpf queue/stack maps Mauricio Vasquez B
                   ` (3 preceding siblings ...)
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps Mauricio Vasquez B
@ 2018-09-18  4:52 ` Mauricio Vasquez B
  2018-09-18  4:53 ` [RFC PATCH bpf-next v3 6/7] Sync uapi/bpf.h to tools/include Mauricio Vasquez B
  2018-09-18  4:53 ` [RFC PATCH bpf-next v3 7/7] selftests/bpf: add test cases for queue and stack maps Mauricio Vasquez B
  6 siblings, 0 replies; 17+ messages in thread
From: Mauricio Vasquez B @ 2018-09-18  4:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: Yonghong Song

Restrict the use of peek, push and pop helpers only to queue and stack
maps.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 kernel/bpf/verifier.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b9e005188f0e..1628ffe48e32 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2084,6 +2084,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (func_id != BPF_FUNC_sk_select_reuseport)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_QUEUE:
+	case BPF_MAP_TYPE_STACK:
+		if (func_id != BPF_FUNC_map_peek_elem &&
+		    func_id != BPF_FUNC_map_pop_elem &&
+		    func_id != BPF_FUNC_map_push_elem)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -2139,6 +2146,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
 			goto error;
 		break;
+	case BPF_FUNC_map_peek_elem:
+	case BPF_FUNC_map_pop_elem:
+	case BPF_FUNC_map_push_elem:
+		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
+		    map->map_type != BPF_MAP_TYPE_STACK)
+			goto error;
+		break;
 	default:
 		break;
 	}

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH bpf-next v3 6/7] Sync uapi/bpf.h to tools/include
  2018-09-18  4:52 [RFC PATCH bpf-next v3 0/7] Implement bpf queue/stack maps Mauricio Vasquez B
                   ` (4 preceding siblings ...)
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 5/7] bpf: restrict use of peek/push/pop Mauricio Vasquez B
@ 2018-09-18  4:53 ` Mauricio Vasquez B
  2018-09-18  4:53 ` [RFC PATCH bpf-next v3 7/7] selftests/bpf: add test cases for queue and stack maps Mauricio Vasquez B
  6 siblings, 0 replies; 17+ messages in thread
From: Mauricio Vasquez B @ 2018-09-18  4:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: Yonghong Song

Sync both files.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 tools/include/uapi/linux/bpf.h |   31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..c899386dcb2b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -103,6 +103,7 @@ enum bpf_cmd {
 	BPF_BTF_LOAD,
 	BPF_BTF_GET_FD_BY_ID,
 	BPF_TASK_FD_QUERY,
+	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 };
 
 enum bpf_map_type {
@@ -127,6 +128,8 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_SOCKHASH,
 	BPF_MAP_TYPE_CGROUP_STORAGE,
 	BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
+	BPF_MAP_TYPE_QUEUE,
+	BPF_MAP_TYPE_STACK,
 };
 
 enum bpf_prog_type {
@@ -459,6 +462,29 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
+ * int bpf_map_push_elem(struct bpf_map *map, const void *value, u32 len,
+ * 			 u64 flags)
+ * 	Description
+ * 		Push an element *value* in *map*. *flags* is one of:
+ *
+ * 		**BPF_EXIST**
+ * 		If the queue/stack is full, the oldest element is removed to
+ * 		make room for this.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_map_pop_elem(struct bpf_map *map, void *value, u32 len)
+ * 	Description
+ * 		Pop an element from *map*.
+ * Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_map_peek_elem(struct bpf_map *map, void *value, u32 len)
+ * 	Description
+ * 		Get an element from *map* without removing it.
+ * Return
+ * 		0 on success, or a negative error in case of failure.
+ *
  * int bpf_probe_read(void *dst, u32 size, const void *src)
  * 	Description
  * 		For tracing programs, safely attempt to read *size* bytes from
@@ -2226,7 +2252,10 @@ union bpf_attr {
 	FN(get_current_cgroup_id),	\
 	FN(get_local_storage),		\
 	FN(sk_select_reuseport),	\
-	FN(skb_ancestor_cgroup_id),
+	FN(skb_ancestor_cgroup_id),	\
+	FN(map_push_elem),		\
+	FN(map_pop_elem),		\
+	FN(map_peek_elem),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH bpf-next v3 7/7] selftests/bpf: add test cases for queue and stack maps
  2018-09-18  4:52 [RFC PATCH bpf-next v3 0/7] Implement bpf queue/stack maps Mauricio Vasquez B
                   ` (5 preceding siblings ...)
  2018-09-18  4:53 ` [RFC PATCH bpf-next v3 6/7] Sync uapi/bpf.h to tools/include Mauricio Vasquez B
@ 2018-09-18  4:53 ` Mauricio Vasquez B
  2018-09-18 23:32   ` Alexei Starovoitov
  6 siblings, 1 reply; 17+ messages in thread
From: Mauricio Vasquez B @ 2018-09-18  4:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: Yonghong Song

Two types of tests are done:
- test_maps: only userspace api.
- test_progs: userspace api and ebpf helpers.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 kernel/bpf/helpers.c                               |    2 
 tools/lib/bpf/bpf.c                                |   12 ++
 tools/lib/bpf/bpf.h                                |    1 
 tools/testing/selftests/bpf/Makefile               |    5 +
 tools/testing/selftests/bpf/bpf_helpers.h          |    7 +
 tools/testing/selftests/bpf/test_maps.c            |  130 ++++++++++++++++++++
 tools/testing/selftests/bpf/test_progs.c           |   99 +++++++++++++++
 tools/testing/selftests/bpf/test_queue_map.c       |    4 +
 tools/testing/selftests/bpf/test_queue_stack_map.h |   59 +++++++++
 tools/testing/selftests/bpf/test_stack_map.c       |    4 +
 10 files changed, 321 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
 create mode 100644 tools/testing/selftests/bpf/test_stack_map.c

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5f364e6acaf1..1293cd5240e3 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -76,7 +76,7 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = {
 	.arg2_type	= ARG_PTR_TO_MAP_KEY,
 };
 
-BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32 size,
+BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32, size,
 	   u64, flags)
 {
 	if (map->value_size != size)
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 60aa4ca8b2c5..7056b2eb554d 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -286,6 +286,18 @@ int bpf_map_lookup_elem(int fd, const void *key, void *value)
 	return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
 }
 
+int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value)
+{
+	union bpf_attr attr;
+
+	bzero(&attr, sizeof(attr));
+	attr.map_fd = fd;
+	attr.key = ptr_to_u64(key);
+	attr.value = ptr_to_u64(value);
+
+	return sys_bpf(BPF_MAP_LOOKUP_AND_DELETE_ELEM, &attr, sizeof(attr));
+}
+
 int bpf_map_delete_elem(int fd, const void *key)
 {
 	union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 6f38164b2618..6134ed9517d3 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -86,6 +86,7 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
 			__u64 flags);
 
 int bpf_map_lookup_elem(int fd, const void *key, void *value);
+int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value);
 int bpf_map_delete_elem(int fd, const void *key);
 int bpf_map_get_next_key(int fd, const void *key, void *next_key);
 int bpf_obj_pin(int fd, const char *pathname);
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..ad8a2b8fb738 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
 	test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-	test_skb_cgroup_id_kern.o
+	test_skb_cgroup_id_kern.o test_queue_map.o test_stack_map.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -110,6 +110,9 @@ CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
 $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
 $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
 
+$(OUTPUT)/test_queue_map.o: test_queue_stack_map.h
+$(OUTPUT)/test_stack_map.o: test_queue_stack_map.h
+
 BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
 BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
 BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 'usage.*llvm')
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index e4be7730222d..bdbe8f84023e 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -16,6 +16,13 @@ static int (*bpf_map_update_elem)(void *map, void *key, void *value,
 	(void *) BPF_FUNC_map_update_elem;
 static int (*bpf_map_delete_elem)(void *map, void *key) =
 	(void *) BPF_FUNC_map_delete_elem;
+static int (*bpf_map_push_elem)(void *map, const void *value, int len,
+				unsigned long long flags) =
+	(void *) BPF_FUNC_map_push_elem;
+static int (*bpf_map_pop_elem)(void *map, void *value, int len) =
+	(void *) BPF_FUNC_map_pop_elem;
+static int (*bpf_map_peek_elem)(void *map, void *value, int len) =
+	(void *) BPF_FUNC_map_peek_elem;
 static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
 	(void *) BPF_FUNC_probe_read;
 static unsigned long long (*bpf_ktime_get_ns)(void) =
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 6f54f84144a0..abb21ea731fa 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -15,6 +15,7 @@
 #include <string.h>
 #include <assert.h>
 #include <stdlib.h>
+#include <time.h>
 
 #include <sys/wait.h>
 #include <sys/socket.h>
@@ -471,6 +472,130 @@ static void test_devmap(int task, void *data)
 	close(fd);
 }
 
+static void test_queuemap(int task, void *data)
+{
+	const int MAP_SIZE = 32;
+	__u32 vals[MAP_SIZE + MAP_SIZE/2], val;
+	int fd, i;
+
+	/* Fill test values to be used */
+	for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
+		vals[i] = rand();
+
+	/* Invalid key size */
+	fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 4, sizeof(val), MAP_SIZE,
+			    map_flags);
+	assert(fd < 0 && errno == EINVAL);
+
+	/* Invalid value size */
+	fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, 3, MAP_SIZE,map_flags);
+	assert(fd < 0 && errno == EINVAL);
+
+	fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, sizeof(val), MAP_SIZE,
+			    map_flags);
+	/* Queue map does not support BPF_F_NO_PREALLOC */
+	if (map_flags & BPF_F_NO_PREALLOC) {
+		assert(fd < 0 && errno == EINVAL);
+		return;
+	}
+	if (fd < 0) {
+		printf("Failed to create queuemap '%s'!\n", strerror(errno));
+		exit(1);
+	}
+
+	/* Push MAP_SIZE elements */
+	for (i = 0; i < MAP_SIZE; i++)
+		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
+
+	/* Check that element cannot be pushed due to max_entries limit */
+	assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
+	       errno == E2BIG);
+
+	/* Peek element */
+	assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[0]);
+
+	/* Replace half elements */
+	for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
+		assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
+
+	/* Pop all elements */
+	for (i = MAP_SIZE/2; i < MAP_SIZE + MAP_SIZE/2; i++)
+		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
+		       val == vals[i]);
+
+	/* Check that there are not elements left */
+	assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
+	       errno == ENOENT);
+
+	/* Check that non supported functions set errno to EINVAL */
+	assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
+	assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
+
+	close(fd);
+}
+
+static void test_stackmap(int task, void *data)
+{
+	const int MAP_SIZE = 32;
+	__u32 vals[MAP_SIZE + MAP_SIZE/2], val;
+	int fd, i;
+
+	/* Fill test values to be used */
+	for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
+		vals[i] = rand();
+
+	/* Invalid key size */
+	fd = bpf_create_map(BPF_MAP_TYPE_STACK, 4, sizeof(val), MAP_SIZE,
+			    map_flags);
+	assert(fd < 0 && errno == EINVAL);
+
+	/* Invalid value size */
+	fd = bpf_create_map(BPF_MAP_TYPE_STACK, 0, 3, MAP_SIZE,map_flags);
+	assert(fd < 0 && errno == EINVAL);
+
+	fd = bpf_create_map(BPF_MAP_TYPE_STACK, 0, sizeof(val), MAP_SIZE,
+			    map_flags);
+	/* Stack map does not support BPF_F_NO_PREALLOC */
+	if (map_flags & BPF_F_NO_PREALLOC) {
+		assert(fd < 0 && errno == EINVAL);
+		return;
+	}
+	if (fd < 0) {
+		printf("Failed to create stackmap '%s'!\n", strerror(errno));
+		exit(1);
+	}
+
+	/* Push MAP_SIZE elements */
+	for (i = 0; i < MAP_SIZE; i++)
+		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
+
+	/* Check that element cannot be pushed due to max_entries limit */
+	assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
+	       errno == E2BIG);
+
+	/* Peek element */
+	assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[i - 1]);
+
+	/* Replace half elements */
+	for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
+		assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
+
+	/* Pop all elements */
+	for (i = MAP_SIZE + MAP_SIZE/2 - 1; i >= MAP_SIZE/2; i--)
+		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
+		       val == vals[i]);
+
+	/* Check that there are not elements left */
+	assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
+	       errno == ENOENT);
+
+	/* Check that non supported functions set errno to EINVAL */
+	assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
+	assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
+
+	close(fd);
+}
+
 #include <sys/socket.h>
 #include <sys/ioctl.h>
 #include <arpa/inet.h>
@@ -1430,10 +1555,15 @@ static void run_all_tests(void)
 	test_map_wronly();
 
 	test_reuseport_array();
+
+	test_queuemap(0, NULL);
+	test_stackmap(0, NULL);
 }
 
 int main(void)
 {
+	srand(time(NULL));
+
 	map_flags = 0;
 	run_all_tests();
 
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 63a671803ed6..a69b9e1e668d 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1698,8 +1698,105 @@ static void test_task_fd_query_tp(void)
 				   "sys_enter_read");
 }
 
+enum {
+	QUEUE,
+	STACK,
+};
+
+static void test_queue_stack_map(int type)
+{
+	const int MAP_SIZE = 32;
+	__u32 vals[MAP_SIZE], duration, retval, size, val;
+	int i, err, prog_fd, map_in_fd, map_out_fd;
+	char file[32], buf[128];
+	struct bpf_object *obj;
+	struct iphdr *iph = (void *)buf + sizeof(struct ethhdr);
+
+	/* Fill test values to be used */
+	for (i = 0; i < MAP_SIZE; i++)
+		vals[i] = rand();
+
+	if (type == QUEUE)
+		strncpy(file, "./test_queue_map.o", sizeof(file));
+	else if (type == STACK)
+		strncpy(file, "./test_stack_map.o", sizeof(file));
+	else
+		return;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, &obj, &prog_fd);
+	if (err) {
+		error_cnt++;
+		return;
+	}
+
+	map_in_fd = bpf_find_map(__func__, obj, "map_in");
+	if (map_in_fd < 0)
+		goto out;
+
+	map_out_fd = bpf_find_map(__func__, obj, "map_out");
+	if (map_out_fd < 0)
+		goto out;
+
+	/* Push 32 elements to the input map */
+	for (i = 0; i < MAP_SIZE; i++) {
+		err = bpf_map_update_elem(map_in_fd, NULL, &vals[i], 0);
+		if (err) {
+			error_cnt++;
+			goto out;
+		}
+	}
+
+	/* The eBPF program pushes iph.saddr in the output map,
+	 * pops the input map and saves this value in iph.daddr
+	 */
+	for (i = 0; i < MAP_SIZE; i++) {
+		if (type == QUEUE) {
+			val = vals[i];
+			pkt_v4.iph.saddr = vals[i] * 5;
+		} else if (type == STACK) {
+			val = vals[MAP_SIZE - 1 - i];
+			pkt_v4.iph.saddr = vals[MAP_SIZE - 1 - i] * 5;
+		}
+
+		err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+					buf, &size, &retval, &duration);
+		if (err || retval || size != sizeof(pkt_v4) ||
+		    iph->daddr != val)
+			break;
+	}
+
+	CHECK(err || retval || size != sizeof(pkt_v4) || iph->daddr != val,
+	      "bpf_map_pop_elem",
+	      "err %d errno %d retval %d size %d iph->daddr %u\n",
+	      err, errno, retval, size, iph->daddr);
+
+	/* Queue is empty, program should return TC_ACT_SHOT */
+	err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+				buf, &size, &retval, &duration);
+	CHECK(err || retval != 2 /* TC_ACT_SHOT */|| size != sizeof(pkt_v4),
+	      "check-queue-stack-map-empty",
+	      "err %d errno %d retval %d size %d\n",
+	      err, errno, retval, size);
+
+	/* Check that the program pushed elements correctly */
+	for (i = 0; i < MAP_SIZE; i++) {
+		err = bpf_map_lookup_and_delete_elem(map_out_fd, NULL, &val);
+		if (err || val != vals[i] * 5)
+			break;
+	}
+
+	CHECK(i != MAP_SIZE && (err || val != vals[i] * 5),
+	      "bpf_map_push_elem", "err %d value %u\n", err, val);
+
+out:
+	pkt_v4.iph.saddr = 0;
+	bpf_object__close(obj);
+}
+
 int main(void)
 {
+	srand(time(NULL));
+
 	jit_enabled = is_jit_enabled();
 
 	test_pkt_access();
@@ -1719,6 +1816,8 @@ int main(void)
 	test_get_stack_raw_tp();
 	test_task_fd_query_rawtp();
 	test_task_fd_query_tp();
+	test_queue_stack_map(QUEUE);
+	test_queue_stack_map(STACK);
 
 	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
 	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_queue_map.c b/tools/testing/selftests/bpf/test_queue_map.c
new file mode 100644
index 000000000000..3fdb3b9cb038
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_queue_map.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2017 Politecnico di Torino
+#define MAP_TYPE BPF_MAP_TYPE_QUEUE
+#include "test_queue_stack_map.h"
diff --git a/tools/testing/selftests/bpf/test_queue_stack_map.h b/tools/testing/selftests/bpf/test_queue_stack_map.h
new file mode 100644
index 000000000000..23a5b5d60069
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_queue_stack_map.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+// Copyright (c) 2017 Politecnico di Torino
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/pkt_cls.h>
+#include "bpf_helpers.h"
+
+int _version SEC("version") = 1;
+
+struct bpf_map_def __attribute__ ((section("maps"), used)) map_in = {
+	.type = MAP_TYPE,
+	.key_size = 0,
+	.value_size = sizeof(__u32),
+	.max_entries = 32,
+	.map_flags = 0,
+};
+
+struct bpf_map_def __attribute__ ((section("maps"), used)) map_out = {
+	.type = MAP_TYPE,
+	.key_size = 0,
+	.value_size = sizeof(__u32),
+	.max_entries = 32,
+	.map_flags = 0,
+};
+
+SEC("test")
+int _test(struct __sk_buff *skb)
+{
+	void *data_end = (void *)(long)skb->data_end;
+	void *data = (void *)(long)skb->data;
+	struct ethhdr *eth = (struct ethhdr *)(data);
+	__u32 value;
+	int err;
+
+	if (eth + 1 > data_end)
+		return TC_ACT_SHOT;
+
+	struct iphdr *iph = (struct iphdr *)(eth + 1);
+
+	if (iph + 1 > data_end)
+		return TC_ACT_SHOT;
+
+	err = bpf_map_pop_elem(&map_in, &value, sizeof(value));
+	if (err)
+		return TC_ACT_SHOT;
+
+	iph->daddr = value;
+
+	err = bpf_map_push_elem(&map_out, &iph->saddr, sizeof(iph->saddr), 0);
+	if (err)
+		return TC_ACT_SHOT;
+
+	return TC_ACT_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_stack_map.c b/tools/testing/selftests/bpf/test_stack_map.c
new file mode 100644
index 000000000000..5be9159b2927
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_stack_map.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2017 Politecnico di Torino
+#define MAP_TYPE BPF_MAP_TYPE_STACK
+#include "test_queue_stack_map.h"

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH bpf-next v3 1/7] bpf: rename stack trace map
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 1/7] bpf: rename stack trace map Mauricio Vasquez B
@ 2018-09-18 23:05   ` Alexei Starovoitov
  2018-09-19  2:53     ` Mauricio Vasquez
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2018-09-18 23:05 UTC (permalink / raw)
  To: Mauricio Vasquez B
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Yonghong Song

On Tue, Sep 18, 2018 at 06:52:34AM +0200, Mauricio Vasquez B wrote:
> In the following patches queue and stack maps (FIFO and LIFO
> datastructures) will be implemented.  In order to avoid confusion and
> a possible name clash rename stackmap.c to stacktracemap.c and
> stack_map_ops to stack_trace_map_ops
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> ---
>  include/linux/bpf_types.h  |    2 
>  kernel/bpf/Makefile        |    2 
>  kernel/bpf/stackmap.c      |  624 --------------------------------------------
>  kernel/bpf/stacktracemap.c |  624 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 626 insertions(+), 626 deletions(-)
>  delete mode 100644 kernel/bpf/stackmap.c
>  create mode 100644 kernel/bpf/stacktracemap.c

Just new file for stack/queue will be enough. I understand that unfortunate name confusion,
but this is too late. It will cause all sorts of headaches in fixes that go to net/stable.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps
  2018-09-18  4:52 ` [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps Mauricio Vasquez B
@ 2018-09-18 23:27   ` Alexei Starovoitov
  2018-09-19  4:28     ` Mauricio Vasquez
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2018-09-18 23:27 UTC (permalink / raw)
  To: Mauricio Vasquez B
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Yonghong Song

On Tue, Sep 18, 2018 at 06:52:51AM +0200, Mauricio Vasquez B wrote:
> Implement two new kind of maps that support the peek, push and pop
> operations.
> 
> A use case for this is to keep track of a pool of elements, like
> network ports in a SNAT.
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> ---
>  include/linux/bpf.h           |    3 
>  include/linux/bpf_types.h     |    2 
>  include/uapi/linux/bpf.h      |   30 ++++
>  kernel/bpf/Makefile           |    2 
>  kernel/bpf/core.c             |    3 
>  kernel/bpf/helpers.c          |   98 ++++++++++++++
>  kernel/bpf/queue_stack_maps.c |  291 +++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c         |    5 +
>  net/core/filter.c             |    6 +
>  9 files changed, 437 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/bpf/queue_stack_maps.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c63a44381d3f..8e924b5c5a0e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -807,6 +807,9 @@ static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
>  extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
>  extern const struct bpf_func_proto bpf_map_update_elem_proto;
>  extern const struct bpf_func_proto bpf_map_delete_elem_proto;
> +extern const struct bpf_func_proto bpf_map_push_elem_proto;
> +extern const struct bpf_func_proto bpf_map_pop_elem_proto;
> +extern const struct bpf_func_proto bpf_map_peek_elem_proto;
>  
>  extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
>  extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 33f7f574b983..903a446f14c3 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -67,3 +67,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
>  #endif
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4cda584c6640..c899386dcb2b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -128,6 +128,8 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_SOCKHASH,
>  	BPF_MAP_TYPE_CGROUP_STORAGE,
>  	BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
> +	BPF_MAP_TYPE_QUEUE,
> +	BPF_MAP_TYPE_STACK,
>  };
>  
>  enum bpf_prog_type {
> @@ -460,6 +462,29 @@ union bpf_attr {
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
>   *
> + * int bpf_map_push_elem(struct bpf_map *map, const void *value, u32 len,
> + * 			 u64 flags)

since we're passing <=8 byte value there is no need for pointer and len.
Lower bits of 'u64 value' would be faster and easier to use from bpf prog side.

> + * 	Description
> + * 		Push an element *value* in *map*. *flags* is one of:
> + *
> + * 		**BPF_EXIST**
> + * 		If the queue/stack is full, the oldest element is removed to
> + * 		make room for this.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_map_pop_elem(struct bpf_map *map, void *value, u32 len)
> + * 	Description
> + * 		Pop an element from *map*.
> + * Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_map_peek_elem(struct bpf_map *map, void *value, u32 len)

for pop/peak helpers the 'void *value' makes sense,
but 'len' doesn't need to be there and doesn't need to be checked in runtime.
Verifier should be able to do that.
More below.

> + * 	Description
> + * 		Get an element from *map* without removing it.
> + * Return
> + * 		0 on success, or a negative error in case of failure.
> + *
>   * int bpf_probe_read(void *dst, u32 size, const void *src)
>   * 	Description
>   * 		For tracing programs, safely attempt to read *size* bytes from
> @@ -2227,7 +2252,10 @@ union bpf_attr {
>  	FN(get_current_cgroup_id),	\
>  	FN(get_local_storage),		\
>  	FN(sk_select_reuseport),	\
> -	FN(skb_ancestor_cgroup_id),
> +	FN(skb_ancestor_cgroup_id),	\
> +	FN(map_push_elem),		\
> +	FN(map_pop_elem),		\
> +	FN(map_peek_elem),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index e656bce87c8f..2d77bc5b2aca 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -3,7 +3,7 @@ obj-y := core.o
>  
>  obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o
> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o
>  ifeq ($(CONFIG_NET),y)
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 3f5bf1af0826..8d2db076d123 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1783,6 +1783,9 @@ BPF_CALL_0(bpf_user_rnd_u32)
>  const struct bpf_func_proto bpf_map_lookup_elem_proto __weak;
>  const struct bpf_func_proto bpf_map_update_elem_proto __weak;
>  const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
> +const struct bpf_func_proto bpf_map_push_elem_proto __weak;
> +const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
> +const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
>  
>  const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
>  const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 1991466b8327..5f364e6acaf1 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -76,6 +76,104 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = {
>  	.arg2_type	= ARG_PTR_TO_MAP_KEY,
>  };
>  
> +BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32 size,
> +	   u64, flags)
> +{
> +	if (map->value_size != size)
> +		return -EINVAL;
> +
> +	return map->ops->map_update_elem(map, NULL, value, flags);
> +}
> +
> +const struct bpf_func_proto bpf_map_push_elem_proto = {
> +	.func		= bpf_map_push_elem,
> +	.gpl_only	= false,
> +	.pkt_access	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_MEM,
> +	.arg3_type	= ARG_CONST_SIZE,
> +	.arg4_type	= ARG_ANYTHING,
> +};
> +
> +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *, value, u32, size)
> +{
> +	void *ptr;
> +
> +	if (map->value_size != size)
> +		return -EINVAL;
> +
> +	ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
> +	if (!ptr)
> +		return -ENOENT;
> +
> +	switch (size) {
> +	case 1:
> +		*(u8 *) value = *(u8 *) ptr;
> +		break;
> +	case 2:
> +		*(u16 *) value = *(u16 *) ptr;
> +		break;
> +	case 4:
> +		*(u32 *) value = *(u32 *) ptr;
> +		break;
> +	case 8:
> +		*(u64 *) value = *(u64 *) ptr;
> +		break;

this is inefficient. can we pass value ptr into ops and let it populate it?

> +	}
> +
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_map_pop_elem_proto = {
> +	.func		= bpf_map_pop_elem,
> +	.gpl_only	= false,
> +	.pkt_access	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg3_type	= ARG_CONST_SIZE,
> +};
> +
> +BPF_CALL_3(bpf_map_peek_elem, struct bpf_map *, map, void *, value, u32, size)
> +{
> +	void *ptr;
> +
> +	if (map->value_size != size)
> +		return -EINVAL;
> +
> +	ptr = map->ops->map_lookup_elem(map, NULL);
> +	if (!ptr)
> +		return -ENOENT;
> +
> +	switch (size) {
> +	case 1:
> +		*(u8 *) value = *(u8 *) ptr;
> +		break;
> +	case 2:
> +		*(u16 *) value = *(u16 *) ptr;
> +		break;
> +	case 4:
> +		*(u32 *) value = *(u32 *) ptr;
> +		break;
> +	case 8:
> +		*(u64 *) value = *(u64 *) ptr;
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +const struct bpf_func_proto bpf_map_peek_elem_proto = {
> +	.func		= bpf_map_pop_elem,
> +	.gpl_only	= false,
> +	.pkt_access	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg3_type	= ARG_CONST_SIZE,
> +};
> +
>  const struct bpf_func_proto bpf_get_prandom_u32_proto = {
>  	.func		= bpf_user_rnd_u32,
>  	.gpl_only	= false,
> diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
> new file mode 100644
> index 000000000000..10c081f3f02b
> --- /dev/null
> +++ b/kernel/bpf/queue_stack_maps.c
> @@ -0,0 +1,291 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * queue_stack_maps.c: BPF queue and stack maps
> + *
> + * Copyright (c) 2018 Politecnico di Torino
> + */
> +#include <linux/bpf.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include "percpu_freelist.h"
> +
> +#define QUEUE_STACK_CREATE_FLAG_MASK \
> +	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
> +
> +
> +struct bpf_queue_stack {
> +	struct bpf_map map;
> +	raw_spinlock_t lock;
> +	u32 head, tail;
> +	u32 index_mask;
> +	u32 count;
> +
> +	char elements[0] __aligned(8);
> +};
> +
> +static struct bpf_queue_stack *bpf_queue_stack(struct bpf_map *map)
> +{
> +	return container_of(map, struct bpf_queue_stack, map);
> +}
> +
> +static bool queue_stack_map_is_empty(struct bpf_queue_stack *qs)
> +{
> +	return qs->count == 0;
> +}
> +
> +static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
> +{
> +	return qs->count == qs->map.max_entries;
> +}
> +
> +/* Called from syscall */
> +static int queue_stack_map_alloc_check(union bpf_attr *attr)
> +{
> +	/* check sanity of attributes */
> +	if (attr->max_entries == 0 || attr->key_size != 0 ||
> +	    (attr->value_size != 1 && attr->value_size != 2 &&
> +	    attr->value_size != 4 && attr->value_size != 8) ||
> +	    attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK)
> +		return -EINVAL;
> +
> +	if (attr->value_size > KMALLOC_MAX_SIZE)
> +		/* if value_size is bigger, the user space won't be able to
> +		 * access the elements.
> +		 */
> +		return -E2BIG;
> +
> +	return 0;
> +}
> +
> +static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
> +{
> +	int ret, numa_node = bpf_map_attr_numa_node(attr);
> +	u32 max_entries, value_size, index_mask;
> +	u64 queue_size, cost, mask64;
> +	struct bpf_queue_stack *qs;
> +
> +	max_entries = attr->max_entries;
> +	value_size = attr->value_size;
> +
> +	/* From arraymap.c:
> +	 * On 32 bit archs roundup_pow_of_two() with max_entries that has
> +	 * upper most bit set in u32 space is undefined behavior due to
> +	 * resulting 1U << 32, so do it manually here in u64 space.
> +	 */
> +	mask64 = fls_long(max_entries - 1);
> +	mask64 = 1ULL << mask64;
> +	mask64 -= 1;
> +
> +	index_mask = mask64;
> +
> +	/* Round up queue size to nearest power of 2 */
> +	max_entries = index_mask + 1;
> +	/* Check for overflows. */
> +	if (max_entries < attr->max_entries)
> +		return ERR_PTR(-E2BIG);
> +
> +	queue_size = sizeof(*qs) + (u64) value_size * max_entries;
> +
> +	cost = queue_size;
> +	if (cost >= U32_MAX - PAGE_SIZE)
> +		return ERR_PTR(-E2BIG);
> +
> +	cost = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	ret = bpf_map_precharge_memlock(cost);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	qs = bpf_map_area_alloc(queue_size, numa_node);
> +	if (!qs)
> +		return ERR_PTR(-ENOMEM);
> +
> +	memset(qs, 0, sizeof(*qs));
> +
> +	bpf_map_init_from_attr(&qs->map, attr);
> +
> +	qs->map.pages = cost;
> +	qs->index_mask = index_mask;
> +
> +	raw_spin_lock_init(&qs->lock);
> +
> +	return &qs->map;
> +}
> +
> +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
> +static void queue_stack_map_free(struct bpf_map *map)
> +{
> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
> +
> +	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> +	 * so the programs (can be more than one that used this map) were
> +	 * disconnected from events. Wait for outstanding critical sections in
> +	 * these programs to complete
> +	 */
> +	synchronize_rcu();
> +
> +	bpf_map_area_free(qs);
> +}
> +
> +static void *__queue_map_lookup(struct bpf_map *map, bool delete)
> +{
> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
> +	unsigned long flags;
> +	void *ptr = NULL;
> +
> +	raw_spin_lock_irqsave(&qs->lock, flags);
> +
> +	if (queue_stack_map_is_empty(qs))
> +		goto out;
> +
> +	ptr = &qs->elements[qs->tail * qs->map.value_size];
> +
> +	if (delete) {
> +		qs->tail = (qs->tail + 1) & qs->index_mask;
> +		qs->count--;
> +	}
> +
> +out:
> +	raw_spin_unlock_irqrestore(&qs->lock, flags);
> +	return ptr;
> +}
> +
> +
> +static void *__stack_map_lookup(struct bpf_map *map, bool delete)
> +{
> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
> +	unsigned long flags;
> +	void *ptr = NULL;
> +	u32 index;
> +
> +	raw_spin_lock_irqsave(&qs->lock, flags);
> +
> +	if (queue_stack_map_is_empty(qs))
> +		goto out;
> +
> +	index = (qs->head - 1) & qs->index_mask;
> +	ptr = &qs->elements[index * qs->map.value_size];
> +
> +	if (delete) {
> +		qs->head = (qs->head - 1) & qs->index_mask;
> +		qs->count--;
> +	}
> +
> +out:
> +	raw_spin_unlock_irqrestore(&qs->lock, flags);

here it's racing with another cpu accessing the same map.
rcu doesn't protect from this particular qs->element being reused
by another cpu.

> +	return ptr;

say, stack is full. one cpu is doing pop(), getting this ptr,
while another cpu doing push() and overwriting this memory.

> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *queue_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	return __queue_map_lookup(map, false);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	return __stack_map_lookup(map, false);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *queue_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return __queue_map_lookup(map, true);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static void *stack_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return __stack_map_lookup(map, true);
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int queue_stack_map_update_elem(struct bpf_map *map, void *key,
> +				       void *value, u64 flags)
> +{
> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
> +	unsigned long irq_flags;
> +	int err = 0;
> +	void *dst;
> +
> +	/* BPF_EXIST is used to force making room for a new element in case the
> +	 * map is full
> +	 */
> +	bool replace = (flags & BPF_EXIST);
> +
> +	/* Check supported flags for queue and stqueue_map_is_preallocack maps */
> +	if (flags & BPF_NOEXIST || flags > BPF_EXIST)
> +		return -EINVAL;
> +
> +	raw_spin_lock_irqsave(&qs->lock, irq_flags);
> +
> +	if (queue_stack_map_is_full(qs)) {
> +		if (!replace) {
> +			err = -E2BIG;
> +			goto out;
> +		}
> +		/* advance tail pointer to overwrite oldest element */
> +		qs->tail = (qs->tail + 1) & qs->index_mask;
> +		qs->count--;
> +	}
> +
> +	dst = &qs->elements[qs->head * qs->map.value_size];
> +
> +	switch (qs->map.value_size) {
> +	case 1:
> +		*(u8 *) dst = *(u8 *) value;
> +		break;
> +	case 2:
> +		*(u16 *) dst = *(u16 *) value;
> +		break;
> +	case 4:
> +		*(u32 *) dst = *(u32 *) value;
> +		break;
> +	case 8:
> +		*(u64 *) dst = *(u64 *) value;
> +		break;
> +	}
> +
> +	qs->head = (qs->head + 1) & qs->index_mask;
> +	qs->count++;
> +
> +out:
> +	raw_spin_unlock_irqrestore(&qs->lock, irq_flags);
> +	return err;
> +}
> +
> +/* Called from syscall or from eBPF program */
> +static int queue_stack_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +	return -EINVAL;
> +}
> +
> +/* Called from syscall */
> +static int queue_stack_map_get_next_key(struct bpf_map *map, void *key,
> +					void *next_key)
> +{
> +	return -EINVAL;
> +}
> +
> +const struct bpf_map_ops queue_map_ops = {
> +	.map_alloc_check = queue_stack_map_alloc_check,
> +	.map_alloc = queue_stack_map_alloc,
> +	.map_free = queue_stack_map_free,
> +	.map_lookup_elem = queue_map_lookup_elem,
> +	.map_lookup_and_delete_elem = queue_map_lookup_and_delete_elem,
> +	.map_update_elem = queue_stack_map_update_elem,
> +	.map_delete_elem = queue_stack_map_delete_elem,
> +	.map_get_next_key = queue_stack_map_get_next_key,
> +};
> +
> +const struct bpf_map_ops stack_map_ops = {
> +	.map_alloc_check = queue_stack_map_alloc_check,
> +	.map_alloc = queue_stack_map_alloc,
> +	.map_free = queue_stack_map_free,
> +	.map_lookup_elem = stack_map_lookup_elem,
> +	.map_lookup_and_delete_elem = stack_map_lookup_and_delete_elem,
> +	.map_update_elem = queue_stack_map_update_elem,
> +	.map_delete_elem = queue_stack_map_delete_elem,
> +	.map_get_next_key = queue_stack_map_get_next_key,
> +};
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f4ff0c569e54..b9e005188f0e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2369,7 +2369,10 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>  	if (func_id != BPF_FUNC_tail_call &&
>  	    func_id != BPF_FUNC_map_lookup_elem &&
>  	    func_id != BPF_FUNC_map_update_elem &&
> -	    func_id != BPF_FUNC_map_delete_elem)
> +	    func_id != BPF_FUNC_map_delete_elem &&
> +	    func_id != BPF_FUNC_map_push_elem &&
> +	    func_id != BPF_FUNC_map_pop_elem &&
> +	    func_id != BPF_FUNC_map_peek_elem)
>  		return 0;
>  
>  	if (meta->map_ptr == NULL) {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index feb578506009..c7b73376c23a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4839,6 +4839,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>  		return &bpf_map_update_elem_proto;
>  	case BPF_FUNC_map_delete_elem:
>  		return &bpf_map_delete_elem_proto;
> +	case BPF_FUNC_map_push_elem:
> +		return &bpf_map_push_elem_proto;
> +	case BPF_FUNC_map_pop_elem:
> +		return &bpf_map_pop_elem_proto;
> +	case BPF_FUNC_map_peek_elem:
> +		return &bpf_map_peek_elem_proto;
>  	case BPF_FUNC_get_prandom_u32:
>  		return &bpf_get_prandom_u32_proto;
>  	case BPF_FUNC_get_smp_processor_id:
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH bpf-next v3 7/7] selftests/bpf: add test cases for queue and stack maps
  2018-09-18  4:53 ` [RFC PATCH bpf-next v3 7/7] selftests/bpf: add test cases for queue and stack maps Mauricio Vasquez B
@ 2018-09-18 23:32   ` Alexei Starovoitov
  2018-09-19  4:36     ` Mauricio Vasquez
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2018-09-18 23:32 UTC (permalink / raw)
  To: Mauricio Vasquez B
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Yonghong Song

On Tue, Sep 18, 2018 at 06:53:07AM +0200, Mauricio Vasquez B wrote:
> Two types of tests are done:
> - test_maps: only userspace api.
> - test_progs: userspace api and ebpf helpers.
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> ---
>  kernel/bpf/helpers.c                               |    2 
>  tools/lib/bpf/bpf.c                                |   12 ++
>  tools/lib/bpf/bpf.h                                |    1 
>  tools/testing/selftests/bpf/Makefile               |    5 +
>  tools/testing/selftests/bpf/bpf_helpers.h          |    7 +
>  tools/testing/selftests/bpf/test_maps.c            |  130 ++++++++++++++++++++
>  tools/testing/selftests/bpf/test_progs.c           |   99 +++++++++++++++
>  tools/testing/selftests/bpf/test_queue_map.c       |    4 +
>  tools/testing/selftests/bpf/test_queue_stack_map.h |   59 +++++++++
>  tools/testing/selftests/bpf/test_stack_map.c       |    4 +
>  10 files changed, 321 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
>  create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
>  create mode 100644 tools/testing/selftests/bpf/test_stack_map.c
> 
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5f364e6acaf1..1293cd5240e3 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -76,7 +76,7 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = {
>  	.arg2_type	= ARG_PTR_TO_MAP_KEY,
>  };
>  
> -BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32 size,
> +BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32, size,
>  	   u64, flags)

part of earlier patch?

>  {
>  	if (map->value_size != size)
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 60aa4ca8b2c5..7056b2eb554d 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -286,6 +286,18 @@ int bpf_map_lookup_elem(int fd, const void *key, void *value)
>  	return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
>  }
>  
> +int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value)
> +{
> +	union bpf_attr attr;
> +
> +	bzero(&attr, sizeof(attr));
> +	attr.map_fd = fd;
> +	attr.key = ptr_to_u64(key);
> +	attr.value = ptr_to_u64(value);
> +
> +	return sys_bpf(BPF_MAP_LOOKUP_AND_DELETE_ELEM, &attr, sizeof(attr));
> +}
> +
>  int bpf_map_delete_elem(int fd, const void *key)
>  {
>  	union bpf_attr attr;
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 6f38164b2618..6134ed9517d3 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -86,6 +86,7 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
>  			__u64 flags);
>  
>  int bpf_map_lookup_elem(int fd, const void *key, void *value);
> +int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value);
>  int bpf_map_delete_elem(int fd, const void *key);
>  int bpf_map_get_next_key(int fd, const void *key, void *next_key);
>  int bpf_obj_pin(int fd, const char *pathname);
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index fff7fb1285fc..ad8a2b8fb738 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
>  	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
>  	test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
>  	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
> -	test_skb_cgroup_id_kern.o
> +	test_skb_cgroup_id_kern.o test_queue_map.o test_stack_map.o
>  
>  # Order correspond to 'make run_tests' order
>  TEST_PROGS := test_kmod.sh \
> @@ -110,6 +110,9 @@ CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
>  $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
>  $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
>  
> +$(OUTPUT)/test_queue_map.o: test_queue_stack_map.h
> +$(OUTPUT)/test_stack_map.o: test_queue_stack_map.h
> +
>  BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
>  BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
>  BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 'usage.*llvm')
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index e4be7730222d..bdbe8f84023e 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -16,6 +16,13 @@ static int (*bpf_map_update_elem)(void *map, void *key, void *value,
>  	(void *) BPF_FUNC_map_update_elem;
>  static int (*bpf_map_delete_elem)(void *map, void *key) =
>  	(void *) BPF_FUNC_map_delete_elem;
> +static int (*bpf_map_push_elem)(void *map, const void *value, int len,
> +				unsigned long long flags) =
> +	(void *) BPF_FUNC_map_push_elem;
> +static int (*bpf_map_pop_elem)(void *map, void *value, int len) =
> +	(void *) BPF_FUNC_map_pop_elem;
> +static int (*bpf_map_peek_elem)(void *map, void *value, int len) =
> +	(void *) BPF_FUNC_map_peek_elem;
>  static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
>  	(void *) BPF_FUNC_probe_read;
>  static unsigned long long (*bpf_ktime_get_ns)(void) =
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index 6f54f84144a0..abb21ea731fa 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -15,6 +15,7 @@
>  #include <string.h>
>  #include <assert.h>
>  #include <stdlib.h>
> +#include <time.h>
>  
>  #include <sys/wait.h>
>  #include <sys/socket.h>
> @@ -471,6 +472,130 @@ static void test_devmap(int task, void *data)
>  	close(fd);
>  }
>  
> +static void test_queuemap(int task, void *data)
> +{
> +	const int MAP_SIZE = 32;
> +	__u32 vals[MAP_SIZE + MAP_SIZE/2], val;
> +	int fd, i;
> +
> +	/* Fill test values to be used */
> +	for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
> +		vals[i] = rand();
> +
> +	/* Invalid key size */
> +	fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 4, sizeof(val), MAP_SIZE,
> +			    map_flags);
> +	assert(fd < 0 && errno == EINVAL);
> +
> +	/* Invalid value size */
> +	fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, 3, MAP_SIZE,map_flags);
> +	assert(fd < 0 && errno == EINVAL);
> +
> +	fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, sizeof(val), MAP_SIZE,
> +			    map_flags);
> +	/* Queue map does not support BPF_F_NO_PREALLOC */
> +	if (map_flags & BPF_F_NO_PREALLOC) {
> +		assert(fd < 0 && errno == EINVAL);
> +		return;
> +	}
> +	if (fd < 0) {
> +		printf("Failed to create queuemap '%s'!\n", strerror(errno));
> +		exit(1);
> +	}
> +
> +	/* Push MAP_SIZE elements */
> +	for (i = 0; i < MAP_SIZE; i++)
> +		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
> +
> +	/* Check that element cannot be pushed due to max_entries limit */
> +	assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
> +	       errno == E2BIG);
> +
> +	/* Peek element */
> +	assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[0]);
> +
> +	/* Replace half elements */
> +	for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
> +		assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
> +
> +	/* Pop all elements */
> +	for (i = MAP_SIZE/2; i < MAP_SIZE + MAP_SIZE/2; i++)
> +		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
> +		       val == vals[i]);
> +
> +	/* Check that there are not elements left */
> +	assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
> +	       errno == ENOENT);
> +
> +	/* Check that non supported functions set errno to EINVAL */
> +	assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
> +	assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
> +
> +	close(fd);
> +}
> +
> +static void test_stackmap(int task, void *data)
> +{
> +	const int MAP_SIZE = 32;
> +	__u32 vals[MAP_SIZE + MAP_SIZE/2], val;
> +	int fd, i;
> +
> +	/* Fill test values to be used */
> +	for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
> +		vals[i] = rand();
> +
> +	/* Invalid key size */
> +	fd = bpf_create_map(BPF_MAP_TYPE_STACK, 4, sizeof(val), MAP_SIZE,
> +			    map_flags);
> +	assert(fd < 0 && errno == EINVAL);
> +
> +	/* Invalid value size */
> +	fd = bpf_create_map(BPF_MAP_TYPE_STACK, 0, 3, MAP_SIZE,map_flags);
> +	assert(fd < 0 && errno == EINVAL);
> +
> +	fd = bpf_create_map(BPF_MAP_TYPE_STACK, 0, sizeof(val), MAP_SIZE,
> +			    map_flags);
> +	/* Stack map does not support BPF_F_NO_PREALLOC */
> +	if (map_flags & BPF_F_NO_PREALLOC) {
> +		assert(fd < 0 && errno == EINVAL);
> +		return;
> +	}
> +	if (fd < 0) {
> +		printf("Failed to create stackmap '%s'!\n", strerror(errno));
> +		exit(1);
> +	}
> +
> +	/* Push MAP_SIZE elements */
> +	for (i = 0; i < MAP_SIZE; i++)
> +		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
> +
> +	/* Check that element cannot be pushed due to max_entries limit */
> +	assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
> +	       errno == E2BIG);
> +
> +	/* Peek element */
> +	assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[i - 1]);
> +
> +	/* Replace half elements */
> +	for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
> +		assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
> +
> +	/* Pop all elements */
> +	for (i = MAP_SIZE + MAP_SIZE/2 - 1; i >= MAP_SIZE/2; i--)
> +		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
> +		       val == vals[i]);
> +
> +	/* Check that there are not elements left */
> +	assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
> +	       errno == ENOENT);
> +
> +	/* Check that non supported functions set errno to EINVAL */
> +	assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
> +	assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
> +
> +	close(fd);
> +}
> +
>  #include <sys/socket.h>
>  #include <sys/ioctl.h>
>  #include <arpa/inet.h>
> @@ -1430,10 +1555,15 @@ static void run_all_tests(void)
>  	test_map_wronly();
>  
>  	test_reuseport_array();
> +
> +	test_queuemap(0, NULL);
> +	test_stackmap(0, NULL);
>  }
>  
>  int main(void)
>  {
> +	srand(time(NULL));
> +
>  	map_flags = 0;
>  	run_all_tests();
>  
> diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> index 63a671803ed6..a69b9e1e668d 100644
> --- a/tools/testing/selftests/bpf/test_progs.c
> +++ b/tools/testing/selftests/bpf/test_progs.c
> @@ -1698,8 +1698,105 @@ static void test_task_fd_query_tp(void)
>  				   "sys_enter_read");
>  }
>  
> +enum {
> +	QUEUE,
> +	STACK,
> +};
> +
> +static void test_queue_stack_map(int type)
> +{
> +	const int MAP_SIZE = 32;
> +	__u32 vals[MAP_SIZE], duration, retval, size, val;
> +	int i, err, prog_fd, map_in_fd, map_out_fd;
> +	char file[32], buf[128];
> +	struct bpf_object *obj;
> +	struct iphdr *iph = (void *)buf + sizeof(struct ethhdr);
> +
> +	/* Fill test values to be used */
> +	for (i = 0; i < MAP_SIZE; i++)
> +		vals[i] = rand();
> +
> +	if (type == QUEUE)
> +		strncpy(file, "./test_queue_map.o", sizeof(file));
> +	else if (type == STACK)
> +		strncpy(file, "./test_stack_map.o", sizeof(file));
> +	else
> +		return;
> +
> +	err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, &obj, &prog_fd);
> +	if (err) {
> +		error_cnt++;
> +		return;
> +	}
> +
> +	map_in_fd = bpf_find_map(__func__, obj, "map_in");
> +	if (map_in_fd < 0)
> +		goto out;
> +
> +	map_out_fd = bpf_find_map(__func__, obj, "map_out");
> +	if (map_out_fd < 0)
> +		goto out;
> +
> +	/* Push 32 elements to the input map */
> +	for (i = 0; i < MAP_SIZE; i++) {
> +		err = bpf_map_update_elem(map_in_fd, NULL, &vals[i], 0);
> +		if (err) {
> +			error_cnt++;
> +			goto out;
> +		}
> +	}
> +
> +	/* The eBPF program pushes iph.saddr in the output map,
> +	 * pops the input map and saves this value in iph.daddr
> +	 */
> +	for (i = 0; i < MAP_SIZE; i++) {
> +		if (type == QUEUE) {
> +			val = vals[i];
> +			pkt_v4.iph.saddr = vals[i] * 5;
> +		} else if (type == STACK) {
> +			val = vals[MAP_SIZE - 1 - i];
> +			pkt_v4.iph.saddr = vals[MAP_SIZE - 1 - i] * 5;
> +		}
> +
> +		err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
> +					buf, &size, &retval, &duration);
> +		if (err || retval || size != sizeof(pkt_v4) ||
> +		    iph->daddr != val)
> +			break;
> +	}
> +
> +	CHECK(err || retval || size != sizeof(pkt_v4) || iph->daddr != val,
> +	      "bpf_map_pop_elem",
> +	      "err %d errno %d retval %d size %d iph->daddr %u\n",
> +	      err, errno, retval, size, iph->daddr);
> +
> +	/* Queue is empty, program should return TC_ACT_SHOT */
> +	err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
> +				buf, &size, &retval, &duration);
> +	CHECK(err || retval != 2 /* TC_ACT_SHOT */|| size != sizeof(pkt_v4),
> +	      "check-queue-stack-map-empty",
> +	      "err %d errno %d retval %d size %d\n",
> +	      err, errno, retval, size);
> +
> +	/* Check that the program pushed elements correctly */
> +	for (i = 0; i < MAP_SIZE; i++) {
> +		err = bpf_map_lookup_and_delete_elem(map_out_fd, NULL, &val);
> +		if (err || val != vals[i] * 5)
> +			break;
> +	}
> +
> +	CHECK(i != MAP_SIZE && (err || val != vals[i] * 5),
> +	      "bpf_map_push_elem", "err %d value %u\n", err, val);
> +
> +out:
> +	pkt_v4.iph.saddr = 0;
> +	bpf_object__close(obj);
> +}
> +
>  int main(void)
>  {
> +	srand(time(NULL));
> +
>  	jit_enabled = is_jit_enabled();
>  
>  	test_pkt_access();
> @@ -1719,6 +1816,8 @@ int main(void)
>  	test_get_stack_raw_tp();
>  	test_task_fd_query_rawtp();
>  	test_task_fd_query_tp();
> +	test_queue_stack_map(QUEUE);
> +	test_queue_stack_map(STACK);
>  
>  	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
>  	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
> diff --git a/tools/testing/selftests/bpf/test_queue_map.c b/tools/testing/selftests/bpf/test_queue_map.c
> new file mode 100644
> index 000000000000..3fdb3b9cb038
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_queue_map.c
> @@ -0,0 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2017 Politecnico di Torino
> +#define MAP_TYPE BPF_MAP_TYPE_QUEUE
> +#include "test_queue_stack_map.h"
> diff --git a/tools/testing/selftests/bpf/test_queue_stack_map.h b/tools/testing/selftests/bpf/test_queue_stack_map.h
> new file mode 100644
> index 000000000000..23a5b5d60069
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_queue_stack_map.h
> @@ -0,0 +1,59 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +// Copyright (c) 2017 Politecnico di Torino
> +#include <stddef.h>
> +#include <string.h>
> +#include <linux/bpf.h>
> +#include <linux/if_ether.h>
> +#include <linux/ip.h>
> +#include <linux/pkt_cls.h>
> +#include "bpf_helpers.h"
> +
> +int _version SEC("version") = 1;
> +
> +struct bpf_map_def __attribute__ ((section("maps"), used)) map_in = {
> +	.type = MAP_TYPE,
> +	.key_size = 0,
> +	.value_size = sizeof(__u32),
> +	.max_entries = 32,
> +	.map_flags = 0,
> +};
> +
> +struct bpf_map_def __attribute__ ((section("maps"), used)) map_out = {
> +	.type = MAP_TYPE,
> +	.key_size = 0,
> +	.value_size = sizeof(__u32),
> +	.max_entries = 32,
> +	.map_flags = 0,
> +};
> +
> +SEC("test")
> +int _test(struct __sk_buff *skb)
> +{
> +	void *data_end = (void *)(long)skb->data_end;
> +	void *data = (void *)(long)skb->data;
> +	struct ethhdr *eth = (struct ethhdr *)(data);
> +	__u32 value;
> +	int err;
> +
> +	if (eth + 1 > data_end)
> +		return TC_ACT_SHOT;
> +
> +	struct iphdr *iph = (struct iphdr *)(eth + 1);
> +
> +	if (iph + 1 > data_end)
> +		return TC_ACT_SHOT;
> +
> +	err = bpf_map_pop_elem(&map_in, &value, sizeof(value));
> +	if (err)
> +		return TC_ACT_SHOT;
> +
> +	iph->daddr = value;
> +
> +	err = bpf_map_push_elem(&map_out, &iph->saddr, sizeof(iph->saddr), 0);
> +	if (err)
> +		return TC_ACT_SHOT;

is it possible to add a test for 'peek' helper in reliable way?
it seems it will alwasy be racy. may be we shouldn't implement peek at all?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH bpf-next v3 1/7] bpf: rename stack trace map
  2018-09-18 23:05   ` Alexei Starovoitov
@ 2018-09-19  2:53     ` Mauricio Vasquez
  0 siblings, 0 replies; 17+ messages in thread
From: Mauricio Vasquez @ 2018-09-19  2:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Yonghong Song



On 09/18/2018 06:05 PM, Alexei Starovoitov wrote:
> On Tue, Sep 18, 2018 at 06:52:34AM +0200, Mauricio Vasquez B wrote:
>> In the following patches queue and stack maps (FIFO and LIFO
>> datastructures) will be implemented.  In order to avoid confusion and
>> a possible name clash rename stackmap.c to stacktracemap.c and
>> stack_map_ops to stack_trace_map_ops
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>> ---
>>   include/linux/bpf_types.h  |    2
>>   kernel/bpf/Makefile        |    2
>>   kernel/bpf/stackmap.c      |  624 --------------------------------------------
>>   kernel/bpf/stacktracemap.c |  624 ++++++++++++++++++++++++++++++++++++++++++++
>>   4 files changed, 626 insertions(+), 626 deletions(-)
>>   delete mode 100644 kernel/bpf/stackmap.c
>>   create mode 100644 kernel/bpf/stacktracemap.c
> Just new file for stack/queue will be enough. I understand that unfortunate name confusion,
> but this is too late. It will cause all sorts of headaches in fixes that go to net/stable.
>
>
Ok, will only rename stack_map_ops to stack_trace_map_ops then.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps
  2018-09-18 23:27   ` Alexei Starovoitov
@ 2018-09-19  4:28     ` Mauricio Vasquez
       [not found]       ` <12a0d7d2-9590-b44f-803a-a00eefe611c1@polito.it>
  0 siblings, 1 reply; 17+ messages in thread
From: Mauricio Vasquez @ 2018-09-19  4:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Yonghong Song



On 09/18/2018 06:27 PM, Alexei Starovoitov wrote:
> On Tue, Sep 18, 2018 at 06:52:51AM +0200, Mauricio Vasquez B wrote:
>> Implement two new kind of maps that support the peek, push and pop
>> operations.
>>
>> A use case for this is to keep track of a pool of elements, like
>> network ports in a SNAT.
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>> ---
>>   include/linux/bpf.h           |    3
>>   include/linux/bpf_types.h     |    2
>>   include/uapi/linux/bpf.h      |   30 ++++
>>   kernel/bpf/Makefile           |    2
>>   kernel/bpf/core.c             |    3
>>   kernel/bpf/helpers.c          |   98 ++++++++++++++
>>   kernel/bpf/queue_stack_maps.c |  291 +++++++++++++++++++++++++++++++++++++++++
>>   kernel/bpf/verifier.c         |    5 +
>>   net/core/filter.c             |    6 +
>>   9 files changed, 437 insertions(+), 3 deletions(-)
>>   create mode 100644 kernel/bpf/queue_stack_maps.c
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index c63a44381d3f..8e924b5c5a0e 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -807,6 +807,9 @@ static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
>>   extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
>>   extern const struct bpf_func_proto bpf_map_update_elem_proto;
>>   extern const struct bpf_func_proto bpf_map_delete_elem_proto;
>> +extern const struct bpf_func_proto bpf_map_push_elem_proto;
>> +extern const struct bpf_func_proto bpf_map_pop_elem_proto;
>> +extern const struct bpf_func_proto bpf_map_peek_elem_proto;
>>   
>>   extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
>>   extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index 33f7f574b983..903a446f14c3 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -67,3 +67,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
>>   #endif
>>   #endif
>> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
>> +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 4cda584c6640..c899386dcb2b 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -128,6 +128,8 @@ enum bpf_map_type {
>>   	BPF_MAP_TYPE_SOCKHASH,
>>   	BPF_MAP_TYPE_CGROUP_STORAGE,
>>   	BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
>> +	BPF_MAP_TYPE_QUEUE,
>> +	BPF_MAP_TYPE_STACK,
>>   };
>>   
>>   enum bpf_prog_type {
>> @@ -460,6 +462,29 @@ union bpf_attr {
>>    * 	Return
>>    * 		0 on success, or a negative error in case of failure.
>>    *
>> + * int bpf_map_push_elem(struct bpf_map *map, const void *value, u32 len,
>> + * 			 u64 flags)
> since we're passing <=8 byte value there is no need for pointer and len.
> Lower bits of 'u64 value' would be faster and easier to use from bpf prog side.
I have the doubt between both solutions, will change it to only u64 value.

>> + * 	Description
>> + * 		Push an element *value* in *map*. *flags* is one of:
>> + *
>> + * 		**BPF_EXIST**
>> + * 		If the queue/stack is full, the oldest element is removed to
>> + * 		make room for this.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_map_pop_elem(struct bpf_map *map, void *value, u32 len)
>> + * 	Description
>> + * 		Pop an element from *map*.
>> + * Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_map_peek_elem(struct bpf_map *map, void *value, u32 len)
> for pop/peak helpers the 'void *value' makes sense,
> but 'len' doesn't need to be there and doesn't need to be checked in runtime.
> Verifier should be able to do that.

You are right, ARG_PTR_TO_MAP_VALUE should do these tests automatically 
based on map->value_size.

> More below.
>
>> + * 	Description
>> + * 		Get an element from *map* without removing it.
>> + * Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>>    * int bpf_probe_read(void *dst, u32 size, const void *src)
>>    * 	Description
>>    * 		For tracing programs, safely attempt to read *size* bytes from
>> @@ -2227,7 +2252,10 @@ union bpf_attr {
>>   	FN(get_current_cgroup_id),	\
>>   	FN(get_local_storage),		\
>>   	FN(sk_select_reuseport),	\
>> -	FN(skb_ancestor_cgroup_id),
>> +	FN(skb_ancestor_cgroup_id),	\
>> +	FN(map_push_elem),		\
>> +	FN(map_pop_elem),		\
>> +	FN(map_peek_elem),
>>   
>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>    * function eBPF program intends to call
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index e656bce87c8f..2d77bc5b2aca 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -3,7 +3,7 @@ obj-y := core.o
>>   
>>   obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
>>   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o
>> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
>>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>>   obj-$(CONFIG_BPF_SYSCALL) += btf.o
>>   ifeq ($(CONFIG_NET),y)
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 3f5bf1af0826..8d2db076d123 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -1783,6 +1783,9 @@ BPF_CALL_0(bpf_user_rnd_u32)
>>   const struct bpf_func_proto bpf_map_lookup_elem_proto __weak;
>>   const struct bpf_func_proto bpf_map_update_elem_proto __weak;
>>   const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
>> +const struct bpf_func_proto bpf_map_push_elem_proto __weak;
>> +const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
>> +const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
>>   
>>   const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
>>   const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 1991466b8327..5f364e6acaf1 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -76,6 +76,104 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = {
>>   	.arg2_type	= ARG_PTR_TO_MAP_KEY,
>>   };
>>   
>> +BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32 size,
>> +	   u64, flags)
>> +{
>> +	if (map->value_size != size)
>> +		return -EINVAL;
>> +
>> +	return map->ops->map_update_elem(map, NULL, value, flags);
>> +}
>> +
>> +const struct bpf_func_proto bpf_map_push_elem_proto = {
>> +	.func		= bpf_map_push_elem,
>> +	.gpl_only	= false,
>> +	.pkt_access	= true,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_CONST_MAP_PTR,
>> +	.arg2_type	= ARG_PTR_TO_MEM,
>> +	.arg3_type	= ARG_CONST_SIZE,
>> +	.arg4_type	= ARG_ANYTHING,
>> +};
>> +
>> +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *, value, u32, size)
>> +{
>> +	void *ptr;
>> +
>> +	if (map->value_size != size)
>> +		return -EINVAL;
>> +
>> +	ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
>> +	if (!ptr)
>> +		return -ENOENT;
>> +
>> +	switch (size) {
>> +	case 1:
>> +		*(u8 *) value = *(u8 *) ptr;
>> +		break;
>> +	case 2:
>> +		*(u16 *) value = *(u16 *) ptr;
>> +		break;
>> +	case 4:
>> +		*(u32 *) value = *(u32 *) ptr;
>> +		break;
>> +	case 8:
>> +		*(u64 *) value = *(u64 *) ptr;
>> +		break;
> this is inefficient. can we pass value ptr into ops and let it populate it?

I don't think so, doing that implies that look_and_delete will be a 
per-value op, while other ops in maps are per-reference.
For instance, how to change it in the case of peek helper that is using 
the lookup operation?, we cannot change the signature of the lookup 
operation.

This is something that worries me a little bit, we are creating new 
per-value helpers based on already existing per-reference operations, 
this is not probably the cleanest way.  Here we are at the beginning of 
the discussion once again, how should we map helpers and syscalls to ops.

What about creating pop/peek/push ops, mapping helpers one to one and 
adding some logic into syscall.c to call the correct operation in case 
the map is stack/queue?
Syscall mapping would be:
bpf_map_lookup_elem() -> peek
bpf_map_lookup_and_delete_elem() -> pop
bpf_map_update_elem() -> push

Does it make sense?

>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +const struct bpf_func_proto bpf_map_pop_elem_proto = {
>> +	.func		= bpf_map_pop_elem,
>> +	.gpl_only	= false,
>> +	.pkt_access	= true,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_CONST_MAP_PTR,
>> +	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
>> +	.arg3_type	= ARG_CONST_SIZE,
>> +};
>> +
>> +BPF_CALL_3(bpf_map_peek_elem, struct bpf_map *, map, void *, value, u32, size)
>> +{
>> +	void *ptr;
>> +
>> +	if (map->value_size != size)
>> +		return -EINVAL;
>> +
>> +	ptr = map->ops->map_lookup_elem(map, NULL);
>> +	if (!ptr)
>> +		return -ENOENT;
>> +
>> +	switch (size) {
>> +	case 1:
>> +		*(u8 *) value = *(u8 *) ptr;
>> +		break;
>> +	case 2:
>> +		*(u16 *) value = *(u16 *) ptr;
>> +		break;
>> +	case 4:
>> +		*(u32 *) value = *(u32 *) ptr;
>> +		break;
>> +	case 8:
>> +		*(u64 *) value = *(u64 *) ptr;
>> +		break;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +const struct bpf_func_proto bpf_map_peek_elem_proto = {
>> +	.func		= bpf_map_pop_elem,
>> +	.gpl_only	= false,
>> +	.pkt_access	= true,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_CONST_MAP_PTR,
>> +	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
>> +	.arg3_type	= ARG_CONST_SIZE,
>> +};
>> +
>>   const struct bpf_func_proto bpf_get_prandom_u32_proto = {
>>   	.func		= bpf_user_rnd_u32,
>>   	.gpl_only	= false,
>> diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
>> new file mode 100644
>> index 000000000000..10c081f3f02b
>> --- /dev/null
>> +++ b/kernel/bpf/queue_stack_maps.c
>> @@ -0,0 +1,291 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * queue_stack_maps.c: BPF queue and stack maps
>> + *
>> + * Copyright (c) 2018 Politecnico di Torino
>> + */
>> +#include <linux/bpf.h>
>> +#include <linux/list.h>
>> +#include <linux/slab.h>
>> +#include "percpu_freelist.h"
>> +
>> +#define QUEUE_STACK_CREATE_FLAG_MASK \
>> +	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
>> +
>> +
>> +struct bpf_queue_stack {
>> +	struct bpf_map map;
>> +	raw_spinlock_t lock;
>> +	u32 head, tail;
>> +	u32 index_mask;
>> +	u32 count;
>> +
>> +	char elements[0] __aligned(8);
>> +};
>> +
>> +static struct bpf_queue_stack *bpf_queue_stack(struct bpf_map *map)
>> +{
>> +	return container_of(map, struct bpf_queue_stack, map);
>> +}
>> +
>> +static bool queue_stack_map_is_empty(struct bpf_queue_stack *qs)
>> +{
>> +	return qs->count == 0;
>> +}
>> +
>> +static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
>> +{
>> +	return qs->count == qs->map.max_entries;
>> +}
>> +
>> +/* Called from syscall */
>> +static int queue_stack_map_alloc_check(union bpf_attr *attr)
>> +{
>> +	/* check sanity of attributes */
>> +	if (attr->max_entries == 0 || attr->key_size != 0 ||
>> +	    (attr->value_size != 1 && attr->value_size != 2 &&
>> +	    attr->value_size != 4 && attr->value_size != 8) ||
>> +	    attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK)
>> +		return -EINVAL;
>> +
>> +	if (attr->value_size > KMALLOC_MAX_SIZE)
>> +		/* if value_size is bigger, the user space won't be able to
>> +		 * access the elements.
>> +		 */
>> +		return -E2BIG;
>> +
>> +	return 0;
>> +}
>> +
>> +static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
>> +{
>> +	int ret, numa_node = bpf_map_attr_numa_node(attr);
>> +	u32 max_entries, value_size, index_mask;
>> +	u64 queue_size, cost, mask64;
>> +	struct bpf_queue_stack *qs;
>> +
>> +	max_entries = attr->max_entries;
>> +	value_size = attr->value_size;
>> +
>> +	/* From arraymap.c:
>> +	 * On 32 bit archs roundup_pow_of_two() with max_entries that has
>> +	 * upper most bit set in u32 space is undefined behavior due to
>> +	 * resulting 1U << 32, so do it manually here in u64 space.
>> +	 */
>> +	mask64 = fls_long(max_entries - 1);
>> +	mask64 = 1ULL << mask64;
>> +	mask64 -= 1;
>> +
>> +	index_mask = mask64;
>> +
>> +	/* Round up queue size to nearest power of 2 */
>> +	max_entries = index_mask + 1;
>> +	/* Check for overflows. */
>> +	if (max_entries < attr->max_entries)
>> +		return ERR_PTR(-E2BIG);
>> +
>> +	queue_size = sizeof(*qs) + (u64) value_size * max_entries;
>> +
>> +	cost = queue_size;
>> +	if (cost >= U32_MAX - PAGE_SIZE)
>> +		return ERR_PTR(-E2BIG);
>> +
>> +	cost = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
>> +
>> +	ret = bpf_map_precharge_memlock(cost);
>> +	if (ret < 0)
>> +		return ERR_PTR(ret);
>> +
>> +	qs = bpf_map_area_alloc(queue_size, numa_node);
>> +	if (!qs)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	memset(qs, 0, sizeof(*qs));
>> +
>> +	bpf_map_init_from_attr(&qs->map, attr);
>> +
>> +	qs->map.pages = cost;
>> +	qs->index_mask = index_mask;
>> +
>> +	raw_spin_lock_init(&qs->lock);
>> +
>> +	return &qs->map;
>> +}
>> +
>> +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
>> +static void queue_stack_map_free(struct bpf_map *map)
>> +{
>> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
>> +
>> +	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
>> +	 * so the programs (can be more than one that used this map) were
>> +	 * disconnected from events. Wait for outstanding critical sections in
>> +	 * these programs to complete
>> +	 */
>> +	synchronize_rcu();
>> +
>> +	bpf_map_area_free(qs);
>> +}
>> +
>> +static void *__queue_map_lookup(struct bpf_map *map, bool delete)
>> +{
>> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
>> +	unsigned long flags;
>> +	void *ptr = NULL;
>> +
>> +	raw_spin_lock_irqsave(&qs->lock, flags);
>> +
>> +	if (queue_stack_map_is_empty(qs))
>> +		goto out;
>> +
>> +	ptr = &qs->elements[qs->tail * qs->map.value_size];
>> +
>> +	if (delete) {
>> +		qs->tail = (qs->tail + 1) & qs->index_mask;
>> +		qs->count--;
>> +	}
>> +
>> +out:
>> +	raw_spin_unlock_irqrestore(&qs->lock, flags);
>> +	return ptr;
>> +}
>> +
>> +
>> +static void *__stack_map_lookup(struct bpf_map *map, bool delete)
>> +{
>> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
>> +	unsigned long flags;
>> +	void *ptr = NULL;
>> +	u32 index;
>> +
>> +	raw_spin_lock_irqsave(&qs->lock, flags);
>> +
>> +	if (queue_stack_map_is_empty(qs))
>> +		goto out;
>> +
>> +	index = (qs->head - 1) & qs->index_mask;
>> +	ptr = &qs->elements[index * qs->map.value_size];
>> +
>> +	if (delete) {
>> +		qs->head = (qs->head - 1) & qs->index_mask;
>> +		qs->count--;
>> +	}
>> +
>> +out:
>> +	raw_spin_unlock_irqrestore(&qs->lock, flags);
> here it's racing with another cpu accessing the same map.
> rcu doesn't protect from this particular qs->element being reused
> by another cpu.
>
>> +	return ptr;
> say, stack is full. one cpu is doing pop(), getting this ptr,
> while another cpu doing push() and overwriting this memory.
Yes, are right.  The lock must be held until the copy is done.  The 
solution to it depends on what ops we agree to create.
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static void *queue_map_lookup_elem(struct bpf_map *map, void *key)
>> +{
>> +	return __queue_map_lookup(map, false);
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
>> +{
>> +	return __stack_map_lookup(map, false);
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static void *queue_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +	return __queue_map_lookup(map, true);
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static void *stack_map_lookup_and_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +	return __stack_map_lookup(map, true);
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static int queue_stack_map_update_elem(struct bpf_map *map, void *key,
>> +				       void *value, u64 flags)
>> +{
>> +	struct bpf_queue_stack *qs = bpf_queue_stack(map);
>> +	unsigned long irq_flags;
>> +	int err = 0;
>> +	void *dst;
>> +
>> +	/* BPF_EXIST is used to force making room for a new element in case the
>> +	 * map is full
>> +	 */
>> +	bool replace = (flags & BPF_EXIST);
>> +
>> +	/* Check supported flags for queue and stqueue_map_is_preallocack maps */
>> +	if (flags & BPF_NOEXIST || flags > BPF_EXIST)
>> +		return -EINVAL;
>> +
>> +	raw_spin_lock_irqsave(&qs->lock, irq_flags);
>> +
>> +	if (queue_stack_map_is_full(qs)) {
>> +		if (!replace) {
>> +			err = -E2BIG;
>> +			goto out;
>> +		}
>> +		/* advance tail pointer to overwrite oldest element */
>> +		qs->tail = (qs->tail + 1) & qs->index_mask;
>> +		qs->count--;
>> +	}
>> +
>> +	dst = &qs->elements[qs->head * qs->map.value_size];
>> +
>> +	switch (qs->map.value_size) {
>> +	case 1:
>> +		*(u8 *) dst = *(u8 *) value;
>> +		break;
>> +	case 2:
>> +		*(u16 *) dst = *(u16 *) value;
>> +		break;
>> +	case 4:
>> +		*(u32 *) dst = *(u32 *) value;
>> +		break;
>> +	case 8:
>> +		*(u64 *) dst = *(u64 *) value;
>> +		break;
>> +	}
>> +
>> +	qs->head = (qs->head + 1) & qs->index_mask;
>> +	qs->count++;
>> +
>> +out:
>> +	raw_spin_unlock_irqrestore(&qs->lock, irq_flags);
>> +	return err;
>> +}
>> +
>> +/* Called from syscall or from eBPF program */
>> +static int queue_stack_map_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +/* Called from syscall */
>> +static int queue_stack_map_get_next_key(struct bpf_map *map, void *key,
>> +					void *next_key)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +const struct bpf_map_ops queue_map_ops = {
>> +	.map_alloc_check = queue_stack_map_alloc_check,
>> +	.map_alloc = queue_stack_map_alloc,
>> +	.map_free = queue_stack_map_free,
>> +	.map_lookup_elem = queue_map_lookup_elem,
>> +	.map_lookup_and_delete_elem = queue_map_lookup_and_delete_elem,
>> +	.map_update_elem = queue_stack_map_update_elem,
>> +	.map_delete_elem = queue_stack_map_delete_elem,
>> +	.map_get_next_key = queue_stack_map_get_next_key,
>> +};
>> +
>> +const struct bpf_map_ops stack_map_ops = {
>> +	.map_alloc_check = queue_stack_map_alloc_check,
>> +	.map_alloc = queue_stack_map_alloc,
>> +	.map_free = queue_stack_map_free,
>> +	.map_lookup_elem = stack_map_lookup_elem,
>> +	.map_lookup_and_delete_elem = stack_map_lookup_and_delete_elem,
>> +	.map_update_elem = queue_stack_map_update_elem,
>> +	.map_delete_elem = queue_stack_map_delete_elem,
>> +	.map_get_next_key = queue_stack_map_get_next_key,
>> +};
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f4ff0c569e54..b9e005188f0e 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -2369,7 +2369,10 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>>   	if (func_id != BPF_FUNC_tail_call &&
>>   	    func_id != BPF_FUNC_map_lookup_elem &&
>>   	    func_id != BPF_FUNC_map_update_elem &&
>> -	    func_id != BPF_FUNC_map_delete_elem)
>> +	    func_id != BPF_FUNC_map_delete_elem &&
>> +	    func_id != BPF_FUNC_map_push_elem &&
>> +	    func_id != BPF_FUNC_map_pop_elem &&
>> +	    func_id != BPF_FUNC_map_peek_elem)
>>   		return 0;
>>   
>>   	if (meta->map_ptr == NULL) {
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index feb578506009..c7b73376c23a 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -4839,6 +4839,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>>   		return &bpf_map_update_elem_proto;
>>   	case BPF_FUNC_map_delete_elem:
>>   		return &bpf_map_delete_elem_proto;
>> +	case BPF_FUNC_map_push_elem:
>> +		return &bpf_map_push_elem_proto;
>> +	case BPF_FUNC_map_pop_elem:
>> +		return &bpf_map_pop_elem_proto;
>> +	case BPF_FUNC_map_peek_elem:
>> +		return &bpf_map_peek_elem_proto;
>>   	case BPF_FUNC_get_prandom_u32:
>>   		return &bpf_get_prandom_u32_proto;
>>   	case BPF_FUNC_get_smp_processor_id:
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH bpf-next v3 7/7] selftests/bpf: add test cases for queue and stack maps
  2018-09-18 23:32   ` Alexei Starovoitov
@ 2018-09-19  4:36     ` Mauricio Vasquez
  0 siblings, 0 replies; 17+ messages in thread
From: Mauricio Vasquez @ 2018-09-19  4:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Yonghong Song



On 09/18/2018 06:32 PM, Alexei Starovoitov wrote:
> On Tue, Sep 18, 2018 at 06:53:07AM +0200, Mauricio Vasquez B wrote:
>> Two types of tests are done:
>> - test_maps: only userspace api.
>> - test_progs: userspace api and ebpf helpers.
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>> ---
>>   kernel/bpf/helpers.c                               |    2
>>   tools/lib/bpf/bpf.c                                |   12 ++
>>   tools/lib/bpf/bpf.h                                |    1
>>   tools/testing/selftests/bpf/Makefile               |    5 +
>>   tools/testing/selftests/bpf/bpf_helpers.h          |    7 +
>>   tools/testing/selftests/bpf/test_maps.c            |  130 ++++++++++++++++++++
>>   tools/testing/selftests/bpf/test_progs.c           |   99 +++++++++++++++
>>   tools/testing/selftests/bpf/test_queue_map.c       |    4 +
>>   tools/testing/selftests/bpf/test_queue_stack_map.h |   59 +++++++++
>>   tools/testing/selftests/bpf/test_stack_map.c       |    4 +
>>   10 files changed, 321 insertions(+), 2 deletions(-)
>>   create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
>>   create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
>>   create mode 100644 tools/testing/selftests/bpf/test_stack_map.c
>>
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 5f364e6acaf1..1293cd5240e3 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -76,7 +76,7 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = {
>>   	.arg2_type	= ARG_PTR_TO_MAP_KEY,
>>   };
>>   
>> -BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32 size,
>> +BPF_CALL_4(bpf_map_push_elem, struct bpf_map *, map, void *, value, u32, size,
>>   	   u64, flags)
> part of earlier patch?

Yes, this is wrong on patch 4 and somehow I managed to include the fix 
into this one. Will move to right patch in next version.

>
>>   {
>>   	if (map->value_size != size)
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index 60aa4ca8b2c5..7056b2eb554d 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
>> @@ -286,6 +286,18 @@ int bpf_map_lookup_elem(int fd, const void *key, void *value)
>>   	return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
>>   }
>>   
>> +int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value)
>> +{
>> +	union bpf_attr attr;
>> +
>> +	bzero(&attr, sizeof(attr));
>> +	attr.map_fd = fd;
>> +	attr.key = ptr_to_u64(key);
>> +	attr.value = ptr_to_u64(value);
>> +
>> +	return sys_bpf(BPF_MAP_LOOKUP_AND_DELETE_ELEM, &attr, sizeof(attr));
>> +}
>> +
>>   int bpf_map_delete_elem(int fd, const void *key)
>>   {
>>   	union bpf_attr attr;
>> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
>> index 6f38164b2618..6134ed9517d3 100644
>> --- a/tools/lib/bpf/bpf.h
>> +++ b/tools/lib/bpf/bpf.h
>> @@ -86,6 +86,7 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
>>   			__u64 flags);
>>   
>>   int bpf_map_lookup_elem(int fd, const void *key, void *value);
>> +int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value);
>>   int bpf_map_delete_elem(int fd, const void *key);
>>   int bpf_map_get_next_key(int fd, const void *key, void *next_key);
>>   int bpf_obj_pin(int fd, const char *pathname);
>> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
>> index fff7fb1285fc..ad8a2b8fb738 100644
>> --- a/tools/testing/selftests/bpf/Makefile
>> +++ b/tools/testing/selftests/bpf/Makefile
>> @@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
>>   	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
>>   	test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
>>   	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
>> -	test_skb_cgroup_id_kern.o
>> +	test_skb_cgroup_id_kern.o test_queue_map.o test_stack_map.o
>>   
>>   # Order correspond to 'make run_tests' order
>>   TEST_PROGS := test_kmod.sh \
>> @@ -110,6 +110,9 @@ CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
>>   $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
>>   $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
>>   
>> +$(OUTPUT)/test_queue_map.o: test_queue_stack_map.h
>> +$(OUTPUT)/test_stack_map.o: test_queue_stack_map.h
>> +
>>   BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
>>   BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
>>   BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 'usage.*llvm')
>> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>> index e4be7730222d..bdbe8f84023e 100644
>> --- a/tools/testing/selftests/bpf/bpf_helpers.h
>> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
>> @@ -16,6 +16,13 @@ static int (*bpf_map_update_elem)(void *map, void *key, void *value,
>>   	(void *) BPF_FUNC_map_update_elem;
>>   static int (*bpf_map_delete_elem)(void *map, void *key) =
>>   	(void *) BPF_FUNC_map_delete_elem;
>> +static int (*bpf_map_push_elem)(void *map, const void *value, int len,
>> +				unsigned long long flags) =
>> +	(void *) BPF_FUNC_map_push_elem;
>> +static int (*bpf_map_pop_elem)(void *map, void *value, int len) =
>> +	(void *) BPF_FUNC_map_pop_elem;
>> +static int (*bpf_map_peek_elem)(void *map, void *value, int len) =
>> +	(void *) BPF_FUNC_map_peek_elem;
>>   static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
>>   	(void *) BPF_FUNC_probe_read;
>>   static unsigned long long (*bpf_ktime_get_ns)(void) =
>> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
>> index 6f54f84144a0..abb21ea731fa 100644
>> --- a/tools/testing/selftests/bpf/test_maps.c
>> +++ b/tools/testing/selftests/bpf/test_maps.c
>> @@ -15,6 +15,7 @@
>>   #include <string.h>
>>   #include <assert.h>
>>   #include <stdlib.h>
>> +#include <time.h>
>>   
>>   #include <sys/wait.h>
>>   #include <sys/socket.h>
>> @@ -471,6 +472,130 @@ static void test_devmap(int task, void *data)
>>   	close(fd);
>>   }
>>   
>> +static void test_queuemap(int task, void *data)
>> +{
>> +	const int MAP_SIZE = 32;
>> +	__u32 vals[MAP_SIZE + MAP_SIZE/2], val;
>> +	int fd, i;
>> +
>> +	/* Fill test values to be used */
>> +	for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +		vals[i] = rand();
>> +
>> +	/* Invalid key size */
>> +	fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 4, sizeof(val), MAP_SIZE,
>> +			    map_flags);
>> +	assert(fd < 0 && errno == EINVAL);
>> +
>> +	/* Invalid value size */
>> +	fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, 3, MAP_SIZE,map_flags);
>> +	assert(fd < 0 && errno == EINVAL);
>> +
>> +	fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, sizeof(val), MAP_SIZE,
>> +			    map_flags);
>> +	/* Queue map does not support BPF_F_NO_PREALLOC */
>> +	if (map_flags & BPF_F_NO_PREALLOC) {
>> +		assert(fd < 0 && errno == EINVAL);
>> +		return;
>> +	}
>> +	if (fd < 0) {
>> +		printf("Failed to create queuemap '%s'!\n", strerror(errno));
>> +		exit(1);
>> +	}
>> +
>> +	/* Push MAP_SIZE elements */
>> +	for (i = 0; i < MAP_SIZE; i++)
>> +		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>> +
>> +	/* Check that element cannot be pushed due to max_entries limit */
>> +	assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
>> +	       errno == E2BIG);
>> +
>> +	/* Peek element */
>> +	assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[0]);
>> +
>> +	/* Replace half elements */
>> +	for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +		assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
>> +
>> +	/* Pop all elements */
>> +	for (i = MAP_SIZE/2; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
>> +		       val == vals[i]);
>> +
>> +	/* Check that there are not elements left */
>> +	assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
>> +	       errno == ENOENT);
>> +
>> +	/* Check that non supported functions set errno to EINVAL */
>> +	assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
>> +	assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
>> +
>> +	close(fd);
>> +}
>> +
>> +static void test_stackmap(int task, void *data)
>> +{
>> +	const int MAP_SIZE = 32;
>> +	__u32 vals[MAP_SIZE + MAP_SIZE/2], val;
>> +	int fd, i;
>> +
>> +	/* Fill test values to be used */
>> +	for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +		vals[i] = rand();
>> +
>> +	/* Invalid key size */
>> +	fd = bpf_create_map(BPF_MAP_TYPE_STACK, 4, sizeof(val), MAP_SIZE,
>> +			    map_flags);
>> +	assert(fd < 0 && errno == EINVAL);
>> +
>> +	/* Invalid value size */
>> +	fd = bpf_create_map(BPF_MAP_TYPE_STACK, 0, 3, MAP_SIZE,map_flags);
>> +	assert(fd < 0 && errno == EINVAL);
>> +
>> +	fd = bpf_create_map(BPF_MAP_TYPE_STACK, 0, sizeof(val), MAP_SIZE,
>> +			    map_flags);
>> +	/* Stack map does not support BPF_F_NO_PREALLOC */
>> +	if (map_flags & BPF_F_NO_PREALLOC) {
>> +		assert(fd < 0 && errno == EINVAL);
>> +		return;
>> +	}
>> +	if (fd < 0) {
>> +		printf("Failed to create stackmap '%s'!\n", strerror(errno));
>> +		exit(1);
>> +	}
>> +
>> +	/* Push MAP_SIZE elements */
>> +	for (i = 0; i < MAP_SIZE; i++)
>> +		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>> +
>> +	/* Check that element cannot be pushed due to max_entries limit */
>> +	assert(bpf_map_update_elem(fd, NULL, &val, 0) == -1 &&
>> +	       errno == E2BIG);
>> +
>> +	/* Peek element */
>> +	assert(bpf_map_lookup_elem(fd, NULL, &val) == 0 && val == vals[i - 1]);
>> +
>> +	/* Replace half elements */
>> +	for (i = MAP_SIZE; i < MAP_SIZE + MAP_SIZE/2; i++)
>> +		assert(bpf_map_update_elem(fd, NULL, &vals[i], BPF_EXIST) == 0);
>> +
>> +	/* Pop all elements */
>> +	for (i = MAP_SIZE + MAP_SIZE/2 - 1; i >= MAP_SIZE/2; i--)
>> +		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
>> +		       val == vals[i]);
>> +
>> +	/* Check that there are not elements left */
>> +	assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == -1 &&
>> +	       errno == ENOENT);
>> +
>> +	/* Check that non supported functions set errno to EINVAL */
>> +	assert(bpf_map_delete_elem(fd, NULL) == -1 && errno == EINVAL);
>> +	assert(bpf_map_get_next_key(fd, NULL, NULL) == -1 && errno == EINVAL);
>> +
>> +	close(fd);
>> +}
>> +
>>   #include <sys/socket.h>
>>   #include <sys/ioctl.h>
>>   #include <arpa/inet.h>
>> @@ -1430,10 +1555,15 @@ static void run_all_tests(void)
>>   	test_map_wronly();
>>   
>>   	test_reuseport_array();
>> +
>> +	test_queuemap(0, NULL);
>> +	test_stackmap(0, NULL);
>>   }
>>   
>>   int main(void)
>>   {
>> +	srand(time(NULL));
>> +
>>   	map_flags = 0;
>>   	run_all_tests();
>>   
>> diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
>> index 63a671803ed6..a69b9e1e668d 100644
>> --- a/tools/testing/selftests/bpf/test_progs.c
>> +++ b/tools/testing/selftests/bpf/test_progs.c
>> @@ -1698,8 +1698,105 @@ static void test_task_fd_query_tp(void)
>>   				   "sys_enter_read");
>>   }
>>   
>> +enum {
>> +	QUEUE,
>> +	STACK,
>> +};
>> +
>> +static void test_queue_stack_map(int type)
>> +{
>> +	const int MAP_SIZE = 32;
>> +	__u32 vals[MAP_SIZE], duration, retval, size, val;
>> +	int i, err, prog_fd, map_in_fd, map_out_fd;
>> +	char file[32], buf[128];
>> +	struct bpf_object *obj;
>> +	struct iphdr *iph = (void *)buf + sizeof(struct ethhdr);
>> +
>> +	/* Fill test values to be used */
>> +	for (i = 0; i < MAP_SIZE; i++)
>> +		vals[i] = rand();
>> +
>> +	if (type == QUEUE)
>> +		strncpy(file, "./test_queue_map.o", sizeof(file));
>> +	else if (type == STACK)
>> +		strncpy(file, "./test_stack_map.o", sizeof(file));
>> +	else
>> +		return;
>> +
>> +	err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, &obj, &prog_fd);
>> +	if (err) {
>> +		error_cnt++;
>> +		return;
>> +	}
>> +
>> +	map_in_fd = bpf_find_map(__func__, obj, "map_in");
>> +	if (map_in_fd < 0)
>> +		goto out;
>> +
>> +	map_out_fd = bpf_find_map(__func__, obj, "map_out");
>> +	if (map_out_fd < 0)
>> +		goto out;
>> +
>> +	/* Push 32 elements to the input map */
>> +	for (i = 0; i < MAP_SIZE; i++) {
>> +		err = bpf_map_update_elem(map_in_fd, NULL, &vals[i], 0);
>> +		if (err) {
>> +			error_cnt++;
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	/* The eBPF program pushes iph.saddr in the output map,
>> +	 * pops the input map and saves this value in iph.daddr
>> +	 */
>> +	for (i = 0; i < MAP_SIZE; i++) {
>> +		if (type == QUEUE) {
>> +			val = vals[i];
>> +			pkt_v4.iph.saddr = vals[i] * 5;
>> +		} else if (type == STACK) {
>> +			val = vals[MAP_SIZE - 1 - i];
>> +			pkt_v4.iph.saddr = vals[MAP_SIZE - 1 - i] * 5;
>> +		}
>> +
>> +		err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
>> +					buf, &size, &retval, &duration);
>> +		if (err || retval || size != sizeof(pkt_v4) ||
>> +		    iph->daddr != val)
>> +			break;
>> +	}
>> +
>> +	CHECK(err || retval || size != sizeof(pkt_v4) || iph->daddr != val,
>> +	      "bpf_map_pop_elem",
>> +	      "err %d errno %d retval %d size %d iph->daddr %u\n",
>> +	      err, errno, retval, size, iph->daddr);
>> +
>> +	/* Queue is empty, program should return TC_ACT_SHOT */
>> +	err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
>> +				buf, &size, &retval, &duration);
>> +	CHECK(err || retval != 2 /* TC_ACT_SHOT */|| size != sizeof(pkt_v4),
>> +	      "check-queue-stack-map-empty",
>> +	      "err %d errno %d retval %d size %d\n",
>> +	      err, errno, retval, size);
>> +
>> +	/* Check that the program pushed elements correctly */
>> +	for (i = 0; i < MAP_SIZE; i++) {
>> +		err = bpf_map_lookup_and_delete_elem(map_out_fd, NULL, &val);
>> +		if (err || val != vals[i] * 5)
>> +			break;
>> +	}
>> +
>> +	CHECK(i != MAP_SIZE && (err || val != vals[i] * 5),
>> +	      "bpf_map_push_elem", "err %d value %u\n", err, val);
>> +
>> +out:
>> +	pkt_v4.iph.saddr = 0;
>> +	bpf_object__close(obj);
>> +}
>> +
>>   int main(void)
>>   {
>> +	srand(time(NULL));
>> +
>>   	jit_enabled = is_jit_enabled();
>>   
>>   	test_pkt_access();
>> @@ -1719,6 +1816,8 @@ int main(void)
>>   	test_get_stack_raw_tp();
>>   	test_task_fd_query_rawtp();
>>   	test_task_fd_query_tp();
>> +	test_queue_stack_map(QUEUE);
>> +	test_queue_stack_map(STACK);
>>   
>>   	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
>>   	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
>> diff --git a/tools/testing/selftests/bpf/test_queue_map.c b/tools/testing/selftests/bpf/test_queue_map.c
>> new file mode 100644
>> index 000000000000..3fdb3b9cb038
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/test_queue_map.c
>> @@ -0,0 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (c) 2017 Politecnico di Torino
>> +#define MAP_TYPE BPF_MAP_TYPE_QUEUE
>> +#include "test_queue_stack_map.h"
>> diff --git a/tools/testing/selftests/bpf/test_queue_stack_map.h b/tools/testing/selftests/bpf/test_queue_stack_map.h
>> new file mode 100644
>> index 000000000000..23a5b5d60069
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/test_queue_stack_map.h
>> @@ -0,0 +1,59 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +// Copyright (c) 2017 Politecnico di Torino
>> +#include <stddef.h>
>> +#include <string.h>
>> +#include <linux/bpf.h>
>> +#include <linux/if_ether.h>
>> +#include <linux/ip.h>
>> +#include <linux/pkt_cls.h>
>> +#include "bpf_helpers.h"
>> +
>> +int _version SEC("version") = 1;
>> +
>> +struct bpf_map_def __attribute__ ((section("maps"), used)) map_in = {
>> +	.type = MAP_TYPE,
>> +	.key_size = 0,
>> +	.value_size = sizeof(__u32),
>> +	.max_entries = 32,
>> +	.map_flags = 0,
>> +};
>> +
>> +struct bpf_map_def __attribute__ ((section("maps"), used)) map_out = {
>> +	.type = MAP_TYPE,
>> +	.key_size = 0,
>> +	.value_size = sizeof(__u32),
>> +	.max_entries = 32,
>> +	.map_flags = 0,
>> +};
>> +
>> +SEC("test")
>> +int _test(struct __sk_buff *skb)
>> +{
>> +	void *data_end = (void *)(long)skb->data_end;
>> +	void *data = (void *)(long)skb->data;
>> +	struct ethhdr *eth = (struct ethhdr *)(data);
>> +	__u32 value;
>> +	int err;
>> +
>> +	if (eth + 1 > data_end)
>> +		return TC_ACT_SHOT;
>> +
>> +	struct iphdr *iph = (struct iphdr *)(eth + 1);
>> +
>> +	if (iph + 1 > data_end)
>> +		return TC_ACT_SHOT;
>> +
>> +	err = bpf_map_pop_elem(&map_in, &value, sizeof(value));
>> +	if (err)
>> +		return TC_ACT_SHOT;
>> +
>> +	iph->daddr = value;
>> +
>> +	err = bpf_map_push_elem(&map_out, &iph->saddr, sizeof(iph->saddr), 0);
>> +	if (err)
>> +		return TC_ACT_SHOT;
> is it possible to add a test for 'peek' helper in reliable way?
> it seems it will alwasy be racy. may be we shouldn't implement peek at all?
>
With the per-value implementation it should not be racy.   Why do you 
say it is seems to be racy?
A very silly test could be to always do peek before pop and check both 
values are the same.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps
       [not found]       ` <12a0d7d2-9590-b44f-803a-a00eefe611c1@polito.it>
@ 2018-10-02  0:26         ` Alexei Starovoitov
  2018-10-03 17:01           ` Mauricio Vasquez
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2018-10-02  0:26 UTC (permalink / raw)
  To: Mauricio Vasquez; +Cc: daniel, netdev

On Mon, Oct 01, 2018 at 08:11:43AM -0500, Mauricio Vasquez wrote:
> > > > +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *,
> > > > value, u32, size)
> > > > +{
> > > > +    void *ptr;
> > > > +
> > > > +    if (map->value_size != size)
> > > > +        return -EINVAL;
> > > > +
> > > > +    ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
> > > > +    if (!ptr)
> > > > +        return -ENOENT;
> > > > +
> > > > +    switch (size) {
> > > > +    case 1:
> > > > +        *(u8 *) value = *(u8 *) ptr;
> > > > +        break;
> > > > +    case 2:
> > > > +        *(u16 *) value = *(u16 *) ptr;
> > > > +        break;
> > > > +    case 4:
> > > > +        *(u32 *) value = *(u32 *) ptr;
> > > > +        break;
> > > > +    case 8:
> > > > +        *(u64 *) value = *(u64 *) ptr;
> > > > +        break;
> > > this is inefficient. can we pass value ptr into ops and let it
> > > populate it?
> > 
> > I don't think so, doing that implies that look_and_delete will be a
> > per-value op, while other ops in maps are per-reference.
> > For instance, how to change it in the case of peek helper that is using
> > the lookup operation?, we cannot change the signature of the lookup
> > operation.
> > 
> > This is something that worries me a little bit, we are creating new
> > per-value helpers based on already existing per-reference operations,
> > this is not probably the cleanest way.  Here we are at the beginning of
> > the discussion once again, how should we map helpers and syscalls to
> > ops.
> > 
> > What about creating pop/peek/push ops, mapping helpers one to one and
> > adding some logic into syscall.c to call the correct operation in case
> > the map is stack/queue?
> > Syscall mapping would be:
> > bpf_map_lookup_elem() -> peek
> > bpf_map_lookup_and_delete_elem() -> pop
> > bpf_map_update_elem() -> push
> > 
> > Does it make sense?
> 
> Hello Alexei,
> 
> Do you have any feedback on this specific part?

Indeed. It seems push/pop ops will be cleaner.
I still think that peek() is useless due to races.
So BPF_MAP_UPDATE_ELEM syscall cmd will map to 'push' ops
and new BPF_MAP_LOOKUP_AND_DELETE_ELEM will map to 'pop' ops.
right?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps
  2018-10-02  0:26         ` Alexei Starovoitov
@ 2018-10-03 17:01           ` Mauricio Vasquez
  2018-10-03 21:10             ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Mauricio Vasquez @ 2018-10-03 17:01 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: daniel, netdev



On 10/01/2018 07:26 PM, Alexei Starovoitov wrote:
> On Mon, Oct 01, 2018 at 08:11:43AM -0500, Mauricio Vasquez wrote:
>>>>> +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *,
>>>>> value, u32, size)
>>>>> +{
>>>>> +    void *ptr;
>>>>> +
>>>>> +    if (map->value_size != size)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
>>>>> +    if (!ptr)
>>>>> +        return -ENOENT;
>>>>> +
>>>>> +    switch (size) {
>>>>> +    case 1:
>>>>> +        *(u8 *) value = *(u8 *) ptr;
>>>>> +        break;
>>>>> +    case 2:
>>>>> +        *(u16 *) value = *(u16 *) ptr;
>>>>> +        break;
>>>>> +    case 4:
>>>>> +        *(u32 *) value = *(u32 *) ptr;
>>>>> +        break;
>>>>> +    case 8:
>>>>> +        *(u64 *) value = *(u64 *) ptr;
>>>>> +        break;
>>>> this is inefficient. can we pass value ptr into ops and let it
>>>> populate it?
>>> I don't think so, doing that implies that look_and_delete will be a
>>> per-value op, while other ops in maps are per-reference.
>>> For instance, how to change it in the case of peek helper that is using
>>> the lookup operation?, we cannot change the signature of the lookup
>>> operation.
>>>
>>> This is something that worries me a little bit, we are creating new
>>> per-value helpers based on already existing per-reference operations,
>>> this is not probably the cleanest way.  Here we are at the beginning of
>>> the discussion once again, how should we map helpers and syscalls to
>>> ops.
>>>
>>> What about creating pop/peek/push ops, mapping helpers one to one and
>>> adding some logic into syscall.c to call the correct operation in case
>>> the map is stack/queue?
>>> Syscall mapping would be:
>>> bpf_map_lookup_elem() -> peek
>>> bpf_map_lookup_and_delete_elem() -> pop
>>> bpf_map_update_elem() -> push
>>>
>>> Does it make sense?
>> Hello Alexei,
>>
>> Do you have any feedback on this specific part?
> Indeed. It seems push/pop ops will be cleaner.
> I still think that peek() is useless due to races.
> So BPF_MAP_UPDATE_ELEM syscall cmd will map to 'push' ops
> and new BPF_MAP_LOOKUP_AND_DELETE_ELEM will map to 'pop' ops.
> right?
>
>
That's right.

While updating the push api some came to me, do we have any specific 
reason to support only 1, 2, 4, 8 bytes? I think we could do it general 
enough to support any number of bytes, if the user is worried about the 
cost of memcpys he could use a map of 8 bytes pointers as you mentioned 
some time ago.
 From an API point of view, pop/peek helpers already expect a void 
*value (int bpf_map_[pop, peek]_elem(map, void *value)), the only change 
would also to use a pointer in the push instead of a u64.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps
  2018-10-03 17:01           ` Mauricio Vasquez
@ 2018-10-03 21:10             ` Alexei Starovoitov
  0 siblings, 0 replies; 17+ messages in thread
From: Alexei Starovoitov @ 2018-10-03 21:10 UTC (permalink / raw)
  To: Mauricio Vasquez; +Cc: daniel, netdev

On Wed, Oct 03, 2018 at 12:01:37PM -0500, Mauricio Vasquez wrote:
> 
> 
> On 10/01/2018 07:26 PM, Alexei Starovoitov wrote:
> > On Mon, Oct 01, 2018 at 08:11:43AM -0500, Mauricio Vasquez wrote:
> > > > > > +BPF_CALL_3(bpf_map_pop_elem, struct bpf_map *, map, void *,
> > > > > > value, u32, size)
> > > > > > +{
> > > > > > +    void *ptr;
> > > > > > +
> > > > > > +    if (map->value_size != size)
> > > > > > +        return -EINVAL;
> > > > > > +
> > > > > > +    ptr = map->ops->map_lookup_and_delete_elem(map, NULL);
> > > > > > +    if (!ptr)
> > > > > > +        return -ENOENT;
> > > > > > +
> > > > > > +    switch (size) {
> > > > > > +    case 1:
> > > > > > +        *(u8 *) value = *(u8 *) ptr;
> > > > > > +        break;
> > > > > > +    case 2:
> > > > > > +        *(u16 *) value = *(u16 *) ptr;
> > > > > > +        break;
> > > > > > +    case 4:
> > > > > > +        *(u32 *) value = *(u32 *) ptr;
> > > > > > +        break;
> > > > > > +    case 8:
> > > > > > +        *(u64 *) value = *(u64 *) ptr;
> > > > > > +        break;
> > > > > this is inefficient. can we pass value ptr into ops and let it
> > > > > populate it?
> > > > I don't think so, doing that implies that look_and_delete will be a
> > > > per-value op, while other ops in maps are per-reference.
> > > > For instance, how to change it in the case of peek helper that is using
> > > > the lookup operation?, we cannot change the signature of the lookup
> > > > operation.
> > > > 
> > > > This is something that worries me a little bit, we are creating new
> > > > per-value helpers based on already existing per-reference operations,
> > > > this is not probably the cleanest way.  Here we are at the beginning of
> > > > the discussion once again, how should we map helpers and syscalls to
> > > > ops.
> > > > 
> > > > What about creating pop/peek/push ops, mapping helpers one to one and
> > > > adding some logic into syscall.c to call the correct operation in case
> > > > the map is stack/queue?
> > > > Syscall mapping would be:
> > > > bpf_map_lookup_elem() -> peek
> > > > bpf_map_lookup_and_delete_elem() -> pop
> > > > bpf_map_update_elem() -> push
> > > > 
> > > > Does it make sense?
> > > Hello Alexei,
> > > 
> > > Do you have any feedback on this specific part?
> > Indeed. It seems push/pop ops will be cleaner.
> > I still think that peek() is useless due to races.
> > So BPF_MAP_UPDATE_ELEM syscall cmd will map to 'push' ops
> > and new BPF_MAP_LOOKUP_AND_DELETE_ELEM will map to 'pop' ops.
> > right?
> > 
> > 
> That's right.
> 
> While updating the push api some came to me, do we have any specific reason
> to support only 1, 2, 4, 8 bytes? I think we could do it general enough to
> support any number of bytes, if the user is worried about the cost of
> memcpys he could use a map of 8 bytes pointers as you mentioned some time
> ago.
> From an API point of view, pop/peek helpers already expect a void *value
> (int bpf_map_[pop, peek]_elem(map, void *value)), the only change would also
> to use a pointer in the push instead of a u64.

Indeed. Good idea. Full copy of the value for push and pop makes sense.
On the verifier side we probably need ARG_PTR_TO_UNINIT_MAP_VALUE
in addition to normal ARG_PTR_TO_MAP_VALUE for the case of bpf_map_pop().

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-10-04  4:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-18  4:52 [RFC PATCH bpf-next v3 0/7] Implement bpf queue/stack maps Mauricio Vasquez B
2018-09-18  4:52 ` [RFC PATCH bpf-next v3 1/7] bpf: rename stack trace map Mauricio Vasquez B
2018-09-18 23:05   ` Alexei Starovoitov
2018-09-19  2:53     ` Mauricio Vasquez
2018-09-18  4:52 ` [RFC PATCH bpf-next v3 2/7] bpf/syscall: allow key to be null in map functions Mauricio Vasquez B
2018-09-18  4:52 ` [RFC PATCH bpf-next v3 3/7] bpf: add lookup_and_delete map operation Mauricio Vasquez B
2018-09-18  4:52 ` [RFC PATCH bpf-next v3 4/7] bpf: add bpf queue and stack maps Mauricio Vasquez B
2018-09-18 23:27   ` Alexei Starovoitov
2018-09-19  4:28     ` Mauricio Vasquez
     [not found]       ` <12a0d7d2-9590-b44f-803a-a00eefe611c1@polito.it>
2018-10-02  0:26         ` Alexei Starovoitov
2018-10-03 17:01           ` Mauricio Vasquez
2018-10-03 21:10             ` Alexei Starovoitov
2018-09-18  4:52 ` [RFC PATCH bpf-next v3 5/7] bpf: restrict use of peek/push/pop Mauricio Vasquez B
2018-09-18  4:53 ` [RFC PATCH bpf-next v3 6/7] Sync uapi/bpf.h to tools/include Mauricio Vasquez B
2018-09-18  4:53 ` [RFC PATCH bpf-next v3 7/7] selftests/bpf: add test cases for queue and stack maps Mauricio Vasquez B
2018-09-18 23:32   ` Alexei Starovoitov
2018-09-19  4:36     ` Mauricio Vasquez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).