[PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
@ 2019-07-24 16:57 Brian Vazquez
  2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

This introduces a new command to retrieve multiple number of entries
from a bpf map.

This new command can be executed from the existing BPF syscall as
follows:

err =  bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
attr->dump.buf_len
returns zero or negative error, and populates buf and buf_len on
succees

This implementation is wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem

Note that this implementation can be extended later to do dump and
delete by extending map_lookup_and_delete_elem (currently it only works
for bpf queue/stack maps) and either use a new flag in map_dump or a new
command map_dump_and_delete. 

Results show that even with a 1-elem_size buffer, it runs ~40 faster
than the current implementation, improvements of ~85% are reported when
the buffer size is increased, although, after the buffer size is around
5% of the total number of entries there's no huge difference in
increasing it.

Tested:
Tried different size buffers to handle case where the bulk is bigger, or
the elements to retrieve are less than the existing ones, all runs read
a map of 100K entries. Below are the results(in ns) from the different
runs:

buf_len_1:       69038725 entry-by-entry: 112384424 improvement
38.569134
buf_len_2:       40897447 entry-by-entry: 111030546 improvement
63.165590
buf_len_230:     13652714 entry-by-entry: 111694058 improvement
87.776687
buf_len_5000:    13576271 entry-by-entry: 111101169 improvement
87.780263
buf_len_73000:   14694343 entry-by-entry: 111740162 improvement
86.849542
buf_len_100000:  13745969 entry-by-entry: 114151991 improvement
87.958187
buf_len_234567:  14329834 entry-by-entry: 114427589 improvement
87.476941

The series of patches are split as follows:

- First patch move some map_lookup_elem logic into 2 fucntions to
deduplicate code: bpf_map_value_size and bpf_map_copy_value
- Second patch introduce map_dump function
- Third patch syncs tools linux headers
- Fourth patch adds libbpf support
- Last two patches adds tests

RFC Changelog:

- remove wrong usage of attr.flags
- move map_fd to remove hole after it

v3:
- add explanation of the API in the commit message
- fix masked errors and return them to user
- copy last_key from return buf into prev_key if it was provided
- run perf test with kpti and retpoline mitigations

v2:
- use proper bpf-next tag

Brian Vazquez (6):
  bpf: add bpf_map_value_size and bp_map_copy_value helper functions
  bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  bpf: keep bpf.h in sync with tools/
  libbpf: support BPF_MAP_DUMP command
  selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
  selftests/bpf: add test to measure performance of BPF_MAP_DUMP

 include/uapi/linux/bpf.h                |   9 +
 kernel/bpf/syscall.c                    | 251 ++++++++++++++++++------
 tools/include/uapi/linux/bpf.h          |   9 +
 tools/lib/bpf/bpf.c                     |  28 +++
 tools/lib/bpf/bpf.h                     |   4 +
 tools/lib/bpf/libbpf.map                |   2 +
 tools/testing/selftests/bpf/test_maps.c | 148 +++++++++++++-
 7 files changed, 388 insertions(+), 63 deletions(-)

-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
@ 2019-07-24 16:57 ` Brian Vazquez
  2019-07-24 20:53   ` Song Liu
  2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

Move reusable code from map_lookup_elem to helper functions to avoid code
duplication in kernel/bpf/syscall.c

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 kernel/bpf/syscall.c | 134 +++++++++++++++++++++++--------------------
 1 file changed, 73 insertions(+), 61 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5d141f16f6fa9..86cdc2f7bb56e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -126,6 +126,76 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
 	return map;
 }
 
+static u32 bpf_map_value_size(struct bpf_map *map)
+{
+	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
+	    map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
+	    map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
+		return round_up(map->value_size, 8) * num_possible_cpus();
+	else if (IS_FD_MAP(map))
+		return sizeof(u32);
+	else
+		return  map->value_size;
+}
+
+static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
+			      __u64 flags)
+{
+	void *ptr;
+	int err;
+
+	if (bpf_map_is_dev_bound(map))
+		return  bpf_map_offload_lookup_elem(map, key, value);
+
+	preempt_disable();
+	this_cpu_inc(bpf_prog_active);
+	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
+		err = bpf_percpu_hash_copy(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+		err = bpf_percpu_array_copy(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+		err = bpf_percpu_cgroup_storage_copy(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
+		err = bpf_stackmap_copy(map, key, value);
+	} else if (IS_FD_ARRAY(map)) {
+		err = bpf_fd_array_map_lookup_elem(map, key, value);
+	} else if (IS_FD_HASH(map)) {
+		err = bpf_fd_htab_map_lookup_elem(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+		err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+		   map->map_type == BPF_MAP_TYPE_STACK) {
+		err = map->ops->map_peek_elem(map, value);
+	} else {
+		rcu_read_lock();
+		if (map->ops->map_lookup_elem_sys_only)
+			ptr = map->ops->map_lookup_elem_sys_only(map, key);
+		else
+			ptr = map->ops->map_lookup_elem(map, key);
+		if (IS_ERR(ptr)) {
+			err = PTR_ERR(ptr);
+		} else if (!ptr) {
+			err = -ENOENT;
+		} else {
+			err = 0;
+			if (flags & BPF_F_LOCK)
+				/* lock 'ptr' and copy everything but lock */
+				copy_map_value_locked(map, value, ptr, true);
+			else
+				copy_map_value(map, value, ptr);
+			/* mask lock, since value wasn't zero inited */
+			check_and_init_map_lock(map, value);
+		}
+		rcu_read_unlock();
+	}
+	this_cpu_dec(bpf_prog_active);
+	preempt_enable();
+
+	return err;
+}
+
 void *bpf_map_area_alloc(size_t size, int numa_node)
 {
 	/* We really just want to fail instead of triggering OOM killer
@@ -729,7 +799,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 	void __user *uvalue = u64_to_user_ptr(attr->value);
 	int ufd = attr->map_fd;
 	struct bpf_map *map;
-	void *key, *value, *ptr;
+	void *key, *value;
 	u32 value_size;
 	struct fd f;
 	int err;
@@ -761,72 +831,14 @@ static int map_lookup_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
-	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
-	    map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
-	    map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
-		value_size = round_up(map->value_size, 8) * num_possible_cpus();
-	else if (IS_FD_MAP(map))
-		value_size = sizeof(u32);
-	else
-		value_size = map->value_size;
+	value_size = bpf_map_value_size(map);
 
 	err = -ENOMEM;
 	value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
 	if (!value)
 		goto free_key;
 
-	if (bpf_map_is_dev_bound(map)) {
-		err = bpf_map_offload_lookup_elem(map, key, value);
-		goto done;
-	}
-
-	preempt_disable();
-	this_cpu_inc(bpf_prog_active);
-	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
-	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
-		err = bpf_percpu_hash_copy(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
-		err = bpf_percpu_array_copy(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
-		err = bpf_percpu_cgroup_storage_copy(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
-		err = bpf_stackmap_copy(map, key, value);
-	} else if (IS_FD_ARRAY(map)) {
-		err = bpf_fd_array_map_lookup_elem(map, key, value);
-	} else if (IS_FD_HASH(map)) {
-		err = bpf_fd_htab_map_lookup_elem(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
-		err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
-		   map->map_type == BPF_MAP_TYPE_STACK) {
-		err = map->ops->map_peek_elem(map, value);
-	} else {
-		rcu_read_lock();
-		if (map->ops->map_lookup_elem_sys_only)
-			ptr = map->ops->map_lookup_elem_sys_only(map, key);
-		else
-			ptr = map->ops->map_lookup_elem(map, key);
-		if (IS_ERR(ptr)) {
-			err = PTR_ERR(ptr);
-		} else if (!ptr) {
-			err = -ENOENT;
-		} else {
-			err = 0;
-			if (attr->flags & BPF_F_LOCK)
-				/* lock 'ptr' and copy everything but lock */
-				copy_map_value_locked(map, value, ptr, true);
-			else
-				copy_map_value(map, value, ptr);
-			/* mask lock, since value wasn't zero inited */
-			check_and_init_map_lock(map, value);
-		}
-		rcu_read_unlock();
-	}
-	this_cpu_dec(bpf_prog_active);
-	preempt_enable();
-
-done:
+	err = bpf_map_copy_value(map, key, value, attr->flags);
 	if (err)
 		goto free_value;
 
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
  2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
@ 2019-07-24 16:57 ` Brian Vazquez
  2019-07-24 19:54   ` Willem de Bruijn
  2019-07-24 21:40   ` Song Liu
  2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

This introduces a new command to retrieve multiple number of entries
from a bpf map, wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem

To start dumping the map from the beginning you must specify NULL as
the prev_key.

The new API returns 0 when it successfully copied all the elements
requested or it copied less because there weren't more elements to
retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.

On a successful call buf and buf_len will contain correct data and in
case prev_key was provided (not for the first walk, since prev_key is
NULL) it will contain the last_key copied into the prev_key which will
simplify next call.

Only when it can't find a single element it will return -ENOENT meaning
that the map has been entirely walked. When an error is return buf,
buf_len and prev_key shouldn't be read nor used.

Because maps can be called from userspace and kernel code, this function
can have a scenario where the next_key was found but by the time we
try to retrieve the value the element is not there, in this case the
function continues and tries to get a new next_key value, skipping the
deleted key. If at some point the function find itself trap in a loop,
it will return -EINTR.

The function will try to fit as much as possible in the buf provided and
will return -EINVAL if buf_len is smaller than elem_size.

QUEUE and STACK maps are not supported.

Note that map_dump doesn't guarantee that reading the entire table is
consistent since this function is always racing with kernel and user code
but the same behaviour is found when the entire table is walked using
the current interfaces: map_get_next_key + map_lookup_elem.
It is also important to note that with  a locked map, the lock is grabbed
for 1 entry at the time, meaning that the returned buf might or might not
be consistent.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 include/uapi/linux/bpf.h |   9 +++
 kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa1c753dcdbc7..66dab5385170d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
 	BPF_TASK_FD_QUERY,
 	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 	BPF_MAP_FREEZE,
+	BPF_MAP_DUMP,
 };
 
 enum bpf_map_type {
@@ -388,6 +389,14 @@ union bpf_attr {
 		__u64		flags;
 	};
 
+	struct { /* struct used by BPF_MAP_DUMP command */
+		__aligned_u64	prev_key;
+		__aligned_u64	buf;
+		__aligned_u64	buf_len; /* input/output: len of buf */
+		__u64		flags;
+		__u32		map_fd;
+	} dump;
+
 	struct { /* anonymous struct used by BPF_PROG_LOAD command */
 		__u32		prog_type;	/* one of enum bpf_prog_type */
 		__u32		insn_cnt;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 86cdc2f7bb56e..0c35505aa219f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
 	return err;
 }
 
+/* last field in 'union bpf_attr' used by this command */
+#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
+
+static int map_dump(union bpf_attr *attr)
+{
+	void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
+	void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
+	u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
+	int ufd = attr->dump.map_fd;
+	struct bpf_map *map;
+	void *buf, *prev_key, *key, *value;
+	u32 value_size, elem_size, buf_len, cp_len;
+	struct fd f;
+	int err;
+	bool first_key = false;
+
+	if (CHECK_ATTR(BPF_MAP_DUMP))
+		return -EINVAL;
+
+	if (attr->dump.flags & ~BPF_F_LOCK)
+		return -EINVAL;
+
+	f = fdget(ufd);
+	map = __bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+	if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
+		err = -EPERM;
+		goto err_put;
+	}
+
+	if ((attr->dump.flags & BPF_F_LOCK) &&
+	    !map_value_has_spin_lock(map)) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
+	if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+	    map->map_type == BPF_MAP_TYPE_STACK) {
+		err = -ENOTSUPP;
+		goto err_put;
+	}
+
+	value_size = bpf_map_value_size(map);
+
+	err = get_user(buf_len, ubuf_len);
+	if (err)
+		goto err_put;
+
+	elem_size = map->key_size + value_size;
+	if (buf_len < elem_size) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
+	if (ukey) {
+		prev_key = __bpf_copy_key(ukey, map->key_size);
+		if (IS_ERR(prev_key)) {
+			err = PTR_ERR(prev_key);
+			goto err_put;
+		}
+	} else {
+		prev_key = NULL;
+		first_key = true;
+	}
+
+	err = -ENOMEM;
+	buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
+	if (!buf)
+		goto err_put;
+
+	key = buf;
+	value = key + map->key_size;
+	for (cp_len = 0; cp_len + elem_size <= buf_len;) {
+		if (signal_pending(current)) {
+			err = -EINTR;
+			break;
+		}
+
+		rcu_read_lock();
+		err = map->ops->map_get_next_key(map, prev_key, key);
+		rcu_read_unlock();
+
+		if (err)
+			break;
+
+		err = bpf_map_copy_value(map, key, value, attr->dump.flags);
+
+		if (err == -ENOENT)
+			continue;
+		if (err)
+			goto free_buf;
+
+		if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
+			err = -EFAULT;
+			goto free_buf;
+		}
+
+		prev_key = key;
+		cp_len += elem_size;
+	}
+
+	if (err == -ENOENT && cp_len)
+		err = 0;
+	if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
+		    (!first_key && copy_to_user(ukey, key, map->key_size))))
+		err = -EFAULT;
+free_buf:
+	kfree(buf);
+err_put:
+	fdput(f);
+	return err;
+}
+
 #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
 
 static int map_lookup_and_delete_elem(union bpf_attr *attr)
@@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
 		err = map_lookup_and_delete_elem(&attr);
 		break;
+	case BPF_MAP_DUMP:
+		err = map_dump(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
  2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
  2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
@ 2019-07-24 16:58 ` Brian Vazquez
  2019-07-24 21:41   ` Song Liu
  2019-07-24 23:10   ` Andrii Nakryiko
  2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

Adds bpf_attr.dump structure to libbpf.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/include/uapi/linux/bpf.h | 9 +++++++++
 tools/lib/bpf/libbpf.map       | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e455018da65f..e127f16e4e932 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
 	BPF_TASK_FD_QUERY,
 	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 	BPF_MAP_FREEZE,
+	BPF_MAP_DUMP,
 };
 
 enum bpf_map_type {
@@ -388,6 +389,14 @@ union bpf_attr {
 		__u64		flags;
 	};
 
+	struct { /* struct used by BPF_MAP_DUMP command */
+		__aligned_u64	prev_key;
+		__aligned_u64	buf;
+		__aligned_u64	buf_len; /* input/output: len of buf */
+		__u64		flags;
+		__u32		map_fd;
+	} dump;
+
 	struct { /* anonymous struct used by BPF_PROG_LOAD command */
 		__u32		prog_type;	/* one of enum bpf_prog_type */
 		__u32		insn_cnt;
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index f9d316e873d8d..cac3723d5c45c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -183,4 +183,6 @@ LIBBPF_0.0.4 {
 		perf_buffer__new;
 		perf_buffer__new_raw;
 		perf_buffer__poll;
+		bpf_map_dump;
+		bpf_map_dump_flags;
 } LIBBPF_0.0.3;
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (2 preceding siblings ...)
  2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
@ 2019-07-24 16:58 ` Brian Vazquez
  2019-07-24 19:51   ` Willem de Bruijn
  2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

Make libbpf aware of new BPF_MAP_DUMP command and add bpf_map_dump and
bpf_map_dump_flags to use them from the library.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/lib/bpf/bpf.c | 28 ++++++++++++++++++++++++++++
 tools/lib/bpf/bpf.h |  4 ++++
 2 files changed, 32 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index c7d7993c44bb0..c1139b7db756a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -368,6 +368,34 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
 	return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
 }
 
+int bpf_map_dump(int fd, const void *prev_key, void *buf, void *buf_len)
+{
+	union bpf_attr attr;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.dump.map_fd = fd;
+	attr.dump.prev_key = ptr_to_u64(prev_key);
+	attr.dump.buf = ptr_to_u64(buf);
+	attr.dump.buf_len = ptr_to_u64(buf_len);
+
+	return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
+}
+
+int bpf_map_dump_flags(int fd, const void *prev_key, void *buf, void *buf_len,
+		       __u64 flags)
+{
+	union bpf_attr attr;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.dump.map_fd = fd;
+	attr.dump.prev_key = ptr_to_u64(prev_key);
+	attr.dump.buf = ptr_to_u64(buf);
+	attr.dump.buf_len = ptr_to_u64(buf_len);
+	attr.dump.flags = flags;
+
+	return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
+}
+
 int bpf_map_lookup_elem(int fd, const void *key, void *value)
 {
 	union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index ff42ca043dc8f..86496443440e9 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -112,6 +112,10 @@ LIBBPF_API int bpf_verify_program(enum bpf_prog_type type,
 LIBBPF_API int bpf_map_update_elem(int fd, const void *key, const void *value,
 				   __u64 flags);
 
+LIBBPF_API int bpf_map_dump(int fd, const void *prev_key, void *buf,
+				void *buf_len);
+LIBBPF_API int bpf_map_dump_flags(int fd, const void *prev_key, void *buf,
+				void *buf_len, __u64 flags);
 LIBBPF_API int bpf_map_lookup_elem(int fd, const void *key, void *value);
 LIBBPF_API int bpf_map_lookup_elem_flags(int fd, const void *key, void *value,
 					 __u64 flags);
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (3 preceding siblings ...)
  2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
@ 2019-07-24 16:58 ` Brian Vazquez
  2019-07-24 21:58   ` Song Liu
  2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
  2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
  6 siblings, 1 reply; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

This tests exercise the new command on a bpf hashmap and make sure it
works as expected.

Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/testing/selftests/bpf/test_maps.c | 83 ++++++++++++++++++++++++-
 1 file changed, 81 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 5443b9bd75ed7..f7ab401399d40 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -309,6 +309,86 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	close(fd);
 }
 
+static void test_hashmap_dump(void)
+{
+	int fd, i, max_entries = 5;
+	uint64_t keys[max_entries], values[max_entries];
+	uint64_t key, value, next_key, prev_key;
+	bool next_key_valid = true;
+	void *buf, *elem;
+	u32 buf_len;
+	const int elem_size = sizeof(key) + sizeof(value);
+
+	fd = helper_fill_hashmap(max_entries);
+
+	// Get the elements in the hashmap, and store them in that order
+	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
+	i = 0;
+	keys[i] = key;
+	for (i = 1; next_key_valid; i++) {
+		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
+		assert(bpf_map_lookup_elem(fd, &key, &values[i - 1]) == 0);
+		keys[i-1] = key;
+		key = next_key;
+	}
+
+	// Alloc memory for the whole table
+	buf = malloc(elem_size * max_entries);
+	assert(buf != NULL);
+
+	// Check that buf_len < elem_size returns EINVAL
+	buf_len = elem_size-1;
+	errno = 0;
+	assert(bpf_map_dump(fd, NULL, buf, &buf_len) == -1 && errno == EINVAL);
+
+	// Check that it returns the first two elements
+	errno = 0;
+	buf_len = elem_size * 2;
+	i = 0;
+	assert(bpf_map_dump(fd, NULL, buf, &buf_len) == 0 &&
+	       buf_len == 2*elem_size);
+	elem = buf;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	elem = buf + elem_size;
+	i++;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	i++;
+
+	/* Check that prev_key contains key from last_elem retrieved in previous
+	 * call
+	 */
+	prev_key = *((uint64_t *)elem);
+	assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
+	       buf_len == elem_size*2);
+	elem = buf;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	elem = buf + elem_size;
+	i++;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	i++;
+	assert(prev_key == (*(uint64_t *)elem));
+
+	/* Continue reading from map and verify buf_len only contains 1 element
+	 * even though buf_len is 2 elem_size and it returns err = 0.
+	 */
+	assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
+	       buf_len == elem_size);
+	elem = buf;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+
+	// Verify there's no more entries and err = ENOENT
+	assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == -1 &&
+	       errno == ENOENT);
+
+	free(buf);
+	close(fd);
+}
+
 static void test_hashmap_zero_seed(void)
 {
 	int i, first, second, old_flags;
@@ -1677,6 +1757,7 @@ static void run_all_tests(void)
 	test_hashmap_percpu(0, NULL);
 	test_hashmap_walk(0, NULL);
 	test_hashmap_zero_seed();
+	test_hashmap_dump();
 
 	test_arraymap(0, NULL);
 	test_arraymap_percpu(0, NULL);
@@ -1714,11 +1795,9 @@ int main(void)
 
 	map_flags = BPF_F_NO_PREALLOC;
 	run_all_tests();
-
 #define CALL
 #include <map_tests/tests.h>
 #undef CALL
-
 	printf("test_maps: OK, %d SKIPPED\n", skips);
 	return 0;
 }
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (4 preceding siblings ...)
  2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
@ 2019-07-24 16:58 ` Brian Vazquez
  2019-07-24 22:01   ` Song Liu
  2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
  6 siblings, 1 reply; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

This tests compares the amount of time that takes to read an entire
table of 100K elements on a bpf hashmap using both BPF_MAP_DUMP and
BPF_MAP_GET_NEXT_KEY + BPF_MAP_LOOKUP_ELEM.

Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/testing/selftests/bpf/test_maps.c | 65 +++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index f7ab401399d40..c4593a8904ca6 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -18,6 +18,7 @@
 #include <sys/socket.h>
 #include <netinet/in.h>
 #include <linux/bpf.h>
+#include <linux/time64.h>
 
 #include <bpf/bpf.h>
 #include <bpf/libbpf.h>
@@ -389,6 +390,69 @@ static void test_hashmap_dump(void)
 	close(fd);
 }
 
+static void test_hashmap_dump_perf(void)
+{
+	int fd, i, max_entries = 100000;
+	uint64_t key, value, next_key;
+	bool next_key_valid = true;
+	void *buf;
+	u32 buf_len, entries;
+	int j = 0;
+	int clk_id = CLOCK_MONOTONIC;
+	struct timespec begin, end;
+	long long time_spent, dump_time_spent;
+	double res;
+	int tests[] = {1, 2, 230, 5000, 73000, 100000, 234567};
+	int test_len = ARRAY_SIZE(tests);
+	const int elem_size = sizeof(key) + sizeof(value);
+
+	fd = helper_fill_hashmap(max_entries);
+	// Alloc memory considering the largest buffer
+	buf = malloc(elem_size * tests[test_len-1]);
+	assert(buf != NULL);
+
+test:
+	entries = tests[j];
+	buf_len = elem_size*tests[j];
+	j++;
+	clock_gettime(clk_id, &begin);
+	errno = 0;
+	i = 0;
+	while (errno == 0) {
+		bpf_map_dump(fd, !i ? NULL : &key,
+				  buf, &buf_len);
+		if (errno)
+			break;
+		if (!i)
+			key = *((uint64_t *)(buf + buf_len - elem_size));
+		i += buf_len / elem_size;
+	}
+	clock_gettime(clk_id, &end);
+	assert(i  == max_entries);
+	dump_time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
+			  end.tv_nsec - begin.tv_nsec;
+	next_key_valid = true;
+	clock_gettime(clk_id, &begin);
+	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
+	for (i = 0; next_key_valid; i++) {
+		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
+		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
+		key = next_key;
+	}
+	clock_gettime(clk_id, &end);
+	time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
+		     end.tv_nsec - begin.tv_nsec;
+	res = (1-((double)dump_time_spent/time_spent))*100;
+	printf("buf_len_%u:\t %llu entry-by-entry: %llu improvement %lf\n",
+	       entries, dump_time_spent, time_spent, res);
+	assert(i  == max_entries);
+
+	if (j < test_len)
+		goto test;
+	free(buf);
+	close(fd);
+}
+
 static void test_hashmap_zero_seed(void)
 {
 	int i, first, second, old_flags;
@@ -1758,6 +1822,7 @@ static void run_all_tests(void)
 	test_hashmap_walk(0, NULL);
 	test_hashmap_zero_seed();
 	test_hashmap_dump();
+	test_hashmap_dump_perf();
 
 	test_arraymap(0, NULL);
 	test_arraymap_percpu(0, NULL);
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (5 preceding siblings ...)
  2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
@ 2019-07-24 19:20 ` Song Liu
  2019-07-24 22:15   ` Brian Vazquez
  6 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-07-24 19:20 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:09 AM Brian Vazquez <brianvv@google.com> wrote:
>
> This introduces a new command to retrieve multiple number of entries
> from a bpf map.
>
> This new command can be executed from the existing BPF syscall as
> follows:
>
> err =  bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
> using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
> attr->dump.buf_len
> returns zero or negative error, and populates buf and buf_len on
> succees
>
> This implementation is wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> Note that this implementation can be extended later to do dump and
> delete by extending map_lookup_and_delete_elem (currently it only works
> for bpf queue/stack maps) and either use a new flag in map_dump or a new
> command map_dump_and_delete.
>
> Results show that even with a 1-elem_size buffer, it runs ~40 faster

Why is the new command 40% faster with 1-elem_size buffer?

> than the current implementation, improvements of ~85% are reported when
> the buffer size is increased, although, after the buffer size is around
> 5% of the total number of entries there's no huge difference in
> increasing it.
>
> Tested:
> Tried different size buffers to handle case where the bulk is bigger, or
> the elements to retrieve are less than the existing ones, all runs read
> a map of 100K entries. Below are the results(in ns) from the different
> runs:
>
> buf_len_1:       69038725 entry-by-entry: 112384424 improvement
> 38.569134
> buf_len_2:       40897447 entry-by-entry: 111030546 improvement
> 63.165590
> buf_len_230:     13652714 entry-by-entry: 111694058 improvement
> 87.776687
> buf_len_5000:    13576271 entry-by-entry: 111101169 improvement
> 87.780263
> buf_len_73000:   14694343 entry-by-entry: 111740162 improvement
> 86.849542
> buf_len_100000:  13745969 entry-by-entry: 114151991 improvement
> 87.958187
> buf_len_234567:  14329834 entry-by-entry: 114427589 improvement
> 87.476941

It took me a while to figure out the meaning of 87.476941. It is probably
a good idea to say 87.5% instead.

Thanks,
Song

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command
  2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
@ 2019-07-24 19:51   ` Willem de Bruijn
  0 siblings, 0 replies; 26+ messages in thread
From: Willem de Bruijn @ 2019-07-24 19:51 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf

On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <brianvv@google.com> wrote:
>
> Make libbpf aware of new BPF_MAP_DUMP command and add bpf_map_dump and
> bpf_map_dump_flags to use them from the library.
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/lib/bpf/bpf.c | 28 ++++++++++++++++++++++++++++
>  tools/lib/bpf/bpf.h |  4 ++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index c7d7993c44bb0..c1139b7db756a 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -368,6 +368,34 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
>         return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
>  }
>
> +int bpf_map_dump(int fd, const void *prev_key, void *buf, void *buf_len)
> +{
> +       union bpf_attr attr;
> +
> +       memset(&attr, 0, sizeof(attr));
> +       attr.dump.map_fd = fd;
> +       attr.dump.prev_key = ptr_to_u64(prev_key);
> +       attr.dump.buf = ptr_to_u64(buf);
> +       attr.dump.buf_len = ptr_to_u64(buf_len);
> +
> +       return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
> +}

This can call bpf_map_dump_flags internally to avoid code duplication?

> +
> +int bpf_map_dump_flags(int fd, const void *prev_key, void *buf, void *buf_len,
> +                      __u64 flags)
> +{
> +       union bpf_attr attr;
> +
> +       memset(&attr, 0, sizeof(attr));
> +       attr.dump.map_fd = fd;
> +       attr.dump.prev_key = ptr_to_u64(prev_key);
> +       attr.dump.buf = ptr_to_u64(buf);
> +       attr.dump.buf_len = ptr_to_u64(buf_len);
> +       attr.dump.flags = flags;
> +
> +       return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
> +}
> +

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
@ 2019-07-24 19:54   ` Willem de Bruijn
  2019-07-24 22:26     ` Brian Vazquez
  2019-07-24 21:40   ` Song Liu
  1 sibling, 1 reply; 26+ messages in thread
From: Willem de Bruijn @ 2019-07-24 19:54 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf

On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <brianvv@google.com> wrote:
>
> This introduces a new command to retrieve multiple number of entries
> from a bpf map, wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> To start dumping the map from the beginning you must specify NULL as
> the prev_key.
>
> The new API returns 0 when it successfully copied all the elements
> requested or it copied less because there weren't more elements to
> retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.

I think I understand this, but perhaps it can be explained a bit more
concisely without reference to ENOENT and error masking. The function
returns the min of the number of requested elements and the number of
remaining elements in the map from the given starting point
(prev_key).

> On a successful call buf and buf_len will contain correct data and in
> case prev_key was provided (not for the first walk, since prev_key is
> NULL) it will contain the last_key copied into the prev_key which will
> simplify next call.
>
> Only when it can't find a single element it will return -ENOENT meaning
> that the map has been entirely walked. When an error is return buf,
> buf_len and prev_key shouldn't be read nor used.

That's common for error handling. No need to state explicitly.

> Because maps can be called from userspace and kernel code, this function
> can have a scenario where the next_key was found but by the time we
> try to retrieve the value the element is not there, in this case the
> function continues and tries to get a new next_key value, skipping the
> deleted key. If at some point the function find itself trap in a loop,
> it will return -EINTR.

Good to point this out! I don't think that unbounded continue;
statements until an interrupt happens is sufficient. Please bound the
number of retries to a low number.

> The function will try to fit as much as possible in the buf provided and
> will return -EINVAL if buf_len is smaller than elem_size.
>
> QUEUE and STACK maps are not supported.
>
> Note that map_dump doesn't guarantee that reading the entire table is
> consistent since this function is always racing with kernel and user code
> but the same behaviour is found when the entire table is walked using
> the current interfaces: map_get_next_key + map_lookup_elem.

> It is also important to note that with  a locked map, the lock is grabbed
> for 1 entry at the time, meaning that the returned buf might or might not
> be consistent.

Would it be informative to signal to the caller if the read was
complete and consistent (because the entire table was read while the
lock was held)?

>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions
  2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
@ 2019-07-24 20:53   ` Song Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-07-24 20:53 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> Move reusable code from map_lookup_elem to helper functions to avoid code
> duplication in kernel/bpf/syscall.c
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>

Acked-by: Song Liu <songliubraving@fb.com>

Some very minor nits though.

> ---
>  kernel/bpf/syscall.c | 134 +++++++++++++++++++++++--------------------
>  1 file changed, 73 insertions(+), 61 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 5d141f16f6fa9..86cdc2f7bb56e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -126,6 +126,76 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
>         return map;
>  }
>
> +static u32 bpf_map_value_size(struct bpf_map *map)
> +{
> +       if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
> +           map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
> +           map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
> +           map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
> +               return round_up(map->value_size, 8) * num_possible_cpus();
> +       else if (IS_FD_MAP(map))
> +               return sizeof(u32);
> +       else
> +               return  map->value_size;
                  ^ extra space after return

> +}
> +
> +static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
> +                             __u64 flags)
> +{
> +       void *ptr;
> +       int err;
> +
> +       if (bpf_map_is_dev_bound(map))
> +               return  bpf_map_offload_lookup_elem(map, key, value);
                  ^ another extra space after return, did replace? :-)

Thanks,
Song

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
  2019-07-24 19:54   ` Willem de Bruijn
@ 2019-07-24 21:40   ` Song Liu
  2019-07-24 22:44     ` Brian Vazquez
  1 sibling, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-07-24 21:40 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> This introduces a new command to retrieve multiple number of entries
> from a bpf map, wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> To start dumping the map from the beginning you must specify NULL as
> the prev_key.
>
> The new API returns 0 when it successfully copied all the elements
> requested or it copied less because there weren't more elements to
> retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
>
> On a successful call buf and buf_len will contain correct data and in
> case prev_key was provided (not for the first walk, since prev_key is
> NULL) it will contain the last_key copied into the prev_key which will
> simplify next call.
>
> Only when it can't find a single element it will return -ENOENT meaning
> that the map has been entirely walked. When an error is return buf,
> buf_len and prev_key shouldn't be read nor used.
>
> Because maps can be called from userspace and kernel code, this function
> can have a scenario where the next_key was found but by the time we
> try to retrieve the value the element is not there, in this case the
> function continues and tries to get a new next_key value, skipping the
> deleted key. If at some point the function find itself trap in a loop,
> it will return -EINTR.
>
> The function will try to fit as much as possible in the buf provided and
> will return -EINVAL if buf_len is smaller than elem_size.
>
> QUEUE and STACK maps are not supported.
>
> Note that map_dump doesn't guarantee that reading the entire table is
> consistent since this function is always racing with kernel and user code
> but the same behaviour is found when the entire table is walked using
> the current interfaces: map_get_next_key + map_lookup_elem.
> It is also important to note that with  a locked map, the lock is grabbed
> for 1 entry at the time, meaning that the returned buf might or might not
> be consistent.
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  include/uapi/linux/bpf.h |   9 +++
>  kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 126 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index fa1c753dcdbc7..66dab5385170d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -106,6 +106,7 @@ enum bpf_cmd {
>         BPF_TASK_FD_QUERY,
>         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
>         BPF_MAP_FREEZE,
> +       BPF_MAP_DUMP,
>  };
>
>  enum bpf_map_type {
> @@ -388,6 +389,14 @@ union bpf_attr {
>                 __u64           flags;
>         };
>
> +       struct { /* struct used by BPF_MAP_DUMP command */
> +               __aligned_u64   prev_key;
> +               __aligned_u64   buf;
> +               __aligned_u64   buf_len; /* input/output: len of buf */
> +               __u64           flags;

Please add explanation of flags here. Also, we need to update the
comments of BPF_F_LOCK for BPF_MAP_DUMP.

> +               __u32           map_fd;
> +       } dump;
> +
>         struct { /* anonymous struct used by BPF_PROG_LOAD command */
>                 __u32           prog_type;      /* one of enum bpf_prog_type */
>                 __u32           insn_cnt;
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 86cdc2f7bb56e..0c35505aa219f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
>         return err;
>  }
>
> +/* last field in 'union bpf_attr' used by this command */
> +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> +
> +static int map_dump(union bpf_attr *attr)
> +{
> +       void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> +       void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> +       u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> +       int ufd = attr->dump.map_fd;
> +       struct bpf_map *map;
> +       void *buf, *prev_key, *key, *value;
> +       u32 value_size, elem_size, buf_len, cp_len;
> +       struct fd f;
> +       int err;
> +       bool first_key = false;
> +
> +       if (CHECK_ATTR(BPF_MAP_DUMP))
> +               return -EINVAL;
> +
> +       if (attr->dump.flags & ~BPF_F_LOCK)
> +               return -EINVAL;
> +
> +       f = fdget(ufd);
> +       map = __bpf_map_get(f);
> +       if (IS_ERR(map))
> +               return PTR_ERR(map);
> +       if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> +               err = -EPERM;
> +               goto err_put;
> +       }
> +
> +       if ((attr->dump.flags & BPF_F_LOCK) &&
> +           !map_value_has_spin_lock(map)) {
> +               err = -EINVAL;
> +               goto err_put;
> +       }

We can share these lines with map_lookup_elem(). Maybe
add another helper function?

> +
> +       if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> +           map->map_type == BPF_MAP_TYPE_STACK) {
> +               err = -ENOTSUPP;
> +               goto err_put;
> +       }
> +
> +       value_size = bpf_map_value_size(map);
> +
> +       err = get_user(buf_len, ubuf_len);
> +       if (err)
> +               goto err_put;
> +
> +       elem_size = map->key_size + value_size;
> +       if (buf_len < elem_size) {
> +               err = -EINVAL;
> +               goto err_put;
> +       }
> +
> +       if (ukey) {
> +               prev_key = __bpf_copy_key(ukey, map->key_size);
> +               if (IS_ERR(prev_key)) {
> +                       err = PTR_ERR(prev_key);
> +                       goto err_put;
> +               }
> +       } else {
> +               prev_key = NULL;
> +               first_key = true;
> +       }
> +
> +       err = -ENOMEM;
> +       buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> +       if (!buf)
> +               goto err_put;
> +
> +       key = buf;
> +       value = key + map->key_size;
> +       for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> +               if (signal_pending(current)) {
> +                       err = -EINTR;
> +                       break;
> +               }
> +
> +               rcu_read_lock();
> +               err = map->ops->map_get_next_key(map, prev_key, key);

If prev_key is deleted before map_get_next_key(), we get the first key
again. This is pretty weird.

> +               rcu_read_unlock();
> +
> +               if (err)
> +                       break;
> +
> +               err = bpf_map_copy_value(map, key, value, attr->dump.flags);
> +
> +               if (err == -ENOENT)
> +                       continue;
> +               if (err)
> +                       goto free_buf;
> +
> +               if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
> +                       err = -EFAULT;
> +                       goto free_buf;
> +               }
> +
> +               prev_key = key;
> +               cp_len += elem_size;
> +       }
> +
> +       if (err == -ENOENT && cp_len)
> +               err = 0;
> +       if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
> +                   (!first_key && copy_to_user(ukey, key, map->key_size))))
> +               err = -EFAULT;
> +free_buf:
> +       kfree(buf);
> +err_put:
> +       fdput(f);
> +       return err;
> +}
> +
>  #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
>
>  static int map_lookup_and_delete_elem(union bpf_attr *attr)
> @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>         case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
>                 err = map_lookup_and_delete_elem(&attr);
>                 break;
> +       case BPF_MAP_DUMP:
> +               err = map_dump(&attr);
> +               break;
>         default:
>                 err = -EINVAL;
>                 break;
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
  2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
@ 2019-07-24 21:41   ` Song Liu
  2019-07-24 23:10   ` Andrii Nakryiko
  1 sibling, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-07-24 21:41 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> Adds bpf_attr.dump structure to libbpf.
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/include/uapi/linux/bpf.h | 9 +++++++++
>  tools/lib/bpf/libbpf.map       | 2 ++
>  2 files changed, 11 insertions(+)
>
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 4e455018da65f..e127f16e4e932 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -106,6 +106,7 @@ enum bpf_cmd {
>         BPF_TASK_FD_QUERY,
>         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
>         BPF_MAP_FREEZE,
> +       BPF_MAP_DUMP,
>  };
>
>  enum bpf_map_type {
> @@ -388,6 +389,14 @@ union bpf_attr {
>                 __u64           flags;
>         };
>
> +       struct { /* struct used by BPF_MAP_DUMP command */
> +               __aligned_u64   prev_key;
> +               __aligned_u64   buf;
> +               __aligned_u64   buf_len; /* input/output: len of buf */
> +               __u64           flags;
> +               __u32           map_fd;
> +       } dump;
> +
>         struct { /* anonymous struct used by BPF_PROG_LOAD command */
>                 __u32           prog_type;      /* one of enum bpf_prog_type */
>                 __u32           insn_cnt;
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index f9d316e873d8d..cac3723d5c45c 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -183,4 +183,6 @@ LIBBPF_0.0.4 {
>                 perf_buffer__new;
>                 perf_buffer__new_raw;
>                 perf_buffer__poll;
> +               bpf_map_dump;
> +               bpf_map_dump_flags;
>  } LIBBPF_0.0.3;

libbpf.map change should go with 4/6.

> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
  2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
@ 2019-07-24 21:58   ` Song Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-07-24 21:58 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> This tests exercise the new command on a bpf hashmap and make sure it
> works as expected.
>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/testing/selftests/bpf/test_maps.c | 83 ++++++++++++++++++++++++-
>  1 file changed, 81 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index 5443b9bd75ed7..f7ab401399d40 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -309,6 +309,86 @@ static void test_hashmap_walk(unsigned int task, void *data)
>         close(fd);
>  }
>
> +static void test_hashmap_dump(void)
> +{
> +       int fd, i, max_entries = 5;

5 is too small for map_dump.

> +       uint64_t keys[max_entries], values[max_entries];
> +       uint64_t key, value, next_key, prev_key;
> +       bool next_key_valid = true;
> +       void *buf, *elem;
> +       u32 buf_len;
> +       const int elem_size = sizeof(key) + sizeof(value);
> +
> +       fd = helper_fill_hashmap(max_entries);
> +
> +       // Get the elements in the hashmap, and store them in that order

Please use /* */ instead of //.

> +       assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
> +       i = 0;
> +       keys[i] = key;
> +       for (i = 1; next_key_valid; i++) {
> +               next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
> +               assert(bpf_map_lookup_elem(fd, &key, &values[i - 1]) == 0);
> +               keys[i-1] = key;
> +               key = next_key;
> +       }
> +
> +       // Alloc memory for the whole table
> +       buf = malloc(elem_size * max_entries);
> +       assert(buf != NULL);
> +
> +       // Check that buf_len < elem_size returns EINVAL
> +       buf_len = elem_size-1;
> +       errno = 0;
> +       assert(bpf_map_dump(fd, NULL, buf, &buf_len) == -1 && errno == EINVAL);
> +
> +       // Check that it returns the first two elements
> +       errno = 0;
> +       buf_len = elem_size * 2;
> +       i = 0;
> +       assert(bpf_map_dump(fd, NULL, buf, &buf_len) == 0 &&
> +              buf_len == 2*elem_size);
> +       elem = buf;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +       elem = buf + elem_size;
> +       i++;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +       i++;
> +
> +       /* Check that prev_key contains key from last_elem retrieved in previous
> +        * call
> +        */
> +       prev_key = *((uint64_t *)elem);
> +       assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
> +              buf_len == elem_size*2);
> +       elem = buf;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +       elem = buf + elem_size;
> +       i++;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +       i++;
> +       assert(prev_key == (*(uint64_t *)elem));
> +
> +       /* Continue reading from map and verify buf_len only contains 1 element
> +        * even though buf_len is 2 elem_size and it returns err = 0.
> +        */
> +       assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
> +              buf_len == elem_size);
> +       elem = buf;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +
> +       // Verify there's no more entries and err = ENOENT
> +       assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == -1 &&
> +              errno == ENOENT);
> +
> +       free(buf);
> +       close(fd);
> +}
> +
>  static void test_hashmap_zero_seed(void)
>  {
>         int i, first, second, old_flags;
> @@ -1677,6 +1757,7 @@ static void run_all_tests(void)
>         test_hashmap_percpu(0, NULL);
>         test_hashmap_walk(0, NULL);
>         test_hashmap_zero_seed();
> +       test_hashmap_dump();
>
>         test_arraymap(0, NULL);
>         test_arraymap_percpu(0, NULL);
> @@ -1714,11 +1795,9 @@ int main(void)
>
>         map_flags = BPF_F_NO_PREALLOC;
>         run_all_tests();
> -
>  #define CALL
>  #include <map_tests/tests.h>
>  #undef CALL
> -
>         printf("test_maps: OK, %d SKIPPED\n", skips);
>         return 0;
>  }
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP
  2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
@ 2019-07-24 22:01   ` Song Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-07-24 22:01 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> This tests compares the amount of time that takes to read an entire
> table of 100K elements on a bpf hashmap using both BPF_MAP_DUMP and
> BPF_MAP_GET_NEXT_KEY + BPF_MAP_LOOKUP_ELEM.
>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/testing/selftests/bpf/test_maps.c | 65 +++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index f7ab401399d40..c4593a8904ca6 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -18,6 +18,7 @@
>  #include <sys/socket.h>
>  #include <netinet/in.h>
>  #include <linux/bpf.h>
> +#include <linux/time64.h>
>
>  #include <bpf/bpf.h>
>  #include <bpf/libbpf.h>
> @@ -389,6 +390,69 @@ static void test_hashmap_dump(void)
>         close(fd);
>  }
>
> +static void test_hashmap_dump_perf(void)
> +{
> +       int fd, i, max_entries = 100000;
> +       uint64_t key, value, next_key;
> +       bool next_key_valid = true;
> +       void *buf;
> +       u32 buf_len, entries;
> +       int j = 0;
> +       int clk_id = CLOCK_MONOTONIC;
> +       struct timespec begin, end;
> +       long long time_spent, dump_time_spent;
> +       double res;
> +       int tests[] = {1, 2, 230, 5000, 73000, 100000, 234567};
> +       int test_len = ARRAY_SIZE(tests);
> +       const int elem_size = sizeof(key) + sizeof(value);
> +
> +       fd = helper_fill_hashmap(max_entries);
> +       // Alloc memory considering the largest buffer
> +       buf = malloc(elem_size * tests[test_len-1]);
> +       assert(buf != NULL);
> +
> +test:
> +       entries = tests[j];
> +       buf_len = elem_size*tests[j];
> +       j++;
> +       clock_gettime(clk_id, &begin);
> +       errno = 0;
> +       i = 0;
> +       while (errno == 0) {
> +               bpf_map_dump(fd, !i ? NULL : &key,
> +                                 buf, &buf_len);
> +               if (errno)
> +                       break;
> +               if (!i)
> +                       key = *((uint64_t *)(buf + buf_len - elem_size));
> +               i += buf_len / elem_size;
> +       }
> +       clock_gettime(clk_id, &end);
> +       assert(i  == max_entries);
> +       dump_time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
> +                         end.tv_nsec - begin.tv_nsec;
> +       next_key_valid = true;
> +       clock_gettime(clk_id, &begin);
> +       assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
> +       for (i = 0; next_key_valid; i++) {
> +               next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
> +               assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
> +               key = next_key;
> +       }
> +       clock_gettime(clk_id, &end);
> +       time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
> +                    end.tv_nsec - begin.tv_nsec;
> +       res = (1-((double)dump_time_spent/time_spent))*100;
> +       printf("buf_len_%u:\t %llu entry-by-entry: %llu improvement %lf\n",
> +              entries, dump_time_spent, time_spent, res);
> +       assert(i  == max_entries);
              ^ extra space after i.

> +
> +       if (j < test_len)
> +               goto test;
> +       free(buf);
> +       close(fd);
> +}
> +
>  static void test_hashmap_zero_seed(void)
>  {
>         int i, first, second, old_flags;
> @@ -1758,6 +1822,7 @@ static void run_all_tests(void)
>         test_hashmap_walk(0, NULL);
>         test_hashmap_zero_seed();
>         test_hashmap_dump();
> +       test_hashmap_dump_perf();
>
>         test_arraymap(0, NULL);
>         test_arraymap_percpu(0, NULL);
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
@ 2019-07-24 22:15   ` Brian Vazquez
  0 siblings, 0 replies; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 22:15 UTC (permalink / raw)
  To: Song Liu
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 12:20 PM Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:09 AM Brian Vazquez <brianvv@google.com> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map.
> >
> > This new command can be executed from the existing BPF syscall as
> > follows:
> >
> > err =  bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
> > using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
> > attr->dump.buf_len
> > returns zero or negative error, and populates buf and buf_len on
> > succees
> >
> > This implementation is wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > Note that this implementation can be extended later to do dump and
> > delete by extending map_lookup_and_delete_elem (currently it only works
> > for bpf queue/stack maps) and either use a new flag in map_dump or a new
> > command map_dump_and_delete.
> >
> > Results show that even with a 1-elem_size buffer, it runs ~40 faster
>
> Why is the new command 40% faster with 1-elem_size buffer?

The test is using a really simple map structure: uint64_t key,val.
Which makes the lookup_elem logic faster since it doesn't spend too
much time copying the value. My conclusion with the 40% was that this
new implementation only needs 1 syscall per elem compare to the 2
syscalls that we needed with previous implementation so in this
particular case the number of ops that we are doing is almost halved.
I did one experiment increasing the value_size (448*64B) and it was
only 14% faster with 1-elem_size buffer.

> > than the current implementation, improvements of ~85% are reported when
> > the buffer size is increased, although, after the buffer size is around
> > 5% of the total number of entries there's no huge difference in
> > increasing it.
> >
> > Tested:
> > Tried different size buffers to handle case where the bulk is bigger, or
> > the elements to retrieve are less than the existing ones, all runs read
> > a map of 100K entries. Below are the results(in ns) from the different
> > runs:
> >
> > buf_len_1:       69038725 entry-by-entry: 112384424 improvement
> > 38.569134
> > buf_len_2:       40897447 entry-by-entry: 111030546 improvement
> > 63.165590
> > buf_len_230:     13652714 entry-by-entry: 111694058 improvement
> > 87.776687
> > buf_len_5000:    13576271 entry-by-entry: 111101169 improvement
> > 87.780263
> > buf_len_73000:   14694343 entry-by-entry: 111740162 improvement
> > 86.849542
> > buf_len_100000:  13745969 entry-by-entry: 114151991 improvement
> > 87.958187
> > buf_len_234567:  14329834 entry-by-entry: 114427589 improvement
> > 87.476941
>
> It took me a while to figure out the meaning of 87.476941. It is probably
> a good idea to say 87.5% instead.

right, will change it in next version.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 19:54   ` Willem de Bruijn
@ 2019-07-24 22:26     ` Brian Vazquez
  2019-07-24 22:33       ` Willem de Bruijn
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 22:26 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf

On Wed, Jul 24, 2019 at 12:55 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <brianvv@google.com> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map, wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > To start dumping the map from the beginning you must specify NULL as
> > the prev_key.
> >
> > The new API returns 0 when it successfully copied all the elements
> > requested or it copied less because there weren't more elements to
> > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
>
> I think I understand this, but perhaps it can be explained a bit more
> concisely without reference to ENOENT and error masking. The function
> returns the min of the number of requested elements and the number of
> remaining elements in the map from the given starting point
> (prev_key).

That sounds better to me. Thanks!

> > On a successful call buf and buf_len will contain correct data and in
> > case prev_key was provided (not for the first walk, since prev_key is
> > NULL) it will contain the last_key copied into the prev_key which will
> > simplify next call.
> >
> > Only when it can't find a single element it will return -ENOENT meaning
> > that the map has been entirely walked. When an error is return buf,
> > buf_len and prev_key shouldn't be read nor used.
>
> That's common for error handling. No need to state explicitly.

 Ack

>
> > Because maps can be called from userspace and kernel code, this function
> > can have a scenario where the next_key was found but by the time we
> > try to retrieve the value the element is not there, in this case the
> > function continues and tries to get a new next_key value, skipping the
> > deleted key. If at some point the function find itself trap in a loop,
> > it will return -EINTR.
>
> Good to point this out! I don't think that unbounded continue;
> statements until an interrupt happens is sufficient. Please bound the
> number of retries to a low number.

And what would it be a good number? Maybe 3 attempts? And in that case
what error should be reported?
>
> > The function will try to fit as much as possible in the buf provided and
> > will return -EINVAL if buf_len is smaller than elem_size.
> >
> > QUEUE and STACK maps are not supported.
> >
> > Note that map_dump doesn't guarantee that reading the entire table is
> > consistent since this function is always racing with kernel and user code
> > but the same behaviour is found when the entire table is walked using
> > the current interfaces: map_get_next_key + map_lookup_elem.
>
> > It is also important to note that with  a locked map, the lock is grabbed
> > for 1 entry at the time, meaning that the returned buf might or might not
> > be consistent.
>
> Would it be informative to signal to the caller if the read was
> complete and consistent (because the entire table was read while the
> lock was held)?

Mmm.. not sure how we could signal that to the caller.  But I don't
think there's a way to know it was consistent (i.e. one element was
removed in bucket 20 and you are copying the keys in bucket 15, when
you get to bucket 20 there's no way to know that some entries were
removed when you traversed them). The lock is held for just 1 single
entry not the entire table.
Maybe clarify more that in the commit message?

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 22:26     ` Brian Vazquez
@ 2019-07-24 22:33       ` Willem de Bruijn
  0 siblings, 0 replies; 26+ messages in thread
From: Willem de Bruijn @ 2019-07-24 22:33 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf

> > > Because maps can be called from userspace and kernel code, this function
> > > can have a scenario where the next_key was found but by the time we
> > > try to retrieve the value the element is not there, in this case the
> > > function continues and tries to get a new next_key value, skipping the
> > > deleted key. If at some point the function find itself trap in a loop,
> > > it will return -EINTR.
> >
> > Good to point this out! I don't think that unbounded continue;
> > statements until an interrupt happens is sufficient. Please bound the
> > number of retries to a low number.
>
> And what would it be a good number? Maybe 3 attempts?

3 sounds good to me.

> And in that case what error should be reported?

One that's unambiguous and somewhat intuitive for the given issue.
Perhaps EBUSY?

> > > The function will try to fit as much as possible in the buf provided and
> > > will return -EINVAL if buf_len is smaller than elem_size.
> > >
> > > QUEUE and STACK maps are not supported.
> > >
> > > Note that map_dump doesn't guarantee that reading the entire table is
> > > consistent since this function is always racing with kernel and user code
> > > but the same behaviour is found when the entire table is walked using
> > > the current interfaces: map_get_next_key + map_lookup_elem.
> >
> > > It is also important to note that with  a locked map, the lock is grabbed
> > > for 1 entry at the time, meaning that the returned buf might or might not
> > > be consistent.
> >
> > Would it be informative to signal to the caller if the read was
> > complete and consistent (because the entire table was read while the
> > lock was held)?
>
> Mmm.. not sure how we could signal that to the caller.  But I don't
> think there's a way to know it was consistent

Okay, that makes for a simple answer :) No need to try to add a signal, then.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 21:40   ` Song Liu
@ 2019-07-24 22:44     ` Brian Vazquez
  2019-07-24 23:04       ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Vazquez @ 2019-07-24 22:44 UTC (permalink / raw)
  To: Song Liu
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 2:40 PM Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map, wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > To start dumping the map from the beginning you must specify NULL as
> > the prev_key.
> >
> > The new API returns 0 when it successfully copied all the elements
> > requested or it copied less because there weren't more elements to
> > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
> >
> > On a successful call buf and buf_len will contain correct data and in
> > case prev_key was provided (not for the first walk, since prev_key is
> > NULL) it will contain the last_key copied into the prev_key which will
> > simplify next call.
> >
> > Only when it can't find a single element it will return -ENOENT meaning
> > that the map has been entirely walked. When an error is return buf,
> > buf_len and prev_key shouldn't be read nor used.
> >
> > Because maps can be called from userspace and kernel code, this function
> > can have a scenario where the next_key was found but by the time we
> > try to retrieve the value the element is not there, in this case the
> > function continues and tries to get a new next_key value, skipping the
> > deleted key. If at some point the function find itself trap in a loop,
> > it will return -EINTR.
> >
> > The function will try to fit as much as possible in the buf provided and
> > will return -EINVAL if buf_len is smaller than elem_size.
> >
> > QUEUE and STACK maps are not supported.
> >
> > Note that map_dump doesn't guarantee that reading the entire table is
> > consistent since this function is always racing with kernel and user code
> > but the same behaviour is found when the entire table is walked using
> > the current interfaces: map_get_next_key + map_lookup_elem.
> > It is also important to note that with  a locked map, the lock is grabbed
> > for 1 entry at the time, meaning that the returned buf might or might not
> > be consistent.
> >
> > Suggested-by: Stanislav Fomichev <sdf@google.com>
> > Signed-off-by: Brian Vazquez <brianvv@google.com>
> > ---
> >  include/uapi/linux/bpf.h |   9 +++
> >  kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 126 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index fa1c753dcdbc7..66dab5385170d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -106,6 +106,7 @@ enum bpf_cmd {
> >         BPF_TASK_FD_QUERY,
> >         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> >         BPF_MAP_FREEZE,
> > +       BPF_MAP_DUMP,
> >  };
> >
> >  enum bpf_map_type {
> > @@ -388,6 +389,14 @@ union bpf_attr {
> >                 __u64           flags;
> >         };
> >
> > +       struct { /* struct used by BPF_MAP_DUMP command */
> > +               __aligned_u64   prev_key;
> > +               __aligned_u64   buf;
> > +               __aligned_u64   buf_len; /* input/output: len of buf */
> > +               __u64           flags;
>
> Please add explanation of flags here.

got it!

> Also, we need to update the
> comments of BPF_F_LOCK for BPF_MAP_DUMP.

What do you mean? I didn't get this part.

>
> > +               __u32           map_fd;
> > +       } dump;
> > +
> >         struct { /* anonymous struct used by BPF_PROG_LOAD command */
> >                 __u32           prog_type;      /* one of enum bpf_prog_type */
> >                 __u32           insn_cnt;
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 86cdc2f7bb56e..0c35505aa219f 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
> >         return err;
> >  }
> >
> > +/* last field in 'union bpf_attr' used by this command */
> > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> > +
> > +static int map_dump(union bpf_attr *attr)
> > +{
> > +       void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> > +       void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> > +       u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> > +       int ufd = attr->dump.map_fd;
> > +       struct bpf_map *map;
> > +       void *buf, *prev_key, *key, *value;
> > +       u32 value_size, elem_size, buf_len, cp_len;
> > +       struct fd f;
> > +       int err;
> > +       bool first_key = false;
> > +
> > +       if (CHECK_ATTR(BPF_MAP_DUMP))
> > +               return -EINVAL;
> > +
> > +       if (attr->dump.flags & ~BPF_F_LOCK)
> > +               return -EINVAL;
> > +
> > +       f = fdget(ufd);
> > +       map = __bpf_map_get(f);
> > +       if (IS_ERR(map))
> > +               return PTR_ERR(map);
> > +       if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> > +               err = -EPERM;
> > +               goto err_put;
> > +       }
> > +
> > +       if ((attr->dump.flags & BPF_F_LOCK) &&
> > +           !map_value_has_spin_lock(map)) {
> > +               err = -EINVAL;
> > +               goto err_put;
> > +       }
>
> We can share these lines with map_lookup_elem(). Maybe
> add another helper function?

Which are the lines you are referring to? the dump.flags? It makes
sense so that way when a new flag is added you only need to modify
them in one spot.

> > +
> > +       if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> > +           map->map_type == BPF_MAP_TYPE_STACK) {
> > +               err = -ENOTSUPP;
> > +               goto err_put;
> > +       }
> > +
> > +       value_size = bpf_map_value_size(map);
> > +
> > +       err = get_user(buf_len, ubuf_len);
> > +       if (err)
> > +               goto err_put;
> > +
> > +       elem_size = map->key_size + value_size;
> > +       if (buf_len < elem_size) {
> > +               err = -EINVAL;
> > +               goto err_put;
> > +       }
> > +
> > +       if (ukey) {
> > +               prev_key = __bpf_copy_key(ukey, map->key_size);
> > +               if (IS_ERR(prev_key)) {
> > +                       err = PTR_ERR(prev_key);
> > +                       goto err_put;
> > +               }
> > +       } else {
> > +               prev_key = NULL;
> > +               first_key = true;
> > +       }
> > +
> > +       err = -ENOMEM;
> > +       buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> > +       if (!buf)
> > +               goto err_put;
> > +
> > +       key = buf;
> > +       value = key + map->key_size;
> > +       for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> > +               if (signal_pending(current)) {
> > +                       err = -EINTR;
> > +                       break;
> > +               }
> > +
> > +               rcu_read_lock();
> > +               err = map->ops->map_get_next_key(map, prev_key, key);
>
> If prev_key is deleted before map_get_next_key(), we get the first key
> again. This is pretty weird.

Yes, I know. But note that the current scenario happens even for the
old interface (imagine you are walking a map from userspace and you
tried get_next_key the prev_key was removed, you will start again from
the beginning without noticing it).
I tried to sent a patch in the past but I was missing some context:
before NULL was used to get the very first_key the interface relied in
a random (non existent) key to retrieve the first_key in the map, and
I was told what we still have to support that scenario.

>
> > +               rcu_read_unlock();
> > +
> > +               if (err)
> > +                       break;
> > +
> > +               err = bpf_map_copy_value(map, key, value, attr->dump.flags);
> > +
> > +               if (err == -ENOENT)
> > +                       continue;
> > +               if (err)
> > +                       goto free_buf;
> > +
> > +               if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
> > +                       err = -EFAULT;
> > +                       goto free_buf;
> > +               }
> > +
> > +               prev_key = key;
> > +               cp_len += elem_size;
> > +       }
> > +
> > +       if (err == -ENOENT && cp_len)
> > +               err = 0;
> > +       if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
> > +                   (!first_key && copy_to_user(ukey, key, map->key_size))))
> > +               err = -EFAULT;
> > +free_buf:
> > +       kfree(buf);
> > +err_put:
> > +       fdput(f);
> > +       return err;
> > +}
> > +
> >  #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
> >
> >  static int map_lookup_and_delete_elem(union bpf_attr *attr)
> > @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> >         case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
> >                 err = map_lookup_and_delete_elem(&attr);
> >                 break;
> > +       case BPF_MAP_DUMP:
> > +               err = map_dump(&attr);
> > +               break;
> >         default:
> >                 err = -EINVAL;
> >                 break;
> > --
> > 2.22.0.657.g960e92d24f-goog
> >

Thanks for reviewing it!!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 22:44     ` Brian Vazquez
@ 2019-07-24 23:04       ` Song Liu
  2019-07-25 23:25         ` Brian Vazquez
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-07-24 23:04 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 3:44 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 2:40 PM Song Liu <liu.song.a23@gmail.com> wrote:
> >
> > On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
> > >
> > > This introduces a new command to retrieve multiple number of entries
> > > from a bpf map, wrapping the existing bpf methods:
> > > map_get_next_key and map_lookup_elem
> > >
> > > To start dumping the map from the beginning you must specify NULL as
> > > the prev_key.
> > >
> > > The new API returns 0 when it successfully copied all the elements
> > > requested or it copied less because there weren't more elements to
> > > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
> > >
> > > On a successful call buf and buf_len will contain correct data and in
> > > case prev_key was provided (not for the first walk, since prev_key is
> > > NULL) it will contain the last_key copied into the prev_key which will
> > > simplify next call.
> > >
> > > Only when it can't find a single element it will return -ENOENT meaning
> > > that the map has been entirely walked. When an error is return buf,
> > > buf_len and prev_key shouldn't be read nor used.
> > >
> > > Because maps can be called from userspace and kernel code, this function
> > > can have a scenario where the next_key was found but by the time we
> > > try to retrieve the value the element is not there, in this case the
> > > function continues and tries to get a new next_key value, skipping the
> > > deleted key. If at some point the function find itself trap in a loop,
> > > it will return -EINTR.
> > >
> > > The function will try to fit as much as possible in the buf provided and
> > > will return -EINVAL if buf_len is smaller than elem_size.
> > >
> > > QUEUE and STACK maps are not supported.
> > >
> > > Note that map_dump doesn't guarantee that reading the entire table is
> > > consistent since this function is always racing with kernel and user code
> > > but the same behaviour is found when the entire table is walked using
> > > the current interfaces: map_get_next_key + map_lookup_elem.
> > > It is also important to note that with  a locked map, the lock is grabbed
> > > for 1 entry at the time, meaning that the returned buf might or might not
> > > be consistent.
> > >
> > > Suggested-by: Stanislav Fomichev <sdf@google.com>
> > > Signed-off-by: Brian Vazquez <brianvv@google.com>
> > > ---
> > >  include/uapi/linux/bpf.h |   9 +++
> > >  kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 126 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index fa1c753dcdbc7..66dab5385170d 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -106,6 +106,7 @@ enum bpf_cmd {
> > >         BPF_TASK_FD_QUERY,
> > >         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> > >         BPF_MAP_FREEZE,
> > > +       BPF_MAP_DUMP,
> > >  };
> > >
> > >  enum bpf_map_type {
> > > @@ -388,6 +389,14 @@ union bpf_attr {
> > >                 __u64           flags;
> > >         };
> > >
> > > +       struct { /* struct used by BPF_MAP_DUMP command */
> > > +               __aligned_u64   prev_key;
> > > +               __aligned_u64   buf;
> > > +               __aligned_u64   buf_len; /* input/output: len of buf */
> > > +               __u64           flags;
> >
> > Please add explanation of flags here.
>
> got it!
>
> > Also, we need to update the
> > comments of BPF_F_LOCK for BPF_MAP_DUMP.
>
> What do you mean? I didn't get this part.

I meant, current comment says BPF_F_LOCK is for BPF_MAP_UPDATE_ELEM.
But it is also used by BPF_MAP_LOOKUP_ELEM and BPF_MAP_DUMP. So
current comment is not accurate either.

Maybe fix it while you are on it?
>
> >
> > > +               __u32           map_fd;
> > > +       } dump;
> > > +
> > >         struct { /* anonymous struct used by BPF_PROG_LOAD command */
> > >                 __u32           prog_type;      /* one of enum bpf_prog_type */
> > >                 __u32           insn_cnt;
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 86cdc2f7bb56e..0c35505aa219f 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
> > >         return err;
> > >  }
> > >
> > > +/* last field in 'union bpf_attr' used by this command */
> > > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> > > +
> > > +static int map_dump(union bpf_attr *attr)
> > > +{
> > > +       void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> > > +       void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> > > +       u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> > > +       int ufd = attr->dump.map_fd;
> > > +       struct bpf_map *map;
> > > +       void *buf, *prev_key, *key, *value;
> > > +       u32 value_size, elem_size, buf_len, cp_len;
> > > +       struct fd f;
> > > +       int err;
> > > +       bool first_key = false;
> > > +
> > > +       if (CHECK_ATTR(BPF_MAP_DUMP))
> > > +               return -EINVAL;
> > > +
> > > +       if (attr->dump.flags & ~BPF_F_LOCK)
> > > +               return -EINVAL;
> > > +
> > > +       f = fdget(ufd);
> > > +       map = __bpf_map_get(f);
> > > +       if (IS_ERR(map))
> > > +               return PTR_ERR(map);
> > > +       if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> > > +               err = -EPERM;
> > > +               goto err_put;
> > > +       }
> > > +
> > > +       if ((attr->dump.flags & BPF_F_LOCK) &&
> > > +           !map_value_has_spin_lock(map)) {
> > > +               err = -EINVAL;
> > > +               goto err_put;
> > > +       }
> >
> > We can share these lines with map_lookup_elem(). Maybe
> > add another helper function?
>
> Which are the lines you are referring to? the dump.flags? It makes
> sense so that way when a new flag is added you only need to modify
> them in one spot.

I think I misread it. attr->dump.flags is not same as attr->flags.

So never mind.

>
> > > +
> > > +       if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> > > +           map->map_type == BPF_MAP_TYPE_STACK) {
> > > +               err = -ENOTSUPP;
> > > +               goto err_put;
> > > +       }
> > > +
> > > +       value_size = bpf_map_value_size(map);
> > > +
> > > +       err = get_user(buf_len, ubuf_len);
> > > +       if (err)
> > > +               goto err_put;
> > > +
> > > +       elem_size = map->key_size + value_size;
> > > +       if (buf_len < elem_size) {
> > > +               err = -EINVAL;
> > > +               goto err_put;
> > > +       }
> > > +
> > > +       if (ukey) {
> > > +               prev_key = __bpf_copy_key(ukey, map->key_size);
> > > +               if (IS_ERR(prev_key)) {
> > > +                       err = PTR_ERR(prev_key);
> > > +                       goto err_put;
> > > +               }
> > > +       } else {
> > > +               prev_key = NULL;
> > > +               first_key = true;
> > > +       }
> > > +
> > > +       err = -ENOMEM;
> > > +       buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> > > +       if (!buf)
> > > +               goto err_put;
> > > +
> > > +       key = buf;
> > > +       value = key + map->key_size;
> > > +       for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> > > +               if (signal_pending(current)) {
> > > +                       err = -EINTR;
> > > +                       break;
> > > +               }
> > > +
> > > +               rcu_read_lock();
> > > +               err = map->ops->map_get_next_key(map, prev_key, key);
> >
> > If prev_key is deleted before map_get_next_key(), we get the first key
> > again. This is pretty weird.
>
> Yes, I know. But note that the current scenario happens even for the
> old interface (imagine you are walking a map from userspace and you
> tried get_next_key the prev_key was removed, you will start again from
> the beginning without noticing it).
> I tried to sent a patch in the past but I was missing some context:
> before NULL was used to get the very first_key the interface relied in
> a random (non existent) key to retrieve the first_key in the map, and
> I was told what we still have to support that scenario.

BPF_MAP_DUMP is slightly different, as you may return the first key
multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
don't have to support legacy scenarios.

Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
to look up previous keys. Would something down this direction work?

Thanks,
Song

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
  2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
  2019-07-24 21:41   ` Song Liu
@ 2019-07-24 23:10   ` Andrii Nakryiko
  2019-07-25 23:27     ` Brian Vazquez
  1 sibling, 1 reply; 26+ messages in thread
From: Andrii Nakryiko @ 2019-07-24 23:10 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> Adds bpf_attr.dump structure to libbpf.
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/include/uapi/linux/bpf.h | 9 +++++++++
>  tools/lib/bpf/libbpf.map       | 2 ++
>  2 files changed, 11 insertions(+)
>
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 4e455018da65f..e127f16e4e932 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -106,6 +106,7 @@ enum bpf_cmd {
>         BPF_TASK_FD_QUERY,
>         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
>         BPF_MAP_FREEZE,
> +       BPF_MAP_DUMP,
>  };
>
>  enum bpf_map_type {
> @@ -388,6 +389,14 @@ union bpf_attr {
>                 __u64           flags;
>         };
>
> +       struct { /* struct used by BPF_MAP_DUMP command */
> +               __aligned_u64   prev_key;
> +               __aligned_u64   buf;
> +               __aligned_u64   buf_len; /* input/output: len of buf */
> +               __u64           flags;
> +               __u32           map_fd;
> +       } dump;
> +
>         struct { /* anonymous struct used by BPF_PROG_LOAD command */
>                 __u32           prog_type;      /* one of enum bpf_prog_type */
>                 __u32           insn_cnt;
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index f9d316e873d8d..cac3723d5c45c 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -183,4 +183,6 @@ LIBBPF_0.0.4 {

LIBBPF_0.0.4 is closed, this needs to go into LIBBPF_0.0.5.

>                 perf_buffer__new;
>                 perf_buffer__new_raw;
>                 perf_buffer__poll;
> +               bpf_map_dump;
> +               bpf_map_dump_flags;

As the general rule, please keep those lists of functions in alphabetical order.

>  } LIBBPF_0.0.3;
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 23:04       ` Song Liu
@ 2019-07-25 23:25         ` Brian Vazquez
  2019-07-25 23:54           ` Alexei Starovoitov
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Vazquez @ 2019-07-25 23:25 UTC (permalink / raw)
  To: Song Liu
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

> > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > again. This is pretty weird.
> >
> > Yes, I know. But note that the current scenario happens even for the
> > old interface (imagine you are walking a map from userspace and you
> > tried get_next_key the prev_key was removed, you will start again from
> > the beginning without noticing it).
> > I tried to sent a patch in the past but I was missing some context:
> > before NULL was used to get the very first_key the interface relied in
> > a random (non existent) key to retrieve the first_key in the map, and
> > I was told what we still have to support that scenario.
>
> BPF_MAP_DUMP is slightly different, as you may return the first key
> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> don't have to support legacy scenarios.
>
> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> to look up previous keys. Would something down this direction work?

I've been thinking about it and I think first we need a way to detect
that since key was not present we got the first_key instead:

- One solution I had in mind was to explicitly asked for the first key
with map_get_next_key(map, NULL, first_key) and while walking the map
check that map_get_next_key(map, prev_key, key) doesn't return the
same key. This could be done using memcmp.
- Discussing with Stan, he mentioned that another option is to support
a flag in map_get_next_key to let it know that we want an error
instead of the first_key.

After detecting the problem we also need to define what we want to do,
here some options:

a) Return the error to the caller
b) Try with previous keys if any (which be limited to the keys that we
have traversed so far in this dump call)
c) continue with next entries in the map. array is easy just get the
next valid key (starting on i+1), but hmap might be difficult since
starting on the next bucket could potentially skip some keys that were
concurrently added to the same bucket where key used to be, and
starting on the same bucket could lead us to return repeated elements.

Or maybe we could support those 3 cases via flags and let the caller
decide which one to use?

Let me know what are your thoughts.

Thanks,
Brian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
  2019-07-24 23:10   ` Andrii Nakryiko
@ 2019-07-25 23:27     ` Brian Vazquez
  0 siblings, 0 replies; 26+ messages in thread
From: Brian Vazquez @ 2019-07-25 23:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 4:10 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
> >
> > Adds bpf_attr.dump structure to libbpf.
> >
> > Suggested-by: Stanislav Fomichev <sdf@google.com>
> > Signed-off-by: Brian Vazquez <brianvv@google.com>
> > ---
> >  tools/include/uapi/linux/bpf.h | 9 +++++++++
> >  tools/lib/bpf/libbpf.map       | 2 ++
> >  2 files changed, 11 insertions(+)
> >
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 4e455018da65f..e127f16e4e932 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -106,6 +106,7 @@ enum bpf_cmd {
> >         BPF_TASK_FD_QUERY,
> >         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> >         BPF_MAP_FREEZE,
> > +       BPF_MAP_DUMP,
> >  };
> >
> >  enum bpf_map_type {
> > @@ -388,6 +389,14 @@ union bpf_attr {
> >                 __u64           flags;
> >         };
> >
> > +       struct { /* struct used by BPF_MAP_DUMP command */
> > +               __aligned_u64   prev_key;
> > +               __aligned_u64   buf;
> > +               __aligned_u64   buf_len; /* input/output: len of buf */
> > +               __u64           flags;
> > +               __u32           map_fd;
> > +       } dump;
> > +
> >         struct { /* anonymous struct used by BPF_PROG_LOAD command */
> >                 __u32           prog_type;      /* one of enum bpf_prog_type */
> >                 __u32           insn_cnt;
> > diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> > index f9d316e873d8d..cac3723d5c45c 100644
> > --- a/tools/lib/bpf/libbpf.map
> > +++ b/tools/lib/bpf/libbpf.map
> > @@ -183,4 +183,6 @@ LIBBPF_0.0.4 {
>
> LIBBPF_0.0.4 is closed, this needs to go into LIBBPF_0.0.5.

Sorry my bad, I didn't closely look at the rebase so this got it wrong.

>
> >                 perf_buffer__new;
> >                 perf_buffer__new_raw;
> >                 perf_buffer__poll;
> > +               bpf_map_dump;
> > +               bpf_map_dump_flags;
>
> As the general rule, please keep those lists of functions in alphabetical order.

right.

I will fix it in next version, thanks for reviewing it!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-25 23:25         ` Brian Vazquez
@ 2019-07-25 23:54           ` Alexei Starovoitov
  2019-07-26  1:02             ` Willem de Bruijn
  2019-07-26  1:24             ` Brian Vazquez
  0 siblings, 2 replies; 26+ messages in thread
From: Alexei Starovoitov @ 2019-07-25 23:54 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Song Liu, Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > again. This is pretty weird.
> > >
> > > Yes, I know. But note that the current scenario happens even for the
> > > old interface (imagine you are walking a map from userspace and you
> > > tried get_next_key the prev_key was removed, you will start again from
> > > the beginning without noticing it).
> > > I tried to sent a patch in the past but I was missing some context:
> > > before NULL was used to get the very first_key the interface relied in
> > > a random (non existent) key to retrieve the first_key in the map, and
> > > I was told what we still have to support that scenario.
> >
> > BPF_MAP_DUMP is slightly different, as you may return the first key
> > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > don't have to support legacy scenarios.
> >
> > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > to look up previous keys. Would something down this direction work?
> 
> I've been thinking about it and I think first we need a way to detect
> that since key was not present we got the first_key instead:
> 
> - One solution I had in mind was to explicitly asked for the first key
> with map_get_next_key(map, NULL, first_key) and while walking the map
> check that map_get_next_key(map, prev_key, key) doesn't return the
> same key. This could be done using memcmp.
> - Discussing with Stan, he mentioned that another option is to support
> a flag in map_get_next_key to let it know that we want an error
> instead of the first_key.
> 
> After detecting the problem we also need to define what we want to do,
> here some options:
> 
> a) Return the error to the caller
> b) Try with previous keys if any (which be limited to the keys that we
> have traversed so far in this dump call)
> c) continue with next entries in the map. array is easy just get the
> next valid key (starting on i+1), but hmap might be difficult since
> starting on the next bucket could potentially skip some keys that were
> concurrently added to the same bucket where key used to be, and
> starting on the same bucket could lead us to return repeated elements.
> 
> Or maybe we could support those 3 cases via flags and let the caller
> decide which one to use?

this type of indecision is the reason why I wasn't excited about
batch dumping in the first place and gave 'soft yes' when Stan
mentioned it during lsf/mm/bpf uconf.
We probably shouldn't do it.
It feels this map_dump makes api more complex and doesn't really
give much benefit to the user other than large map dump becomes faster.
I think we gotta solve this problem differently.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-25 23:54           ` Alexei Starovoitov
@ 2019-07-26  1:02             ` Willem de Bruijn
  2019-07-26  1:24             ` Brian Vazquez
  1 sibling, 0 replies; 26+ messages in thread
From: Willem de Bruijn @ 2019-07-26  1:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brian Vazquez, Song Liu, Brian Vazquez, Alexei Starovoitov,
	Daniel Borkmann, David S . Miller, Stanislav Fomichev,
	Petar Penkov, open list, Networking, bpf

On Thu, Jul 25, 2019 at 7:54 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > > again. This is pretty weird.
> > > >
> > > > Yes, I know. But note that the current scenario happens even for the
> > > > old interface (imagine you are walking a map from userspace and you
> > > > tried get_next_key the prev_key was removed, you will start again from
> > > > the beginning without noticing it).
> > > > I tried to sent a patch in the past but I was missing some context:
> > > > before NULL was used to get the very first_key the interface relied in
> > > > a random (non existent) key to retrieve the first_key in the map, and
> > > > I was told what we still have to support that scenario.
> > >
> > > BPF_MAP_DUMP is slightly different, as you may return the first key
> > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > > don't have to support legacy scenarios.
> > >
> > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > > to look up previous keys. Would something down this direction work?
> >
> > I've been thinking about it and I think first we need a way to detect
> > that since key was not present we got the first_key instead:
> >
> > - One solution I had in mind was to explicitly asked for the first key
> > with map_get_next_key(map, NULL, first_key) and while walking the map
> > check that map_get_next_key(map, prev_key, key) doesn't return the
> > same key. This could be done using memcmp.
> > - Discussing with Stan, he mentioned that another option is to support
> > a flag in map_get_next_key to let it know that we want an error
> > instead of the first_key.
> >
> > After detecting the problem we also need to define what we want to do,
> > here some options:
> >
> > a) Return the error to the caller
> > b) Try with previous keys if any (which be limited to the keys that we
> > have traversed so far in this dump call)
> > c) continue with next entries in the map. array is easy just get the
> > next valid key (starting on i+1), but hmap might be difficult since
> > starting on the next bucket could potentially skip some keys that were
> > concurrently added to the same bucket where key used to be, and
> > starting on the same bucket could lead us to return repeated elements.
> >
> > Or maybe we could support those 3 cases via flags and let the caller
> > decide which one to use?
>
> this type of indecision is the reason why I wasn't excited about
> batch dumping in the first place and gave 'soft yes' when Stan
> mentioned it during lsf/mm/bpf uconf.
> We probably shouldn't do it.
> It feels this map_dump makes api more complex and doesn't really
> give much benefit to the user other than large map dump becomes faster.
> I think we gotta solve this problem differently.

Multiple variants with flags indeed makes the API complex. I think the
kernel should expose only the simplest, most obvious behavior that
allows the application to recover. In this case, that sounds like
option (a) and restart.

In practice, the common use case is to allocate enough user memory to
read an entire table in one go, in which case the entire issue is
moot.

The cycle savings of dump are significant for large tables. I'm not
sure how we achieve that differently and even simpler? We originally
looked at shared memory, but that is obviously much more complex.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-25 23:54           ` Alexei Starovoitov
  2019-07-26  1:02             ` Willem de Bruijn
@ 2019-07-26  1:24             ` Brian Vazquez
  1 sibling, 0 replies; 26+ messages in thread
From: Brian Vazquez @ 2019-07-26  1:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Song Liu, Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > > again. This is pretty weird.
> > > >
> > > > Yes, I know. But note that the current scenario happens even for the
> > > > old interface (imagine you are walking a map from userspace and you
> > > > tried get_next_key the prev_key was removed, you will start again from
> > > > the beginning without noticing it).
> > > > I tried to sent a patch in the past but I was missing some context:
> > > > before NULL was used to get the very first_key the interface relied in
> > > > a random (non existent) key to retrieve the first_key in the map, and
> > > > I was told what we still have to support that scenario.
> > >
> > > BPF_MAP_DUMP is slightly different, as you may return the first key
> > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > > don't have to support legacy scenarios.
> > >
> > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > > to look up previous keys. Would something down this direction work?
> >
> > I've been thinking about it and I think first we need a way to detect
> > that since key was not present we got the first_key instead:
> >
> > - One solution I had in mind was to explicitly asked for the first key
> > with map_get_next_key(map, NULL, first_key) and while walking the map
> > check that map_get_next_key(map, prev_key, key) doesn't return the
> > same key. This could be done using memcmp.
> > - Discussing with Stan, he mentioned that another option is to support
> > a flag in map_get_next_key to let it know that we want an error
> > instead of the first_key.
> >
> > After detecting the problem we also need to define what we want to do,
> > here some options:
> >
> > a) Return the error to the caller
> > b) Try with previous keys if any (which be limited to the keys that we
> > have traversed so far in this dump call)
> > c) continue with next entries in the map. array is easy just get the
> > next valid key (starting on i+1), but hmap might be difficult since
> > starting on the next bucket could potentially skip some keys that were
> > concurrently added to the same bucket where key used to be, and
> > starting on the same bucket could lead us to return repeated elements.
> >
> > Or maybe we could support those 3 cases via flags and let the caller
> > decide which one to use?
>
> this type of indecision is the reason why I wasn't excited about
> batch dumping in the first place and gave 'soft yes' when Stan
> mentioned it during lsf/mm/bpf uconf.
> We probably shouldn't do it.
> It feels this map_dump makes api more complex and doesn't really
> give much benefit to the user other than large map dump becomes faster.
> I think we gotta solve this problem differently.

Some users are working around the dumping problems with the existing
api by creating a bpf_map_get_next_key_and_delete userspace function
(see https://www.bouncybouncy.net/blog/bpf_map_get_next_key-pitfalls/)
which in my opinion is actually a good idea. The only problem with
that is that calling bpf_map_get_next_key(fd, key, next_key) and then
bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
and it might lose some information when deleting.
We could then do map_dump_and_delete using that idea but in the kernel
where we could better handle the racing condition. In that scenario
even if we retrieve the same key it will contain different info ( the
delta between old and new value). Would that work?

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-07-26  1:24 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
2019-07-24 20:53   ` Song Liu
2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
2019-07-24 19:54   ` Willem de Bruijn
2019-07-24 22:26     ` Brian Vazquez
2019-07-24 22:33       ` Willem de Bruijn
2019-07-24 21:40   ` Song Liu
2019-07-24 22:44     ` Brian Vazquez
2019-07-24 23:04       ` Song Liu
2019-07-25 23:25         ` Brian Vazquez
2019-07-25 23:54           ` Alexei Starovoitov
2019-07-26  1:02             ` Willem de Bruijn
2019-07-26  1:24             ` Brian Vazquez
2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
2019-07-24 21:41   ` Song Liu
2019-07-24 23:10   ` Andrii Nakryiko
2019-07-25 23:27     ` Brian Vazquez
2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
2019-07-24 19:51   ` Willem de Bruijn
2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
2019-07-24 21:58   ` Song Liu
2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
2019-07-24 22:01   ` Song Liu
2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
2019-07-24 22:15   ` Brian Vazquez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).