All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
@ 2019-07-24 16:57 Brian Vazquez
  2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

This introduces a new command to retrieve multiple number of entries
from a bpf map.

This new command can be executed from the existing BPF syscall as
follows:

err =  bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
attr->dump.buf_len
returns zero or negative error, and populates buf and buf_len on
succees

This implementation is wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem

Note that this implementation can be extended later to do dump and
delete by extending map_lookup_and_delete_elem (currently it only works
for bpf queue/stack maps) and either use a new flag in map_dump or a new
command map_dump_and_delete. 

Results show that even with a 1-elem_size buffer, it runs ~40 faster
than the current implementation, improvements of ~85% are reported when
the buffer size is increased, although, after the buffer size is around
5% of the total number of entries there's no huge difference in
increasing it.

Tested:
Tried different size buffers to handle case where the bulk is bigger, or
the elements to retrieve are less than the existing ones, all runs read
a map of 100K entries. Below are the results(in ns) from the different
runs:

buf_len_1:       69038725 entry-by-entry: 112384424 improvement
38.569134
buf_len_2:       40897447 entry-by-entry: 111030546 improvement
63.165590
buf_len_230:     13652714 entry-by-entry: 111694058 improvement
87.776687
buf_len_5000:    13576271 entry-by-entry: 111101169 improvement
87.780263
buf_len_73000:   14694343 entry-by-entry: 111740162 improvement
86.849542
buf_len_100000:  13745969 entry-by-entry: 114151991 improvement
87.958187
buf_len_234567:  14329834 entry-by-entry: 114427589 improvement
87.476941

The series of patches are split as follows:

- First patch move some map_lookup_elem logic into 2 fucntions to
deduplicate code: bpf_map_value_size and bpf_map_copy_value
- Second patch introduce map_dump function
- Third patch syncs tools linux headers
- Fourth patch adds libbpf support
- Last two patches adds tests

RFC Changelog:

- remove wrong usage of attr.flags
- move map_fd to remove hole after it

v3:
- add explanation of the API in the commit message
- fix masked errors and return them to user
- copy last_key from return buf into prev_key if it was provided
- run perf test with kpti and retpoline mitigations

v2:
- use proper bpf-next tag

Brian Vazquez (6):
  bpf: add bpf_map_value_size and bp_map_copy_value helper functions
  bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  bpf: keep bpf.h in sync with tools/
  libbpf: support BPF_MAP_DUMP command
  selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
  selftests/bpf: add test to measure performance of BPF_MAP_DUMP

 include/uapi/linux/bpf.h                |   9 +
 kernel/bpf/syscall.c                    | 251 ++++++++++++++++++------
 tools/include/uapi/linux/bpf.h          |   9 +
 tools/lib/bpf/bpf.c                     |  28 +++
 tools/lib/bpf/bpf.h                     |   4 +
 tools/lib/bpf/libbpf.map                |   2 +
 tools/testing/selftests/bpf/test_maps.c | 148 +++++++++++++-
 7 files changed, 388 insertions(+), 63 deletions(-)

-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
@ 2019-07-24 16:57 ` Brian Vazquez
  2019-07-24 20:53   ` Song Liu
  2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

Move reusable code from map_lookup_elem to helper functions to avoid code
duplication in kernel/bpf/syscall.c

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 kernel/bpf/syscall.c | 134 +++++++++++++++++++++++--------------------
 1 file changed, 73 insertions(+), 61 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5d141f16f6fa9..86cdc2f7bb56e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -126,6 +126,76 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
 	return map;
 }
 
+static u32 bpf_map_value_size(struct bpf_map *map)
+{
+	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
+	    map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
+	    map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
+		return round_up(map->value_size, 8) * num_possible_cpus();
+	else if (IS_FD_MAP(map))
+		return sizeof(u32);
+	else
+		return  map->value_size;
+}
+
+static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
+			      __u64 flags)
+{
+	void *ptr;
+	int err;
+
+	if (bpf_map_is_dev_bound(map))
+		return  bpf_map_offload_lookup_elem(map, key, value);
+
+	preempt_disable();
+	this_cpu_inc(bpf_prog_active);
+	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
+		err = bpf_percpu_hash_copy(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+		err = bpf_percpu_array_copy(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+		err = bpf_percpu_cgroup_storage_copy(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
+		err = bpf_stackmap_copy(map, key, value);
+	} else if (IS_FD_ARRAY(map)) {
+		err = bpf_fd_array_map_lookup_elem(map, key, value);
+	} else if (IS_FD_HASH(map)) {
+		err = bpf_fd_htab_map_lookup_elem(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+		err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+		   map->map_type == BPF_MAP_TYPE_STACK) {
+		err = map->ops->map_peek_elem(map, value);
+	} else {
+		rcu_read_lock();
+		if (map->ops->map_lookup_elem_sys_only)
+			ptr = map->ops->map_lookup_elem_sys_only(map, key);
+		else
+			ptr = map->ops->map_lookup_elem(map, key);
+		if (IS_ERR(ptr)) {
+			err = PTR_ERR(ptr);
+		} else if (!ptr) {
+			err = -ENOENT;
+		} else {
+			err = 0;
+			if (flags & BPF_F_LOCK)
+				/* lock 'ptr' and copy everything but lock */
+				copy_map_value_locked(map, value, ptr, true);
+			else
+				copy_map_value(map, value, ptr);
+			/* mask lock, since value wasn't zero inited */
+			check_and_init_map_lock(map, value);
+		}
+		rcu_read_unlock();
+	}
+	this_cpu_dec(bpf_prog_active);
+	preempt_enable();
+
+	return err;
+}
+
 void *bpf_map_area_alloc(size_t size, int numa_node)
 {
 	/* We really just want to fail instead of triggering OOM killer
@@ -729,7 +799,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 	void __user *uvalue = u64_to_user_ptr(attr->value);
 	int ufd = attr->map_fd;
 	struct bpf_map *map;
-	void *key, *value, *ptr;
+	void *key, *value;
 	u32 value_size;
 	struct fd f;
 	int err;
@@ -761,72 +831,14 @@ static int map_lookup_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
-	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
-	    map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
-	    map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
-		value_size = round_up(map->value_size, 8) * num_possible_cpus();
-	else if (IS_FD_MAP(map))
-		value_size = sizeof(u32);
-	else
-		value_size = map->value_size;
+	value_size = bpf_map_value_size(map);
 
 	err = -ENOMEM;
 	value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
 	if (!value)
 		goto free_key;
 
-	if (bpf_map_is_dev_bound(map)) {
-		err = bpf_map_offload_lookup_elem(map, key, value);
-		goto done;
-	}
-
-	preempt_disable();
-	this_cpu_inc(bpf_prog_active);
-	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
-	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
-		err = bpf_percpu_hash_copy(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
-		err = bpf_percpu_array_copy(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
-		err = bpf_percpu_cgroup_storage_copy(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
-		err = bpf_stackmap_copy(map, key, value);
-	} else if (IS_FD_ARRAY(map)) {
-		err = bpf_fd_array_map_lookup_elem(map, key, value);
-	} else if (IS_FD_HASH(map)) {
-		err = bpf_fd_htab_map_lookup_elem(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
-		err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
-		   map->map_type == BPF_MAP_TYPE_STACK) {
-		err = map->ops->map_peek_elem(map, value);
-	} else {
-		rcu_read_lock();
-		if (map->ops->map_lookup_elem_sys_only)
-			ptr = map->ops->map_lookup_elem_sys_only(map, key);
-		else
-			ptr = map->ops->map_lookup_elem(map, key);
-		if (IS_ERR(ptr)) {
-			err = PTR_ERR(ptr);
-		} else if (!ptr) {
-			err = -ENOENT;
-		} else {
-			err = 0;
-			if (attr->flags & BPF_F_LOCK)
-				/* lock 'ptr' and copy everything but lock */
-				copy_map_value_locked(map, value, ptr, true);
-			else
-				copy_map_value(map, value, ptr);
-			/* mask lock, since value wasn't zero inited */
-			check_and_init_map_lock(map, value);
-		}
-		rcu_read_unlock();
-	}
-	this_cpu_dec(bpf_prog_active);
-	preempt_enable();
-
-done:
+	err = bpf_map_copy_value(map, key, value, attr->flags);
 	if (err)
 		goto free_value;
 
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
  2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
@ 2019-07-24 16:57 ` Brian Vazquez
  2019-07-24 19:54   ` Willem de Bruijn
  2019-07-24 21:40   ` Song Liu
  2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

This introduces a new command to retrieve multiple number of entries
from a bpf map, wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem

To start dumping the map from the beginning you must specify NULL as
the prev_key.

The new API returns 0 when it successfully copied all the elements
requested or it copied less because there weren't more elements to
retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.

On a successful call buf and buf_len will contain correct data and in
case prev_key was provided (not for the first walk, since prev_key is
NULL) it will contain the last_key copied into the prev_key which will
simplify next call.

Only when it can't find a single element it will return -ENOENT meaning
that the map has been entirely walked. When an error is return buf,
buf_len and prev_key shouldn't be read nor used.

Because maps can be called from userspace and kernel code, this function
can have a scenario where the next_key was found but by the time we
try to retrieve the value the element is not there, in this case the
function continues and tries to get a new next_key value, skipping the
deleted key. If at some point the function find itself trap in a loop,
it will return -EINTR.

The function will try to fit as much as possible in the buf provided and
will return -EINVAL if buf_len is smaller than elem_size.

QUEUE and STACK maps are not supported.

Note that map_dump doesn't guarantee that reading the entire table is
consistent since this function is always racing with kernel and user code
but the same behaviour is found when the entire table is walked using
the current interfaces: map_get_next_key + map_lookup_elem.
It is also important to note that with  a locked map, the lock is grabbed
for 1 entry at the time, meaning that the returned buf might or might not
be consistent.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 include/uapi/linux/bpf.h |   9 +++
 kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa1c753dcdbc7..66dab5385170d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
 	BPF_TASK_FD_QUERY,
 	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 	BPF_MAP_FREEZE,
+	BPF_MAP_DUMP,
 };
 
 enum bpf_map_type {
@@ -388,6 +389,14 @@ union bpf_attr {
 		__u64		flags;
 	};
 
+	struct { /* struct used by BPF_MAP_DUMP command */
+		__aligned_u64	prev_key;
+		__aligned_u64	buf;
+		__aligned_u64	buf_len; /* input/output: len of buf */
+		__u64		flags;
+		__u32		map_fd;
+	} dump;
+
 	struct { /* anonymous struct used by BPF_PROG_LOAD command */
 		__u32		prog_type;	/* one of enum bpf_prog_type */
 		__u32		insn_cnt;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 86cdc2f7bb56e..0c35505aa219f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
 	return err;
 }
 
+/* last field in 'union bpf_attr' used by this command */
+#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
+
+static int map_dump(union bpf_attr *attr)
+{
+	void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
+	void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
+	u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
+	int ufd = attr->dump.map_fd;
+	struct bpf_map *map;
+	void *buf, *prev_key, *key, *value;
+	u32 value_size, elem_size, buf_len, cp_len;
+	struct fd f;
+	int err;
+	bool first_key = false;
+
+	if (CHECK_ATTR(BPF_MAP_DUMP))
+		return -EINVAL;
+
+	if (attr->dump.flags & ~BPF_F_LOCK)
+		return -EINVAL;
+
+	f = fdget(ufd);
+	map = __bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+	if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
+		err = -EPERM;
+		goto err_put;
+	}
+
+	if ((attr->dump.flags & BPF_F_LOCK) &&
+	    !map_value_has_spin_lock(map)) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
+	if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+	    map->map_type == BPF_MAP_TYPE_STACK) {
+		err = -ENOTSUPP;
+		goto err_put;
+	}
+
+	value_size = bpf_map_value_size(map);
+
+	err = get_user(buf_len, ubuf_len);
+	if (err)
+		goto err_put;
+
+	elem_size = map->key_size + value_size;
+	if (buf_len < elem_size) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
+	if (ukey) {
+		prev_key = __bpf_copy_key(ukey, map->key_size);
+		if (IS_ERR(prev_key)) {
+			err = PTR_ERR(prev_key);
+			goto err_put;
+		}
+	} else {
+		prev_key = NULL;
+		first_key = true;
+	}
+
+	err = -ENOMEM;
+	buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
+	if (!buf)
+		goto err_put;
+
+	key = buf;
+	value = key + map->key_size;
+	for (cp_len = 0; cp_len + elem_size <= buf_len;) {
+		if (signal_pending(current)) {
+			err = -EINTR;
+			break;
+		}
+
+		rcu_read_lock();
+		err = map->ops->map_get_next_key(map, prev_key, key);
+		rcu_read_unlock();
+
+		if (err)
+			break;
+
+		err = bpf_map_copy_value(map, key, value, attr->dump.flags);
+
+		if (err == -ENOENT)
+			continue;
+		if (err)
+			goto free_buf;
+
+		if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
+			err = -EFAULT;
+			goto free_buf;
+		}
+
+		prev_key = key;
+		cp_len += elem_size;
+	}
+
+	if (err == -ENOENT && cp_len)
+		err = 0;
+	if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
+		    (!first_key && copy_to_user(ukey, key, map->key_size))))
+		err = -EFAULT;
+free_buf:
+	kfree(buf);
+err_put:
+	fdput(f);
+	return err;
+}
+
 #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
 
 static int map_lookup_and_delete_elem(union bpf_attr *attr)
@@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
 		err = map_lookup_and_delete_elem(&attr);
 		break;
+	case BPF_MAP_DUMP:
+		err = map_dump(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
  2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
  2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
@ 2019-07-24 16:58 ` Brian Vazquez
  2019-07-24 21:41   ` Song Liu
  2019-07-24 23:10   ` Andrii Nakryiko
  2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

Adds bpf_attr.dump structure to libbpf.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/include/uapi/linux/bpf.h | 9 +++++++++
 tools/lib/bpf/libbpf.map       | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e455018da65f..e127f16e4e932 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
 	BPF_TASK_FD_QUERY,
 	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 	BPF_MAP_FREEZE,
+	BPF_MAP_DUMP,
 };
 
 enum bpf_map_type {
@@ -388,6 +389,14 @@ union bpf_attr {
 		__u64		flags;
 	};
 
+	struct { /* struct used by BPF_MAP_DUMP command */
+		__aligned_u64	prev_key;
+		__aligned_u64	buf;
+		__aligned_u64	buf_len; /* input/output: len of buf */
+		__u64		flags;
+		__u32		map_fd;
+	} dump;
+
 	struct { /* anonymous struct used by BPF_PROG_LOAD command */
 		__u32		prog_type;	/* one of enum bpf_prog_type */
 		__u32		insn_cnt;
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index f9d316e873d8d..cac3723d5c45c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -183,4 +183,6 @@ LIBBPF_0.0.4 {
 		perf_buffer__new;
 		perf_buffer__new_raw;
 		perf_buffer__poll;
+		bpf_map_dump;
+		bpf_map_dump_flags;
 } LIBBPF_0.0.3;
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (2 preceding siblings ...)
  2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
@ 2019-07-24 16:58 ` Brian Vazquez
  2019-07-24 19:51   ` Willem de Bruijn
  2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

Make libbpf aware of new BPF_MAP_DUMP command and add bpf_map_dump and
bpf_map_dump_flags to use them from the library.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/lib/bpf/bpf.c | 28 ++++++++++++++++++++++++++++
 tools/lib/bpf/bpf.h |  4 ++++
 2 files changed, 32 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index c7d7993c44bb0..c1139b7db756a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -368,6 +368,34 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
 	return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
 }
 
+int bpf_map_dump(int fd, const void *prev_key, void *buf, void *buf_len)
+{
+	union bpf_attr attr;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.dump.map_fd = fd;
+	attr.dump.prev_key = ptr_to_u64(prev_key);
+	attr.dump.buf = ptr_to_u64(buf);
+	attr.dump.buf_len = ptr_to_u64(buf_len);
+
+	return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
+}
+
+int bpf_map_dump_flags(int fd, const void *prev_key, void *buf, void *buf_len,
+		       __u64 flags)
+{
+	union bpf_attr attr;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.dump.map_fd = fd;
+	attr.dump.prev_key = ptr_to_u64(prev_key);
+	attr.dump.buf = ptr_to_u64(buf);
+	attr.dump.buf_len = ptr_to_u64(buf_len);
+	attr.dump.flags = flags;
+
+	return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
+}
+
 int bpf_map_lookup_elem(int fd, const void *key, void *value)
 {
 	union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index ff42ca043dc8f..86496443440e9 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -112,6 +112,10 @@ LIBBPF_API int bpf_verify_program(enum bpf_prog_type type,
 LIBBPF_API int bpf_map_update_elem(int fd, const void *key, const void *value,
 				   __u64 flags);
 
+LIBBPF_API int bpf_map_dump(int fd, const void *prev_key, void *buf,
+				void *buf_len);
+LIBBPF_API int bpf_map_dump_flags(int fd, const void *prev_key, void *buf,
+				void *buf_len, __u64 flags);
 LIBBPF_API int bpf_map_lookup_elem(int fd, const void *key, void *value);
 LIBBPF_API int bpf_map_lookup_elem_flags(int fd, const void *key, void *value,
 					 __u64 flags);
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (3 preceding siblings ...)
  2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
@ 2019-07-24 16:58 ` Brian Vazquez
  2019-07-24 21:58   ` Song Liu
  2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
  2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
  6 siblings, 1 reply; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

This tests exercise the new command on a bpf hashmap and make sure it
works as expected.

Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/testing/selftests/bpf/test_maps.c | 83 ++++++++++++++++++++++++-
 1 file changed, 81 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 5443b9bd75ed7..f7ab401399d40 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -309,6 +309,86 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	close(fd);
 }
 
+static void test_hashmap_dump(void)
+{
+	int fd, i, max_entries = 5;
+	uint64_t keys[max_entries], values[max_entries];
+	uint64_t key, value, next_key, prev_key;
+	bool next_key_valid = true;
+	void *buf, *elem;
+	u32 buf_len;
+	const int elem_size = sizeof(key) + sizeof(value);
+
+	fd = helper_fill_hashmap(max_entries);
+
+	// Get the elements in the hashmap, and store them in that order
+	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
+	i = 0;
+	keys[i] = key;
+	for (i = 1; next_key_valid; i++) {
+		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
+		assert(bpf_map_lookup_elem(fd, &key, &values[i - 1]) == 0);
+		keys[i-1] = key;
+		key = next_key;
+	}
+
+	// Alloc memory for the whole table
+	buf = malloc(elem_size * max_entries);
+	assert(buf != NULL);
+
+	// Check that buf_len < elem_size returns EINVAL
+	buf_len = elem_size-1;
+	errno = 0;
+	assert(bpf_map_dump(fd, NULL, buf, &buf_len) == -1 && errno == EINVAL);
+
+	// Check that it returns the first two elements
+	errno = 0;
+	buf_len = elem_size * 2;
+	i = 0;
+	assert(bpf_map_dump(fd, NULL, buf, &buf_len) == 0 &&
+	       buf_len == 2*elem_size);
+	elem = buf;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	elem = buf + elem_size;
+	i++;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	i++;
+
+	/* Check that prev_key contains key from last_elem retrieved in previous
+	 * call
+	 */
+	prev_key = *((uint64_t *)elem);
+	assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
+	       buf_len == elem_size*2);
+	elem = buf;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	elem = buf + elem_size;
+	i++;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	i++;
+	assert(prev_key == (*(uint64_t *)elem));
+
+	/* Continue reading from map and verify buf_len only contains 1 element
+	 * even though buf_len is 2 elem_size and it returns err = 0.
+	 */
+	assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
+	       buf_len == elem_size);
+	elem = buf;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+
+	// Verify there's no more entries and err = ENOENT
+	assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == -1 &&
+	       errno == ENOENT);
+
+	free(buf);
+	close(fd);
+}
+
 static void test_hashmap_zero_seed(void)
 {
 	int i, first, second, old_flags;
@@ -1677,6 +1757,7 @@ static void run_all_tests(void)
 	test_hashmap_percpu(0, NULL);
 	test_hashmap_walk(0, NULL);
 	test_hashmap_zero_seed();
+	test_hashmap_dump();
 
 	test_arraymap(0, NULL);
 	test_arraymap_percpu(0, NULL);
@@ -1714,11 +1795,9 @@ int main(void)
 
 	map_flags = BPF_F_NO_PREALLOC;
 	run_all_tests();
-
 #define CALL
 #include <map_tests/tests.h>
 #undef CALL
-
 	printf("test_maps: OK, %d SKIPPED\n", skips);
 	return 0;
 }
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (4 preceding siblings ...)
  2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
@ 2019-07-24 16:58 ` Brian Vazquez
  2019-07-24 22:01   ` Song Liu
  2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
  6 siblings, 1 reply; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

This tests compares the amount of time that takes to read an entire
table of 100K elements on a bpf hashmap using both BPF_MAP_DUMP and
BPF_MAP_GET_NEXT_KEY + BPF_MAP_LOOKUP_ELEM.

Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/testing/selftests/bpf/test_maps.c | 65 +++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index f7ab401399d40..c4593a8904ca6 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -18,6 +18,7 @@
 #include <sys/socket.h>
 #include <netinet/in.h>
 #include <linux/bpf.h>
+#include <linux/time64.h>
 
 #include <bpf/bpf.h>
 #include <bpf/libbpf.h>
@@ -389,6 +390,69 @@ static void test_hashmap_dump(void)
 	close(fd);
 }
 
+static void test_hashmap_dump_perf(void)
+{
+	int fd, i, max_entries = 100000;
+	uint64_t key, value, next_key;
+	bool next_key_valid = true;
+	void *buf;
+	u32 buf_len, entries;
+	int j = 0;
+	int clk_id = CLOCK_MONOTONIC;
+	struct timespec begin, end;
+	long long time_spent, dump_time_spent;
+	double res;
+	int tests[] = {1, 2, 230, 5000, 73000, 100000, 234567};
+	int test_len = ARRAY_SIZE(tests);
+	const int elem_size = sizeof(key) + sizeof(value);
+
+	fd = helper_fill_hashmap(max_entries);
+	// Alloc memory considering the largest buffer
+	buf = malloc(elem_size * tests[test_len-1]);
+	assert(buf != NULL);
+
+test:
+	entries = tests[j];
+	buf_len = elem_size*tests[j];
+	j++;
+	clock_gettime(clk_id, &begin);
+	errno = 0;
+	i = 0;
+	while (errno == 0) {
+		bpf_map_dump(fd, !i ? NULL : &key,
+				  buf, &buf_len);
+		if (errno)
+			break;
+		if (!i)
+			key = *((uint64_t *)(buf + buf_len - elem_size));
+		i += buf_len / elem_size;
+	}
+	clock_gettime(clk_id, &end);
+	assert(i  == max_entries);
+	dump_time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
+			  end.tv_nsec - begin.tv_nsec;
+	next_key_valid = true;
+	clock_gettime(clk_id, &begin);
+	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
+	for (i = 0; next_key_valid; i++) {
+		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
+		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
+		key = next_key;
+	}
+	clock_gettime(clk_id, &end);
+	time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
+		     end.tv_nsec - begin.tv_nsec;
+	res = (1-((double)dump_time_spent/time_spent))*100;
+	printf("buf_len_%u:\t %llu entry-by-entry: %llu improvement %lf\n",
+	       entries, dump_time_spent, time_spent, res);
+	assert(i  == max_entries);
+
+	if (j < test_len)
+		goto test;
+	free(buf);
+	close(fd);
+}
+
 static void test_hashmap_zero_seed(void)
 {
 	int i, first, second, old_flags;
@@ -1758,6 +1822,7 @@ static void run_all_tests(void)
 	test_hashmap_walk(0, NULL);
 	test_hashmap_zero_seed();
 	test_hashmap_dump();
+	test_hashmap_dump_perf();
 
 	test_arraymap(0, NULL);
 	test_arraymap_percpu(0, NULL);
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
                   ` (5 preceding siblings ...)
  2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
@ 2019-07-24 19:20 ` Song Liu
  2019-07-24 22:15   ` Brian Vazquez
  6 siblings, 1 reply; 31+ messages in thread
From: Song Liu @ 2019-07-24 19:20 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:09 AM Brian Vazquez <brianvv@google.com> wrote:
>
> This introduces a new command to retrieve multiple number of entries
> from a bpf map.
>
> This new command can be executed from the existing BPF syscall as
> follows:
>
> err =  bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
> using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
> attr->dump.buf_len
> returns zero or negative error, and populates buf and buf_len on
> succees
>
> This implementation is wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> Note that this implementation can be extended later to do dump and
> delete by extending map_lookup_and_delete_elem (currently it only works
> for bpf queue/stack maps) and either use a new flag in map_dump or a new
> command map_dump_and_delete.
>
> Results show that even with a 1-elem_size buffer, it runs ~40 faster

Why is the new command 40% faster with 1-elem_size buffer?

> than the current implementation, improvements of ~85% are reported when
> the buffer size is increased, although, after the buffer size is around
> 5% of the total number of entries there's no huge difference in
> increasing it.
>
> Tested:
> Tried different size buffers to handle case where the bulk is bigger, or
> the elements to retrieve are less than the existing ones, all runs read
> a map of 100K entries. Below are the results(in ns) from the different
> runs:
>
> buf_len_1:       69038725 entry-by-entry: 112384424 improvement
> 38.569134
> buf_len_2:       40897447 entry-by-entry: 111030546 improvement
> 63.165590
> buf_len_230:     13652714 entry-by-entry: 111694058 improvement
> 87.776687
> buf_len_5000:    13576271 entry-by-entry: 111101169 improvement
> 87.780263
> buf_len_73000:   14694343 entry-by-entry: 111740162 improvement
> 86.849542
> buf_len_100000:  13745969 entry-by-entry: 114151991 improvement
> 87.958187
> buf_len_234567:  14329834 entry-by-entry: 114427589 improvement
> 87.476941

It took me a while to figure out the meaning of 87.476941. It is probably
a good idea to say 87.5% instead.

Thanks,
Song

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command
  2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
@ 2019-07-24 19:51   ` Willem de Bruijn
  0 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2019-07-24 19:51 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf

On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <brianvv@google.com> wrote:
>
> Make libbpf aware of new BPF_MAP_DUMP command and add bpf_map_dump and
> bpf_map_dump_flags to use them from the library.
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/lib/bpf/bpf.c | 28 ++++++++++++++++++++++++++++
>  tools/lib/bpf/bpf.h |  4 ++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index c7d7993c44bb0..c1139b7db756a 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -368,6 +368,34 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
>         return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
>  }
>
> +int bpf_map_dump(int fd, const void *prev_key, void *buf, void *buf_len)
> +{
> +       union bpf_attr attr;
> +
> +       memset(&attr, 0, sizeof(attr));
> +       attr.dump.map_fd = fd;
> +       attr.dump.prev_key = ptr_to_u64(prev_key);
> +       attr.dump.buf = ptr_to_u64(buf);
> +       attr.dump.buf_len = ptr_to_u64(buf_len);
> +
> +       return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
> +}

This can call bpf_map_dump_flags internally to avoid code duplication?

> +
> +int bpf_map_dump_flags(int fd, const void *prev_key, void *buf, void *buf_len,
> +                      __u64 flags)
> +{
> +       union bpf_attr attr;
> +
> +       memset(&attr, 0, sizeof(attr));
> +       attr.dump.map_fd = fd;
> +       attr.dump.prev_key = ptr_to_u64(prev_key);
> +       attr.dump.buf = ptr_to_u64(buf);
> +       attr.dump.buf_len = ptr_to_u64(buf_len);
> +       attr.dump.flags = flags;
> +
> +       return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
> +}
> +

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
@ 2019-07-24 19:54   ` Willem de Bruijn
  2019-07-24 22:26     ` Brian Vazquez
  2019-07-24 21:40   ` Song Liu
  1 sibling, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2019-07-24 19:54 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf

On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <brianvv@google.com> wrote:
>
> This introduces a new command to retrieve multiple number of entries
> from a bpf map, wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> To start dumping the map from the beginning you must specify NULL as
> the prev_key.
>
> The new API returns 0 when it successfully copied all the elements
> requested or it copied less because there weren't more elements to
> retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.

I think I understand this, but perhaps it can be explained a bit more
concisely without reference to ENOENT and error masking. The function
returns the min of the number of requested elements and the number of
remaining elements in the map from the given starting point
(prev_key).

> On a successful call buf and buf_len will contain correct data and in
> case prev_key was provided (not for the first walk, since prev_key is
> NULL) it will contain the last_key copied into the prev_key which will
> simplify next call.
>
> Only when it can't find a single element it will return -ENOENT meaning
> that the map has been entirely walked. When an error is return buf,
> buf_len and prev_key shouldn't be read nor used.

That's common for error handling. No need to state explicitly.

> Because maps can be called from userspace and kernel code, this function
> can have a scenario where the next_key was found but by the time we
> try to retrieve the value the element is not there, in this case the
> function continues and tries to get a new next_key value, skipping the
> deleted key. If at some point the function find itself trap in a loop,
> it will return -EINTR.

Good to point this out! I don't think that unbounded continue;
statements until an interrupt happens is sufficient. Please bound the
number of retries to a low number.

> The function will try to fit as much as possible in the buf provided and
> will return -EINVAL if buf_len is smaller than elem_size.
>
> QUEUE and STACK maps are not supported.
>
> Note that map_dump doesn't guarantee that reading the entire table is
> consistent since this function is always racing with kernel and user code
> but the same behaviour is found when the entire table is walked using
> the current interfaces: map_get_next_key + map_lookup_elem.

> It is also important to note that with  a locked map, the lock is grabbed
> for 1 entry at the time, meaning that the returned buf might or might not
> be consistent.

Would it be informative to signal to the caller if the read was
complete and consistent (because the entire table was read while the
lock was held)?

>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions
  2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
@ 2019-07-24 20:53   ` Song Liu
  0 siblings, 0 replies; 31+ messages in thread
From: Song Liu @ 2019-07-24 20:53 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> Move reusable code from map_lookup_elem to helper functions to avoid code
> duplication in kernel/bpf/syscall.c
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>

Acked-by: Song Liu <songliubraving@fb.com>

Some very minor nits though.

> ---
>  kernel/bpf/syscall.c | 134 +++++++++++++++++++++++--------------------
>  1 file changed, 73 insertions(+), 61 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 5d141f16f6fa9..86cdc2f7bb56e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -126,6 +126,76 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
>         return map;
>  }
>
> +static u32 bpf_map_value_size(struct bpf_map *map)
> +{
> +       if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
> +           map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
> +           map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
> +           map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
> +               return round_up(map->value_size, 8) * num_possible_cpus();
> +       else if (IS_FD_MAP(map))
> +               return sizeof(u32);
> +       else
> +               return  map->value_size;
                  ^ extra space after return

> +}
> +
> +static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
> +                             __u64 flags)
> +{
> +       void *ptr;
> +       int err;
> +
> +       if (bpf_map_is_dev_bound(map))
> +               return  bpf_map_offload_lookup_elem(map, key, value);
                  ^ another extra space after return, did replace? :-)

Thanks,
Song

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
  2019-07-24 19:54   ` Willem de Bruijn
@ 2019-07-24 21:40   ` Song Liu
  2019-07-24 22:44     ` Brian Vazquez
  1 sibling, 1 reply; 31+ messages in thread
From: Song Liu @ 2019-07-24 21:40 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> This introduces a new command to retrieve multiple number of entries
> from a bpf map, wrapping the existing bpf methods:
> map_get_next_key and map_lookup_elem
>
> To start dumping the map from the beginning you must specify NULL as
> the prev_key.
>
> The new API returns 0 when it successfully copied all the elements
> requested or it copied less because there weren't more elements to
> retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
>
> On a successful call buf and buf_len will contain correct data and in
> case prev_key was provided (not for the first walk, since prev_key is
> NULL) it will contain the last_key copied into the prev_key which will
> simplify next call.
>
> Only when it can't find a single element it will return -ENOENT meaning
> that the map has been entirely walked. When an error is return buf,
> buf_len and prev_key shouldn't be read nor used.
>
> Because maps can be called from userspace and kernel code, this function
> can have a scenario where the next_key was found but by the time we
> try to retrieve the value the element is not there, in this case the
> function continues and tries to get a new next_key value, skipping the
> deleted key. If at some point the function find itself trap in a loop,
> it will return -EINTR.
>
> The function will try to fit as much as possible in the buf provided and
> will return -EINVAL if buf_len is smaller than elem_size.
>
> QUEUE and STACK maps are not supported.
>
> Note that map_dump doesn't guarantee that reading the entire table is
> consistent since this function is always racing with kernel and user code
> but the same behaviour is found when the entire table is walked using
> the current interfaces: map_get_next_key + map_lookup_elem.
> It is also important to note that with  a locked map, the lock is grabbed
> for 1 entry at the time, meaning that the returned buf might or might not
> be consistent.
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  include/uapi/linux/bpf.h |   9 +++
>  kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 126 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index fa1c753dcdbc7..66dab5385170d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -106,6 +106,7 @@ enum bpf_cmd {
>         BPF_TASK_FD_QUERY,
>         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
>         BPF_MAP_FREEZE,
> +       BPF_MAP_DUMP,
>  };
>
>  enum bpf_map_type {
> @@ -388,6 +389,14 @@ union bpf_attr {
>                 __u64           flags;
>         };
>
> +       struct { /* struct used by BPF_MAP_DUMP command */
> +               __aligned_u64   prev_key;
> +               __aligned_u64   buf;
> +               __aligned_u64   buf_len; /* input/output: len of buf */
> +               __u64           flags;

Please add explanation of flags here. Also, we need to update the
comments of BPF_F_LOCK for BPF_MAP_DUMP.

> +               __u32           map_fd;
> +       } dump;
> +
>         struct { /* anonymous struct used by BPF_PROG_LOAD command */
>                 __u32           prog_type;      /* one of enum bpf_prog_type */
>                 __u32           insn_cnt;
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 86cdc2f7bb56e..0c35505aa219f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
>         return err;
>  }
>
> +/* last field in 'union bpf_attr' used by this command */
> +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> +
> +static int map_dump(union bpf_attr *attr)
> +{
> +       void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> +       void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> +       u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> +       int ufd = attr->dump.map_fd;
> +       struct bpf_map *map;
> +       void *buf, *prev_key, *key, *value;
> +       u32 value_size, elem_size, buf_len, cp_len;
> +       struct fd f;
> +       int err;
> +       bool first_key = false;
> +
> +       if (CHECK_ATTR(BPF_MAP_DUMP))
> +               return -EINVAL;
> +
> +       if (attr->dump.flags & ~BPF_F_LOCK)
> +               return -EINVAL;
> +
> +       f = fdget(ufd);
> +       map = __bpf_map_get(f);
> +       if (IS_ERR(map))
> +               return PTR_ERR(map);
> +       if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> +               err = -EPERM;
> +               goto err_put;
> +       }
> +
> +       if ((attr->dump.flags & BPF_F_LOCK) &&
> +           !map_value_has_spin_lock(map)) {
> +               err = -EINVAL;
> +               goto err_put;
> +       }

We can share these lines with map_lookup_elem(). Maybe
add another helper function?

> +
> +       if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> +           map->map_type == BPF_MAP_TYPE_STACK) {
> +               err = -ENOTSUPP;
> +               goto err_put;
> +       }
> +
> +       value_size = bpf_map_value_size(map);
> +
> +       err = get_user(buf_len, ubuf_len);
> +       if (err)
> +               goto err_put;
> +
> +       elem_size = map->key_size + value_size;
> +       if (buf_len < elem_size) {
> +               err = -EINVAL;
> +               goto err_put;
> +       }
> +
> +       if (ukey) {
> +               prev_key = __bpf_copy_key(ukey, map->key_size);
> +               if (IS_ERR(prev_key)) {
> +                       err = PTR_ERR(prev_key);
> +                       goto err_put;
> +               }
> +       } else {
> +               prev_key = NULL;
> +               first_key = true;
> +       }
> +
> +       err = -ENOMEM;
> +       buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> +       if (!buf)
> +               goto err_put;
> +
> +       key = buf;
> +       value = key + map->key_size;
> +       for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> +               if (signal_pending(current)) {
> +                       err = -EINTR;
> +                       break;
> +               }
> +
> +               rcu_read_lock();
> +               err = map->ops->map_get_next_key(map, prev_key, key);

If prev_key is deleted before map_get_next_key(), we get the first key
again. This is pretty weird.

> +               rcu_read_unlock();
> +
> +               if (err)
> +                       break;
> +
> +               err = bpf_map_copy_value(map, key, value, attr->dump.flags);
> +
> +               if (err == -ENOENT)
> +                       continue;
> +               if (err)
> +                       goto free_buf;
> +
> +               if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
> +                       err = -EFAULT;
> +                       goto free_buf;
> +               }
> +
> +               prev_key = key;
> +               cp_len += elem_size;
> +       }
> +
> +       if (err == -ENOENT && cp_len)
> +               err = 0;
> +       if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
> +                   (!first_key && copy_to_user(ukey, key, map->key_size))))
> +               err = -EFAULT;
> +free_buf:
> +       kfree(buf);
> +err_put:
> +       fdput(f);
> +       return err;
> +}
> +
>  #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
>
>  static int map_lookup_and_delete_elem(union bpf_attr *attr)
> @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>         case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
>                 err = map_lookup_and_delete_elem(&attr);
>                 break;
> +       case BPF_MAP_DUMP:
> +               err = map_dump(&attr);
> +               break;
>         default:
>                 err = -EINVAL;
>                 break;
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
  2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
@ 2019-07-24 21:41   ` Song Liu
  2019-07-24 23:10   ` Andrii Nakryiko
  1 sibling, 0 replies; 31+ messages in thread
From: Song Liu @ 2019-07-24 21:41 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> Adds bpf_attr.dump structure to libbpf.
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/include/uapi/linux/bpf.h | 9 +++++++++
>  tools/lib/bpf/libbpf.map       | 2 ++
>  2 files changed, 11 insertions(+)
>
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 4e455018da65f..e127f16e4e932 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -106,6 +106,7 @@ enum bpf_cmd {
>         BPF_TASK_FD_QUERY,
>         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
>         BPF_MAP_FREEZE,
> +       BPF_MAP_DUMP,
>  };
>
>  enum bpf_map_type {
> @@ -388,6 +389,14 @@ union bpf_attr {
>                 __u64           flags;
>         };
>
> +       struct { /* struct used by BPF_MAP_DUMP command */
> +               __aligned_u64   prev_key;
> +               __aligned_u64   buf;
> +               __aligned_u64   buf_len; /* input/output: len of buf */
> +               __u64           flags;
> +               __u32           map_fd;
> +       } dump;
> +
>         struct { /* anonymous struct used by BPF_PROG_LOAD command */
>                 __u32           prog_type;      /* one of enum bpf_prog_type */
>                 __u32           insn_cnt;
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index f9d316e873d8d..cac3723d5c45c 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -183,4 +183,6 @@ LIBBPF_0.0.4 {
>                 perf_buffer__new;
>                 perf_buffer__new_raw;
>                 perf_buffer__poll;
> +               bpf_map_dump;
> +               bpf_map_dump_flags;
>  } LIBBPF_0.0.3;

libbpf.map change should go with 4/6.

> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
  2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
@ 2019-07-24 21:58   ` Song Liu
  0 siblings, 0 replies; 31+ messages in thread
From: Song Liu @ 2019-07-24 21:58 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> This tests exercise the new command on a bpf hashmap and make sure it
> works as expected.
>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/testing/selftests/bpf/test_maps.c | 83 ++++++++++++++++++++++++-
>  1 file changed, 81 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index 5443b9bd75ed7..f7ab401399d40 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -309,6 +309,86 @@ static void test_hashmap_walk(unsigned int task, void *data)
>         close(fd);
>  }
>
> +static void test_hashmap_dump(void)
> +{
> +       int fd, i, max_entries = 5;

5 is too small for map_dump.

> +       uint64_t keys[max_entries], values[max_entries];
> +       uint64_t key, value, next_key, prev_key;
> +       bool next_key_valid = true;
> +       void *buf, *elem;
> +       u32 buf_len;
> +       const int elem_size = sizeof(key) + sizeof(value);
> +
> +       fd = helper_fill_hashmap(max_entries);
> +
> +       // Get the elements in the hashmap, and store them in that order

Please use /* */ instead of //.

> +       assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
> +       i = 0;
> +       keys[i] = key;
> +       for (i = 1; next_key_valid; i++) {
> +               next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
> +               assert(bpf_map_lookup_elem(fd, &key, &values[i - 1]) == 0);
> +               keys[i-1] = key;
> +               key = next_key;
> +       }
> +
> +       // Alloc memory for the whole table
> +       buf = malloc(elem_size * max_entries);
> +       assert(buf != NULL);
> +
> +       // Check that buf_len < elem_size returns EINVAL
> +       buf_len = elem_size-1;
> +       errno = 0;
> +       assert(bpf_map_dump(fd, NULL, buf, &buf_len) == -1 && errno == EINVAL);
> +
> +       // Check that it returns the first two elements
> +       errno = 0;
> +       buf_len = elem_size * 2;
> +       i = 0;
> +       assert(bpf_map_dump(fd, NULL, buf, &buf_len) == 0 &&
> +              buf_len == 2*elem_size);
> +       elem = buf;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +       elem = buf + elem_size;
> +       i++;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +       i++;
> +
> +       /* Check that prev_key contains key from last_elem retrieved in previous
> +        * call
> +        */
> +       prev_key = *((uint64_t *)elem);
> +       assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
> +              buf_len == elem_size*2);
> +       elem = buf;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +       elem = buf + elem_size;
> +       i++;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +       i++;
> +       assert(prev_key == (*(uint64_t *)elem));
> +
> +       /* Continue reading from map and verify buf_len only contains 1 element
> +        * even though buf_len is 2 elem_size and it returns err = 0.
> +        */
> +       assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
> +              buf_len == elem_size);
> +       elem = buf;
> +       assert((*(uint64_t *)elem) == keys[i] &&
> +              (*(uint64_t *)(elem + sizeof(key))) == values[i]);
> +
> +       // Verify there's no more entries and err = ENOENT
> +       assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == -1 &&
> +              errno == ENOENT);
> +
> +       free(buf);
> +       close(fd);
> +}
> +
>  static void test_hashmap_zero_seed(void)
>  {
>         int i, first, second, old_flags;
> @@ -1677,6 +1757,7 @@ static void run_all_tests(void)
>         test_hashmap_percpu(0, NULL);
>         test_hashmap_walk(0, NULL);
>         test_hashmap_zero_seed();
> +       test_hashmap_dump();
>
>         test_arraymap(0, NULL);
>         test_arraymap_percpu(0, NULL);
> @@ -1714,11 +1795,9 @@ int main(void)
>
>         map_flags = BPF_F_NO_PREALLOC;
>         run_all_tests();
> -
>  #define CALL
>  #include <map_tests/tests.h>
>  #undef CALL
> -
>         printf("test_maps: OK, %d SKIPPED\n", skips);
>         return 0;
>  }
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP
  2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
@ 2019-07-24 22:01   ` Song Liu
  0 siblings, 0 replies; 31+ messages in thread
From: Song Liu @ 2019-07-24 22:01 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> This tests compares the amount of time that takes to read an entire
> table of 100K elements on a bpf hashmap using both BPF_MAP_DUMP and
> BPF_MAP_GET_NEXT_KEY + BPF_MAP_LOOKUP_ELEM.
>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/testing/selftests/bpf/test_maps.c | 65 +++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index f7ab401399d40..c4593a8904ca6 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -18,6 +18,7 @@
>  #include <sys/socket.h>
>  #include <netinet/in.h>
>  #include <linux/bpf.h>
> +#include <linux/time64.h>
>
>  #include <bpf/bpf.h>
>  #include <bpf/libbpf.h>
> @@ -389,6 +390,69 @@ static void test_hashmap_dump(void)
>         close(fd);
>  }
>
> +static void test_hashmap_dump_perf(void)
> +{
> +       int fd, i, max_entries = 100000;
> +       uint64_t key, value, next_key;
> +       bool next_key_valid = true;
> +       void *buf;
> +       u32 buf_len, entries;
> +       int j = 0;
> +       int clk_id = CLOCK_MONOTONIC;
> +       struct timespec begin, end;
> +       long long time_spent, dump_time_spent;
> +       double res;
> +       int tests[] = {1, 2, 230, 5000, 73000, 100000, 234567};
> +       int test_len = ARRAY_SIZE(tests);
> +       const int elem_size = sizeof(key) + sizeof(value);
> +
> +       fd = helper_fill_hashmap(max_entries);
> +       // Alloc memory considering the largest buffer
> +       buf = malloc(elem_size * tests[test_len-1]);
> +       assert(buf != NULL);
> +
> +test:
> +       entries = tests[j];
> +       buf_len = elem_size*tests[j];
> +       j++;
> +       clock_gettime(clk_id, &begin);
> +       errno = 0;
> +       i = 0;
> +       while (errno == 0) {
> +               bpf_map_dump(fd, !i ? NULL : &key,
> +                                 buf, &buf_len);
> +               if (errno)
> +                       break;
> +               if (!i)
> +                       key = *((uint64_t *)(buf + buf_len - elem_size));
> +               i += buf_len / elem_size;
> +       }
> +       clock_gettime(clk_id, &end);
> +       assert(i  == max_entries);
> +       dump_time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
> +                         end.tv_nsec - begin.tv_nsec;
> +       next_key_valid = true;
> +       clock_gettime(clk_id, &begin);
> +       assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
> +       for (i = 0; next_key_valid; i++) {
> +               next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
> +               assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
> +               key = next_key;
> +       }
> +       clock_gettime(clk_id, &end);
> +       time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
> +                    end.tv_nsec - begin.tv_nsec;
> +       res = (1-((double)dump_time_spent/time_spent))*100;
> +       printf("buf_len_%u:\t %llu entry-by-entry: %llu improvement %lf\n",
> +              entries, dump_time_spent, time_spent, res);
> +       assert(i  == max_entries);
              ^ extra space after i.

> +
> +       if (j < test_len)
> +               goto test;
> +       free(buf);
> +       close(fd);
> +}
> +
>  static void test_hashmap_zero_seed(void)
>  {
>         int i, first, second, old_flags;
> @@ -1758,6 +1822,7 @@ static void run_all_tests(void)
>         test_hashmap_walk(0, NULL);
>         test_hashmap_zero_seed();
>         test_hashmap_dump();
> +       test_hashmap_dump_perf();
>
>         test_arraymap(0, NULL);
>         test_arraymap_percpu(0, NULL);
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
@ 2019-07-24 22:15   ` Brian Vazquez
  0 siblings, 0 replies; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 22:15 UTC (permalink / raw)
  To: Song Liu
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 12:20 PM Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:09 AM Brian Vazquez <brianvv@google.com> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map.
> >
> > This new command can be executed from the existing BPF syscall as
> > follows:
> >
> > err =  bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
> > using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
> > attr->dump.buf_len
> > returns zero or negative error, and populates buf and buf_len on
> > succees
> >
> > This implementation is wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > Note that this implementation can be extended later to do dump and
> > delete by extending map_lookup_and_delete_elem (currently it only works
> > for bpf queue/stack maps) and either use a new flag in map_dump or a new
> > command map_dump_and_delete.
> >
> > Results show that even with a 1-elem_size buffer, it runs ~40 faster
>
> Why is the new command 40% faster with 1-elem_size buffer?

The test is using a really simple map structure: uint64_t key,val.
Which makes the lookup_elem logic faster since it doesn't spend too
much time copying the value. My conclusion with the 40% was that this
new implementation only needs 1 syscall per elem compare to the 2
syscalls that we needed with previous implementation so in this
particular case the number of ops that we are doing is almost halved.
I did one experiment increasing the value_size (448*64B) and it was
only 14% faster with 1-elem_size buffer.

> > than the current implementation, improvements of ~85% are reported when
> > the buffer size is increased, although, after the buffer size is around
> > 5% of the total number of entries there's no huge difference in
> > increasing it.
> >
> > Tested:
> > Tried different size buffers to handle case where the bulk is bigger, or
> > the elements to retrieve are less than the existing ones, all runs read
> > a map of 100K entries. Below are the results(in ns) from the different
> > runs:
> >
> > buf_len_1:       69038725 entry-by-entry: 112384424 improvement
> > 38.569134
> > buf_len_2:       40897447 entry-by-entry: 111030546 improvement
> > 63.165590
> > buf_len_230:     13652714 entry-by-entry: 111694058 improvement
> > 87.776687
> > buf_len_5000:    13576271 entry-by-entry: 111101169 improvement
> > 87.780263
> > buf_len_73000:   14694343 entry-by-entry: 111740162 improvement
> > 86.849542
> > buf_len_100000:  13745969 entry-by-entry: 114151991 improvement
> > 87.958187
> > buf_len_234567:  14329834 entry-by-entry: 114427589 improvement
> > 87.476941
>
> It took me a while to figure out the meaning of 87.476941. It is probably
> a good idea to say 87.5% instead.

right, will change it in next version.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 19:54   ` Willem de Bruijn
@ 2019-07-24 22:26     ` Brian Vazquez
  2019-07-24 22:33       ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 22:26 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf

On Wed, Jul 24, 2019 at 12:55 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <brianvv@google.com> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map, wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > To start dumping the map from the beginning you must specify NULL as
> > the prev_key.
> >
> > The new API returns 0 when it successfully copied all the elements
> > requested or it copied less because there weren't more elements to
> > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
>
> I think I understand this, but perhaps it can be explained a bit more
> concisely without reference to ENOENT and error masking. The function
> returns the min of the number of requested elements and the number of
> remaining elements in the map from the given starting point
> (prev_key).

That sounds better to me. Thanks!

> > On a successful call buf and buf_len will contain correct data and in
> > case prev_key was provided (not for the first walk, since prev_key is
> > NULL) it will contain the last_key copied into the prev_key which will
> > simplify next call.
> >
> > Only when it can't find a single element it will return -ENOENT meaning
> > that the map has been entirely walked. When an error is return buf,
> > buf_len and prev_key shouldn't be read nor used.
>
> That's common for error handling. No need to state explicitly.

 Ack

>
> > Because maps can be called from userspace and kernel code, this function
> > can have a scenario where the next_key was found but by the time we
> > try to retrieve the value the element is not there, in this case the
> > function continues and tries to get a new next_key value, skipping the
> > deleted key. If at some point the function find itself trap in a loop,
> > it will return -EINTR.
>
> Good to point this out! I don't think that unbounded continue;
> statements until an interrupt happens is sufficient. Please bound the
> number of retries to a low number.

And what would it be a good number? Maybe 3 attempts? And in that case
what error should be reported?
>
> > The function will try to fit as much as possible in the buf provided and
> > will return -EINVAL if buf_len is smaller than elem_size.
> >
> > QUEUE and STACK maps are not supported.
> >
> > Note that map_dump doesn't guarantee that reading the entire table is
> > consistent since this function is always racing with kernel and user code
> > but the same behaviour is found when the entire table is walked using
> > the current interfaces: map_get_next_key + map_lookup_elem.
>
> > It is also important to note that with  a locked map, the lock is grabbed
> > for 1 entry at the time, meaning that the returned buf might or might not
> > be consistent.
>
> Would it be informative to signal to the caller if the read was
> complete and consistent (because the entire table was read while the
> lock was held)?

Mmm.. not sure how we could signal that to the caller.  But I don't
think there's a way to know it was consistent (i.e. one element was
removed in bucket 20 and you are copying the keys in bucket 15, when
you get to bucket 20 there's no way to know that some entries were
removed when you traversed them). The lock is held for just 1 single
entry not the entire table.
Maybe clarify more that in the commit message?

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 22:26     ` Brian Vazquez
@ 2019-07-24 22:33       ` Willem de Bruijn
  0 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2019-07-24 22:33 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf

> > > Because maps can be called from userspace and kernel code, this function
> > > can have a scenario where the next_key was found but by the time we
> > > try to retrieve the value the element is not there, in this case the
> > > function continues and tries to get a new next_key value, skipping the
> > > deleted key. If at some point the function find itself trap in a loop,
> > > it will return -EINTR.
> >
> > Good to point this out! I don't think that unbounded continue;
> > statements until an interrupt happens is sufficient. Please bound the
> > number of retries to a low number.
>
> And what would it be a good number? Maybe 3 attempts?

3 sounds good to me.

> And in that case what error should be reported?

One that's unambiguous and somewhat intuitive for the given issue.
Perhaps EBUSY?

> > > The function will try to fit as much as possible in the buf provided and
> > > will return -EINVAL if buf_len is smaller than elem_size.
> > >
> > > QUEUE and STACK maps are not supported.
> > >
> > > Note that map_dump doesn't guarantee that reading the entire table is
> > > consistent since this function is always racing with kernel and user code
> > > but the same behaviour is found when the entire table is walked using
> > > the current interfaces: map_get_next_key + map_lookup_elem.
> >
> > > It is also important to note that with  a locked map, the lock is grabbed
> > > for 1 entry at the time, meaning that the returned buf might or might not
> > > be consistent.
> >
> > Would it be informative to signal to the caller if the read was
> > complete and consistent (because the entire table was read while the
> > lock was held)?
>
> Mmm.. not sure how we could signal that to the caller.  But I don't
> think there's a way to know it was consistent

Okay, that makes for a simple answer :) No need to try to add a signal, then.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 21:40   ` Song Liu
@ 2019-07-24 22:44     ` Brian Vazquez
  2019-07-24 23:04       ` Song Liu
  0 siblings, 1 reply; 31+ messages in thread
From: Brian Vazquez @ 2019-07-24 22:44 UTC (permalink / raw)
  To: Song Liu
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 2:40 PM Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map, wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > To start dumping the map from the beginning you must specify NULL as
> > the prev_key.
> >
> > The new API returns 0 when it successfully copied all the elements
> > requested or it copied less because there weren't more elements to
> > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
> >
> > On a successful call buf and buf_len will contain correct data and in
> > case prev_key was provided (not for the first walk, since prev_key is
> > NULL) it will contain the last_key copied into the prev_key which will
> > simplify next call.
> >
> > Only when it can't find a single element it will return -ENOENT meaning
> > that the map has been entirely walked. When an error is return buf,
> > buf_len and prev_key shouldn't be read nor used.
> >
> > Because maps can be called from userspace and kernel code, this function
> > can have a scenario where the next_key was found but by the time we
> > try to retrieve the value the element is not there, in this case the
> > function continues and tries to get a new next_key value, skipping the
> > deleted key. If at some point the function find itself trap in a loop,
> > it will return -EINTR.
> >
> > The function will try to fit as much as possible in the buf provided and
> > will return -EINVAL if buf_len is smaller than elem_size.
> >
> > QUEUE and STACK maps are not supported.
> >
> > Note that map_dump doesn't guarantee that reading the entire table is
> > consistent since this function is always racing with kernel and user code
> > but the same behaviour is found when the entire table is walked using
> > the current interfaces: map_get_next_key + map_lookup_elem.
> > It is also important to note that with  a locked map, the lock is grabbed
> > for 1 entry at the time, meaning that the returned buf might or might not
> > be consistent.
> >
> > Suggested-by: Stanislav Fomichev <sdf@google.com>
> > Signed-off-by: Brian Vazquez <brianvv@google.com>
> > ---
> >  include/uapi/linux/bpf.h |   9 +++
> >  kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 126 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index fa1c753dcdbc7..66dab5385170d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -106,6 +106,7 @@ enum bpf_cmd {
> >         BPF_TASK_FD_QUERY,
> >         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> >         BPF_MAP_FREEZE,
> > +       BPF_MAP_DUMP,
> >  };
> >
> >  enum bpf_map_type {
> > @@ -388,6 +389,14 @@ union bpf_attr {
> >                 __u64           flags;
> >         };
> >
> > +       struct { /* struct used by BPF_MAP_DUMP command */
> > +               __aligned_u64   prev_key;
> > +               __aligned_u64   buf;
> > +               __aligned_u64   buf_len; /* input/output: len of buf */
> > +               __u64           flags;
>
> Please add explanation of flags here.

got it!

> Also, we need to update the
> comments of BPF_F_LOCK for BPF_MAP_DUMP.

What do you mean? I didn't get this part.

>
> > +               __u32           map_fd;
> > +       } dump;
> > +
> >         struct { /* anonymous struct used by BPF_PROG_LOAD command */
> >                 __u32           prog_type;      /* one of enum bpf_prog_type */
> >                 __u32           insn_cnt;
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 86cdc2f7bb56e..0c35505aa219f 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
> >         return err;
> >  }
> >
> > +/* last field in 'union bpf_attr' used by this command */
> > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> > +
> > +static int map_dump(union bpf_attr *attr)
> > +{
> > +       void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> > +       void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> > +       u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> > +       int ufd = attr->dump.map_fd;
> > +       struct bpf_map *map;
> > +       void *buf, *prev_key, *key, *value;
> > +       u32 value_size, elem_size, buf_len, cp_len;
> > +       struct fd f;
> > +       int err;
> > +       bool first_key = false;
> > +
> > +       if (CHECK_ATTR(BPF_MAP_DUMP))
> > +               return -EINVAL;
> > +
> > +       if (attr->dump.flags & ~BPF_F_LOCK)
> > +               return -EINVAL;
> > +
> > +       f = fdget(ufd);
> > +       map = __bpf_map_get(f);
> > +       if (IS_ERR(map))
> > +               return PTR_ERR(map);
> > +       if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> > +               err = -EPERM;
> > +               goto err_put;
> > +       }
> > +
> > +       if ((attr->dump.flags & BPF_F_LOCK) &&
> > +           !map_value_has_spin_lock(map)) {
> > +               err = -EINVAL;
> > +               goto err_put;
> > +       }
>
> We can share these lines with map_lookup_elem(). Maybe
> add another helper function?

Which are the lines you are referring to? the dump.flags? It makes
sense so that way when a new flag is added you only need to modify
them in one spot.

> > +
> > +       if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> > +           map->map_type == BPF_MAP_TYPE_STACK) {
> > +               err = -ENOTSUPP;
> > +               goto err_put;
> > +       }
> > +
> > +       value_size = bpf_map_value_size(map);
> > +
> > +       err = get_user(buf_len, ubuf_len);
> > +       if (err)
> > +               goto err_put;
> > +
> > +       elem_size = map->key_size + value_size;
> > +       if (buf_len < elem_size) {
> > +               err = -EINVAL;
> > +               goto err_put;
> > +       }
> > +
> > +       if (ukey) {
> > +               prev_key = __bpf_copy_key(ukey, map->key_size);
> > +               if (IS_ERR(prev_key)) {
> > +                       err = PTR_ERR(prev_key);
> > +                       goto err_put;
> > +               }
> > +       } else {
> > +               prev_key = NULL;
> > +               first_key = true;
> > +       }
> > +
> > +       err = -ENOMEM;
> > +       buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> > +       if (!buf)
> > +               goto err_put;
> > +
> > +       key = buf;
> > +       value = key + map->key_size;
> > +       for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> > +               if (signal_pending(current)) {
> > +                       err = -EINTR;
> > +                       break;
> > +               }
> > +
> > +               rcu_read_lock();
> > +               err = map->ops->map_get_next_key(map, prev_key, key);
>
> If prev_key is deleted before map_get_next_key(), we get the first key
> again. This is pretty weird.

Yes, I know. But note that the current scenario happens even for the
old interface (imagine you are walking a map from userspace and you
tried get_next_key the prev_key was removed, you will start again from
the beginning without noticing it).
I tried to sent a patch in the past but I was missing some context:
before NULL was used to get the very first_key the interface relied in
a random (non existent) key to retrieve the first_key in the map, and
I was told what we still have to support that scenario.

>
> > +               rcu_read_unlock();
> > +
> > +               if (err)
> > +                       break;
> > +
> > +               err = bpf_map_copy_value(map, key, value, attr->dump.flags);
> > +
> > +               if (err == -ENOENT)
> > +                       continue;
> > +               if (err)
> > +                       goto free_buf;
> > +
> > +               if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
> > +                       err = -EFAULT;
> > +                       goto free_buf;
> > +               }
> > +
> > +               prev_key = key;
> > +               cp_len += elem_size;
> > +       }
> > +
> > +       if (err == -ENOENT && cp_len)
> > +               err = 0;
> > +       if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
> > +                   (!first_key && copy_to_user(ukey, key, map->key_size))))
> > +               err = -EFAULT;
> > +free_buf:
> > +       kfree(buf);
> > +err_put:
> > +       fdput(f);
> > +       return err;
> > +}
> > +
> >  #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
> >
> >  static int map_lookup_and_delete_elem(union bpf_attr *attr)
> > @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> >         case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
> >                 err = map_lookup_and_delete_elem(&attr);
> >                 break;
> > +       case BPF_MAP_DUMP:
> > +               err = map_dump(&attr);
> > +               break;
> >         default:
> >                 err = -EINVAL;
> >                 break;
> > --
> > 2.22.0.657.g960e92d24f-goog
> >

Thanks for reviewing it!!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 22:44     ` Brian Vazquez
@ 2019-07-24 23:04       ` Song Liu
  2019-07-25 23:25         ` Brian Vazquez
  0 siblings, 1 reply; 31+ messages in thread
From: Song Liu @ 2019-07-24 23:04 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 3:44 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 2:40 PM Song Liu <liu.song.a23@gmail.com> wrote:
> >
> > On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
> > >
> > > This introduces a new command to retrieve multiple number of entries
> > > from a bpf map, wrapping the existing bpf methods:
> > > map_get_next_key and map_lookup_elem
> > >
> > > To start dumping the map from the beginning you must specify NULL as
> > > the prev_key.
> > >
> > > The new API returns 0 when it successfully copied all the elements
> > > requested or it copied less because there weren't more elements to
> > > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
> > >
> > > On a successful call buf and buf_len will contain correct data and in
> > > case prev_key was provided (not for the first walk, since prev_key is
> > > NULL) it will contain the last_key copied into the prev_key which will
> > > simplify next call.
> > >
> > > Only when it can't find a single element it will return -ENOENT meaning
> > > that the map has been entirely walked. When an error is return buf,
> > > buf_len and prev_key shouldn't be read nor used.
> > >
> > > Because maps can be called from userspace and kernel code, this function
> > > can have a scenario where the next_key was found but by the time we
> > > try to retrieve the value the element is not there, in this case the
> > > function continues and tries to get a new next_key value, skipping the
> > > deleted key. If at some point the function find itself trap in a loop,
> > > it will return -EINTR.
> > >
> > > The function will try to fit as much as possible in the buf provided and
> > > will return -EINVAL if buf_len is smaller than elem_size.
> > >
> > > QUEUE and STACK maps are not supported.
> > >
> > > Note that map_dump doesn't guarantee that reading the entire table is
> > > consistent since this function is always racing with kernel and user code
> > > but the same behaviour is found when the entire table is walked using
> > > the current interfaces: map_get_next_key + map_lookup_elem.
> > > It is also important to note that with  a locked map, the lock is grabbed
> > > for 1 entry at the time, meaning that the returned buf might or might not
> > > be consistent.
> > >
> > > Suggested-by: Stanislav Fomichev <sdf@google.com>
> > > Signed-off-by: Brian Vazquez <brianvv@google.com>
> > > ---
> > >  include/uapi/linux/bpf.h |   9 +++
> > >  kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 126 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index fa1c753dcdbc7..66dab5385170d 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -106,6 +106,7 @@ enum bpf_cmd {
> > >         BPF_TASK_FD_QUERY,
> > >         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> > >         BPF_MAP_FREEZE,
> > > +       BPF_MAP_DUMP,
> > >  };
> > >
> > >  enum bpf_map_type {
> > > @@ -388,6 +389,14 @@ union bpf_attr {
> > >                 __u64           flags;
> > >         };
> > >
> > > +       struct { /* struct used by BPF_MAP_DUMP command */
> > > +               __aligned_u64   prev_key;
> > > +               __aligned_u64   buf;
> > > +               __aligned_u64   buf_len; /* input/output: len of buf */
> > > +               __u64           flags;
> >
> > Please add explanation of flags here.
>
> got it!
>
> > Also, we need to update the
> > comments of BPF_F_LOCK for BPF_MAP_DUMP.
>
> What do you mean? I didn't get this part.

I meant, current comment says BPF_F_LOCK is for BPF_MAP_UPDATE_ELEM.
But it is also used by BPF_MAP_LOOKUP_ELEM and BPF_MAP_DUMP. So
current comment is not accurate either.

Maybe fix it while you are on it?
>
> >
> > > +               __u32           map_fd;
> > > +       } dump;
> > > +
> > >         struct { /* anonymous struct used by BPF_PROG_LOAD command */
> > >                 __u32           prog_type;      /* one of enum bpf_prog_type */
> > >                 __u32           insn_cnt;
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 86cdc2f7bb56e..0c35505aa219f 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
> > >         return err;
> > >  }
> > >
> > > +/* last field in 'union bpf_attr' used by this command */
> > > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> > > +
> > > +static int map_dump(union bpf_attr *attr)
> > > +{
> > > +       void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> > > +       void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> > > +       u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> > > +       int ufd = attr->dump.map_fd;
> > > +       struct bpf_map *map;
> > > +       void *buf, *prev_key, *key, *value;
> > > +       u32 value_size, elem_size, buf_len, cp_len;
> > > +       struct fd f;
> > > +       int err;
> > > +       bool first_key = false;
> > > +
> > > +       if (CHECK_ATTR(BPF_MAP_DUMP))
> > > +               return -EINVAL;
> > > +
> > > +       if (attr->dump.flags & ~BPF_F_LOCK)
> > > +               return -EINVAL;
> > > +
> > > +       f = fdget(ufd);
> > > +       map = __bpf_map_get(f);
> > > +       if (IS_ERR(map))
> > > +               return PTR_ERR(map);
> > > +       if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> > > +               err = -EPERM;
> > > +               goto err_put;
> > > +       }
> > > +
> > > +       if ((attr->dump.flags & BPF_F_LOCK) &&
> > > +           !map_value_has_spin_lock(map)) {
> > > +               err = -EINVAL;
> > > +               goto err_put;
> > > +       }
> >
> > We can share these lines with map_lookup_elem(). Maybe
> > add another helper function?
>
> Which are the lines you are referring to? the dump.flags? It makes
> sense so that way when a new flag is added you only need to modify
> them in one spot.

I think I misread it. attr->dump.flags is not same as attr->flags.

So never mind.

>
> > > +
> > > +       if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> > > +           map->map_type == BPF_MAP_TYPE_STACK) {
> > > +               err = -ENOTSUPP;
> > > +               goto err_put;
> > > +       }
> > > +
> > > +       value_size = bpf_map_value_size(map);
> > > +
> > > +       err = get_user(buf_len, ubuf_len);
> > > +       if (err)
> > > +               goto err_put;
> > > +
> > > +       elem_size = map->key_size + value_size;
> > > +       if (buf_len < elem_size) {
> > > +               err = -EINVAL;
> > > +               goto err_put;
> > > +       }
> > > +
> > > +       if (ukey) {
> > > +               prev_key = __bpf_copy_key(ukey, map->key_size);
> > > +               if (IS_ERR(prev_key)) {
> > > +                       err = PTR_ERR(prev_key);
> > > +                       goto err_put;
> > > +               }
> > > +       } else {
> > > +               prev_key = NULL;
> > > +               first_key = true;
> > > +       }
> > > +
> > > +       err = -ENOMEM;
> > > +       buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> > > +       if (!buf)
> > > +               goto err_put;
> > > +
> > > +       key = buf;
> > > +       value = key + map->key_size;
> > > +       for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> > > +               if (signal_pending(current)) {
> > > +                       err = -EINTR;
> > > +                       break;
> > > +               }
> > > +
> > > +               rcu_read_lock();
> > > +               err = map->ops->map_get_next_key(map, prev_key, key);
> >
> > If prev_key is deleted before map_get_next_key(), we get the first key
> > again. This is pretty weird.
>
> Yes, I know. But note that the current scenario happens even for the
> old interface (imagine you are walking a map from userspace and you
> tried get_next_key the prev_key was removed, you will start again from
> the beginning without noticing it).
> I tried to sent a patch in the past but I was missing some context:
> before NULL was used to get the very first_key the interface relied in
> a random (non existent) key to retrieve the first_key in the map, and
> I was told what we still have to support that scenario.

BPF_MAP_DUMP is slightly different, as you may return the first key
multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
don't have to support legacy scenarios.

Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
to look up previous keys. Would something down this direction work?

Thanks,
Song

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
  2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
  2019-07-24 21:41   ` Song Liu
@ 2019-07-24 23:10   ` Andrii Nakryiko
  2019-07-25 23:27     ` Brian Vazquez
  1 sibling, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2019-07-24 23:10 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
>
> Adds bpf_attr.dump structure to libbpf.
>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>  tools/include/uapi/linux/bpf.h | 9 +++++++++
>  tools/lib/bpf/libbpf.map       | 2 ++
>  2 files changed, 11 insertions(+)
>
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 4e455018da65f..e127f16e4e932 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -106,6 +106,7 @@ enum bpf_cmd {
>         BPF_TASK_FD_QUERY,
>         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
>         BPF_MAP_FREEZE,
> +       BPF_MAP_DUMP,
>  };
>
>  enum bpf_map_type {
> @@ -388,6 +389,14 @@ union bpf_attr {
>                 __u64           flags;
>         };
>
> +       struct { /* struct used by BPF_MAP_DUMP command */
> +               __aligned_u64   prev_key;
> +               __aligned_u64   buf;
> +               __aligned_u64   buf_len; /* input/output: len of buf */
> +               __u64           flags;
> +               __u32           map_fd;
> +       } dump;
> +
>         struct { /* anonymous struct used by BPF_PROG_LOAD command */
>                 __u32           prog_type;      /* one of enum bpf_prog_type */
>                 __u32           insn_cnt;
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index f9d316e873d8d..cac3723d5c45c 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -183,4 +183,6 @@ LIBBPF_0.0.4 {

LIBBPF_0.0.4 is closed, this needs to go into LIBBPF_0.0.5.

>                 perf_buffer__new;
>                 perf_buffer__new_raw;
>                 perf_buffer__poll;
> +               bpf_map_dump;
> +               bpf_map_dump_flags;

As the general rule, please keep those lists of functions in alphabetical order.

>  } LIBBPF_0.0.3;
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-24 23:04       ` Song Liu
@ 2019-07-25 23:25         ` Brian Vazquez
  2019-07-25 23:54           ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Brian Vazquez @ 2019-07-25 23:25 UTC (permalink / raw)
  To: Song Liu
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

> > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > again. This is pretty weird.
> >
> > Yes, I know. But note that the current scenario happens even for the
> > old interface (imagine you are walking a map from userspace and you
> > tried get_next_key the prev_key was removed, you will start again from
> > the beginning without noticing it).
> > I tried to sent a patch in the past but I was missing some context:
> > before NULL was used to get the very first_key the interface relied in
> > a random (non existent) key to retrieve the first_key in the map, and
> > I was told what we still have to support that scenario.
>
> BPF_MAP_DUMP is slightly different, as you may return the first key
> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> don't have to support legacy scenarios.
>
> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> to look up previous keys. Would something down this direction work?

I've been thinking about it and I think first we need a way to detect
that since key was not present we got the first_key instead:

- One solution I had in mind was to explicitly asked for the first key
with map_get_next_key(map, NULL, first_key) and while walking the map
check that map_get_next_key(map, prev_key, key) doesn't return the
same key. This could be done using memcmp.
- Discussing with Stan, he mentioned that another option is to support
a flag in map_get_next_key to let it know that we want an error
instead of the first_key.

After detecting the problem we also need to define what we want to do,
here some options:

a) Return the error to the caller
b) Try with previous keys if any (which be limited to the keys that we
have traversed so far in this dump call)
c) continue with next entries in the map. array is easy just get the
next valid key (starting on i+1), but hmap might be difficult since
starting on the next bucket could potentially skip some keys that were
concurrently added to the same bucket where key used to be, and
starting on the same bucket could lead us to return repeated elements.

Or maybe we could support those 3 cases via flags and let the caller
decide which one to use?

Let me know what are your thoughts.

Thanks,
Brian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
  2019-07-24 23:10   ` Andrii Nakryiko
@ 2019-07-25 23:27     ` Brian Vazquez
  0 siblings, 0 replies; 31+ messages in thread
From: Brian Vazquez @ 2019-07-25 23:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Wed, Jul 24, 2019 at 4:10 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
> >
> > Adds bpf_attr.dump structure to libbpf.
> >
> > Suggested-by: Stanislav Fomichev <sdf@google.com>
> > Signed-off-by: Brian Vazquez <brianvv@google.com>
> > ---
> >  tools/include/uapi/linux/bpf.h | 9 +++++++++
> >  tools/lib/bpf/libbpf.map       | 2 ++
> >  2 files changed, 11 insertions(+)
> >
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 4e455018da65f..e127f16e4e932 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -106,6 +106,7 @@ enum bpf_cmd {
> >         BPF_TASK_FD_QUERY,
> >         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> >         BPF_MAP_FREEZE,
> > +       BPF_MAP_DUMP,
> >  };
> >
> >  enum bpf_map_type {
> > @@ -388,6 +389,14 @@ union bpf_attr {
> >                 __u64           flags;
> >         };
> >
> > +       struct { /* struct used by BPF_MAP_DUMP command */
> > +               __aligned_u64   prev_key;
> > +               __aligned_u64   buf;
> > +               __aligned_u64   buf_len; /* input/output: len of buf */
> > +               __u64           flags;
> > +               __u32           map_fd;
> > +       } dump;
> > +
> >         struct { /* anonymous struct used by BPF_PROG_LOAD command */
> >                 __u32           prog_type;      /* one of enum bpf_prog_type */
> >                 __u32           insn_cnt;
> > diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> > index f9d316e873d8d..cac3723d5c45c 100644
> > --- a/tools/lib/bpf/libbpf.map
> > +++ b/tools/lib/bpf/libbpf.map
> > @@ -183,4 +183,6 @@ LIBBPF_0.0.4 {
>
> LIBBPF_0.0.4 is closed, this needs to go into LIBBPF_0.0.5.

Sorry my bad, I didn't closely look at the rebase so this got it wrong.

>
> >                 perf_buffer__new;
> >                 perf_buffer__new_raw;
> >                 perf_buffer__poll;
> > +               bpf_map_dump;
> > +               bpf_map_dump_flags;
>
> As the general rule, please keep those lists of functions in alphabetical order.

right.

I will fix it in next version, thanks for reviewing it!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-25 23:25         ` Brian Vazquez
@ 2019-07-25 23:54           ` Alexei Starovoitov
  2019-07-26  1:02             ` Willem de Bruijn
  2019-07-26  1:24             ` Brian Vazquez
  0 siblings, 2 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2019-07-25 23:54 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Song Liu, Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > again. This is pretty weird.
> > >
> > > Yes, I know. But note that the current scenario happens even for the
> > > old interface (imagine you are walking a map from userspace and you
> > > tried get_next_key the prev_key was removed, you will start again from
> > > the beginning without noticing it).
> > > I tried to sent a patch in the past but I was missing some context:
> > > before NULL was used to get the very first_key the interface relied in
> > > a random (non existent) key to retrieve the first_key in the map, and
> > > I was told what we still have to support that scenario.
> >
> > BPF_MAP_DUMP is slightly different, as you may return the first key
> > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > don't have to support legacy scenarios.
> >
> > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > to look up previous keys. Would something down this direction work?
> 
> I've been thinking about it and I think first we need a way to detect
> that since key was not present we got the first_key instead:
> 
> - One solution I had in mind was to explicitly asked for the first key
> with map_get_next_key(map, NULL, first_key) and while walking the map
> check that map_get_next_key(map, prev_key, key) doesn't return the
> same key. This could be done using memcmp.
> - Discussing with Stan, he mentioned that another option is to support
> a flag in map_get_next_key to let it know that we want an error
> instead of the first_key.
> 
> After detecting the problem we also need to define what we want to do,
> here some options:
> 
> a) Return the error to the caller
> b) Try with previous keys if any (which be limited to the keys that we
> have traversed so far in this dump call)
> c) continue with next entries in the map. array is easy just get the
> next valid key (starting on i+1), but hmap might be difficult since
> starting on the next bucket could potentially skip some keys that were
> concurrently added to the same bucket where key used to be, and
> starting on the same bucket could lead us to return repeated elements.
> 
> Or maybe we could support those 3 cases via flags and let the caller
> decide which one to use?

this type of indecision is the reason why I wasn't excited about
batch dumping in the first place and gave 'soft yes' when Stan
mentioned it during lsf/mm/bpf uconf.
We probably shouldn't do it.
It feels this map_dump makes api more complex and doesn't really
give much benefit to the user other than large map dump becomes faster.
I think we gotta solve this problem differently.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-25 23:54           ` Alexei Starovoitov
@ 2019-07-26  1:02             ` Willem de Bruijn
  2019-07-26  1:24             ` Brian Vazquez
  1 sibling, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2019-07-26  1:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brian Vazquez, Song Liu, Brian Vazquez, Alexei Starovoitov,
	Daniel Borkmann, David S . Miller, Stanislav Fomichev,
	Petar Penkov, open list, Networking, bpf

On Thu, Jul 25, 2019 at 7:54 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > > again. This is pretty weird.
> > > >
> > > > Yes, I know. But note that the current scenario happens even for the
> > > > old interface (imagine you are walking a map from userspace and you
> > > > tried get_next_key the prev_key was removed, you will start again from
> > > > the beginning without noticing it).
> > > > I tried to sent a patch in the past but I was missing some context:
> > > > before NULL was used to get the very first_key the interface relied in
> > > > a random (non existent) key to retrieve the first_key in the map, and
> > > > I was told what we still have to support that scenario.
> > >
> > > BPF_MAP_DUMP is slightly different, as you may return the first key
> > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > > don't have to support legacy scenarios.
> > >
> > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > > to look up previous keys. Would something down this direction work?
> >
> > I've been thinking about it and I think first we need a way to detect
> > that since key was not present we got the first_key instead:
> >
> > - One solution I had in mind was to explicitly asked for the first key
> > with map_get_next_key(map, NULL, first_key) and while walking the map
> > check that map_get_next_key(map, prev_key, key) doesn't return the
> > same key. This could be done using memcmp.
> > - Discussing with Stan, he mentioned that another option is to support
> > a flag in map_get_next_key to let it know that we want an error
> > instead of the first_key.
> >
> > After detecting the problem we also need to define what we want to do,
> > here some options:
> >
> > a) Return the error to the caller
> > b) Try with previous keys if any (which be limited to the keys that we
> > have traversed so far in this dump call)
> > c) continue with next entries in the map. array is easy just get the
> > next valid key (starting on i+1), but hmap might be difficult since
> > starting on the next bucket could potentially skip some keys that were
> > concurrently added to the same bucket where key used to be, and
> > starting on the same bucket could lead us to return repeated elements.
> >
> > Or maybe we could support those 3 cases via flags and let the caller
> > decide which one to use?
>
> this type of indecision is the reason why I wasn't excited about
> batch dumping in the first place and gave 'soft yes' when Stan
> mentioned it during lsf/mm/bpf uconf.
> We probably shouldn't do it.
> It feels this map_dump makes api more complex and doesn't really
> give much benefit to the user other than large map dump becomes faster.
> I think we gotta solve this problem differently.

Multiple variants with flags indeed makes the API complex. I think the
kernel should expose only the simplest, most obvious behavior that
allows the application to recover. In this case, that sounds like
option (a) and restart.

In practice, the common use case is to allocate enough user memory to
read an entire table in one go, in which case the entire issue is
moot.

The cycle savings of dump are significant for large tables. I'm not
sure how we achieve that differently and even simpler? We originally
looked at shared memory, but that is obviously much more complex.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-25 23:54           ` Alexei Starovoitov
  2019-07-26  1:02             ` Willem de Bruijn
@ 2019-07-26  1:24             ` Brian Vazquez
  2019-07-26  1:47               ` Alexei Starovoitov
  1 sibling, 1 reply; 31+ messages in thread
From: Brian Vazquez @ 2019-07-26  1:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Song Liu, Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf

On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > > again. This is pretty weird.
> > > >
> > > > Yes, I know. But note that the current scenario happens even for the
> > > > old interface (imagine you are walking a map from userspace and you
> > > > tried get_next_key the prev_key was removed, you will start again from
> > > > the beginning without noticing it).
> > > > I tried to sent a patch in the past but I was missing some context:
> > > > before NULL was used to get the very first_key the interface relied in
> > > > a random (non existent) key to retrieve the first_key in the map, and
> > > > I was told what we still have to support that scenario.
> > >
> > > BPF_MAP_DUMP is slightly different, as you may return the first key
> > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > > don't have to support legacy scenarios.
> > >
> > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > > to look up previous keys. Would something down this direction work?
> >
> > I've been thinking about it and I think first we need a way to detect
> > that since key was not present we got the first_key instead:
> >
> > - One solution I had in mind was to explicitly asked for the first key
> > with map_get_next_key(map, NULL, first_key) and while walking the map
> > check that map_get_next_key(map, prev_key, key) doesn't return the
> > same key. This could be done using memcmp.
> > - Discussing with Stan, he mentioned that another option is to support
> > a flag in map_get_next_key to let it know that we want an error
> > instead of the first_key.
> >
> > After detecting the problem we also need to define what we want to do,
> > here some options:
> >
> > a) Return the error to the caller
> > b) Try with previous keys if any (which be limited to the keys that we
> > have traversed so far in this dump call)
> > c) continue with next entries in the map. array is easy just get the
> > next valid key (starting on i+1), but hmap might be difficult since
> > starting on the next bucket could potentially skip some keys that were
> > concurrently added to the same bucket where key used to be, and
> > starting on the same bucket could lead us to return repeated elements.
> >
> > Or maybe we could support those 3 cases via flags and let the caller
> > decide which one to use?
>
> this type of indecision is the reason why I wasn't excited about
> batch dumping in the first place and gave 'soft yes' when Stan
> mentioned it during lsf/mm/bpf uconf.
> We probably shouldn't do it.
> It feels this map_dump makes api more complex and doesn't really
> give much benefit to the user other than large map dump becomes faster.
> I think we gotta solve this problem differently.

Some users are working around the dumping problems with the existing
api by creating a bpf_map_get_next_key_and_delete userspace function
(see https://www.bouncybouncy.net/blog/bpf_map_get_next_key-pitfalls/)
which in my opinion is actually a good idea. The only problem with
that is that calling bpf_map_get_next_key(fd, key, next_key) and then
bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
and it might lose some information when deleting.
We could then do map_dump_and_delete using that idea but in the kernel
where we could better handle the racing condition. In that scenario
even if we retrieve the same key it will contain different info ( the
delta between old and new value). Would that work?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-26  1:24             ` Brian Vazquez
@ 2019-07-26  1:47               ` Alexei Starovoitov
  2019-07-26  6:10                 ` Yonghong Song
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2019-07-26  1:47 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Song Liu, Brian Vazquez, Daniel Borkmann, David S . Miller,
	Stanislav Fomichev, Willem de Bruijn, Petar Penkov, Networking,
	bpf, Yonghong Song

On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote:
>
> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> > > > > > If prev_key is deleted before map_get_next_key(), we get the first key
> > > > > > again. This is pretty weird.
> > > > >
> > > > > Yes, I know. But note that the current scenario happens even for the
> > > > > old interface (imagine you are walking a map from userspace and you
> > > > > tried get_next_key the prev_key was removed, you will start again from
> > > > > the beginning without noticing it).
> > > > > I tried to sent a patch in the past but I was missing some context:
> > > > > before NULL was used to get the very first_key the interface relied in
> > > > > a random (non existent) key to retrieve the first_key in the map, and
> > > > > I was told what we still have to support that scenario.
> > > >
> > > > BPF_MAP_DUMP is slightly different, as you may return the first key
> > > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> > > > don't have to support legacy scenarios.
> > > >
> > > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> > > > to look up previous keys. Would something down this direction work?
> > >
> > > I've been thinking about it and I think first we need a way to detect
> > > that since key was not present we got the first_key instead:
> > >
> > > - One solution I had in mind was to explicitly asked for the first key
> > > with map_get_next_key(map, NULL, first_key) and while walking the map
> > > check that map_get_next_key(map, prev_key, key) doesn't return the
> > > same key. This could be done using memcmp.
> > > - Discussing with Stan, he mentioned that another option is to support
> > > a flag in map_get_next_key to let it know that we want an error
> > > instead of the first_key.
> > >
> > > After detecting the problem we also need to define what we want to do,
> > > here some options:
> > >
> > > a) Return the error to the caller
> > > b) Try with previous keys if any (which be limited to the keys that we
> > > have traversed so far in this dump call)
> > > c) continue with next entries in the map. array is easy just get the
> > > next valid key (starting on i+1), but hmap might be difficult since
> > > starting on the next bucket could potentially skip some keys that were
> > > concurrently added to the same bucket where key used to be, and
> > > starting on the same bucket could lead us to return repeated elements.
> > >
> > > Or maybe we could support those 3 cases via flags and let the caller
> > > decide which one to use?
> >
> > this type of indecision is the reason why I wasn't excited about
> > batch dumping in the first place and gave 'soft yes' when Stan
> > mentioned it during lsf/mm/bpf uconf.
> > We probably shouldn't do it.
> > It feels this map_dump makes api more complex and doesn't really
> > give much benefit to the user other than large map dump becomes faster.
> > I think we gotta solve this problem differently.
>
> Some users are working around the dumping problems with the existing
> api by creating a bpf_map_get_next_key_and_delete userspace function
> (see https://www.bouncybouncy.net/blog/bpf_map_get_next_key-pitfalls/)
> which in my opinion is actually a good idea. The only problem with
> that is that calling bpf_map_get_next_key(fd, key, next_key) and then
> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
> and it might lose some information when deleting.
> We could then do map_dump_and_delete using that idea but in the kernel
> where we could better handle the racing condition. In that scenario
> even if we retrieve the same key it will contain different info ( the
> delta between old and new value). Would that work?

you mean get_next+lookup+delete at once?
Sounds useful.
Yonghong has been thinking about batching api as well.

I think if we cannot figure out how to make a batch of two commands
get_next + lookup to work correctly then we need to identify/invent one
command and make batching more generic.
Like make one jumbo/compound/atomic command to be get_next+lookup+delete.
Define the semantics of this single compound command.
And then let batching to be a multiplier of such command.
In a sense that multiplier 1 or N should be have the same way.
No extra flags to alter the batching.
The high level description of the batch would be:
pls execute get_next,lookup,delete and repeat it N times.
or
pls execute get_next,lookup and repeat N times.
where each command action is defined to be composable.

Just a rough idea.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-26  1:47               ` Alexei Starovoitov
@ 2019-07-26  6:10                 ` Yonghong Song
  2019-07-26 23:36                   ` Brian Vazquez
  0 siblings, 1 reply; 31+ messages in thread
From: Yonghong Song @ 2019-07-26  6:10 UTC (permalink / raw)
  To: Alexei Starovoitov, Brian Vazquez
  Cc: Song Liu, Brian Vazquez, Daniel Borkmann, David S . Miller,
	Stanislav Fomichev, Willem de Bruijn, Petar Penkov, Networking,
	bpf



On 7/25/19 6:47 PM, Alexei Starovoitov wrote:
> On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote:
>>
>> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
>>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key
>>>>>>> again. This is pretty weird.
>>>>>>
>>>>>> Yes, I know. But note that the current scenario happens even for the
>>>>>> old interface (imagine you are walking a map from userspace and you
>>>>>> tried get_next_key the prev_key was removed, you will start again from
>>>>>> the beginning without noticing it).
>>>>>> I tried to sent a patch in the past but I was missing some context:
>>>>>> before NULL was used to get the very first_key the interface relied in
>>>>>> a random (non existent) key to retrieve the first_key in the map, and
>>>>>> I was told what we still have to support that scenario.
>>>>>
>>>>> BPF_MAP_DUMP is slightly different, as you may return the first key
>>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
>>>>> don't have to support legacy scenarios.
>>>>>
>>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
>>>>> to look up previous keys. Would something down this direction work?
>>>>
>>>> I've been thinking about it and I think first we need a way to detect
>>>> that since key was not present we got the first_key instead:
>>>>
>>>> - One solution I had in mind was to explicitly asked for the first key
>>>> with map_get_next_key(map, NULL, first_key) and while walking the map
>>>> check that map_get_next_key(map, prev_key, key) doesn't return the
>>>> same key. This could be done using memcmp.
>>>> - Discussing with Stan, he mentioned that another option is to support
>>>> a flag in map_get_next_key to let it know that we want an error
>>>> instead of the first_key.
>>>>
>>>> After detecting the problem we also need to define what we want to do,
>>>> here some options:
>>>>
>>>> a) Return the error to the caller
>>>> b) Try with previous keys if any (which be limited to the keys that we
>>>> have traversed so far in this dump call)
>>>> c) continue with next entries in the map. array is easy just get the
>>>> next valid key (starting on i+1), but hmap might be difficult since
>>>> starting on the next bucket could potentially skip some keys that were
>>>> concurrently added to the same bucket where key used to be, and
>>>> starting on the same bucket could lead us to return repeated elements.
>>>>
>>>> Or maybe we could support those 3 cases via flags and let the caller
>>>> decide which one to use?
>>>
>>> this type of indecision is the reason why I wasn't excited about
>>> batch dumping in the first place and gave 'soft yes' when Stan
>>> mentioned it during lsf/mm/bpf uconf.
>>> We probably shouldn't do it.
>>> It feels this map_dump makes api more complex and doesn't really
>>> give much benefit to the user other than large map dump becomes faster.
>>> I think we gotta solve this problem differently.
>>
>> Some users are working around the dumping problems with the existing
>> api by creating a bpf_map_get_next_key_and_delete userspace function
>> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= )
>> which in my opinion is actually a good idea. The only problem with
>> that is that calling bpf_map_get_next_key(fd, key, next_key) and then
>> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
>> and it might lose some information when deleting.
>> We could then do map_dump_and_delete using that idea but in the kernel
>> where we could better handle the racing condition. In that scenario
>> even if we retrieve the same key it will contain different info ( the
>> delta between old and new value). Would that work?
> 
> you mean get_next+lookup+delete at once?
> Sounds useful.
> Yonghong has been thinking about batching api as well.

In bcc, we have many instances like this:
    getting all (key value) pairs, do some analysis and output,
    delete all keys

The implementation typically like
    /* to get all (key, value) pairs */
    while(bpf_get_next_key() == 0)
      bpf_map_lookup()
    /* do analysis and output */
    for (all keys)
      bpf_map_delete()

get_next+lookup+delete will be definitely useful.
batching will be even better to save the number of syscalls.

An alternative is to do batch get_next+lookup and batch delete
to achieve similar goal as the above code.

There is a minor difference between this approach
and the above get_next+lookup+delete.
During scanning the hash map, get_next+lookup may get less number
of elements compared to get_next+lookup+delete as the latter
may have more later-inserted hash elements after the operation
start. But both are inaccurate, so probably the difference
is minor.

> 
> I think if we cannot figure out how to make a batch of two commands
> get_next + lookup to work correctly then we need to identify/invent one
> command and make batching more generic.

not 100% sure. It will be hard to define what is "correctly".
For not changing map, looping of (get_next, lookup) and batch
get_next+lookup should have the same results.
For constant changing loops, not sure how to define which one
is correct. If users have concerns, they may need to just pick one
which gives them more comfort.

> Like make one jumbo/compound/atomic command to be get_next+lookup+delete.
> Define the semantics of this single compound command.
> And then let batching to be a multiplier of such command.
> In a sense that multiplier 1 or N should be have the same way.
> No extra flags to alter the batching.
> The high level description of the batch would be:
> pls execute get_next,lookup,delete and repeat it N times.
> or
> pls execute get_next,lookup and repeat N times.
> where each command action is defined to be composable.
> 
> Just a rough idea.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-26  6:10                 ` Yonghong Song
@ 2019-07-26 23:36                   ` Brian Vazquez
  2019-07-27  0:02                     ` Jakub Kicinski
  2019-07-27 17:54                     ` Yonghong Song
  0 siblings, 2 replies; 31+ messages in thread
From: Brian Vazquez @ 2019-07-26 23:36 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Song Liu, Brian Vazquez, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, Networking, bpf

On Thu, Jul 25, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/25/19 6:47 PM, Alexei Starovoitov wrote:
> > On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote:
> >>
> >> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> >>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key
> >>>>>>> again. This is pretty weird.
> >>>>>>
> >>>>>> Yes, I know. But note that the current scenario happens even for the
> >>>>>> old interface (imagine you are walking a map from userspace and you
> >>>>>> tried get_next_key the prev_key was removed, you will start again from
> >>>>>> the beginning without noticing it).
> >>>>>> I tried to sent a patch in the past but I was missing some context:
> >>>>>> before NULL was used to get the very first_key the interface relied in
> >>>>>> a random (non existent) key to retrieve the first_key in the map, and
> >>>>>> I was told what we still have to support that scenario.
> >>>>>
> >>>>> BPF_MAP_DUMP is slightly different, as you may return the first key
> >>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> >>>>> don't have to support legacy scenarios.
> >>>>>
> >>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> >>>>> to look up previous keys. Would something down this direction work?
> >>>>
> >>>> I've been thinking about it and I think first we need a way to detect
> >>>> that since key was not present we got the first_key instead:
> >>>>
> >>>> - One solution I had in mind was to explicitly asked for the first key
> >>>> with map_get_next_key(map, NULL, first_key) and while walking the map
> >>>> check that map_get_next_key(map, prev_key, key) doesn't return the
> >>>> same key. This could be done using memcmp.
> >>>> - Discussing with Stan, he mentioned that another option is to support
> >>>> a flag in map_get_next_key to let it know that we want an error
> >>>> instead of the first_key.
> >>>>
> >>>> After detecting the problem we also need to define what we want to do,
> >>>> here some options:
> >>>>
> >>>> a) Return the error to the caller
> >>>> b) Try with previous keys if any (which be limited to the keys that we
> >>>> have traversed so far in this dump call)
> >>>> c) continue with next entries in the map. array is easy just get the
> >>>> next valid key (starting on i+1), but hmap might be difficult since
> >>>> starting on the next bucket could potentially skip some keys that were
> >>>> concurrently added to the same bucket where key used to be, and
> >>>> starting on the same bucket could lead us to return repeated elements.
> >>>>
> >>>> Or maybe we could support those 3 cases via flags and let the caller
> >>>> decide which one to use?
> >>>
> >>> this type of indecision is the reason why I wasn't excited about
> >>> batch dumping in the first place and gave 'soft yes' when Stan
> >>> mentioned it during lsf/mm/bpf uconf.
> >>> We probably shouldn't do it.
> >>> It feels this map_dump makes api more complex and doesn't really
> >>> give much benefit to the user other than large map dump becomes faster.
> >>> I think we gotta solve this problem differently.
> >>
> >> Some users are working around the dumping problems with the existing
> >> api by creating a bpf_map_get_next_key_and_delete userspace function
> >> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= )
> >> which in my opinion is actually a good idea. The only problem with
> >> that is that calling bpf_map_get_next_key(fd, key, next_key) and then
> >> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
> >> and it might lose some information when deleting.
> >> We could then do map_dump_and_delete using that idea but in the kernel
> >> where we could better handle the racing condition. In that scenario
> >> even if we retrieve the same key it will contain different info ( the
> >> delta between old and new value). Would that work?
> >
> > you mean get_next+lookup+delete at once?
> > Sounds useful.
> > Yonghong has been thinking about batching api as well.
>
> In bcc, we have many instances like this:
>     getting all (key value) pairs, do some analysis and output,
>     delete all keys
>
> The implementation typically like
>     /* to get all (key, value) pairs */
>     while(bpf_get_next_key() == 0)
>       bpf_map_lookup()
>     /* do analysis and output */
>     for (all keys)
>       bpf_map_delete()

If you do that in a map that is being modified while you are doing the
analysis and output, you will lose some new data by deleting the keys,
right?

> get_next+lookup+delete will be definitely useful.
> batching will be even better to save the number of syscalls.
>
> An alternative is to do batch get_next+lookup and batch delete
> to achieve similar goal as the above code.

What I mentioned above is what it makes me think that with the
deletion it'd be better if we perform these 3 operations at once:
get_next+lookup+delete in a jumbo/atomic command and batch them later?

>
> There is a minor difference between this approach
> and the above get_next+lookup+delete.
> During scanning the hash map, get_next+lookup may get less number
> of elements compared to get_next+lookup+delete as the latter
> may have more later-inserted hash elements after the operation
> start. But both are inaccurate, so probably the difference
> is minor.
>
> >
> > I think if we cannot figure out how to make a batch of two commands
> > get_next + lookup to work correctly then we need to identify/invent one
> > command and make batching more generic.
>
> not 100% sure. It will be hard to define what is "correctly".

I agree, it'll be hard to define what is the right behavior.

> For not changing map, looping of (get_next, lookup) and batch
> get_next+lookup should have the same results.

This is true for the api I'm presenting the only think that I was
missing was what to do for changing maps to avoid the weird scenario
(getting the first key due a concurrent deletion). And, in my opinion
the way to go should be what also Willem supported: return the err to
the caller and restart the dumping. I could do this with existing code
just by detecting that we do provide a prev_key and got the first_key
instead of the next_key or even implement a new function if you want
to.

> For constant changing loops, not sure how to define which one
> is correct. If users have concerns, they may need to just pick one
> which gives them more comfort.
>
> > Like make one jumbo/compound/atomic command to be get_next+lookup+delete.
> > Define the semantics of this single compound command.
> > And then let batching to be a multiplier of such command.
> > In a sense that multiplier 1 or N should be have the same way.
> > No extra flags to alter the batching.
> > The high level description of the batch would be:
> > pls execute get_next,lookup,delete and repeat it N times.
> > or
> > pls execute get_next,lookup and repeat N times.

But any attempt to do get_next+lookup will have same problem with
deletions right?

I don't see how we could do it more consistent than what I'm
proposing. Let's just support one case: report an error if the
prev_key was not found instead of retrieving the first_key. Would that
work?

> > where each command action is defined to be composable.
> >
> > Just a rough idea.
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-26 23:36                   ` Brian Vazquez
@ 2019-07-27  0:02                     ` Jakub Kicinski
  2019-07-27 17:54                     ` Yonghong Song
  1 sibling, 0 replies; 31+ messages in thread
From: Jakub Kicinski @ 2019-07-27  0:02 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Yonghong Song, Alexei Starovoitov, Song Liu, Brian Vazquez,
	Daniel Borkmann, David S . Miller, Stanislav Fomichev,
	Willem de Bruijn, Petar Penkov, Networking, bpf

On Fri, 26 Jul 2019 16:36:19 -0700, Brian Vazquez wrote:
> > In bcc, we have many instances like this:
> >     getting all (key value) pairs, do some analysis and output,
> >     delete all keys
> >
> > The implementation typically like
> >     /* to get all (key, value) pairs */
> >     while(bpf_get_next_key() == 0)
> >       bpf_map_lookup()
> >     /* do analysis and output */
> >     for (all keys)
> >       bpf_map_delete()  
> 
> If you do that in a map that is being modified while you are doing the
> analysis and output, you will lose some new data by deleting the keys,
> right?
> 
> > get_next+lookup+delete will be definitely useful.
> > batching will be even better to save the number of syscalls.
> >
> > An alternative is to do batch get_next+lookup and batch delete
> > to achieve similar goal as the above code.  
> 
> What I mentioned above is what it makes me think that with the
> deletion it'd be better if we perform these 3 operations at once:
> get_next+lookup+delete in a jumbo/atomic command and batch them later?

Hm. The lookup+delete are only "atomic" if we insert an RCU sync in
between right? The elements' lifetime is protected by RCU.

	CPU 1				CPU 2
					val = lookup(map, key)				
	val = lookup(map, key)
	delete(map, key)
	dump(val)
					val->counter++

So we'd need to walk the hash table, disconnect all lists from the
buckets and splice them onto one combined list, then synchronize_rcu()
and only after that we can dump the elements and free them.

> > There is a minor difference between this approach
> > and the above get_next+lookup+delete.
> > During scanning the hash map, get_next+lookup may get less number
> > of elements compared to get_next+lookup+delete as the latter
> > may have more later-inserted hash elements after the operation
> > start. But both are inaccurate, so probably the difference
> > is minor.

> > For not changing map, looping of (get_next, lookup) and batch
> > get_next+lookup should have the same results.  
> 
> This is true for the api I'm presenting the only think that I was
> missing was what to do for changing maps to avoid the weird scenario
> (getting the first key due a concurrent deletion). And, in my opinion
> the way to go should be what also Willem supported: return the err to
> the caller and restart the dumping. I could do this with existing code
> just by detecting that we do provide a prev_key and got the first_key
> instead of the next_key or even implement a new function if you want
> to.

My knee jerk reaction to Willem's proposal was to nit pick that sizing
the dump space to the map's max entries doesn't guarantee all entries
will fit if an entry can be removed and re-added in a different place
while dumper proceeds.

If we keep elements disconnected from the hash array _without_
decrementing element count until synchronize_rcu() completes we have
that guarantee, but OTOH it may not be possible to add any new entries
from the datapath, so that doesn't sound great either :/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  2019-07-26 23:36                   ` Brian Vazquez
  2019-07-27  0:02                     ` Jakub Kicinski
@ 2019-07-27 17:54                     ` Yonghong Song
  1 sibling, 0 replies; 31+ messages in thread
From: Yonghong Song @ 2019-07-27 17:54 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Alexei Starovoitov, Song Liu, Brian Vazquez, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, Networking, bpf



On 7/26/19 4:36 PM, Brian Vazquez wrote:
> On Thu, Jul 25, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 7/25/19 6:47 PM, Alexei Starovoitov wrote:
>>> On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote:
>>>>
>>>> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>
>>>>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
>>>>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key
>>>>>>>>> again. This is pretty weird.
>>>>>>>>
>>>>>>>> Yes, I know. But note that the current scenario happens even for the
>>>>>>>> old interface (imagine you are walking a map from userspace and you
>>>>>>>> tried get_next_key the prev_key was removed, you will start again from
>>>>>>>> the beginning without noticing it).
>>>>>>>> I tried to sent a patch in the past but I was missing some context:
>>>>>>>> before NULL was used to get the very first_key the interface relied in
>>>>>>>> a random (non existent) key to retrieve the first_key in the map, and
>>>>>>>> I was told what we still have to support that scenario.
>>>>>>>
>>>>>>> BPF_MAP_DUMP is slightly different, as you may return the first key
>>>>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
>>>>>>> don't have to support legacy scenarios.
>>>>>>>
>>>>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
>>>>>>> to look up previous keys. Would something down this direction work?
>>>>>>
>>>>>> I've been thinking about it and I think first we need a way to detect
>>>>>> that since key was not present we got the first_key instead:
>>>>>>
>>>>>> - One solution I had in mind was to explicitly asked for the first key
>>>>>> with map_get_next_key(map, NULL, first_key) and while walking the map
>>>>>> check that map_get_next_key(map, prev_key, key) doesn't return the
>>>>>> same key. This could be done using memcmp.
>>>>>> - Discussing with Stan, he mentioned that another option is to support
>>>>>> a flag in map_get_next_key to let it know that we want an error
>>>>>> instead of the first_key.
>>>>>>
>>>>>> After detecting the problem we also need to define what we want to do,
>>>>>> here some options:
>>>>>>
>>>>>> a) Return the error to the caller
>>>>>> b) Try with previous keys if any (which be limited to the keys that we
>>>>>> have traversed so far in this dump call)
>>>>>> c) continue with next entries in the map. array is easy just get the
>>>>>> next valid key (starting on i+1), but hmap might be difficult since
>>>>>> starting on the next bucket could potentially skip some keys that were
>>>>>> concurrently added to the same bucket where key used to be, and
>>>>>> starting on the same bucket could lead us to return repeated elements.
>>>>>>
>>>>>> Or maybe we could support those 3 cases via flags and let the caller
>>>>>> decide which one to use?
>>>>>
>>>>> this type of indecision is the reason why I wasn't excited about
>>>>> batch dumping in the first place and gave 'soft yes' when Stan
>>>>> mentioned it during lsf/mm/bpf uconf.
>>>>> We probably shouldn't do it.
>>>>> It feels this map_dump makes api more complex and doesn't really
>>>>> give much benefit to the user other than large map dump becomes faster.
>>>>> I think we gotta solve this problem differently.
>>>>
>>>> Some users are working around the dumping problems with the existing
>>>> api by creating a bpf_map_get_next_key_and_delete userspace function
>>>> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= )
>>>> which in my opinion is actually a good idea. The only problem with
>>>> that is that calling bpf_map_get_next_key(fd, key, next_key) and then
>>>> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
>>>> and it might lose some information when deleting.
>>>> We could then do map_dump_and_delete using that idea but in the kernel
>>>> where we could better handle the racing condition. In that scenario
>>>> even if we retrieve the same key it will contain different info ( the
>>>> delta between old and new value). Would that work?
>>>
>>> you mean get_next+lookup+delete at once?
>>> Sounds useful.
>>> Yonghong has been thinking about batching api as well.
>>
>> In bcc, we have many instances like this:
>>      getting all (key value) pairs, do some analysis and output,
>>      delete all keys
>>
>> The implementation typically like
>>      /* to get all (key, value) pairs */
>>      while(bpf_get_next_key() == 0)
>>        bpf_map_lookup()
>>      /* do analysis and output */
>>      for (all keys)
>>        bpf_map_delete()
> 
> If you do that in a map that is being modified while you are doing the
> analysis and output, you will lose some new data by deleting the keys,
> right?

Agreed, it is possible that if the same keys are reused to generate data 
during analysis and output period, we will miss them by deleting them.
 From that perspective, your above approach
       while (bpf_get_next_key())
           bpf_map_delete(prev_key)
           bpf_map_lookup()
           reset prev_keey
should provide a better alternative.

> 
>> get_next+lookup+delete will be definitely useful.
>> batching will be even better to save the number of syscalls.
>>
>> An alternative is to do batch get_next+lookup and batch delete
>> to achieve similar goal as the above code.
> 
> What I mentioned above is what it makes me think that with the
> deletion it'd be better if we perform these 3 operations at once:
> get_next+lookup+delete in a jumbo/atomic command and batch them later?

Agree. This is indeed the one most useful for bcc use case as well.

> 
>>
>> There is a minor difference between this approach
>> and the above get_next+lookup+delete.
>> During scanning the hash map, get_next+lookup may get less number
>> of elements compared to get_next+lookup+delete as the latter
>> may have more later-inserted hash elements after the operation
>> start. But both are inaccurate, so probably the difference
>> is minor.
>>
>>>
>>> I think if we cannot figure out how to make a batch of two commands
>>> get_next + lookup to work correctly then we need to identify/invent one
>>> command and make batching more generic.
>>
>> not 100% sure. It will be hard to define what is "correctly".
> 
> I agree, it'll be hard to define what is the right behavior.
> 
>> For not changing map, looping of (get_next, lookup) and batch
>> get_next+lookup should have the same results.
> 
> This is true for the api I'm presenting the only think that I was
> missing was what to do for changing maps to avoid the weird scenario
> (getting the first key due a concurrent deletion). And, in my opinion
> the way to go should be what also Willem supported: return the err to
> the caller and restart the dumping. I could do this with existing code
> just by detecting that we do provide a prev_key and got the first_key
> instead of the next_key or even implement a new function if you want
> to.

Always starting from the first key has its drawback as we keep getting 
the new elements if they are constantly populated. This may skew
the results for a large hash table.

Maybe we can just do lookup+delete or batch lookup+delete?
user gives NULL means the first key to lookup/delete.
Every (batch) lookup+delete will deletes one or a set of keys.
The set of keys are retrieved using internal get_next .
The (batch) lookup+delete will return next available key, which
user can be used for next (batch) lookup+delete.
If user provided key does not match, user can provide a flag
to go to the first key, or return an error.


> 
>> For constant changing loops, not sure how to define which one
>> is correct. If users have concerns, they may need to just pick one
>> which gives them more comfort.
>>
>>> Like make one jumbo/compound/atomic command to be get_next+lookup+delete.
>>> Define the semantics of this single compound command.
>>> And then let batching to be a multiplier of such command.
>>> In a sense that multiplier 1 or N should be have the same way.
>>> No extra flags to alter the batching.
>>> The high level description of the batch would be:
>>> pls execute get_next,lookup,delete and repeat it N times.
>>> or
>>> pls execute get_next,lookup and repeat N times.
> 
> But any attempt to do get_next+lookup will have same problem with
> deletions right?
> 
> I don't see how we could do it more consistent than what I'm
> proposing. Let's just support one case: report an error if the
> prev_key was not found instead of retrieving the first_key. Would that
> work?
> 
>>> where each command action is defined to be composable.
>>>
>>> Just a rough idea.
>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-07-27 17:55 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-24 16:57 [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
2019-07-24 16:57 ` [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Brian Vazquez
2019-07-24 20:53   ` Song Liu
2019-07-24 16:57 ` [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Brian Vazquez
2019-07-24 19:54   ` Willem de Bruijn
2019-07-24 22:26     ` Brian Vazquez
2019-07-24 22:33       ` Willem de Bruijn
2019-07-24 21:40   ` Song Liu
2019-07-24 22:44     ` Brian Vazquez
2019-07-24 23:04       ` Song Liu
2019-07-25 23:25         ` Brian Vazquez
2019-07-25 23:54           ` Alexei Starovoitov
2019-07-26  1:02             ` Willem de Bruijn
2019-07-26  1:24             ` Brian Vazquez
2019-07-26  1:47               ` Alexei Starovoitov
2019-07-26  6:10                 ` Yonghong Song
2019-07-26 23:36                   ` Brian Vazquez
2019-07-27  0:02                     ` Jakub Kicinski
2019-07-27 17:54                     ` Yonghong Song
2019-07-24 16:58 ` [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/ Brian Vazquez
2019-07-24 21:41   ` Song Liu
2019-07-24 23:10   ` Andrii Nakryiko
2019-07-25 23:27     ` Brian Vazquez
2019-07-24 16:58 ` [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command Brian Vazquez
2019-07-24 19:51   ` Willem de Bruijn
2019-07-24 16:58 ` [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap Brian Vazquez
2019-07-24 21:58   ` Song Liu
2019-07-24 16:58 ` [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP Brian Vazquez
2019-07-24 22:01   ` Song Liu
2019-07-24 19:20 ` [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call Song Liu
2019-07-24 22:15   ` Brian Vazquez

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.