[RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
@ 2019-08-13 13:09 Quentin Monnet
  2019-08-13 13:09 ` [RFC bpf-next 1/3] tools: bpftool: clean up dump_map_elem() return value Quentin Monnet
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Quentin Monnet @ 2019-08-13 13:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

This series adds a "bpftool map count" subcommand to count the number of
entries present in a BPF map. This results from a customer request for a
tool to count the number of entries in BPF maps used in production (for
example, to know how many free entries are left in a given map).

The first two commits actually contain some clean-up in preparation for the
new subcommand.

The third commit adds the new subcommand. Because what data should count as
an entry is not entirely clear for all map types, we actually dump several
counters, and leave it to the users to interpret the values.

Sending as a RFC because I'm looking for feedback on the approach. Is
printing several values the good thing to do? Also, note that some map
types such as queue/stack maps do not support any type of counting, this
would need to be implemented in the kernel I believe.

More generally, we have a use case where (hash) maps are under pressure
(many additions/deletions from the BPF program), and counting the entries
by iterating other the different keys is not at all reliable. Would that
make sense to add a new bpf() subcommand to count the entries on the kernel
side instead of cycling over the entries in bpftool? If so, we would need
to agree on what makes an entry for each kind of map.

Note that we are also facing similar issues for purging map from their
entries (deleting all entries at once). We can iterate on the keys and
delete elements one by one, but this is very inefficient when entries are
being added/removed in parallel from the BPF program, and having another
dedicated command accessible from the bpf() system call might help here as
well.

Quentin Monnet (3):
  tools: bpftool: clean up dump_map_elem() return value
  tools: bpftool: make comment more explicit for count of dumped entries
  tools: bpftool: add "bpftool map count" to count entries in map

 .../bpf/bpftool/Documentation/bpftool-map.rst |  15 +++
 tools/bpf/bpftool/bash-completion/bpftool     |   4 +-
 tools/bpf/bpftool/map.c                       | 110 ++++++++++++++++--
 3 files changed, 119 insertions(+), 10 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC bpf-next 1/3] tools: bpftool: clean up dump_map_elem() return value
  2019-08-13 13:09 [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries Quentin Monnet
@ 2019-08-13 13:09 ` Quentin Monnet
  2019-08-13 13:09 ` [RFC bpf-next 2/3] tools: bpftool: make comment more explicit for count of dumped entries Quentin Monnet
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Quentin Monnet @ 2019-08-13 13:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

The code for dumping a map entry (as part of a full map dump) was moved
to a specific function dump_map_elem() in commit 18a781daa93e
("tools/bpf: bpftool, split the function do_dump()"). The "num_elems"
variable was moved in that function, incremented on success, and
returned to be immediately added to the counter in do_dump().

Returning the count of elements dumped, which is either 0 or 1, is not
really consistent with the rest of the function, especially because
"dump_map_elem()" name is not explicit about returning a counter.
Furthermore, the counter is not incremented when the entry is dumped in
JSON. This has no visible effect, because the number of elements
successfully dumped is not printed for JSON output.

Still, let's remove "num_elems" from the function and make it return 0
or -1 in case of success or failure, respectively. This is more correct,
and more consistent with the rest of the code.

It is unclear if an error value should indeed be returned for maps of
maps or maps of progs, but this has no effect on the output either, so
we just leave the current behaviour unchanged.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/bpf/bpftool/map.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index bfbbc6b4cb83..206ee46189d9 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -686,7 +686,6 @@ static int dump_map_elem(int fd, void *key, void *value,
 			 struct bpf_map_info *map_info, struct btf *btf,
 			 json_writer_t *btf_wtr)
 {
-	int num_elems = 0;
 	int lookup_errno;

 	if (!bpf_map_lookup_elem(fd, key, value)) {
@@ -704,9 +703,8 @@ static int dump_map_elem(int fd, void *key, void *value,
 			} else {
 				print_entry_plain(map_info, key, value);
 			}
-			num_elems++;
 		}
-		return num_elems;
+		return 0;
 	}

 	/* lookup error handling */
@@ -714,7 +712,7 @@ static int dump_map_elem(int fd, void *key, void *value,

 	if (map_is_map_of_maps(map_info->type) ||
 	    map_is_map_of_progs(map_info->type))
-		return 0;
+		return -1;

 	if (json_output) {
 		jsonw_start_object(json_wtr);
@@ -738,7 +736,7 @@ static int dump_map_elem(int fd, void *key, void *value,
 				  msg ? : strerror(lookup_errno));
 	}

-	return 0;
+	return -1;
 }

 static int do_dump(int argc, char **argv)
@@ -800,7 +798,8 @@ static int do_dump(int argc, char **argv)
 				err = 0;
 			break;
 		}
-		num_elems += dump_map_elem(fd, key, value, &info, btf, btf_wtr);
+		if (!dump_map_elem(fd, key, value, &info, btf, btf_wtr))
+			num_elems++;
 		prev_key = key;
 	}

-- 
2.17.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC bpf-next 2/3] tools: bpftool: make comment more explicit for count of dumped entries
  2019-08-13 13:09 [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries Quentin Monnet
  2019-08-13 13:09 ` [RFC bpf-next 1/3] tools: bpftool: clean up dump_map_elem() return value Quentin Monnet
@ 2019-08-13 13:09 ` Quentin Monnet
  2019-08-13 13:09 ` [RFC bpf-next 3/3] tools: bpftool: add "bpftool map count" to count entries in map Quentin Monnet
  2019-08-14  1:51 ` [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries Alexei Starovoitov
  3 siblings, 0 replies; 15+ messages in thread
From: Quentin Monnet @ 2019-08-13 13:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

The counter printed at the end of plain map dump does not reflect the
exact number of entries in the map, but the number of entries bpftool
managed to dump (some of them could not be read, or made no sense to
dump (map-in-map...)).

Edit slightly the message to make this more explicit.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/bpf/bpftool/map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 206ee46189d9..cead639b3ab1 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -809,7 +809,7 @@ static int do_dump(int argc, char **argv)
 		jsonw_end_array(btf_wtr);
 		jsonw_destroy(&btf_wtr);
 	} else {
-		printf("Found %u element%s\n", num_elems,
+		printf("Found %u element%s to dump\n", num_elems,
 		       num_elems != 1 ? "s" : "");
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC bpf-next 3/3] tools: bpftool: add "bpftool map count" to count entries in map
  2019-08-13 13:09 [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries Quentin Monnet
  2019-08-13 13:09 ` [RFC bpf-next 1/3] tools: bpftool: clean up dump_map_elem() return value Quentin Monnet
  2019-08-13 13:09 ` [RFC bpf-next 2/3] tools: bpftool: make comment more explicit for count of dumped entries Quentin Monnet
@ 2019-08-13 13:09 ` Quentin Monnet
  2019-08-14  1:51 ` [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries Alexei Starovoitov
  3 siblings, 0 replies; 15+ messages in thread
From: Quentin Monnet @ 2019-08-13 13:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

Add a "map count" subcommand for counting the number of entries in a
map.

Because of the variety of BPF map types, it is not entirely clear what
counts as an "entry" to dump. We could count all entries for which we
have keys; but then for all array derivatives, it would simply come down
to printing the maximum number of entries (already accessible through
"bpftool map show").

Several map types set errno to ENOENT when they consider there is no
value associated to a key, so we could maybe use that... But then there
are also some map types that simply reject lookup attempts with
EONOTSUPP (xskmap, sock_map, sock_hash): Not being able to lookup a
value in such maps does not mean they have no values.

Instead of trying to enforce a definition for a "map entry", the
selected approach in this patch consists in dumping several counter, and
letting the user decide how to interpret them. Values printed are:

  - Max number of entries
  - Number of key found
  - Number of successful lookups
  - Number of failed lookups, broken down into the most frequent values
    for errno (ENOENT, EOPNOTSUPP, EINVAL, EPERM).

Not all possible values for errno are included (e.g. ENOMEM or EFAULT,
for example), they can be added in the future if necessary.

Below are some sample output with different types of maps.

Array map:

        # bpftool map count id 11
        max entries:            2
        keys found:             2
        successful lookups:     2

Empty prog_array map:

        # bpftool map count id 13
        max entries:            5
        keys found:             5
        successful lookups:     0
        failed lookups: 5, of which:
          - errno set to ENOENT:        5

Empty xskmap:

        # bpftool map count id 14
        max entries:            5
        keys found:             5
        successful lookups:     0
        failed lookups: 5, of which:
          - errno set to EOPNOTSUPP:    5

JSON for the array map:

        # bpftool map count id 11
        {
            "max_entries": 2,
            "n_keys": 2,
            "n_lookup_success": 2,
            "lookup_failures": {
                "enoent": 0,
                "eopnotsupp": 0,
                "einval": 0,
                "eperm": 0,
            }
        }

Queue map containing 3 items:

        # bpftool map count id 12
        failed to get next key, interrupting count: Invalid argument
        max entries:            5
        keys found:             0
        successful lookups:     0

Note that counting entries for queue and stack maps is not supported
(beyond max_entries), as these types do not support cycling over the
keys.

This commit also adds relevant documentation and bash completion.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 .../bpf/bpftool/Documentation/bpftool-map.rst | 15 +++
 tools/bpf/bpftool/bash-completion/bpftool     |  4 +-
 tools/bpf/bpftool/map.c                       | 97 ++++++++++++++++++-
 3 files changed, 113 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 61d1d270eb5e..ccc19bdd2ca3 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -25,6 +25,7 @@ MAP COMMANDS
 |	**bpftool** **map create**     *FILE* **type** *TYPE* **key** *KEY_SIZE* **value** *VALUE_SIZE* \
 |		**entries** *MAX_ENTRIES* **name** *NAME* [**flags** *FLAGS*] [**dev** *NAME*]
 |	**bpftool** **map dump**       *MAP*
+|	**bpftool** **map count**      *MAP*
 |	**bpftool** **map update**     *MAP* [**key** *DATA*] [**value** *VALUE*] [*UPDATE_FLAGS*]
 |	**bpftool** **map lookup**     *MAP* [**key** *DATA*]
 |	**bpftool** **map getnext**    *MAP* [**key** *DATA*]
@@ -67,6 +68,20 @@ DESCRIPTION
 	**bpftool map dump**    *MAP*
 		  Dump all entries in a given *MAP*.
 
+	**bpftool map count**   *MAP*
+		  Count the number of entries in a given *MAP*. Several values
+		  are printed: the maximum number of entries, the number of
+		  keys found, the number of successful lookups with those keys.
+		  The report for failed lookups is broken down to give values
+		  for the most frequent **errno** values.
+
+		  Note that the counters may not be accurate if the map is
+		  being modified (for example by a running BPF program). For
+		  example, if an element gets removed while being dumped, and
+		  then passed in as the "previous key" while cycling over map
+		  keys, the dump will restart and bpftool will count the
+		  entries multiple times.
+
 	**bpftool map update**  *MAP* [**key** *DATA*] [**value** *VALUE*] [*UPDATE_FLAGS*]
 		  Update map entry for a given *KEY*.
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index df16c5415444..764c88bfe9da 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -449,7 +449,7 @@ _bpftool()
         map)
             local MAP_TYPE='id pinned'
             case $command in
-                show|list|dump|peek|pop|dequeue)
+                show|list|dump|count|peek|pop|dequeue)
                     case $prev in
                         $command)
                             COMPREPLY=( $( compgen -W "$MAP_TYPE" -- "$cur" ) )
@@ -642,7 +642,7 @@ _bpftool()
                     [[ $prev == $object ]] && \
                         COMPREPLY=( $( compgen -W 'delete dump getnext help \
                             lookup pin event_pipe show list update create \
-                            peek push enqueue pop dequeue' -- \
+                            peek push enqueue pop dequeue count' -- \
                             "$cur" ) )
                     ;;
             esac
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index cead639b3ab1..918d08d1676e 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -822,6 +822,98 @@ static int do_dump(int argc, char **argv)
 	return err;
 }
 
+static int do_count(int argc, char **argv)
+{
+	unsigned int num_keys = 0, num_lookups = 0;
+	unsigned int err_cnts[1024] = {};
+	struct bpf_map_info info = {};
+	void *key, *value, *prev_key;
+	__u32 len = sizeof(info);
+	int err, fd;
+
+	if (!REQ_ARGS(2))
+		return -1;
+
+	fd = map_parse_fd_and_info(&argc, &argv, &info, &len);
+	if (fd < 0)
+		return -1;
+
+	key = malloc(info.key_size);
+	value = alloc_value(&info);
+	if (!key || !value) {
+		p_err("mem alloc failed");
+		err = -1;
+		goto exit_free;
+	}
+
+	prev_key = NULL;
+	while (true) {
+		int res;
+
+		err = bpf_map_get_next_key(fd, prev_key, key);
+		if (err) {
+			if (errno == ENOENT)
+				err = 0;
+			else
+				p_info("failed to get next key, interrupting count: %s",
+				       strerror(errno));
+			break;
+		}
+
+		num_keys++;
+		res = bpf_map_lookup_elem(fd, key, value);
+		if (res) {
+			if (errno < (int)ARRAY_SIZE(err_cnts))
+				err_cnts[errno]++;
+		} else {
+			num_lookups++;
+		}
+		prev_key = key;
+	}
+
+	if (json_output) {
+		jsonw_start_object(json_wtr);	/* root */
+		jsonw_uint_field(json_wtr, "max_entries", info.max_entries);
+		jsonw_uint_field(json_wtr, "n_keys", num_keys);
+		jsonw_uint_field(json_wtr, "n_lookup_success", num_lookups);
+		jsonw_name(json_wtr, "lookup_failures");
+		jsonw_start_object(json_wtr);	/* lookup_failures */
+		jsonw_uint_field(json_wtr, "enoent", err_cnts[ENOENT]);
+		jsonw_uint_field(json_wtr, "eopnotsupp", err_cnts[EOPNOTSUPP]);
+		jsonw_uint_field(json_wtr, "einval", err_cnts[EINVAL]);
+		jsonw_uint_field(json_wtr, "eperm", err_cnts[EPERM]);
+		jsonw_end_object(json_wtr);	/* lookup_failures */
+		jsonw_end_object(json_wtr);	/* root */
+	} else {
+		printf("max entries:\t\t%u\n", info.max_entries);
+		printf("keys found:\t\t%u\n", num_keys);
+		printf("successful lookups:\t%u\n", num_lookups);
+		if (num_lookups != num_keys) {
+			printf("failed lookups:\t%u, of which:\n",
+			       num_keys - num_lookups);
+			if (err_cnts[ENOENT])
+				printf("  - errno set to ENOENT:\t%u\n",
+				       err_cnts[ENOENT]);
+			if (err_cnts[EOPNOTSUPP])
+				printf("  - errno set to EOPNOTSUPP:\t%u\n",
+				       err_cnts[EOPNOTSUPP]);
+			if (err_cnts[EINVAL])
+				printf("  - errno set to EINVAL:\t%u\n",
+				       err_cnts[EINVAL]);
+			if (err_cnts[EPERM])
+				printf("  - errno set to EPERM:\t\t%u\n",
+				       err_cnts[EPERM]);
+		}
+	}
+
+exit_free:
+	free(key);
+	free(value);
+	close(fd);
+
+	return err;
+}
+
 static int alloc_key_value(struct bpf_map_info *info, void **key, void **value)
 {
 	*key = NULL;
@@ -1250,6 +1342,7 @@ static int do_help(int argc, char **argv)
 		"                              entries MAX_ENTRIES name NAME [flags FLAGS] \\\n"
 		"                              [dev NAME]\n"
 		"       %s %s dump       MAP\n"
+		"       %s %s count      MAP\n"
 		"       %s %s update     MAP [key DATA] [value VALUE] [UPDATE_FLAGS]\n"
 		"       %s %s lookup     MAP [key DATA]\n"
 		"       %s %s getnext    MAP [key DATA]\n"
@@ -1279,7 +1372,8 @@ static int do_help(int argc, char **argv)
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
-		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2]);
 
 	return 0;
 }
@@ -1301,6 +1395,7 @@ static const struct cmd cmds[] = {
 	{ "enqueue",	do_update },
 	{ "pop",	do_pop_dequeue },
 	{ "dequeue",	do_pop_dequeue },
+	{ "count",	do_count },
 	{ 0 }
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-13 13:09 [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries Quentin Monnet
                   ` (2 preceding siblings ...)
  2019-08-13 13:09 ` [RFC bpf-next 3/3] tools: bpftool: add "bpftool map count" to count entries in map Quentin Monnet
@ 2019-08-14  1:51 ` Alexei Starovoitov
  2019-08-14  9:42   ` Quentin Monnet
  3 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2019-08-14  1:51 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, netdev, oss-drivers

On Tue, Aug 13, 2019 at 02:09:18PM +0100, Quentin Monnet wrote:
> This series adds a "bpftool map count" subcommand to count the number of
> entries present in a BPF map. This results from a customer request for a
> tool to count the number of entries in BPF maps used in production (for
> example, to know how many free entries are left in a given map).
> 
> The first two commits actually contain some clean-up in preparation for the
> new subcommand.
> 
> The third commit adds the new subcommand. Because what data should count as
> an entry is not entirely clear for all map types, we actually dump several
> counters, and leave it to the users to interpret the values.
> 
> Sending as a RFC because I'm looking for feedback on the approach. Is
> printing several values the good thing to do? Also, note that some map
> types such as queue/stack maps do not support any type of counting, this
> would need to be implemented in the kernel I believe.
> 
> More generally, we have a use case where (hash) maps are under pressure
> (many additions/deletions from the BPF program), and counting the entries
> by iterating other the different keys is not at all reliable. Would that
> make sense to add a new bpf() subcommand to count the entries on the kernel
> side instead of cycling over the entries in bpftool? If so, we would need
> to agree on what makes an entry for each kind of map.

I don't mind new bpftool sub-command, but against adding kernel interface.
Can you elaborate what is the actual use case?
The same can be achieved by 'bpftool map dump|grep key|wc -l', no?

> Note that we are also facing similar issues for purging map from their
> entries (deleting all entries at once). We can iterate on the keys and
> delete elements one by one, but this is very inefficient when entries are
> being added/removed in parallel from the BPF program, and having another
> dedicated command accessible from the bpf() system call might help here as
> well.

I think that fits into the batch processing of map commands discussion.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-14  1:51 ` [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries Alexei Starovoitov
@ 2019-08-14  9:42   ` Quentin Monnet
  2019-08-14 16:45     ` Edward Cree
  0 siblings, 1 reply; 15+ messages in thread
From: Quentin Monnet @ 2019-08-14  9:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, netdev, oss-drivers

2019-08-13 18:51 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Aug 13, 2019 at 02:09:18PM +0100, Quentin Monnet wrote:
>> This series adds a "bpftool map count" subcommand to count the number of
>> entries present in a BPF map. This results from a customer request for a
>> tool to count the number of entries in BPF maps used in production (for
>> example, to know how many free entries are left in a given map).
>>
>> The first two commits actually contain some clean-up in preparation for the
>> new subcommand.
>>
>> The third commit adds the new subcommand. Because what data should count as
>> an entry is not entirely clear for all map types, we actually dump several
>> counters, and leave it to the users to interpret the values.
>>
>> Sending as a RFC because I'm looking for feedback on the approach. Is
>> printing several values the good thing to do? Also, note that some map
>> types such as queue/stack maps do not support any type of counting, this
>> would need to be implemented in the kernel I believe.
>>
>> More generally, we have a use case where (hash) maps are under pressure
>> (many additions/deletions from the BPF program), and counting the entries
>> by iterating other the different keys is not at all reliable. Would that
>> make sense to add a new bpf() subcommand to count the entries on the kernel
>> side instead of cycling over the entries in bpftool? If so, we would need
>> to agree on what makes an entry for each kind of map.
> 
> I don't mind new bpftool sub-command, but against adding kernel interface.
> Can you elaborate what is the actual use case?

Hi Alexei, thanks for your feedback.

The use case is a network processing application (close to a NAT), where
a hash map is used to keep track of flows, many of them being
short-lived. The BPF program spends a good chunk of time adding and
deleting entries to/from the map. The overall size (number of entries)
increases slowly, and when it grows past a certain threshold some action
must be taken (some flows are deleted from user space, possibly copied
to another map or whatever) to ensure we still have some room for new
incoming flows.

> The same can be achieved by 'bpftool map dump|grep key|wc -l', no?

To some extent (with subtleties for some other map types); and we use a
similar command line as a workaround for now. But because of the rate of
inserts/deletes in the map, the process often reports a number higher
than the max number of entries (we observed up to ~750k when max_entries
is 500k), even is the map is only half-full on average during the count.
On the worst case (though not frequent), an entry is deleted just before
we get the next key from it, and iteration starts all over again. This
is not reliable to determine how much space is left in the map.

I cannot see a solution that would provide a more accurate count from
user space, when the map is under pressure?

> 
>> Note that we are also facing similar issues for purging map from their
>> entries (deleting all entries at once). We can iterate on the keys and
>> delete elements one by one, but this is very inefficient when entries are
>> being added/removed in parallel from the BPF program, and having another
>> dedicated command accessible from the bpf() system call might help here as
>> well.
> 
> I think that fits into the batch processing of map commands discussion.
> 

This is also what we do at the moment, but we hit similar limitations
when iterating over the keys.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-14  9:42   ` Quentin Monnet
@ 2019-08-14 16:45     ` Edward Cree
  2019-08-14 16:58       ` Alexei Starovoitov
  2019-08-14 16:58       ` Quentin Monnet
  0 siblings, 2 replies; 15+ messages in thread
From: Edward Cree @ 2019-08-14 16:45 UTC (permalink / raw)
  To: Quentin Monnet, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, netdev, oss-drivers

On 14/08/2019 10:42, Quentin Monnet wrote:
> 2019-08-13 18:51 UTC-0700 ~ Alexei Starovoitov
> <alexei.starovoitov@gmail.com>
>> The same can be achieved by 'bpftool map dump|grep key|wc -l', no?
> To some extent (with subtleties for some other map types); and we use a
> similar command line as a workaround for now. But because of the rate of
> inserts/deletes in the map, the process often reports a number higher
> than the max number of entries (we observed up to ~750k when max_entries
> is 500k), even is the map is only half-full on average during the count.
> On the worst case (though not frequent), an entry is deleted just before
> we get the next key from it, and iteration starts all over again. This
> is not reliable to determine how much space is left in the map.
>
> I cannot see a solution that would provide a more accurate count from
> user space, when the map is under pressure?
This might be a really dumb suggestion, but: you're wanting to collect a
 summary statistic over an in-kernel data structure in a single syscall,
 because making a series of syscalls to examine every entry is slow and
 racy.  Isn't that exactly a job for an in-kernel virtual machine, and
 could you not supply an eBPF program which the kernel runs on each entry
 in the map, thus supporting people who want to calculate something else
 (mean, min and max, whatever) instead of count?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-14 16:45     ` Edward Cree
@ 2019-08-14 16:58       ` Alexei Starovoitov
  2019-08-14 17:12         ` Quentin Monnet
  2019-08-14 16:58       ` Quentin Monnet
  1 sibling, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2019-08-14 16:58 UTC (permalink / raw)
  To: Edward Cree
  Cc: Quentin Monnet, Alexei Starovoitov, Daniel Borkmann, bpf,
	Network Development, oss-drivers

On Wed, Aug 14, 2019 at 9:45 AM Edward Cree <ecree@solarflare.com> wrote:
>
> On 14/08/2019 10:42, Quentin Monnet wrote:
> > 2019-08-13 18:51 UTC-0700 ~ Alexei Starovoitov
> > <alexei.starovoitov@gmail.com>
> >> The same can be achieved by 'bpftool map dump|grep key|wc -l', no?
> > To some extent (with subtleties for some other map types); and we use a
> > similar command line as a workaround for now. But because of the rate of
> > inserts/deletes in the map, the process often reports a number higher
> > than the max number of entries (we observed up to ~750k when max_entries
> > is 500k), even is the map is only half-full on average during the count.
> > On the worst case (though not frequent), an entry is deleted just before
> > we get the next key from it, and iteration starts all over again. This
> > is not reliable to determine how much space is left in the map.
> >
> > I cannot see a solution that would provide a more accurate count from
> > user space, when the map is under pressure?
> This might be a really dumb suggestion, but: you're wanting to collect a
>  summary statistic over an in-kernel data structure in a single syscall,
>  because making a series of syscalls to examine every entry is slow and
>  racy.  Isn't that exactly a job for an in-kernel virtual machine, and
>  could you not supply an eBPF program which the kernel runs on each entry
>  in the map, thus supporting people who want to calculate something else
>  (mean, min and max, whatever) instead of count?

Pretty much my suggestion as well :)

It seems the better fix for your nat threshold is to keep count of
elements in the map in a separate global variable that
bpf program manually increments and decrements.
bpftool will dump it just as regular map of single element.
(I believe it doesn't recognize global variables properly yet)
and BTF will be there to pick exactly that 'count' variable.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-14 16:45     ` Edward Cree
  2019-08-14 16:58       ` Alexei Starovoitov
@ 2019-08-14 16:58       ` Quentin Monnet
  2019-08-14 17:14         ` Edward Cree
  1 sibling, 1 reply; 15+ messages in thread
From: Quentin Monnet @ 2019-08-14 16:58 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, netdev, oss-drivers

2019-08-14 17:45 UTC+0100 ~ Edward Cree <ecree@solarflare.com>
> On 14/08/2019 10:42, Quentin Monnet wrote:
>> 2019-08-13 18:51 UTC-0700 ~ Alexei Starovoitov
>> <alexei.starovoitov@gmail.com>
>>> The same can be achieved by 'bpftool map dump|grep key|wc -l', no?
>> To some extent (with subtleties for some other map types); and we use a
>> similar command line as a workaround for now. But because of the rate of
>> inserts/deletes in the map, the process often reports a number higher
>> than the max number of entries (we observed up to ~750k when max_entries
>> is 500k), even is the map is only half-full on average during the count.
>> On the worst case (though not frequent), an entry is deleted just before
>> we get the next key from it, and iteration starts all over again. This
>> is not reliable to determine how much space is left in the map.
>>
>> I cannot see a solution that would provide a more accurate count from
>> user space, when the map is under pressure?
> This might be a really dumb suggestion, but: you're wanting to collect a
>  summary statistic over an in-kernel data structure in a single syscall,
>  because making a series of syscalls to examine every entry is slow and
>  racy.  Isn't that exactly a job for an in-kernel virtual machine, and
>  could you not supply an eBPF program which the kernel runs on each entry
>  in the map, thus supporting people who want to calculate something else
>  (mean, min and max, whatever) instead of count?
> 

Hi Edward, I like the approach, thanks for the suggestion.

But I did not mention that we were using offloaded maps: Tracing the
kernel would probably work for programs running on the host, but this is
not a solution we could extend to hardware offload.

Best regards,
Quentin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-14 16:58       ` Alexei Starovoitov
@ 2019-08-14 17:12         ` Quentin Monnet
  2019-08-14 20:18           ` Andrii Nakryiko
  0 siblings, 1 reply; 15+ messages in thread
From: Quentin Monnet @ 2019-08-14 17:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Edward Cree
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
	oss-drivers

2019-08-14 09:58 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Wed, Aug 14, 2019 at 9:45 AM Edward Cree <ecree@solarflare.com> wrote:
>>
>> On 14/08/2019 10:42, Quentin Monnet wrote:
>>> 2019-08-13 18:51 UTC-0700 ~ Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com>
>>>> The same can be achieved by 'bpftool map dump|grep key|wc -l', no?
>>> To some extent (with subtleties for some other map types); and we use a
>>> similar command line as a workaround for now. But because of the rate of
>>> inserts/deletes in the map, the process often reports a number higher
>>> than the max number of entries (we observed up to ~750k when max_entries
>>> is 500k), even is the map is only half-full on average during the count.
>>> On the worst case (though not frequent), an entry is deleted just before
>>> we get the next key from it, and iteration starts all over again. This
>>> is not reliable to determine how much space is left in the map.
>>>
>>> I cannot see a solution that would provide a more accurate count from
>>> user space, when the map is under pressure?
>> This might be a really dumb suggestion, but: you're wanting to collect a
>>  summary statistic over an in-kernel data structure in a single syscall,
>>  because making a series of syscalls to examine every entry is slow and
>>  racy.  Isn't that exactly a job for an in-kernel virtual machine, and
>>  could you not supply an eBPF program which the kernel runs on each entry
>>  in the map, thus supporting people who want to calculate something else
>>  (mean, min and max, whatever) instead of count?
> 
> Pretty much my suggestion as well :)
> 
> It seems the better fix for your nat threshold is to keep count of
> elements in the map in a separate global variable that
> bpf program manually increments and decrements.
> bpftool will dump it just as regular map of single element.
> (I believe it doesn't recognize global variables properly yet)
> and BTF will be there to pick exactly that 'count' variable.
> 

It would be with an offloaded map, but yes, I suppose we could keep
track of the numbers in a separate map. We'll have a look into this.

Thanks to both of you for the suggestions.
Quentin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-14 16:58       ` Quentin Monnet
@ 2019-08-14 17:14         ` Edward Cree
  2019-08-15 14:15           ` Quentin Monnet
  0 siblings, 1 reply; 15+ messages in thread
From: Edward Cree @ 2019-08-14 17:14 UTC (permalink / raw)
  To: Quentin Monnet, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, netdev, oss-drivers

On 14/08/2019 17:58, Quentin Monnet wrote:
> 2019-08-14 17:45 UTC+0100 ~ Edward Cree <ecree@solarflare.com>
>> This might be a really dumb suggestion, but: you're wanting to collect a
>>  summary statistic over an in-kernel data structure in a single syscall,
>>  because making a series of syscalls to examine every entry is slow and
>>  racy.  Isn't that exactly a job for an in-kernel virtual machine, and
>>  could you not supply an eBPF program which the kernel runs on each entry
>>  in the map, thus supporting people who want to calculate something else
>>  (mean, min and max, whatever) instead of count?
>>
> Hi Edward, I like the approach, thanks for the suggestion.
>
> But I did not mention that we were using offloaded maps: Tracing the
> kernel would probably work for programs running on the host, but this is
> not a solution we could extend to hardware offload.
I don't see where "tracing" comes into it; this is a new program type and
 a new map op under the bpf() syscall.
Could the user-supplied BPF program not then be passed down to the device
 for it to run against its offloaded maps?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-14 17:12         ` Quentin Monnet
@ 2019-08-14 20:18           ` Andrii Nakryiko
  2019-08-15 14:02             ` Quentin Monnet
  0 siblings, 1 reply; 15+ messages in thread
From: Andrii Nakryiko @ 2019-08-14 20:18 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Alexei Starovoitov, Edward Cree, Alexei Starovoitov,
	Daniel Borkmann, bpf, Network Development, oss-drivers

On Wed, Aug 14, 2019 at 10:12 AM Quentin Monnet
<quentin.monnet@netronome.com> wrote:
>
> 2019-08-14 09:58 UTC-0700 ~ Alexei Starovoitov
> <alexei.starovoitov@gmail.com>
> > On Wed, Aug 14, 2019 at 9:45 AM Edward Cree <ecree@solarflare.com> wrote:
> >>
> >> On 14/08/2019 10:42, Quentin Monnet wrote:
> >>> 2019-08-13 18:51 UTC-0700 ~ Alexei Starovoitov
> >>> <alexei.starovoitov@gmail.com>
> >>>> The same can be achieved by 'bpftool map dump|grep key|wc -l', no?
> >>> To some extent (with subtleties for some other map types); and we use a
> >>> similar command line as a workaround for now. But because of the rate of
> >>> inserts/deletes in the map, the process often reports a number higher
> >>> than the max number of entries (we observed up to ~750k when max_entries
> >>> is 500k), even is the map is only half-full on average during the count.
> >>> On the worst case (though not frequent), an entry is deleted just before
> >>> we get the next key from it, and iteration starts all over again. This
> >>> is not reliable to determine how much space is left in the map.
> >>>
> >>> I cannot see a solution that would provide a more accurate count from
> >>> user space, when the map is under pressure?
> >> This might be a really dumb suggestion, but: you're wanting to collect a
> >>  summary statistic over an in-kernel data structure in a single syscall,
> >>  because making a series of syscalls to examine every entry is slow and
> >>  racy.  Isn't that exactly a job for an in-kernel virtual machine, and
> >>  could you not supply an eBPF program which the kernel runs on each entry
> >>  in the map, thus supporting people who want to calculate something else
> >>  (mean, min and max, whatever) instead of count?
> >
> > Pretty much my suggestion as well :)

I also support the suggestion to count it from BPF side. It's flexible
and powerful approach and doesn't require adding more and more nuanced
sub-APIs to kernel to support subset of bulk operations on map
(subset, because we'll expose count, but what about, e.g., p50, etc,
there will always be something more that someone will want and it just
doesn't scale).

> >
> > It seems the better fix for your nat threshold is to keep count of
> > elements in the map in a separate global variable that
> > bpf program manually increments and decrements.
> > bpftool will dump it just as regular map of single element.
> > (I believe it doesn't recognize global variables properly yet)
> > and BTF will be there to pick exactly that 'count' variable.
> >
>
> It would be with an offloaded map, but yes, I suppose we could keep
> track of the numbers in a separate map. We'll have a look into this.

See if you can use a global variable, that way you completely
eliminate any overhead from BPF side of things, except for atomic
increment.

>
> Thanks to both of you for the suggestions.
> Quentin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-14 20:18           ` Andrii Nakryiko
@ 2019-08-15 14:02             ` Quentin Monnet
  0 siblings, 0 replies; 15+ messages in thread
From: Quentin Monnet @ 2019-08-15 14:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Edward Cree, Alexei Starovoitov,
	Daniel Borkmann, bpf, Network Development, oss-drivers

2019-08-14 13:18 UTC-0700 ~ Andrii Nakryiko <andrii.nakryiko@gmail.com>
> On Wed, Aug 14, 2019 at 10:12 AM Quentin Monnet
> <quentin.monnet@netronome.com> wrote:
>>
>> 2019-08-14 09:58 UTC-0700 ~ Alexei Starovoitov
>> <alexei.starovoitov@gmail.com>
>>> On Wed, Aug 14, 2019 at 9:45 AM Edward Cree <ecree@solarflare.com> wrote:
>>>>
>>>> On 14/08/2019 10:42, Quentin Monnet wrote:
>>>>> 2019-08-13 18:51 UTC-0700 ~ Alexei Starovoitov
>>>>> <alexei.starovoitov@gmail.com>
>>>>>> The same can be achieved by 'bpftool map dump|grep key|wc -l', no?
>>>>> To some extent (with subtleties for some other map types); and we use a
>>>>> similar command line as a workaround for now. But because of the rate of
>>>>> inserts/deletes in the map, the process often reports a number higher
>>>>> than the max number of entries (we observed up to ~750k when max_entries
>>>>> is 500k), even is the map is only half-full on average during the count.
>>>>> On the worst case (though not frequent), an entry is deleted just before
>>>>> we get the next key from it, and iteration starts all over again. This
>>>>> is not reliable to determine how much space is left in the map.
>>>>>
>>>>> I cannot see a solution that would provide a more accurate count from
>>>>> user space, when the map is under pressure?
>>>> This might be a really dumb suggestion, but: you're wanting to collect a
>>>>  summary statistic over an in-kernel data structure in a single syscall,
>>>>  because making a series of syscalls to examine every entry is slow and
>>>>  racy.  Isn't that exactly a job for an in-kernel virtual machine, and
>>>>  could you not supply an eBPF program which the kernel runs on each entry
>>>>  in the map, thus supporting people who want to calculate something else
>>>>  (mean, min and max, whatever) instead of count?
>>>
>>> Pretty much my suggestion as well :)
> 
> I also support the suggestion to count it from BPF side. It's flexible
> and powerful approach and doesn't require adding more and more nuanced
> sub-APIs to kernel to support subset of bulk operations on map
> (subset, because we'll expose count, but what about, e.g., p50, etc,
> there will always be something more that someone will want and it just
> doesn't scale).

Hi Andrii,
Yes, that makes sense.

> 
>>>
>>> It seems the better fix for your nat threshold is to keep count of
>>> elements in the map in a separate global variable that
>>> bpf program manually increments and decrements.
>>> bpftool will dump it just as regular map of single element.
>>> (I believe it doesn't recognize global variables properly yet)
>>> and BTF will be there to pick exactly that 'count' variable.
>>>
>>
>> It would be with an offloaded map, but yes, I suppose we could keep
>> track of the numbers in a separate map. We'll have a look into this.
> 
> See if you can use a global variable, that way you completely
> eliminate any overhead from BPF side of things, except for atomic
> increment.

Offloaded maps do not implement the map_direct_value_addr() operation,
so global variables are not supported at the moment. I need to dive
deeper into this and see what is required to add that support.

Thanks for your advice!
Quentin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-14 17:14         ` Edward Cree
@ 2019-08-15 14:15           ` Quentin Monnet
  2019-08-16 18:13             ` Edward Cree
  0 siblings, 1 reply; 15+ messages in thread
From: Quentin Monnet @ 2019-08-15 14:15 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, netdev, oss-drivers

2019-08-14 18:14 UTC+0100 ~ Edward Cree <ecree@solarflare.com>
> On 14/08/2019 17:58, Quentin Monnet wrote:
>> 2019-08-14 17:45 UTC+0100 ~ Edward Cree <ecree@solarflare.com>
>>> This might be a really dumb suggestion, but: you're wanting to collect a
>>>  summary statistic over an in-kernel data structure in a single syscall,
>>>  because making a series of syscalls to examine every entry is slow and
>>>  racy.  Isn't that exactly a job for an in-kernel virtual machine, and
>>>  could you not supply an eBPF program which the kernel runs on each entry
>>>  in the map, thus supporting people who want to calculate something else
>>>  (mean, min and max, whatever) instead of count?
>>>
>> Hi Edward, I like the approach, thanks for the suggestion.
>>
>> But I did not mention that we were using offloaded maps: Tracing the
>> kernel would probably work for programs running on the host, but this is
>> not a solution we could extend to hardware offload.
> I don't see where "tracing" comes into it; this is a new program type and
>  a new map op under the bpf() syscall.
> Could the user-supplied BPF program not then be passed down to the device
>  for it to run against its offloaded maps?
> 

Sorry, I misunderstood your suggestion :s (I thought you meant tracing
the insert and delete operations).

So if I understand correctly, we would use the bpf() syscall to trigger
a run of such program on all map entries (for map implementing the new
operation), and the context would include pointers to the key and the
value for the entry being processed so we can count/sum/compute an
average of the values or any other kind of processing?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
  2019-08-15 14:15           ` Quentin Monnet
@ 2019-08-16 18:13             ` Edward Cree
  0 siblings, 0 replies; 15+ messages in thread
From: Edward Cree @ 2019-08-16 18:13 UTC (permalink / raw)
  To: Quentin Monnet, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, netdev, oss-drivers

On 15/08/2019 15:15, Quentin Monnet wrote:
> So if I understand correctly, we would use the bpf() syscall to trigger
> a run of such program on all map entries (for map implementing the new
> operation), and the context would include pointers to the key and the
> value for the entry being processed so we can count/sum/compute an
> average of the values or any other kind of processing?
Yep, that's pretty much exactly what I had in mind.

-Ed

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-08-16 18:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-13 13:09 [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries Quentin Monnet
2019-08-13 13:09 ` [RFC bpf-next 1/3] tools: bpftool: clean up dump_map_elem() return value Quentin Monnet
2019-08-13 13:09 ` [RFC bpf-next 2/3] tools: bpftool: make comment more explicit for count of dumped entries Quentin Monnet
2019-08-13 13:09 ` [RFC bpf-next 3/3] tools: bpftool: add "bpftool map count" to count entries in map Quentin Monnet
2019-08-14  1:51 ` [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries Alexei Starovoitov
2019-08-14  9:42   ` Quentin Monnet
2019-08-14 16:45     ` Edward Cree
2019-08-14 16:58       ` Alexei Starovoitov
2019-08-14 17:12         ` Quentin Monnet
2019-08-14 20:18           ` Andrii Nakryiko
2019-08-15 14:02             ` Quentin Monnet
2019-08-14 16:58       ` Quentin Monnet
2019-08-14 17:14         ` Edward Cree
2019-08-15 14:15           ` Quentin Monnet
2019-08-16 18:13             ` Edward Cree

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).