bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf] bpf: improve htab_map_get_next_key behaviour during races
@ 2019-10-18 13:43 Lorenz Bauer
  2019-10-18 16:03 ` Alexei Starovoitov
  0 siblings, 1 reply; 3+ messages in thread
From: Lorenz Bauer @ 2019-10-18 13:43 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: joe, Lorenz Bauer

To iterate a BPF map, userspace must use MAP_GET_NEXT_KEY and provide
the last retrieved key. The code then scans the hash table bucket
for the key and returns the key of the next item.

This presents a problem if the last retrieved key isn't present in the
hash table anymore, e.g. due to concurrent deletion. It's not possible
to ascertain the location of a key in a given bucket, so there isn't
really a correct answer. The implementation currently returns the
first key in the first bucket. This guarantees that we never skip an
existing key. However, it means that a user space program iterating
a heavily modified map may never reach the end of the hash table,
forever restarting at the beginning.

Fixing this outright is rather involved. However, we can improve slightly
by never revisiting earlier buckets. Instead of the first key in the
first bucket we return the first key in the "current" bucket. This
doesn't eliminate the problem, but makes it less likely to occur.

Prior to commit 8fe45924387b ("bpf: map_get_next_key to return first key on NULL")
passing a non-existent key to MAP_GET_NEXT_KEY was the only way to
find the first key. Hence there is a small chance that there is code that
will be broken by this change.

Fixes: 8fe45924387b ("bpf: map_get_next_key to return first key on NULL")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
---
 kernel/bpf/hashtab.c                    | 4 +++-
 tools/testing/selftests/bpf/test_maps.c | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a62c8c9..30f0dab488f0 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -613,6 +613,9 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 
 	head = select_bucket(htab, hash);
 
+	/* don't iterate previous buckets */
+	i = hash & (htab->n_buckets - 1);
+
 	/* lookup the key */
 	l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets);
 
@@ -630,7 +633,6 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	}
 
 	/* no more elements in this hash list, go to the next bucket */
-	i = hash & (htab->n_buckets - 1);
 	i++;
 
 find_first_elem:
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 806b298397d3..6f351e532ddc 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -100,7 +100,7 @@ static void test_hashmap(unsigned int task, void *data)
 	assert(bpf_map_get_next_key(fd, NULL, &first_key) == 0 &&
 	       (first_key == 1 || first_key == 2));
 	assert(bpf_map_get_next_key(fd, &key, &next_key) == 0 &&
-	       (next_key == first_key));
+	       (next_key == 1 || next_key == 2));
 	assert(bpf_map_get_next_key(fd, &next_key, &next_key) == 0 &&
 	       (next_key == 1 || next_key == 2) &&
 	       (next_key != first_key));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH bpf] bpf: improve htab_map_get_next_key behaviour during races
  2019-10-18 13:43 [PATCH bpf] bpf: improve htab_map_get_next_key behaviour during races Lorenz Bauer
@ 2019-10-18 16:03 ` Alexei Starovoitov
  2019-10-18 20:56   ` Yonghong Song
  0 siblings, 1 reply; 3+ messages in thread
From: Alexei Starovoitov @ 2019-10-18 16:03 UTC (permalink / raw)
  To: Lorenz Bauer; +Cc: bpf, ast, daniel, joe

On Fri, Oct 18, 2019 at 02:43:11PM +0100, Lorenz Bauer wrote:
> To iterate a BPF map, userspace must use MAP_GET_NEXT_KEY and provide
> the last retrieved key. The code then scans the hash table bucket
> for the key and returns the key of the next item.
> 
> This presents a problem if the last retrieved key isn't present in the
> hash table anymore, e.g. due to concurrent deletion. It's not possible
> to ascertain the location of a key in a given bucket, so there isn't
> really a correct answer. The implementation currently returns the
> first key in the first bucket. This guarantees that we never skip an
> existing key. However, it means that a user space program iterating
> a heavily modified map may never reach the end of the hash table,
> forever restarting at the beginning.
> 
> Fixing this outright is rather involved. However, we can improve slightly
> by never revisiting earlier buckets. Instead of the first key in the
> first bucket we return the first key in the "current" bucket. This
> doesn't eliminate the problem, but makes it less likely to occur.
> 
> Prior to commit 8fe45924387b ("bpf: map_get_next_key to return first key on NULL")
> passing a non-existent key to MAP_GET_NEXT_KEY was the only way to
> find the first key. Hence there is a small chance that there is code that
> will be broken by this change.

It is 100% chance that it will break older bcc tools that were written
before NULL was possible argument for get_next_key.
Please see Yonghong's patches for batched map lookup.
That's the proper way to solve your problem.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH bpf] bpf: improve htab_map_get_next_key behaviour during races
  2019-10-18 16:03 ` Alexei Starovoitov
@ 2019-10-18 20:56   ` Yonghong Song
  0 siblings, 0 replies; 3+ messages in thread
From: Yonghong Song @ 2019-10-18 20:56 UTC (permalink / raw)
  To: Alexei Starovoitov, Lorenz Bauer; +Cc: bpf, ast, daniel, joe



On 10/18/19 9:03 AM, Alexei Starovoitov wrote:
> On Fri, Oct 18, 2019 at 02:43:11PM +0100, Lorenz Bauer wrote:
>> To iterate a BPF map, userspace must use MAP_GET_NEXT_KEY and provide
>> the last retrieved key. The code then scans the hash table bucket
>> for the key and returns the key of the next item.
>>
>> This presents a problem if the last retrieved key isn't present in the
>> hash table anymore, e.g. due to concurrent deletion. It's not possible
>> to ascertain the location of a key in a given bucket, so there isn't
>> really a correct answer. The implementation currently returns the
>> first key in the first bucket. This guarantees that we never skip an
>> existing key. However, it means that a user space program iterating
>> a heavily modified map may never reach the end of the hash table,
>> forever restarting at the beginning.
>>
>> Fixing this outright is rather involved. However, we can improve slightly
>> by never revisiting earlier buckets. Instead of the first key in the
>> first bucket we return the first key in the "current" bucket. This
>> doesn't eliminate the problem, but makes it less likely to occur.
>>
>> Prior to commit 8fe45924387b ("bpf: map_get_next_key to return first key on NULL")
>> passing a non-existent key to MAP_GET_NEXT_KEY was the only way to
>> find the first key. Hence there is a small chance that there is code that
>> will be broken by this change.
> 
> It is 100% chance that it will break older bcc tools that were written
> before NULL was possible argument for get_next_key.

The referenced bcc code is in below:
https://github.com/iovisor/bcc/blob/master/src/cc/libbpf.c#L301-L330

> Please see Yonghong's patches for batched map lookup.

This is my RFC patch.
https://lore.kernel.org/bpf/20190906225434.3635421-1-yhs@fb.com/T/#t

I have not got time to finish it as a proper patch set yet.
But hopefully soon can find some time to work on this.


> That's the proper way to solve your problem.
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-10-18 20:57 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-18 13:43 [PATCH bpf] bpf: improve htab_map_get_next_key behaviour during races Lorenz Bauer
2019-10-18 16:03 ` Alexei Starovoitov
2019-10-18 20:56   ` Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).