Re: [PATCH V2 bpf-next] bpf: make __htab_lookup_and_delete_batch faster when map is almost empty

From: Yonghong Song <yhs@fb.com>
To: Brian Vazquez <brianvv@google.com>,
	Brian Vazquez <brianvv.kernel@gmail.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"David S . Miller" <davem@davemloft.net>
Cc: <linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org>,
	<bpf@vger.kernel.org>, Luigi Rizzo <lrizzo@google.com>
Subject: Re: [PATCH V2 bpf-next] bpf: make __htab_lookup_and_delete_batch faster when map is almost empty
Date: Sat, 1 Aug 2020 22:23:57 -0700	[thread overview]
Message-ID: <e0caa616-ea4e-82b9-6fae-6f7318f88ab3@fb.com> (raw)
In-Reply-To: <20200801180927.1003340-1-brianvv@google.com>

On 8/1/20 11:09 AM, Brian Vazquez wrote:
> While running some experiments it was observed that map_lookup_batch was 2x
> slower than get_next_key + lookup when the syscall overhead is minimal.
> This was because the map_lookup_batch implementation was more expensive
> traversing empty buckets, this can be really costly when the pre-allocated
> map is too big.
> 
> This patch optimizes the case when the bucket is empty so we can move
> quickly to next bucket.
> 
> The Benchmark was generated using the google/benchmark library[1]. When
> the benckmark is executed the number of iterations is governed by the
> amount of time the benckmarks takes, the number of iterations is at
> least 1 and not more than 1e9, until CPU time(of the entire binary, not
> just the part to measure), is greater than 0.5s. Time and CPU reported
> are the average of a single iteration over the iteration runs.
> 
> The experiments to exercise the empty buckets are as follows:
> 
> -The map was populated with a single entry to make sure that the syscall
> overhead is not helping the map_batch_lookup.
> -The size of the preallocated map was increased to show the effect of
> traversing empty buckets.
> 
> To interpret the results, Benchmark is the name of the experiment where
> the first number correspond to the number of elements in the map, and
> the next one correspond to the size of the pre-allocated map. Time and
> CPU are average and correspond to the time elapsed per iteration and the
> system time consumtion per iteration.

thanks for explanation!

> 
> Results:
> 
>    Using get_next_key + lookup:
> 
>    Benchmark                Time(ns)        CPU(ns)     Iteration
>    ---------------------------------------------------------------
>    BM_DumpHashMap/1/1k          3593           3586         192680
>    BM_DumpHashMap/1/4k          6004           5972         100000
>    BM_DumpHashMap/1/16k        15755          15710          44341
>    BM_DumpHashMap/1/64k        59525          59376          10000
> 
>    Using htab_lookup_batch before this patch:
>    Benchmark                Time(ns)        CPU(ns)     Iterations
>    ---------------------------------------------------------------
>    BM_DumpHashMap/1/1k          3933           3927         177978
>    BM_DumpHashMap/1/4k          9192           9177          73951
>    BM_DumpHashMap/1/16k        42011          41970          16789
>    BM_DumpHashMap/1/64k       117895         117661           6135
> 
>    Using htab_lookup_batch with this patch:
>    Benchmark                Time(ns)        CPU(ns)     Iterations
>    ---------------------------------------------------------------
>    BM_DumpHashMap/1/1k          2809           2803         249212
>    BM_DumpHashMap/1/4k          5318           5316         100000
>    BM_DumpHashMap/1/16k        14925          14895          47448
>    BM_DumpHashMap/1/64k        58870          58674          10000
> 
> [1] https://github.com/google/benchmark.git
> 
> Changelog:
> 
> v1 -> v2:
>   - Add more information about how to interpret the results
> 
> Suggested-by: Luigi Rizzo <lrizzo@google.com>
> Cc: Yonghong Song <yhs@fb.com>
> Signed-off-by: Brian Vazquez <brianvv@google.com>
> ---
>   kernel/bpf/hashtab.c | 23 ++++++++---------------
>   1 file changed, 8 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 024276787055..b6d28bd6345b 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1349,7 +1349,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>   	struct hlist_nulls_head *head;
>   	struct hlist_nulls_node *n;
>   	unsigned long flags = 0;
> -	bool locked = false;
>   	struct htab_elem *l;
>   	struct bucket *b;
>   	int ret = 0;
> @@ -1408,19 +1407,19 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>   	dst_val = values;
>   	b = &htab->buckets[batch];
>   	head = &b->head;
> -	/* do not grab the lock unless need it (bucket_cnt > 0). */
> -	if (locked)
> -		flags = htab_lock_bucket(htab, b);
>   
> +	l = hlist_nulls_entry_safe(rcu_dereference_raw(hlist_nulls_first_rcu(head)),
> +					struct htab_elem, hash_node);
> +	if (!l && (batch + 1 < htab->n_buckets)) {
> +		batch++;
> +		goto again_nocopy;
> +	}

In this case, if batch + 1 == htab->n_buckets, we still go through
htab_lock_bucket/htab_unlock_bucket which is really not needed.

So since we are trying to optimize for performance, let us handle
the above case as well. We can do
	if (!l) {
		if (batch + 1 < htab->n_buckets) {
			batch++;
			goto again_nocopy;
		}
		bucket_cnt = 0;
		goto done_bucket;
	}

...
done_bucket:
	rcu_read_unlock();
	bpf_enable_instrumentation();
	...

what do you think?

> +
> +	flags = htab_lock_bucket(htab, b);
>   	bucket_cnt = 0;
>   	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
>   		bucket_cnt++;
>   
> -	if (bucket_cnt && !locked) {
> -		locked = true;
> -		goto again_nocopy;
> -	}
> -
>   	if (bucket_cnt > (max_count - total)) {
>   		if (total == 0)
>   			ret = -ENOSPC;
> @@ -1446,10 +1445,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>   		goto alloc;
>   	}
>   
> -	/* Next block is only safe to run if you have grabbed the lock */
> -	if (!locked)
> -		goto next_batch;
> -
>   	hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>   		memcpy(dst_key, l->key, key_size);
>   
> @@ -1492,7 +1487,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>   	}
>   
>   	htab_unlock_bucket(htab, b, flags);
> -	locked = false;
>   
>   	while (node_to_free) {
>   		l = node_to_free;
> @@ -1500,7 +1494,6 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>   		bpf_lru_push_free(&htab->lru, &l->lru_node);
>   	}
>   
> -next_batch:
>   	/* If we are not copying data, we can go to next bucket and avoid
>   	 * unlocking the rcu.
>   	 */
>