All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/2] Misc optimizations for bpf mem allocator
@ 2022-12-06  4:29 Hou Tao
  2022-12-06  4:29 ` [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
  2022-12-06  4:29 ` [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
  0 siblings, 2 replies; 12+ messages in thread
From: Hou Tao @ 2022-12-06  4:29 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The patchset is just misc optimizations for bpf mem allocator. Patch 1
fixes the OOM problem found during running hash-table update benchmark
from qp-trie patchset [0]. The benchmark will add htab elements in
batch and then delete elements in batch, so freed objects will stack on
free_by_rcu and wait for the expiration of RCU grace period. There can
be tens of thousands of freed objects and these objects are not
available for new allocation, so adding htab element will continue to do
new allocation.

For the benchmark commmand: "./bench -w3 -d10 -a htab-update -p 16",
even the maximum entries of htab is 16384, key_size is 255 and
value_size is 4, the peak memory usage will reach 14GB or more.
Increasing rcupdate.rcu_task_enqueue_lim will decrease the peak memory to
860MB, but it is still too many. Although the above case is contrived,
it is better to fix it and the fixing is simple: just reusing the freed
objects in free_by_rcu during allocation. After the fix, the peak memory
usage will decrease to 26MB. Beside above case, the memory blow-up
problem is also possible when allocation and freeing are done on total
different CPUs. I'm trying to fix the blow-up problem by using a global
per-cpu work to free these objects in free_by_rcu timely, but it doesn't
work very well and I am still digging into it.

Patch 2 is a left-over patch from rcu_trace_implies_rcu_gp() patchset
[1]. After disscussing with Paul [2], I think it is also safe to skip
rcu_barrier() when rcu_trace_implies_rcu_gp() returns true.

Comments are always welcome.

[0]: https://lore.kernel.org/bpf/20220924133620.4147153-13-houtao@huaweicloud.com/
[1]: https://lore.kernel.org/bpf/20221014113946.965131-1-houtao@huaweicloud.com/
[2]: https://lore.kernel.org/bpf/20221021185002.GP5600@paulmck-ThinkPad-P17-Gen-1/

Hou Tao (2):
  bpf: Reuse freed element in free_by_rcu during allocation
  bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true

 kernel/bpf/memalloc.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation
  2022-12-06  4:29 [PATCH bpf-next 0/2] Misc optimizations for bpf mem allocator Hou Tao
@ 2022-12-06  4:29 ` Hou Tao
  2022-12-07  1:52   ` Yonghong Song
  2022-12-06  4:29 ` [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
  1 sibling, 1 reply; 12+ messages in thread
From: Hou Tao @ 2022-12-06  4:29 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

When there are batched freeing operations on a specific CPU, part of
the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
be moved to waiting_for_gp list and the remaining part will be left in
free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
period and the next invocation of free_bulk().

So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
allocate a new object, in alloc_bulk() just check whether or not there
is freed element in free_by_rcu and reuse it if available.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 8f0d65f2474a..7daf147bc8f6 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
 	for (i = 0; i < cnt; i++) {
-		obj = __alloc(c, node);
-		if (!obj)
-			break;
+		/*
+		 * free_by_rcu is only manipulated by irq work refill_work().
+		 * IRQ works on the same CPU are called sequentially, so it is
+		 * safe to use __llist_del_first() here. If alloc_bulk() is
+		 * invoked by the initial prefill, there will be no running
+		 * irq work, so __llist_del_first() is fine as well.
+		 *
+		 * In most cases, objects on free_by_rcu are from the same CPU.
+		 * If some objects come from other CPUs, it doesn't incur any
+		 * harm because NUMA_NO_NODE means the preference for current
+		 * numa node and it is not a guarantee.
+		 */
+		obj = __llist_del_first(&c->free_by_rcu);
+		if (!obj) {
+			obj = __alloc(c, node);
+			if (!obj)
+				break;
+		}
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			/* In RT irq_work runs in per-cpu kthread, so disable
 			 * interrupts to avoid preemption and interrupts and
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
  2022-12-06  4:29 [PATCH bpf-next 0/2] Misc optimizations for bpf mem allocator Hou Tao
  2022-12-06  4:29 ` [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
@ 2022-12-06  4:29 ` Hou Tao
  2022-12-07  2:24   ` Hou Tao
  2022-12-07  4:00   ` Yonghong Song
  1 sibling, 2 replies; 12+ messages in thread
From: Hou Tao @ 2022-12-06  4:29 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

From: Hou Tao <houtao1@huawei.com>

If there are pending rcu callback, free_mem_alloc() will use
rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
__free_rcu_tasks_trace() and __free_rcu() callback.

If rcu_trace_implies_rcu_gp() is true, there will be no pending
__free_rcu(), so it will be OK to skip rcu_barrier() as well.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 7daf147bc8f6..d43991fafc4f 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -464,9 +464,17 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
 {
 	/* waiting_for_gp lists was drained, but __free_rcu might
 	 * still execute. Wait for it now before we freeing percpu caches.
+	 *
+	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
+	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
+	 * to wait for the pending __free_rcu_tasks_trace() and __free_rcu(),
+	 * so if call_rcu(head, __free_rcu) is skipped due to
+	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
+	 * using rcu_trace_implies_rcu_gp() as well.
 	 */
 	rcu_barrier_tasks_trace();
-	rcu_barrier();
+	if (!rcu_trace_implies_rcu_gp())
+		rcu_barrier();
 	free_mem_alloc_no_barrier(ma);
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation
  2022-12-06  4:29 ` [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
@ 2022-12-07  1:52   ` Yonghong Song
  2022-12-07  2:20     ` Hou Tao
  0 siblings, 1 reply; 12+ messages in thread
From: Yonghong Song @ 2022-12-07  1:52 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1



On 12/5/22 8:29 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> When there are batched freeing operations on a specific CPU, part of
> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
> be moved to waiting_for_gp list and the remaining part will be left in
> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
> period and the next invocation of free_bulk().

The change below LGTM. However, the above description seems not precise.
IIUC, free_by_rcu list => waiting_for_gp is controlled by whether
call_rcu_in_progress is true or not. If it is true, free_by_rcu list
will remain intact and not moving into waiting_for_gp list.
So it is not 'the remaining part will be left in free_by_rcu'.
It is all elements in free_by_rcu to waiting_for_gp or none.

> 
> So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
> allocate a new object, in alloc_bulk() just check whether or not there
> is freed element in free_by_rcu and reuse it if available.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>

LGTM except the above suggestions.

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   kernel/bpf/memalloc.c | 21 ++++++++++++++++++---
>   1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 8f0d65f2474a..7daf147bc8f6 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>   	memcg = get_memcg(c);
>   	old_memcg = set_active_memcg(memcg);
>   	for (i = 0; i < cnt; i++) {
> -		obj = __alloc(c, node);
> -		if (!obj)
> -			break;
> +		/*
> +		 * free_by_rcu is only manipulated by irq work refill_work().
> +		 * IRQ works on the same CPU are called sequentially, so it is
> +		 * safe to use __llist_del_first() here. If alloc_bulk() is
> +		 * invoked by the initial prefill, there will be no running
> +		 * irq work, so __llist_del_first() is fine as well.
> +		 *
> +		 * In most cases, objects on free_by_rcu are from the same CPU.
> +		 * If some objects come from other CPUs, it doesn't incur any
> +		 * harm because NUMA_NO_NODE means the preference for current
> +		 * numa node and it is not a guarantee.
> +		 */
> +		obj = __llist_del_first(&c->free_by_rcu);
> +		if (!obj) {
> +			obj = __alloc(c, node);
> +			if (!obj)
> +				break;
> +		}
>   		if (IS_ENABLED(CONFIG_PREEMPT_RT))
>   			/* In RT irq_work runs in per-cpu kthread, so disable
>   			 * interrupts to avoid preemption and interrupts and

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation
  2022-12-07  1:52   ` Yonghong Song
@ 2022-12-07  2:20     ` Hou Tao
  2022-12-07  2:58       ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Hou Tao @ 2022-12-07  2:20 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

Hi,

On 12/7/2022 9:52 AM, Yonghong Song wrote:
>
>
> On 12/5/22 8:29 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When there are batched freeing operations on a specific CPU, part of
>> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
>> be moved to waiting_for_gp list and the remaining part will be left in
>> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
>> period and the next invocation of free_bulk().
>
> The change below LGTM. However, the above description seems not precise.
> IIUC, free_by_rcu list => waiting_for_gp is controlled by whether
> call_rcu_in_progress is true or not. If it is true, free_by_rcu list
> will remain intact and not moving into waiting_for_gp list.
> So it is not 'the remaining part will be left in free_by_rcu'.
> It is all elements in free_by_rcu to waiting_for_gp or none.
Thanks for the review and the suggestions. I tried to say that moving from
free_by_rcu to waiting_for_gp is slow, and there can be many free elements being
stacked on free_by_rcu list. So how about the following rephrasing or do you
still prefer "It is all elements in free_by_rcu to waiting_for_gp or none."  ?

When there are batched freeing operations on a specific CPU, part of the freed
elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to
waiting_for_gp list  and the remaining part will be left in free_by_rcu list.
These elements in free_by_rcu list will be moved into waiting_for_gp list after
one RCU-tasks-trace grace period and another invocation of free_bulk(), so there
may be many free elements being stacked on free_by_rcu_list.

>>
>> So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
>> allocate a new object, in alloc_bulk() just check whether or not there
>> is freed element in free_by_rcu and reuse it if available.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>
> LGTM except the above suggestions.
>
> Acked-by: Yonghong Song <yhs@fb.com>
>
>> ---
>>   kernel/bpf/memalloc.c | 21 ++++++++++++++++++---
>>   1 file changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 8f0d65f2474a..7daf147bc8f6 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt,
>> int node)
>>       memcg = get_memcg(c);
>>       old_memcg = set_active_memcg(memcg);
>>       for (i = 0; i < cnt; i++) {
>> -        obj = __alloc(c, node);
>> -        if (!obj)
>> -            break;
>> +        /*
>> +         * free_by_rcu is only manipulated by irq work refill_work().
>> +         * IRQ works on the same CPU are called sequentially, so it is
>> +         * safe to use __llist_del_first() here. If alloc_bulk() is
>> +         * invoked by the initial prefill, there will be no running
>> +         * irq work, so __llist_del_first() is fine as well.
>> +         *
>> +         * In most cases, objects on free_by_rcu are from the same CPU.
>> +         * If some objects come from other CPUs, it doesn't incur any
>> +         * harm because NUMA_NO_NODE means the preference for current
>> +         * numa node and it is not a guarantee.
>> +         */
>> +        obj = __llist_del_first(&c->free_by_rcu);
>> +        if (!obj) {
>> +            obj = __alloc(c, node);
>> +            if (!obj)
>> +                break;
>> +        }
>>           if (IS_ENABLED(CONFIG_PREEMPT_RT))
>>               /* In RT irq_work runs in per-cpu kthread, so disable
>>                * interrupts to avoid preemption and interrupts and
>
> .


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
  2022-12-06  4:29 ` [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
@ 2022-12-07  2:24   ` Hou Tao
  2022-12-07 22:28     ` Paul E. McKenney
  2022-12-07  4:00   ` Yonghong Song
  1 sibling, 1 reply; 12+ messages in thread
From: Hou Tao @ 2022-12-07  2:24 UTC (permalink / raw)
  To: Hou Tao, bpf, Paul E. McKenney, rcu
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1

Forget to cc Paul and RCU maillist for more comments.

On 12/6/2022 12:29 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> If there are pending rcu callback, free_mem_alloc() will use
> rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
> __free_rcu_tasks_trace() and __free_rcu() callback.
>
> If rcu_trace_implies_rcu_gp() is true, there will be no pending
> __free_rcu(), so it will be OK to skip rcu_barrier() as well.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/memalloc.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 7daf147bc8f6..d43991fafc4f 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -464,9 +464,17 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
>  {
>  	/* waiting_for_gp lists was drained, but __free_rcu might
>  	 * still execute. Wait for it now before we freeing percpu caches.
> +	 *
> +	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
> +	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
> +	 * to wait for the pending __free_rcu_tasks_trace() and __free_rcu(),
> +	 * so if call_rcu(head, __free_rcu) is skipped due to
> +	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
> +	 * using rcu_trace_implies_rcu_gp() as well.
>  	 */
>  	rcu_barrier_tasks_trace();
> -	rcu_barrier();
> +	if (!rcu_trace_implies_rcu_gp())
> +		rcu_barrier();
>  	free_mem_alloc_no_barrier(ma);
>  }
>  


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation
  2022-12-07  2:20     ` Hou Tao
@ 2022-12-07  2:58       ` Alexei Starovoitov
  2022-12-08  1:49         ` Hou Tao
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2022-12-07  2:58 UTC (permalink / raw)
  To: Hou Tao
  Cc: Yonghong Song, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, Stanislav Fomichev, Jiri Olsa, John Fastabend, Hou Tao

On Tue, Dec 6, 2022 at 6:20 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 12/7/2022 9:52 AM, Yonghong Song wrote:
> >
> >
> > On 12/5/22 8:29 PM, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> When there are batched freeing operations on a specific CPU, part of
> >> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
> >> be moved to waiting_for_gp list and the remaining part will be left in
> >> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
> >> period and the next invocation of free_bulk().
> >
> > The change below LGTM. However, the above description seems not precise.
> > IIUC, free_by_rcu list => waiting_for_gp is controlled by whether
> > call_rcu_in_progress is true or not. If it is true, free_by_rcu list
> > will remain intact and not moving into waiting_for_gp list.
> > So it is not 'the remaining part will be left in free_by_rcu'.
> > It is all elements in free_by_rcu to waiting_for_gp or none.
> Thanks for the review and the suggestions. I tried to say that moving from
> free_by_rcu to waiting_for_gp is slow, and there can be many free elements being
> stacked on free_by_rcu list. So how about the following rephrasing or do you
> still prefer "It is all elements in free_by_rcu to waiting_for_gp or none."  ?
>
> When there are batched freeing operations on a specific CPU, part of the freed
> elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to
> waiting_for_gp list  and the remaining part will be left in free_by_rcu list.

I agree with Yonghong.
The above sentence is not just not precise.
'elements moved to waiting_for_gp list' part is not correct.
The elements never moved into it directly.
Only via free_by_rcu list.

All or none also matters.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
  2022-12-06  4:29 ` [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
  2022-12-07  2:24   ` Hou Tao
@ 2022-12-07  4:00   ` Yonghong Song
  1 sibling, 0 replies; 12+ messages in thread
From: Yonghong Song @ 2022-12-07  4:00 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, houtao1



On 12/5/22 8:29 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> If there are pending rcu callback, free_mem_alloc() will use
> rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
> __free_rcu_tasks_trace() and __free_rcu() callback.
> 
> If rcu_trace_implies_rcu_gp() is true, there will be no pending
> __free_rcu(), so it will be OK to skip rcu_barrier() as well.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
  2022-12-07  2:24   ` Hou Tao
@ 2022-12-07 22:28     ` Paul E. McKenney
  2022-12-08  1:27       ` Hou Tao
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2022-12-07 22:28 UTC (permalink / raw)
  To: Hou Tao
  Cc: Hou Tao, bpf, rcu, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, Stanislav Fomichev, Jiri Olsa, John Fastabend

On Wed, Dec 07, 2022 at 10:24:55AM +0800, Hou Tao wrote:
> Forget to cc Paul and RCU maillist for more comments.
> 
> On 12/6/2022 12:29 PM, Hou Tao wrote:
> > From: Hou Tao <houtao1@huawei.com>
> >
> > If there are pending rcu callback, free_mem_alloc() will use
> > rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
> > __free_rcu_tasks_trace() and __free_rcu() callback.
> >
> > If rcu_trace_implies_rcu_gp() is true, there will be no pending
> > __free_rcu(), so it will be OK to skip rcu_barrier() as well.

The bit about there being no pending __free_rcu() is guaranteed by
your algorithm, correct?  As in you have something like this somewhere
else in the code?

	if (!rcu_trace_implies_rcu_gp())
		call_rcu(...);

Or am I missing something more subtle?

							Thanx, Paul

> > Signed-off-by: Hou Tao <houtao1@huawei.com>
> > ---
> >  kernel/bpf/memalloc.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > index 7daf147bc8f6..d43991fafc4f 100644
> > --- a/kernel/bpf/memalloc.c
> > +++ b/kernel/bpf/memalloc.c
> > @@ -464,9 +464,17 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
> >  {
> >  	/* waiting_for_gp lists was drained, but __free_rcu might
> >  	 * still execute. Wait for it now before we freeing percpu caches.
> > +	 *
> > +	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
> > +	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
> > +	 * to wait for the pending __free_rcu_tasks_trace() and __free_rcu(),
> > +	 * so if call_rcu(head, __free_rcu) is skipped due to
> > +	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
> > +	 * using rcu_trace_implies_rcu_gp() as well.
> >  	 */
> >  	rcu_barrier_tasks_trace();
> > -	rcu_barrier();
> > +	if (!rcu_trace_implies_rcu_gp())
> > +		rcu_barrier();
> >  	free_mem_alloc_no_barrier(ma);
> >  }
> >  
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
  2022-12-07 22:28     ` Paul E. McKenney
@ 2022-12-08  1:27       ` Hou Tao
  2022-12-08  1:42         ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Hou Tao @ 2022-12-08  1:27 UTC (permalink / raw)
  To: paulmck, Hou Tao
  Cc: bpf, rcu, Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
	Yonghong Song, Alexei Starovoitov, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend

Hi,

On 12/8/2022 6:28 AM, Paul E. McKenney wrote:
> On Wed, Dec 07, 2022 at 10:24:55AM +0800, Hou Tao wrote:
>> Forget to cc Paul and RCU maillist for more comments.
>>
>> On 12/6/2022 12:29 PM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> If there are pending rcu callback, free_mem_alloc() will use
>>> rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
>>> __free_rcu_tasks_trace() and __free_rcu() callback.
>>>
>>> If rcu_trace_implies_rcu_gp() is true, there will be no pending
>>> __free_rcu(), so it will be OK to skip rcu_barrier() as well.
> The bit about there being no pending __free_rcu() is guaranteed by
> your algorithm, correct?  As in you have something like this somewhere
> else in the code?
>
> 	if (!rcu_trace_implies_rcu_gp())
> 		call_rcu(...);
>
> Or am I missing something more subtle?
Yes. It is guaranteed by the implementation of bpf mem allocator: if
rcu_trace_implies_rcu_gp() is true, there will be no call_rcu() in bpf memory
allocator.
>
> 							Thanx, Paul
>
>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>> ---
>>>  kernel/bpf/memalloc.c | 10 +++++++++-
>>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index 7daf147bc8f6..d43991fafc4f 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -464,9 +464,17 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
>>>  {
>>>  	/* waiting_for_gp lists was drained, but __free_rcu might
>>>  	 * still execute. Wait for it now before we freeing percpu caches.
>>> +	 *
>>> +	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
>>> +	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
>>> +	 * to wait for the pending __free_rcu_tasks_trace() and __free_rcu(),
>>> +	 * so if call_rcu(head, __free_rcu) is skipped due to
>>> +	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
>>> +	 * using rcu_trace_implies_rcu_gp() as well.
>>>  	 */
>>>  	rcu_barrier_tasks_trace();
>>> -	rcu_barrier();
>>> +	if (!rcu_trace_implies_rcu_gp())
>>> +		rcu_barrier();
>>>  	free_mem_alloc_no_barrier(ma);
>>>  }
>>>  


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
  2022-12-08  1:27       ` Hou Tao
@ 2022-12-08  1:42         ` Paul E. McKenney
  0 siblings, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2022-12-08  1:42 UTC (permalink / raw)
  To: Hou Tao
  Cc: Hou Tao, bpf, rcu, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, Stanislav Fomichev, Jiri Olsa, John Fastabend

On Thu, Dec 08, 2022 at 09:27:05AM +0800, Hou Tao wrote:
> Hi,
> 
> On 12/8/2022 6:28 AM, Paul E. McKenney wrote:
> > On Wed, Dec 07, 2022 at 10:24:55AM +0800, Hou Tao wrote:
> >> Forget to cc Paul and RCU maillist for more comments.
> >>
> >> On 12/6/2022 12:29 PM, Hou Tao wrote:
> >>> From: Hou Tao <houtao1@huawei.com>
> >>>
> >>> If there are pending rcu callback, free_mem_alloc() will use
> >>> rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
> >>> __free_rcu_tasks_trace() and __free_rcu() callback.
> >>>
> >>> If rcu_trace_implies_rcu_gp() is true, there will be no pending
> >>> __free_rcu(), so it will be OK to skip rcu_barrier() as well.
> > The bit about there being no pending __free_rcu() is guaranteed by
> > your algorithm, correct?  As in you have something like this somewhere
> > else in the code?
> >
> > 	if (!rcu_trace_implies_rcu_gp())
> > 		call_rcu(...);
> >
> > Or am I missing something more subtle?
> Yes. It is guaranteed by the implementation of bpf mem allocator: if
> rcu_trace_implies_rcu_gp() is true, there will be no call_rcu() in bpf memory
> allocator.

Very well, then from an RCU perspective:

Acked-by: Paul E. McKenney <paulmck@kernel.org>

							Thanx, Paul

> >>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >>> ---
> >>>  kernel/bpf/memalloc.c | 10 +++++++++-
> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >>> index 7daf147bc8f6..d43991fafc4f 100644
> >>> --- a/kernel/bpf/memalloc.c
> >>> +++ b/kernel/bpf/memalloc.c
> >>> @@ -464,9 +464,17 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
> >>>  {
> >>>  	/* waiting_for_gp lists was drained, but __free_rcu might
> >>>  	 * still execute. Wait for it now before we freeing percpu caches.
> >>> +	 *
> >>> +	 * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
> >>> +	 * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
> >>> +	 * to wait for the pending __free_rcu_tasks_trace() and __free_rcu(),
> >>> +	 * so if call_rcu(head, __free_rcu) is skipped due to
> >>> +	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
> >>> +	 * using rcu_trace_implies_rcu_gp() as well.
> >>>  	 */
> >>>  	rcu_barrier_tasks_trace();
> >>> -	rcu_barrier();
> >>> +	if (!rcu_trace_implies_rcu_gp())
> >>> +		rcu_barrier();
> >>>  	free_mem_alloc_no_barrier(ma);
> >>>  }
> >>>  
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation
  2022-12-07  2:58       ` Alexei Starovoitov
@ 2022-12-08  1:49         ` Hou Tao
  0 siblings, 0 replies; 12+ messages in thread
From: Hou Tao @ 2022-12-08  1:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, bpf, Martin KaFai Lau, Andrii Nakryiko, Song Liu,
	Hao Luo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	KP Singh, Stanislav Fomichev, Jiri Olsa, John Fastabend, Hou Tao

Hi,

On 12/7/2022 10:58 AM, Alexei Starovoitov wrote:
> On Tue, Dec 6, 2022 at 6:20 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 12/7/2022 9:52 AM, Yonghong Song wrote:
>>>
>>> On 12/5/22 8:29 PM, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> When there are batched freeing operations on a specific CPU, part of
>>>> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
>>>> be moved to waiting_for_gp list and the remaining part will be left in
>>>> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
>>>> period and the next invocation of free_bulk().
>>> The change below LGTM. However, the above description seems not precise.
>>> IIUC, free_by_rcu list => waiting_for_gp is controlled by whether
>>> call_rcu_in_progress is true or not. If it is true, free_by_rcu list
>>> will remain intact and not moving into waiting_for_gp list.
>>> So it is not 'the remaining part will be left in free_by_rcu'.
>>> It is all elements in free_by_rcu to waiting_for_gp or none.
>> Thanks for the review and the suggestions. I tried to say that moving from
>> free_by_rcu to waiting_for_gp is slow, and there can be many free elements being
>> stacked on free_by_rcu list. So how about the following rephrasing or do you
>> still prefer "It is all elements in free_by_rcu to waiting_for_gp or none."  ?
>>
>> When there are batched freeing operations on a specific CPU, part of the freed
>> elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to
>> waiting_for_gp list  and the remaining part will be left in free_by_rcu list.
> I agree with Yonghong.
> The above sentence is not just not precise.
> 'elements moved to waiting_for_gp list' part is not correct.
> The elements never moved into it directly.
> Only via free_by_rcu list.
Yes.
>
> All or none also matters.
I am still confused about the "All or none". Does it mean all elements in
free_by_list will be moved into waiting_for_gp or none will be moved if
call_rcu_in_progress is true, right ?

So How about the following rephrasing ?

When there are batched freeing operations on a specific CPU, part of the freed
elements ((high_watermark - lower_watermark) / 2 + 1) will be indirectly moved
into waiting_for_gp list through free_by_rcu list. After call_rcu_in_progress
becomes false again, the remaining elements in free_by_rcu list will be moved to
waiting_for_gp list by the next invocation of free_bulk(). However if the
expiration of RCU tasks trace grace period is relatively slow, none element in
free_by_rcu list will be moved.

So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to allocate a new
object, in alloc_bulk() just check whether or not there is freed element in
free_by_rcu list and reuse it if available.

> .


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-12-08  1:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-06  4:29 [PATCH bpf-next 0/2] Misc optimizations for bpf mem allocator Hou Tao
2022-12-06  4:29 ` [PATCH bpf-next 1/2] bpf: Reuse freed element in free_by_rcu during allocation Hou Tao
2022-12-07  1:52   ` Yonghong Song
2022-12-07  2:20     ` Hou Tao
2022-12-07  2:58       ` Alexei Starovoitov
2022-12-08  1:49         ` Hou Tao
2022-12-06  4:29 ` [PATCH bpf-next 2/2] bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true Hou Tao
2022-12-07  2:24   ` Hou Tao
2022-12-07 22:28     ` Paul E. McKenney
2022-12-08  1:27       ` Hou Tao
2022-12-08  1:42         ` Paul E. McKenney
2022-12-07  4:00   ` Yonghong Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.