* [PATCH bpf 0/2] Wait for busy refill_work when destorying bpf memory allocator
@ 2022-10-19 11:55 Hou Tao
2022-10-19 11:55 ` [PATCH bpf 1/2] bpf: " Hou Tao
2022-10-19 11:55 ` [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining Hou Tao
0 siblings, 2 replies; 11+ messages in thread
From: Hou Tao @ 2022-10-19 11:55 UTC (permalink / raw)
To: bpf, Alexei Starovoitov
Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
Jiri Olsa, John Fastabend, houtao1
From: Hou Tao <houtao1@huawei.com>
Hi,
The patchset aims to fix the problem of bpf memory allocator destruction
when there is PREEMPT_RT kernel or kernel with arch_irq_work_has_interrupt()
being false (e.g. 1-cpu arm32 host). The root cause is that there may be
busy refill_work when the allocator is destorying and it may incur oops
or other problems as shown in patch #1. Patch #1 fixes the problem by
waiting for the completion of irq work during destorying and patch #2
is just a clean-up patch based on patch #1. Please see individual
patches for more details.
Comments are always welcome.
Hou Tao (2):
bpf: Wait for busy refill_work when destorying bpf memory allocator
bpf: Use __llist_del_all() whenever possbile during memory draining
kernel/bpf/memalloc.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
--
2.29.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator
2022-10-19 11:55 [PATCH bpf 0/2] Wait for busy refill_work when destorying bpf memory allocator Hou Tao
@ 2022-10-19 11:55 ` Hou Tao
2022-10-19 18:38 ` sdf
2022-10-19 11:55 ` [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining Hou Tao
1 sibling, 1 reply; 11+ messages in thread
From: Hou Tao @ 2022-10-19 11:55 UTC (permalink / raw)
To: bpf, Alexei Starovoitov
Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
Jiri Olsa, John Fastabend, houtao1
From: Hou Tao <houtao1@huawei.com>
A busy irq work is an unfinished irq work and it can be either in the
pending state or in the running state. When destroying bpf memory
allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
work is invoked in a per-CPU RT-kthread. It is also possible for kernel
with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
and irq work is inovked in timer interrupt.
The busy refill_work leads to various issues. The obvious one is that
there will be concurrent operations on free_by_rcu and free_list between
irq work and memory draining. Another one is call_rcu_in_progress will
not be reliable for the checking of pending RCU callback because
do_call_rcu() may has not been invoked by irq work. The other is there
will be use-after-free if irq work is freed before the callback of
irq work is invoked as shown below:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
Oops: 0010 [#1] PREEMPT_RT SMP
CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
......
Call Trace:
<TASK>
irq_work_single+0x24/0x60
irq_work_run_list+0x24/0x30
run_irq_workd+0x23/0x30
smpboot_thread_fn+0x203/0x300
kthread+0x126/0x150
ret_from_fork+0x1f/0x30
</TASK>
Considering the ease of concurrency handling and the short wait time
used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
the 99th percentile is 10us), just waiting for busy refill_work to
complete before memory draining and memory freeing.
Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
kernel/bpf/memalloc.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 94f0f63443a6..48e606aaacf0 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
rcu_in_progress = 0;
for_each_possible_cpu(cpu) {
c = per_cpu_ptr(ma->cache, cpu);
+ /*
+ * refill_work may be unfinished for PREEMPT_RT kernel
+ * in which irq work is invoked in a per-CPU RT thread.
+ * It is also possible for kernel with
+ * arch_irq_work_has_interrupt() being false and irq
+ * work is inovked in timer interrupt. So wait for the
+ * completion of irq work to ease the handling of
+ * concurrency.
+ */
+ irq_work_sync(&c->refill_work);
drain_mem_cache(c);
rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
}
@@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
cc = per_cpu_ptr(ma->caches, cpu);
for (i = 0; i < NUM_CACHES; i++) {
c = &cc->cache[i];
+ irq_work_sync(&c->refill_work);
drain_mem_cache(c);
rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
}
--
2.29.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining
2022-10-19 11:55 [PATCH bpf 0/2] Wait for busy refill_work when destorying bpf memory allocator Hou Tao
2022-10-19 11:55 ` [PATCH bpf 1/2] bpf: " Hou Tao
@ 2022-10-19 11:55 ` Hou Tao
2022-10-19 19:00 ` sdf
1 sibling, 1 reply; 11+ messages in thread
From: Hou Tao @ 2022-10-19 11:55 UTC (permalink / raw)
To: bpf, Alexei Starovoitov
Cc: Martin KaFai Lau, Andrii Nakryiko, Song Liu, Hao Luo,
Yonghong Song, Daniel Borkmann, KP Singh, Stanislav Fomichev,
Jiri Olsa, John Fastabend, houtao1
From: Hou Tao <houtao1@huawei.com>
Except for waiting_for_gp list, there are no concurrent operations on
free_by_rcu, free_llist and free_llist_extra lists, so use
__llist_del_all() instead of llist_del_all(). waiting_for_gp list can be
deleted by RCU callback concurrently, so still use llist_del_all().
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
kernel/bpf/memalloc.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 48e606aaacf0..7f45744a09f7 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
/* No progs are using this bpf_mem_cache, but htab_map_free() called
* bpf_mem_cache_free() for all remaining elements and they can be in
* free_by_rcu or in waiting_for_gp lists, so drain those lists now.
+ *
+ * Except for waiting_for_gp list, there are no concurrent operations
+ * on these lists, so it is safe to use __llist_del_all().
*/
llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
free_one(c, llnode);
llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
free_one(c, llnode);
- llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
+ llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist))
free_one(c, llnode);
- llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
+ llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra))
free_one(c, llnode);
}
--
2.29.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator
2022-10-19 11:55 ` [PATCH bpf 1/2] bpf: " Hou Tao
@ 2022-10-19 18:38 ` sdf
2022-10-20 1:07 ` Hou Tao
0 siblings, 1 reply; 11+ messages in thread
From: sdf @ 2022-10-19 18:38 UTC (permalink / raw)
To: Hou Tao
Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
Jiri Olsa, John Fastabend, houtao1
On 10/19, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> A busy irq work is an unfinished irq work and it can be either in the
> pending state or in the running state. When destroying bpf memory
> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
> work is invoked in a per-CPU RT-kthread. It is also possible for kernel
> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
> and irq work is inovked in timer interrupt.
> The busy refill_work leads to various issues. The obvious one is that
> there will be concurrent operations on free_by_rcu and free_list between
> irq work and memory draining. Another one is call_rcu_in_progress will
> not be reliable for the checking of pending RCU callback because
> do_call_rcu() may has not been invoked by irq work. The other is there
> will be use-after-free if irq work is freed before the callback of
> irq work is invoked as shown below:
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page
> PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
> Oops: 0010 [#1] PREEMPT_RT SMP
> CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
> RIP: 0010:0x0
> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
> RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
> ......
> Call Trace:
> <TASK>
> irq_work_single+0x24/0x60
> irq_work_run_list+0x24/0x30
> run_irq_workd+0x23/0x30
> smpboot_thread_fn+0x203/0x300
> kthread+0x126/0x150
> ret_from_fork+0x1f/0x30
> </TASK>
> Considering the ease of concurrency handling and the short wait time
> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
> the 99th percentile is 10us), just waiting for busy refill_work to
> complete before memory draining and memory freeing.
> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory
> allocator.")
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
> kernel/bpf/memalloc.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 94f0f63443a6..48e606aaacf0 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> rcu_in_progress = 0;
> for_each_possible_cpu(cpu) {
> c = per_cpu_ptr(ma->cache, cpu);
> + /*
> + * refill_work may be unfinished for PREEMPT_RT kernel
> + * in which irq work is invoked in a per-CPU RT thread.
> + * It is also possible for kernel with
> + * arch_irq_work_has_interrupt() being false and irq
> + * work is inovked in timer interrupt. So wait for the
> + * completion of irq work to ease the handling of
> + * concurrency.
> + */
> + irq_work_sync(&c->refill_work);
Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
We do have a bunch of them sprinkled already to run alloc/free with
irqs disabled.
I was also trying to see if adding local_irq_save inside drain_mem_cache
to pair with the ones from refill might work, but waiting for irq to
finish seems easier...
Maybe also move both of these in some new "static void irq_work_wait"
to make it clear that the PREEMT_RT comment applies to both of them?
Or maybe that helper should do 'for_each_possible_cpu(cpu)
irq_work_sync(&c->refill_work);'
in the PREEMPT_RT case so we don't have to call it twice?
> drain_mem_cache(c);
> rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> }
> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> cc = per_cpu_ptr(ma->caches, cpu);
> for (i = 0; i < NUM_CACHES; i++) {
> c = &cc->cache[i];
> + irq_work_sync(&c->refill_work);
> drain_mem_cache(c);
> rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> }
> --
> 2.29.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining
2022-10-19 11:55 ` [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining Hou Tao
@ 2022-10-19 19:00 ` sdf
2022-10-20 1:17 ` Hou Tao
0 siblings, 1 reply; 11+ messages in thread
From: sdf @ 2022-10-19 19:00 UTC (permalink / raw)
To: Hou Tao
Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
Jiri Olsa, John Fastabend, houtao1
On 10/19, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> Except for waiting_for_gp list, there are no concurrent operations on
> free_by_rcu, free_llist and free_llist_extra lists, so use
> __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be
> deleted by RCU callback concurrently, so still use llist_del_all().
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
> kernel/bpf/memalloc.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 48e606aaacf0..7f45744a09f7 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
> /* No progs are using this bpf_mem_cache, but htab_map_free() called
> * bpf_mem_cache_free() for all remaining elements and they can be in
> * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
> + *
> + * Except for waiting_for_gp list, there are no concurrent operations
> + * on these lists, so it is safe to use __llist_del_all().
> */
> llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
> free_one(c, llnode);
> llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
> free_one(c, llnode);
> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist))
> free_one(c, llnode);
> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra))
> free_one(c, llnode);
Acked-by: Stanislav Fomichev <sdf@google.com>
Seems safe even without the previous patch? OTOH, do we really care
about __lllist vs llist in the cleanup path? Might be safer to always
do llist_del_all everywhere?
> }
> --
> 2.29.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator
2022-10-19 18:38 ` sdf
@ 2022-10-20 1:07 ` Hou Tao
2022-10-20 17:49 ` Stanislav Fomichev
0 siblings, 1 reply; 11+ messages in thread
From: Hou Tao @ 2022-10-20 1:07 UTC (permalink / raw)
To: sdf
Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
Jiri Olsa, John Fastabend, houtao1
Hi,
On 10/20/2022 2:38 AM, sdf@google.com wrote:
> On 10/19, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>
>> A busy irq work is an unfinished irq work and it can be either in the
>> pending state or in the running state. When destroying bpf memory
>> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
>> work is invoked in a per-CPU RT-kthread. It is also possible for kernel
>> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
>> and irq work is inovked in timer interrupt.
>
>> The busy refill_work leads to various issues. The obvious one is that
>> there will be concurrent operations on free_by_rcu and free_list between
>> irq work and memory draining. Another one is call_rcu_in_progress will
>> not be reliable for the checking of pending RCU callback because
>> do_call_rcu() may has not been invoked by irq work. The other is there
>> will be use-after-free if irq work is freed before the callback of
>> irq work is invoked as shown below:
>
>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>> #PF: supervisor instruction fetch in kernel mode
>> #PF: error_code(0x0010) - not-present page
>> PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
>> Oops: 0010 [#1] PREEMPT_RT SMP
>> CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
>> RIP: 0010:0x0
>> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>> RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
>> RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
>> RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
>> ......
>> Call Trace:
>> <TASK>
>> irq_work_single+0x24/0x60
>> irq_work_run_list+0x24/0x30
>> run_irq_workd+0x23/0x30
>> smpboot_thread_fn+0x203/0x300
>> kthread+0x126/0x150
>> ret_from_fork+0x1f/0x30
>> </TASK>
>
>> Considering the ease of concurrency handling and the short wait time
>> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
>> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
>> the 99th percentile is 10us), just waiting for busy refill_work to
>> complete before memory draining and memory freeing.
>
>> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory
>> allocator.")
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>> kernel/bpf/memalloc.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 94f0f63443a6..48e606aaacf0 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>> rcu_in_progress = 0;
>> for_each_possible_cpu(cpu) {
>> c = per_cpu_ptr(ma->cache, cpu);
>> + /*
>> + * refill_work may be unfinished for PREEMPT_RT kernel
>> + * in which irq work is invoked in a per-CPU RT thread.
>> + * It is also possible for kernel with
>> + * arch_irq_work_has_interrupt() being false and irq
>> + * work is inovked in timer interrupt. So wait for the
>> + * completion of irq work to ease the handling of
>> + * concurrency.
>> + */
>> + irq_work_sync(&c->refill_work);
>
> Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
> We do have a bunch of them sprinkled already to run alloc/free with
> irqs disabled.
No. As said in the commit message and the comments, irq_work_sync() is needed
for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being
false. And for other kernels, irq_work_sync() doesn't incur any overhead,
because it is just a simple memory read through irq_work_is_busy() and nothing
else. The reason is the irq work must have been completed when invoking
bpf_mem_alloc_destroy() for these kernels.
void irq_work_sync(struct irq_work *work)
{
/* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */
/* irq wor*/
while (irq_work_is_busy(work))
cpu_relax();
}
>
> I was also trying to see if adding local_irq_save inside drain_mem_cache
> to pair with the ones from refill might work, but waiting for irq to
> finish seems easier...
Disabling hard irq works, but irq_work_sync() is still needed to ensure it is
completed before freeing its memory.
>
> Maybe also move both of these in some new "static void irq_work_wait"
> to make it clear that the PREEMT_RT comment applies to both of them?
>
> Or maybe that helper should do 'for_each_possible_cpu(cpu)
> irq_work_sync(&c->refill_work);'
> in the PREEMPT_RT case so we don't have to call it twice?
drain_mem_cache() is also time consuming somethings, so I think it is better to
interleave irq_work_sync() and drain_mem_cache() to reduce waiting time.
>
>> drain_mem_cache(c);
>> rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>> }
>> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>> cc = per_cpu_ptr(ma->caches, cpu);
>> for (i = 0; i < NUM_CACHES; i++) {
>> c = &cc->cache[i];
>> + irq_work_sync(&c->refill_work);
>> drain_mem_cache(c);
>> rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>> }
>> --
>> 2.29.2
>
> .
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining
2022-10-19 19:00 ` sdf
@ 2022-10-20 1:17 ` Hou Tao
2022-10-20 17:52 ` Stanislav Fomichev
0 siblings, 1 reply; 11+ messages in thread
From: Hou Tao @ 2022-10-20 1:17 UTC (permalink / raw)
To: sdf
Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
Jiri Olsa, John Fastabend, houtao1
Hi,
On 10/20/2022 3:00 AM, sdf@google.com wrote:
> On 10/19, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>
>> Except for waiting_for_gp list, there are no concurrent operations on
>> free_by_rcu, free_llist and free_llist_extra lists, so use
>> __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be
>> deleted by RCU callback concurrently, so still use llist_del_all().
>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>> kernel/bpf/memalloc.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 48e606aaacf0..7f45744a09f7 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
>> /* No progs are using this bpf_mem_cache, but htab_map_free() called
>> * bpf_mem_cache_free() for all remaining elements and they can be in
>> * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
>> + *
>> + * Except for waiting_for_gp list, there are no concurrent operations
>> + * on these lists, so it is safe to use __llist_del_all().
>> */
>> llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
>> free_one(c, llnode);
>> llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
>> free_one(c, llnode);
>> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
>> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist))
>> free_one(c, llnode);
>> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
>> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra))
>> free_one(c, llnode);
>
> Acked-by: Stanislav Fomichev <sdf@google.com>
Thanks for the Acked-by.
>
> Seems safe even without the previous patch? OTOH, do we really care
> about __lllist vs llist in the cleanup path? Might be safer to always
> do llist_del_all everywhere?
No. free_llist is manipulated by both irq work and memory draining concurrently
before patch #1. Using llist_del_all(&c->free_llist) also doesn't help because
irq work uses __llist_add/__llist_del helpers. Basically there is no difference
between __llist and list helper for cleanup patch, but I think it is better to
clarity the possible concurrent accesses and codify these assumption.
>
>> }
>
>> --
>> 2.29.2
>
> .
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator
2022-10-20 1:07 ` Hou Tao
@ 2022-10-20 17:49 ` Stanislav Fomichev
2022-10-21 1:06 ` Hou Tao
0 siblings, 1 reply; 11+ messages in thread
From: Stanislav Fomichev @ 2022-10-20 17:49 UTC (permalink / raw)
To: Hou Tao
Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
Jiri Olsa, John Fastabend, houtao1
On Wed, Oct 19, 2022 at 6:08 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 10/20/2022 2:38 AM, sdf@google.com wrote:
> > On 10/19, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >
> >> A busy irq work is an unfinished irq work and it can be either in the
> >> pending state or in the running state. When destroying bpf memory
> >> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
> >> work is invoked in a per-CPU RT-kthread. It is also possible for kernel
> >> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
> >> and irq work is inovked in timer interrupt.
> >
> >> The busy refill_work leads to various issues. The obvious one is that
> >> there will be concurrent operations on free_by_rcu and free_list between
> >> irq work and memory draining. Another one is call_rcu_in_progress will
> >> not be reliable for the checking of pending RCU callback because
> >> do_call_rcu() may has not been invoked by irq work. The other is there
> >> will be use-after-free if irq work is freed before the callback of
> >> irq work is invoked as shown below:
> >
> >> BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> #PF: supervisor instruction fetch in kernel mode
> >> #PF: error_code(0x0010) - not-present page
> >> PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
> >> Oops: 0010 [#1] PREEMPT_RT SMP
> >> CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
> >> RIP: 0010:0x0
> >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> >> RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
> >> RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
> >> RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
> >> ......
> >> Call Trace:
> >> <TASK>
> >> irq_work_single+0x24/0x60
> >> irq_work_run_list+0x24/0x30
> >> run_irq_workd+0x23/0x30
> >> smpboot_thread_fn+0x203/0x300
> >> kthread+0x126/0x150
> >> ret_from_fork+0x1f/0x30
> >> </TASK>
> >
> >> Considering the ease of concurrency handling and the short wait time
> >> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
> >> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
> >> the 99th percentile is 10us), just waiting for busy refill_work to
> >> complete before memory draining and memory freeing.
> >
> >> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory
> >> allocator.")
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> >> kernel/bpf/memalloc.c | 11 +++++++++++
> >> 1 file changed, 11 insertions(+)
> >
> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >> index 94f0f63443a6..48e606aaacf0 100644
> >> --- a/kernel/bpf/memalloc.c
> >> +++ b/kernel/bpf/memalloc.c
> >> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> >> rcu_in_progress = 0;
> >> for_each_possible_cpu(cpu) {
> >> c = per_cpu_ptr(ma->cache, cpu);
> >> + /*
> >> + * refill_work may be unfinished for PREEMPT_RT kernel
> >> + * in which irq work is invoked in a per-CPU RT thread.
> >> + * It is also possible for kernel with
> >> + * arch_irq_work_has_interrupt() being false and irq
> >> + * work is inovked in timer interrupt. So wait for the
> >> + * completion of irq work to ease the handling of
> >> + * concurrency.
> >> + */
> >> + irq_work_sync(&c->refill_work);
> >
> > Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
> > We do have a bunch of them sprinkled already to run alloc/free with
> > irqs disabled.
> No. As said in the commit message and the comments, irq_work_sync() is needed
> for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being
> false. And for other kernels, irq_work_sync() doesn't incur any overhead,
> because it is just a simple memory read through irq_work_is_busy() and nothing
> else. The reason is the irq work must have been completed when invoking
> bpf_mem_alloc_destroy() for these kernels.
>
> void irq_work_sync(struct irq_work *work)
> {
> /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */
> /* irq wor*/
> while (irq_work_is_busy(work))
> cpu_relax();
> }
I see, thanks for clarifying! I was so carried away with that
PREEMPT_RT that I missed the fact that arch_irq_work_has_interrupt is
a separate thing. Agreed that doing irq_work_sync won't hurt in a
non-preempt/non-has_interrupt case.
In this case, can you still do a respin and fix the spelling issue in
the comment? You can slap my acked-by for the v2:
Acked-by: Stanislav Fomichev <sdf@google.com>
s/work is inovked in timer interrupt. So wait for the/... invoked .../
> >
> > I was also trying to see if adding local_irq_save inside drain_mem_cache
> > to pair with the ones from refill might work, but waiting for irq to
> > finish seems easier...
> Disabling hard irq works, but irq_work_sync() is still needed to ensure it is
> completed before freeing its memory.
> >
> > Maybe also move both of these in some new "static void irq_work_wait"
> > to make it clear that the PREEMT_RT comment applies to both of them?
> >
> > Or maybe that helper should do 'for_each_possible_cpu(cpu)
> > irq_work_sync(&c->refill_work);'
> > in the PREEMPT_RT case so we don't have to call it twice?
> drain_mem_cache() is also time consuming somethings, so I think it is better to
> interleave irq_work_sync() and drain_mem_cache() to reduce waiting time.
>
> >
> >> drain_mem_cache(c);
> >> rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> >> }
> >> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> >> cc = per_cpu_ptr(ma->caches, cpu);
> >> for (i = 0; i < NUM_CACHES; i++) {
> >> c = &cc->cache[i];
> >> + irq_work_sync(&c->refill_work);
> >> drain_mem_cache(c);
> >> rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> >> }
> >> --
> >> 2.29.2
> >
> > .
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining
2022-10-20 1:17 ` Hou Tao
@ 2022-10-20 17:52 ` Stanislav Fomichev
2022-10-21 1:09 ` Hou Tao
0 siblings, 1 reply; 11+ messages in thread
From: Stanislav Fomichev @ 2022-10-20 17:52 UTC (permalink / raw)
To: Hou Tao
Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
Jiri Olsa, John Fastabend, houtao1
On Wed, Oct 19, 2022 at 6:18 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 10/20/2022 3:00 AM, sdf@google.com wrote:
> > On 10/19, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >
> >> Except for waiting_for_gp list, there are no concurrent operations on
> >> free_by_rcu, free_llist and free_llist_extra lists, so use
> >> __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be
> >> deleted by RCU callback concurrently, so still use llist_del_all().
> >
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> >> kernel/bpf/memalloc.c | 7 +++++--
> >> 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >> index 48e606aaacf0..7f45744a09f7 100644
> >> --- a/kernel/bpf/memalloc.c
> >> +++ b/kernel/bpf/memalloc.c
> >> @@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
> >> /* No progs are using this bpf_mem_cache, but htab_map_free() called
> >> * bpf_mem_cache_free() for all remaining elements and they can be in
> >> * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
> >> + *
> >> + * Except for waiting_for_gp list, there are no concurrent operations
> >> + * on these lists, so it is safe to use __llist_del_all().
> >> */
> >> llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
> >> free_one(c, llnode);
> >> llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
> >> free_one(c, llnode);
> >> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
> >> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist))
> >> free_one(c, llnode);
> >> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
> >> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra))
> >> free_one(c, llnode);
> >
> > Acked-by: Stanislav Fomichev <sdf@google.com>
> Thanks for the Acked-by.
> >
> > Seems safe even without the previous patch? OTOH, do we really care
> > about __lllist vs llist in the cleanup path? Might be safer to always
> > do llist_del_all everywhere?
> No. free_llist is manipulated by both irq work and memory draining concurrently
> before patch #1. Using llist_del_all(&c->free_llist) also doesn't help because
> irq work uses __llist_add/__llist_del helpers. Basically there is no difference
> between __llist and list helper for cleanup patch, but I think it is better to
> clarity the possible concurrent accesses and codify these assumption.
But this is still mostly relevant only for the preemt_rt/has_interrupt
case, right?
For non-preempt, irq should've finished long before we got to drain_mem_cache.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator
2022-10-20 17:49 ` Stanislav Fomichev
@ 2022-10-21 1:06 ` Hou Tao
0 siblings, 0 replies; 11+ messages in thread
From: Hou Tao @ 2022-10-21 1:06 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
Jiri Olsa, John Fastabend, houtao1
Hi,
On 10/21/2022 1:49 AM, Stanislav Fomichev wrote:
> On Wed, Oct 19, 2022 at 6:08 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 10/20/2022 2:38 AM, sdf@google.com wrote:
>>> On 10/19, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
SNIP
>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>> index 94f0f63443a6..48e606aaacf0 100644
>>>> --- a/kernel/bpf/memalloc.c
>>>> +++ b/kernel/bpf/memalloc.c
>>>> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>>> rcu_in_progress = 0;
>>>> for_each_possible_cpu(cpu) {
>>>> c = per_cpu_ptr(ma->cache, cpu);
>>>> + /*
>>>> + * refill_work may be unfinished for PREEMPT_RT kernel
>>>> + * in which irq work is invoked in a per-CPU RT thread.
>>>> + * It is also possible for kernel with
>>>> + * arch_irq_work_has_interrupt() being false and irq
>>>> + * work is inovked in timer interrupt. So wait for the
>>>> + * completion of irq work to ease the handling of
>>>> + * concurrency.
>>>> + */
>>>> + irq_work_sync(&c->refill_work);
>>> Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
>>> We do have a bunch of them sprinkled already to run alloc/free with
>>> irqs disabled.
>> No. As said in the commit message and the comments, irq_work_sync() is needed
>> for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being
>> false. And for other kernels, irq_work_sync() doesn't incur any overhead,
>> because it is just a simple memory read through irq_work_is_busy() and nothing
>> else. The reason is the irq work must have been completed when invoking
>> bpf_mem_alloc_destroy() for these kernels.
>>
>> void irq_work_sync(struct irq_work *work)
>> {
>> /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */
>> /* irq wor*/
>> while (irq_work_is_busy(work))
>> cpu_relax();
>> }
> I see, thanks for clarifying! I was so carried away with that
> PREEMPT_RT that I missed the fact that arch_irq_work_has_interrupt is
> a separate thing. Agreed that doing irq_work_sync won't hurt in a
> non-preempt/non-has_interrupt case.
>
> In this case, can you still do a respin and fix the spelling issue in
> the comment? You can slap my acked-by for the v2:
>
> Acked-by: Stanislav Fomichev <sdf@google.com>
>
> s/work is inovked in timer interrupt. So wait for the/... invoked .../
Thanks. Will update the commit message and the comments in v2 to fix the typos
and add notes about the fact that there is no overhead under non-PREEMPT_RT and
arch_irq_work_hash_interrupt() kernel.
>
>>> I was also trying to see if adding local_irq_save inside drain_mem_cache
>>> to pair with the ones from refill might work, but waiting for irq to
>>> finish seems easier...
>> Disabling hard irq works, but irq_work_sync() is still needed to ensure it is
>> completed before freeing its memory.
>>> Maybe also move both of these in some new "static void irq_work_wait"
>>> to make it clear that the PREEMT_RT comment applies to both of them?
>>>
>>> Or maybe that helper should do 'for_each_possible_cpu(cpu)
>>> irq_work_sync(&c->refill_work);'
>>> in the PREEMPT_RT case so we don't have to call it twice?
>> drain_mem_cache() is also time consuming somethings, so I think it is better to
>> interleave irq_work_sync() and drain_mem_cache() to reduce waiting time.
>>
>>>> drain_mem_cache(c);
>>>> rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>>> }
>>>> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>>> cc = per_cpu_ptr(ma->caches, cpu);
>>>> for (i = 0; i < NUM_CACHES; i++) {
>>>> c = &cc->cache[i];
>>>> + irq_work_sync(&c->refill_work);
>>>> drain_mem_cache(c);
>>>> rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>>> }
>>>> --
>>>> 2.29.2
>>> .
> .
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining
2022-10-20 17:52 ` Stanislav Fomichev
@ 2022-10-21 1:09 ` Hou Tao
0 siblings, 0 replies; 11+ messages in thread
From: Hou Tao @ 2022-10-21 1:09 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
Jiri Olsa, John Fastabend, houtao1
Hi,
On 10/21/2022 1:52 AM, Stanislav Fomichev wrote:
> On Wed, Oct 19, 2022 at 6:18 PM Hou Tao <houtao@huaweicloud.com> wrote:
SNIP
>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>> index 48e606aaacf0..7f45744a09f7 100644
>>>> --- a/kernel/bpf/memalloc.c
>>>> +++ b/kernel/bpf/memalloc.c
>>>> @@ -422,14 +422,17 @@ static void drain_mem_cache(struct bpf_mem_cache *c)
>>>> /* No progs are using this bpf_mem_cache, but htab_map_free() called
>>>> * bpf_mem_cache_free() for all remaining elements and they can be in
>>>> * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
>>>> + *
>>>> + * Except for waiting_for_gp list, there are no concurrent operations
>>>> + * on these lists, so it is safe to use __llist_del_all().
>>>> */
>>>> llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu))
>>>> free_one(c, llnode);
>>>> llist_for_each_safe(llnode, t, llist_del_all(&c->waiting_for_gp))
>>>> free_one(c, llnode);
>>>> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist))
>>>> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist))
>>>> free_one(c, llnode);
>>>> - llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra))
>>>> + llist_for_each_safe(llnode, t, __llist_del_all(&c->free_llist_extra))
>>>> free_one(c, llnode);
>>> Acked-by: Stanislav Fomichev <sdf@google.com>
>> Thanks for the Acked-by.
>>> Seems safe even without the previous patch? OTOH, do we really care
>>> about __lllist vs llist in the cleanup path? Might be safer to always
>>> do llist_del_all everywhere?
>> No. free_llist is manipulated by both irq work and memory draining concurrently
>> before patch #1. Using llist_del_all(&c->free_llist) also doesn't help because
>> irq work uses __llist_add/__llist_del helpers. Basically there is no difference
>> between __llist and list helper for cleanup patch, but I think it is better to
>> clarity the possible concurrent accesses and codify these assumption.
> But this is still mostly relevant only for the preemt_rt/has_interrupt
> case, right?
> For non-preempt, irq should've finished long before we got to drain_mem_cache.
Yes. The concurrent access on free_llist is only possible for
preempt_rt/does_not_has_interrupt cases.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-10-21 1:09 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-19 11:55 [PATCH bpf 0/2] Wait for busy refill_work when destorying bpf memory allocator Hou Tao
2022-10-19 11:55 ` [PATCH bpf 1/2] bpf: " Hou Tao
2022-10-19 18:38 ` sdf
2022-10-20 1:07 ` Hou Tao
2022-10-20 17:49 ` Stanislav Fomichev
2022-10-21 1:06 ` Hou Tao
2022-10-19 11:55 ` [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining Hou Tao
2022-10-19 19:00 ` sdf
2022-10-20 1:17 ` Hou Tao
2022-10-20 17:52 ` Stanislav Fomichev
2022-10-21 1:09 ` Hou Tao
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.