* [PATCH bpf 0/2] bpf/memalloc: Allow non-atomic alloc_bulk @ 2023-07-20 20:44 YiFei Zhu 2023-07-20 20:44 ` [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill YiFei Zhu 2023-07-20 20:44 ` [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails YiFei Zhu 0 siblings, 2 replies; 11+ messages in thread From: YiFei Zhu @ 2023-07-20 20:44 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko In internal testing of test_maps, we sometimes observed failures like: test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *): Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed. where the errno is ENOMEM. After some troubleshooting and enabling the warnings, we saw: [ 91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left [ 91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G N 6.1.38-smp-DEV #7 [ 91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023 [ 91.304721] Call Trace: [ 91.304724] <TASK> [ 91.304730] [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88 [ 91.304737] [<ffffffffa7ef83f8>] dump_stack+0x10/0x18 [ 91.304738] [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870 [ 91.304741] [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20 [ 91.304743] [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0 [ 91.304746] [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0 [ 91.304747] [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650 [ 91.304750] [<ffffffffa751d6e0>] map_create+0x140/0x2e0 [ 91.304752] [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0 [ 91.304753] [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30 [ 91.304754] [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80 [ 91.304756] [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd This makes sense, because in atomic context, percpu allocation would not create new chunks; it would only create in non-atomic contexts. This series attempts to add ways where the allocation could occur non-atomically, allowing the allocator to take mutexes, perform IO, and/or sleep. Patch 1 addresses the prefill case. Since prefill already runs in sleepable context this is simply just conditionally adding GFP_KERNEL instead of GFP_NOWAIT. Patch 2 addresses the refill case. Since refill always runs in atomic context, we instead add workqueue work to perform non-atomic allocation when the atomic attempt failed. YiFei Zhu (2): bpf/memalloc: Non-atomically allocate freelist during prefill bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails kernel/bpf/memalloc.c | 59 ++++++++++++++++++++++++++++++------------- 1 file changed, 41 insertions(+), 18 deletions(-) -- 2.41.0.487.g6d72f3e995-goog ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill 2023-07-20 20:44 [PATCH bpf 0/2] bpf/memalloc: Allow non-atomic alloc_bulk YiFei Zhu @ 2023-07-20 20:44 ` YiFei Zhu 2023-07-21 1:44 ` Hou Tao 2023-07-20 20:44 ` [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails YiFei Zhu 1 sibling, 1 reply; 11+ messages in thread From: YiFei Zhu @ 2023-07-20 20:44 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko Sometimes during prefill all precpu chunks are full and atomic __alloc_percpu_gfp would not allocate new chunks. This will cause -ENOMEM immediately upon next unit_alloc. Prefill phase does not actually run in atomic context, so we can use this fact to allocate non-atomically with GFP_KERNEL instead of GFP_NOWAIT. This avoids the immediate -ENOMEM. Unfortunately unit_alloc runs in atomic context, even from map item allocation in syscalls, due to rcu_read_lock, so we can't do non-atomic workarounds in unit_alloc. Fixes: 4ab67149f3c6 ("bpf: Add percpu allocation support to bpf_mem_alloc.") Signed-off-by: YiFei Zhu <zhuyifei@google.com> --- kernel/bpf/memalloc.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 0668bcd7c926..016249672b43 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -154,13 +154,17 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c) } /* Mostly runs from irq_work except __init phase. */ -static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) +static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) { struct mem_cgroup *memcg = NULL, *old_memcg; unsigned long flags; + gfp_t gfp; void *obj; int i; + gfp = __GFP_NOWARN | __GFP_ACCOUNT; + gfp |= atomic ? GFP_NOWAIT : GFP_KERNEL; + memcg = get_memcg(c); old_memcg = set_active_memcg(memcg); for (i = 0; i < cnt; i++) { @@ -183,7 +187,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) * will allocate from the current numa node which is what we * want here. */ - obj = __alloc(c, node, GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT); + obj = __alloc(c, node, gfp); if (!obj) break; } @@ -321,7 +325,7 @@ static void bpf_mem_refill(struct irq_work *work) /* irq_work runs on this cpu and kmalloc will allocate * from the current numa node which is what we want here. */ - alloc_bulk(c, c->batch, NUMA_NO_NODE); + alloc_bulk(c, c->batch, NUMA_NO_NODE, true); else if (cnt > c->high_watermark) free_bulk(c); } @@ -367,7 +371,7 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) * prog won't be doing more than 4 map_update_elem from * irq disabled region */ - alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu)); + alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false); } /* When size != 0 bpf_mem_cache for each cpu. -- 2.41.0.487.g6d72f3e995-goog ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill 2023-07-20 20:44 ` [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill YiFei Zhu @ 2023-07-21 1:44 ` Hou Tao 2023-07-21 2:31 ` YiFei Zhu 0 siblings, 1 reply; 11+ messages in thread From: Hou Tao @ 2023-07-21 1:44 UTC (permalink / raw) To: YiFei Zhu, bpf Cc: Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko On 7/21/2023 4:44 AM, YiFei Zhu wrote: > Sometimes during prefill all precpu chunks are full and atomic > __alloc_percpu_gfp would not allocate new chunks. This will cause > -ENOMEM immediately upon next unit_alloc. > > Prefill phase does not actually run in atomic context, so we can > use this fact to allocate non-atomically with GFP_KERNEL instead > of GFP_NOWAIT. This avoids the immediate -ENOMEM. Unfortunately > unit_alloc runs in atomic context, even from map item allocation in > syscalls, due to rcu_read_lock, so we can't do non-atomic > workarounds in unit_alloc. > > Fixes: 4ab67149f3c6 ("bpf: Add percpu allocation support to bpf_mem_alloc.") > Signed-off-by: YiFei Zhu <zhuyifei@google.com> Make sense to me, so Acked-by: Hou Tao <houtao1@huawei.com> But I don't know whether or not it is suitable for bpf tree. > --- > kernel/bpf/memalloc.c | 12 ++++++++---- > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 0668bcd7c926..016249672b43 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -154,13 +154,17 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c) > } > > /* Mostly runs from irq_work except __init phase. */ > -static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > +static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) > { > struct mem_cgroup *memcg = NULL, *old_memcg; > unsigned long flags; > + gfp_t gfp; > void *obj; > int i; > > + gfp = __GFP_NOWARN | __GFP_ACCOUNT; > + gfp |= atomic ? GFP_NOWAIT : GFP_KERNEL; > + > memcg = get_memcg(c); > old_memcg = set_active_memcg(memcg); > for (i = 0; i < cnt; i++) { > @@ -183,7 +187,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > * will allocate from the current numa node which is what we > * want here. > */ > - obj = __alloc(c, node, GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT); > + obj = __alloc(c, node, gfp); > if (!obj) > break; > } > @@ -321,7 +325,7 @@ static void bpf_mem_refill(struct irq_work *work) > /* irq_work runs on this cpu and kmalloc will allocate > * from the current numa node which is what we want here. > */ > - alloc_bulk(c, c->batch, NUMA_NO_NODE); > + alloc_bulk(c, c->batch, NUMA_NO_NODE, true); > else if (cnt > c->high_watermark) > free_bulk(c); > } > @@ -367,7 +371,7 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) > * prog won't be doing more than 4 map_update_elem from > * irq disabled region > */ > - alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu)); > + alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false); > } > > /* When size != 0 bpf_mem_cache for each cpu. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill 2023-07-21 1:44 ` Hou Tao @ 2023-07-21 2:31 ` YiFei Zhu 2023-07-26 11:38 ` Hou Tao 0 siblings, 1 reply; 11+ messages in thread From: YiFei Zhu @ 2023-07-21 2:31 UTC (permalink / raw) To: Hou Tao Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko On Thu, Jul 20, 2023 at 6:45 PM Hou Tao <houtao@huaweicloud.com> wrote: > On 7/21/2023 4:44 AM, YiFei Zhu wrote: > > Sometimes during prefill all precpu chunks are full and atomic > > __alloc_percpu_gfp would not allocate new chunks. This will cause > > -ENOMEM immediately upon next unit_alloc. > > > > Prefill phase does not actually run in atomic context, so we can > > use this fact to allocate non-atomically with GFP_KERNEL instead > > of GFP_NOWAIT. This avoids the immediate -ENOMEM. Unfortunately > > unit_alloc runs in atomic context, even from map item allocation in > > syscalls, due to rcu_read_lock, so we can't do non-atomic > > workarounds in unit_alloc. > > > > Fixes: 4ab67149f3c6 ("bpf: Add percpu allocation support to bpf_mem_alloc.") > > Signed-off-by: YiFei Zhu <zhuyifei@google.com> > > Make sense to me, so > > Acked-by: Hou Tao <houtao1@huawei.com> > > But I don't know whether or not it is suitable for bpf tree. I don't mind either way :) If changing to bpf-next requires a resend I can do that too. YiFei Zhu > > --- > > kernel/bpf/memalloc.c | 12 ++++++++---- > > 1 file changed, 8 insertions(+), 4 deletions(-) > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > > index 0668bcd7c926..016249672b43 100644 > > --- a/kernel/bpf/memalloc.c > > +++ b/kernel/bpf/memalloc.c > > @@ -154,13 +154,17 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c) > > } > > > > /* Mostly runs from irq_work except __init phase. */ > > -static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > > +static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) > > { > > struct mem_cgroup *memcg = NULL, *old_memcg; > > unsigned long flags; > > + gfp_t gfp; > > void *obj; > > int i; > > > > + gfp = __GFP_NOWARN | __GFP_ACCOUNT; > > + gfp |= atomic ? GFP_NOWAIT : GFP_KERNEL; > > + > > memcg = get_memcg(c); > > old_memcg = set_active_memcg(memcg); > > for (i = 0; i < cnt; i++) { > > @@ -183,7 +187,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > > * will allocate from the current numa node which is what we > > * want here. > > */ > > - obj = __alloc(c, node, GFP_NOWAIT | __GFP_NOWARN | __GFP_ACCOUNT); > > + obj = __alloc(c, node, gfp); > > if (!obj) > > break; > > } > > @@ -321,7 +325,7 @@ static void bpf_mem_refill(struct irq_work *work) > > /* irq_work runs on this cpu and kmalloc will allocate > > * from the current numa node which is what we want here. > > */ > > - alloc_bulk(c, c->batch, NUMA_NO_NODE); > > + alloc_bulk(c, c->batch, NUMA_NO_NODE, true); > > else if (cnt > c->high_watermark) > > free_bulk(c); > > } > > @@ -367,7 +371,7 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) > > * prog won't be doing more than 4 map_update_elem from > > * irq disabled region > > */ > > - alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu)); > > + alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false); > > } > > > > /* When size != 0 bpf_mem_cache for each cpu. > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill 2023-07-21 2:31 ` YiFei Zhu @ 2023-07-26 11:38 ` Hou Tao 2023-07-26 18:44 ` YiFei Zhu 0 siblings, 1 reply; 11+ messages in thread From: Hou Tao @ 2023-07-26 11:38 UTC (permalink / raw) To: YiFei Zhu Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko Hi, On 7/21/2023 10:31 AM, YiFei Zhu wrote: > On Thu, Jul 20, 2023 at 6:45 PM Hou Tao <houtao@huaweicloud.com> wrote: >> On 7/21/2023 4:44 AM, YiFei Zhu wrote: >>> Sometimes during prefill all precpu chunks are full and atomic >>> __alloc_percpu_gfp would not allocate new chunks. This will cause >>> -ENOMEM immediately upon next unit_alloc. >>> >>> Prefill phase does not actually run in atomic context, so we can >>> use this fact to allocate non-atomically with GFP_KERNEL instead >>> of GFP_NOWAIT. This avoids the immediate -ENOMEM. Unfortunately >>> unit_alloc runs in atomic context, even from map item allocation in >>> syscalls, due to rcu_read_lock, so we can't do non-atomic >>> workarounds in unit_alloc. >>> >>> Fixes: 4ab67149f3c6 ("bpf: Add percpu allocation support to bpf_mem_alloc.") >>> Signed-off-by: YiFei Zhu <zhuyifei@google.com> >> Make sense to me, so >> >> Acked-by: Hou Tao <houtao1@huawei.com> >> >> But I don't know whether or not it is suitable for bpf tree. > I don't mind either way :) If changing to bpf-next requires a resend I > can do that too. Please resend and rebase the patch again bpf-next tree. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill 2023-07-26 11:38 ` Hou Tao @ 2023-07-26 18:44 ` YiFei Zhu 2023-07-27 1:21 ` Hou Tao 0 siblings, 1 reply; 11+ messages in thread From: YiFei Zhu @ 2023-07-26 18:44 UTC (permalink / raw) To: Hou Tao Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko On Wed, Jul 26, 2023 at 4:38 AM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 7/21/2023 10:31 AM, YiFei Zhu wrote: > > On Thu, Jul 20, 2023 at 6:45 PM Hou Tao <houtao@huaweicloud.com> wrote: > >> On 7/21/2023 4:44 AM, YiFei Zhu wrote: > >>> Sometimes during prefill all precpu chunks are full and atomic > >>> __alloc_percpu_gfp would not allocate new chunks. This will cause > >>> -ENOMEM immediately upon next unit_alloc. > >>> > >>> Prefill phase does not actually run in atomic context, so we can > >>> use this fact to allocate non-atomically with GFP_KERNEL instead > >>> of GFP_NOWAIT. This avoids the immediate -ENOMEM. Unfortunately > >>> unit_alloc runs in atomic context, even from map item allocation in > >>> syscalls, due to rcu_read_lock, so we can't do non-atomic > >>> workarounds in unit_alloc. > >>> > >>> Fixes: 4ab67149f3c6 ("bpf: Add percpu allocation support to bpf_mem_alloc.") > >>> Signed-off-by: YiFei Zhu <zhuyifei@google.com> > >> Make sense to me, so > >> > >> Acked-by: Hou Tao <houtao1@huawei.com> > >> > >> But I don't know whether or not it is suitable for bpf tree. > > I don't mind either way :) If changing to bpf-next requires a resend I > > can do that too. > > Please resend and rebase the patch again bpf-next tree. > Will do. Should I drop the Fixes tag then? YiFei Zhu ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill 2023-07-26 18:44 ` YiFei Zhu @ 2023-07-27 1:21 ` Hou Tao 0 siblings, 0 replies; 11+ messages in thread From: Hou Tao @ 2023-07-27 1:21 UTC (permalink / raw) To: YiFei Zhu Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko Hi, On 7/27/2023 2:44 AM, YiFei Zhu wrote: > On Wed, Jul 26, 2023 at 4:38 AM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 7/21/2023 10:31 AM, YiFei Zhu wrote: >>> On Thu, Jul 20, 2023 at 6:45 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>> On 7/21/2023 4:44 AM, YiFei Zhu wrote: >>>>> Sometimes during prefill all precpu chunks are full and atomic >>>>> __alloc_percpu_gfp would not allocate new chunks. This will cause >>>>> -ENOMEM immediately upon next unit_alloc. >>>>> >>>>> Prefill phase does not actually run in atomic context, so we can >>>>> use this fact to allocate non-atomically with GFP_KERNEL instead >>>>> of GFP_NOWAIT. This avoids the immediate -ENOMEM. Unfortunately >>>>> unit_alloc runs in atomic context, even from map item allocation in >>>>> syscalls, due to rcu_read_lock, so we can't do non-atomic >>>>> workarounds in unit_alloc. >>>>> >>>>> Fixes: 4ab67149f3c6 ("bpf: Add percpu allocation support to bpf_mem_alloc.") >>>>> Signed-off-by: YiFei Zhu <zhuyifei@google.com> >>>> Make sense to me, so >>>> >>>> Acked-by: Hou Tao <houtao1@huawei.com> >>>> >>>> But I don't know whether or not it is suitable for bpf tree. >>> I don't mind either way :) If changing to bpf-next requires a resend I >>> can do that too. >> Please resend and rebase the patch again bpf-next tree. >> > Will do. Should I drop the Fixes tag then? Before the introduction of bpf memory allocator, the allocation flag for per-cpu memory allocation in hash map is GFP_NOWAIT. BPF memory allocator doesn't change that, so I think we could drop the Fixes tag. > > YiFei Zhu > > . ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails 2023-07-20 20:44 [PATCH bpf 0/2] bpf/memalloc: Allow non-atomic alloc_bulk YiFei Zhu 2023-07-20 20:44 ` [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill YiFei Zhu @ 2023-07-20 20:44 ` YiFei Zhu 2023-07-21 2:24 ` Hou Tao 1 sibling, 1 reply; 11+ messages in thread From: YiFei Zhu @ 2023-07-20 20:44 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko Atomic refill can fail, such as when all percpu chunks are full, and when that happens there's no guarantee when more space will be available for atomic allocations. Instead of having the caller wait for memory to be available by retrying until the related BPF API no longer gives -ENOMEM, we can kick off a non-atomic GFP_KERNEL allocation with highprio workqueue. This should make it much less likely for those APIs to return -ENOMEM. Because alloc_bulk can now be called from the workqueue, non-atomic calls now also calls local_irq_save/restore to reduce the chance of races. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Signed-off-by: YiFei Zhu <zhuyifei@google.com> --- kernel/bpf/memalloc.c | 47 ++++++++++++++++++++++++++++++------------- 1 file changed, 33 insertions(+), 14 deletions(-) diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 016249672b43..2915639a5e16 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -84,14 +84,15 @@ struct bpf_mem_cache { struct llist_head free_llist; local_t active; - /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill + /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill_* * are sequenced by per-cpu 'active' counter. But unit_free() cannot * fail. When 'active' is busy the unit_free() will add an object to * free_llist_extra. */ struct llist_head free_llist_extra; - struct irq_work refill_work; + struct irq_work refill_work_irq; + struct work_struct refill_work_wq; struct obj_cgroup *objcg; int unit_size; /* count of objects in free_llist */ @@ -153,7 +154,7 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c) #endif } -/* Mostly runs from irq_work except __init phase. */ +/* Mostly runs from irq_work except workqueue and __init phase. */ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) { struct mem_cgroup *memcg = NULL, *old_memcg; @@ -188,10 +189,18 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) * want here. */ obj = __alloc(c, node, gfp); - if (!obj) + if (!obj) { + /* We might have exhausted the percpu chunks, schedule + * non-atomic allocation so hopefully caller can get + * a free unit upon next invocation. + */ + if (atomic) + queue_work_on(smp_processor_id(), + system_highpri_wq, &c->refill_work_wq); break; + } } - if (IS_ENABLED(CONFIG_PREEMPT_RT)) + if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic) /* In RT irq_work runs in per-cpu kthread, so disable * interrupts to avoid preemption and interrupts and * reduce the chance of bpf prog executing on this cpu @@ -208,7 +217,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) __llist_add(obj, &c->free_llist); c->free_cnt++; local_dec(&c->active); - if (IS_ENABLED(CONFIG_PREEMPT_RT)) + if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic) local_irq_restore(flags); } set_active_memcg(old_memcg); @@ -314,9 +323,9 @@ static void free_bulk(struct bpf_mem_cache *c) do_call_rcu(c); } -static void bpf_mem_refill(struct irq_work *work) +static void bpf_mem_refill_irq(struct irq_work *work) { - struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work); + struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_irq); int cnt; /* Racy access to free_cnt. It doesn't need to be 100% accurate */ @@ -332,7 +341,14 @@ static void bpf_mem_refill(struct irq_work *work) static void notrace irq_work_raise(struct bpf_mem_cache *c) { - irq_work_queue(&c->refill_work); + irq_work_queue(&c->refill_work_irq); +} + +static void bpf_mem_refill_wq(struct work_struct *work) +{ + struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_wq); + + alloc_bulk(c, c->batch, NUMA_NO_NODE, false); } /* For typical bpf map case that uses bpf_mem_cache_alloc and single bucket @@ -352,7 +368,8 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c) static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) { - init_irq_work(&c->refill_work, bpf_mem_refill); + init_irq_work(&c->refill_work_irq, bpf_mem_refill_irq); + INIT_WORK(&c->refill_work_wq, bpf_mem_refill_wq); if (c->unit_size <= 256) { c->low_watermark = 32; c->high_watermark = 96; @@ -529,7 +546,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) for_each_possible_cpu(cpu) { c = per_cpu_ptr(ma->cache, cpu); /* - * refill_work may be unfinished for PREEMPT_RT kernel + * refill_work_irq may be unfinished for PREEMPT_RT kernel * in which irq work is invoked in a per-CPU RT thread. * It is also possible for kernel with * arch_irq_work_has_interrupt() being false and irq @@ -537,7 +554,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) * the completion of irq work to ease the handling of * concurrency. */ - irq_work_sync(&c->refill_work); + irq_work_sync(&c->refill_work_irq); + cancel_work_sync(&c->refill_work_wq); drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -552,7 +570,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) cc = per_cpu_ptr(ma->caches, cpu); for (i = 0; i < NUM_CACHES; i++) { c = &cc->cache[i]; - irq_work_sync(&c->refill_work); + irq_work_sync(&c->refill_work_irq); + cancel_work_sync(&c->refill_work_wq); drain_mem_cache(c); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } @@ -580,7 +599,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) * * but prog_B could be a perf_event NMI prog. * Use per-cpu 'active' counter to order free_list access between - * unit_alloc/unit_free/bpf_mem_refill. + * unit_alloc/unit_free/bpf_mem_refill_*. */ local_irq_save(flags); if (local_inc_return(&c->active) == 1) { -- 2.41.0.487.g6d72f3e995-goog ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails 2023-07-20 20:44 ` [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails YiFei Zhu @ 2023-07-21 2:24 ` Hou Tao 2023-07-21 3:02 ` YiFei Zhu 0 siblings, 1 reply; 11+ messages in thread From: Hou Tao @ 2023-07-21 2:24 UTC (permalink / raw) To: YiFei Zhu, bpf Cc: Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko Hi, On 7/21/2023 4:44 AM, YiFei Zhu wrote: > Atomic refill can fail, such as when all percpu chunks are full, > and when that happens there's no guarantee when more space will be > available for atomic allocations. > > Instead of having the caller wait for memory to be available by > retrying until the related BPF API no longer gives -ENOMEM, we can > kick off a non-atomic GFP_KERNEL allocation with highprio workqueue. > This should make it much less likely for those APIs to return > -ENOMEM. > > Because alloc_bulk can now be called from the workqueue, > non-atomic calls now also calls local_irq_save/restore to reduce > the chance of races. > > Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") > Signed-off-by: YiFei Zhu <zhuyifei@google.com> > --- > kernel/bpf/memalloc.c | 47 ++++++++++++++++++++++++++++++------------- > 1 file changed, 33 insertions(+), 14 deletions(-) > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 016249672b43..2915639a5e16 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -84,14 +84,15 @@ struct bpf_mem_cache { > struct llist_head free_llist; > local_t active; > > - /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill > + /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill_* > * are sequenced by per-cpu 'active' counter. But unit_free() cannot > * fail. When 'active' is busy the unit_free() will add an object to > * free_llist_extra. > */ > struct llist_head free_llist_extra; > > - struct irq_work refill_work; > + struct irq_work refill_work_irq; > + struct work_struct refill_work_wq; > struct obj_cgroup *objcg; > int unit_size; > /* count of objects in free_llist */ > @@ -153,7 +154,7 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c) > #endif > } > > -/* Mostly runs from irq_work except __init phase. */ > +/* Mostly runs from irq_work except workqueue and __init phase. */ > static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) > { > struct mem_cgroup *memcg = NULL, *old_memcg; > @@ -188,10 +189,18 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) > * want here. > */ > obj = __alloc(c, node, gfp); > - if (!obj) > + if (!obj) { > + /* We might have exhausted the percpu chunks, schedule > + * non-atomic allocation so hopefully caller can get > + * a free unit upon next invocation. > + */ > + if (atomic) > + queue_work_on(smp_processor_id(), > + system_highpri_wq, &c->refill_work_wq); I am not a MM expert. But according to the code in pcpu_balance_workfn(), it will try to do pcpu_create_chunk() when pcpu_atomic_alloc_failed is true, so the reason for introducing refill_work_wq is that pcpu_balance_workfn is too slow to fulfill the allocation request from bpf memory allocator ? > break; > + } > } > - if (IS_ENABLED(CONFIG_PREEMPT_RT)) > + if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic) > /* In RT irq_work runs in per-cpu kthread, so disable > * interrupts to avoid preemption and interrupts and > * reduce the chance of bpf prog executing on this cpu > @@ -208,7 +217,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) > __llist_add(obj, &c->free_llist); > c->free_cnt++; > local_dec(&c->active); > - if (IS_ENABLED(CONFIG_PREEMPT_RT)) > + if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic) > local_irq_restore(flags); > } > set_active_memcg(old_memcg); > @@ -314,9 +323,9 @@ static void free_bulk(struct bpf_mem_cache *c) > do_call_rcu(c); > } > > -static void bpf_mem_refill(struct irq_work *work) > +static void bpf_mem_refill_irq(struct irq_work *work) > { > - struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work); > + struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_irq); > int cnt; > > /* Racy access to free_cnt. It doesn't need to be 100% accurate */ > @@ -332,7 +341,14 @@ static void bpf_mem_refill(struct irq_work *work) > > static void notrace irq_work_raise(struct bpf_mem_cache *c) > { > - irq_work_queue(&c->refill_work); > + irq_work_queue(&c->refill_work_irq); > +} > + > +static void bpf_mem_refill_wq(struct work_struct *work) > +{ > + struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_wq); > + > + alloc_bulk(c, c->batch, NUMA_NO_NODE, false); Considering that the kworker may be interrupted by irq work, so there will be concurrent __llist_del_first() operations on free_by_rcu, andI think it is not safe to call alloc_bulk directly here. Maybe we can just skip __llist_del_first() for !atomic context. > } > > /* For typical bpf map case that uses bpf_mem_cache_alloc and single bucket > @@ -352,7 +368,8 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c) > > static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) > { > - init_irq_work(&c->refill_work, bpf_mem_refill); > + init_irq_work(&c->refill_work_irq, bpf_mem_refill_irq); > + INIT_WORK(&c->refill_work_wq, bpf_mem_refill_wq); > if (c->unit_size <= 256) { > c->low_watermark = 32; > c->high_watermark = 96; > @@ -529,7 +546,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > for_each_possible_cpu(cpu) { > c = per_cpu_ptr(ma->cache, cpu); > /* > - * refill_work may be unfinished for PREEMPT_RT kernel > + * refill_work_irq may be unfinished for PREEMPT_RT kernel > * in which irq work is invoked in a per-CPU RT thread. > * It is also possible for kernel with > * arch_irq_work_has_interrupt() being false and irq > @@ -537,7 +554,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > * the completion of irq work to ease the handling of > * concurrency. > */ > - irq_work_sync(&c->refill_work); > + irq_work_sync(&c->refill_work_irq); > + cancel_work_sync(&c->refill_work_wq); cancel_work_sync() may be time-consuming. We may need to move it to free_mem_alloc_deferred() to prevent blocking the destroy of bpf memory allocator. > drain_mem_cache(c); > rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > } > @@ -552,7 +570,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > cc = per_cpu_ptr(ma->caches, cpu); > for (i = 0; i < NUM_CACHES; i++) { > c = &cc->cache[i]; > - irq_work_sync(&c->refill_work); > + irq_work_sync(&c->refill_work_irq); > + cancel_work_sync(&c->refill_work_wq); > drain_mem_cache(c); > rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > } > @@ -580,7 +599,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > * > * but prog_B could be a perf_event NMI prog. > * Use per-cpu 'active' counter to order free_list access between > - * unit_alloc/unit_free/bpf_mem_refill. > + * unit_alloc/unit_free/bpf_mem_refill_*. > */ > local_irq_save(flags); > if (local_inc_return(&c->active) == 1) { ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails 2023-07-21 2:24 ` Hou Tao @ 2023-07-21 3:02 ` YiFei Zhu 2023-07-26 11:37 ` Hou Tao 0 siblings, 1 reply; 11+ messages in thread From: YiFei Zhu @ 2023-07-21 3:02 UTC (permalink / raw) To: Hou Tao Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko On Thu, Jul 20, 2023 at 7:24 PM Hou Tao <houtao@huaweicloud.com> wrote: > Hi, > > On 7/21/2023 4:44 AM, YiFei Zhu wrote: > > Atomic refill can fail, such as when all percpu chunks are full, > > and when that happens there's no guarantee when more space will be > > available for atomic allocations. > > > > Instead of having the caller wait for memory to be available by > > retrying until the related BPF API no longer gives -ENOMEM, we can > > kick off a non-atomic GFP_KERNEL allocation with highprio workqueue. > > This should make it much less likely for those APIs to return > > -ENOMEM. > > > > Because alloc_bulk can now be called from the workqueue, > > non-atomic calls now also calls local_irq_save/restore to reduce > > the chance of races. > > > > Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") > > Signed-off-by: YiFei Zhu <zhuyifei@google.com> > > --- > > kernel/bpf/memalloc.c | 47 ++++++++++++++++++++++++++++++------------- > > 1 file changed, 33 insertions(+), 14 deletions(-) > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > > index 016249672b43..2915639a5e16 100644 > > --- a/kernel/bpf/memalloc.c > > +++ b/kernel/bpf/memalloc.c > > @@ -84,14 +84,15 @@ struct bpf_mem_cache { > > struct llist_head free_llist; > > local_t active; > > > > - /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill > > + /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill_* > > * are sequenced by per-cpu 'active' counter. But unit_free() cannot > > * fail. When 'active' is busy the unit_free() will add an object to > > * free_llist_extra. > > */ > > struct llist_head free_llist_extra; > > > > - struct irq_work refill_work; > > + struct irq_work refill_work_irq; > > + struct work_struct refill_work_wq; > > struct obj_cgroup *objcg; > > int unit_size; > > /* count of objects in free_llist */ > > @@ -153,7 +154,7 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c) > > #endif > > } > > > > -/* Mostly runs from irq_work except __init phase. */ > > +/* Mostly runs from irq_work except workqueue and __init phase. */ > > static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) > > { > > struct mem_cgroup *memcg = NULL, *old_memcg; > > @@ -188,10 +189,18 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) > > * want here. > > */ > > obj = __alloc(c, node, gfp); > > - if (!obj) > > + if (!obj) { > > + /* We might have exhausted the percpu chunks, schedule > > + * non-atomic allocation so hopefully caller can get > > + * a free unit upon next invocation. > > + */ > > + if (atomic) > > + queue_work_on(smp_processor_id(), > > + system_highpri_wq, &c->refill_work_wq); > > I am not a MM expert. But according to the code in > pcpu_balance_workfn(), it will try to do pcpu_create_chunk() when > pcpu_atomic_alloc_failed is true, so the reason for introducing > refill_work_wq is that pcpu_balance_workfn is too slow to fulfill the > allocation request from bpf memory allocator ? Oh I missed that part of the code. In one of my tests I had the previous patch applied, and I had a lot of assertions around the code (for debugging-by-kdumping), and I was able to get some crashes that suggested I needed to add more, so I wrote this. However I wasn't able to reproduce that again. Though, after giving it another thought, this sequence of events could still happen: initial condition: free_cnt = 1, low_watermark = 1 unit_alloc() sets free_cnt = 0 free_cnt < low_watermark irq_work_raise() irq work: bpf_mem_refill() alloc_bulk() __alloc() __alloc_percpu_gfp() fails pcpu_schedule_balance_work() return NULL pcpu_balance_workfn() succeeds, next __alloc_percpu_gfp will succeed unit_alloc() free_cnt is still 0 return NULL The thing here is that, even if pcpu_balance_workfn is fast enough to run before the next unit_alloc, unit_alloc will still return NULL. I'm not sure if this is desired, but this should be a very rare condition requiring 8k unit_size. I'm not exactly sure what happened in that dump. And since I'm unable to reproduce this again, and if we are okay with the rare case above, I'm happy to drop this patch until I have a better idea of what happened (or it was just my bad assertions, which could very well be what happened). > > break; > > + } > > } > > - if (IS_ENABLED(CONFIG_PREEMPT_RT)) > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic) > > /* In RT irq_work runs in per-cpu kthread, so disable > > * interrupts to avoid preemption and interrupts and > > * reduce the chance of bpf prog executing on this cpu > > @@ -208,7 +217,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) > > __llist_add(obj, &c->free_llist); > > c->free_cnt++; > > local_dec(&c->active); > > - if (IS_ENABLED(CONFIG_PREEMPT_RT)) > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic) > > local_irq_restore(flags); > > } > > set_active_memcg(old_memcg); > > @@ -314,9 +323,9 @@ static void free_bulk(struct bpf_mem_cache *c) > > do_call_rcu(c); > > } > > > > -static void bpf_mem_refill(struct irq_work *work) > > +static void bpf_mem_refill_irq(struct irq_work *work) > > { > > - struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work); > > + struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_irq); > > int cnt; > > > > /* Racy access to free_cnt. It doesn't need to be 100% accurate */ > > @@ -332,7 +341,14 @@ static void bpf_mem_refill(struct irq_work *work) > > > > static void notrace irq_work_raise(struct bpf_mem_cache *c) > > { > > - irq_work_queue(&c->refill_work); > > + irq_work_queue(&c->refill_work_irq); > > +} > > + > > +static void bpf_mem_refill_wq(struct work_struct *work) > > +{ > > + struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_wq); > > + > > + alloc_bulk(c, c->batch, NUMA_NO_NODE, false); > > Considering that the kworker may be interrupted by irq work, so there > will be concurrent __llist_del_first() operations on free_by_rcu, andI > think it is not safe to call alloc_bulk directly here. Maybe we can just > skip __llist_del_first() for !atomic context. Ack. > > } > > > > /* For typical bpf map case that uses bpf_mem_cache_alloc and single bucket > > @@ -352,7 +368,8 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c) > > > > static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) > > { > > - init_irq_work(&c->refill_work, bpf_mem_refill); > > + init_irq_work(&c->refill_work_irq, bpf_mem_refill_irq); > > + INIT_WORK(&c->refill_work_wq, bpf_mem_refill_wq); > > if (c->unit_size <= 256) { > > c->low_watermark = 32; > > c->high_watermark = 96; > > @@ -529,7 +546,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > > for_each_possible_cpu(cpu) { > > c = per_cpu_ptr(ma->cache, cpu); > > /* > > - * refill_work may be unfinished for PREEMPT_RT kernel > > + * refill_work_irq may be unfinished for PREEMPT_RT kernel > > * in which irq work is invoked in a per-CPU RT thread. > > * It is also possible for kernel with > > * arch_irq_work_has_interrupt() being false and irq > > @@ -537,7 +554,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > > * the completion of irq work to ease the handling of > > * concurrency. > > */ > > - irq_work_sync(&c->refill_work); > > + irq_work_sync(&c->refill_work_irq); > > + cancel_work_sync(&c->refill_work_wq); > > cancel_work_sync() may be time-consuming. We may need to move it to > free_mem_alloc_deferred() to prevent blocking the destroy of bpf memory > allocator. Ack. > > drain_mem_cache(c); > > rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > > } > > @@ -552,7 +570,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) > > cc = per_cpu_ptr(ma->caches, cpu); > > for (i = 0; i < NUM_CACHES; i++) { > > c = &cc->cache[i]; > > - irq_work_sync(&c->refill_work); > > + irq_work_sync(&c->refill_work_irq); > > + cancel_work_sync(&c->refill_work_wq); > > drain_mem_cache(c); > > rcu_in_progress += atomic_read(&c->call_rcu_in_progress); > > } > > @@ -580,7 +599,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > > * > > * but prog_B could be a perf_event NMI prog. > > * Use per-cpu 'active' counter to order free_list access between > > - * unit_alloc/unit_free/bpf_mem_refill. > > + * unit_alloc/unit_free/bpf_mem_refill_*. > > */ > > local_irq_save(flags); > > if (local_inc_return(&c->active) == 1) { > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails 2023-07-21 3:02 ` YiFei Zhu @ 2023-07-26 11:37 ` Hou Tao 0 siblings, 0 replies; 11+ messages in thread From: Hou Tao @ 2023-07-26 11:37 UTC (permalink / raw) To: YiFei Zhu, Alexei Starovoitov Cc: bpf, Daniel Borkmann, Stanislav Fomichev, Martin KaFai Lau, Andrii Nakryiko Hi, On 7/21/2023 11:02 AM, YiFei Zhu wrote: > On Thu, Jul 20, 2023 at 7:24 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 7/21/2023 4:44 AM, YiFei Zhu wrote: >>> Atomic refill can fail, such as when all percpu chunks are full, >>> and when that happens there's no guarantee when more space will be >>> available for atomic allocations. >>> >>> Instead of having the caller wait for memory to be available by >>> retrying until the related BPF API no longer gives -ENOMEM, we can >>> kick off a non-atomic GFP_KERNEL allocation with highprio workqueue. >>> This should make it much less likely for those APIs to return >>> -ENOMEM. >>> >>> Because alloc_bulk can now be called from the workqueue, >>> non-atomic calls now also calls local_irq_save/restore to reduce >>> the chance of races. >>> >>> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") >>> Signed-off-by: YiFei Zhu <zhuyifei@google.com> >>> --- >>> kernel/bpf/memalloc.c | 47 ++++++++++++++++++++++++++++++------------- >>> 1 file changed, 33 insertions(+), 14 deletions(-) >>> >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>> index 016249672b43..2915639a5e16 100644 >>> --- a/kernel/bpf/memalloc.c >>> +++ b/kernel/bpf/memalloc.c >>> @@ -84,14 +84,15 @@ struct bpf_mem_cache { >>> struct llist_head free_llist; >>> local_t active; >>> >>> - /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill >>> + /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill_* >>> * are sequenced by per-cpu 'active' counter. But unit_free() cannot >>> * fail. When 'active' is busy the unit_free() will add an object to >>> * free_llist_extra. >>> */ >>> struct llist_head free_llist_extra; >>> >>> - struct irq_work refill_work; >>> + struct irq_work refill_work_irq; >>> + struct work_struct refill_work_wq; >>> struct obj_cgroup *objcg; >>> int unit_size; >>> /* count of objects in free_llist */ >>> @@ -153,7 +154,7 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c) >>> #endif >>> } >>> >>> -/* Mostly runs from irq_work except __init phase. */ >>> +/* Mostly runs from irq_work except workqueue and __init phase. */ >>> static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) >>> { >>> struct mem_cgroup *memcg = NULL, *old_memcg; >>> @@ -188,10 +189,18 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) >>> * want here. >>> */ >>> obj = __alloc(c, node, gfp); >>> - if (!obj) >>> + if (!obj) { >>> + /* We might have exhausted the percpu chunks, schedule >>> + * non-atomic allocation so hopefully caller can get >>> + * a free unit upon next invocation. >>> + */ >>> + if (atomic) >>> + queue_work_on(smp_processor_id(), >>> + system_highpri_wq, &c->refill_work_wq); >> I am not a MM expert. But according to the code in >> pcpu_balance_workfn(), it will try to do pcpu_create_chunk() when >> pcpu_atomic_alloc_failed is true, so the reason for introducing >> refill_work_wq is that pcpu_balance_workfn is too slow to fulfill the >> allocation request from bpf memory allocator ? > Oh I missed that part of the code. In one of my tests I had the > previous patch applied, and I had a lot of assertions around the code > (for debugging-by-kdumping), and I was able to get some crashes that > suggested I needed to add more, so I wrote this. However I wasn't able > to reproduce that again. Though, after giving it another thought, this > sequence of events could still happen: > > initial condition: free_cnt = 1, low_watermark = 1 > unit_alloc() > sets free_cnt = 0 > free_cnt < low_watermark > irq_work_raise() > irq work: bpf_mem_refill() > alloc_bulk() > __alloc() > __alloc_percpu_gfp() > fails > pcpu_schedule_balance_work() > return NULL > pcpu_balance_workfn() > succeeds, next __alloc_percpu_gfp will succeed > unit_alloc() > free_cnt is still 0 > return NULL bpf_mem_refill_wq is also running asynchronously. So if preemption is enabled, the next unit_alloc() will fail as well if bpf_mem_refill_wq is preempted. > > The thing here is that, even if pcpu_balance_workfn is fast enough to > run before the next unit_alloc, unit_alloc will still return NULL. I'm > not sure if this is desired, but this should be a very rare condition > requiring 8k unit_size. I'm not exactly sure what happened in that > dump. And since I'm unable to reproduce this again, and if we are okay > with the rare case above, I'm happy to drop this patch until I have a > better idea of what happened (or it was just my bad assertions, which > could very well be what happened). I think the patch raises a good question about whether or not GFP_NOWAIT could allocate most of the available memory timely. If the answer is yes, I think the mitigation proposed in the patch will be unnecessary. But I am not a MM expert and I don't have an answer for the question. > >>> break; >>> + } >>> } >>> - if (IS_ENABLED(CONFIG_PREEMPT_RT)) >>> + if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic) >>> /* In RT irq_work runs in per-cpu kthread, so disable >>> * interrupts to avoid preemption and interrupts and >>> * reduce the chance of bpf prog executing on this cpu >>> @@ -208,7 +217,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic) >>> __llist_add(obj, &c->free_llist); >>> c->free_cnt++; >>> local_dec(&c->active); >>> - if (IS_ENABLED(CONFIG_PREEMPT_RT)) >>> + if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic) >>> local_irq_restore(flags); >>> } >>> set_active_memcg(old_memcg); >>> @@ -314,9 +323,9 @@ static void free_bulk(struct bpf_mem_cache *c) >>> do_call_rcu(c); >>> } >>> >>> -static void bpf_mem_refill(struct irq_work *work) >>> +static void bpf_mem_refill_irq(struct irq_work *work) >>> { >>> - struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work); >>> + struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_irq); >>> int cnt; >>> >>> /* Racy access to free_cnt. It doesn't need to be 100% accurate */ >>> @@ -332,7 +341,14 @@ static void bpf_mem_refill(struct irq_work *work) >>> >>> static void notrace irq_work_raise(struct bpf_mem_cache *c) >>> { >>> - irq_work_queue(&c->refill_work); >>> + irq_work_queue(&c->refill_work_irq); >>> +} >>> + >>> +static void bpf_mem_refill_wq(struct work_struct *work) >>> +{ >>> + struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_wq); >>> + >>> + alloc_bulk(c, c->batch, NUMA_NO_NODE, false); >> Considering that the kworker may be interrupted by irq work, so there >> will be concurrent __llist_del_first() operations on free_by_rcu, andI >> think it is not safe to call alloc_bulk directly here. Maybe we can just >> skip __llist_del_first() for !atomic context. > Ack. > >>> } >>> >>> /* For typical bpf map case that uses bpf_mem_cache_alloc and single bucket >>> @@ -352,7 +368,8 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c) >>> >>> static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) >>> { >>> - init_irq_work(&c->refill_work, bpf_mem_refill); >>> + init_irq_work(&c->refill_work_irq, bpf_mem_refill_irq); >>> + INIT_WORK(&c->refill_work_wq, bpf_mem_refill_wq); >>> if (c->unit_size <= 256) { >>> c->low_watermark = 32; >>> c->high_watermark = 96; >>> @@ -529,7 +546,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) >>> for_each_possible_cpu(cpu) { >>> c = per_cpu_ptr(ma->cache, cpu); >>> /* >>> - * refill_work may be unfinished for PREEMPT_RT kernel >>> + * refill_work_irq may be unfinished for PREEMPT_RT kernel >>> * in which irq work is invoked in a per-CPU RT thread. >>> * It is also possible for kernel with >>> * arch_irq_work_has_interrupt() being false and irq >>> @@ -537,7 +554,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) >>> * the completion of irq work to ease the handling of >>> * concurrency. >>> */ >>> - irq_work_sync(&c->refill_work); >>> + irq_work_sync(&c->refill_work_irq); >>> + cancel_work_sync(&c->refill_work_wq); >> cancel_work_sync() may be time-consuming. We may need to move it to >> free_mem_alloc_deferred() to prevent blocking the destroy of bpf memory >> allocator. > Ack. > >>> drain_mem_cache(c); >>> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); >>> } >>> @@ -552,7 +570,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) >>> cc = per_cpu_ptr(ma->caches, cpu); >>> for (i = 0; i < NUM_CACHES; i++) { >>> c = &cc->cache[i]; >>> - irq_work_sync(&c->refill_work); >>> + irq_work_sync(&c->refill_work_irq); >>> + cancel_work_sync(&c->refill_work_wq); >>> drain_mem_cache(c); >>> rcu_in_progress += atomic_read(&c->call_rcu_in_progress); >>> } >>> @@ -580,7 +599,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>> * >>> * but prog_B could be a perf_event NMI prog. >>> * Use per-cpu 'active' counter to order free_list access between >>> - * unit_alloc/unit_free/bpf_mem_refill. >>> + * unit_alloc/unit_free/bpf_mem_refill_*. >>> */ >>> local_irq_save(flags); >>> if (local_inc_return(&c->active) == 1) { > . ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-07-27 1:21 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-07-20 20:44 [PATCH bpf 0/2] bpf/memalloc: Allow non-atomic alloc_bulk YiFei Zhu 2023-07-20 20:44 ` [PATCH bpf 1/2] bpf/memalloc: Non-atomically allocate freelist during prefill YiFei Zhu 2023-07-21 1:44 ` Hou Tao 2023-07-21 2:31 ` YiFei Zhu 2023-07-26 11:38 ` Hou Tao 2023-07-26 18:44 ` YiFei Zhu 2023-07-27 1:21 ` Hou Tao 2023-07-20 20:44 ` [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails YiFei Zhu 2023-07-21 2:24 ` Hou Tao 2023-07-21 3:02 ` YiFei Zhu 2023-07-26 11:37 ` Hou Tao
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.