Re: [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails

From: Hou Tao <houtao@huaweicloud.com>
To: YiFei Zhu <zhuyifei@google.com>, Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org, Daniel Borkmann <daniel@iogearbox.net>,
	Stanislav Fomichev <sdf@google.com>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Andrii Nakryiko <andrii@kernel.org>
Subject: Re: [PATCH bpf 2/2] bpf/memalloc: Schedule highprio wq for non-atomic alloc when atomic fails
Date: Wed, 26 Jul 2023 19:37:46 +0800	[thread overview]
Message-ID: <0a5e2443-d8e5-b498-4cff-1dd9e394f015@huaweicloud.com> (raw)
In-Reply-To: <CAA-VZPnVE7MSnXn-5pMun2D_naMSU9Q6XFost7ZgncyJDtnnAg@mail.gmail.com>

Hi,

On 7/21/2023 11:02 AM, YiFei Zhu wrote:
> On Thu, Jul 20, 2023 at 7:24 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 7/21/2023 4:44 AM, YiFei Zhu wrote:
>>> Atomic refill can fail, such as when all percpu chunks are full,
>>> and when that happens there's no guarantee when more space will be
>>> available for atomic allocations.
>>>
>>> Instead of having the caller wait for memory to be available by
>>> retrying until the related BPF API no longer gives -ENOMEM, we can
>>> kick off a non-atomic GFP_KERNEL allocation with highprio workqueue.
>>> This should make it much less likely for those APIs to return
>>> -ENOMEM.
>>>
>>> Because alloc_bulk can now be called from the workqueue,
>>> non-atomic calls now also calls local_irq_save/restore to reduce
>>> the chance of races.
>>>
>>> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.")
>>> Signed-off-by: YiFei Zhu <zhuyifei@google.com>
>>> ---
>>>  kernel/bpf/memalloc.c | 47 ++++++++++++++++++++++++++++++-------------
>>>  1 file changed, 33 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index 016249672b43..2915639a5e16 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -84,14 +84,15 @@ struct bpf_mem_cache {
>>>       struct llist_head free_llist;
>>>       local_t active;
>>>
>>> -     /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill
>>> +     /* Operations on the free_list from unit_alloc/unit_free/bpf_mem_refill_*
>>>        * are sequenced by per-cpu 'active' counter. But unit_free() cannot
>>>        * fail. When 'active' is busy the unit_free() will add an object to
>>>        * free_llist_extra.
>>>        */
>>>       struct llist_head free_llist_extra;
>>>
>>> -     struct irq_work refill_work;
>>> +     struct irq_work refill_work_irq;
>>> +     struct work_struct refill_work_wq;
>>>       struct obj_cgroup *objcg;
>>>       int unit_size;
>>>       /* count of objects in free_llist */
>>> @@ -153,7 +154,7 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
>>>  #endif
>>>  }
>>>
>>> -/* Mostly runs from irq_work except __init phase. */
>>> +/* Mostly runs from irq_work except workqueue and __init phase. */
>>>  static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic)
>>>  {
>>>       struct mem_cgroup *memcg = NULL, *old_memcg;
>>> @@ -188,10 +189,18 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic)
>>>                        * want here.
>>>                        */
>>>                       obj = __alloc(c, node, gfp);
>>> -                     if (!obj)
>>> +                     if (!obj) {
>>> +                             /* We might have exhausted the percpu chunks, schedule
>>> +                              * non-atomic allocation so hopefully caller can get
>>> +                              * a free unit upon next invocation.
>>> +                              */
>>> +                             if (atomic)
>>> +                                     queue_work_on(smp_processor_id(),
>>> +                                                   system_highpri_wq, &c->refill_work_wq);
>> I am not a MM expert. But according to the code in
>> pcpu_balance_workfn(), it will try to do pcpu_create_chunk() when
>> pcpu_atomic_alloc_failed is true, so the reason for introducing
>> refill_work_wq is that pcpu_balance_workfn is too slow to fulfill the
>> allocation request from bpf memory allocator ?
> Oh I missed that part of the code. In one of my tests I had the
> previous patch applied, and I had a lot of assertions around the code
> (for debugging-by-kdumping), and I was able to get some crashes that
> suggested I needed to add more, so I wrote this. However I wasn't able
> to reproduce that again. Though, after giving it another thought, this
> sequence of events could still happen:
>
>   initial condition: free_cnt = 1, low_watermark = 1
>   unit_alloc()
>     sets free_cnt = 0
>     free_cnt < low_watermark
>       irq_work_raise()
>   irq work: bpf_mem_refill()
>     alloc_bulk()
>       __alloc()
>         __alloc_percpu_gfp()
>           fails
>           pcpu_schedule_balance_work()
>           return NULL
>   pcpu_balance_workfn()
>     succeeds, next __alloc_percpu_gfp will succeed
>   unit_alloc()
>     free_cnt is still 0
>     return NULL

bpf_mem_refill_wq is also running asynchronously. So if preemption is
enabled, the next unit_alloc() will fail as well if bpf_mem_refill_wq is
preempted.
>
> The thing here is that, even if pcpu_balance_workfn is fast enough to
> run before the next unit_alloc, unit_alloc will still return NULL. I'm
> not sure if this is desired, but this should be a very rare condition
> requiring 8k unit_size. I'm not exactly sure what happened in that
> dump. And since I'm unable to reproduce this again, and if we are okay
> with the rare case above, I'm happy to drop this patch until I have a
> better idea of what happened (or it was just my bad assertions, which
> could very well be what happened).

I think the patch raises a good question about whether or not GFP_NOWAIT
could allocate most of the available memory timely. If the answer is
yes, I think the mitigation proposed in the patch will be unnecessary.
But I am not a MM expert and I don't have an answer for the question.
>
>>>                               break;
>>> +                     }
>>>               }
>>> -             if (IS_ENABLED(CONFIG_PREEMPT_RT))
>>> +             if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic)
>>>                       /* In RT irq_work runs in per-cpu kthread, so disable
>>>                        * interrupts to avoid preemption and interrupts and
>>>                        * reduce the chance of bpf prog executing on this cpu
>>> @@ -208,7 +217,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node, bool atomic)
>>>               __llist_add(obj, &c->free_llist);
>>>               c->free_cnt++;
>>>               local_dec(&c->active);
>>> -             if (IS_ENABLED(CONFIG_PREEMPT_RT))
>>> +             if (IS_ENABLED(CONFIG_PREEMPT_RT) || !atomic)
>>>                       local_irq_restore(flags);
>>>       }
>>>       set_active_memcg(old_memcg);
>>> @@ -314,9 +323,9 @@ static void free_bulk(struct bpf_mem_cache *c)
>>>       do_call_rcu(c);
>>>  }
>>>
>>> -static void bpf_mem_refill(struct irq_work *work)
>>> +static void bpf_mem_refill_irq(struct irq_work *work)
>>>  {
>>> -     struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
>>> +     struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_irq);
>>>       int cnt;
>>>
>>>       /* Racy access to free_cnt. It doesn't need to be 100% accurate */
>>> @@ -332,7 +341,14 @@ static void bpf_mem_refill(struct irq_work *work)
>>>
>>>  static void notrace irq_work_raise(struct bpf_mem_cache *c)
>>>  {
>>> -     irq_work_queue(&c->refill_work);
>>> +     irq_work_queue(&c->refill_work_irq);
>>> +}
>>> +
>>> +static void bpf_mem_refill_wq(struct work_struct *work)
>>> +{
>>> +     struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work_wq);
>>> +
>>> +     alloc_bulk(c, c->batch, NUMA_NO_NODE, false);
>> Considering that the kworker may be interrupted by irq work, so there
>> will be concurrent __llist_del_first() operations on free_by_rcu, andI
>> think it is not safe to call alloc_bulk directly here. Maybe we can just
>> skip __llist_del_first() for !atomic context.
> Ack.
>
>>>  }
>>>
>>>  /* For typical bpf map case that uses bpf_mem_cache_alloc and single bucket
>>> @@ -352,7 +368,8 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c)
>>>
>>>  static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
>>>  {
>>> -     init_irq_work(&c->refill_work, bpf_mem_refill);
>>> +     init_irq_work(&c->refill_work_irq, bpf_mem_refill_irq);
>>> +     INIT_WORK(&c->refill_work_wq, bpf_mem_refill_wq);
>>>       if (c->unit_size <= 256) {
>>>               c->low_watermark = 32;
>>>               c->high_watermark = 96;
>>> @@ -529,7 +546,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>>               for_each_possible_cpu(cpu) {
>>>                       c = per_cpu_ptr(ma->cache, cpu);
>>>                       /*
>>> -                      * refill_work may be unfinished for PREEMPT_RT kernel
>>> +                      * refill_work_irq may be unfinished for PREEMPT_RT kernel
>>>                        * in which irq work is invoked in a per-CPU RT thread.
>>>                        * It is also possible for kernel with
>>>                        * arch_irq_work_has_interrupt() being false and irq
>>> @@ -537,7 +554,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>>                        * the completion of irq work to ease the handling of
>>>                        * concurrency.
>>>                        */
>>> -                     irq_work_sync(&c->refill_work);
>>> +                     irq_work_sync(&c->refill_work_irq);
>>> +                     cancel_work_sync(&c->refill_work_wq);
>> cancel_work_sync() may be time-consuming. We may need to move it to
>> free_mem_alloc_deferred() to prevent blocking the destroy of bpf memory
>> allocator.
> Ack.
>
>>>                       drain_mem_cache(c);
>>>                       rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>>               }
>>> @@ -552,7 +570,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>>                       cc = per_cpu_ptr(ma->caches, cpu);
>>>                       for (i = 0; i < NUM_CACHES; i++) {
>>>                               c = &cc->cache[i];
>>> -                             irq_work_sync(&c->refill_work);
>>> +                             irq_work_sync(&c->refill_work_irq);
>>> +                             cancel_work_sync(&c->refill_work_wq);
>>>                               drain_mem_cache(c);
>>>                               rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>>                       }
>>> @@ -580,7 +599,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
>>>        *
>>>        * but prog_B could be a perf_event NMI prog.
>>>        * Use per-cpu 'active' counter to order free_list access between
>>> -      * unit_alloc/unit_free/bpf_mem_refill.
>>> +      * unit_alloc/unit_free/bpf_mem_refill_*.
>>>        */
>>>       local_irq_save(flags);
>>>       if (local_inc_return(&c->active) == 1) {
> .