From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Hou Tao <houtao@huaweicloud.com>
Cc: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
Daniel Borkmann <daniel@iogearbox.net>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
John Fastabend <john.fastabend@gmail.com>,
"Paul E . McKenney" <paulmck@kernel.org>,
rcu@vger.kernel.org, houtao1@huawei.com
Subject: Re: [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP
Date: Fri, 21 Apr 2023 20:12:13 -0700 [thread overview]
Message-ID: <20230422031213.ubhzng67qf7axt7x@dhcp-172-26-102-232.dhcp.thefacebook.com> (raw)
In-Reply-To: <20230408141846.1878768-5-houtao@huaweicloud.com>
On Sat, Apr 08, 2023 at 10:18:46PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Currently the freed objects in bpf memory allocator may be reused
> immediately by new allocation, it introduces use-after-bpf-ma-free
> problem for non-preallocated hash map and makes lookup procedure
> return incorrect result. The immediate reuse also makes introducing
> new use case more difficult (e.g. qp-trie).
>
> So introduce BPF_MA_REUSE_AFTER_RCU_GP to solve these problems. For
> BPF_MA_REUSE_AFTER_GP, the freed objects are reused only after one RCU
> grace period and may be returned back to slab system after another
> RCU-tasks-trace grace period. So for bpf programs which care about reuse
> problem, these programs can use bpf_rcu_read_{lock,unlock}() to access
> these freed objects safely and for those which doesn't care, there will
> be safely use-after-bpf-ma-free because these objects have not been
> freed by bpf memory allocator.
>
> To make these freed elements being reusab quickly, BPF_MA_REUSE_AFTER_GP
> dynamically allocates memory to create many inflight RCU callbacks to
> mark these freed element being reusable. These memories used for
> bpf_reuse_batch will be freed when these RCU callbacks complete. When no
> memory is available, synchronize_rcu_expedited() will be used to make
> these freed element reusable. In order to reduce the risk of OOM, part
> of these reusable memory will be freed through RCU-tasks-trace grace
> period. Before these freeing memories are freed, these memories are also
> available for reuse.
>
> The following are the benchmark results when comparing between different
> flavors of bpf memory allocator. These results show:
> * The performance of reuse-after-rcu-gp bpf ma is good than no bpf ma.
> Its memory usage is also good than no bpf ma except for
> add_del_on_diff_cpu case.
> * The memory usage of reuse-after-rcu-gp bpf ma increases a lot compared
> with normal bpf ma.
> * The memory usage of free-after-rcu-gp bpf ma is better than
> reuse-after-rcu-gp bpf ma, but its performance is bad than
> reuse-after-ruc-gp because it doesn't do reuse.
>
> (1) no bpf memory allocator (v6.0.19)
meaning that htab is using kmalloc and call_rcu to free, right?
> | name | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | -- | -- | -- | -- |
> | no_op | 1187 | 1.05 | 1.05 |
> | overwrite | 3.74 | 32.52 | 84.18 |
> | batch_add_batch_del | 2.23 | 26.38 | 48.75 |
> | add_del_on_diff_cpu | 3.92 | 33.72 | 48.96 |
>
> (2) normal bpf memory allocator
> | name | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | -- | -- | -- | -- |
> | no_op | 1187 | 0.96 | 1.00 |
> | overwrite | 27.12 | 2.5 | 2.99 |
> | batch_add_batch_del | 8.9 | 2.77 | 3.24 |
> | add_del_on_diff_cpu | 11.30 | 218.54 | 440.37 |
>
> (3) reuse-after-rcu-gp bpf memory allocator
that's the one you're implementing below, right?
> | name | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | -- | -- | -- | -- |
> | no_op | 1276 | 0.96 | 1.00 |
> | overwrite | 15.66 | 25.00 | 33.07 |
> | batch_add_batch_del | 10.32 | 18.84 | 22.64 |
> | add_del_on_diff_cpu | 13.00 | 550.50 | 748.74 |
>
> (4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu)
What do you mean? htab uses bpf_ma, but does call_rcu before doing bpf_mem_free ?
>
> | name | loop (k/s) | average memory (MiB) | peak memory (MiB) |
> | -- | -- | -- | -- |
> | no_op | 1263 | 0.96 | 1.00 |
> | overwrite | 10.73 | 12.33 | 20.32 |
> | batch_add_batch_del | 7.02 | 9.45 | 14.07 |
> | add_del_on_diff_cpu | 8.99 | 131.64 | 204.42 |
Depending on what we care about all the extra complexity in bpf_ma with reuse-after-rcu-gp
buys us a bit better perf, but many times worse memory consumption?
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
> include/linux/bpf_mem_alloc.h | 1 +
> kernel/bpf/memalloc.c | 353 +++++++++++++++++++++++++++++++---
> 2 files changed, 326 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
> index 148347950e16..e7f68432713b 100644
> --- a/include/linux/bpf_mem_alloc.h
> +++ b/include/linux/bpf_mem_alloc.h
> @@ -18,6 +18,7 @@ struct bpf_mem_alloc {
> /* flags for bpf_mem_alloc_init() */
> enum {
> BPF_MA_PERCPU = 1U << 0,
> + BPF_MA_REUSE_AFTER_RCU_GP = 1U << 1,
> };
>
> /* 'size != 0' is for bpf_mem_alloc which manages fixed-size objects.
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 072102476019..262100f89610 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -63,6 +63,10 @@ static u8 size_index[24] __ro_after_init = {
> 2 /* 192 */
> };
>
> +static struct workqueue_struct *bpf_ma_wq;
> +
> +static void bpf_ma_prepare_reuse_work(struct work_struct *work);
> +
> static int bpf_mem_cache_idx(size_t size)
> {
> if (!size || size > 4096)
> @@ -98,18 +102,36 @@ struct bpf_mem_cache {
> int free_cnt;
> int low_watermark, high_watermark, batch;
> int percpu_size;
> + int cpu;
> unsigned int flags;
>
> + raw_spinlock_t reuse_lock;
> + bool abort_reuse;
> + struct llist_head reuse_ready_head;
> + struct llist_node *reuse_ready_tail;
> + struct llist_head wait_for_free;
> + struct llist_head prepare_reuse_head;
> + struct llist_node *prepare_reuse_tail;
> + unsigned int prepare_reuse_cnt;
> + atomic_t reuse_cb_in_progress;
> + struct work_struct reuse_work;
> +
> struct rcu_head rcu;
> struct llist_head free_by_rcu;
> struct llist_head waiting_for_gp;
> - atomic_t call_rcu_in_progress;
> + atomic_t free_cb_in_progress;
> };
>
> struct bpf_mem_caches {
> struct bpf_mem_cache cache[NUM_CACHES];
> };
>
> +struct bpf_reuse_batch {
> + struct bpf_mem_cache *c;
> + struct llist_node *head, *tail;
> + struct rcu_head rcu;
> +};
> +
> static struct llist_node notrace *__llist_del_first(struct llist_head *head)
> {
> struct llist_node *entry, *next;
> @@ -154,6 +176,45 @@ static struct mem_cgroup *get_memcg(const struct bpf_mem_cache *c)
> #endif
> }
>
> +static void *bpf_ma_get_reusable_obj(struct bpf_mem_cache *c)
> +{
> + if (c->flags & BPF_MA_REUSE_AFTER_RCU_GP) {
> + unsigned long flags;
> + void *obj;
> +
> + if (llist_empty(&c->reuse_ready_head) && llist_empty(&c->wait_for_free))
> + return NULL;
> +
> + /* reuse_ready_head and wait_for_free may be manipulated by
> + * kworker and RCU callbacks.
> + */
> + raw_spin_lock_irqsave(&c->reuse_lock, flags);
> + obj = __llist_del_first(&c->reuse_ready_head);
> + if (obj) {
> + if (llist_empty(&c->reuse_ready_head))
> + c->reuse_ready_tail = NULL;
> + } else {
> + obj = __llist_del_first(&c->wait_for_free);
> + }
> + raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
> + return obj;
> + }
> +
> + /*
> + * free_by_rcu is only manipulated by irq work refill_work().
> + * IRQ works on the same CPU are called sequentially, so it is
> + * safe to use __llist_del_first() here. If alloc_bulk() is
> + * invoked by the initial prefill, there will be no running
> + * refill_work(), so __llist_del_first() is fine as well.
> + *
> + * In most cases, objects on free_by_rcu are from the same CPU.
> + * If some objects come from other CPUs, it doesn't incur any
> + * harm because NUMA_NO_NODE means the preference for current
> + * numa node and it is not a guarantee.
> + */
> + return __llist_del_first(&c->free_by_rcu);
> +}
> +
> /* Mostly runs from irq_work except __init phase. */
> static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> {
> @@ -165,19 +226,7 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> memcg = get_memcg(c);
> old_memcg = set_active_memcg(memcg);
> for (i = 0; i < cnt; i++) {
> - /*
> - * free_by_rcu is only manipulated by irq work refill_work().
> - * IRQ works on the same CPU are called sequentially, so it is
> - * safe to use __llist_del_first() here. If alloc_bulk() is
> - * invoked by the initial prefill, there will be no running
> - * refill_work(), so __llist_del_first() is fine as well.
> - *
> - * In most cases, objects on free_by_rcu are from the same CPU.
> - * If some objects come from other CPUs, it doesn't incur any
> - * harm because NUMA_NO_NODE means the preference for current
> - * numa node and it is not a guarantee.
> - */
> - obj = __llist_del_first(&c->free_by_rcu);
> + obj = bpf_ma_get_reusable_obj(c);
> if (!obj) {
> /* Allocate, but don't deplete atomic reserves that typical
> * GFP_ATOMIC would do. irq_work runs on this cpu and kmalloc
> @@ -236,7 +285,7 @@ static void __free_rcu(struct rcu_head *head)
> struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>
> free_all(llist_del_all(&c->waiting_for_gp), !!c->percpu_size);
> - atomic_set(&c->call_rcu_in_progress, 0);
> + atomic_set(&c->free_cb_in_progress, 0);
> }
>
> static void __free_rcu_tasks_trace(struct rcu_head *head)
> @@ -264,7 +313,7 @@ static void do_call_rcu(struct bpf_mem_cache *c)
> {
> struct llist_node *llnode, *t;
>
> - if (atomic_xchg(&c->call_rcu_in_progress, 1))
> + if (atomic_xchg(&c->free_cb_in_progress, 1))
> return;
>
> WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp));
> @@ -409,6 +458,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
> c->objcg = objcg;
> c->percpu_size = percpu_size;
> c->flags = flags;
> + c->cpu = cpu;
> + INIT_WORK(&c->reuse_work, bpf_ma_prepare_reuse_work);
> prefill_mem_cache(c, cpu);
> }
> ma->cache = pc;
> @@ -433,6 +484,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
> c->unit_size = sizes[i];
> c->objcg = objcg;
> c->flags = flags;
> + c->cpu = cpu;
> + INIT_WORK(&c->reuse_work, bpf_ma_prepare_reuse_work);
> prefill_mem_cache(c, cpu);
> }
> }
> @@ -444,18 +497,40 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, unsigned int flags)
> static void drain_mem_cache(struct bpf_mem_cache *c)
> {
> bool percpu = !!c->percpu_size;
> + struct llist_node *head[3];
> + unsigned long flags;
>
> /* No progs are using this bpf_mem_cache, but htab_map_free() called
> * bpf_mem_cache_free() for all remaining elements and they can be in
> * free_by_rcu or in waiting_for_gp lists, so drain those lists now.
> *
> - * Except for waiting_for_gp list, there are no concurrent operations
> - * on these lists, so it is safe to use __llist_del_all().
> + * Except for waiting_for_gp and free_llist_extra list, there are no
> + * concurrent operations on these lists, so it is safe to use
> + * __llist_del_all().
> */
> free_all(__llist_del_all(&c->free_by_rcu), percpu);
> free_all(llist_del_all(&c->waiting_for_gp), percpu);
> free_all(__llist_del_all(&c->free_llist), percpu);
> - free_all(__llist_del_all(&c->free_llist_extra), percpu);
> + free_all(llist_del_all(&c->free_llist_extra), percpu);
> +
> + if (!(c->flags & BPF_MA_REUSE_AFTER_RCU_GP))
> + return;
> +
> + raw_spin_lock_irqsave(&c->reuse_lock, flags);
> + /* Indicate kworker and RCU callback to free elements directly
> + * instead of adding new elements into these lists.
> + */
> + c->abort_reuse = true;
> + head[0] = __llist_del_all(&c->prepare_reuse_head);
> + c->prepare_reuse_tail = NULL;
> + head[1] = __llist_del_all(&c->reuse_ready_head);
> + c->reuse_ready_tail = NULL;
> + head[2] = __llist_del_all(&c->wait_for_free);
> + raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
> +
> + free_all(head[0], percpu);
> + free_all(head[1], percpu);
> + free_all(head[2], percpu);
> }
>
> static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
> @@ -466,10 +541,39 @@ static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma)
> ma->caches = NULL;
> }
>
> +static void bpf_ma_cancel_reuse_work(struct bpf_mem_alloc *ma)
> +{
> + struct bpf_mem_caches *cc;
> + struct bpf_mem_cache *c;
> + int cpu, i;
> +
> + if (ma->cache) {
> + for_each_possible_cpu(cpu) {
> + c = per_cpu_ptr(ma->cache, cpu);
> + cancel_work_sync(&c->reuse_work);
> + }
> + }
> + if (ma->caches) {
> + for_each_possible_cpu(cpu) {
> + cc = per_cpu_ptr(ma->caches, cpu);
> + for (i = 0; i < NUM_CACHES; i++) {
> + c = &cc->cache[i];
> + cancel_work_sync(&c->reuse_work);
> + }
> + }
> + }
> +}
> +
> static void free_mem_alloc(struct bpf_mem_alloc *ma)
> {
> - /* waiting_for_gp lists was drained, but __free_rcu might
> - * still execute. Wait for it now before we freeing percpu caches.
> + bool reuse_after_rcu_gp = ma->flags & BPF_MA_REUSE_AFTER_RCU_GP;
> +
> + /* Cancel the inflight kworkers */
> + if (reuse_after_rcu_gp)
> + bpf_ma_cancel_reuse_work(ma);
> +
> + /* For normal bpf ma, waiting_for_gp lists was drained, but __free_rcu
> + * might still execute. Wait for it now before we freeing percpu caches.
> *
> * rcu_barrier_tasks_trace() doesn't imply synchronize_rcu_tasks_trace(),
> * but rcu_barrier_tasks_trace() and rcu_barrier() below are only used
> @@ -477,9 +581,13 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
> * so if call_rcu(head, __free_rcu) is skipped due to
> * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
> * using rcu_trace_implies_rcu_gp() as well.
> + *
> + * For reuse-after-rcu-gp bpf ma, use rcu_barrier_tasks_trace() to
> + * wait for the pending bpf_ma_free_reusable_cb() and use rcu_barrier()
> + * to wait for the pending bpf_ma_reuse_cb().
> */
> rcu_barrier_tasks_trace();
> - if (!rcu_trace_implies_rcu_gp())
> + if (reuse_after_rcu_gp || !rcu_trace_implies_rcu_gp())
> rcu_barrier();
> free_mem_alloc_no_barrier(ma);
> }
> @@ -512,6 +620,7 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress)
> }
>
> /* Defer barriers into worker to let the rest of map memory to be freed */
> + copy->flags = ma->flags;
> copy->cache = ma->cache;
> ma->cache = NULL;
> copy->caches = ma->caches;
> @@ -541,7 +650,9 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> */
> irq_work_sync(&c->refill_work);
> drain_mem_cache(c);
> - rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> + rcu_in_progress += atomic_read(&c->free_cb_in_progress);
> + /* Pending kworkers or RCU callbacks */
> + rcu_in_progress += atomic_read(&c->reuse_cb_in_progress);
> }
> /* objcg is the same across cpus */
> if (c->objcg)
> @@ -556,7 +667,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> c = &cc->cache[i];
> irq_work_sync(&c->refill_work);
> drain_mem_cache(c);
> - rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> + rcu_in_progress += atomic_read(&c->free_cb_in_progress);
> + rcu_in_progress += atomic_read(&c->reuse_cb_in_progress);
> }
> }
> if (c->objcg)
> @@ -600,18 +712,183 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c)
> return llnode;
> }
>
> +static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c, struct llist_node *head,
> + struct llist_node *tail)
> +{
> + unsigned long flags;
> + bool abort;
> +
> + raw_spin_lock_irqsave(&c->reuse_lock, flags);
> + abort = c->abort_reuse;
> + if (!abort) {
> + if (llist_empty(&c->reuse_ready_head))
> + c->reuse_ready_tail = tail;
> + __llist_add_batch(head, tail, &c->reuse_ready_head);
> + }
> + raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
> +
> + /* Don't move these objects to reuse_ready list and free
> + * these objects directly.
> + */
> + if (abort)
> + free_all(head, !!c->percpu_size);
> +}
> +
> +static void bpf_ma_reuse_cb(struct rcu_head *rcu)
> +{
> + struct bpf_reuse_batch *batch = container_of(rcu, struct bpf_reuse_batch, rcu);
> + struct bpf_mem_cache *c = batch->c;
> +
> + bpf_ma_add_to_reuse_ready_or_free(c, batch->head, batch->tail);
> + atomic_dec(&c->reuse_cb_in_progress);
> + kfree(batch);
> +}
> +
> +static bool bpf_ma_try_free_reuse_objs(struct bpf_mem_cache *c)
> +{
> + struct llist_node *head, *tail;
> + bool do_free;
> +
> + if (llist_empty(&c->reuse_ready_head))
> + return false;
> +
> + do_free = !atomic_xchg(&c->free_cb_in_progress, 1);
> + if (!do_free)
> + return false;
> +
> + head = __llist_del_all(&c->reuse_ready_head);
> + tail = c->reuse_ready_tail;
> + c->reuse_ready_tail = NULL;
> +
> + __llist_add_batch(head, tail, &c->wait_for_free);
> +
> + return true;
> +}
> +
> +static void bpf_ma_free_reusable_cb(struct rcu_head *rcu)
> +{
> + struct bpf_mem_cache *c = container_of(rcu, struct bpf_mem_cache, rcu);
> + struct llist_node *head;
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&c->reuse_lock, flags);
> + head = __llist_del_all(&c->wait_for_free);
> + raw_spin_unlock_irqrestore(&c->reuse_lock, flags);
> +
> + free_all(head, !!c->percpu_size);
> + atomic_set(&c->free_cb_in_progress, 0);
> +}
> +
> +static void bpf_ma_prepare_reuse_work(struct work_struct *work)
> +{
> + struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, reuse_work);
> + struct llist_node *head, *tail, *llnode, *tmp;
> + struct bpf_reuse_batch *batch;
> + unsigned long flags;
> + bool do_free;
> +
> + local_irq_save(flags);
> + /* When CPU is offline, the running CPU may be different with
> + * the CPU which submitted the work. When these two CPUs are the same,
> + * kworker may be interrupted by NMI, so increase active to protect
> + * again such concurrency.
> + */
> + if (c->cpu == smp_processor_id())
> + WARN_ON_ONCE(local_inc_return(&c->active) != 1);
> + raw_spin_lock(&c->reuse_lock);
> + head = __llist_del_all(&c->prepare_reuse_head);
> + tail = c->prepare_reuse_tail;
> + c->prepare_reuse_tail = NULL;
> + c->prepare_reuse_cnt = 0;
> + if (c->cpu == smp_processor_id())
> + local_dec(&c->active);
> +
> + /* Try to free elements in reusable list. Before these elements are
> + * freed in RCU cb, these element will still be available for reuse.
> + */
> + do_free = bpf_ma_try_free_reuse_objs(c);
> + raw_spin_unlock(&c->reuse_lock);
> + local_irq_restore(flags);
> +
> + if (do_free)
> + call_rcu_tasks_trace(&c->rcu, bpf_ma_free_reusable_cb);
> +
> + llist_for_each_safe(llnode, tmp, llist_del_all(&c->free_llist_extra)) {
> + if (!head)
> + tail = llnode;
> + llnode->next = head;
> + head = llnode->next;
> + }
> + /* Draining is in progress ? */
> + if (!head) {
> + /* kworker completes and no RCU callback */
> + atomic_dec(&c->reuse_cb_in_progress);
> + return;
> + }
> +
> + batch = kmalloc(sizeof(*batch), GFP_KERNEL);
> + if (!batch) {
> + synchronize_rcu_expedited();
> + bpf_ma_add_to_reuse_ready_or_free(c, head, tail);
> + /* kworker completes and no RCU callback */
> + atomic_dec(&c->reuse_cb_in_progress);
> + return;
> + }
> +
> + batch->c = c;
> + batch->head = head;
> + batch->tail = tail;
> + call_rcu(&batch->rcu, bpf_ma_reuse_cb);
> +}
> +
> +static void notrace wait_gp_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + /* In case a NMI-context bpf program is also freeing object. */
> + if (local_inc_return(&c->active) == 1) {
> + bool try_queue_work = false;
> +
> + /* kworker may remove elements from prepare_reuse_head */
> + raw_spin_lock(&c->reuse_lock);
> + if (llist_empty(&c->prepare_reuse_head))
> + c->prepare_reuse_tail = llnode;
> + __llist_add(llnode, &c->prepare_reuse_head);
> + if (++c->prepare_reuse_cnt > c->high_watermark) {
> + /* Zero out prepare_reuse_cnt early to prevent
> + * unnecessary queue_work().
> + */
> + c->prepare_reuse_cnt = 0;
> + try_queue_work = true;
> + }
> + raw_spin_unlock(&c->reuse_lock);
> +
> + if (try_queue_work && !work_pending(&c->reuse_work)) {
> + /* Use reuse_cb_in_progress to indicate there is
> + * inflight reuse kworker or reuse RCU callback.
> + */
> + atomic_inc(&c->reuse_cb_in_progress);
> + /* Already queued */
> + if (!queue_work(bpf_ma_wq, &c->reuse_work))
how many kthreads are spawned by wq in the peak?
> + atomic_dec(&c->reuse_cb_in_progress);
> + }
> + } else {
> + llist_add(llnode, &c->free_llist_extra);
> + }
> + local_dec(&c->active);
> + local_irq_restore(flags);
> +}
> +
> /* Though 'ptr' object could have been allocated on a different cpu
> * add it to the free_llist of the current cpu.
> * Let kfree() logic deal with it when it's later called from irq_work.
> */
> -static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
> +static void notrace immediate_reuse_free(struct bpf_mem_cache *c, struct llist_node *llnode)
> {
> - struct llist_node *llnode = ptr - LLIST_NODE_SZ;
> unsigned long flags;
> int cnt = 0;
>
> - BUILD_BUG_ON(LLIST_NODE_SZ > 8);
> -
> local_irq_save(flags);
> if (local_inc_return(&c->active) == 1) {
> __llist_add(llnode, &c->free_llist);
> @@ -633,6 +910,18 @@ static void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
> irq_work_raise(c);
> }
>
> +static inline void notrace unit_free(struct bpf_mem_cache *c, void *ptr)
> +{
> + struct llist_node *llnode = ptr - LLIST_NODE_SZ;
> +
> + BUILD_BUG_ON(LLIST_NODE_SZ > 8);
> +
> + if (c->flags & BPF_MA_REUSE_AFTER_RCU_GP)
> + wait_gp_reuse_free(c, llnode);
> + else
> + immediate_reuse_free(c, llnode);
> +}
> +
> /* Called from BPF program or from sys_bpf syscall.
> * In both cases migration is disabled.
> */
> @@ -724,3 +1013,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags)
>
> return !ret ? NULL : ret + LLIST_NODE_SZ;
> }
> +
> +static int __init bpf_ma_init(void)
> +{
> + bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0);
> + BUG_ON(!bpf_ma_wq);
> + return 0;
> +}
> +late_initcall(bpf_ma_init);
> --
> 2.29.2
>
next prev parent reply other threads:[~2023-04-22 3:12 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-08 14:18 [RFC bpf-next v2 0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 1/4] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
2023-04-22 2:59 ` Alexei Starovoitov
2023-04-23 1:55 ` Hou Tao
2023-04-27 4:20 ` Alexei Starovoitov
2023-04-27 13:46 ` Paul E. McKenney
2023-04-28 6:13 ` Hou Tao
2023-04-28 2:16 ` Hou Tao
2023-04-23 8:03 ` Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 2/4] bpf: Factor out a common helper free_all() Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 3/4] bpf: Pass bitwise flags to bpf_mem_alloc_init() Hou Tao
2023-04-08 14:18 ` [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP Hou Tao
2023-04-22 3:12 ` Alexei Starovoitov [this message]
2023-04-23 7:41 ` Hou Tao
2023-04-27 4:24 ` Alexei Starovoitov
2023-04-28 2:24 ` Hou Tao
2023-04-21 6:23 ` [RFC bpf-next v2 0/4] " Hou Tao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230422031213.ubhzng67qf7axt7x@dhcp-172-26-102-232.dhcp.thefacebook.com \
--to=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=haoluo@google.com \
--cc=houtao1@huawei.com \
--cc=houtao@huaweicloud.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=martin.lau@linux.dev \
--cc=paulmck@kernel.org \
--cc=rcu@vger.kernel.org \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).