Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular

From: Delyan Kratunov <delyank@fb.com>
To: "alexei.starovoitov@gmail.com" <alexei.starovoitov@gmail.com>
Cc: "davem@davemloft.net" <davem@davemloft.net>,
	"joannelkoong@gmail.com" <joannelkoong@gmail.com>,
	"andrii@kernel.org" <andrii@kernel.org>,
	"tj@kernel.org" <tj@kernel.org>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	Dave Marchevsky <davemarchevsky@fb.com>,
	"memxor@gmail.com" <memxor@gmail.com>,
	Kernel Team <Kernel-team@fb.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Subject: Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular
Date: Fri, 2 Sep 2022 01:40:29 +0000	[thread overview]
Message-ID: <667008debe17cf6ced894b63841670daccbe9f4c.camel@fb.com> (raw)
In-Reply-To: <20220902001211.wrgwpvquuky5wpgn@macbook-pro-4.dhcp.thefacebook.com>

On Thu, 2022-09-01 at 17:12 -0700, Alexei Starovoitov wrote:
> On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote:
> > On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote:
> > > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote:
> > > > Given that tracing programs can't really maintain their own freelists safely (I think
> > > > they're missing the building blocks - you can't cmpxchg kptrs),
> > > 
> > > Today? yes, but soon we will have link lists supported natively.
> > > 
> > > > I do feel like
> > > > isolated allocators are a requirement here. Without them, allocations can fail and
> > > > there's no way to write a reliable program.
> > > 
> > > Completely agree that there should be a way for programs
> > > to guarantee availability of the element.
> > > Inline allocation can fail regardless whether allocation pool
> > > is shared by multiple programs or a single program owns an allocator.
> > 
> > I'm not sure I understand this point. 
> > If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs
> > for struct foo to succeed. In what situations can this fail if I'm the only program
> > using it _and_ I've calculated the prefill amount correctly?
> > 
> > Unless you're saying that the prefill wouldn't adjust the freelist limits, in which
> > case, I think that's a mistake - prefill should effectively _set_ the freelist
> > limits.
> 
> There is no prefill implementation today, so we're just guessing, but let's try.

Well, initial capacity was going to be part of the map API, so I always considered it
part of the same work.

> prefill would probably have to adjust high-watermark limit.
> That makes sense, but for how long? Should the watermark go back after time
> or after N objects were consumed?

Neither, if you want your pool of objects to not vanish from under you.

> What prefill is going to do? Prefill on current cpu only ?
> but it doesn't help the prog to be deterministic in consuming them.
> Prefill on all cpu-s? That's possible,
> but for_each_possible_cpu() {irq_work_queue_on(cpu);}
> looks to be a significant memory and run-time overhead.

No, that's overkill imo, especially on 100+ core systems.
I was imagining the allocator consuming the current cpu freelist first, then stealing
from other cpus, and only if they are empty, giving up and scheduling irq_work. 

A little complex to implement but it's possible. It does require atomics everywhere
though.

> When freelist is managed by the program it may contain just N elements
> that progs needs.
> 
> > > In that sense, allowing multiple programs to create an instance
> > > of an allocator doesn't solve this problem.
> > > Short free list inside bpf_mem_cache is an implementation detail.
> > > "prefill to guarantee successful alloc" is a bit out of scope
> > > of an allocator.
> > 
> > I disagree that it's out of scope. This is the only access to dynamic memory from a
> > bpf program, it makes sense that it covers the requirements of bpf programs,
> > including prefill/freelist behavior, so all programs can safely use it.
> > 
> > > "allocate a set and stash it" should be a separate building block
> > > available to bpf progs when step "allocate" can fail and
> > > efficient "stash it" can probably be done on top of the link list.
> > 
> > Do you imagine a BPF object that every user has to link into their programs (yuck),
> > or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things
> > this way.
> 
> I'm assuming Kumar's proposed list api:
> struct bpf_list_head head;
> struct bpf_list_node node;
> bpf_list_insert(&node, &head);
> 
> will work out.

Given the assumed locking in that design, I don't see how it would help nmi programs
tbh. This is list_head, we need llist_head, relatively speaking.

> 
> > If it's distributed separately, I think that's an unexpected burden on developers
> > (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool
> > skels). There are no other bpf features that require a userspace support library like
> > this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful
> > without a library)
> 
> bpf progs must not pay for what they don't use. Hence all building blocks should be
> small. We will have libc-like bpf libraries with bigger blocks eventually. 

I'm not sure I understand how having the mechanism in helpers and managed by the
kernel is paying for something they don't use?

> 
> > > I think the disagreement here is that per-prog allocator based
> > > on bpf_mem_alloc isn't going to be any more deterministic than
> > > one global bpf_mem_alloc for all progs.
> > > Per-prog short free list of ~64 elements vs
> > > global free list of ~64 elements.
> > 
> > Right, I think I had a hidden assumption here that we've exposed. 
> > Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be
> > 1000 struct foos on the freelist and the freelist thresholds would be adjusted
> > accordingly (i.e., you can free all of them and allocate them again, all from the
> > freelist).
> > 
> > Ultimately, that's what nmi programs actually need but I see why that's not an
> > obvious behavior.
> 
> How prefill is going to work is still to-be-designed.

That's the new part for me, though - the maps design had a mechanism to specify
initial capacity, and it worked for nmi programs. That's why I'm pulling on this
thread, it's the hardest thing to get right _and_ it needs to exist before deferred
work can be useful.

> In addition to current-cpu vs on-all-cpu question above...
> Will prefill() helper just do irq_work ?
> If so then it doesn't help nmi and irq-disabled progs at all.
> prefill helper working asynchronously doesn't guarantee availability of objects
> later to the program.
> prefill() becomes a hint and probably useless as such.

Agreed.

> So it should probably be synchronous and fail when in-nmi or in-irq?
> But bpf prog cannot know its context, so only safe synchronous prefill()
> would be out of sleepable progs.

Initial maps capacity would've come from the syscall, so the program itself would not
contain a prefill() call. 

We covered this in our initial discussions - I also think that requiring every
perf_event program to setup a uprobe or syscall program to fill the object pool
(internal or external) is also a bad design.

If we're going for a global allocator, I suppose we could encode these requirements
in BTF and satisfy them on program load? .alloc map with some predefined names or
something?

> 
> > > In both cases these lists will have to do irq_work and refill
> > > out of global slabs.
> > 
> > If a tracing program needs irq_work to refill, then hasn't the API already failed the
> > program writer? I'd have to remind myself how irq_work actually works but given that
> > it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before
> > irq_work has a chance to refill it. I don't see how you'd write reliable programs
> > this way.
> 
> The only way nmi-prog can guarantee availability if it allocates and reserves
> objects in a different non-nmi program.
> If everything runs in nmi there is nothing can be done.

See above, we were using the map load syscall to satisfy this before. We could
probably do the same here but it's just documenting requirements as opposed to also
introducing ownership/lifetime problems.

> 
> > > 
> > > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> > > > memory usage, which is strictly worse for us (and worse developer experience on top).
> > > 
> > > I don't understand this point.
> > > All allocations are still coming out of bpf_mem_alloc.
> > > We can have debug mode with memleak support and other debug
> > > mechanisms.
> > 
> > I mostly mean accounting here. If we segment the allocated objects by finer-grained
> > allocators, we can attribute them to individual programs better. But, I agree, this
> > can be implemented in other ways, it can just be counts/tables on bpf_prog.
> 
> mem accounting is the whole different, huge and largely unsovled topic.
> The thread about memcg and bpf is still ongoing.
> The fine-grained bpf_mem_alloc isn't going to magically solve it.
> 
> > > 
> > > What is 'freelist determinism' ?
> > 
> > The property that prefill keeps all objects on the freelist, so the following
> > sequence doesn't observe allocation failures:
> > 
> > bpf_mem_prefill(1000, struct foo);
> > run_1000_times { alloc(struct foo); }
> > run_1000_times { free(struct foo); }
> > run_1000_times { alloc(struct foo); }
> > alloc(struct foo) // this can observe a failure
> 
> we cannot evalute the above until we answer current-cpu vs on-all-cpus question
> and whether bpf_mem_prefill is sync or async.
> 
> I still think designing prefill and guaranteed availability is out of scope
> of allocator.

It was in the maps design on purpose though, I need it for deferred work to be useful
(remember that build id EFAULT thread? only way to fix it for good requires that work
submission never fails, which needs allocations from nmi to never fail).

> 
> > > Are you talking about some other freelist on top of bpf_mem_alloc's
> > > free lists ?
> > 
> > Well, that's the question, isn't it? I think it should be part of the bpf kernel
> > ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's
> > instantiated per-program, that's easiest to reason about.
> 
> There should be a way. For sure. Helper/kfunc or trivial stuff on top of bpf_link_list
> is still a question. Bundling this feature together with an allocator feels artificial.
> In user space C you wouldn't combine tcmalloc with custom free list.

Userspace doesn't have nmi or need allocators that work from signal handlers, for a
more appropriate analogy. We actually need this to work reliably from nmi, so we can
shift work _away_ from nmi. If we didn't have this use case, I would've folded on the
entire issue and kicked the can down the road (plenty of helpers don't work in nmi).

-- Delyan

> During early days of bpf bundling would totally make sense, but right now we're able to
> create much smaller building blocks than in the past. I don't think we'll be adding
> any more new map types. We probably won't be adding any new program types either.