Re: [PATCH RFC bpf-next 2/4] bpf: support BPF ksym variables in kernel modules

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>, bpf <bpf@vger.kernel.org>,
	Networking <netdev@vger.kernel.org>,
	Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH RFC bpf-next 2/4] bpf: support BPF ksym variables in kernel modules
Date: Fri, 11 Dec 2020 17:52:15 -0800	[thread overview]
Message-ID: <20201212015215.zmychededhpv55th@ast-mbp> (raw)
In-Reply-To: <CAEf4BzbZK8uZOprwHq_+mh=2Lb27POv5VMW4kB6eyPc_6bcSPg@mail.gmail.com>

On Fri, Dec 11, 2020 at 02:15:28PM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 11, 2020 at 1:27 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Dec 10, 2020 at 08:27:32PM -0800, Andrii Nakryiko wrote:
> > > During BPF program load time, verifier will resolve FD to BTF object and will
> > > take reference on BTF object itself and, for module BTFs, corresponding module
> > > as well, to make sure it won't be unloaded from under running BPF program. The
> > > mechanism used is similar to how bpf_prog keeps track of used bpf_maps.
> > ...
> > > +
> > > +     /* if we reference variables from kernel module, bump its refcount */
> > > +     if (btf_is_module(btf)) {
> > > +             btf_mod->module = btf_try_get_module(btf);
> >
> > Is it necessary to refcnt the module? Correct me if I'm wrong, but
> > for module's BTF we register a notifier. Then the module can be rmmod-ed
> > at any time and we will do btf_put() for corresponding BTF, but that BTF may
> > stay around because bpftool or something is looking at it.
> 
> Correct, BTF object itself doesn't take a refcnt on module.
> 
> > Similarly when prog is attached to raw_tp in a module we currently do try_module_get(),
> > but is it really necessary ? When bpf is attached to a netdev the netdev can
> > be removed and the link will be dangling. May be it makes sense to do the same
> > with modules?  The raw_tp can become dangling after rmmod and the prog won't be
> 
> So for raw_tp it's not the case today. I tested, I attached raw_tp,
> kept triggering it in a loop, and tried to rmmod bpf_testmod. It
> failed, because raw tracepoint takes refcnt on module. rmmod -f

Right. I meant that we can change that behavior if it would make sense to do so.

> bpf_testmod also didn't work, but it's because my kernel wasn't built
> with force-unload enabled for modules. But force-unload is an entirely
> different matter and it's inherently dangerous to do, it can crash and
> corrupt anything in the kernel.
> 
> > executed anymore. So hard coded address of a per-cpu var in a ksym will
> > be pointing to freed mod memory after rmmod, but that's ok, since that prog will
> > never execute.
> 
> Not so fast :) Indeed, if somehow module gets unloaded while we keep
> BPF program loaded, we'll point to unallocated memory **OR** to a
> memory re-used for something else. That's bad. Nothing will crash even
> if it's unmapped memory (due to bpf_probe_read semantics), but we will
> potentially be reading some garbage (not zeroes), if some other module
> re-uses that per-CPU memory.
> 
> As for the BPF program won't be triggered. That's not true in general,
> as you mention yourself below.
> 
> > On the other side if we envision a bpf prog attaching to a vmlinux function
> > and accessing per-cpu or normal ksym in some module it would need to inc refcnt
> > of that module, since we won't be able to guarantee that this prog will
> > not execute any more. So we cannot allow dangling memory addresses.
> 
> That's what my new selftest is doing actually. It's a generic
> sys_enter raw_tp, which doesn't attach to the module, but it does read
> module's per-CPU variable. 

Got it. I see that now.

> So I actually ran a test before posting. I
> successfully unloaded bpf_testmod, but kept running the prog. And it
> kept returning *correct* per-CPU value. Most probably due to per-CPU
> memory not unmapped and not yet reused for something else. But it's a
> really nasty and surprising situation.

you mean you managed to unload early during development before
you've introduced refcnting of modules?

> Keep in mind, also, that whenever BPF program declares per-cpu
> variable extern, it doesn't know or care whether it will get resolved
> to built-in vmlinux per-CPU variable or module per-CPU variable.
> Restricting attachment to only module-provided hooks is both tedious
> and might be quite surprising sometimes, seems not worth the pain.
> 
> > If latter is what we want to allow then we probably need a test case for it and
> > document the reasons for keeping modules pinned while progs access their data.
> > Since such pinning behavior is different from other bpf attaching cases where
> > underlying objects (like netdev and cgroup) can go away.
> 
> See above, that's already the case for module tracepoints.
> 
> So in summary, I think we should take a refcnt on module, as that's
> already the case for stuff like raw_tp. I can add more comments to
> make this clear, of course.

ok. agreed.

Regarding fd+id in upper/lower 32-bit of ld_imm64...
That works for ksyms because at that end the pair is converted to single
address that fits into ld_imm64. That won't work for Alan's case
where btf_obj pointer and btf_id are two values (64-bit and 32-bit).
So api-wise it's fine here, but cannot adopt the same idea everywhere.

re: patch 4
Please add non-percpu var to the test. Just for completeness.
The pair fd+id should be enough to disambiguate, right?

re: patch 1.
Instead of copy paste that hack please convert it to sys_membarrier(MEMBARRIER_CMD_GLOBAL).

The rest looks good to me.