Re: bpf_jit_limit close shave

From: Daniel Borkmann <daniel@iogearbox.net>
To: Lorenz Bauer <lmb@cloudflare.com>
Cc: Frank Hofmann <fhofmann@cloudflare.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	kernel-team <kernel-team@cloudflare.com>
Subject: Re: bpf_jit_limit close shave
Date: Thu, 23 Sep 2021 13:52:05 +0200	[thread overview]
Message-ID: <53e09160-f30d-7d23-e3d0-8f636cd82117@iogearbox.net> (raw)
In-Reply-To: <CACAyw9-Ha9RQC_VijJAE02mCX3E09vmDji__Ts8YrsSH4cGiyg@mail.gmail.com>

On 9/23/21 11:16 AM, Lorenz Bauer wrote:
> On Wed, 22 Sept 2021 at 22:51, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 9/22/21 1:07 PM, Lorenz Bauer wrote:
>>> On Wed, 22 Sept 2021 at 09:20, Frank Hofmann <fhofmann@cloudflare.com> wrote:
>>>>
>>>>> That jit limit is not there on older kernels and doesn't apply to root.
>>>>> How would you notice such a kernel bug in such conditions?
>>>>
>>>> I'm talking about bpf_jit_current - it's an "overall gauge" for
>>>> allocation, priv and unpriv. I understood Lorenz' note as "change it
>>>> so it only tracks unpriv BPF mem usage - since we'll never act on
>>>> privileged usage anyway"
>>>
>>> Yes, that was my suggestion indeed. What Frank is saying: it looks
>>> like our leak of JIT memory is due to a privileged process. By
>>> exempting privileged processes it would be even harder to notice /
>>> debug. That's true, and brings me back to my question: what is
>>> different about JIT memory that we can't do a better limit?
>>
>> The knob with the limit was basically added back then as a band-aid to avoid
>> unprivileged BPF JIT (cBPF or eBPF) eating up all the module memory to the
>> point where we cannot even load kernel modules anymore. Given that memory
>> resource is global, we added the bpf_jit_limit / bpf_jit_current acounting
>> as a fix/heuristic via ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict
>> unpriv allocations"). If we wouldn't account for root, how would such detection
>> proposal work otherwise to block unprivileged? I don't think it's feasible to
>> only account the latter given privileged progs might have occupied most of the
>> budget already.
> 
> Thanks, that was the part I was missing. JITed BPF programs are
> treated like modules (why?). There is a limited space reserved for
> kernel modules.

See bpf_jit_alloc_exec() which calls module_alloc() for the images' r+x memory
holding the generated opcodes, and there's only one such pool for the system
on the latter: on x86 in particular, the rationale for module_alloc() use is
so that the image is guaranteed to be within +/- 2GB of where the kernel image
resides. See the encoding of BPF_CALL with __bpf_call_base + imm32, for example.

> How does the knob solve the "can't load a new module" problem if our
> suggestion / preference is to steer people towards CAP_BPF anyways
> (since unpriv BPF is trouble)? Over time all BPF will be privileged
> and we're in the same mess again?

Keep in mind that the knob was added before CAP_BPF. In general, unprivileged
cBPF->eBPF is also using the same bpf_jit_alloc_exec() for the JIT, so that
needs to be taken into consideration as well, but if you grant an application
CAP_BPF then you're essentially privileged. The knob's point was to prevent
fully unprivileged users to play bad games.

Thanks,
Daniel