Re: [PATCH bpf v2 0/3] bpf: invalidate unused part of bpf_prog_pack

From: Song Liu <songliubraving@fb.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: bpf <bpf@vger.kernel.org>, Networking <netdev@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>,
	Christoph Hellwig <hch@infradead.org>,
	Peter Zijlstra <peterz@infradead.org>, Song Liu <song@kernel.org>
Subject: Re: [PATCH bpf v2 0/3] bpf: invalidate unused part of bpf_prog_pack
Date: Thu, 28 Apr 2022 06:48:51 +0000	[thread overview]
Message-ID: <57DBEBDB-71AF-4A85-AB8D-8274541E0F3C@fb.com> (raw)
In-Reply-To: <CAHk-=wgA1Uku=ejwknv11ssNhz2pswhD=mJFBPEMQtCspz0YEQ@mail.gmail.com>

Hi Linus, 

Thanks for your thorough analysis of the situation, which make a lot of
sense. 

> On Apr 27, 2022, at 6:45 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Wed, Apr 27, 2022 at 3:24 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> Could you please share your suggestions on this set? Shall we ship it
>> with 5.18?
> 
> I'd personally prefer to just not do the prog_pack thing at all, since
> I don't think it was actually in a "ready to ship" state for this
> merge window, and the hugepage mapping protection games I'm still
> leery of.
> 
> Yes, the hugepage protection things probably do work from what I saw
> when I looked through them, but that x86 vmalloc hugepage code was
> really designed for another use (non-refcounted device pages), so the
> fact that it all actually seems surprisingly ok certainly wasn't
> because the code was designed to do that new case.
> 
> Does the prog_pack thing work with small pages?
> 
> Yes. But that wasn't what it was designed for or its selling point, so
> it all is a bit suspect to me.

prog_pack on small pages can also reduce the direct map fragmentation.
This is because libbpf uses tiny BPF programs to probe kernel features. 
Before prog_pack, all these BPF programs can fragment the direct map.
For example, runqslower (tools/bpf/runqslower/) loads total 7 BPF programs 
(3 actual programs and 4 tiny probe programs). All these programs may 
cause direct map fragmentation. With prog_pack, OTOH, these BPF programs 
would fit in a single page (or even share pages with other tools). 

> 
> And I'm looking at those set_memory_xyz() calls, and I'm going "yeah,
> I think it works on x86, but on ppc I look at it and I see
> 
>  static inline int set_memory_ro(unsigned long addr, int numpages)
>  {
>        return change_memory_attr(addr, numpages, SET_MEMORY_RO);
>  }
> 
> and then in change_memory_attr() it does
> 
>        if (WARN_ON_ONCE(is_vmalloc_or_module_addr((void *)addr) &&
>                         is_vm_area_hugepages((void *)addr)))
>                return -EINVAL;
> 
> and I'm "this is all supposedly generic code, but I'm not seeing how
> it works cross-architecture".

AFAICT, we have spent the time and effort to design bpf_prog_pack to 
work cross-architecture. However, since prog_pack requires relatively 
new building blocks in multiple layers (vmalloc, set_memory_XXX, 
bpf_jit, etc.), non-x86 architectures have more missing pieces. 

The fact that we cannot rely on set_vm_flush_reset_perms() for huge 
pages on x86 does leak some architectural details to generic code. 
But I guess we don’t really need the hack (by not using 
set_vm_flush_reset_perms, but calling set_memory_ manually) for 
prog_pack on small pages? 

> 
> I *think* it's actually because this is all basically x86-specific due
> to x86 being the only architecture that implements
> bpf_arch_text_copy(), and everybody else then ends up erroring out and
> not using the prog_pack thing after all.
> 
> And then one of the two places that use bpf_arch_text_copy() doesn't
> even check the returned error code.
> 
> So this all ends up just depending on "only x86 will actually succeed
> in bpf_jit_binary_pack_finalize(), everybody else will fail after
> having done all the common setup".
> 
> End result: it all seems a bit broken right now. The "generic" code
> only works on x86, and on other architectures it goes through the
> motions and then fails at the end. And even on x86 I worry about
> actually enabling it fully.
> 
> I'm not having the warm and fuzzies about this all, in other words.
> 
> Maybe people can convince me otherwise, but I think you need to work at it.
> 
> And even for 5.19+ kind of timeframes, I'd actually like the x86
> people who maintain arch/x86/mm/pat/set_memory.c also sign off on
> using that code for hugepage vmalloc mappings too.
> 
> I *think* it does. But that code has various subtle things going on.
> 
> I see PeterZ is cc'd (presumably because of the text_poke() stuff, not
> because of the whole "call set_memory_ro() on virtually mapped kernel
> largepage memory".
> 
> Did people even talk to x86 people about this, or did the whole "it
> works, except it turns out set_vm_flush_reset_perms() doesn't work"
> mean that the authors of that code never got involved?

We have CC'ed x86 folks (at least on some versions). But we haven’t
got much feedbacks until recently. We should definitely do better 
with this in the future. 

set_vm_flush_reset_perms is clearly a problem here. But if we don’t
use huge page (current upstream + this set), we shouldn’t need that
workaround. 

Overall, we do hope to eliminate (or reduce) system slowdown caused 
by direct map fragmentation. Ideally, we want achieve this with 
small number of huge pages. If huge pages don’t work here, small 
pages would also help. Given we already do set_memory_() with BPF
programs (in bpf_jit_binary_lock_ro), I think the risk with small 
pages is pretty low. And we should still see non-trivial performance
improvement from 4kB bpf_prog_pack (I don’t have full benchmark
results at the moment). 

Thanks again,
Song