Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
To: "rppt@kernel.org" <rppt@kernel.org>
Cc: "peterz@infradead.org" <peterz@infradead.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"song@kernel.org" <song@kernel.org>, "hch@lst.de" <hch@lst.de>,
	"x86@kernel.org" <x86@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"Lu, Aaron" <aaron.lu@intel.com>
Subject: Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
Date: Wed, 9 Nov 2022 17:04:25 +0000	[thread overview]
Message-ID: <bcdc5a31570f87267183496f06963ac58b41bfe1.camel@intel.com> (raw)
In-Reply-To: <Y2uMWvmiPlaNXlZz@kernel.org>

On Wed, 2022-11-09 at 13:17 +0200, Mike Rapoport wrote:
> On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote:
> > On Tue, 2022-11-08 at 13:27 +0200, Mike Rapoport wrote:
> > > > Based on our experiments [5], we measured 0.5% performance
> > > > improvement
> > > > from bpf_prog_pack. This patchset further boosts the
> > > > improvement to
> > > > 0.7%.
> > > > The difference is because bpf_prog_pack uses 512x 4kB pages
> > > > instead
> > > > of
> > > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> > > > 
> > > > This patchset replaces bpf_prog_pack with a better API and
> > > > makes it
> > > > available for other dynamic kernel text, such as modules,
> > > > ftrace,
> > > > kprobe.
> > > 
> > >   
> > > The proposed execmem_alloc() looks to me very much tailored for
> > > x86
> > > to be
> > > used as a replacement for module_alloc(). Some architectures have
> > > module_alloc() that is quite different from the default or x86
> > > version, so
> > > I'd expect at least some explanation how modules etc can use
> > > execmem_
> > > APIs
> > > without breaking !x86 architectures.
> > 
> > I think this is fair, but I think we should ask ask ourselves - how
> > much should we do in one step?
> 
> I think that at least we need an evidence that execmem_alloc() etc
> can be
> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't
> work
> for him at all, so having a core MM API for code allocation that only
> works
> with BPF on x86 seems not right to me.

Those modules changes wouldn't work on non-x86 either. Most of modules
is cross-arch, so this kind of has to work for non-text_poke() or
modules needs to be refactored.

>  
> > For non-text_poke() architectures, the way you can make it work is
> > have
> > the API look like:
> > execmem_alloc()  <- Does the allocation, but necessarily usable yet
> > execmem_write()  <- Loads the mapping, doesn't work after finish()
> > execmem_finish() <- Makes the mapping live (loaded, executable,
> > ready)
> > 
> > So for text_poke():
> > execmem_alloc()  <- reserves the mapping
> > execmem_write()  <- text_pokes() to the mapping
> > execmem_finish() <- does nothing
> > 
> > And non-text_poke():
> > execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> > execmem_write()  <- Writes normally to it
> > execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> > 
> > Non-text_poke() only gets the benefits of centralized logic, but
> > the
> > interface works for both. This is pretty much what the perm_alloc()
> > RFC
> > did to make it work with other arch's and modules. But to fit with
> > the
> > existing modules code (which is actually spread all over) and also
> > handle RO sections, it also needed some additional bells and
> > whistles.
> 
> I'm less concerned about non-text_poke() part, but rather about
> restrictions where code and data can live on different architectures
> and
> whether these restrictions won't lead to inability to use the
> centralized
> logic on, say, arm64 and powerpc.
> 
> For instance, if we use execmem_alloc() for modules, it means that
> data
> sections should be allocated separately with plain vmalloc(). Will
> this
> work universally? Or this will require special care with additional
> complexity in the modules code?

Good point. If the module data was still in the modules range, I would
hope it would still work, but there are a lot of architectures to
check. Some might care if the data is really close to the text. I'm not
sure.

The perm_alloc() stuff did some hacks to force the allocations close to
each other out of paranoia about this. Basically started with one
allocation, but then tracked the pieces separately so arch's could
separate them if they wanted. But I wondered if it was really needed.

>  
> > So the question I'm trying to ask is, how much should we target for
> > the
> > next step? I first thought that this functionality was so
> > intertwined,
> > it would be too hard to do iteratively. So if we want to try
> > iteratively, I'm ok if it doesn't solve everything.
> 
>  
> With execmem_alloc() as the first step I'm failing to see the large
> picture. If we want to use it for modules, how will we allocate RO
> data?

Similar to the perm_alloc() hacks?

> with similar rodata_alloc() that uses yet another tree in vmalloc? 

It would have to group them together at least. Not sure if it needs a
separate tree or not. I would think permission flags would be better
than a new function for each memory type.

> How the caching of large pages in vmalloc can be made useful for use
> cases
> like secretmem and PKS?

This part is easy I think. If we had an unmapped page allocator it
could just feed this. Do you have any idea when you might pick up that
stuff again?

To answer my own question, I think a good first step would be to make
the interface also work for non-text_poke() so it could really be cross
arch, then use it for everything except modules. The benefit to the
other arch's at that point is centralized handling of loading text.