Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs

From: Song Liu <song@kernel.org>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: bpf@vger.kernel.org, linux-mm@kvack.org, peterz@infradead.org,
	akpm@linux-foundation.org, x86@kernel.org, hch@lst.de,
	rick.p.edgecombe@intel.com, aaron.lu@intel.com, rppt@kernel.org,
	mcgrof@kernel.org
Subject: Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
Date: Tue, 6 Dec 2022 12:25:07 -0800	[thread overview]
Message-ID: <CAPhsuW65K5TBbT_noTMnAEQ58rNGe-MfnjHF-arG8SZV9nfhzg@mail.gmail.com> (raw)
In-Reply-To: <878rjqqhxf.ffs@tglx>

Hi Thomas,

Thanks again for your suggestions. Here is my homework so far.

On Fri, Dec 2, 2022 at 1:22 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Song!
>
> On Fri, Dec 02 2022 at 00:38, Song Liu wrote:
> > Thanks for all these suggestions!
>
> Welcome.
>
> > On Thu, Dec 1, 2022 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> You have to be aware, that the rodata space needs to be page granular
> >> while text and data can really aggregate below the page alignment, but
> >> again might have different alignment requirements.
> >
> > I don't quite follow why rodata space needs to be page granular. If text can
> > go below page granular, rodata should also do that, no?
>
> Of course it can, except for the case of ro_after_init_data, because
> that needs to be RW during module_init() and is then switched to RO when
> module_init() returns success. So for that you need page granular maps
> per module, right?
>
> Sure you can have a separate space for rodata and ro_after_init_data,
> but as I said to Mike:
>
>   "The point is, that rodata and ro_after_init_data is a pretty small
>    portion of modules as far as my limited analysis of a distro build
>    shows.
>
>    The bulk is in text and data. So if we preserve 2M pages for text and
>    for RW data and bite the bullet to split one 2M page for
>    ro[_after_init_]data, we get the maximum benefit for the least
>    complexity."
>
> So under the assumption that rodata is small, it's questionable whether
> the split of rodata and ro_after_init_data makes a lot of difference. It
> might, but that needs to be investigated.
>
> That's not a fundamental conceptual problem because adding a 4th type to
> the concept we outlined so far is straight forward, right?
>
> > I guess I will do my homework, and come back with as much information
> > as possible for #1 + #2 + #3. Then, we can discuss whether it makes
> > sense at all.
>
> Correct. Please have a close look at the 11 architecture specific
> module_alloc() variants so you can see what kind of tweaks and magic
> they need, which lets you better specify the needs for the
> initialization parameter set required.

Survey of the 11 architecture specific module_alloc(). They basically do
the following magic:

1. Modify MODULES_VADDR and/or MODULES_END. There are multiple
  reasons behind this, some arch does this for KASLR, some other archs
  have different MODULES_[VADDR|END] for different processors (32b vs.
  64b for example), some archs use some module address space for other
  things (i.e. _exiprom on arm).

Archs need 1: x86, arm64, arm, mips, ppc, riscv, s390, loongarch, sparc

2. Use kasan_alloc_module_shadow()

Archs need 2: x86, arm64, s390

3. A secondary module address space. There is a smaller preferred
  address space for modules. Once the preferred space runs out, allocate
  memory from a secondary address space.

Archs need 3: some ppc, arm, arm64 (PLTS on arm and arm64)

4. User different pgprot_t (PAGE_KERNEL, PAGE_KERNEL_EXEC, etc.)

5. sparc does memset(ptr, 0, size) in module_alloc()

6. nios2 uses kmalloc() for modules. Based on the comment, this is
  probably only because it needs different MODULES_[VADDR|END].

I think we can handle all these with a single module_alloc() and a few
module_arch_* functions().

unsigned long module_arch_vaddr(void);
unsigned long module_arch_end(void);
unsigned long module_arch_secondary_vaddr(void);
unsigned long module_arch_secondary_end(void);
pgprot_t module_arch_pgprot(alloc_type type);
void *module_arch_initialize(void *s, size_t n);
bool module_arch_do_kasan_shadow(void);

So module_alloc() would look like:

void *module_alloc(unsigned long size, pgprot_t prot, unsigned long align,
       unsigned long granularity, alloc_type type)
{
    unsigned long vm_flags = VM_FLUSH_RESET_PERMS |
        (module_arch_do_kasan_shadow() ? VM_DEFER_KMEMLEAK : 0);
    void *ptr;

    ptr = __vmalloc_node_range(size, align, module_arch_vaddr(),
        module_arch_end(), GFP_KERNEL, module_arch_pgprot(type), vm_flags,
        NUMA_NO_NODE, __builtin_return_address(0));

    if (!ptr && module_arch_secondary_vaddr() != module_arch_secondary_end())
        ptr = __vmalloc_node_range(size, align, module_arch_secondary_vaddr(),
        module_arch_secondary_end(), GFP_KERNEL, module_arch_pgprot(type),
        vm_flags, NUMA_NO_NODE, __builtin_return_address(0));

    if (p && module_arch_do_kasan_shadow() &&
        (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
        vfree(p);
        return NULL;
    }
    module_arch_initialize(ptr, size);
    return p;
}

This is not really pretty, but I don't have a better idea at the moment.

For the allocation type, there are technically 5 of them:

ALLOC_TYPE_RX,   /* text */
ALLOC_TYPE_RW,    /* rw data */
ALLOC_TYPE_RO,    /* ro data */
ALLOC_TYPE_RO_AFTER_INIT,
ALLOC_TYPE_RWX,   /* legacy, existing module_alloc behavior */

Given RO and RO_AFTER_INIT require PAGE alignment and are
relatively small. I think we can merge them with RWX.

For RX and RW, we can allocate huge pages, and cut subpage
chunks out for users (something similar to 1/6 of the set).

For RWX, we have 2 options:
1. Use similar logic as RX and RW, but use PAGE granularity, and
    do set_memory_ro on it.
2. Keep current module_alloc behavior.

1 is better at protecting direct map (less fragmentation); while
2 is probably a little simpler. Given module load/unload are
rare events in most systems, I personally think we can start
with option 2.

We also need to redesign module_layout. Right now, we have
up to 3 layouts: core, init, and data. We will need 6 allocations:
  core text,
  core rw data,
  core ro and ro_after_init data (one allocation)
  init text,
  init rw data,
  init ro data.

PS: how much do we benefit with separate core and init.
Maybe it is time to merge the two? (keep init part around until
the module unloads).

The above is my Problem analysis and Concepts.

For data structures, I propose we use two extra trees for RX
and RW allocation (similar to 1/6 of current version, but 2x
trees). For RWX, we keep current module_alloc() behavior,
so no new data structure is needed.

The new module_layout will be something like:

struct module_layout {
    void *ptr;
    unsigned int size;  /* text size, rw data size, ro + ro_after_init size */
    unsigned int ro_size;    /* ro_size for ro + ro_after_init allocation */
};

So that's all I have so far. Please share your comments and
suggestions on it.

One more question: shall we make module sections page
aligned without STRICT_MODULE_RWX? It appears to be
a good way to simplify the logic. But it may cause too much
memory waste for smaller processors?

Thanks,
Song