linux-modules.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
       [not found] ` <20220713071846.3286727-2-song@kernel.org>
@ 2022-07-13  9:53   ` Peter Zijlstra
  2022-07-13 10:08   ` Christoph Hellwig
  2022-07-13 10:20   ` Peter Zijlstra
  2 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2022-07-13  9:53 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-kernel, linux-mm, linux-modules, mcgrof, rostedt,
	tglx, mingo, bp, mhiramat, naveen.n.rao, davem,
	anil.s.keshavamurthy, keescook, hch, dave, daniel, kernel-team,
	x86, dave.hansen, rick.p.edgecombe, akpm

On Wed, Jul 13, 2022 at 12:18:44AM -0700, Song Liu wrote:
> +/**
> + * vmalloc_exec - allocate RO+X memory in kernel text space
> + * @size:	allocation size
> + *
> + * Allocate @size of RO+X memory in kernel text space. This memory can be
> + * used to serve dynamic kernel text, such as BPF programs.
> + *
> + * The memory allocated is filled illegal instructions.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(size_t size)
> +{
> +	return vmalloc_exec_pack_alloc(size);
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);

NAK! modules do *NOT* get to allocate text.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
       [not found] ` <20220713071846.3286727-2-song@kernel.org>
  2022-07-13  9:53   ` [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory Peter Zijlstra
@ 2022-07-13 10:08   ` Christoph Hellwig
  2022-07-13 15:49     ` Song Liu
  2022-07-13 10:20   ` Peter Zijlstra
  2 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-07-13 10:08 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-kernel, linux-mm, linux-modules, mcgrof, peterz,
	rostedt, tglx, mingo, bp, mhiramat, naveen.n.rao, davem,
	anil.s.keshavamurthy, keescook, hch, dave, daniel, kernel-team,
	x86, dave.hansen, rick.p.edgecombe, akpm

NAK.  This is not something that should be an exported public API
ever.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
       [not found] ` <20220713071846.3286727-2-song@kernel.org>
  2022-07-13  9:53   ` [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory Peter Zijlstra
  2022-07-13 10:08   ` Christoph Hellwig
@ 2022-07-13 10:20   ` Peter Zijlstra
  2022-07-13 15:48     ` Song Liu
                       ` (3 more replies)
  2 siblings, 4 replies; 16+ messages in thread
From: Peter Zijlstra @ 2022-07-13 10:20 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-kernel, linux-mm, linux-modules, mcgrof, rostedt,
	tglx, mingo, bp, mhiramat, naveen.n.rao, davem,
	anil.s.keshavamurthy, keescook, hch, dave, daniel, kernel-team,
	x86, dave.hansen, rick.p.edgecombe, akpm

On Wed, Jul 13, 2022 at 12:18:44AM -0700, Song Liu wrote:
> Dynamically allocated kernel texts, such as module texts, bpf programs,
> and ftrace trampolines, are used in more and more scenarios. Currently,
> these users allocate meory with module_alloc, fill the memory with text,
> and then use set_memory_[ro|x] to protect the memory.
> 
> This approach has two issues:
>  1) each of these user occupies one or more RO+X page, and thus one or
>     more entry in the page table and the iTLB;
>  2) frequent allocate/free of RO+X pages causes fragmentation of kernel
>     direct map [1].
> 
> BPF prog pack [2] addresses this from the BPF side. Now, make the same
> logic available to other users of dynamic kernel text.
> 
> The new API is like:
> 
>   void *vmalloc_exec(size_t size);
>   void vfree_exec(void *addr, size_t size);
> 
> vmalloc_exec has different handling for small and big allocations
> (> PMD_SIZE * num_possible_nodes). bigger allocations have dedicated
> vmalloc allocation; while small allocations share a vmalloc_exec_pack
> with other allocations.
> 
> Once allocated, the vmalloc_exec_pack is filled with invalid instructions

*sigh*, again, INT3 is a *VALID* instruction.

> and protected with RO+X. Some text_poke feature is required to make
> changes to the vmalloc_exec_pack. Therefore, vmalloc_exec requires changes
> from the arch (to provide text_poke family APIs), and the user (to use
> text poke APIs to make any changes to the memory).

I hate the naming; this isn't just vmalloc, this is a whole different
allocator build on top of things.

I'm also not convinced this is the right way to go about doing this;
much of the design here is because of how the module range is mixing
text and data and working around that.

So how about instead we separate them? Then much of the problem goes
away, you don't need to track these 2M chunks at all.

Start by adding VM_TOPDOWN_VMAP, which instead of returning the lowest
(leftmost) vmap_area that fits, picks the higests (rightmost).

Then add module_alloc_data() that uses VM_TOPDOWN_VMAP and make
ARCH_WANTS_MODULE_DATA_IN_VMALLOC use that instead of vmalloc (with a
weak function doing the vmalloc).

This gets you bottom of module range is RO+X only, top is shattered
between different !X types.

Then track the boundary between X and !X and ensure module_alloc_data()
and module_alloc() never cross over and stay strictly separated.

Then change all module_alloc() users to expect RO+X memory, instead of
RW.

Then make sure any extention of the X range is 2M aligned.

And presto, *everybody* always uses 2M TLB for text, modules, bpf,
ftrace, the lot and nobody is tracking chunks.

Maybe migration can be eased by instead providing module_alloc_text()
and ARCH_WANTS_MODULE_ALLOC_TEXT.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-13 10:20   ` Peter Zijlstra
@ 2022-07-13 15:48     ` Song Liu
  2022-07-13 20:26       ` Peter Zijlstra
  2022-07-14  5:16     ` Christoph Hellwig
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2022-07-13 15:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, bpf, lkml, Linux-MM, linux-modules, Luis Chamberlain,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Masami Hiramatsu, naveen.n.rao, davem, anil.s.keshavamurthy,
	keescook, hch, dave, daniel, Kernel Team, x86, dave.hansen,
	rick.p.edgecombe, akpm



> On Jul 13, 2022, at 3:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jul 13, 2022 at 12:18:44AM -0700, Song Liu wrote:
>> Dynamically allocated kernel texts, such as module texts, bpf programs,
>> and ftrace trampolines, are used in more and more scenarios. Currently,
>> these users allocate meory with module_alloc, fill the memory with text,
>> and then use set_memory_[ro|x] to protect the memory.
>> 
>> This approach has two issues:
>> 1) each of these user occupies one or more RO+X page, and thus one or
>>    more entry in the page table and the iTLB;
>> 2) frequent allocate/free of RO+X pages causes fragmentation of kernel
>>    direct map [1].
>> 
>> BPF prog pack [2] addresses this from the BPF side. Now, make the same
>> logic available to other users of dynamic kernel text.
>> 
>> The new API is like:
>> 
>>  void *vmalloc_exec(size_t size);
>>  void vfree_exec(void *addr, size_t size);
>> 
>> vmalloc_exec has different handling for small and big allocations
>> (> PMD_SIZE * num_possible_nodes). bigger allocations have dedicated
>> vmalloc allocation; while small allocations share a vmalloc_exec_pack
>> with other allocations.
>> 
>> Once allocated, the vmalloc_exec_pack is filled with invalid instructions
> 
> *sigh*, again, INT3 is a *VALID* instruction.

I am fully aware "invalid" or "illegal" is not accurate, but I am not 
sure what to use. Shall we call them "safe" instructions?

> 
>> and protected with RO+X. Some text_poke feature is required to make
>> changes to the vmalloc_exec_pack. Therefore, vmalloc_exec requires changes
>> from the arch (to provide text_poke family APIs), and the user (to use
>> text poke APIs to make any changes to the memory).
> 
> I hate the naming; this isn't just vmalloc, this is a whole different
> allocator build on top of things.
> 
> I'm also not convinced this is the right way to go about doing this;
> much of the design here is because of how the module range is mixing
> text and data and working around that.

Hmm.. I am not sure mixed data/text is the only problem here. 

> 
> So how about instead we separate them? Then much of the problem goes
> away, you don't need to track these 2M chunks at all.

If we manage the memory in < 2MiB granularity, either 4kB or smaller, 
we still need some way to track which parts are being used, no? I mean
the bitmap.  

> 
> Start by adding VM_TOPDOWN_VMAP, which instead of returning the lowest
> (leftmost) vmap_area that fits, picks the higests (rightmost).
> 
> Then add module_alloc_data() that uses VM_TOPDOWN_VMAP and make
> ARCH_WANTS_MODULE_DATA_IN_VMALLOC use that instead of vmalloc (with a
> weak function doing the vmalloc).
> 
> This gets you bottom of module range is RO+X only, top is shattered
> between different !X types.
> 
> Then track the boundary between X and !X and ensure module_alloc_data()
> and module_alloc() never cross over and stay strictly separated.
> 
> Then change all module_alloc() users to expect RO+X memory, instead of
> RW.
> 
> Then make sure any extention of the X range is 2M aligned.
> 
> And presto, *everybody* always uses 2M TLB for text, modules, bpf,
> ftrace, the lot and nobody is tracking chunks.
> 
> Maybe migration can be eased by instead providing module_alloc_text()
> and ARCH_WANTS_MODULE_ALLOC_TEXT.

If we have the text/data separation, can we just put text after _etext? 

Right now, we allocate huge pages for _stext to round_down(_etext, 2MB),
and 4kB pages for round_down(_etext, 2MB) to round_up(_etext, 4kB). To 
make this more efficient, we can allocate huge pages for _stext to 
round_up(_etext, 2MB), and use _etext to round_up(_etext, 2MB) as the
first pool of memory for module_alloc_text(). Once we used all the 
memory there, we allocate more huge pages after round_up(_etext, 2MB).

I am not sure how to make this work, but I guess this is similar to 
the idea you are describing here? However, we will need some bitmap 
to track the usage of these memory pools, right?

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-13 10:08   ` Christoph Hellwig
@ 2022-07-13 15:49     ` Song Liu
  2022-07-14  4:23       ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2022-07-13 15:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Song Liu, bpf, linux-kernel, linux-mm, linux-modules, mcgrof,
	peterz, rostedt, tglx, mingo, bp, mhiramat, naveen.n.rao, davem,
	anil.s.keshavamurthy, keescook, dave, daniel, Kernel Team, x86,
	dave.hansen, rick.p.edgecombe, akpm



> On Jul 13, 2022, at 3:08 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> NAK.  This is not something that should be an exported public API
> ever.

Hmm.. I will remove EXPORT_SYMBOL_GPL (if we ever do a v2 of this..)

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-13 15:48     ` Song Liu
@ 2022-07-13 20:26       ` Peter Zijlstra
  2022-07-13 21:20         ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2022-07-13 20:26 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, bpf, lkml, Linux-MM, linux-modules, Luis Chamberlain,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Masami Hiramatsu, naveen.n.rao, davem, anil.s.keshavamurthy,
	keescook, hch, dave, daniel, Kernel Team, x86, dave.hansen,
	rick.p.edgecombe, akpm

On Wed, Jul 13, 2022 at 03:48:35PM +0000, Song Liu wrote:

> > So how about instead we separate them? Then much of the problem goes
> > away, you don't need to track these 2M chunks at all.
> 
> If we manage the memory in < 2MiB granularity, either 4kB or smaller, 
> we still need some way to track which parts are being used, no? I mean
> the bitmap.  

I was thinking the vmalloc vmap_area tree could help out there.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-13 20:26       ` Peter Zijlstra
@ 2022-07-13 21:20         ` Song Liu
  2022-07-14 10:10           ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2022-07-13 21:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, bpf, lkml, Linux-MM, linux-modules, Luis Chamberlain,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Masami Hiramatsu, naveen.n.rao, davem, anil.s.keshavamurthy,
	keescook, hch, dave, daniel, Kernel Team, x86, dave.hansen,
	rick.p.edgecombe, akpm



> On Jul 13, 2022, at 1:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jul 13, 2022 at 03:48:35PM +0000, Song Liu wrote:
> 
>>> So how about instead we separate them? Then much of the problem goes
>>> away, you don't need to track these 2M chunks at all.
>> 
>> If we manage the memory in < 2MiB granularity, either 4kB or smaller, 
>> we still need some way to track which parts are being used, no? I mean
>> the bitmap.  
> 
> I was thinking the vmalloc vmap_area tree could help out there.

Interesting. vmap_area tree indeed keeps a lot of useful information. 

Currently, powerpc supports CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, 
which leaves module_alloc just for module text. If this works, we get
separation between RO+X and RW memory. What would it take to enable
CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC for x86_64? 

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-13 15:49     ` Song Liu
@ 2022-07-14  4:23       ` Christoph Hellwig
  2022-07-14  4:54         ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-07-14  4:23 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Song Liu, bpf, linux-kernel, linux-mm,
	linux-modules, mcgrof, peterz, rostedt, tglx, mingo, bp,
	mhiramat, naveen.n.rao, davem, anil.s.keshavamurthy, keescook,
	dave, daniel, Kernel Team, x86, dave.hansen, rick.p.edgecombe,
	akpm

On Wed, Jul 13, 2022 at 03:49:45PM +0000, Song Liu wrote:
> 
> 
> > On Jul 13, 2022, at 3:08 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > NAK.  This is not something that should be an exported public API
> > ever.
> 
> Hmm.. I will remove EXPORT_SYMBOL_GPL (if we ever do a v2 of this..)

Even without that it really is not a vmalloc API anyway.  Executable
memory needs to be written first, so we should allocate it in that state
and only mark it executable after that write has completed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-14  4:23       ` Christoph Hellwig
@ 2022-07-14  4:54         ` Song Liu
  2022-07-14 18:15           ` Uladzislau Rezki
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2022-07-14  4:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Song Liu, bpf, linux-kernel, linux-mm, linux-modules, mcgrof,
	peterz, rostedt, tglx, mingo, bp, mhiramat, naveen.n.rao, davem,
	anil.s.keshavamurthy, keescook, dave, daniel, Kernel Team, x86,
	dave.hansen, rick.p.edgecombe, akpm



> On Jul 13, 2022, at 9:23 PM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Wed, Jul 13, 2022 at 03:49:45PM +0000, Song Liu wrote:
>> 
>> 
>>> On Jul 13, 2022, at 3:08 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>> 
>>> NAK.  This is not something that should be an exported public API
>>> ever.
>> 
>> Hmm.. I will remove EXPORT_SYMBOL_GPL (if we ever do a v2 of this..)
> 
> Even without that it really is not a vmalloc API anyway.  

This ...

> Executable
> memory needs to be written first, so we should allocate it in that state
> and only mark it executable after that write has completed.

... and this are two separate NAKs.

For the first NAK, I agree that my version is another layer on top of 
vmalloc. But what do you think about Peter's idea? AFAICT, that fits
well in vmalloc logic. 

For the second NAK, I acknowledge the concern. However, I think we 
will need some mechanism to update text without flipping W and X bit in 
the page table. Otherwise, set_memory_* w/ alias will fragment the 
direct map, and cause significant performance drop over time. If this
really doesn't work because of this concern, we will need to look into 
other solutions discussed in LSFMMBPF [1]. 

Thanks,
Song
[1] https://lwn.net/Articles/894557/



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-13 10:20   ` Peter Zijlstra
  2022-07-13 15:48     ` Song Liu
@ 2022-07-14  5:16     ` Christoph Hellwig
  2022-07-14  7:26       ` Peter Zijlstra
  2022-08-05  5:29     ` Song Liu
  2022-08-05  5:29     ` Song Liu
  3 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-07-14  5:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, bpf, linux-kernel, linux-mm, linux-modules, mcgrof,
	rostedt, tglx, mingo, bp, mhiramat, naveen.n.rao, davem,
	anil.s.keshavamurthy, keescook, hch, dave, daniel, kernel-team,
	x86, dave.hansen, rick.p.edgecombe, akpm

On Wed, Jul 13, 2022 at 12:20:09PM +0200, Peter Zijlstra wrote:
> Start by adding VM_TOPDOWN_VMAP, which instead of returning the lowest
> (leftmost) vmap_area that fits, picks the higests (rightmost).
> 
> Then add module_alloc_data() that uses VM_TOPDOWN_VMAP and make
> ARCH_WANTS_MODULE_DATA_IN_VMALLOC use that instead of vmalloc (with a
> weak function doing the vmalloc).
> 
> This gets you bottom of module range is RO+X only, top is shattered
> between different !X types.
> 
> Then track the boundary between X and !X and ensure module_alloc_data()
> and module_alloc() never cross over and stay strictly separated.
> 
> Then change all module_alloc() users to expect RO+X memory, instead of
> RW.
> 
> Then make sure any extention of the X range is 2M aligned.
> 
> And presto, *everybody* always uses 2M TLB for text, modules, bpf,
> ftrace, the lot and nobody is tracking chunks.
> 
> Maybe migration can be eased by instead providing module_alloc_text()
> and ARCH_WANTS_MODULE_ALLOC_TEXT.

This all looks pretty sensible.  How are we going to do the initial
write to the executable memory, though?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-14  5:16     ` Christoph Hellwig
@ 2022-07-14  7:26       ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2022-07-14  7:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Song Liu, bpf, linux-kernel, linux-mm, linux-modules, mcgrof,
	rostedt, tglx, mingo, bp, mhiramat, naveen.n.rao, davem,
	anil.s.keshavamurthy, keescook, dave, daniel, kernel-team, x86,
	dave.hansen, rick.p.edgecombe, akpm

On Wed, Jul 13, 2022 at 10:16:36PM -0700, Christoph Hellwig wrote:
> On Wed, Jul 13, 2022 at 12:20:09PM +0200, Peter Zijlstra wrote:
> > Start by adding VM_TOPDOWN_VMAP, which instead of returning the lowest
> > (leftmost) vmap_area that fits, picks the higests (rightmost).
> > 
> > Then add module_alloc_data() that uses VM_TOPDOWN_VMAP and make
> > ARCH_WANTS_MODULE_DATA_IN_VMALLOC use that instead of vmalloc (with a
> > weak function doing the vmalloc).
> > 
> > This gets you bottom of module range is RO+X only, top is shattered
> > between different !X types.
> > 
> > Then track the boundary between X and !X and ensure module_alloc_data()
> > and module_alloc() never cross over and stay strictly separated.
> > 
> > Then change all module_alloc() users to expect RO+X memory, instead of
> > RW.
> > 
> > Then make sure any extention of the X range is 2M aligned.
> > 
> > And presto, *everybody* always uses 2M TLB for text, modules, bpf,
> > ftrace, the lot and nobody is tracking chunks.
> > 
> > Maybe migration can be eased by instead providing module_alloc_text()
> > and ARCH_WANTS_MODULE_ALLOC_TEXT.
> 
> This all looks pretty sensible.  How are we going to do the initial
> write to the executable memory, though?

With something like text_poke_memcpy(). I suppose that the proposed
ARCH_WANTS_MODULE_ALLOC_TEXT needs to imply availability of that too.

If the 4K copy thing ends up being a bottleneck we can easily extend
that to have a 2M option as well.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-13 21:20         ` Song Liu
@ 2022-07-14 10:10           ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2022-07-14 10:10 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, bpf, lkml, Linux-MM, linux-modules, Luis Chamberlain,
	Steven Rostedt, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Masami Hiramatsu, naveen.n.rao, davem, anil.s.keshavamurthy,
	keescook, hch, dave, daniel, Kernel Team, x86, dave.hansen,
	rick.p.edgecombe, akpm

On Wed, Jul 13, 2022 at 09:20:55PM +0000, Song Liu wrote:
> 
> 
> > On Jul 13, 2022, at 1:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Wed, Jul 13, 2022 at 03:48:35PM +0000, Song Liu wrote:
> > 
> >>> So how about instead we separate them? Then much of the problem goes
> >>> away, you don't need to track these 2M chunks at all.
> >> 
> >> If we manage the memory in < 2MiB granularity, either 4kB or smaller, 
> >> we still need some way to track which parts are being used, no? I mean
> >> the bitmap.  
> > 
> > I was thinking the vmalloc vmap_area tree could help out there.
> 
> Interesting. vmap_area tree indeed keeps a lot of useful information. 
> 
> Currently, powerpc supports CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, 

Only PPC32; and it's due to a constraint in their MMU vs page
protections.

> which leaves module_alloc just for module text. If this works, we get
> separation between RO+X and RW memory. What would it take to enable
> CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC for x86_64? 

The VM_TOPDOWN_VMAP flag and ensuring the data and code regions never
overlap. Once you have that you can enable it.

Specifically the problem is that data needs to be in the s32 immediate
range just like code, so we're constrained to the module range. Given
that constraint, the easiest solution is to use the different ends of
that range.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-14  4:54         ` Song Liu
@ 2022-07-14 18:15           ` Uladzislau Rezki
  2022-07-15  0:24             ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Uladzislau Rezki @ 2022-07-14 18:15 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Song Liu, bpf, linux-kernel, linux-mm,
	linux-modules, mcgrof, peterz, rostedt, tglx, mingo, bp,
	mhiramat, naveen.n.rao, davem, anil.s.keshavamurthy, keescook,
	dave, daniel, Kernel Team, x86, dave.hansen, rick.p.edgecombe,
	akpm

On Thu, Jul 14, 2022 at 04:54:40AM +0000, Song Liu wrote:
> 
> 
> > On Jul 13, 2022, at 9:23 PM, Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > On Wed, Jul 13, 2022 at 03:49:45PM +0000, Song Liu wrote:
> >> 
> >> 
> >>> On Jul 13, 2022, at 3:08 AM, Christoph Hellwig <hch@infradead.org> wrote:
> >>> 
> >>> NAK.  This is not something that should be an exported public API
> >>> ever.
> >> 
> >> Hmm.. I will remove EXPORT_SYMBOL_GPL (if we ever do a v2 of this..)
> > 
> > Even without that it really is not a vmalloc API anyway.  
> 
> This ...
> 
> > Executable
> > memory needs to be written first, so we should allocate it in that state
> > and only mark it executable after that write has completed.
> 
> ... and this are two separate NAKs.
> 
> For the first NAK, I agree that my version is another layer on top of 
> vmalloc. But what do you think about Peter's idea? AFAICT, that fits
> well in vmalloc logic. 
> 
I am not able to find the patch/change to see what you have done. But
please do not build a new allocator on top of vmalloc code. We have
three different ones what make things to be complicated :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-14 18:15           ` Uladzislau Rezki
@ 2022-07-15  0:24             ` Song Liu
  0 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2022-07-15  0:24 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Christoph Hellwig, Song Liu, bpf, linux-kernel, linux-mm,
	linux-modules, mcgrof, peterz, rostedt, tglx, mingo, bp,
	mhiramat, naveen.n.rao, davem, anil.s.keshavamurthy, keescook,
	dave, daniel, Kernel Team, x86, dave.hansen, rick.p.edgecombe,
	akpm



> On Jul 14, 2022, at 11:15 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> On Thu, Jul 14, 2022 at 04:54:40AM +0000, Song Liu wrote:
>> 
>> 
>>> On Jul 13, 2022, at 9:23 PM, Christoph Hellwig <hch@infradead.org> wrote:
>>> 
>>> On Wed, Jul 13, 2022 at 03:49:45PM +0000, Song Liu wrote:
>>>> 
>>>> 
>>>>> On Jul 13, 2022, at 3:08 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>>>> 
>>>>> NAK.  This is not something that should be an exported public API
>>>>> ever.
>>>> 
>>>> Hmm.. I will remove EXPORT_SYMBOL_GPL (if we ever do a v2 of this..)
>>> 
>>> Even without that it really is not a vmalloc API anyway.  
>> 
>> This ...
>> 
>>> Executable
>>> memory needs to be written first, so we should allocate it in that state
>>> and only mark it executable after that write has completed.
>> 
>> ... and this are two separate NAKs.
>> 
>> For the first NAK, I agree that my version is another layer on top of 
>> vmalloc. But what do you think about Peter's idea? AFAICT, that fits
>> well in vmalloc logic. 
>> 
> I am not able to find the patch/change to see what you have done.

vger dropped my patch again. :(

> But
> please do not build a new allocator on top of vmalloc code. We have
> three different ones what make things to be complicated :)

It was a bpf_prog_pack like allocator, but named as vmalloc_exec(), 
vfree_exec(). I guess I have got enough NAKs for it. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-13 10:20   ` Peter Zijlstra
  2022-07-13 15:48     ` Song Liu
  2022-07-14  5:16     ` Christoph Hellwig
@ 2022-08-05  5:29     ` Song Liu
  2022-08-05  5:29     ` Song Liu
  3 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2022-08-05  5:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, bpf, linux-kernel, linux-mm, linux-modules, mcgrof,
	rostedt, tglx, mingo, bp, mhiramat, naveen.n.rao, davem,
	anil.s.keshavamurthy, keescook, hch, dave, daniel, Kernel Team,
	x86, dave.hansen, rick.p.edgecombe, akpm

Hi Peter,

> On Jul 13, 2022, at 3:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 

[...]

> 
> So how about instead we separate them? Then much of the problem goes
> away, you don't need to track these 2M chunks at all.
> 
> Start by adding VM_TOPDOWN_VMAP, which instead of returning the lowest
> (leftmost) vmap_area that fits, picks the higests (rightmost).
> 
> Then add module_alloc_data() that uses VM_TOPDOWN_VMAP and make
> ARCH_WANTS_MODULE_DATA_IN_VMALLOC use that instead of vmalloc (with a
> weak function doing the vmalloc).
> 
> This gets you bottom of module range is RO+X only, top is shattered
> between different !X types.
> 
> Then track the boundary between X and !X and ensure module_alloc_data()
> and module_alloc() never cross over and stay strictly separated.
> 
> Then change all module_alloc() users to expect RO+X memory, instead of
> RW.
> 
> Then make sure any extention of the X range is 2M aligned.
> 
> And presto, *everybody* always uses 2M TLB for text, modules, bpf,
> ftrace, the lot and nobody is tracking chunks.
> 
> Maybe migration can be eased by instead providing module_alloc_text()
> and ARCH_WANTS_MODULE_ALLOC_TEXT.

I finally got some time to look into the code. A few questions:

1. AFAICT, vmap_area tree only works with PAGE_SIZE aligned addresses. 
   For the sharing to be more efficient, I think we need to go with
   smaller granularity. Will this work? Shall we pick a smaller 
   granularity, say 64 bytes? Or shall we go all the way to 1 byte?

2. I think we will need multiple vmap_area's sharing the same vm_struct. 
   Do we need to add refcount to vm_struct?

Thanks,
Song



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory
  2022-07-13 10:20   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2022-08-05  5:29     ` Song Liu
@ 2022-08-05  5:29     ` Song Liu
  3 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2022-08-05  5:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, bpf, linux-kernel, linux-mm, linux-modules, mcgrof,
	rostedt, tglx, mingo, bp, mhiramat, naveen.n.rao, davem,
	anil.s.keshavamurthy, keescook, hch, dave, daniel, Kernel Team,
	x86, dave.hansen, rick.p.edgecombe, akpm

Hi Peter,

> On Jul 13, 2022, at 3:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 

[...]

> 
> So how about instead we separate them? Then much of the problem goes
> away, you don't need to track these 2M chunks at all.
> 
> Start by adding VM_TOPDOWN_VMAP, which instead of returning the lowest
> (leftmost) vmap_area that fits, picks the higests (rightmost).
> 
> Then add module_alloc_data() that uses VM_TOPDOWN_VMAP and make
> ARCH_WANTS_MODULE_DATA_IN_VMALLOC use that instead of vmalloc (with a
> weak function doing the vmalloc).
> 
> This gets you bottom of module range is RO+X only, top is shattered
> between different !X types.
> 
> Then track the boundary between X and !X and ensure module_alloc_data()
> and module_alloc() never cross over and stay strictly separated.
> 
> Then change all module_alloc() users to expect RO+X memory, instead of
> RW.
> 
> Then make sure any extention of the X range is 2M aligned.
> 
> And presto, *everybody* always uses 2M TLB for text, modules, bpf,
> ftrace, the lot and nobody is tracking chunks.
> 
> Maybe migration can be eased by instead providing module_alloc_text()
> and ARCH_WANTS_MODULE_ALLOC_TEXT.

I finally got some time to look into the code. A few questions:

1. AFAICT, vmap_area tree only works with PAGE_SIZE aligned addresses. 
   For the sharing to be more efficient, I think we need to go with
   smaller granularity. Will this work? Shall we pick a smaller 
   granularity, say 64 bytes? Or shall we go all the way to 1 byte?

2. I think we will need multiple vmap_area's sharing the same vm_struct. 
   Do we need to add refcount to vm_struct?

Thanks,
Song



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-08-05  5:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220713071846.3286727-1-song@kernel.org>
     [not found] ` <20220713071846.3286727-2-song@kernel.org>
2022-07-13  9:53   ` [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory Peter Zijlstra
2022-07-13 10:08   ` Christoph Hellwig
2022-07-13 15:49     ` Song Liu
2022-07-14  4:23       ` Christoph Hellwig
2022-07-14  4:54         ` Song Liu
2022-07-14 18:15           ` Uladzislau Rezki
2022-07-15  0:24             ` Song Liu
2022-07-13 10:20   ` Peter Zijlstra
2022-07-13 15:48     ` Song Liu
2022-07-13 20:26       ` Peter Zijlstra
2022-07-13 21:20         ` Song Liu
2022-07-14 10:10           ` Peter Zijlstra
2022-07-14  5:16     ` Christoph Hellwig
2022-07-14  7:26       ` Peter Zijlstra
2022-08-05  5:29     ` Song Liu
2022-08-05  5:29     ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).