linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
       [not found] <20221007234315.2877365-1-song@kernel.org>
@ 2022-10-08  0:17 ` Song Liu
       [not found] ` <20221007234315.2877365-2-song@kernel.org>
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Song Liu @ 2022-10-08  0:17 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linux-MM, lkml, Andrew Morton, X86 ML, Peter Zijlstra,
	Christoph Hellwig, Kernel Team, Edgecombe, Rick P, dave.hansen,
	urezki


+ Luis.
> On Oct 7, 2022, at 4:43 PM, Song Liu <song@kernel.org> wrote:
> 
> Changes RFC v1 => RFC v2:
> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>   still need some work.
> 
> This set is a prototype that allows dynamic kernel text (modules, bpf
> programs, various trampolines, etc.) to share huge pages. The idea is
> similar to Peter's suggestion in [1]. Please refer to each patch for
> more detais.
> 
> The ultimate goal is to only host kernel text in 2MB pages (for x86_64).
> 
> Please share your comments on this.
> 
> Thanks!
> 
> [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
> [2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/
> 
> Song Liu (4):
>  vmalloc: introduce vmalloc_exec and vfree_exec
>  bpf: use vmalloc_exec
>  modules, x86: use vmalloc_exec for module core
>  vmalloc_exec: share a huge page with kernel text
> 
> arch/x86/Kconfig              |   1 +
> arch/x86/kernel/alternative.c |  30 +++-
> arch/x86/kernel/module.c      |   1 +
> arch/x86/mm/init_64.c         |   3 +-
> include/linux/vmalloc.h       |   2 +
> kernel/bpf/core.c             | 155 ++----------------
> kernel/module/main.c          |  23 +--
> kernel/module/strict_rwx.c    |   3 -
> kernel/trace/ftrace.c         |   3 +-
> mm/nommu.c                    |   7 +
> mm/vmalloc.c                  | 296 ++++++++++++++++++++++++++++++++++
> 11 files changed, 358 insertions(+), 166 deletions(-)
> 
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec
       [not found] ` <20221007234315.2877365-2-song@kernel.org>
@ 2022-10-10 18:13   ` Edgecombe, Rick P
  2022-10-10 19:04     ` Song Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 18:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm, song
  Cc: hch, kernel-team, peterz, urezki, akpm, x86, Hansen, Dave

How does this work with kasan?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
       [not found] ` <20221007234315.2877365-5-song@kernel.org>
@ 2022-10-10 18:32   ` Edgecombe, Rick P
  2022-10-10 19:08     ` Song Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 18:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm, song
  Cc: hch, kernel-team, peterz, urezki, akpm, x86, Hansen, Dave

On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> On x86 kernel, we allocate 2MB pages for kernel text up to
> round_down(_etext, 2MB). Therefore, some of the kernel text is still
> on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
> round_up(_etext, 2MB), and use the rest of the page for modules and
> BPF programs.
> 
> Here is an example:
> 
> [root@eth50-1 ~]# grep _etext /proc/kallsyms
> ffffffff82202a08 T _etext
> 
> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> ffffffff8220f920 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> ffffffff8220fa28 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> ffffffff8220fad4 t
> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> 
> [root@eth50-1 ~]#  grep 0xffffffff82200000
> /sys/kernel/debug/page_tables/kernel
> 0xffffffff82200000-
> 0xffffffff82400000     2M     ro   PSE         x  pmd
> 
> [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
> ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
> ffffffff822bc580 t xfs_flush_inodes     [xfs]
> 
> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
> xfs
> module, and bpf programs.

Can this memory range be freed as part of a vfree_exec() call then?
Does vmalloc actually try to unmap it? If so, it could get complicated
with PTI.

It probably should be a special case that never gets fully freed.

> 
> Signed-off-by: Song Liu <song@kernel.org>
> ---
>  arch/x86/mm/init_64.c |  3 ++-
>  mm/vmalloc.c          | 24 ++++++++++++++++++++++++
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0fe690ebc269..d94f196c541a 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1367,12 +1367,13 @@ int __init
> deferred_page_init_max_threads(const struct cpumask *node_cpumask)
>  
>  int kernel_set_to_readonly;
>  
> +#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) &
> PMD_MASK)
>  void mark_rodata_ro(void)
>  {
>  	unsigned long start = PFN_ALIGN(_text);
>  	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>  	unsigned long end = (unsigned long)__end_rodata_hpage_align;
> -	unsigned long text_end = PFN_ALIGN(_etext);
> +	unsigned long text_end = PMD_ALIGN(_etext);

This should probably have more logic and adjustments. If etext is PMD
aligned, some of the stuff outside the diff won't do anything.

Also, if a kernel doesn't have modules or BPF JIT it would be a waste
of memory.

>  	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
>  	unsigned long all_end;
>  
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 9212ff96b871..41509bbec583 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -75,6 +75,9 @@ static const bool vmap_allow_huge = false;
>  #define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
>  #define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
>  
> +static struct vm_struct text_tail_vm;
> +static struct vmap_area text_tail_va;
> +
>  bool is_vmalloc_addr(const void *x)
>  {
>  	unsigned long addr = (unsigned long)kasan_reset_tag(x);
> @@ -637,6 +640,8 @@ int is_vmalloc_or_module_addr(const void *x)
>  	unsigned long addr = (unsigned long)kasan_reset_tag(x);
>  	if (addr >= MODULES_VADDR && addr < MODULES_END)
>  		return 1;
> +	if (addr >= text_tail_va.va_start && addr <
> text_tail_va.va_end)
> +		return 1;
>  #endif
>  	return is_vmalloc_addr(x);
>  }
> @@ -2422,6 +2427,24 @@ static void vmap_init_free_space(void)
>  	}
>  }
>  
> +static void register_text_tail_vm(void)
> +{
> +	unsigned long start = PFN_ALIGN((unsigned long)_etext);
> +	unsigned long end = PMD_ALIGN((unsigned long)_etext);
> +	struct vmap_area *va;
> +
> +	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> +	if (WARN_ON_ONCE(!va))
> +		return;
> +	text_tail_vm.addr = (void *)start;
> +	text_tail_vm.size = end - start;
> +	text_tail_va.va_start = start;
> +	text_tail_va.va_end = end;
> +	text_tail_va.vm = &text_tail_vm;
> +	memcpy(va, &text_tail_va, sizeof(*va));
> +	insert_vmap_area_augment(va, NULL, &free_text_area_root,
> &free_text_area_list);
> +}
> +
>  void __init vmalloc_init(void)
>  {
>  	struct vmap_area *va;
> @@ -2432,6 +2455,7 @@ void __init vmalloc_init(void)
>  	 * Create the cache for vmap_area objects.
>  	 */
>  	vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> +	register_text_tail_vm();
>  
>  	for_each_possible_cpu(i) {
>  		struct vmap_block_queue *vbq;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec
  2022-10-10 18:13   ` [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec Edgecombe, Rick P
@ 2022-10-10 19:04     ` Song Liu
  2022-10-10 19:59       ` Edgecombe, Rick P
  0 siblings, 1 reply; 24+ messages in thread
From: Song Liu @ 2022-10-10 19:04 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, linux-mm, song, hch, Kernel Team, peterz, urezki,
	akpm, x86, Hansen, Dave



> On Oct 10, 2022, at 11:13 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> How does this work with kasan?

With KASAN enabled, BPF works fine. But module load does hit some issues. 

I guess we are missing some kasan_*() calls?

Thanks,
Song


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-10 18:32   ` [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text Edgecombe, Rick P
@ 2022-10-10 19:08     ` Song Liu
  2022-10-10 20:09       ` Edgecombe, Rick P
  0 siblings, 1 reply; 24+ messages in thread
From: Song Liu @ 2022-10-10 19:08 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, linux-mm, song, hch, Kernel Team, peterz, urezki,
	akpm, x86, Hansen, Dave



> On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
>> On x86 kernel, we allocate 2MB pages for kernel text up to
>> round_down(_etext, 2MB). Therefore, some of the kernel text is still
>> on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
>> round_up(_etext, 2MB), and use the rest of the page for modules and
>> BPF programs.
>> 
>> Here is an example:
>> 
>> [root@eth50-1 ~]# grep _etext /proc/kallsyms
>> ffffffff82202a08 T _etext
>> 
>> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
>> ffffffff8220f920 t
>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
>> ffffffff8220fa28 t
>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
>> ffffffff8220fad4 t
>> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
>> 
>> [root@eth50-1 ~]#  grep 0xffffffff82200000
>> /sys/kernel/debug/page_tables/kernel
>> 0xffffffff82200000-
>> 0xffffffff82400000     2M     ro   PSE         x  pmd
>> 
>> [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
>> ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
>> ffffffff822bc580 t xfs_flush_inodes     [xfs]
>> 
>> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
>> xfs
>> module, and bpf programs.
> 
> Can this memory range be freed as part of a vfree_exec() call then?
> Does vmalloc actually try to unmap it? If so, it could get complicated
> with PTI.
> 
> It probably should be a special case that never gets fully freed.

Right, this is never freed. 

> 
>> 
>> Signed-off-by: Song Liu <song@kernel.org>
>> ---
>> arch/x86/mm/init_64.c |  3 ++-
>> mm/vmalloc.c          | 24 ++++++++++++++++++++++++
>> 2 files changed, 26 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 0fe690ebc269..d94f196c541a 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1367,12 +1367,13 @@ int __init
>> deferred_page_init_max_threads(const struct cpumask *node_cpumask)
>> 
>> int kernel_set_to_readonly;
>> 
>> +#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) &
>> PMD_MASK)
>> void mark_rodata_ro(void)
>> {
>> 	unsigned long start = PFN_ALIGN(_text);
>> 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>> 	unsigned long end = (unsigned long)__end_rodata_hpage_align;
>> -	unsigned long text_end = PFN_ALIGN(_etext);
>> +	unsigned long text_end = PMD_ALIGN(_etext);
> 
> This should probably have more logic and adjustments. If etext is PMD
> aligned, some of the stuff outside the diff won't do anything.

Hmm.. I don't quite follow this comment. If the etext is PMD aligned, 
we can still use vmalloc_exec to allocate memory. So it shouldn't 
matter, no?

> 
> Also, if a kernel doesn't have modules or BPF JIT it would be a waste
> of memory.

I guess we can add a command line argument for these corner cases? 

Thanks,
Song

> 
>> 	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
>> 	unsigned long all_end;
>> 
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 9212ff96b871..41509bbec583 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -75,6 +75,9 @@ static const bool vmap_allow_huge = false;
>> #define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
>> #define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
>> 
>> +static struct vm_struct text_tail_vm;
>> +static struct vmap_area text_tail_va;
>> +
>> bool is_vmalloc_addr(const void *x)
>> {
>> 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
>> @@ -637,6 +640,8 @@ int is_vmalloc_or_module_addr(const void *x)
>> 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
>> 	if (addr >= MODULES_VADDR && addr < MODULES_END)
>> 		return 1;
>> +	if (addr >= text_tail_va.va_start && addr <
>> text_tail_va.va_end)
>> +		return 1;
>> #endif
>> 	return is_vmalloc_addr(x);
>> }
>> @@ -2422,6 +2427,24 @@ static void vmap_init_free_space(void)
>> 	}
>> }
>> 
>> +static void register_text_tail_vm(void)
>> +{
>> +	unsigned long start = PFN_ALIGN((unsigned long)_etext);
>> +	unsigned long end = PMD_ALIGN((unsigned long)_etext);
>> +	struct vmap_area *va;
>> +
>> +	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
>> +	if (WARN_ON_ONCE(!va))
>> +		return;
>> +	text_tail_vm.addr = (void *)start;
>> +	text_tail_vm.size = end - start;
>> +	text_tail_va.va_start = start;
>> +	text_tail_va.va_end = end;
>> +	text_tail_va.vm = &text_tail_vm;
>> +	memcpy(va, &text_tail_va, sizeof(*va));
>> +	insert_vmap_area_augment(va, NULL, &free_text_area_root,
>> &free_text_area_list);
>> +}
>> +
>> void __init vmalloc_init(void)
>> {
>> 	struct vmap_area *va;
>> @@ -2432,6 +2455,7 @@ void __init vmalloc_init(void)
>> 	 * Create the cache for vmap_area objects.
>> 	 */
>> 	vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
>> +	register_text_tail_vm();
>> 
>> 	for_each_possible_cpu(i) {
>> 		struct vmap_block_queue *vbq;


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec
  2022-10-10 19:04     ` Song Liu
@ 2022-10-10 19:59       ` Edgecombe, Rick P
  0 siblings, 0 replies; 24+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 19:59 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Mon, 2022-10-10 at 19:04 +0000, Song Liu wrote:
> > On Oct 10, 2022, at 11:13 AM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > How does this work with kasan?
> 
> With KASAN enabled, BPF works fine. But module load does hit some
> issues. 
> 
> I guess we are missing some kasan_*() calls?

I'm not sure what the desired behavior should be. Do you mark the
individual vmalloc_exec() areas as allocated/freed, or the bigger
allocations that contain them? Since this is text, it is going to be
getting mostly fetches which kasan is not going to check. Not sure
which is right, but there should probably be some specific kasan
behavior.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-10 19:08     ` Song Liu
@ 2022-10-10 20:09       ` Edgecombe, Rick P
  2022-10-11 16:25         ` Song Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 20:09 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Mon, 2022-10-10 at 19:08 +0000, Song Liu wrote:
> > On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> > > On x86 kernel, we allocate 2MB pages for kernel text up to
> > > round_down(_etext, 2MB). Therefore, some of the kernel text is
> > > still
> > > on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
> > > round_up(_etext, 2MB), and use the rest of the page for modules
> > > and
> > > BPF programs.
> > > 
> > > Here is an example:
> > > 
> > > [root@eth50-1 ~]# grep _etext /proc/kallsyms
> > > ffffffff82202a08 T _etext
> > > 
> > > [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> > > ffffffff8220f920 t
> > > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> > > ffffffff8220fa28 t
> > > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> > > ffffffff8220fad4 t
> > > bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> > > 
> > > [root@eth50-1 ~]#  grep 0xffffffff82200000
> > > /sys/kernel/debug/page_tables/kernel
> > > 0xffffffff82200000-
> > > 0xffffffff82400000     2M     ro   PSE         x  pmd
> > > 
> > > [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
> > > ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
> > > ffffffff822bc580 t xfs_flush_inodes     [xfs]
> > > 
> > > ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel
> > > text,
> > > xfs
> > > module, and bpf programs.
> > 
> > Can this memory range be freed as part of a vfree_exec() call then?
> > Does vmalloc actually try to unmap it? If so, it could get
> > complicated
> > with PTI.
> > 
> > It probably should be a special case that never gets fully freed.
> 
> Right, this is never freed. 

Can we get a comment somewhere highlighting how this is avoided?

Maybe this is just me missing some vmalloc understanding, but this
pointer to an all zero vm_struct seems weird too. Are there other vmap
allocations like this? Which vmap APIs work with this and which don't?

> 
> > 
> > > 
> > > Signed-off-by: Song Liu <song@kernel.org>
> > > ---
> > > arch/x86/mm/init_64.c |  3 ++-
> > > mm/vmalloc.c          | 24 ++++++++++++++++++++++++
> > > 2 files changed, 26 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > > index 0fe690ebc269..d94f196c541a 100644
> > > --- a/arch/x86/mm/init_64.c
> > > +++ b/arch/x86/mm/init_64.c
> > > @@ -1367,12 +1367,13 @@ int __init
> > > deferred_page_init_max_threads(const struct cpumask
> > > *node_cpumask)
> > > 
> > > int kernel_set_to_readonly;
> > > 
> > > +#define PMD_ALIGN(x)        (((unsigned long)(x) + (PMD_SIZE -
> > > 1)) &
> > > PMD_MASK)
> > > void mark_rodata_ro(void)
> > > {
> > >       unsigned long start = PFN_ALIGN(_text);
> > >       unsigned long rodata_start = PFN_ALIGN(__start_rodata);
> > >       unsigned long end = (unsigned
> > > long)__end_rodata_hpage_align;
> > > -    unsigned long text_end = PFN_ALIGN(_etext);
> > > +    unsigned long text_end = PMD_ALIGN(_etext);
> > 
> > This should probably have more logic and adjustments. If etext is
> > PMD
> > aligned, some of the stuff outside the diff won't do anything.
> 
> Hmm.. I don't quite follow this comment. If the etext is PMD
> aligned, 
> we can still use vmalloc_exec to allocate memory. So it shouldn't 
> matter, no?

Maybe this doesn't matter since PMD alignment must happen naturally
sometimes. I was just noticing the attempts to operate on this region
between etext and start_rodata (free_init_pages(), etc). If this was
never not PMD aligned they could be dropped. But if you are going to
adjust the behavior for !CONFIG_MODULES, etc, then it is still needed.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-10 20:09       ` Edgecombe, Rick P
@ 2022-10-11 16:25         ` Song Liu
  2022-10-11 20:40           ` Edgecombe, Rick P
  0 siblings, 1 reply; 24+ messages in thread
From: Song Liu @ 2022-10-11 16:25 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, peterz, Kernel Team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki



> On Oct 10, 2022, at 1:09 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Mon, 2022-10-10 at 19:08 +0000, Song Liu wrote:
>>> On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P <
>>> rick.p.edgecombe@intel.com> wrote:
>>> 
>>> On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
>>>> On x86 kernel, we allocate 2MB pages for kernel text up to
>>>> round_down(_etext, 2MB). Therefore, some of the kernel text is
>>>> still
>>>> on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
>>>> round_up(_etext, 2MB), and use the rest of the page for modules
>>>> and
>>>> BPF programs.
>>>> 
>>>> Here is an example:
>>>> 
>>>> [root@eth50-1 ~]# grep _etext /proc/kallsyms
>>>> ffffffff82202a08 T _etext
>>>> 
>>>> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
>>>> ffffffff8220f920 t
>>>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
>>>> ffffffff8220fa28 t
>>>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
>>>> ffffffff8220fad4 t
>>>> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
>>>> 
>>>> [root@eth50-1 ~]#  grep 0xffffffff82200000
>>>> /sys/kernel/debug/page_tables/kernel
>>>> 0xffffffff82200000-
>>>> 0xffffffff82400000     2M     ro   PSE         x  pmd
>>>> 
>>>> [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
>>>> ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
>>>> ffffffff822bc580 t xfs_flush_inodes     [xfs]
>>>> 
>>>> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel
>>>> text,
>>>> xfs
>>>> module, and bpf programs.
>>> 
>>> Can this memory range be freed as part of a vfree_exec() call then?
>>> Does vmalloc actually try to unmap it? If so, it could get
>>> complicated
>>> with PTI.
>>> 
>>> It probably should be a special case that never gets fully freed.
>> 
>> Right, this is never freed. 
> 
> Can we get a comment somewhere highlighting how this is avoided?

vfree_exec() only frees the range when we get PMD aligned range >=
PMD_SIZE. Since this range is smaller than PMD_SIZe, it is never
freed. I will add a comment for this in the never version. 
 
> 
> Maybe this is just me missing some vmalloc understanding, but this
> pointer to an all zero vm_struct seems weird too. Are there other vmap
> allocations like this? Which vmap APIs work with this and which don't?

There are two vmap trees at the moment: free_area_ tree and 
vmap_area_ tree. free_area_ tree uses vmap->subtree_max_size, while 
vmap_area_ tree contains vmap backed by vm_struct, and thus uses 
vmap->vm. 

This set add a new tree, free_text_area_. This tree is different to 
the other two, as it uses subtree_max_size, and it is also backed 
by vm_struct. To handle this requirement without growing vmap_struct,
we introduced all_text_vm to store the vm_struct for free_text_area_
tree. 

free_text_area_ tree is different to vmap_area_ tree. Each vmap in
vmap_area_ tree has its own vm_struct (1 to 1 mapping), while 
multiple vmap in free_text_area_ tree map to a single vm_struct.

Also, free_text_area_ handles granularity < PAGE_SIZE; while the
other two trees only work with PAGE_SIZE aligned memory. 

Does this answer your questions? 

> 
>> 
>>> 
>>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>> arch/x86/mm/init_64.c |  3 ++-
>>>> mm/vmalloc.c          | 24 ++++++++++++++++++++++++
>>>> 2 files changed, 26 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>>> index 0fe690ebc269..d94f196c541a 100644
>>>> --- a/arch/x86/mm/init_64.c
>>>> +++ b/arch/x86/mm/init_64.c
>>>> @@ -1367,12 +1367,13 @@ int __init
>>>> deferred_page_init_max_threads(const struct cpumask
>>>> *node_cpumask)
>>>> 
>>>> int kernel_set_to_readonly;
>>>> 
>>>> +#define PMD_ALIGN(x)        (((unsigned long)(x) + (PMD_SIZE -
>>>> 1)) &
>>>> PMD_MASK)
>>>> void mark_rodata_ro(void)
>>>> {
>>>>      unsigned long start = PFN_ALIGN(_text);
>>>>      unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>>>>      unsigned long end = (unsigned
>>>> long)__end_rodata_hpage_align;
>>>> -    unsigned long text_end = PFN_ALIGN(_etext);
>>>> +    unsigned long text_end = PMD_ALIGN(_etext);
>>> 
>>> This should probably have more logic and adjustments. If etext is
>>> PMD
>>> aligned, some of the stuff outside the diff won't do anything.
>> 
>> Hmm.. I don't quite follow this comment. If the etext is PMD
>> aligned, 
>> we can still use vmalloc_exec to allocate memory. So it shouldn't 
>> matter, no?
> 
> Maybe this doesn't matter since PMD alignment must happen naturally
> sometimes. I was just noticing the attempts to operate on this region
> between etext and start_rodata (free_init_pages(), etc). If this was
> never not PMD aligned they could be dropped. But if you are going to
> adjust the behavior for !CONFIG_MODULES, etc, then it is still needed.

I guess we can add special handling for !CONFIG_MODULES && !CONFIG_BPF
&& !CONFIG_FTRACE cases, where we will not allocate this memory. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-11 16:25         ` Song Liu
@ 2022-10-11 20:40           ` Edgecombe, Rick P
  2022-10-12  5:37             ` Song Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Edgecombe, Rick P @ 2022-10-11 20:40 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Tue, 2022-10-11 at 16:25 +0000, Song Liu wrote:
> > Maybe this is just me missing some vmalloc understanding, but this
> > pointer to an all zero vm_struct seems weird too. Are there other
> > vmap
> > allocations like this? Which vmap APIs work with this and which
> > don't?
> 
> There are two vmap trees at the moment: free_area_ tree and 
> vmap_area_ tree. free_area_ tree uses vmap->subtree_max_size, while 
> vmap_area_ tree contains vmap backed by vm_struct, and thus uses 
> vmap->vm. 
> 
> This set add a new tree, free_text_area_. This tree is different to 
> the other two, as it uses subtree_max_size, and it is also backed 
> by vm_struct. To handle this requirement without growing vmap_struct,
> we introduced all_text_vm to store the vm_struct for free_text_area_
> tree. 
> 
> free_text_area_ tree is different to vmap_area_ tree. Each vmap in
> vmap_area_ tree has its own vm_struct (1 to 1 mapping), while 
> multiple vmap in free_text_area_ tree map to a single vm_struct.
> 
> Also, free_text_area_ handles granularity < PAGE_SIZE; while the
> other two trees only work with PAGE_SIZE aligned memory. 
> 
> Does this answer your questions? 

I mean from the perspective of someone trying to use this without
diving into the entire implementation.

The function is called vmalloc_exec() and is freed with vfree_exec().
Makes sense. But with the other vmallocs_foo's (including previous
vmalloc_exec() implementations) you can call find_vm_area(), etc on
them. They show in "vmallocinfo" and generally behave similarly. That
isn't true for these new allocations, right?

Then you have code that operates on module text like:
if (is_vmalloc_or_module_addr(addr))
	pfn = vmalloc_to_pfn(addr);

It looks like it would work (on x86 at least). Should it be expected
to?

Especially after this patch, where there is memory that isn't even
tracked by the original vmap_area trees, it is pretty much a separate
allocator. So I think it might be nice to spell out which other vmalloc
APIs work with these new functions since they are named "vmalloc".
Maybe just say none of them do.


Separate from that, I guess you are planning to make this limited to
certain architectures? It might be better to put logic with assumptions
about x86 boot time page table details inside arch/x86 somewhere.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-11 20:40           ` Edgecombe, Rick P
@ 2022-10-12  5:37             ` Song Liu
  2022-10-12 18:38               ` Edgecombe, Rick P
  0 siblings, 1 reply; 24+ messages in thread
From: Song Liu @ 2022-10-12  5:37 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Song Liu, linux-kernel, peterz, Kernel Team, linux-mm, song, hch,
	x86, akpm, Hansen, Dave, urezki



> On Oct 11, 2022, at 1:40 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Tue, 2022-10-11 at 16:25 +0000, Song Liu wrote:
>>> Maybe this is just me missing some vmalloc understanding, but this
>>> pointer to an all zero vm_struct seems weird too. Are there other
>>> vmap
>>> allocations like this? Which vmap APIs work with this and which
>>> don't?
>> 
>> There are two vmap trees at the moment: free_area_ tree and 
>> vmap_area_ tree. free_area_ tree uses vmap->subtree_max_size, while 
>> vmap_area_ tree contains vmap backed by vm_struct, and thus uses 
>> vmap->vm. 
>> 
>> This set add a new tree, free_text_area_. This tree is different to 
>> the other two, as it uses subtree_max_size, and it is also backed 
>> by vm_struct. To handle this requirement without growing vmap_struct,
>> we introduced all_text_vm to store the vm_struct for free_text_area_
>> tree. 
>> 
>> free_text_area_ tree is different to vmap_area_ tree. Each vmap in
>> vmap_area_ tree has its own vm_struct (1 to 1 mapping), while 
>> multiple vmap in free_text_area_ tree map to a single vm_struct.
>> 
>> Also, free_text_area_ handles granularity < PAGE_SIZE; while the
>> other two trees only work with PAGE_SIZE aligned memory. 
>> 
>> Does this answer your questions? 
> 
> I mean from the perspective of someone trying to use this without
> diving into the entire implementation.
> 
> The function is called vmalloc_exec() and is freed with vfree_exec().
> Makes sense. But with the other vmallocs_foo's (including previous
> vmalloc_exec() implementations) you can call find_vm_area(), etc on
> them. They show in "vmallocinfo" and generally behave similarly. That
> isn't true for these new allocations, right?

That's right. These operations are not supported (at least for now). 

> 
> Then you have code that operates on module text like:
> if (is_vmalloc_or_module_addr(addr))
> 	pfn = vmalloc_to_pfn(addr);
> 
> It looks like it would work (on x86 at least). Should it be expected
> to?
> 
> Especially after this patch, where there is memory that isn't even
> tracked by the original vmap_area trees, it is pretty much a separate
> allocator. So I think it might be nice to spell out which other vmalloc
> APIs work with these new functions since they are named "vmalloc".
> Maybe just say none of them do.

I guess it is fair to call this a separate allocator. Maybe 
vmalloc_exec is not the right name? I do think this is the best 
way to build an allocator with vmap tree logic. 

> 
> 
> Separate from that, I guess you are planning to make this limited to
> certain architectures? It might be better to put logic with assumptions
> about x86 boot time page table details inside arch/x86 somewhere.

Yes, the architecture need some text_poke mechanism to use this. 
On BPF side, x86_64 calls this directly from arch code (jit engine), 
so it is mostly covered. For modules, we need to handle this better. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-12  5:37             ` Song Liu
@ 2022-10-12 18:38               ` Edgecombe, Rick P
  2022-10-12 19:01                 ` Song Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Edgecombe, Rick P @ 2022-10-12 18:38 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Wed, 2022-10-12 at 05:37 +0000, Song Liu wrote:
> > Then you have code that operates on module text like:
> > if (is_vmalloc_or_module_addr(addr))
> >        pfn = vmalloc_to_pfn(addr);
> > 
> > It looks like it would work (on x86 at least). Should it be
> > expected
> > to?
> > 
> > Especially after this patch, where there is memory that isn't even
> > tracked by the original vmap_area trees, it is pretty much a
> > separate
> > allocator. So I think it might be nice to spell out which other
> > vmalloc
> > APIs work with these new functions since they are named "vmalloc".
> > Maybe just say none of them do.
> 
> I guess it is fair to call this a separate allocator. Maybe 
> vmalloc_exec is not the right name? I do think this is the best 
> way to build an allocator with vmap tree logic. 

Yea, I don't know about the name. I think someone else suggested it
specifically, right?

I had called mine perm_alloc() so it could also handle read-only and
other permissions. If you keep vmalloc_exec() it needs some big
comments about which APIs can work with it, and an audit of the
existing code that works on module and JIT text.

> 
> > 
> > 
> > Separate from that, I guess you are planning to make this limited
> > to
> > certain architectures? It might be better to put logic with
> > assumptions
> > about x86 boot time page table details inside arch/x86 somewhere.
> 
> Yes, the architecture need some text_poke mechanism to use this. 

It also depends on the space between _etext and the PMD aligned _etext
to be present and not get used by anything else. For other
architectures, there might be rodata there or other things.

> On BPF side, x86_64 calls this directly from arch code (jit engine), 
> so it is mostly covered. For modules, we need to handle this better. 

That old RFC has some ideas around this. I kind of like your
incremental approach though. To me it seems to be moving in the right
direction.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-12 18:38               ` Edgecombe, Rick P
@ 2022-10-12 19:01                 ` Song Liu
  0 siblings, 0 replies; 24+ messages in thread
From: Song Liu @ 2022-10-12 19:01 UTC (permalink / raw)
  To: Edgecombe, Rick P, Luis Chamberlain
  Cc: Song Liu, linux-kernel, peterz, Kernel Team, linux-mm, song, hch,
	x86, akpm, Hansen, Dave, urezki



> On Oct 12, 2022, at 11:38 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Wed, 2022-10-12 at 05:37 +0000, Song Liu wrote:
>>> Then you have code that operates on module text like:
>>> if (is_vmalloc_or_module_addr(addr))
>>>       pfn = vmalloc_to_pfn(addr);
>>> 
>>> It looks like it would work (on x86 at least). Should it be
>>> expected
>>> to?
>>> 
>>> Especially after this patch, where there is memory that isn't even
>>> tracked by the original vmap_area trees, it is pretty much a
>>> separate
>>> allocator. So I think it might be nice to spell out which other
>>> vmalloc
>>> APIs work with these new functions since they are named "vmalloc".
>>> Maybe just say none of them do.
>> 
>> I guess it is fair to call this a separate allocator. Maybe 
>> vmalloc_exec is not the right name? I do think this is the best 
>> way to build an allocator with vmap tree logic. 
> 
> Yea, I don't know about the name. I think someone else suggested it
> specifically, right?

I think Luis suggested rename module_alloc to vmalloc_exec. But I 
guess we still need module_alloc for module data allocations. 

> 
> I had called mine perm_alloc() so it could also handle read-only and
> other permissions.

What are other permissions that we use? We can probably duplicate
the free_text_are_ tree logic for other cases. 


> If you keep vmalloc_exec() it needs some big
> comments about which APIs can work with it, and an audit of the
> existing code that works on module and JIT text.
> 
>> 
>>> 
>>> 
>>> Separate from that, I guess you are planning to make this limited
>>> to
>>> certain architectures? It might be better to put logic with
>>> assumptions
>>> about x86 boot time page table details inside arch/x86 somewhere.
>> 
>> Yes, the architecture need some text_poke mechanism to use this. 
> 
> It also depends on the space between _etext and the PMD aligned _etext
> to be present and not get used by anything else. For other
> architectures, there might be rodata there or other things.

Good point! We need to make sure this part is not used by other things.

> 
>> On BPF side, x86_64 calls this directly from arch code (jit engine), 
>> so it is mostly covered. For modules, we need to handle this better. 
> 
> That old RFC has some ideas around this. I kind of like your
> incremental approach though. To me it seems to be moving in the right
> direction.

Thanks!
Song

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
       [not found] <20221007234315.2877365-1-song@kernel.org>
                   ` (2 preceding siblings ...)
       [not found] ` <20221007234315.2877365-5-song@kernel.org>
@ 2022-10-12 19:03 ` Song Liu
       [not found] ` <20221007234315.2877365-4-song@kernel.org>
  2022-10-17  7:26 ` [RFC v2 0/4] vmalloc_exec for modules and BPF programs Christoph Hellwig
  5 siblings, 0 replies; 24+ messages in thread
From: Song Liu @ 2022-10-12 19:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, lkml, Andrew Morton, X86 ML, Christoph Hellwig,
	Kernel Team, Edgecombe, Rick P, Hansen, Dave, Uladzislau Rezki,
	Luis Chamberlain, Song Liu

Hi Peter,

> On Oct 7, 2022, at 4:43 PM, Song Liu <song@kernel.org> wrote:
> 
> Changes RFC v1 => RFC v2:
> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>   still need some work.
> 
> This set is a prototype that allows dynamic kernel text (modules, bpf
> programs, various trampolines, etc.) to share huge pages. The idea is
> similar to Peter's suggestion in [1]. Please refer to each patch for
> more detais.
> 
> The ultimate goal is to only host kernel text in 2MB pages (for x86_64).
> 
> Please share your comments on this.
> 
> Thanks!
> 
> [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
> [2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/

Could you please share your comments on this version? Is this the 
right direction to move this work?

Thanks,
Song

> 
> Song Liu (4):
>  vmalloc: introduce vmalloc_exec and vfree_exec
>  bpf: use vmalloc_exec
>  modules, x86: use vmalloc_exec for module core
>  vmalloc_exec: share a huge page with kernel text
> 
> arch/x86/Kconfig              |   1 +
> arch/x86/kernel/alternative.c |  30 +++-
> arch/x86/kernel/module.c      |   1 +
> arch/x86/mm/init_64.c         |   3 +-
> include/linux/vmalloc.h       |   2 +
> kernel/bpf/core.c             | 155 ++----------------
> kernel/module/main.c          |  23 +--
> kernel/module/strict_rwx.c    |   3 -
> kernel/trace/ftrace.c         |   3 +-
> mm/nommu.c                    |   7 +
> mm/vmalloc.c                  | 296 ++++++++++++++++++++++++++++++++++
> 11 files changed, 358 insertions(+), 166 deletions(-)
> 
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
       [not found] ` <20221007234315.2877365-4-song@kernel.org>
@ 2022-10-14  3:48   ` Aaron Lu
  2022-10-14  6:07     ` Song Liu
  2022-10-14 15:42   ` Edgecombe, Rick P
  1 sibling, 1 reply; 24+ messages in thread
From: Aaron Lu @ 2022-10-14  3:48 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-mm, linux-kernel, akpm, x86, peterz, hch, kernel-team,
	rick.p.edgecombe, dave.hansen, urezki

On Fri, Oct 07, 2022 at 04:43:14PM -0700, Song Liu wrote:
> This is a prototype that allows modules to share 2MB text pages with other
> modules and BPF programs.
> 
> Current version only covers core_layout.
> ---
>  arch/x86/Kconfig              |  1 +
>  arch/x86/kernel/alternative.c | 30 ++++++++++++++++++++++++------
>  arch/x86/kernel/module.c      |  1 +
>  kernel/module/main.c          | 23 +++++++++++++----------
>  kernel/module/strict_rwx.c    |  3 ---
>  kernel/trace/ftrace.c         |  3 ++-
>  6 files changed, 41 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index f9920f1341c8..0b1ea05a1da6 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -91,6 +91,7 @@ config X86
>  	select ARCH_HAS_SET_DIRECT_MAP
>  	select ARCH_HAS_STRICT_KERNEL_RWX
>  	select ARCH_HAS_STRICT_MODULE_RWX
> +	select ARCH_WANTS_MODULES_DATA_IN_VMALLOC	if X86_64
>  	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>  	select ARCH_HAS_SYSCALL_WRAPPER
>  	select ARCH_HAS_UBSAN_SANITIZE_ALL
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 4f3204364caa..0e47a558c5bc 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -332,7 +332,13 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
>  
>  		DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
>  
> -		text_poke_early(instr, insn_buff, insn_buff_sz);
> +		if (system_state < SYSTEM_RUNNING) {
> +			text_poke_early(instr, insn_buff, insn_buff_sz);
> +		} else {
> +			mutex_lock(&text_mutex);
> +			text_poke(instr, insn_buff, insn_buff_sz);
> +			mutex_unlock(&text_mutex);
> +		}
>  
>  next:
>  		optimize_nops(instr, a->instrlen);
> @@ -503,7 +509,13 @@ void __init_or_module noinline apply_retpolines(s32 *start, s32 *end)
>  			optimize_nops(bytes, len);
>  			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
>  			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
> -			text_poke_early(addr, bytes, len);
> +			if (system_state == SYSTEM_BOOTING) {
> +				text_poke_early(addr, bytes, len);
> +			} else {
> +				mutex_lock(&text_mutex);
> +				text_poke(addr, bytes, len);
> +				mutex_unlock(&text_mutex);
> +			}
>  		}
>  	}
>  }
> @@ -568,7 +580,13 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end)
>  		if (len == insn.length) {
>  			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
>  			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
> -			text_poke_early(addr, bytes, len);
> +			if (unlikely(system_state == SYSTEM_BOOTING)) {
> +				text_poke_early(addr, bytes, len);
> +			} else {
> +				mutex_lock(&text_mutex);
> +				text_poke(addr, bytes, len);
> +				mutex_unlock(&text_mutex);
> +			}
>  		}
>  	}
>  }
> @@ -609,7 +627,7 @@ void __init_or_module noinline apply_ibt_endbr(s32 *start, s32 *end)
>  		 */
>  		DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
>  		DUMP_BYTES(((u8*)&poison), 4, "%px: repl: ", addr);
> -		text_poke_early(addr, &poison, 4);
> +		text_poke(addr, &poison, 4);
>  	}
>  }
>  
> @@ -791,7 +809,7 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
>  
>  		/* Pad the rest with nops */
>  		add_nops(insn_buff + used, p->len - used);
> -		text_poke_early(p->instr, insn_buff, p->len);
> +		text_poke(p->instr, insn_buff, p->len);

Got below warning when booting a VM:

[    0.190098] ------------[ cut here ]------------
[    0.190377] WARNING: CPU: 0 PID: 0 at /home/aaron/linux/src/arch/x86/kernel/alternative.c:1224 text_poke+0x53/0x60
[    0.191083] Modules linked in:
[    0.191269] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.0.0-00004-gc49d19177d78 #5
[    0.191721] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[    0.192083] RIP: 0010:text_poke+0x53/0x60
[    0.192326] Code: c7 c7 20 e7 02 81 5b 5d e9 2a f8 ff ff be ff ff ff ff 48 c7 c7 b0 6d 06 83 48 89 14 24 e8 75 fd bf 00 85 c0 48 8b 14 24 75 c8 <0f> 0b eb c4 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 56
[    0.193083] RSP: 0000:ffffffff83003d60 EFLAGS: 00010246
[    0.194083] RAX: 0000000000000000 RBX: ffffffff810295b7 RCX: 0000000000000001
[    0.194506] RDX: 0000000000000006 RSI: ffffffff828b01c5 RDI: ffffffff8293898e
[    0.195083] RBP: ffffffff83003d82 R08: ffffffff82206520 R09: 0000000000000001
[    0.195506] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8a9949c0
[    0.195929] R13: ffffffff8a95f400 R14: 00000000ffffffff R15: 00000000ffffffff
[    0.196083] FS:  0000000000000000(0000) GS:ffff88842de00000(0000) knlGS:0000000000000000
[    0.196562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.197083] CR2: ffff88843ffff000 CR3: 0000000003012001 CR4: 0000000000770ef0
[    0.197508] PKRU: 55555554
[    0.197673] Call Trace:
[    0.197822]  <TASK>
[    0.198084]  apply_paravirt+0xaf/0x150
[    0.198313]  ? __might_resched+0x3f/0x280
[    0.198557]  ? synchronize_rcu+0xe0/0x1c0
[    0.198799]  ? lock_release+0x230/0x450
[    0.199030]  ? _raw_spin_unlock_irqrestore+0x30/0x60
[    0.199083]  ? lockdep_hardirqs_on+0x79/0x100
[    0.199345]  ? _raw_spin_unlock_irqrestore+0x3b/0x60
[    0.199643]  ? atomic_notifier_chain_unregister+0x51/0x80
[    0.200084]  alternative_instructions+0x27/0xfa
[    0.200357]  check_bugs+0xe08/0xe82
[    0.200570]  start_kernel+0x692/0x6cc
[    0.200797]  secondary_startup_64_no_verify+0xe0/0xeb
[    0.201088]  </TASK>
[    0.201223] irq event stamp: 13575
[    0.201428] hardirqs last  enabled at (13583): [<ffffffff811193c2>] __up_console_sem+0x52/0x60
[    0.202083] hardirqs last disabled at (13592): [<ffffffff811193a7>] __up_console_sem+0x37/0x60
[    0.202594] softirqs last  enabled at (12762): [<ffffffff8117e169>] cgroup_idr_alloc.constprop.60+0x59/0x100
[    0.203083] softirqs last disabled at (12750): [<ffffffff8117e13d>] cgroup_idr_alloc.constprop.60+0x2d/0x100
[    0.203665] ---[ end trace 0000000000000000 ]---

Looks like it is also necessary to differentiate system_state in
apply_paravirt() like you did in the other apply_XXX() functions.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
  2022-10-14  3:48   ` [RFC v2 3/4] modules, x86: use vmalloc_exec for module core Aaron Lu
@ 2022-10-14  6:07     ` Song Liu
  0 siblings, 0 replies; 24+ messages in thread
From: Song Liu @ 2022-10-14  6:07 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, akpm, x86, peterz, hch, kernel-team,
	rick.p.edgecombe, dave.hansen, urezki

On Thu, Oct 13, 2022 at 8:49 PM Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Fri, Oct 07, 2022 at 04:43:14PM -0700, Song Liu wrote:
> > This is a prototype that allows modules to share 2MB text pages with other
> > modules and BPF programs.
> >
> > Current version only covers core_layout.
> > ---
> >  arch/x86/Kconfig              |  1 +
> >  arch/x86/kernel/alternative.c | 30 ++++++++++++++++++++++++------
> >  arch/x86/kernel/module.c      |  1 +
> >  kernel/module/main.c          | 23 +++++++++++++----------
> >  kernel/module/strict_rwx.c    |  3 ---
> >  kernel/trace/ftrace.c         |  3 ++-
> >  6 files changed, 41 insertions(+), 20 deletions(-)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index f9920f1341c8..0b1ea05a1da6 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -91,6 +91,7 @@ config X86
> >       select ARCH_HAS_SET_DIRECT_MAP
> >       select ARCH_HAS_STRICT_KERNEL_RWX
> >       select ARCH_HAS_STRICT_MODULE_RWX
> > +     select ARCH_WANTS_MODULES_DATA_IN_VMALLOC       if X86_64
> >       select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> >       select ARCH_HAS_SYSCALL_WRAPPER
> >       select ARCH_HAS_UBSAN_SANITIZE_ALL
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index 4f3204364caa..0e47a558c5bc 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -332,7 +332,13 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
> >
> >               DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
> >
> > -             text_poke_early(instr, insn_buff, insn_buff_sz);
> > +             if (system_state < SYSTEM_RUNNING) {
> > +                     text_poke_early(instr, insn_buff, insn_buff_sz);
> > +             } else {
> > +                     mutex_lock(&text_mutex);
> > +                     text_poke(instr, insn_buff, insn_buff_sz);
> > +                     mutex_unlock(&text_mutex);
> > +             }
> >
> >  next:
> >               optimize_nops(instr, a->instrlen);
> > @@ -503,7 +509,13 @@ void __init_or_module noinline apply_retpolines(s32 *start, s32 *end)
> >                       optimize_nops(bytes, len);
> >                       DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
> >                       DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
> > -                     text_poke_early(addr, bytes, len);
> > +                     if (system_state == SYSTEM_BOOTING) {
> > +                             text_poke_early(addr, bytes, len);
> > +                     } else {
> > +                             mutex_lock(&text_mutex);
> > +                             text_poke(addr, bytes, len);
> > +                             mutex_unlock(&text_mutex);
> > +                     }
> >               }
> >       }
> >  }
> > @@ -568,7 +580,13 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end)
> >               if (len == insn.length) {
> >                       DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
> >                       DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
> > -                     text_poke_early(addr, bytes, len);
> > +                     if (unlikely(system_state == SYSTEM_BOOTING)) {
> > +                             text_poke_early(addr, bytes, len);
> > +                     } else {
> > +                             mutex_lock(&text_mutex);
> > +                             text_poke(addr, bytes, len);
> > +                             mutex_unlock(&text_mutex);
> > +                     }
> >               }
> >       }
> >  }
> > @@ -609,7 +627,7 @@ void __init_or_module noinline apply_ibt_endbr(s32 *start, s32 *end)
> >                */
> >               DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
> >               DUMP_BYTES(((u8*)&poison), 4, "%px: repl: ", addr);
> > -             text_poke_early(addr, &poison, 4);
> > +             text_poke(addr, &poison, 4);
> >       }
> >  }
> >
> > @@ -791,7 +809,7 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
> >
> >               /* Pad the rest with nops */
> >               add_nops(insn_buff + used, p->len - used);
> > -             text_poke_early(p->instr, insn_buff, p->len);
> > +             text_poke(p->instr, insn_buff, p->len);
>
> Got below warning when booting a VM:
>
> [    0.190098] ------------[ cut here ]------------
> [    0.190377] WARNING: CPU: 0 PID: 0 at /home/aaron/linux/src/arch/x86/kernel/alternative.c:1224 text_poke+0x53/0x60
> [    0.191083] Modules linked in:
> [    0.191269] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.0.0-00004-gc49d19177d78 #5
> [    0.191721] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> [    0.192083] RIP: 0010:text_poke+0x53/0x60
> [    0.192326] Code: c7 c7 20 e7 02 81 5b 5d e9 2a f8 ff ff be ff ff ff ff 48 c7 c7 b0 6d 06 83 48 89 14 24 e8 75 fd bf 00 85 c0 48 8b 14 24 75 c8 <0f> 0b eb c4 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 56
> [    0.193083] RSP: 0000:ffffffff83003d60 EFLAGS: 00010246
> [    0.194083] RAX: 0000000000000000 RBX: ffffffff810295b7 RCX: 0000000000000001
> [    0.194506] RDX: 0000000000000006 RSI: ffffffff828b01c5 RDI: ffffffff8293898e
> [    0.195083] RBP: ffffffff83003d82 R08: ffffffff82206520 R09: 0000000000000001
> [    0.195506] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8a9949c0
> [    0.195929] R13: ffffffff8a95f400 R14: 00000000ffffffff R15: 00000000ffffffff
> [    0.196083] FS:  0000000000000000(0000) GS:ffff88842de00000(0000) knlGS:0000000000000000
> [    0.196562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.197083] CR2: ffff88843ffff000 CR3: 0000000003012001 CR4: 0000000000770ef0
> [    0.197508] PKRU: 55555554
> [    0.197673] Call Trace:
> [    0.197822]  <TASK>
> [    0.198084]  apply_paravirt+0xaf/0x150
> [    0.198313]  ? __might_resched+0x3f/0x280
> [    0.198557]  ? synchronize_rcu+0xe0/0x1c0
> [    0.198799]  ? lock_release+0x230/0x450
> [    0.199030]  ? _raw_spin_unlock_irqrestore+0x30/0x60
> [    0.199083]  ? lockdep_hardirqs_on+0x79/0x100
> [    0.199345]  ? _raw_spin_unlock_irqrestore+0x3b/0x60
> [    0.199643]  ? atomic_notifier_chain_unregister+0x51/0x80
> [    0.200084]  alternative_instructions+0x27/0xfa
> [    0.200357]  check_bugs+0xe08/0xe82
> [    0.200570]  start_kernel+0x692/0x6cc
> [    0.200797]  secondary_startup_64_no_verify+0xe0/0xeb
> [    0.201088]  </TASK>
> [    0.201223] irq event stamp: 13575
> [    0.201428] hardirqs last  enabled at (13583): [<ffffffff811193c2>] __up_console_sem+0x52/0x60
> [    0.202083] hardirqs last disabled at (13592): [<ffffffff811193a7>] __up_console_sem+0x37/0x60
> [    0.202594] softirqs last  enabled at (12762): [<ffffffff8117e169>] cgroup_idr_alloc.constprop.60+0x59/0x100
> [    0.203083] softirqs last disabled at (12750): [<ffffffff8117e13d>] cgroup_idr_alloc.constprop.60+0x2d/0x100
> [    0.203665] ---[ end trace 0000000000000000 ]---
>
> Looks like it is also necessary to differentiate system_state in
> apply_paravirt() like you did in the other apply_XXX() functions.

Thanks for the report! Somehow I didn't see this in my qemu vm.

Song

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
       [not found] ` <20221007234315.2877365-4-song@kernel.org>
  2022-10-14  3:48   ` [RFC v2 3/4] modules, x86: use vmalloc_exec for module core Aaron Lu
@ 2022-10-14 15:42   ` Edgecombe, Rick P
  2022-10-14 18:26     ` Song Liu
  1 sibling, 1 reply; 24+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 15:42 UTC (permalink / raw)
  To: linux-kernel, linux-mm, song
  Cc: hch, kernel-team, peterz, urezki, akpm, x86, Hansen, Dave

On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index a4e4d84b6f4e..b44806e31a56 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -53,6 +53,7 @@
>  #include <linux/bsearch.h>
>  #include <linux/dynamic_debug.h>
>  #include <linux/audit.h>
> +#include <linux/bpf.h>
>  #include <uapi/linux/module.h>
>  #include "internal.h"
>  
> @@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
>         lockdep_free_key_range(mod->data_layout.base, mod-
> >data_layout.size);
>  
>         /* Finally, free the core (containing the module structure)
> */
> -       module_memfree(mod->core_layout.base);
> +       vfree_exec(mod->core_layout.base);
>  #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
>         vfree(mod->data_layout.base);
>  #endif
> @@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module *mod,
> const struct load_info *info)
>                         ksym = resolve_symbol_wait(mod, info, name);
>                         /* Ok if resolved.  */
>                         if (ksym && !IS_ERR(ksym)) {
> -                               sym[i].st_value =
> kernel_symbol_value(ksym);
> +                               unsigned long val =
> kernel_symbol_value(ksym);
> +                               bpf_arch_text_copy(&sym[i].st_value,
> &val, sizeof(val));

Why bpf_arch_text_copy()? This of course won't work for other
architectures. So there needs to be fallback method. That RFC broke the
operation into two stages: Loading and finalized. When loading, on non-
x86 the writes would simply be to the allocation mapped as writable.
When it was finalized it changed it to it's final permission (RO, etc).
Then for x86 it does text_pokes() for the writes and has it RO from the
beginning.

I ended up needing a staging buffer for modules too, so that the code
could operate on it directly. I can't remember why that was, it might
be unneeded now since you moved data out of the core allocation.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
  2022-10-14 15:42   ` Edgecombe, Rick P
@ 2022-10-14 18:26     ` Song Liu
  2022-10-14 18:41       ` Edgecombe, Rick P
  0 siblings, 1 reply; 24+ messages in thread
From: Song Liu @ 2022-10-14 18:26 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, linux-mm, song, hch, Kernel Team, peterz, urezki,
	akpm, x86, Hansen, Dave



> On Oct 14, 2022, at 8:42 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>> index a4e4d84b6f4e..b44806e31a56 100644
>> --- a/kernel/module/main.c
>> +++ b/kernel/module/main.c
>> @@ -53,6 +53,7 @@
>> #include <linux/bsearch.h>
>> #include <linux/dynamic_debug.h>
>> #include <linux/audit.h>
>> +#include <linux/bpf.h>
>> #include <uapi/linux/module.h>
>> #include "internal.h"
>> 
>> @@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
>>        lockdep_free_key_range(mod->data_layout.base, mod-
>>> data_layout.size);
>> 
>>        /* Finally, free the core (containing the module structure)
>> */
>> -       module_memfree(mod->core_layout.base);
>> +       vfree_exec(mod->core_layout.base);
>> #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
>>        vfree(mod->data_layout.base);
>> #endif
>> @@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module *mod,
>> const struct load_info *info)
>>                        ksym = resolve_symbol_wait(mod, info, name);
>>                        /* Ok if resolved.  */
>>                        if (ksym && !IS_ERR(ksym)) {
>> -                               sym[i].st_value =
>> kernel_symbol_value(ksym);
>> +                               unsigned long val =
>> kernel_symbol_value(ksym);
>> +                               bpf_arch_text_copy(&sym[i].st_value,
>> &val, sizeof(val));
> 
> Why bpf_arch_text_copy()? This of course won't work for other
> architectures. So there needs to be fallback method. That RFC broke the
> operation into two stages: Loading and finalized. When loading, on non-
> x86 the writes would simply be to the allocation mapped as writable.
> When it was finalized it changed it to it's final permission (RO, etc).
> Then for x86 it does text_pokes() for the writes and has it RO from the
> beginning.

Yeah, this one (3/4) is really a prototype to show vmalloc_exec could 
work for modules (with a lot more work of course). And something to
replace bpf_arch_text_copy() is one of the issues we need to address in
the future. 

> 
> I ended up needing a staging buffer for modules too, so that the code
> could operate on it directly. I can't remember why that was, it might
> be unneeded now since you moved data out of the core allocation.

Both bpf_jit and bpf_dispather uses a staging buffer with bpf_prog_pack. 
The benefit of this approach is that it minimizes the number of 
text_poke/copy() calls. OTOH, it is quite a pain to make all the 
relative calls correct, as the staging buffer has different address to 
the final allocation. 

I think we may not need the staging buffer for modules, as module 
load/unload happens less often than BPF program JITs (so it is ok for 
it to be slightly slower). 

btw: I cannot take credit for split module data out of core allocation,
Christophe Leroy did the work. :)

Thanks,
Song

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
  2022-10-14 18:26     ` Song Liu
@ 2022-10-14 18:41       ` Edgecombe, Rick P
  0 siblings, 0 replies; 24+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 18:41 UTC (permalink / raw)
  To: songliubraving, mcgrof
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Fri, 2022-10-14 at 18:26 +0000, Song Liu wrote:
> > On Oct 14, 2022, at 8:42 AM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> > > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > > index a4e4d84b6f4e..b44806e31a56 100644
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -53,6 +53,7 @@
> > > #include <linux/bsearch.h>
> > > #include <linux/dynamic_debug.h>
> > > #include <linux/audit.h>
> > > +#include <linux/bpf.h>
> > > #include <uapi/linux/module.h>
> > > #include "internal.h"
> > > 
> > > @@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
> > >         lockdep_free_key_range(mod->data_layout.base, mod-
> > > > data_layout.size);
> > > 
> > >         /* Finally, free the core (containing the module
> > > structure)
> > > */
> > > -       module_memfree(mod->core_layout.base);
> > > +       vfree_exec(mod->core_layout.base);
> > > #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> > >         vfree(mod->data_layout.base);
> > > #endif
> > > @@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module
> > > *mod,
> > > const struct load_info *info)
> > >                         ksym = resolve_symbol_wait(mod, info,
> > > name);
> > >                         /* Ok if resolved.  */
> > >                         if (ksym && !IS_ERR(ksym)) {
> > > -                               sym[i].st_value =
> > > kernel_symbol_value(ksym);
> > > +                               unsigned long val =
> > > kernel_symbol_value(ksym);
> > > +                              
> > > bpf_arch_text_copy(&sym[i].st_value,
> > > &val, sizeof(val));
> > 
> > Why bpf_arch_text_copy()? This of course won't work for other
> > architectures. So there needs to be fallback method. That RFC broke
> > the
> > operation into two stages: Loading and finalized. When loading, on
> > non-
> > x86 the writes would simply be to the allocation mapped as
> > writable.
> > When it was finalized it changed it to it's final permission (RO,
> > etc).
> > Then for x86 it does text_pokes() for the writes and has it RO from
> > the
> > beginning.
> 
> Yeah, this one (3/4) is really a prototype to show vmalloc_exec
> could 
> work for modules (with a lot more work of course). And something to
> replace bpf_arch_text_copy() is one of the issues we need to address
> in
> the future. 

Right, I think making it work both with and without text_poke() is
needed to ever get it to work with modules. Since so much of it is
cross arch. Oops, it looks like we lost Luis on most of these
responses.

If we don't have that, then the modules RFC is kind of a distraction.
But bpf, kprobes and ftrace is still nice.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
       [not found] <20221007234315.2877365-1-song@kernel.org>
                   ` (4 preceding siblings ...)
       [not found] ` <20221007234315.2877365-4-song@kernel.org>
@ 2022-10-17  7:26 ` Christoph Hellwig
  2022-10-17 16:23   ` Song Liu
  5 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2022-10-17  7:26 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-mm, linux-kernel, akpm, x86, peterz, hch, kernel-team,
	rick.p.edgecombe, dave.hansen, urezki

On Fri, Oct 07, 2022 at 04:43:11PM -0700, Song Liu wrote:
> Changes RFC v1 => RFC v2:
> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>    work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>    still need some work.

Can you please move the changelog under the description of WTF the
series actually does like the normal kernel process?  Explaining the
changes from a previous version before you even describe what the series
does is completely incoherent.

> This set is a prototype that allows dynamic kernel text (modules, bpf
> programs, various trampolines, etc.) to share huge pages. The idea is
> similar to Peter's suggestion in [1]. Please refer to each patch for
> more detais.

Well, nothing explains what the method is to avoid having memory
that is mapped writable and executable at the same time, which really
could use some explanation here (and in the main patch as well).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-17  7:26 ` [RFC v2 0/4] vmalloc_exec for modules and BPF programs Christoph Hellwig
@ 2022-10-17 16:23   ` Song Liu
  2022-10-18 14:50     ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Song Liu @ 2022-10-17 16:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Song Liu, Linux-MM, lkml, Andrew Morton, X86 ML, Peter Zijlstra,
	Kernel Team, Edgecombe, Rick P, Hansen, Dave, urezki

Hi Chritoph, 

> On Oct 17, 2022, at 12:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Fri, Oct 07, 2022 at 04:43:11PM -0700, Song Liu wrote:
>> Changes RFC v1 => RFC v2:
>> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>>   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>>   still need some work.
> 
> Can you please move the changelog under the description of WTF the
> series actually does like the normal kernel process?  Explaining the
> changes from a previous version before you even describe what the series
> does is completely incoherent.

Will fix in the next version. 

> 
>> This set is a prototype that allows dynamic kernel text (modules, bpf
>> programs, various trampolines, etc.) to share huge pages. The idea is
>> similar to Peter's suggestion in [1]. Please refer to each patch for
>> more detais.
> 
> Well, nothing explains what the method is to avoid having memory
> that is mapped writable and executable at the same time, which really
> could use some explanation here (and in the main patch as well).

Thanks for the feedback. I will add this. 

Does the code look good to you? I personally think patch 1, 2, 4 could
ship with a little more work. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-17 16:23   ` Song Liu
@ 2022-10-18 14:50     ` Christoph Hellwig
  2022-10-18 15:05       ` Song Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2022-10-18 14:50 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Song Liu, Linux-MM, lkml, Andrew Morton,
	X86 ML, Peter Zijlstra, Kernel Team, Edgecombe, Rick P, Hansen,
	Dave, urezki

On Mon, Oct 17, 2022 at 04:23:52PM +0000, Song Liu wrote:
> > Well, nothing explains what the method is to avoid having memory
> > that is mapped writable and executable at the same time, which really
> > could use some explanation here (and in the main patch as well).
> 
> Thanks for the feedback. I will add this. 
> 
> Does the code look good to you? I personally think patch 1, 2, 4 could
> ship with a little more work. 

I only took a quick look and I'm not sure how the W^X actually works.
Yes, it alls into the text poke helpers, but how do these work on
less than page sized allocations?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-18 14:50     ` Christoph Hellwig
@ 2022-10-18 15:05       ` Song Liu
  2022-10-18 15:40         ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Song Liu @ 2022-10-18 15:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Song Liu, Song Liu, Linux-MM, lkml, Andrew Morton, X86 ML,
	Peter Zijlstra, Kernel Team, Edgecombe, Rick P, Hansen, Dave,
	urezki



> On Oct 18, 2022, at 7:50 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Mon, Oct 17, 2022 at 04:23:52PM +0000, Song Liu wrote:
>>> Well, nothing explains what the method is to avoid having memory
>>> that is mapped writable and executable at the same time, which really
>>> could use some explanation here (and in the main patch as well).
>> 
>> Thanks for the feedback. I will add this. 
>> 
>> Does the code look good to you? I personally think patch 1, 2, 4 could
>> ship with a little more work. 
> 
> I only took a quick look and I'm not sure how the W^X actually works.
> Yes, it alls into the text poke helpers, but how do these work on
> less than page sized allocations?

Aha, I guess I understand your point (and concern) now. 

It is the same as text poke into static kernel text: we create a local 
writable mapping to the memory we need to update. For less than page
sized allocation, this mapping does have access to X memory that may 
belong to a different allocation, just like text poke into static 
kernel text. 

Maybe we need something like vcopy_exec(x_mem, tmp_buf, size), where
we explicitly check the allowed memory of x_mem is bigger or equal to
size. And users of vmalloc_exec should only use vcopy_exec to update
memory from vmalloc_exec. 

Does this make sense? Did I understand your concern correctly? 

Thanks,
Song

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-18 15:05       ` Song Liu
@ 2022-10-18 15:40         ` Christoph Hellwig
  2022-10-18 15:40           ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2022-10-18 15:40 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Song Liu, Linux-MM, lkml, Andrew Morton,
	X86 ML, Peter Zijlstra, Kernel Team, Edgecombe, Rick P, Hansen,
	Dave, urezki

On Tue, Oct 18, 2022 at 03:05:24PM +0000, Song Liu wrote:
> Maybe we need something like vcopy_exec(x_mem, tmp_buf, size), where
> we explicitly check the allowed memory of x_mem is bigger or equal to
> size. And users of vmalloc_exec should only use vcopy_exec to update
> memory from vmalloc_exec. 
> 
> Does this make sense? Did I understand your concern correctly? 

It sounds sensible.  Make sure it iѕ properly documented and reviewed
by people that actually know this area.  I just know enough to ask
stupid question, not to very something is actually correct :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-18 15:40         ` Christoph Hellwig
@ 2022-10-18 15:40           ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2022-10-18 15:40 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Song Liu, Linux-MM, lkml, Andrew Morton,
	X86 ML, Peter Zijlstra, Kernel Team, Edgecombe, Rick P, Hansen,
	Dave, urezki

On Tue, Oct 18, 2022 at 05:40:20PM +0200, Christoph Hellwig wrote:
> It sounds sensible.  Make sure it iѕ properly documented and reviewed
> by people that actually know this area.  I just know enough to ask
> stupid question, not to very something is actually correct :)

s/very/verify/

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-10-18 15:40 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20221007234315.2877365-1-song@kernel.org>
2022-10-08  0:17 ` [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
     [not found] ` <20221007234315.2877365-2-song@kernel.org>
2022-10-10 18:13   ` [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec Edgecombe, Rick P
2022-10-10 19:04     ` Song Liu
2022-10-10 19:59       ` Edgecombe, Rick P
     [not found] ` <20221007234315.2877365-5-song@kernel.org>
2022-10-10 18:32   ` [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text Edgecombe, Rick P
2022-10-10 19:08     ` Song Liu
2022-10-10 20:09       ` Edgecombe, Rick P
2022-10-11 16:25         ` Song Liu
2022-10-11 20:40           ` Edgecombe, Rick P
2022-10-12  5:37             ` Song Liu
2022-10-12 18:38               ` Edgecombe, Rick P
2022-10-12 19:01                 ` Song Liu
2022-10-12 19:03 ` [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
     [not found] ` <20221007234315.2877365-4-song@kernel.org>
2022-10-14  3:48   ` [RFC v2 3/4] modules, x86: use vmalloc_exec for module core Aaron Lu
2022-10-14  6:07     ` Song Liu
2022-10-14 15:42   ` Edgecombe, Rick P
2022-10-14 18:26     ` Song Liu
2022-10-14 18:41       ` Edgecombe, Rick P
2022-10-17  7:26 ` [RFC v2 0/4] vmalloc_exec for modules and BPF programs Christoph Hellwig
2022-10-17 16:23   ` Song Liu
2022-10-18 14:50     ` Christoph Hellwig
2022-10-18 15:05       ` Song Liu
2022-10-18 15:40         ` Christoph Hellwig
2022-10-18 15:40           ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).