* [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge
2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
@ 2022-06-24 21:57 ` Song Liu
2022-07-01 23:20 ` Luis Chamberlain
2022-06-24 21:57 ` [PATCH v5 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
` (4 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
To: linux-kernel, netdev, x86
Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu
Introduce module_alloc_huge, which allocates huge page backed memory in
module memory space. The primary user of this memory is bpf_prog_pack
(multiple BPF programs sharing a huge page).
Signed-off-by: Song Liu <song@kernel.org>
---
arch/x86/kernel/module.c | 21 +++++++++++++++++++++
include/linux/moduleloader.h | 5 +++++
kernel/module/main.c | 8 ++++++++
3 files changed, 34 insertions(+)
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b98ffcf4d250..63f6a16c70dc 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -86,6 +86,27 @@ void *module_alloc(unsigned long size)
return p;
}
+void *module_alloc_huge(unsigned long size)
+{
+ gfp_t gfp_mask = GFP_KERNEL;
+ void *p;
+
+ if (PAGE_ALIGN(size) > MODULES_LEN)
+ return NULL;
+
+ p = __vmalloc_node_range(size, MODULE_ALIGN,
+ MODULES_VADDR + get_module_load_offset(),
+ MODULES_END, gfp_mask, PAGE_KERNEL,
+ VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
+ if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
+ vfree(p);
+ return NULL;
+ }
+
+ return p;
+}
+
#ifdef CONFIG_X86_32
int apply_relocate(Elf32_Shdr *sechdrs,
const char *strtab,
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 9e09d11ffe5b..d34743a88938 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -26,6 +26,11 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
sections. Returns NULL on failure. */
void *module_alloc(unsigned long size);
+/* Allocator used for allocating memory in module memory space. If size is
+ * greater than PMD_SIZE, allow using huge pages. Returns NULL on failure.
+ */
+void *module_alloc_huge(unsigned long size);
+
/* Free memory returned from module_alloc. */
void module_memfree(void *module_region);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index fed58d30725d..349b2a8bd20f 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1613,6 +1613,14 @@ void * __weak module_alloc(unsigned long size)
NUMA_NO_NODE, __builtin_return_address(0));
}
+void * __weak module_alloc_huge(unsigned long size)
+{
+ return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+ GFP_KERNEL, PAGE_KERNEL_EXEC,
+ VM_FLUSH_RESET_PERMS | VM_ALLOW_HUGE_VMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
+}
+
bool __weak module_init_section(const char *name)
{
return strstarts(name, ".init");
--
2.30.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge
2022-06-24 21:57 ` [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
@ 2022-07-01 23:20 ` Luis Chamberlain
2022-07-06 4:39 ` Song Liu
0 siblings, 1 reply; 10+ messages in thread
From: Luis Chamberlain @ 2022-07-01 23:20 UTC (permalink / raw)
To: Song Liu
Cc: linux-kernel, netdev, x86, dave.hansen, rick.p.edgecombe,
kernel-team, daniel
On Fri, Jun 24, 2022 at 02:57:08PM -0700, Song Liu wrote:
> Introduce module_alloc_huge, which allocates huge page backed memory in
> module memory space. The primary user of this memory is bpf_prog_pack
> (multiple BPF programs sharing a huge page).
>
> Signed-off-by: Song Liu <song@kernel.org>
I see mm not Cc'd. I'd like review from them.
> ---
> arch/x86/kernel/module.c | 21 +++++++++++++++++++++
> include/linux/moduleloader.h | 5 +++++
> kernel/module/main.c | 8 ++++++++
> 3 files changed, 34 insertions(+)
>
> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index b98ffcf4d250..63f6a16c70dc 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -86,6 +86,27 @@ void *module_alloc(unsigned long size)
> return p;
> }
>
> +void *module_alloc_huge(unsigned long size)
> +{
> + gfp_t gfp_mask = GFP_KERNEL;
> + void *p;
> +
> + if (PAGE_ALIGN(size) > MODULES_LEN)
> + return NULL;
> +
> + p = __vmalloc_node_range(size, MODULE_ALIGN,
> + MODULES_VADDR + get_module_load_offset(),
> + MODULES_END, gfp_mask, PAGE_KERNEL,
> + VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
> + NUMA_NO_NODE, __builtin_return_address(0));
> + if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
> + vfree(p);
> + return NULL;
> + }
> +
> + return p;
> +}
1) When things like kernel/bpf/core.c start using a module alloc it
is time to consider genearlizing this.
2) How we free is important, and each arch does something funky for
this. This is not addressed here.
And yes I welcome generalizing generic module_alloc() too as suggested
before. The concern on my part is the sloppiness this enables.
Luis
> +
> #ifdef CONFIG_X86_32
> int apply_relocate(Elf32_Shdr *sechdrs,
> const char *strtab,
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 9e09d11ffe5b..d34743a88938 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -26,6 +26,11 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
> sections. Returns NULL on failure. */
> void *module_alloc(unsigned long size);
>
> +/* Allocator used for allocating memory in module memory space. If size is
> + * greater than PMD_SIZE, allow using huge pages. Returns NULL on failure.
> + */
> +void *module_alloc_huge(unsigned long size);
> +
> /* Free memory returned from module_alloc. */
> void module_memfree(void *module_region);
>
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index fed58d30725d..349b2a8bd20f 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1613,6 +1613,14 @@ void * __weak module_alloc(unsigned long size)
> NUMA_NO_NODE, __builtin_return_address(0));
> }
>
> +void * __weak module_alloc_huge(unsigned long size)
> +{
> + return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> + GFP_KERNEL, PAGE_KERNEL_EXEC,
> + VM_FLUSH_RESET_PERMS | VM_ALLOW_HUGE_VMAP,
> + NUMA_NO_NODE, __builtin_return_address(0));
> +}
> +
> bool __weak module_init_section(const char *name)
> {
> return strstarts(name, ".init");
> --
> 2.30.2
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge
2022-07-01 23:20 ` Luis Chamberlain
@ 2022-07-06 4:39 ` Song Liu
2022-07-07 20:11 ` Luis Chamberlain
0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2022-07-06 4:39 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Song Liu, lkml, netdev, x86, dave.hansen, rick.p.edgecombe,
Kernel Team, daniel
> On Jul 1, 2022, at 4:20 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Fri, Jun 24, 2022 at 02:57:08PM -0700, Song Liu wrote:
>> Introduce module_alloc_huge, which allocates huge page backed memory in
>> module memory space. The primary user of this memory is bpf_prog_pack
>> (multiple BPF programs sharing a huge page).
>>
>> Signed-off-by: Song Liu <song@kernel.org>
>
> I see mm not Cc'd. I'd like review from them.
I will CC mm in the next version (or resend). Thanks for the reminder.
>
>> ---
>> arch/x86/kernel/module.c | 21 +++++++++++++++++++++
>> include/linux/moduleloader.h | 5 +++++
>> kernel/module/main.c | 8 ++++++++
>> 3 files changed, 34 insertions(+)
>>
>> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
>> index b98ffcf4d250..63f6a16c70dc 100644
>> --- a/arch/x86/kernel/module.c
>> +++ b/arch/x86/kernel/module.c
>> @@ -86,6 +86,27 @@ void *module_alloc(unsigned long size)
>> return p;
>> }
>>
>> +void *module_alloc_huge(unsigned long size)
>> +{
>> + gfp_t gfp_mask = GFP_KERNEL;
>> + void *p;
>> +
>> + if (PAGE_ALIGN(size) > MODULES_LEN)
>> + return NULL;
>> +
>> + p = __vmalloc_node_range(size, MODULE_ALIGN,
>> + MODULES_VADDR + get_module_load_offset(),
>> + MODULES_END, gfp_mask, PAGE_KERNEL,
>> + VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
>> + NUMA_NO_NODE, __builtin_return_address(0));
>> + if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
>> + vfree(p);
>> + return NULL;
>> + }
>> +
>> + return p;
>> +}
>
> 1) When things like kernel/bpf/core.c start using a module alloc it
> is time to consider genearlizing this.
I am not quite sure what the generalization would look like. IMHO, the
ideal case would have:
a) A kernel_text_rw_allocator, similar to current module_alloc.
b) A kernel_text_ro_allocator, similar to current bpf_prog_pack_alloc.
This is built on top of kernel_text_rw_allocator. Different
allocations could share a page, thus it requires text_poke like
support from the arch.
c) If the arch supports text_poke, kprobes, ftrace trampolines, and
bpf trampolines should use kernel_text_ro_allocator.
d) Major archs should support CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC,
and they should use kernel_text_ro_allocator for module text.
Does this sound reasonable to you?
I tried to enable CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC for x86_64,
but that doesn't really work. Do we have plan to make this combination
work?
>
> 2) How we free is important, and each arch does something funky for
> this. This is not addressed here.
How should we address this? IIUC, x86_64 just calls vfree.
>
> And yes I welcome generalizing generic module_alloc() too as suggested
> before. The concern on my part is the sloppiness this enables.
One question I have is, does module_alloc (or kernel_text_*_allocator
above) belong to module code, or mm code (maybe vmalloc)?
I am planning to let BPF trampoline use bpf_prog_pack on x86_64, which
is another baby step of c) above.
Thanks,
Song
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge
2022-07-06 4:39 ` Song Liu
@ 2022-07-07 20:11 ` Luis Chamberlain
0 siblings, 0 replies; 10+ messages in thread
From: Luis Chamberlain @ 2022-07-07 20:11 UTC (permalink / raw)
To: Song Liu
Cc: Song Liu, lkml, netdev, x86, dave.hansen, rick.p.edgecombe,
Kernel Team, daniel
On Wed, Jul 06, 2022 at 04:39:13AM +0000, Song Liu wrote:
> > On Jul 1, 2022, at 4:20 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > On Fri, Jun 24, 2022 at 02:57:08PM -0700, Song Liu wrote:
> >> +void *module_alloc_huge(unsigned long size)
> >> +{
> >> + gfp_t gfp_mask = GFP_KERNEL;
> >> + void *p;
> >> +
> >> + if (PAGE_ALIGN(size) > MODULES_LEN)
> >> + return NULL;
> >> +
> >> + p = __vmalloc_node_range(size, MODULE_ALIGN,
> >> + MODULES_VADDR + get_module_load_offset(),
> >> + MODULES_END, gfp_mask, PAGE_KERNEL,
> >> + VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
> >> + NUMA_NO_NODE, __builtin_return_address(0));
> >> + if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
> >> + vfree(p);
> >> + return NULL;
> >> + }
> >> +
> >> + return p;
> >> +}
> >
> > 1) When things like kernel/bpf/core.c start using a module alloc it
> > is time to consider genearlizing this.
>
> I am not quite sure what the generalization would look like. IMHO, the
> ideal case would have:
> a) A kernel_text_rw_allocator, similar to current module_alloc.
> b) A kernel_text_ro_allocator, similar to current bpf_prog_pack_alloc.
> This is built on top of kernel_text_rw_allocator. Different
> allocations could share a page, thus it requires text_poke like
> support from the arch.
> c) If the arch supports text_poke, kprobes, ftrace trampolines, and
> bpf trampolines should use kernel_text_ro_allocator.
> d) Major archs should support CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC,
> and they should use kernel_text_ro_allocator for module text.
>
> Does this sound reasonable to you?
Yes, and a respective free call may have an arch specific stuff which
removes exec stuff.
In so far as the bikeshedding, I do think this is generic so
vmalloc_text_*() suffices or vmalloc_exec_*() take your pick for
a starter and let the world throw in their preference.
> I tried to enable CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC for x86_64,
> but that doesn't really work. Do we have plan to make this combination
> work?
Oh nice.
Good stuff. Perhaps it just requires a little love from mm folks.
Don't beat yourself up if it does not yet. We can work towards that
later.
> > 2) How we free is important, and each arch does something funky for
> > this. This is not addressed here.
>
> How should we address this? IIUC, x86_64 just calls vfree.
That's not the case for all archs is it? I'm talking about the generic
module_alloc() too. I'd like to see that go the way we discussed above.
> > And yes I welcome generalizing generic module_alloc() too as suggested
> > before. The concern on my part is the sloppiness this enables.
>
> One question I have is, does module_alloc (or kernel_text_*_allocator
> above) belong to module code, or mm code (maybe vmalloc)?
The evolution of its uses indicates it is growing outside of modules and
so mm should be the new home.
> I am planning to let BPF trampoline use bpf_prog_pack on x86_64, which
> is another baby step of c) above.
OK!
Luis
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v5 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack
2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
@ 2022-06-24 21:57 ` Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
To: linux-kernel, netdev, x86
Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu
Use module_alloc_huge for bpf_prog_pack so that BPF programs sit on
PMD_SIZE pages. This benefits system performance by reducing iTLB miss
rate. Benchmark of a real web service workload shows this change gives
another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
(which improve system throughput by ~0.5%).
Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and use
set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
[1] https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
kernel/bpf/core.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index f023cb399e3f..83f5a98517d5 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
void *ptr;
size = BPF_HPAGE_SIZE * num_online_nodes();
- ptr = module_alloc(size);
+ ptr = module_alloc_huge(size);
/* Test whether we can get huge pages. If not just use PAGE_SIZE
* packs.
@@ -881,7 +881,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
GFP_KERNEL);
if (!pack)
return NULL;
- pack->ptr = module_alloc(bpf_prog_pack_size);
+ pack->ptr = module_alloc_huge(bpf_prog_pack_size);
if (!pack->ptr) {
kfree(pack);
return NULL;
@@ -890,7 +890,6 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
bitmap_zero(pack->bitmap, bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE);
list_add_tail(&pack->list, &pack_list);
- set_vm_flush_reset_perms(pack->ptr);
set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
return pack;
@@ -909,10 +908,9 @@ static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insn
if (size > bpf_prog_pack_size) {
size = round_up(size, PAGE_SIZE);
- ptr = module_alloc(size);
+ ptr = module_alloc_huge(size);
if (ptr) {
bpf_fill_ill_insns(ptr, size);
- set_vm_flush_reset_perms(ptr);
set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
}
@@ -949,6 +947,8 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
mutex_lock(&pack_mutex);
if (hdr->size > bpf_prog_pack_size) {
+ set_memory_nx((unsigned long)hdr, hdr->size / PAGE_SIZE);
+ set_memory_rw((unsigned long)hdr, hdr->size / PAGE_SIZE);
module_memfree(hdr);
goto out;
}
@@ -975,6 +975,8 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
if (bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0,
bpf_prog_chunk_count(), 0) == 0) {
list_del(&pack->list);
+ set_memory_nx((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
+ set_memory_rw((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
module_memfree(pack->ptr);
kfree(pack);
}
--
2.30.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v5 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
@ 2022-06-24 21:57 ` Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
To: linux-kernel, netdev, x86
Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu
VM_FLUSH_RESET_PERMS is not yet ready for huge pages, add a WARN to
catch misuse soon.
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
include/linux/vmalloc.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..59d3e1f3e108 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -239,6 +239,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
{
struct vm_struct *vm = find_vm_area(addr);
+ WARN_ON_ONCE(is_vm_area_hugepages(addr));
if (vm)
vm->flags |= VM_FLUSH_RESET_PERMS;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v5 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported
2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
` (2 preceding siblings ...)
2022-06-24 21:57 ` [PATCH v5 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
@ 2022-06-24 21:57 ` Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
2022-06-24 22:00 ` [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
To: linux-kernel, netdev, x86
Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu
huge_vmalloc_supported() exposes vmap_allow_huge so that users of vmalloc
APIs could know whether vmalloc will return huge pages.
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
include/linux/vmalloc.h | 6 ++++++
mm/vmalloc.c | 5 +++++
2 files changed, 11 insertions(+)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 59d3e1f3e108..aa2182959fc5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -243,11 +243,17 @@ static inline void set_vm_flush_reset_perms(void *addr)
if (vm)
vm->flags |= VM_FLUSH_RESET_PERMS;
}
+bool huge_vmalloc_supported(void);
#else
static inline void set_vm_flush_reset_perms(void *addr)
{
}
+
+static inline bool huge_vmalloc_supported(void)
+{
+ return false;
+}
#endif
/* for /proc/kcore */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index effd1ff6a4b4..0a5add4b5b2d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,11 @@ early_param("nohugevmalloc", set_nohugevmalloc);
static const bool vmap_allow_huge = false;
#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+bool huge_vmalloc_supported(void)
+{
+ return vmap_allow_huge;
+}
+
bool is_vmalloc_addr(const void *x)
{
unsigned long addr = (unsigned long)kasan_reset_tag(x);
--
2.30.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v5 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size
2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
` (3 preceding siblings ...)
2022-06-24 21:57 ` [PATCH v5 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
@ 2022-06-24 21:57 ` Song Liu
2022-06-24 22:00 ` [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
To: linux-kernel, netdev, x86
Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu
Use huge_vmalloc_supported to simplify select_bpf_prog_pack_size, so that
we don't allocate some huge pages and free them immediately.
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
kernel/bpf/core.c | 15 ++++-----------
1 file changed, 4 insertions(+), 11 deletions(-)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 83f5a98517d5..fbbfb7534979 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -854,22 +854,15 @@ static LIST_HEAD(pack_list);
static size_t select_bpf_prog_pack_size(void)
{
size_t size;
- void *ptr;
-
- size = BPF_HPAGE_SIZE * num_online_nodes();
- ptr = module_alloc_huge(size);
- /* Test whether we can get huge pages. If not just use PAGE_SIZE
- * packs.
- */
- if (!ptr || !is_vm_area_hugepages(ptr)) {
+ if (huge_vmalloc_supported()) {
+ size = BPF_HPAGE_SIZE * num_online_nodes();
+ bpf_prog_pack_mask = BPF_HPAGE_MASK;
+ } else {
size = PAGE_SIZE;
bpf_prog_pack_mask = PAGE_MASK;
- } else {
- bpf_prog_pack_mask = BPF_HPAGE_MASK;
}
- vfree(ptr);
return size;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v5 bpf-next 0/5] bpf_prog_pack followup
2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
` (4 preceding siblings ...)
2022-06-24 21:57 ` [PATCH v5 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
@ 2022-06-24 22:00 ` Song Liu
5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 22:00 UTC (permalink / raw)
To: Song Liu
Cc: lkml, Networking, Dave Hansen, Luis Chamberlain, Edgecombe,
Rick P, Kernel Team, Daniel Borkmann, X86 ML
oops, wrong address for x86@.
CC x86@kernel.org
> On Jun 24, 2022, at 2:57 PM, Song Liu <song@kernel.org> wrote:
>
> This set is the second half of v4 [1].
>
> Changes v4 => v5:
> 1. Rebase and resolve conflicts due to module.c split.
> 2. Update experiment results (below).
>
> For our web service production benchmark, bpf_prog_pack on 4kB pages
> gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
> bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
> bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
> believe this is also significant for other companies with many thousand
> servers.
>
> Update: Further experiments (suggested by Rick Edgecombe) showed that most
> of benefit on the web service benchmark came from less direct map
> fragmentation. The experiment is as follows:
>
> Side A: 2MB bpf prog pack on a single 2MB page;
> Side B: 2MB bpf prog pack on 512x 4kB pages;
>
> The system only uses about 200kB for BPF programs, but 2MB is allocated
> for bpf_prog_pack (for both A and B). Therefore, direct map fragmentation
> caused by BPF programs is elminated, and we are only measuring the
> performance difference of 1x 2MB page vs. ~50 4kB pages (we only use
> about 50 out of the 512 pages). For these two sides, the difference in
> system throughput is within the noise. I also measured iTLB-load-misses
> caused by bpf programs, which is ~300/s for case A, and ~1600/s for case B.
> The overall iTLB-load-misses is about 1.5M/s on these hosts. Therefore,
> we can clearly see 2MB page reduces iTLB misses, but the difference is not
> enough to have visible impact on system throughput.
>
> Of course, the impact of iTLB miss will be more significant for systems
> with more BPF programs loaded.
>
> [1] https://lore.kernel.org/bpf/20220520235758.1858153-1-song@kernel.org/
>
> Song Liu (5):
> module: introduce module_alloc_huge
> bpf: use module_alloc_huge for bpf_prog_pack
> vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
> vmalloc: introduce huge_vmalloc_supported
> bpf: simplify select_bpf_prog_pack_size
>
> arch/x86/kernel/module.c | 21 +++++++++++++++++++++
> include/linux/moduleloader.h | 5 +++++
> include/linux/vmalloc.h | 7 +++++++
> kernel/bpf/core.c | 25 ++++++++++---------------
> kernel/module/main.c | 8 ++++++++
> mm/vmalloc.c | 5 +++++
> 6 files changed, 56 insertions(+), 15 deletions(-)
>
> --
> 2.30.2
^ permalink raw reply [flat|nested] 10+ messages in thread