linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 bpf-next 0/5] bpf_prog_pack followup
@ 2022-06-24 21:57 Song Liu
  2022-06-24 21:57 ` [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
  To: linux-kernel, netdev, x86
  Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu

This set is the second half of v4 [1].

Changes v4 => v5:
1. Rebase and resolve conflicts due to module.c split.
2. Update experiment results (below).

For our web service production benchmark, bpf_prog_pack on 4kB pages
gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
believe this is also significant for other companies with many thousand
servers.

Update: Further experiments (suggested by Rick Edgecombe) showed that most
of benefit on the web service benchmark came from less direct map
fragmentation. The experiment is as follows:

Side A: 2MB bpf prog pack on a single 2MB page;
Side B: 2MB bpf prog pack on 512x 4kB pages;

The system only uses about 200kB for BPF programs, but 2MB is allocated
for bpf_prog_pack (for both A and B). Therefore, direct map fragmentation
caused by BPF programs is elminated, and we are only measuring the
performance difference of 1x 2MB page vs. ~50 4kB pages (we only use
about 50 out of the 512 pages). For these two sides, the difference in
system throughput is within the noise. I also measured iTLB-load-misses
caused by bpf programs, which is ~300/s for case A, and ~1600/s for case B.
The overall iTLB-load-misses is about 1.5M/s on these hosts. Therefore,
we can clearly see 2MB page reduces iTLB misses, but the difference is not
enough to have visible impact on system throughput.

Of course, the impact of iTLB miss will be more significant for systems
with more BPF programs loaded.

[1] https://lore.kernel.org/bpf/20220520235758.1858153-1-song@kernel.org/

Song Liu (5):
  module: introduce module_alloc_huge
  bpf: use module_alloc_huge for bpf_prog_pack
  vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
  vmalloc: introduce huge_vmalloc_supported
  bpf: simplify select_bpf_prog_pack_size

 arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
 include/linux/moduleloader.h |  5 +++++
 include/linux/vmalloc.h      |  7 +++++++
 kernel/bpf/core.c            | 25 ++++++++++---------------
 kernel/module/main.c         |  8 ++++++++
 mm/vmalloc.c                 |  5 +++++
 6 files changed, 56 insertions(+), 15 deletions(-)

--
2.30.2

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge
  2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
@ 2022-06-24 21:57 ` Song Liu
  2022-07-01 23:20   ` Luis Chamberlain
  2022-06-24 21:57 ` [PATCH v5 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
  To: linux-kernel, netdev, x86
  Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu

Introduce module_alloc_huge, which allocates huge page backed memory in
module memory space. The primary user of this memory is bpf_prog_pack
(multiple BPF programs sharing a huge page).

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
 include/linux/moduleloader.h |  5 +++++
 kernel/module/main.c         |  8 ++++++++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b98ffcf4d250..63f6a16c70dc 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -86,6 +86,27 @@ void *module_alloc(unsigned long size)
 	return p;
 }
 
+void *module_alloc_huge(unsigned long size)
+{
+	gfp_t gfp_mask = GFP_KERNEL;
+	void *p;
+
+	if (PAGE_ALIGN(size) > MODULES_LEN)
+		return NULL;
+
+	p = __vmalloc_node_range(size, MODULE_ALIGN,
+				 MODULES_VADDR + get_module_load_offset(),
+				 MODULES_END, gfp_mask, PAGE_KERNEL,
+				 VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
+				 NUMA_NO_NODE, __builtin_return_address(0));
+	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
+		vfree(p);
+		return NULL;
+	}
+
+	return p;
+}
+
 #ifdef CONFIG_X86_32
 int apply_relocate(Elf32_Shdr *sechdrs,
 		   const char *strtab,
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 9e09d11ffe5b..d34743a88938 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -26,6 +26,11 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
    sections.  Returns NULL on failure. */
 void *module_alloc(unsigned long size);
 
+/* Allocator used for allocating memory in module memory space. If size is
+ * greater than PMD_SIZE, allow using huge pages. Returns NULL on failure.
+ */
+void *module_alloc_huge(unsigned long size);
+
 /* Free memory returned from module_alloc. */
 void module_memfree(void *module_region);
 
diff --git a/kernel/module/main.c b/kernel/module/main.c
index fed58d30725d..349b2a8bd20f 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1613,6 +1613,14 @@ void * __weak module_alloc(unsigned long size)
 			NUMA_NO_NODE, __builtin_return_address(0));
 }
 
+void * __weak module_alloc_huge(unsigned long size)
+{
+	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+				    GFP_KERNEL, PAGE_KERNEL_EXEC,
+				    VM_FLUSH_RESET_PERMS | VM_ALLOW_HUGE_VMAP,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+
 bool __weak module_init_section(const char *name)
 {
 	return strstarts(name, ".init");
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
  2022-06-24 21:57 ` [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
@ 2022-06-24 21:57 ` Song Liu
  2022-06-24 21:57 ` [PATCH v5 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
  To: linux-kernel, netdev, x86
  Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu

Use module_alloc_huge for bpf_prog_pack so that BPF programs sit on
PMD_SIZE pages. This benefits system performance by reducing iTLB miss
rate. Benchmark of a real web service workload shows this change gives
another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
(which improve system throughput by ~0.5%).

Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and use
set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]

[1] https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index f023cb399e3f..83f5a98517d5 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
 	void *ptr;
 
 	size = BPF_HPAGE_SIZE * num_online_nodes();
-	ptr = module_alloc(size);
+	ptr = module_alloc_huge(size);
 
 	/* Test whether we can get huge pages. If not just use PAGE_SIZE
 	 * packs.
@@ -881,7 +881,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 		       GFP_KERNEL);
 	if (!pack)
 		return NULL;
-	pack->ptr = module_alloc(bpf_prog_pack_size);
+	pack->ptr = module_alloc_huge(bpf_prog_pack_size);
 	if (!pack->ptr) {
 		kfree(pack);
 		return NULL;
@@ -890,7 +890,6 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 	bitmap_zero(pack->bitmap, bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE);
 	list_add_tail(&pack->list, &pack_list);
 
-	set_vm_flush_reset_perms(pack->ptr);
 	set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 	set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 	return pack;
@@ -909,10 +908,9 @@ static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insn
 
 	if (size > bpf_prog_pack_size) {
 		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
+		ptr = module_alloc_huge(size);
 		if (ptr) {
 			bpf_fill_ill_insns(ptr, size);
-			set_vm_flush_reset_perms(ptr);
 			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
 			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
 		}
@@ -949,6 +947,8 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 
 	mutex_lock(&pack_mutex);
 	if (hdr->size > bpf_prog_pack_size) {
+		set_memory_nx((unsigned long)hdr, hdr->size / PAGE_SIZE);
+		set_memory_rw((unsigned long)hdr, hdr->size / PAGE_SIZE);
 		module_memfree(hdr);
 		goto out;
 	}
@@ -975,6 +975,8 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	if (bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0,
 				       bpf_prog_chunk_count(), 0) == 0) {
 		list_del(&pack->list);
+		set_memory_nx((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
+		set_memory_rw((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 		module_memfree(pack->ptr);
 		kfree(pack);
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
  2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
  2022-06-24 21:57 ` [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
  2022-06-24 21:57 ` [PATCH v5 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
@ 2022-06-24 21:57 ` Song Liu
  2022-06-24 21:57 ` [PATCH v5 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
  To: linux-kernel, netdev, x86
  Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu

VM_FLUSH_RESET_PERMS is not yet ready for huge pages, add a WARN to
catch misuse soon.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..59d3e1f3e108 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -239,6 +239,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
 {
 	struct vm_struct *vm = find_vm_area(addr);
 
+	WARN_ON_ONCE(is_vm_area_hugepages(addr));
 	if (vm)
 		vm->flags |= VM_FLUSH_RESET_PERMS;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported
  2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
                   ` (2 preceding siblings ...)
  2022-06-24 21:57 ` [PATCH v5 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
@ 2022-06-24 21:57 ` Song Liu
  2022-06-24 21:57 ` [PATCH v5 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
  2022-06-24 22:00 ` [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
  5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
  To: linux-kernel, netdev, x86
  Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu

huge_vmalloc_supported() exposes vmap_allow_huge so that users of vmalloc
APIs could know whether vmalloc will return huge pages.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h | 6 ++++++
 mm/vmalloc.c            | 5 +++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 59d3e1f3e108..aa2182959fc5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -243,11 +243,17 @@ static inline void set_vm_flush_reset_perms(void *addr)
 	if (vm)
 		vm->flags |= VM_FLUSH_RESET_PERMS;
 }
+bool huge_vmalloc_supported(void);
 
 #else
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 }
+
+static inline bool huge_vmalloc_supported(void)
+{
+	return false;
+}
 #endif
 
 /* for /proc/kcore */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index effd1ff6a4b4..0a5add4b5b2d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,11 @@ early_param("nohugevmalloc", set_nohugevmalloc);
 static const bool vmap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
 
+bool huge_vmalloc_supported(void)
+{
+	return vmap_allow_huge;
+}
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size
  2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
                   ` (3 preceding siblings ...)
  2022-06-24 21:57 ` [PATCH v5 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
@ 2022-06-24 21:57 ` Song Liu
  2022-06-24 22:00 ` [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
  5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 21:57 UTC (permalink / raw)
  To: linux-kernel, netdev, x86
  Cc: dave.hansen, mcgrof, rick.p.edgecombe, kernel-team, daniel, Song Liu

Use huge_vmalloc_supported to simplify select_bpf_prog_pack_size, so that
we don't allocate some huge pages and free them immediately.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 83f5a98517d5..fbbfb7534979 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -854,22 +854,15 @@ static LIST_HEAD(pack_list);
 static size_t select_bpf_prog_pack_size(void)
 {
 	size_t size;
-	void *ptr;
-
-	size = BPF_HPAGE_SIZE * num_online_nodes();
-	ptr = module_alloc_huge(size);
 
-	/* Test whether we can get huge pages. If not just use PAGE_SIZE
-	 * packs.
-	 */
-	if (!ptr || !is_vm_area_hugepages(ptr)) {
+	if (huge_vmalloc_supported()) {
+		size = BPF_HPAGE_SIZE * num_online_nodes();
+		bpf_prog_pack_mask = BPF_HPAGE_MASK;
+	} else {
 		size = PAGE_SIZE;
 		bpf_prog_pack_mask = PAGE_MASK;
-	} else {
-		bpf_prog_pack_mask = BPF_HPAGE_MASK;
 	}
 
-	vfree(ptr);
 	return size;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf_prog_pack followup
  2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
                   ` (4 preceding siblings ...)
  2022-06-24 21:57 ` [PATCH v5 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
@ 2022-06-24 22:00 ` Song Liu
  5 siblings, 0 replies; 10+ messages in thread
From: Song Liu @ 2022-06-24 22:00 UTC (permalink / raw)
  To: Song Liu
  Cc: lkml, Networking, Dave Hansen, Luis Chamberlain, Edgecombe,
	Rick P, Kernel Team, Daniel Borkmann, X86 ML

oops, wrong address for x86@. 

CC x86@kernel.org

> On Jun 24, 2022, at 2:57 PM, Song Liu <song@kernel.org> wrote:
> 
> This set is the second half of v4 [1].
> 
> Changes v4 => v5:
> 1. Rebase and resolve conflicts due to module.c split.
> 2. Update experiment results (below).
> 
> For our web service production benchmark, bpf_prog_pack on 4kB pages
> gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
> bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
> bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
> believe this is also significant for other companies with many thousand
> servers.
> 
> Update: Further experiments (suggested by Rick Edgecombe) showed that most
> of benefit on the web service benchmark came from less direct map
> fragmentation. The experiment is as follows:
> 
> Side A: 2MB bpf prog pack on a single 2MB page;
> Side B: 2MB bpf prog pack on 512x 4kB pages;
> 
> The system only uses about 200kB for BPF programs, but 2MB is allocated
> for bpf_prog_pack (for both A and B). Therefore, direct map fragmentation
> caused by BPF programs is elminated, and we are only measuring the
> performance difference of 1x 2MB page vs. ~50 4kB pages (we only use
> about 50 out of the 512 pages). For these two sides, the difference in
> system throughput is within the noise. I also measured iTLB-load-misses
> caused by bpf programs, which is ~300/s for case A, and ~1600/s for case B.
> The overall iTLB-load-misses is about 1.5M/s on these hosts. Therefore,
> we can clearly see 2MB page reduces iTLB misses, but the difference is not
> enough to have visible impact on system throughput.
> 
> Of course, the impact of iTLB miss will be more significant for systems
> with more BPF programs loaded.
> 
> [1] https://lore.kernel.org/bpf/20220520235758.1858153-1-song@kernel.org/
> 
> Song Liu (5):
>  module: introduce module_alloc_huge
>  bpf: use module_alloc_huge for bpf_prog_pack
>  vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
>  vmalloc: introduce huge_vmalloc_supported
>  bpf: simplify select_bpf_prog_pack_size
> 
> arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
> include/linux/moduleloader.h |  5 +++++
> include/linux/vmalloc.h      |  7 +++++++
> kernel/bpf/core.c            | 25 ++++++++++---------------
> kernel/module/main.c         |  8 ++++++++
> mm/vmalloc.c                 |  5 +++++
> 6 files changed, 56 insertions(+), 15 deletions(-)
> 
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge
  2022-06-24 21:57 ` [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
@ 2022-07-01 23:20   ` Luis Chamberlain
  2022-07-06  4:39     ` Song Liu
  0 siblings, 1 reply; 10+ messages in thread
From: Luis Chamberlain @ 2022-07-01 23:20 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, netdev, x86, dave.hansen, rick.p.edgecombe,
	kernel-team, daniel

On Fri, Jun 24, 2022 at 02:57:08PM -0700, Song Liu wrote:
> Introduce module_alloc_huge, which allocates huge page backed memory in
> module memory space. The primary user of this memory is bpf_prog_pack
> (multiple BPF programs sharing a huge page).
> 
> Signed-off-by: Song Liu <song@kernel.org>

I see mm not Cc'd. I'd like review from them.

> ---
>  arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
>  include/linux/moduleloader.h |  5 +++++
>  kernel/module/main.c         |  8 ++++++++
>  3 files changed, 34 insertions(+)
> 
> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index b98ffcf4d250..63f6a16c70dc 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -86,6 +86,27 @@ void *module_alloc(unsigned long size)
>  	return p;
>  }
>  
> +void *module_alloc_huge(unsigned long size)
> +{
> +	gfp_t gfp_mask = GFP_KERNEL;
> +	void *p;
> +
> +	if (PAGE_ALIGN(size) > MODULES_LEN)
> +		return NULL;
> +
> +	p = __vmalloc_node_range(size, MODULE_ALIGN,
> +				 MODULES_VADDR + get_module_load_offset(),
> +				 MODULES_END, gfp_mask, PAGE_KERNEL,
> +				 VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
> +				 NUMA_NO_NODE, __builtin_return_address(0));
> +	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
> +		vfree(p);
> +		return NULL;
> +	}
> +
> +	return p;
> +}

1) When things like kernel/bpf/core.c start using a module alloc it
   is time to consider genearlizing this.

2) How we free is important, and each arch does something funky for
   this. This is not addressed here.

And yes I welcome generalizing generic module_alloc() too as suggested
before. The concern on my part is the sloppiness this enables.

  Luis

> +
>  #ifdef CONFIG_X86_32
>  int apply_relocate(Elf32_Shdr *sechdrs,
>  		   const char *strtab,
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 9e09d11ffe5b..d34743a88938 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -26,6 +26,11 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
>     sections.  Returns NULL on failure. */
>  void *module_alloc(unsigned long size);
>  
> +/* Allocator used for allocating memory in module memory space. If size is
> + * greater than PMD_SIZE, allow using huge pages. Returns NULL on failure.
> + */
> +void *module_alloc_huge(unsigned long size);
> +
>  /* Free memory returned from module_alloc. */
>  void module_memfree(void *module_region);
>  
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index fed58d30725d..349b2a8bd20f 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1613,6 +1613,14 @@ void * __weak module_alloc(unsigned long size)
>  			NUMA_NO_NODE, __builtin_return_address(0));
>  }
>  
> +void * __weak module_alloc_huge(unsigned long size)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +				    GFP_KERNEL, PAGE_KERNEL_EXEC,
> +				    VM_FLUSH_RESET_PERMS | VM_ALLOW_HUGE_VMAP,
> +				    NUMA_NO_NODE, __builtin_return_address(0));
> +}
> +
>  bool __weak module_init_section(const char *name)
>  {
>  	return strstarts(name, ".init");
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge
  2022-07-01 23:20   ` Luis Chamberlain
@ 2022-07-06  4:39     ` Song Liu
  2022-07-07 20:11       ` Luis Chamberlain
  0 siblings, 1 reply; 10+ messages in thread
From: Song Liu @ 2022-07-06  4:39 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, lkml, netdev, x86, dave.hansen, rick.p.edgecombe,
	Kernel Team, daniel



> On Jul 1, 2022, at 4:20 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Fri, Jun 24, 2022 at 02:57:08PM -0700, Song Liu wrote:
>> Introduce module_alloc_huge, which allocates huge page backed memory in
>> module memory space. The primary user of this memory is bpf_prog_pack
>> (multiple BPF programs sharing a huge page).
>> 
>> Signed-off-by: Song Liu <song@kernel.org>
> 
> I see mm not Cc'd. I'd like review from them.

I will CC mm in the next version (or resend). Thanks for the reminder. 

> 
>> ---
>> arch/x86/kernel/module.c | 21 +++++++++++++++++++++
>> include/linux/moduleloader.h | 5 +++++
>> kernel/module/main.c | 8 ++++++++
>> 3 files changed, 34 insertions(+)
>> 
>> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
>> index b98ffcf4d250..63f6a16c70dc 100644
>> --- a/arch/x86/kernel/module.c
>> +++ b/arch/x86/kernel/module.c
>> @@ -86,6 +86,27 @@ void *module_alloc(unsigned long size)
>> 	return p;
>> }
>> 
>> +void *module_alloc_huge(unsigned long size)
>> +{
>> +	gfp_t gfp_mask = GFP_KERNEL;
>> +	void *p;
>> +
>> +	if (PAGE_ALIGN(size) > MODULES_LEN)
>> +		return NULL;
>> +
>> +	p = __vmalloc_node_range(size, MODULE_ALIGN,
>> +				 MODULES_VADDR + get_module_load_offset(),
>> +				 MODULES_END, gfp_mask, PAGE_KERNEL,
>> +				 VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
>> +				 NUMA_NO_NODE, __builtin_return_address(0));
>> +	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
>> +		vfree(p);
>> +		return NULL;
>> +	}
>> +
>> +	return p;
>> +}
> 
> 1) When things like kernel/bpf/core.c start using a module alloc it
> is time to consider genearlizing this.

I am not quite sure what the generalization would look like. IMHO, the
ideal case would have:
  a) A kernel_text_rw_allocator, similar to current module_alloc.
  b) A kernel_text_ro_allocator, similar to current bpf_prog_pack_alloc.
     This is built on top of kernel_text_rw_allocator. Different 
     allocations could share a page, thus it requires text_poke like 
     support from the arch. 
  c) If the arch supports text_poke, kprobes, ftrace trampolines, and
     bpf trampolines should use kernel_text_ro_allocator.
  d) Major archs should support CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC,
     and they should use kernel_text_ro_allocator for module text. 

Does this sound reasonable to you?

I tried to enable CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC for x86_64, 
but that doesn't really work. Do we have plan to make this combination
work?

> 
> 2) How we free is important, and each arch does something funky for
> this. This is not addressed here.

How should we address this? IIUC, x86_64 just calls vfree. 

> 
> And yes I welcome generalizing generic module_alloc() too as suggested
> before. The concern on my part is the sloppiness this enables.

One question I have is, does module_alloc (or kernel_text_*_allocator 
above) belong to module code, or mm code (maybe vmalloc)?

I am planning to let BPF trampoline use bpf_prog_pack on x86_64, which 
is another baby step of c) above. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge
  2022-07-06  4:39     ` Song Liu
@ 2022-07-07 20:11       ` Luis Chamberlain
  0 siblings, 0 replies; 10+ messages in thread
From: Luis Chamberlain @ 2022-07-07 20:11 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, lkml, netdev, x86, dave.hansen, rick.p.edgecombe,
	Kernel Team, daniel

On Wed, Jul 06, 2022 at 04:39:13AM +0000, Song Liu wrote:
> > On Jul 1, 2022, at 4:20 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > On Fri, Jun 24, 2022 at 02:57:08PM -0700, Song Liu wrote:
> >> +void *module_alloc_huge(unsigned long size)
> >> +{
> >> +	gfp_t gfp_mask = GFP_KERNEL;
> >> +	void *p;
> >> +
> >> +	if (PAGE_ALIGN(size) > MODULES_LEN)
> >> +		return NULL;
> >> +
> >> +	p = __vmalloc_node_range(size, MODULE_ALIGN,
> >> +				 MODULES_VADDR + get_module_load_offset(),
> >> +				 MODULES_END, gfp_mask, PAGE_KERNEL,
> >> +				 VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
> >> +				 NUMA_NO_NODE, __builtin_return_address(0));
> >> +	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
> >> +		vfree(p);
> >> +		return NULL;
> >> +	}
> >> +
> >> +	return p;
> >> +}
> > 
> > 1) When things like kernel/bpf/core.c start using a module alloc it
> > is time to consider genearlizing this.
> 
> I am not quite sure what the generalization would look like. IMHO, the
> ideal case would have:
>   a) A kernel_text_rw_allocator, similar to current module_alloc.
>   b) A kernel_text_ro_allocator, similar to current bpf_prog_pack_alloc.
>      This is built on top of kernel_text_rw_allocator. Different 
>      allocations could share a page, thus it requires text_poke like 
>      support from the arch. 
>   c) If the arch supports text_poke, kprobes, ftrace trampolines, and
>      bpf trampolines should use kernel_text_ro_allocator.
>   d) Major archs should support CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC,
>      and they should use kernel_text_ro_allocator for module text. 
> 
> Does this sound reasonable to you?

Yes, and a respective free call may have an arch specific stuff which
removes exec stuff.

In so far as the bikeshedding, I do think this is generic so
vmalloc_text_*() suffices or vmalloc_exec_*() take your pick for
a starter and let the world throw in their preference.

> I tried to enable CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC for x86_64, 
> but that doesn't really work. Do we have plan to make this combination
> work?

Oh nice.

Good stuff. Perhaps it just requires a little love from mm folks.
Don't beat yourself up if it does not yet. We can work towards that
later.

> > 2) How we free is important, and each arch does something funky for
> > this. This is not addressed here.
> 
> How should we address this? IIUC, x86_64 just calls vfree. 

That's not the case for all archs is it? I'm talking about the generic
module_alloc() too. I'd like to see that go the way we discussed above.

> > And yes I welcome generalizing generic module_alloc() too as suggested
> > before. The concern on my part is the sloppiness this enables.
> 
> One question I have is, does module_alloc (or kernel_text_*_allocator 
> above) belong to module code, or mm code (maybe vmalloc)?

The evolution of its uses indicates it is growing outside of modules and
so mm should be the new home.

> I am planning to let BPF trampoline use bpf_prog_pack on x86_64, which 
> is another baby step of c) above. 

OK!

  Luis

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-07-07 20:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-24 21:57 [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
2022-07-01 23:20   ` Luis Chamberlain
2022-07-06  4:39     ` Song Liu
2022-07-07 20:11       ` Luis Chamberlain
2022-06-24 21:57 ` [PATCH v5 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
2022-06-24 21:57 ` [PATCH v5 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
2022-06-24 22:00 ` [PATCH v5 bpf-next 0/5] bpf_prog_pack followup Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).