All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
@ 2022-07-07 22:35 Song Liu
  2022-07-07 22:35 ` [PATCH v6 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Song Liu @ 2022-07-07 22:35 UTC (permalink / raw)
  To: bpf, linux-kernel, linux-mm
  Cc: daniel, kernel-team, x86, dave.hansen, rick.p.edgecombe, mcgrof,
	Song Liu

This set is the second half of v4 [1].

Changes v5 => v6:
1. Rebase and extend CC list.

Changes v4 => v5:
1. Rebase and resolve conflicts due to module.c split.
2. Update experiment results (below).

For our web service production benchmark, bpf_prog_pack on 4kB pages
gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
believe this is also significant for other companies with many thousand
servers.

Update: Further experiments (suggested by Rick Edgecombe) showed that most
of benefit on the web service benchmark came from less direct map
fragmentation. The experiment is as follows:

Side A: 2MB bpf prog pack on a single 2MB page;
Side B: 2MB bpf prog pack on 512x 4kB pages;

The system only uses about 200kB for BPF programs, but 2MB is allocated
for bpf_prog_pack (for both A and B). Therefore, direct map fragmentation
caused by BPF programs is elminated, and we are only measuring the
performance difference of 1x 2MB page vs. ~50 4kB pages (we only use
about 50 out of the 512 pages). For these two sides, the difference in
system throughput is within the noise. I also measured iTLB-load-misses
caused by bpf programs, which is ~300/s for case A, and ~1600/s for case B.
The overall iTLB-load-misses is about 1.5M/s on these hosts. Therefore,
we can clearly see 2MB page reduces iTLB misses, but the difference is not
enough to have visible impact on system throughput.

Of course, the impact of iTLB miss will be more significant for systems
with more BPF programs loaded.

[1] https://lore.kernel.org/bpf/20220520235758.1858153-1-song@kernel.org/

Song Liu (5):
  module: introduce module_alloc_huge
  bpf: use module_alloc_huge for bpf_prog_pack
  vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
  vmalloc: introduce huge_vmalloc_supported
  bpf: simplify select_bpf_prog_pack_size

 arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
 include/linux/moduleloader.h |  5 +++++
 include/linux/vmalloc.h      |  7 +++++++
 kernel/bpf/core.c            | 25 ++++++++++---------------
 kernel/module/main.c         |  8 ++++++++
 mm/vmalloc.c                 |  5 +++++
 6 files changed, 56 insertions(+), 15 deletions(-)

--
2.30.2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v6 bpf-next 1/5] module: introduce module_alloc_huge
  2022-07-07 22:35 [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Song Liu
@ 2022-07-07 22:35 ` Song Liu
  2022-07-07 22:35 ` [PATCH v6 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2022-07-07 22:35 UTC (permalink / raw)
  To: bpf, linux-kernel, linux-mm
  Cc: daniel, kernel-team, x86, dave.hansen, rick.p.edgecombe, mcgrof,
	Song Liu

Introduce module_alloc_huge, which allocates huge page backed memory in
module memory space. The primary user of this memory is bpf_prog_pack
(multiple BPF programs sharing a huge page).

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
 include/linux/moduleloader.h |  5 +++++
 kernel/module/main.c         |  8 ++++++++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b98ffcf4d250..63f6a16c70dc 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -86,6 +86,27 @@ void *module_alloc(unsigned long size)
 	return p;
 }
 
+void *module_alloc_huge(unsigned long size)
+{
+	gfp_t gfp_mask = GFP_KERNEL;
+	void *p;
+
+	if (PAGE_ALIGN(size) > MODULES_LEN)
+		return NULL;
+
+	p = __vmalloc_node_range(size, MODULE_ALIGN,
+				 MODULES_VADDR + get_module_load_offset(),
+				 MODULES_END, gfp_mask, PAGE_KERNEL,
+				 VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
+				 NUMA_NO_NODE, __builtin_return_address(0));
+	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
+		vfree(p);
+		return NULL;
+	}
+
+	return p;
+}
+
 #ifdef CONFIG_X86_32
 int apply_relocate(Elf32_Shdr *sechdrs,
 		   const char *strtab,
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 9e09d11ffe5b..d34743a88938 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -26,6 +26,11 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
    sections.  Returns NULL on failure. */
 void *module_alloc(unsigned long size);
 
+/* Allocator used for allocating memory in module memory space. If size is
+ * greater than PMD_SIZE, allow using huge pages. Returns NULL on failure.
+ */
+void *module_alloc_huge(unsigned long size);
+
 /* Free memory returned from module_alloc. */
 void module_memfree(void *module_region);
 
diff --git a/kernel/module/main.c b/kernel/module/main.c
index fed58d30725d..349b2a8bd20f 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1613,6 +1613,14 @@ void * __weak module_alloc(unsigned long size)
 			NUMA_NO_NODE, __builtin_return_address(0));
 }
 
+void * __weak module_alloc_huge(unsigned long size)
+{
+	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+				    GFP_KERNEL, PAGE_KERNEL_EXEC,
+				    VM_FLUSH_RESET_PERMS | VM_ALLOW_HUGE_VMAP,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+
 bool __weak module_init_section(const char *name)
 {
 	return strstarts(name, ".init");
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-07-07 22:35 [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Song Liu
  2022-07-07 22:35 ` [PATCH v6 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
@ 2022-07-07 22:35 ` Song Liu
  2022-07-07 22:35 ` [PATCH v6 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2022-07-07 22:35 UTC (permalink / raw)
  To: bpf, linux-kernel, linux-mm
  Cc: daniel, kernel-team, x86, dave.hansen, rick.p.edgecombe, mcgrof,
	Song Liu

Use module_alloc_huge for bpf_prog_pack so that BPF programs sit on
PMD_SIZE pages. This benefits system performance by reducing iTLB miss
rate. Benchmark of a real web service workload shows this change gives
another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
(which improve system throughput by ~0.5%).

Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and use
set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]

[1] https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 805c2ad5c793..d1f32ac354d3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -860,7 +860,7 @@ static size_t select_bpf_prog_pack_size(void)
 	void *ptr;
 
 	size = BPF_HPAGE_SIZE * num_online_nodes();
-	ptr = module_alloc(size);
+	ptr = module_alloc_huge(size);
 
 	/* Test whether we can get huge pages. If not just use PAGE_SIZE
 	 * packs.
@@ -884,7 +884,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 		       GFP_KERNEL);
 	if (!pack)
 		return NULL;
-	pack->ptr = module_alloc(bpf_prog_pack_size);
+	pack->ptr = module_alloc_huge(bpf_prog_pack_size);
 	if (!pack->ptr) {
 		kfree(pack);
 		return NULL;
@@ -893,7 +893,6 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 	bitmap_zero(pack->bitmap, bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE);
 	list_add_tail(&pack->list, &pack_list);
 
-	set_vm_flush_reset_perms(pack->ptr);
 	set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 	set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 	return pack;
@@ -912,10 +911,9 @@ static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insn
 
 	if (size > bpf_prog_pack_size) {
 		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
+		ptr = module_alloc_huge(size);
 		if (ptr) {
 			bpf_fill_ill_insns(ptr, size);
-			set_vm_flush_reset_perms(ptr);
 			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
 			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
 		}
@@ -952,6 +950,8 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 
 	mutex_lock(&pack_mutex);
 	if (hdr->size > bpf_prog_pack_size) {
+		set_memory_nx((unsigned long)hdr, hdr->size / PAGE_SIZE);
+		set_memory_rw((unsigned long)hdr, hdr->size / PAGE_SIZE);
 		module_memfree(hdr);
 		goto out;
 	}
@@ -978,6 +978,8 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	if (bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0,
 				       bpf_prog_chunk_count(), 0) == 0) {
 		list_del(&pack->list);
+		set_memory_nx((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
+		set_memory_rw((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 		module_memfree(pack->ptr);
 		kfree(pack);
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
  2022-07-07 22:35 [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Song Liu
  2022-07-07 22:35 ` [PATCH v6 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
  2022-07-07 22:35 ` [PATCH v6 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
@ 2022-07-07 22:35 ` Song Liu
  2022-07-07 22:35 ` [PATCH v6 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2022-07-07 22:35 UTC (permalink / raw)
  To: bpf, linux-kernel, linux-mm
  Cc: daniel, kernel-team, x86, dave.hansen, rick.p.edgecombe, mcgrof,
	Song Liu

VM_FLUSH_RESET_PERMS is not yet ready for huge pages, add a WARN to
catch misuse soon.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..59d3e1f3e108 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -239,6 +239,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
 {
 	struct vm_struct *vm = find_vm_area(addr);
 
+	WARN_ON_ONCE(is_vm_area_hugepages(addr));
 	if (vm)
 		vm->flags |= VM_FLUSH_RESET_PERMS;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported
  2022-07-07 22:35 [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Song Liu
                   ` (2 preceding siblings ...)
  2022-07-07 22:35 ` [PATCH v6 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
@ 2022-07-07 22:35 ` Song Liu
  2022-07-07 22:35 ` [PATCH v6 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
  2022-07-07 22:59 ` [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Luis Chamberlain
  5 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2022-07-07 22:35 UTC (permalink / raw)
  To: bpf, linux-kernel, linux-mm
  Cc: daniel, kernel-team, x86, dave.hansen, rick.p.edgecombe, mcgrof,
	Song Liu

huge_vmalloc_supported() exposes vmap_allow_huge so that users of vmalloc
APIs could know whether vmalloc will return huge pages.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h | 6 ++++++
 mm/vmalloc.c            | 5 +++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 59d3e1f3e108..aa2182959fc5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -243,11 +243,17 @@ static inline void set_vm_flush_reset_perms(void *addr)
 	if (vm)
 		vm->flags |= VM_FLUSH_RESET_PERMS;
 }
+bool huge_vmalloc_supported(void);
 
 #else
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 }
+
+static inline bool huge_vmalloc_supported(void)
+{
+	return false;
+}
 #endif
 
 /* for /proc/kcore */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index effd1ff6a4b4..0a5add4b5b2d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,11 @@ early_param("nohugevmalloc", set_nohugevmalloc);
 static const bool vmap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
 
+bool huge_vmalloc_supported(void)
+{
+	return vmap_allow_huge;
+}
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size
  2022-07-07 22:35 [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Song Liu
                   ` (3 preceding siblings ...)
  2022-07-07 22:35 ` [PATCH v6 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
@ 2022-07-07 22:35 ` Song Liu
  2022-07-07 22:59 ` [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Luis Chamberlain
  5 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2022-07-07 22:35 UTC (permalink / raw)
  To: bpf, linux-kernel, linux-mm
  Cc: daniel, kernel-team, x86, dave.hansen, rick.p.edgecombe, mcgrof,
	Song Liu

Use huge_vmalloc_supported to simplify select_bpf_prog_pack_size, so that
we don't allocate some huge pages and free them immediately.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d1f32ac354d3..e1f8d36fb95c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -857,22 +857,15 @@ static LIST_HEAD(pack_list);
 static size_t select_bpf_prog_pack_size(void)
 {
 	size_t size;
-	void *ptr;
-
-	size = BPF_HPAGE_SIZE * num_online_nodes();
-	ptr = module_alloc_huge(size);
 
-	/* Test whether we can get huge pages. If not just use PAGE_SIZE
-	 * packs.
-	 */
-	if (!ptr || !is_vm_area_hugepages(ptr)) {
+	if (huge_vmalloc_supported()) {
+		size = BPF_HPAGE_SIZE * num_online_nodes();
+		bpf_prog_pack_mask = BPF_HPAGE_MASK;
+	} else {
 		size = PAGE_SIZE;
 		bpf_prog_pack_mask = PAGE_MASK;
-	} else {
-		bpf_prog_pack_mask = BPF_HPAGE_MASK;
 	}
 
-	vfree(ptr);
 	return size;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-07 22:35 [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Song Liu
                   ` (4 preceding siblings ...)
  2022-07-07 22:35 ` [PATCH v6 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
@ 2022-07-07 22:59 ` Luis Chamberlain
  2022-07-07 23:52   ` Song Liu
  5 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2022-07-07 22:59 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-kernel, linux-mm, daniel, kernel-team, x86,
	dave.hansen, rick.p.edgecombe

On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
> This set is the second half of v4 [1].
> 
> Changes v5 => v6:
> 1. Rebase and extend CC list.

Why post a new iteration so soon without completing the discussion we
had? It seems like we were at least going somewhere. If it's just
to include mm as I requested, sure, that's fine, but this does not
provide context as to what we last were talking about.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-07 22:59 ` [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Luis Chamberlain
@ 2022-07-07 23:52   ` Song Liu
  2022-07-08  0:53     ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2022-07-07 23:52 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, bpf, lkml, Linux-MM, Daniel Borkmann, Kernel Team, x86,
	dave.hansen, rick.p.edgecombe



> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
>> This set is the second half of v4 [1].
>> 
>> Changes v5 => v6:
>> 1. Rebase and extend CC list.
> 
> Why post a new iteration so soon without completing the discussion we
> had? It seems like we were at least going somewhere. If it's just
> to include mm as I requested, sure, that's fine, but this does not
> provide context as to what we last were talking about.

Sorry for sending v6 too soon. The primary reason was to extend the CC
list and add it back to patchwork (v5 somehow got archived). 

Also, I think vmalloc_exec_ work would be a separate project, while this 
set is the followup work of bpf_prog_pack. Does this make sense? 

Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
more efficient to discuss this in person. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-07 23:52   ` Song Liu
@ 2022-07-08  0:53     ` Luis Chamberlain
  2022-07-08  1:36       ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2022-07-08  0:53 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, bpf, lkml, Linux-MM, Daniel Borkmann, Kernel Team, x86,
	dave.hansen, rick.p.edgecombe

On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
> > On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> > On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
> >> This set is the second half of v4 [1].
> >> 
> >> Changes v5 => v6:
> >> 1. Rebase and extend CC list.
> > 
> > Why post a new iteration so soon without completing the discussion we
> > had? It seems like we were at least going somewhere. If it's just
> > to include mm as I requested, sure, that's fine, but this does not
> > provide context as to what we last were talking about.
> 
> Sorry for sending v6 too soon. The primary reason was to extend the CC
> list and add it back to patchwork (v5 somehow got archived). 
> 
> Also, I think vmalloc_exec_ work would be a separate project, while this 
> set is the followup work of bpf_prog_pack. Does this make sense? 
> 
> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
> more efficient to discuss this in person. 

What we need is input from mm / arch folks. What is not done here is
what that stuff we're talking about is and so mm folks can't guess. My
preference is to address that.

I don't think in person discussion is needed if the only folks
discussing this topic so far is just you and me.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-08  0:53     ` Luis Chamberlain
@ 2022-07-08  1:36       ` Song Liu
  2022-07-08 15:58         ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2022-07-08  1:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, bpf, lkml, Linux-MM, Daniel Borkmann, Kernel Team, x86,
	dave.hansen, rick.p.edgecombe



> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
>>>> This set is the second half of v4 [1].
>>>> 
>>>> Changes v5 => v6:
>>>> 1. Rebase and extend CC list.
>>> 
>>> Why post a new iteration so soon without completing the discussion we
>>> had? It seems like we were at least going somewhere. If it's just
>>> to include mm as I requested, sure, that's fine, but this does not
>>> provide context as to what we last were talking about.
>> 
>> Sorry for sending v6 too soon. The primary reason was to extend the CC
>> list and add it back to patchwork (v5 somehow got archived). 
>> 
>> Also, I think vmalloc_exec_ work would be a separate project, while this 
>> set is the followup work of bpf_prog_pack. Does this make sense? 
>> 
>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
>> more efficient to discuss this in person. 
> 
> What we need is input from mm / arch folks. What is not done here is
> what that stuff we're talking about is and so mm folks can't guess. My
> preference is to address that.
> 
> I don't think in person discussion is needed if the only folks
> discussing this topic so far is just you and me.

How about we start a thread with mm / arch folks for the vmalloc_exec_*
topic? I will summarize previous discussions and include pointers to 
these discussions. If necessary, we can continue the discussion at LPC.

OTOH, I guess the outcome of that discussion should not change this set? 
If we have concern about module_alloc_huge(), maybe we can have bpf code 
call vmalloc directly (until we have vmalloc_exec_)? 

What do you think about this plan?

Thanks,
Song

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-08  1:36       ` Song Liu
@ 2022-07-08 15:58         ` Luis Chamberlain
  2022-07-08 19:58           ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2022-07-08 15:58 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, bpf, lkml, Linux-MM, Daniel Borkmann, Kernel Team, x86,
	dave.hansen, rick.p.edgecombe

On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote:
> 
> 
> > On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> > On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
> >>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> >>> 
> >>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
> >>>> This set is the second half of v4 [1].
> >>>> 
> >>>> Changes v5 => v6:
> >>>> 1. Rebase and extend CC list.
> >>> 
> >>> Why post a new iteration so soon without completing the discussion we
> >>> had? It seems like we were at least going somewhere. If it's just
> >>> to include mm as I requested, sure, that's fine, but this does not
> >>> provide context as to what we last were talking about.
> >> 
> >> Sorry for sending v6 too soon. The primary reason was to extend the CC
> >> list and add it back to patchwork (v5 somehow got archived). 
> >> 
> >> Also, I think vmalloc_exec_ work would be a separate project, while this 
> >> set is the followup work of bpf_prog_pack. Does this make sense? 
> >> 
> >> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
> >> more efficient to discuss this in person. 
> > 
> > What we need is input from mm / arch folks. What is not done here is
> > what that stuff we're talking about is and so mm folks can't guess. My
> > preference is to address that.
> > 
> > I don't think in person discussion is needed if the only folks
> > discussing this topic so far is just you and me.
> 
> How about we start a thread with mm / arch folks for the vmalloc_exec_*
> topic? I will summarize previous discussions and include pointers to 
> these discussions. If necessary, we can continue the discussion at LPC.

This sounds like a nice thread to use as this is why we are talking
about that topic.

> OTOH, I guess the outcome of that discussion should not change this set? 

If the above is done right then actually I think it would show similar
considerations for a respective free for module_alloc_huge().

> If we have concern about module_alloc_huge(), maybe we can have bpf code 
> call vmalloc directly (until we have vmalloc_exec_)? 

You'd need to then still open code in a similar way the same things
which we are trying to reach consensus on.

> What do you think about this plan?

I think we should strive to not be lazy and sloppy, and prevent growth
of sloppy code. So long as we do that I think this is all reasoanble.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-08 15:58         ` Luis Chamberlain
@ 2022-07-08 19:58           ` Song Liu
  2022-07-08 22:24             ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2022-07-08 19:58 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, bpf, lkml, Linux-MM, Daniel Borkmann, Kernel Team, x86,
	dave.hansen, rick.p.edgecombe



> On Jul 8, 2022, at 8:58 AM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote:
>> 
>> 
>>> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
>>>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>>>> 
>>>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
>>>>>> This set is the second half of v4 [1].
>>>>>> 
>>>>>> Changes v5 => v6:
>>>>>> 1. Rebase and extend CC list.
>>>>> 
>>>>> Why post a new iteration so soon without completing the discussion we
>>>>> had? It seems like we were at least going somewhere. If it's just
>>>>> to include mm as I requested, sure, that's fine, but this does not
>>>>> provide context as to what we last were talking about.
>>>> 
>>>> Sorry for sending v6 too soon. The primary reason was to extend the CC
>>>> list and add it back to patchwork (v5 somehow got archived). 
>>>> 
>>>> Also, I think vmalloc_exec_ work would be a separate project, while this 
>>>> set is the followup work of bpf_prog_pack. Does this make sense? 
>>>> 
>>>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
>>>> more efficient to discuss this in person. 
>>> 
>>> What we need is input from mm / arch folks. What is not done here is
>>> what that stuff we're talking about is and so mm folks can't guess. My
>>> preference is to address that.
>>> 
>>> I don't think in person discussion is needed if the only folks
>>> discussing this topic so far is just you and me.
>> 
>> How about we start a thread with mm / arch folks for the vmalloc_exec_*
>> topic? I will summarize previous discussions and include pointers to 
>> these discussions. If necessary, we can continue the discussion at LPC.
> 
> This sounds like a nice thread to use as this is why we are talking
> about that topic.
> 
>> OTOH, I guess the outcome of that discussion should not change this set? 
> 
> If the above is done right then actually I think it would show similar
> considerations for a respective free for module_alloc_huge().
> 
>> If we have concern about module_alloc_huge(), maybe we can have bpf code 
>> call vmalloc directly (until we have vmalloc_exec_)? 
> 
> You'd need to then still open code in a similar way the same things
> which we are trying to reach consensus on.

>> What do you think about this plan?
> 
> I think we should strive to not be lazy and sloppy, and prevent growth
> of sloppy code. So long as we do that I think this is all reasoanble.

Let me try to understand your concerns here. Say if we want module code
to be a temporary home for module_alloc_huge before we move it to mm
code. Would you think it is ready to ship if we:

1) Rename module_alloc_huge as module_alloc_text_huge();
2) Add module_free_text_huge();
3) Move set_memory_* and fill_ill_insn logic into module_alloc_text_huge()
  and module_free_text_huge(). 

Are these on the right direction? Did I miss anything important?

Thanks,
Song


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-08 19:58           ` Song Liu
@ 2022-07-08 22:24             ` Luis Chamberlain
  2022-07-09  1:14               ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2022-07-08 22:24 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, bpf, Christoph Hellwig, Davidlohr Bueso, lkml,
	Linux-MM, Daniel Borkmann, Kernel Team, x86, dave.hansen,
	rick.p.edgecombe, linux-modules

On Fri, Jul 08, 2022 at 07:58:44PM +0000, Song Liu wrote:
> 
> 
> > On Jul 8, 2022, at 8:58 AM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> > On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote:
> >> 
> >> 
> >>> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> >>> 
> >>> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
> >>>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> >>>>> 
> >>>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
> >>>>>> This set is the second half of v4 [1].
> >>>>>> 
> >>>>>> Changes v5 => v6:
> >>>>>> 1. Rebase and extend CC list.
> >>>>> 
> >>>>> Why post a new iteration so soon without completing the discussion we
> >>>>> had? It seems like we were at least going somewhere. If it's just
> >>>>> to include mm as I requested, sure, that's fine, but this does not
> >>>>> provide context as to what we last were talking about.
> >>>> 
> >>>> Sorry for sending v6 too soon. The primary reason was to extend the CC
> >>>> list and add it back to patchwork (v5 somehow got archived). 
> >>>> 
> >>>> Also, I think vmalloc_exec_ work would be a separate project, while this 
> >>>> set is the followup work of bpf_prog_pack. Does this make sense? 
> >>>> 
> >>>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
> >>>> more efficient to discuss this in person. 
> >>> 
> >>> What we need is input from mm / arch folks. What is not done here is
> >>> what that stuff we're talking about is and so mm folks can't guess. My
> >>> preference is to address that.
> >>> 
> >>> I don't think in person discussion is needed if the only folks
> >>> discussing this topic so far is just you and me.
> >> 
> >> How about we start a thread with mm / arch folks for the vmalloc_exec_*
> >> topic? I will summarize previous discussions and include pointers to 
> >> these discussions. If necessary, we can continue the discussion at LPC.
> > 
> > This sounds like a nice thread to use as this is why we are talking
> > about that topic.
> > 
> >> OTOH, I guess the outcome of that discussion should not change this set? 
> > 
> > If the above is done right then actually I think it would show similar
> > considerations for a respective free for module_alloc_huge().
> > 
> >> If we have concern about module_alloc_huge(), maybe we can have bpf code 
> >> call vmalloc directly (until we have vmalloc_exec_)? 
> > 
> > You'd need to then still open code in a similar way the same things
> > which we are trying to reach consensus on.
> 
> >> What do you think about this plan?
> > 
> > I think we should strive to not be lazy and sloppy, and prevent growth
> > of sloppy code. So long as we do that I think this is all reasoanble.
> 
> Let me try to understand your concerns here. Say if we want module code
> to be a temporary home for module_alloc_huge before we move it to mm
> code. Would you think it is ready to ship if we:

Please CC Christoph and linux-modules@vger.kernel.org on future patches
and dicussions aroudn this, and all others now CC'd.

> 1) Rename module_alloc_huge as module_alloc_text_huge();

module_alloc_text_huge() is too long, but I've suggested names before
which are short and generic, and also suggested that if modules are
not the only users this needs to go outside of modules and so
vmalloc_text_huge() or whatever.

To do this right it begs the question why we don't do that for the
existing module_alloc(), as the users of this code is well outside of
modules now. Last time a similar generic name was used all the special
arch stuff was left to be done by the module code still, but still
non-modules were still using that allocator. From my perspective the
right thing to do is to deal with all the arch stuff as well in the
generic handler, and have the module code *and* the other users which
use module_alloc() to use that new caller as well.

> 2) Add module_free_text_huge();

Right, we have special handling for how we free this special code for regular
module_alloc() and so similar considerations would be needed here for
the huge stuff.

> 3) Move set_memory_* and fill_ill_insn logic into module_alloc_text_huge()
>   and module_free_text_huge(). 

Yes, that's a bit hairy now, and so a saner and consistent way to do
this would be best.

> Are these on the right direction? Did I miss anything important?

I've also hinted before that another way to help here is to have draw
up a simple lib/test_vmalloc_text.c or something like that which would
enable a selftest to ensure correctness of this code on different archs
and maybe even let you do performance analysis using perf [0]. You have
good reasons to move to the huge allocator and the performance metrics
are an abstract load, however perf measurements can also give you real
raw data which you can reproduce and enable others to do similar
comparisons later.

The last thing I'd ask is just ensure you Cc folks who have already been in
these discussions.

[0] https://lkml.kernel.org/r/Yog+d+oR5TtPp2cs@bombadil.infradead.org

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-08 22:24             ` Luis Chamberlain
@ 2022-07-09  1:14               ` Song Liu
  2022-07-12  4:18                 ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2022-07-09  1:14 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, bpf, Christoph Hellwig, Davidlohr Bueso, lkml,
	Linux-MM, Daniel Borkmann, Kernel Team, x86, dave.hansen,
	rick.p.edgecombe, linux-modules



> On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Fri, Jul 08, 2022 at 07:58:44PM +0000, Song Liu wrote:
>> 
>> 
>>> On Jul 8, 2022, at 8:58 AM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>> On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote:
>>>> 
>>>> 
>>>>> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>>>> 
>>>>> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
>>>>>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>>>>>> 
>>>>>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
>>>>>>>> This set is the second half of v4 [1].
>>>>>>>> 
>>>>>>>> Changes v5 => v6:
>>>>>>>> 1. Rebase and extend CC list.
>>>>>>> 
>>>>>>> Why post a new iteration so soon without completing the discussion we
>>>>>>> had? It seems like we were at least going somewhere. If it's just
>>>>>>> to include mm as I requested, sure, that's fine, but this does not
>>>>>>> provide context as to what we last were talking about.
>>>>>> 
>>>>>> Sorry for sending v6 too soon. The primary reason was to extend the CC
>>>>>> list and add it back to patchwork (v5 somehow got archived). 
>>>>>> 
>>>>>> Also, I think vmalloc_exec_ work would be a separate project, while this 
>>>>>> set is the followup work of bpf_prog_pack. Does this make sense? 
>>>>>> 
>>>>>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
>>>>>> more efficient to discuss this in person. 
>>>>> 
>>>>> What we need is input from mm / arch folks. What is not done here is
>>>>> what that stuff we're talking about is and so mm folks can't guess. My
>>>>> preference is to address that.
>>>>> 
>>>>> I don't think in person discussion is needed if the only folks
>>>>> discussing this topic so far is just you and me.
>>>> 
>>>> How about we start a thread with mm / arch folks for the vmalloc_exec_*
>>>> topic? I will summarize previous discussions and include pointers to 
>>>> these discussions. If necessary, we can continue the discussion at LPC.
>>> 
>>> This sounds like a nice thread to use as this is why we are talking
>>> about that topic.
>>> 
>>>> OTOH, I guess the outcome of that discussion should not change this set? 
>>> 
>>> If the above is done right then actually I think it would show similar
>>> considerations for a respective free for module_alloc_huge().
>>> 
>>>> If we have concern about module_alloc_huge(), maybe we can have bpf code 
>>>> call vmalloc directly (until we have vmalloc_exec_)? 
>>> 
>>> You'd need to then still open code in a similar way the same things
>>> which we are trying to reach consensus on.
>> 
>>>> What do you think about this plan?
>>> 
>>> I think we should strive to not be lazy and sloppy, and prevent growth
>>> of sloppy code. So long as we do that I think this is all reasoanble.
>> 
>> Let me try to understand your concerns here. Say if we want module code
>> to be a temporary home for module_alloc_huge before we move it to mm
>> code. Would you think it is ready to ship if we:
> 
> Please CC Christoph and linux-modules@vger.kernel.org on future patches
> and dicussions aroudn this, and all others now CC'd.

Sometimes, vger drops my patch because the CC list is too long. That's 
the reason I often trim the CC list. I will try to keep folks in this
thread CC'ed. 

> 
>> 1) Rename module_alloc_huge as module_alloc_text_huge();
> 
> module_alloc_text_huge() is too long, but I've suggested names before
> which are short and generic, and also suggested that if modules are
> not the only users this needs to go outside of modules and so
> vmalloc_text_huge() or whatever.
> 
> To do this right it begs the question why we don't do that for the
> existing module_alloc(), as the users of this code is well outside of
> modules now. Last time a similar generic name was used all the special
> arch stuff was left to be done by the module code still, but still
> non-modules were still using that allocator. From my perspective the
> right thing to do is to deal with all the arch stuff as well in the
> generic handler, and have the module code *and* the other users which
> use module_alloc() to use that new caller as well.

The key difference between module_alloc() and the new API is that the 
API will return RO+X memory, and the user need text-poke like API to
modify this buffer. Archs that do not support text-poke will not be
able to use the new API. Does this sound like a reasonable design?

> 
>> 2) Add module_free_text_huge();
> 
> Right, we have special handling for how we free this special code for regular
> module_alloc() and so similar considerations would be needed here for
> the huge stuff.
> 
>> 3) Move set_memory_* and fill_ill_insn logic into module_alloc_text_huge()
>> and module_free_text_huge(). 
> 
> Yes, that's a bit hairy now, and so a saner and consistent way to do
> this would be best.

Thanks for these information. I will try to go this direction. 

> 
>> Are these on the right direction? Did I miss anything important?
> 
> I've also hinted before that another way to help here is to have draw
> up a simple lib/test_vmalloc_text.c or something like that which would
> enable a selftest to ensure correctness of this code on different archs
> and maybe even let you do performance analysis using perf [0]. You have
> good reasons to move to the huge allocator and the performance metrics
> are an abstract load, however perf measurements can also give you real
> raw data which you can reproduce and enable others to do similar
> comparisons later.
> 
> The last thing I'd ask is just ensure you Cc folks who have already been in
> these discussions.
> 
> [0] https://lkml.kernel.org/r/Yog+d+oR5TtPp2cs@bombadil.infradead.org

Let me see how we can test it. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-09  1:14               ` Song Liu
@ 2022-07-12  4:18                 ` Luis Chamberlain
  2022-07-12  4:24                   ` Luis Chamberlain
  2022-07-12  5:49                   ` Song Liu
  0 siblings, 2 replies; 21+ messages in thread
From: Luis Chamberlain @ 2022-07-12  4:18 UTC (permalink / raw)
  To: Song Liu, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Masami Hiramatsu, Naveen N. Rao,
	David S. Miller, Anil S Keshavamurthy, Kees Cook
  Cc: Song Liu, bpf, Christoph Hellwig, Davidlohr Bueso, lkml,
	Linux-MM, Daniel Borkmann, Kernel Team, x86, dave.hansen,
	rick.p.edgecombe, linux-modules

On Sat, Jul 09, 2022 at 01:14:23AM +0000, Song Liu wrote:
> > On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> >> 1) Rename module_alloc_huge as module_alloc_text_huge();
> > 
> > module_alloc_text_huge() is too long, but I've suggested names before
> > which are short and generic, and also suggested that if modules are
> > not the only users this needs to go outside of modules and so
> > vmalloc_text_huge() or whatever.
> > 
> > To do this right it begs the question why we don't do that for the
> > existing module_alloc(), as the users of this code is well outside of
> > modules now. Last time a similar generic name was used all the special
> > arch stuff was left to be done by the module code still, but still
> > non-modules were still using that allocator. From my perspective the
> > right thing to do is to deal with all the arch stuff as well in the
> > generic handler, and have the module code *and* the other users which
> > use module_alloc() to use that new caller as well.
> 
> The key difference between module_alloc() and the new API is that the 
> API will return RO+X memory, and the user need text-poke like API to
> modify this buffer. Archs that do not support text-poke will not be
> able to use the new API. Does this sound like a reasonable design?

I'm adding kprobe + ftrace folks.

I can't see why we need to *require* text_poke for just a
module_alloc_huge(). Enhancements on module_alloc() are just
enhancements, not requirements. So we have these for instance:

``` from arch/Kconfig
config ARCH_OPTIONAL_KERNEL_RWX
	def_bool n

config ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
	def_bool n

config ARCH_HAS_STRICT_KERNEL_RWX
	def_bool n

config STRICT_KERNEL_RWX
	bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX
	depends on ARCH_HAS_STRICT_KERNEL_RWX
	default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
	help
	  If this is set, kernel text and rodata memory will be made read-only,
	  and non-text memory will be made non-executable. This provides
	  protection against certain security exploits (e.g. executing the heap
	  or modifying text)

	  These features are considered standard security practice these days.
	  You should say Y here in almost all cases.

config ARCH_HAS_STRICT_MODULE_RWX
	def_bool n

config STRICT_MODULE_RWX
	bool "Set loadable kernel module data as NX and text as RO" if ARCH_OPTIONAL_KERNEL_RWX
	depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES
	default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
	help
	  If this is set, module text and rodata memory will be made read-only,
	  and non-text memory will be made non-executable. This provides
	  protection against certain security exploits (e.g. writing to text)
```

With module_alloc() we have the above symbols to tell us when we *can*
support strict module rwx. So the way the kernel's modules are allocated
and used is:

for each module section:
	module_alloc()
module_enable_ro()
module_enable_nx()
module_enable_x()

The above can be read in the code as:

load_module() -->
	layout_and_allocate()
	complete_formation()

Then there is the consideration of set_vm_flush_reset_perms() for
freeing. On the module code we use this fore the RO+X stuff (core_layout,
init_layout), but now that is a bit obfuscated due to the placement of
the call. It would seem the other users use it for the same:

 * ebpf
 * kprobes
 * ftrace

I believe you are mentioning requiring text_poke() because the way
eBPF code uses the module_alloc() is different. Correct me if I'm
wrong, but from what I gather is you use the text_poke_copy() as the data
is already RO+X, contrary module_alloc() use cases. You do this since your
bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
module_alloc() and before you can use this memory. This is a different type
of allocator. And, again please correct me if I'm wrong but now you want to
share *one* 2 MiB huge-page for multiple BPF programs to help with the
impact of TLB misses.

A vmalloc_ro_exec() by definition would imply a text_poke().

Can kprobes, ftrace and modules use it too? It would be nice
so to not have to deal with the loose semantics on the user to
have to use set_vm_flush_reset_perms() on ro+x later, but
I think this can be addressed separately on a case by case basis.

But a vmalloc_ro_exec() with a respective free can remove the
requirement to do set_vm_flush_reset_perms().

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-12  4:18                 ` Luis Chamberlain
@ 2022-07-12  4:24                   ` Luis Chamberlain
  2022-07-12  5:49                   ` Song Liu
  1 sibling, 0 replies; 21+ messages in thread
From: Luis Chamberlain @ 2022-07-12  4:24 UTC (permalink / raw)
  To: Song Liu, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Masami Hiramatsu, Naveen N. Rao,
	David S. Miller, Anil S Keshavamurthy, Kees Cook
  Cc: Song Liu, bpf, Christoph Hellwig, Davidlohr Bueso, lkml,
	Linux-MM, Daniel Borkmann, Kernel Team, x86, dave.hansen,
	rick.p.edgecombe, linux-modules

On Mon, Jul 11, 2022 at 09:18:53PM -0700, Luis Chamberlain wrote:
> A vmalloc_ro_exec() by definition would imply a text_poke().
> 
> Can kprobes, ftrace and modules use it too? It would be nice
> so to not have to deal with the loose semantics on the user to
> have to use set_vm_flush_reset_perms() on ro+x later, but
> I think this can be addressed separately on a case by case basis.

And who knows, if they can, and can also share a huge page allocator
then they may also share similar performance improvements.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-12  4:18                 ` Luis Chamberlain
  2022-07-12  4:24                   ` Luis Chamberlain
@ 2022-07-12  5:49                   ` Song Liu
  2022-07-12 19:04                     ` Luis Chamberlain
  1 sibling, 1 reply; 21+ messages in thread
From: Song Liu @ 2022-07-12  5:49 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Masami Hiramatsu, Naveen N. Rao,
	David S. Miller, Anil S Keshavamurthy, Kees Cook, Song Liu, bpf,
	Christoph Hellwig, Davidlohr Bueso, lkml, Linux-MM,
	Daniel Borkmann, Kernel Team, x86, dave.hansen, rick.p.edgecombe,
	linux-modules



> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Sat, Jul 09, 2022 at 01:14:23AM +0000, Song Liu wrote:
>>> On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>>> 1) Rename module_alloc_huge as module_alloc_text_huge();
>>> 
>>> module_alloc_text_huge() is too long, but I've suggested names before
>>> which are short and generic, and also suggested that if modules are
>>> not the only users this needs to go outside of modules and so
>>> vmalloc_text_huge() or whatever.
>>> 
>>> To do this right it begs the question why we don't do that for the
>>> existing module_alloc(), as the users of this code is well outside of
>>> modules now. Last time a similar generic name was used all the special
>>> arch stuff was left to be done by the module code still, but still
>>> non-modules were still using that allocator. From my perspective the
>>> right thing to do is to deal with all the arch stuff as well in the
>>> generic handler, and have the module code *and* the other users which
>>> use module_alloc() to use that new caller as well.
>> 
>> The key difference between module_alloc() and the new API is that the 
>> API will return RO+X memory, and the user need text-poke like API to
>> modify this buffer. Archs that do not support text-poke will not be
>> able to use the new API. Does this sound like a reasonable design?

[...]

> I believe you are mentioning requiring text_poke() because the way
> eBPF code uses the module_alloc() is different. Correct me if I'm
> wrong, but from what I gather is you use the text_poke_copy() as the data
> is already RO+X, contrary module_alloc() use cases. You do this since your
> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
> module_alloc() and before you can use this memory. This is a different type
> of allocator. And, again please correct me if I'm wrong but now you want to
> share *one* 2 MiB huge-page for multiple BPF programs to help with the
> impact of TLB misses.

Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
much value on top of current module_alloc(). 

> A vmalloc_ro_exec() by definition would imply a text_poke().
> 
> Can kprobes, ftrace and modules use it too? It would be nice
> so to not have to deal with the loose semantics on the user to
> have to use set_vm_flush_reset_perms() on ro+x later, but
> I think this can be addressed separately on a case by case basis.

I am pretty confident that kprobe and ftrace can share huge pages with 
BPF programs. I haven't looked into all the details with modules, but 
given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
possible. 

Once this is done, a regular system (without huge BPF program or huge
modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
and bpf programs. 

> 
> But a vmalloc_ro_exec() with a respective free can remove the
> requirement to do set_vm_flush_reset_perms().

Removing the requirement to set_vm_flush_reset_perms() is the other
reason to go directly to vmalloc_ro_exec(). 

My current version looks like this:

void *vmalloc_exec(unsigned long size);
void vfree_exec(void *ptr, unsigned int size);

ro is eliminated as there is no rw version of the API. 

The ugly part is @size for vfree_exec(). We need it to share huge 
pages. 

Under the hood, it looks similar to current bpf_prog_pack_alloc
and bpf_prog_pack_free. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-12  5:49                   ` Song Liu
@ 2022-07-12 19:04                     ` Luis Chamberlain
  2022-07-12 23:12                       ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2022-07-12 19:04 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Masami Hiramatsu, Naveen N. Rao,
	David S. Miller, Anil S Keshavamurthy, Kees Cook, Song Liu, bpf,
	Christoph Hellwig, Davidlohr Bueso, lkml, Linux-MM,
	Daniel Borkmann, Kernel Team, x86, dave.hansen, rick.p.edgecombe,
	linux-modules

On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
> > On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> > I believe you are mentioning requiring text_poke() because the way
> > eBPF code uses the module_alloc() is different. Correct me if I'm
> > wrong, but from what I gather is you use the text_poke_copy() as the data
> > is already RO+X, contrary module_alloc() use cases. You do this since your
> > bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
> > module_alloc() and before you can use this memory. This is a different type
> > of allocator. And, again please correct me if I'm wrong but now you want to
> > share *one* 2 MiB huge-page for multiple BPF programs to help with the
> > impact of TLB misses.
> 
> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
> much value on top of current module_alloc(). 

Thanks for the clarification.

> > A vmalloc_ro_exec() by definition would imply a text_poke().
> > 
> > Can kprobes, ftrace and modules use it too? It would be nice
> > so to not have to deal with the loose semantics on the user to
> > have to use set_vm_flush_reset_perms() on ro+x later, but
> > I think this can be addressed separately on a case by case basis.
> 
> I am pretty confident that kprobe and ftrace can share huge pages with 
> BPF programs.

Then wonderful, we know where to go in terms of a new API then as it
can be shared in the future for sure and there are gains.

> I haven't looked into all the details with modules, but 
> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
> possible.

Sure.

> Once this is done, a regular system (without huge BPF program or huge
> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
> and bpf programs. 

That would be nice, if possible, however modules will require likely its
own thing, on my system I see about 57 MiB used on coresize alone.

lsmod | grep -v Module | cut -f1 -d ' ' | \
	xargs sudo modinfo | grep filename | \
	grep -o '/.*' | xargs stat -c "%s - %n" | \
	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
60001272

And so perhaps we need such a pool size to be configurable.

> > But a vmalloc_ro_exec() with a respective free can remove the
> > requirement to do set_vm_flush_reset_perms().
> 
> Removing the requirement to set_vm_flush_reset_perms() is the other
> reason to go directly to vmalloc_ro_exec(). 

Yes fantastic.

> My current version looks like this:
> 
> void *vmalloc_exec(unsigned long size);
> void vfree_exec(void *ptr, unsigned int size);
> 
> ro is eliminated as there is no rw version of the API. 

Alright.

I am not sure if 2 MiB will suffice given what I mentioned above, and
what to do to ensure this grows at a reasonable pace. Then, at least for
usage for all architectures since not all will support text_poke() we
will want to consider a way to make it easy to users to use non huge
page fallbacks, but that would be up to those users, so we can wait for
that.

> The ugly part is @size for vfree_exec(). We need it to share huge 
> pages. 

I suppose this will become evident during patch review.

> Under the hood, it looks similar to current bpf_prog_pack_alloc
> and bpf_prog_pack_free. 

Groovy.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-12 19:04                     ` Luis Chamberlain
@ 2022-07-12 23:12                       ` Song Liu
  2022-07-12 23:42                         ` Luis Chamberlain
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2022-07-12 23:12 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Masami Hiramatsu, Naveen N. Rao,
	David S. Miller, Anil S Keshavamurthy, Kees Cook, Song Liu, bpf,
	Christoph Hellwig, Davidlohr Bueso, lkml, Linux-MM,
	Daniel Borkmann, Kernel Team, x86, dave.hansen, rick.p.edgecombe,
	linux-modules



> On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
>>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>> 
>>> I believe you are mentioning requiring text_poke() because the way
>>> eBPF code uses the module_alloc() is different. Correct me if I'm
>>> wrong, but from what I gather is you use the text_poke_copy() as the data
>>> is already RO+X, contrary module_alloc() use cases. You do this since your
>>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
>>> module_alloc() and before you can use this memory. This is a different type
>>> of allocator. And, again please correct me if I'm wrong but now you want to
>>> share *one* 2 MiB huge-page for multiple BPF programs to help with the
>>> impact of TLB misses.
>> 
>> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
>> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
>> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
>> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
>> much value on top of current module_alloc(). 
> 
> Thanks for the clarification.
> 
>>> A vmalloc_ro_exec() by definition would imply a text_poke().
>>> 
>>> Can kprobes, ftrace and modules use it too? It would be nice
>>> so to not have to deal with the loose semantics on the user to
>>> have to use set_vm_flush_reset_perms() on ro+x later, but
>>> I think this can be addressed separately on a case by case basis.
>> 
>> I am pretty confident that kprobe and ftrace can share huge pages with 
>> BPF programs.
> 
> Then wonderful, we know where to go in terms of a new API then as it
> can be shared in the future for sure and there are gains.
> 
>> I haven't looked into all the details with modules, but 
>> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
>> possible.
> 
> Sure.
> 
>> Once this is done, a regular system (without huge BPF program or huge
>> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
>> and bpf programs. 
> 
> That would be nice, if possible, however modules will require likely its
> own thing, on my system I see about 57 MiB used on coresize alone.
> 
> lsmod | grep -v Module | cut -f1 -d ' ' | \
> 	xargs sudo modinfo | grep filename | \
> 	grep -o '/.*' | xargs stat -c "%s - %n" | \
> 	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
> 60001272
> 
> And so perhaps we need such a pool size to be configurable.
> 
>>> But a vmalloc_ro_exec() with a respective free can remove the
>>> requirement to do set_vm_flush_reset_perms().
>> 
>> Removing the requirement to set_vm_flush_reset_perms() is the other
>> reason to go directly to vmalloc_ro_exec(). 
> 
> Yes fantastic.
> 
>> My current version looks like this:
>> 
>> void *vmalloc_exec(unsigned long size);
>> void vfree_exec(void *ptr, unsigned int size);
>> 
>> ro is eliminated as there is no rw version of the API. 
> 
> Alright.
> 
> I am not sure if 2 MiB will suffice given what I mentioned above, and
> what to do to ensure this grows at a reasonable pace. Then, at least for
> usage for all architectures since not all will support text_poke() we
> will want to consider a way to make it easy to users to use non huge
> page fallbacks, but that would be up to those users, so we can wait for
> that.

We are not limited to 2MiB total. The logic is like: 

1. Anything bigger than 2MiB gets its own allocation.
2. We maintain a list of 2MiB pages, and bitmaps showing which parts of 
   these pages are in use. 
3. For objects smaller than 2MiB, we will try to fit it in one of these
   pages. 
   3. a) If there isn't a page with big enough continuous free space, we
        will allocate a new 2MiB page. 

(For system with n NUMA nodes, multiple 2MiB above by n). 

So, if we have 100 kernel modules using 1MiB each, they will share 50x
2MiB pages. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-12 23:12                       ` Song Liu
@ 2022-07-12 23:42                         ` Luis Chamberlain
  2022-07-13  1:00                           ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Chamberlain @ 2022-07-12 23:42 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Masami Hiramatsu, Naveen N. Rao,
	David S. Miller, Anil S Keshavamurthy, Kees Cook, Song Liu, bpf,
	Christoph Hellwig, Davidlohr Bueso, lkml, Linux-MM,
	Daniel Borkmann, Kernel Team, x86, dave.hansen, rick.p.edgecombe,
	linux-modules

On Tue, Jul 12, 2022 at 11:12:22PM +0000, Song Liu wrote:
> 
> 
> > On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> > On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
> >>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> >> 
> >>> I believe you are mentioning requiring text_poke() because the way
> >>> eBPF code uses the module_alloc() is different. Correct me if I'm
> >>> wrong, but from what I gather is you use the text_poke_copy() as the data
> >>> is already RO+X, contrary module_alloc() use cases. You do this since your
> >>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
> >>> module_alloc() and before you can use this memory. This is a different type
> >>> of allocator. And, again please correct me if I'm wrong but now you want to
> >>> share *one* 2 MiB huge-page for multiple BPF programs to help with the
> >>> impact of TLB misses.
> >> 
> >> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
> >> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
> >> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
> >> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
> >> much value on top of current module_alloc(). 
> > 
> > Thanks for the clarification.
> > 
> >>> A vmalloc_ro_exec() by definition would imply a text_poke().
> >>> 
> >>> Can kprobes, ftrace and modules use it too? It would be nice
> >>> so to not have to deal with the loose semantics on the user to
> >>> have to use set_vm_flush_reset_perms() on ro+x later, but
> >>> I think this can be addressed separately on a case by case basis.
> >> 
> >> I am pretty confident that kprobe and ftrace can share huge pages with 
> >> BPF programs.
> > 
> > Then wonderful, we know where to go in terms of a new API then as it
> > can be shared in the future for sure and there are gains.
> > 
> >> I haven't looked into all the details with modules, but 
> >> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
> >> possible.
> > 
> > Sure.
> > 
> >> Once this is done, a regular system (without huge BPF program or huge
> >> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
> >> and bpf programs. 
> > 
> > That would be nice, if possible, however modules will require likely its
> > own thing, on my system I see about 57 MiB used on coresize alone.
> > 
> > lsmod | grep -v Module | cut -f1 -d ' ' | \
> > 	xargs sudo modinfo | grep filename | \
> > 	grep -o '/.*' | xargs stat -c "%s - %n" | \
> > 	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
> > 60001272
> > 
> > And so perhaps we need such a pool size to be configurable.
> > 
> >>> But a vmalloc_ro_exec() with a respective free can remove the
> >>> requirement to do set_vm_flush_reset_perms().
> >> 
> >> Removing the requirement to set_vm_flush_reset_perms() is the other
> >> reason to go directly to vmalloc_ro_exec(). 
> > 
> > Yes fantastic.
> > 
> >> My current version looks like this:
> >> 
> >> void *vmalloc_exec(unsigned long size);
> >> void vfree_exec(void *ptr, unsigned int size);
> >> 
> >> ro is eliminated as there is no rw version of the API. 
> > 
> > Alright.
> > 
> > I am not sure if 2 MiB will suffice given what I mentioned above, and
> > what to do to ensure this grows at a reasonable pace. Then, at least for
> > usage for all architectures since not all will support text_poke() we
> > will want to consider a way to make it easy to users to use non huge
> > page fallbacks, but that would be up to those users, so we can wait for
> > that.
> 
> We are not limited to 2MiB total. The logic is like: 
> 
> 1. Anything bigger than 2MiB gets its own allocation.

And does that allocation get split up into a few huge 2 MiB pages?
When freed does that go into the pool of available list of 2 MiB pages
to use?

> 2. We maintain a list of 2MiB pages, and bitmaps showing which parts of 
>    these pages are in use. 

How many 2 MiB huge pages are allocated initially? Do we have a cap?

> 3. For objects smaller than 2MiB, we will try to fit it in one of these
>    pages. 
>    3. a) If there isn't a page with big enough continuous free space, we
>         will allocate a new 2MiB page. 
> 
> (For system with n NUMA nodes, multiple 2MiB above by n). 
> 
> So, if we have 100 kernel modules using 1MiB each, they will share 50x
> 2MiB pages. 

lsmod | grep -v Module | cut -f1 -d ' ' | \
	xargs sudo modinfo | grep filename |\
	grep -o '/.*' | xargs stat -c "%s - %n" | \
	awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 
271.273

On average my system's modules are 271 KiB.

Then I only have 6 out of 216 modules which are use more than 2 MiB or
memory for coresize. So roughly 97% of my modules would be covered
with this. Not bad.

The monsters:

lsmod | grep -v Module | cut -f1 -d ' ' | xargs sudo modinfo \
	| grep filename |grep -o '/.*' | xargs stat -c "%s %n" | \
	sort -n -k 1 -r | head -10 | \
	awk '{print $1/1024/1024" "$2}'
6.50775 /lib/modules/5.17.0-1-amd64/kernel/drivers/gpu/drm/i915/i915.ko
3.6847 /lib/modules/5.17.0-1-amd64/kernel/fs/xfs/xfs.ko
3.34252 /lib/modules/5.17.0-1-amd64/kernel/fs/btrfs/btrfs.ko
2.37677 /lib/modules/5.17.0-1-amd64/kernel/net/mac80211/mac80211.ko
2.2972 /lib/modules/5.17.0-1-amd64/kernel/net/wireless/cfg80211.ko
2.05754 /lib/modules/5.17.0-1-amd64/kernel/arch/x86/kvm/kvm.ko
1.96126 /lib/modules/5.17.0-1-amd64/kernel/net/bluetooth/bluetooth.ko
1.83429 /lib/modules/5.17.0-1-amd64/kernel/fs/ext4/ext4.ko
1.7724 /lib/modules/5.17.0-1-amd64/kernel/fs/nfsd/nfsd.ko
1.60539 /lib/modules/5.17.0-1-amd64/kernel/net/sunrpc/sunrpc.ko

On a big iron server I have 149 modules and the situation is better
there:

3.69791 /lib/modules/5.16.0-6-amd64/kernel/fs/xfs/xfs.ko
3.35575 /lib/modules/5.16.0-6-amd64/kernel/fs/btrfs/btrfs.ko
3.21056 /lib/modules/5.16.0-6-amd64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
2.02773 /lib/modules/5.16.0-6-amd64/kernel/arch/x86/kvm/kvm.ko
1.82574 /lib/modules/5.16.0-6-amd64/kernel/fs/ext4/ext4.ko
1.36571 /lib/modules/5.16.0-6-amd64/kernel/net/sunrpc/sunrpc.ko
1.32686 /lib/modules/5.16.0-6-amd64/kernel/fs/nfsd/nfsd.ko
1.12648 /lib/modules/5.16.0-6-amd64/kernel/drivers/gpu/drm/drm.ko
0.898623 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
0.86922 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/core/ib_core.ko

So this may just work nicely.

  Luis

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
  2022-07-12 23:42                         ` Luis Chamberlain
@ 2022-07-13  1:00                           ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2022-07-13  1:00 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Masami Hiramatsu, Naveen N. Rao,
	David S. Miller, Anil S Keshavamurthy, Kees Cook, Song Liu, bpf,
	Christoph Hellwig, Davidlohr Bueso, lkml, Linux-MM,
	Daniel Borkmann, Kernel Team, x86, dave.hansen, rick.p.edgecombe,
	linux-modules



> On Jul 12, 2022, at 4:42 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Tue, Jul 12, 2022 at 11:12:22PM +0000, Song Liu wrote:
>> 
>> 
>>> On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>> On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
>>>>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>>> 
>>>>> I believe you are mentioning requiring text_poke() because the way
>>>>> eBPF code uses the module_alloc() is different. Correct me if I'm
>>>>> wrong, but from what I gather is you use the text_poke_copy() as the data
>>>>> is already RO+X, contrary module_alloc() use cases. You do this since your
>>>>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
>>>>> module_alloc() and before you can use this memory. This is a different type
>>>>> of allocator. And, again please correct me if I'm wrong but now you want to
>>>>> share *one* 2 MiB huge-page for multiple BPF programs to help with the
>>>>> impact of TLB misses.
>>>> 
>>>> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
>>>> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
>>>> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
>>>> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
>>>> much value on top of current module_alloc(). 
>>> 
>>> Thanks for the clarification.
>>> 
>>>>> A vmalloc_ro_exec() by definition would imply a text_poke().
>>>>> 
>>>>> Can kprobes, ftrace and modules use it too? It would be nice
>>>>> so to not have to deal with the loose semantics on the user to
>>>>> have to use set_vm_flush_reset_perms() on ro+x later, but
>>>>> I think this can be addressed separately on a case by case basis.
>>>> 
>>>> I am pretty confident that kprobe and ftrace can share huge pages with 
>>>> BPF programs.
>>> 
>>> Then wonderful, we know where to go in terms of a new API then as it
>>> can be shared in the future for sure and there are gains.
>>> 
>>>> I haven't looked into all the details with modules, but 
>>>> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
>>>> possible.
>>> 
>>> Sure.
>>> 
>>>> Once this is done, a regular system (without huge BPF program or huge
>>>> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
>>>> and bpf programs. 
>>> 
>>> That would be nice, if possible, however modules will require likely its
>>> own thing, on my system I see about 57 MiB used on coresize alone.
>>> 
>>> lsmod | grep -v Module | cut -f1 -d ' ' | \
>>> 	xargs sudo modinfo | grep filename | \
>>> 	grep -o '/.*' | xargs stat -c "%s - %n" | \
>>> 	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
>>> 60001272
>>> 
>>> And so perhaps we need such a pool size to be configurable.
>>> 
>>>>> But a vmalloc_ro_exec() with a respective free can remove the
>>>>> requirement to do set_vm_flush_reset_perms().
>>>> 
>>>> Removing the requirement to set_vm_flush_reset_perms() is the other
>>>> reason to go directly to vmalloc_ro_exec(). 
>>> 
>>> Yes fantastic.
>>> 
>>>> My current version looks like this:
>>>> 
>>>> void *vmalloc_exec(unsigned long size);
>>>> void vfree_exec(void *ptr, unsigned int size);
>>>> 
>>>> ro is eliminated as there is no rw version of the API. 
>>> 
>>> Alright.
>>> 
>>> I am not sure if 2 MiB will suffice given what I mentioned above, and
>>> what to do to ensure this grows at a reasonable pace. Then, at least for
>>> usage for all architectures since not all will support text_poke() we
>>> will want to consider a way to make it easy to users to use non huge
>>> page fallbacks, but that would be up to those users, so we can wait for
>>> that.
>> 
>> We are not limited to 2MiB total. The logic is like: 
>> 
>> 1. Anything bigger than 2MiB gets its own allocation.
> 
> And does that allocation get split up into a few huge 2 MiB pages?
> When freed does that go into the pool of available list of 2 MiB pages
> to use?

This would have some 2MiB pages and some 4kB pages. For example, if we 
need 4MiB + 5kB, it will allocate 2x 2MiB pages, and 2x 4kB pages (round
up to 8kB). 

On free, the will not go to the pool. Instead, it will be vfree()'ed. 

> 
>> 2. We maintain a list of 2MiB pages, and bitmaps showing which parts of 
>>   these pages are in use. 
> 
> How many 2 MiB huge pages are allocated initially? Do we have a cap?

Current logic just allocates 1 huge page at a time. No cap. 

> 
>> 3. For objects smaller than 2MiB, we will try to fit it in one of these
>>   pages. 
>>   3. a) If there isn't a page with big enough continuous free space, we
>>        will allocate a new 2MiB page. 
>> 
>> (For system with n NUMA nodes, multiple 2MiB above by n). 
>> 
>> So, if we have 100 kernel modules using 1MiB each, they will share 50x
>> 2MiB pages. 
> 
> lsmod | grep -v Module | cut -f1 -d ' ' | \
> 	xargs sudo modinfo | grep filename |\
> 	grep -o '/.*' | xargs stat -c "%s - %n" | \
> 	awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 
> 271.273
> 
> On average my system's modules are 271 KiB.
> 
> Then I only have 6 out of 216 modules which are use more than 2 MiB or
> memory for coresize. So roughly 97% of my modules would be covered
> with this. Not bad.

Are these all the modules we have in tree? ;)

Thanks,
Song

> 
> The monsters:
> 
> lsmod | grep -v Module | cut -f1 -d ' ' | xargs sudo modinfo \
> 	| grep filename |grep -o '/.*' | xargs stat -c "%s %n" | \
> 	sort -n -k 1 -r | head -10 | \
> 	awk '{print $1/1024/1024" "$2}'
> 6.50775 /lib/modules/5.17.0-1-amd64/kernel/drivers/gpu/drm/i915/i915.ko
> 3.6847 /lib/modules/5.17.0-1-amd64/kernel/fs/xfs/xfs.ko
> 3.34252 /lib/modules/5.17.0-1-amd64/kernel/fs/btrfs/btrfs.ko
> 2.37677 /lib/modules/5.17.0-1-amd64/kernel/net/mac80211/mac80211.ko
> 2.2972 /lib/modules/5.17.0-1-amd64/kernel/net/wireless/cfg80211.ko
> 2.05754 /lib/modules/5.17.0-1-amd64/kernel/arch/x86/kvm/kvm.ko
> 1.96126 /lib/modules/5.17.0-1-amd64/kernel/net/bluetooth/bluetooth.ko
> 1.83429 /lib/modules/5.17.0-1-amd64/kernel/fs/ext4/ext4.ko
> 1.7724 /lib/modules/5.17.0-1-amd64/kernel/fs/nfsd/nfsd.ko
> 1.60539 /lib/modules/5.17.0-1-amd64/kernel/net/sunrpc/sunrpc.ko
> 
> On a big iron server I have 149 modules and the situation is better
> there:
> 
> 3.69791 /lib/modules/5.16.0-6-amd64/kernel/fs/xfs/xfs.ko
> 3.35575 /lib/modules/5.16.0-6-amd64/kernel/fs/btrfs/btrfs.ko
> 3.21056 /lib/modules/5.16.0-6-amd64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
> 2.02773 /lib/modules/5.16.0-6-amd64/kernel/arch/x86/kvm/kvm.ko
> 1.82574 /lib/modules/5.16.0-6-amd64/kernel/fs/ext4/ext4.ko
> 1.36571 /lib/modules/5.16.0-6-amd64/kernel/net/sunrpc/sunrpc.ko
> 1.32686 /lib/modules/5.16.0-6-amd64/kernel/fs/nfsd/nfsd.ko
> 1.12648 /lib/modules/5.16.0-6-amd64/kernel/drivers/gpu/drm/drm.ko
> 0.898623 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
> 0.86922 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/core/ib_core.ko
> 
> So this may just work nicely.
> 
>  Luis


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-07-13  1:00 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-07 22:35 [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
2022-07-07 22:59 ` [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Luis Chamberlain
2022-07-07 23:52   ` Song Liu
2022-07-08  0:53     ` Luis Chamberlain
2022-07-08  1:36       ` Song Liu
2022-07-08 15:58         ` Luis Chamberlain
2022-07-08 19:58           ` Song Liu
2022-07-08 22:24             ` Luis Chamberlain
2022-07-09  1:14               ` Song Liu
2022-07-12  4:18                 ` Luis Chamberlain
2022-07-12  4:24                   ` Luis Chamberlain
2022-07-12  5:49                   ` Song Liu
2022-07-12 19:04                     ` Luis Chamberlain
2022-07-12 23:12                       ` Song Liu
2022-07-12 23:42                         ` Luis Chamberlain
2022-07-13  1:00                           ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.