All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/5] bpf_prog_pack followup
@ 2022-05-16  5:40 Song Liu
  2022-05-16  5:40 ` [PATCH bpf-next 1/5] bpf: fill new bpf_prog_pack with illegal instructions Song Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Song Liu @ 2022-05-16  5:40 UTC (permalink / raw)
  To: linux-kernel, bpf
  Cc: ast, daniel, peterz, mcgrof, torvalds, rick.p.edgecombe,
	kernel-team, Song Liu

Resending the set, as the original ones didn't make through the maillist.

As of 5.18-rc6, x86_64 uses bpf_prog_pack on 4kB pages. This set contains
two followups:
  1/5 - 3/5 fills unused part of bpf_prog_pack with illegal instructions.
  4/5 - 5/5 enables bpf_prog_pack on 2MB pages.

The primary goal of bpf_prog_pack is to reduce iTLB miss rate and reduce
direct memory mapping fragmentation. This leads to non-trivial performance
improvements.

For our web service production benchmark, bpf_prog_pack on 4kB pages
gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
believe this is also significant for other companies with many thousand
servers.

bpf_prog_pack on 2MB pages may use slightly more memory for systems
without many BPF programs. However, such waste in memory (<2MB) is within
noisy for modern x86_64 systems.

Song Liu (5):
  bpf: fill new bpf_prog_pack with illegal instructions
  x86/alternative: introduce text_poke_set
  bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack
  module: introduce module_alloc_huge
  bpf: use module_alloc_huge for bpf_prog_pack

 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c        | 70 ++++++++++++++++++++++++----
 arch/x86/kernel/module.c             | 21 +++++++++
 arch/x86/net/bpf_jit_comp.c          |  5 ++
 include/linux/bpf.h                  |  1 +
 include/linux/moduleloader.h         |  5 ++
 kernel/bpf/core.c                    | 30 ++++++++----
 kernel/module.c                      |  8 ++++
 8 files changed, 122 insertions(+), 19 deletions(-)

--
2.30.2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 1/5] bpf: fill new bpf_prog_pack with illegal instructions
  2022-05-16  5:40 [PATCH bpf-next 0/5] bpf_prog_pack followup Song Liu
@ 2022-05-16  5:40 ` Song Liu
  2022-05-16  5:40 ` [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set Song Liu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-05-16  5:40 UTC (permalink / raw)
  To: linux-kernel, bpf
  Cc: ast, daniel, peterz, mcgrof, torvalds, rick.p.edgecombe,
	kernel-team, Song Liu

bpf_prog_pack enables sharing huge pages among multiple BPF programs.
These pages are marked as executable before the JIT engine fill it with
BPF programs. To make these pages safe, fill the hole bpf_prog_pack with
illegal instructions before making it executable.

Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator")
Fixes: 33c9805860e5 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9cc91f0f3115..2d0c9d4696ad 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -873,7 +873,7 @@ static size_t select_bpf_prog_pack_size(void)
 	return size;
 }
 
-static struct bpf_prog_pack *alloc_new_pack(void)
+static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns)
 {
 	struct bpf_prog_pack *pack;
 
@@ -886,6 +886,7 @@ static struct bpf_prog_pack *alloc_new_pack(void)
 		kfree(pack);
 		return NULL;
 	}
+	bpf_fill_ill_insns(pack->ptr, bpf_prog_pack_size);
 	bitmap_zero(pack->bitmap, bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE);
 	list_add_tail(&pack->list, &pack_list);
 
@@ -895,7 +896,7 @@ static struct bpf_prog_pack *alloc_new_pack(void)
 	return pack;
 }
 
-static void *bpf_prog_pack_alloc(u32 size)
+static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
 {
 	unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size);
 	struct bpf_prog_pack *pack;
@@ -910,6 +911,7 @@ static void *bpf_prog_pack_alloc(u32 size)
 		size = round_up(size, PAGE_SIZE);
 		ptr = module_alloc(size);
 		if (ptr) {
+			bpf_fill_ill_insns(ptr, size);
 			set_vm_flush_reset_perms(ptr);
 			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
 			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
@@ -923,7 +925,7 @@ static void *bpf_prog_pack_alloc(u32 size)
 			goto found_free_area;
 	}
 
-	pack = alloc_new_pack();
+	pack = alloc_new_pack(bpf_fill_ill_insns);
 	if (!pack)
 		goto out;
 
@@ -1102,7 +1104,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 
 	if (bpf_jit_charge_modmem(size))
 		return NULL;
-	ro_header = bpf_prog_pack_alloc(size);
+	ro_header = bpf_prog_pack_alloc(size, bpf_fill_ill_insns);
 	if (!ro_header) {
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set
  2022-05-16  5:40 [PATCH bpf-next 0/5] bpf_prog_pack followup Song Liu
  2022-05-16  5:40 ` [PATCH bpf-next 1/5] bpf: fill new bpf_prog_pack with illegal instructions Song Liu
@ 2022-05-16  5:40 ` Song Liu
  2022-05-17 19:16   ` Edgecombe, Rick P
                     ` (2 more replies)
  2022-05-16  5:40 ` [PATCH bpf-next 3/5] bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack Song Liu
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 25+ messages in thread
From: Song Liu @ 2022-05-16  5:40 UTC (permalink / raw)
  To: linux-kernel, bpf
  Cc: ast, daniel, peterz, mcgrof, torvalds, rick.p.edgecombe,
	kernel-team, Song Liu

Introduce a memset like API for text_poke. This will be used to fill the
unused RX memory with illegal instructions.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c        | 70 ++++++++++++++++++++++++----
 2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index d20ab0921480..1cc15528ce29 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -45,6 +45,7 @@ extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void text_poke_sync(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
+extern void *text_poke_set(void *addr, int c, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
 
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d374cb3cf024..732814065389 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -994,7 +994,21 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
-static void *__text_poke(void *addr, const void *opcode, size_t len)
+static void text_poke_memcpy(void *dst, const void *src, size_t len)
+{
+	memcpy(dst, src, len);
+}
+
+static void text_poke_memset(void *dst, const void *src, size_t len)
+{
+	int c = *(int *)src;
+
+	memset(dst, c, len);
+}
+
+typedef void text_poke_f(void *dst, const void *src, size_t len);
+
+static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t len)
 {
 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
 	struct page *pages[2] = {NULL};
@@ -1059,7 +1073,7 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
 	prev = use_temporary_mm(poking_mm);
 
 	kasan_disable_current();
-	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
+	func((u8 *)poking_addr + offset_in_page(addr), src, len);
 	kasan_enable_current();
 
 	/*
@@ -1087,11 +1101,13 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
 			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
 			   PAGE_SHIFT, false);
 
-	/*
-	 * If the text does not match what we just wrote then something is
-	 * fundamentally screwy; there's nothing we can really do about that.
-	 */
-	BUG_ON(memcmp(addr, opcode, len));
+	if (func == text_poke_memcpy) {
+		/*
+		 * If the text does not match what we just wrote then something is
+		 * fundamentally screwy; there's nothing we can really do about that.
+		 */
+		BUG_ON(memcmp(addr, src, len));
+	}
 
 	local_irq_restore(flags);
 	pte_unmap_unlock(ptep, ptl);
@@ -1118,7 +1134,7 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 {
 	lockdep_assert_held(&text_mutex);
 
-	return __text_poke(addr, opcode, len);
+	return __text_poke(text_poke_memcpy, addr, opcode, len);
 }
 
 /**
@@ -1137,7 +1153,7 @@ void *text_poke(void *addr, const void *opcode, size_t len)
  */
 void *text_poke_kgdb(void *addr, const void *opcode, size_t len)
 {
-	return __text_poke(addr, opcode, len);
+	return __text_poke(text_poke_memcpy, addr, opcode, len);
 }
 
 /**
@@ -1167,7 +1183,41 @@ void *text_poke_copy(void *addr, const void *opcode, size_t len)
 
 		s = min_t(size_t, PAGE_SIZE * 2 - offset_in_page(ptr), len - patched);
 
-		__text_poke((void *)ptr, opcode + patched, s);
+		__text_poke(text_poke_memcpy, (void *)ptr, opcode + patched, s);
+		patched += s;
+	}
+	mutex_unlock(&text_mutex);
+	return addr;
+}
+
+/**
+ * text_poke_set - memset into (an unused part of) RX memory
+ * @addr: address to modify
+ * @c: the byte to fill the area with
+ * @len: length to copy, could be more than 2x PAGE_SIZE
+ *
+ * Not safe against concurrent execution; useful for JITs to dump
+ * new code blocks into unused regions of RX memory. Can be used in
+ * conjunction with synchronize_rcu_tasks() to wait for existing
+ * execution to quiesce after having made sure no existing functions
+ * pointers are live.
+ */
+void *text_poke_set(void *addr, int c, size_t len)
+{
+	unsigned long start = (unsigned long)addr;
+	size_t patched = 0;
+
+	if (WARN_ON_ONCE(core_kernel_text(start)))
+		return NULL;
+
+	mutex_lock(&text_mutex);
+	while (patched < len) {
+		unsigned long ptr = start + patched;
+		size_t s;
+
+		s = min_t(size_t, PAGE_SIZE * 2 - offset_in_page(ptr), len - patched);
+
+		__text_poke(text_poke_memset, (void *)ptr, (void *)&c, s);
 		patched += s;
 	}
 	mutex_unlock(&text_mutex);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 3/5] bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack
  2022-05-16  5:40 [PATCH bpf-next 0/5] bpf_prog_pack followup Song Liu
  2022-05-16  5:40 ` [PATCH bpf-next 1/5] bpf: fill new bpf_prog_pack with illegal instructions Song Liu
  2022-05-16  5:40 ` [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set Song Liu
@ 2022-05-16  5:40 ` Song Liu
  2022-05-16  5:40 ` [PATCH bpf-next 4/5] module: introduce module_alloc_huge Song Liu
  2022-05-16  5:40 ` [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
  4 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-05-16  5:40 UTC (permalink / raw)
  To: linux-kernel, bpf
  Cc: ast, daniel, peterz, mcgrof, torvalds, rick.p.edgecombe,
	kernel-team, Song Liu

Introduce bpf_arch_text_invalidate and use it to fill unused part of the
bpf_prog_pack with illegal instructions when a BPF program is freed.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 5 +++++
 include/linux/bpf.h         | 1 +
 kernel/bpf/core.c           | 8 ++++++++
 3 files changed, 14 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index a2b6d197c226..f298b18a9a3d 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -228,6 +228,11 @@ static void jit_fill_hole(void *area, unsigned int size)
 	memset(area, 0xcc, size);
 }
 
+int bpf_arch_text_invalidate(void *dst, size_t len)
+{
+	return IS_ERR_OR_NULL(text_poke_set(dst, 0xcc, len));
+}
+
 struct jit_context {
 	int cleanup_addr; /* Epilogue code offset */
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5061ccd8b2dc..0288a6464236 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2362,6 +2362,7 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		       void *addr1, void *addr2);
 
 void *bpf_arch_text_copy(void *dst, void *src, size_t len);
+int bpf_arch_text_invalidate(void *dst, size_t len);
 
 struct btf_id_set;
 bool btf_id_set_contains(const struct btf_id_set *set, u32 id);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2d0c9d4696ad..cacd8684c3c4 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -968,6 +968,9 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
 	pos = ((unsigned long)hdr - (unsigned long)pack_ptr) >> BPF_PROG_CHUNK_SHIFT;
 
+	WARN_ONCE(bpf_arch_text_invalidate(hdr, hdr->size),
+		  "bpf_prog_pack bug: missing bpf_arch_text_invalidate?\n");
+
 	bitmap_clear(pack->bitmap, pos, nbits);
 	if (bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0,
 				       bpf_prog_chunk_count(), 0) == 0) {
@@ -2740,6 +2743,11 @@ void * __weak bpf_arch_text_copy(void *dst, void *src, size_t len)
 	return ERR_PTR(-ENOTSUPP);
 }
 
+int __weak bpf_arch_text_invalidate(void *dst, size_t len)
+{
+	return -ENOTSUPP;
+}
+
 DEFINE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
 EXPORT_SYMBOL(bpf_stats_enabled_key);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 4/5] module: introduce module_alloc_huge
  2022-05-16  5:40 [PATCH bpf-next 0/5] bpf_prog_pack followup Song Liu
                   ` (2 preceding siblings ...)
  2022-05-16  5:40 ` [PATCH bpf-next 3/5] bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack Song Liu
@ 2022-05-16  5:40 ` Song Liu
  2022-05-18  6:28   ` Song Liu
  2022-05-16  5:40 ` [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
  4 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-05-16  5:40 UTC (permalink / raw)
  To: linux-kernel, bpf
  Cc: ast, daniel, peterz, mcgrof, torvalds, rick.p.edgecombe,
	kernel-team, Song Liu, Stephen Rothwell

Introduce module_alloc_huge, which allocates huge page backed memory in
module memory space. The primary user of this memory is bpf_prog_pack
(multiple BPF programs sharing a huge page).

Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Song Liu <song@kernel.org>

---
Note: This conflicts with the module.c => module/ split in modules-next.
Current patch is for module.c in the bpf-next tree. After the split,
__weak module_alloc_huge() should be added to kernel/module/main.c.
---
 arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
 include/linux/moduleloader.h |  5 +++++
 kernel/module.c              |  8 ++++++++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b98ffcf4d250..63f6a16c70dc 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -86,6 +86,27 @@ void *module_alloc(unsigned long size)
 	return p;
 }
 
+void *module_alloc_huge(unsigned long size)
+{
+	gfp_t gfp_mask = GFP_KERNEL;
+	void *p;
+
+	if (PAGE_ALIGN(size) > MODULES_LEN)
+		return NULL;
+
+	p = __vmalloc_node_range(size, MODULE_ALIGN,
+				 MODULES_VADDR + get_module_load_offset(),
+				 MODULES_END, gfp_mask, PAGE_KERNEL,
+				 VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
+				 NUMA_NO_NODE, __builtin_return_address(0));
+	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
+		vfree(p);
+		return NULL;
+	}
+
+	return p;
+}
+
 #ifdef CONFIG_X86_32
 int apply_relocate(Elf32_Shdr *sechdrs,
 		   const char *strtab,
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 9e09d11ffe5b..d34743a88938 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -26,6 +26,11 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
    sections.  Returns NULL on failure. */
 void *module_alloc(unsigned long size);
 
+/* Allocator used for allocating memory in module memory space. If size is
+ * greater than PMD_SIZE, allow using huge pages. Returns NULL on failure.
+ */
+void *module_alloc_huge(unsigned long size);
+
 /* Free memory returned from module_alloc. */
 void module_memfree(void *module_region);
 
diff --git a/kernel/module.c b/kernel/module.c
index 6cea788fd965..2af20ac3209c 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2839,6 +2839,14 @@ void * __weak module_alloc(unsigned long size)
 			NUMA_NO_NODE, __builtin_return_address(0));
 }
 
+void * __weak module_alloc_huge(unsigned long size)
+{
+	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+				    GFP_KERNEL, PAGE_KERNEL_EXEC,
+				    VM_FLUSH_RESET_PERMS | VM_ALLOW_HUGE_VMAP,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+
 bool __weak module_init_section(const char *name)
 {
 	return strstarts(name, ".init");
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-16  5:40 [PATCH bpf-next 0/5] bpf_prog_pack followup Song Liu
                   ` (3 preceding siblings ...)
  2022-05-16  5:40 ` [PATCH bpf-next 4/5] module: introduce module_alloc_huge Song Liu
@ 2022-05-16  5:40 ` Song Liu
  2022-05-17 19:15   ` Edgecombe, Rick P
  4 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-05-16  5:40 UTC (permalink / raw)
  To: linux-kernel, bpf
  Cc: ast, daniel, peterz, mcgrof, torvalds, rick.p.edgecombe,
	kernel-team, Song Liu

Use module_alloc_huge for bpf_prog_pack so that BPF programs sit on
PMD_SIZE pages. This benefits system performance by reducing iTLB miss
rate. Benchmark of a real web service workload shows this change gives
another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
(which improve system throughput by ~0.5%).

Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and use
set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]

[1] https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index cacd8684c3c4..b64d91fcb0ba 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
 	void *ptr;
 
 	size = BPF_HPAGE_SIZE * num_online_nodes();
-	ptr = module_alloc(size);
+	ptr = module_alloc_huge(size);
 
 	/* Test whether we can get huge pages. If not just use PAGE_SIZE
 	 * packs.
@@ -881,7 +881,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 		       GFP_KERNEL);
 	if (!pack)
 		return NULL;
-	pack->ptr = module_alloc(bpf_prog_pack_size);
+	pack->ptr = module_alloc_huge(bpf_prog_pack_size);
 	if (!pack->ptr) {
 		kfree(pack);
 		return NULL;
@@ -890,7 +890,6 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 	bitmap_zero(pack->bitmap, bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE);
 	list_add_tail(&pack->list, &pack_list);
 
-	set_vm_flush_reset_perms(pack->ptr);
 	set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 	set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 	return pack;
@@ -909,10 +908,9 @@ static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insn
 
 	if (size > bpf_prog_pack_size) {
 		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
+		ptr = module_alloc_huge(size);
 		if (ptr) {
 			bpf_fill_ill_insns(ptr, size);
-			set_vm_flush_reset_perms(ptr);
 			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
 			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
 		}
@@ -949,6 +947,8 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 
 	mutex_lock(&pack_mutex);
 	if (hdr->size > bpf_prog_pack_size) {
+		set_memory_nx((unsigned long)hdr, hdr->size / PAGE_SIZE);
+		set_memory_rw((unsigned long)hdr, hdr->size / PAGE_SIZE);
 		module_memfree(hdr);
 		goto out;
 	}
@@ -975,6 +975,8 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	if (bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0,
 				       bpf_prog_chunk_count(), 0) == 0) {
 		list_del(&pack->list);
+		set_memory_nx((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
+		set_memory_rw((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 		module_memfree(pack->ptr);
 		kfree(pack);
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-16  5:40 ` [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
@ 2022-05-17 19:15   ` Edgecombe, Rick P
  2022-05-17 21:08     ` Song Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Edgecombe, Rick P @ 2022-05-17 19:15 UTC (permalink / raw)
  To: linux-kernel, song, bpf
  Cc: daniel, Torvalds, Linus, peterz, kernel-team, ast, mcgrof

On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
> Use module_alloc_huge for bpf_prog_pack so that BPF programs sit on
> PMD_SIZE pages. This benefits system performance by reducing iTLB
> miss
> rate. Benchmark of a real web service workload shows this change
> gives
> another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
> (which improve system throughput by ~0.5%).

0.7% sounds good as a whole. How sure are you of that +0.2%? Was this a
big averaged test?

> 
> Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and use
> set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
> VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
> 
> [1] 
> https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

As I said before, I think this will work functionally. But I meant it
as a quick fix when we were talking about patching this up to keep it
enabled upstream.

So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
pages? The main benefit would be to keep the tear down of these types
of allocations consistent for correctness reasons. The TLB flush
minimizing differences are probably less impactful given the caching
introduced here. At the very least though, we should have (or have
already had) some WARN if people try to use it with huge pages.

> Signed-off-by: Song Liu <song@kernel.org>
> ---
>  kernel/bpf/core.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index cacd8684c3c4..b64d91fcb0ba 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>  	void *ptr;
>  
>  	size = BPF_HPAGE_SIZE * num_online_nodes();
> -	ptr = module_alloc(size);
> +	ptr = module_alloc_huge(size);

This select_bpf_prog_pack_size() function always seemed weird - doing a
big allocation and then immediately freeing. Can't it check a config
for vmalloc huge page support?

>  
>  	/* Test whether we can get huge pages. If not just use
> PAGE_SIZE
>  	 * packs.
> @@ -881,7 +881,7 @@ static struct bpf_prog_pack
> *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
>  		       GFP_KERNEL);
>  	if (!pack)
>  		return NULL;
> -	pack->ptr = module_alloc(bpf_prog_pack_size);
> +	pack->ptr = module_alloc_huge(bpf_prog_pack_size);
>  	if (!pack->ptr) {
>  		kfree(pack);
>  		return NULL;
> @@ -890,7 +890,6 @@ static struct bpf_prog_pack
> *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
>  	bitmap_zero(pack->bitmap, bpf_prog_pack_size /
> BPF_PROG_CHUNK_SIZE);
>  	list_add_tail(&pack->list, &pack_list);
>  
> -	set_vm_flush_reset_perms(pack->ptr);
>  	set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size /
> PAGE_SIZE);
>  	set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size /
> PAGE_SIZE);
>  	return pack;
> @@ -909,10 +908,9 @@ static void *bpf_prog_pack_alloc(u32 size,
> bpf_jit_fill_hole_t bpf_fill_ill_insn
>  
>  	if (size > bpf_prog_pack_size) {
>  		size = round_up(size, PAGE_SIZE);
> -		ptr = module_alloc(size);
> +		ptr = module_alloc_huge(size);
>  		if (ptr) {
>  			bpf_fill_ill_insns(ptr, size);
> -			set_vm_flush_reset_perms(ptr);
>  			set_memory_ro((unsigned long)ptr, size /
> PAGE_SIZE);
>  			set_memory_x((unsigned long)ptr, size /
> PAGE_SIZE);
>  		}
> @@ -949,6 +947,8 @@ static void bpf_prog_pack_free(struct
> bpf_binary_header *hdr)
>  
>  	mutex_lock(&pack_mutex);
>  	if (hdr->size > bpf_prog_pack_size) {
> +		set_memory_nx((unsigned long)hdr, hdr->size /
> PAGE_SIZE);
> +		set_memory_rw((unsigned long)hdr, hdr->size /
> PAGE_SIZE);
>  		module_memfree(hdr);
>  		goto out;
>  	}
> @@ -975,6 +975,8 @@ static void bpf_prog_pack_free(struct
> bpf_binary_header *hdr)
>  	if (bitmap_find_next_zero_area(pack->bitmap,
> bpf_prog_chunk_count(), 0,
>  				       bpf_prog_chunk_count(), 0) == 0)
> {
>  		list_del(&pack->list);
> +		set_memory_nx((unsigned long)pack->ptr,
> bpf_prog_pack_size / PAGE_SIZE);
> +		set_memory_rw((unsigned long)pack->ptr,
> bpf_prog_pack_size / PAGE_SIZE);
>  		module_memfree(pack->ptr);
>  		kfree(pack);
>  	}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set
  2022-05-16  5:40 ` [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set Song Liu
@ 2022-05-17 19:16   ` Edgecombe, Rick P
  2022-05-17 21:09     ` Song Liu
  2022-05-18  6:58   ` Song Liu
  2022-05-18 17:09   ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Edgecombe, Rick P @ 2022-05-17 19:16 UTC (permalink / raw)
  To: linux-kernel, song, bpf
  Cc: daniel, Torvalds, Linus, peterz, kernel-team, ast, mcgrof

On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
> +static void text_poke_memset(void *dst, const void *src, size_t len)
> +{
> +       int c = *(int *)src;

It casts away the const unnecessarily. It could be *(const int *)src.

> +
> +       memset(dst, c, len);
> +}
> +

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-17 19:15   ` Edgecombe, Rick P
@ 2022-05-17 21:08     ` Song Liu
  2022-05-17 23:58       ` Edgecombe, Rick P
  0 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-05-17 21:08 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, song, bpf, daniel, Torvalds, Linus, peterz,
	Kernel Team, ast, mcgrof



> On May 17, 2022, at 12:15 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
>> Use module_alloc_huge for bpf_prog_pack so that BPF programs sit on
>> PMD_SIZE pages. This benefits system performance by reducing iTLB
>> miss
>> rate. Benchmark of a real web service workload shows this change
>> gives
>> another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
>> (which improve system throughput by ~0.5%).
> 
> 0.7% sounds good as a whole. How sure are you of that +0.2%? Was this a
> big averaged test?

Yes, this was a test between two tiers with 10+ servers on each tier.  
We took the average performance over a few hours of shadow workload. 

> 
>> 
>> Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and use
>> set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
>> VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
>> 
>> [1] 
>> https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
>> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> As I said before, I think this will work functionally. But I meant it
> as a quick fix when we were talking about patching this up to keep it
> enabled upstream.
> 
> So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
> pages? The main benefit would be to keep the tear down of these types
> of allocations consistent for correctness reasons. The TLB flush
> minimizing differences are probably less impactful given the caching
> introduced here. At the very least though, we should have (or have
> already had) some WARN if people try to use it with huge pages.

I am not quite sure the exact work needed here. Rick, would you have
time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the merge 
window is coming soon, I guess we need current work around in 5.19. 

> 
>> Signed-off-by: Song Liu <song@kernel.org>
>> ---
>> kernel/bpf/core.c | 12 +++++++-----
>> 1 file changed, 7 insertions(+), 5 deletions(-)
>> 
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index cacd8684c3c4..b64d91fcb0ba 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>> 	void *ptr;
>> 
>> 	size = BPF_HPAGE_SIZE * num_online_nodes();
>> -	ptr = module_alloc(size);
>> +	ptr = module_alloc_huge(size);
> 
> This select_bpf_prog_pack_size() function always seemed weird - doing a
> big allocation and then immediately freeing. Can't it check a config
> for vmalloc huge page support?

Yes, it is weird. Checking a config is not enough here. We also need to 
check vmap_allow_huge, which is controlled by boot parameter nohugeiomap. 
I haven’t got a better solution for this. 

Thanks,
Song

> 
>> 
>> 	/* Test whether we can get huge pages. If not just use
>> PAGE_SIZE
>> 	 * packs.
>> @@ -881,7 +881,7 @@ static struct bpf_prog_pack
>> *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
>> 		       GFP_KERNEL);
>> 	if (!pack)
>> 		return NULL;
>> -	pack->ptr = module_alloc(bpf_prog_pack_size);
>> +	pack->ptr = module_alloc_huge(bpf_prog_pack_size);
>> 	if (!pack->ptr) {
>> 		kfree(pack);
>> 		return NULL;
>> @@ -890,7 +890,6 @@ static struct bpf_prog_pack
>> *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
>> 	bitmap_zero(pack->bitmap, bpf_prog_pack_size /
>> BPF_PROG_CHUNK_SIZE);
>> 	list_add_tail(&pack->list, &pack_list);
>> 
>> -	set_vm_flush_reset_perms(pack->ptr);
>> 	set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size /
>> PAGE_SIZE);
>> 	set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size /
>> PAGE_SIZE);
>> 	return pack;
>> @@ -909,10 +908,9 @@ static void *bpf_prog_pack_alloc(u32 size,
>> bpf_jit_fill_hole_t bpf_fill_ill_insn
>> 
>> 	if (size > bpf_prog_pack_size) {
>> 		size = round_up(size, PAGE_SIZE);
>> -		ptr = module_alloc(size);
>> +		ptr = module_alloc_huge(size);
>> 		if (ptr) {
>> 			bpf_fill_ill_insns(ptr, size);
>> -			set_vm_flush_reset_perms(ptr);
>> 			set_memory_ro((unsigned long)ptr, size /
>> PAGE_SIZE);
>> 			set_memory_x((unsigned long)ptr, size /
>> PAGE_SIZE);
>> 		}
>> @@ -949,6 +947,8 @@ static void bpf_prog_pack_free(struct
>> bpf_binary_header *hdr)
>> 
>> 	mutex_lock(&pack_mutex);
>> 	if (hdr->size > bpf_prog_pack_size) {
>> +		set_memory_nx((unsigned long)hdr, hdr->size /
>> PAGE_SIZE);
>> +		set_memory_rw((unsigned long)hdr, hdr->size /
>> PAGE_SIZE);
>> 		module_memfree(hdr);
>> 		goto out;
>> 	}
>> @@ -975,6 +975,8 @@ static void bpf_prog_pack_free(struct
>> bpf_binary_header *hdr)
>> 	if (bitmap_find_next_zero_area(pack->bitmap,
>> bpf_prog_chunk_count(), 0,
>> 				       bpf_prog_chunk_count(), 0) == 0)
>> {
>> 		list_del(&pack->list);
>> +		set_memory_nx((unsigned long)pack->ptr,
>> bpf_prog_pack_size / PAGE_SIZE);
>> +		set_memory_rw((unsigned long)pack->ptr,
>> bpf_prog_pack_size / PAGE_SIZE);
>> 		module_memfree(pack->ptr);
>> 		kfree(pack);
>> 	}


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set
  2022-05-17 19:16   ` Edgecombe, Rick P
@ 2022-05-17 21:09     ` Song Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-05-17 21:09 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, song, bpf, daniel, Torvalds, Linus, peterz,
	Kernel Team, ast, mcgrof



> On May 17, 2022, at 12:16 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
>> +static void text_poke_memset(void *dst, const void *src, size_t len)
>> +{
>> +       int c = *(int *)src;
> 
> It casts away the const unnecessarily. It could be *(const int *)src.

I will fix this in the next version. Or we can ask the maintainer to 
fix it when applying the patches. 

Thanks,
Song

> 
>> +
>> +       memset(dst, c, len);
>> +}
>> +


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-17 21:08     ` Song Liu
@ 2022-05-17 23:58       ` Edgecombe, Rick P
  2022-05-18  6:25         ` Song Liu
                           ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Edgecombe, Rick P @ 2022-05-17 23:58 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, daniel, peterz, ast, bpf, Torvalds, Linus,
	Kernel-team, song, mcgrof

On Tue, 2022-05-17 at 21:08 +0000, Song Liu wrote:
> > On May 17, 2022, at 12:15 PM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
> > > Use module_alloc_huge for bpf_prog_pack so that BPF programs sit
> > > on
> > > PMD_SIZE pages. This benefits system performance by reducing iTLB
> > > miss
> > > rate. Benchmark of a real web service workload shows this change
> > > gives
> > > another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
> > > (which improve system throughput by ~0.5%).
> > 
> > 0.7% sounds good as a whole. How sure are you of that +0.2%? Was
> > this a
> > big averaged test?
> 
> Yes, this was a test between two tiers with 10+ servers on each
> tier.  
> We took the average performance over a few hours of shadow workload. 

Awesome. Sounds great.

> 
> > 
> > > 
> > > Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and
> > > use
> > > set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
> > > VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
> > > 
> > > [1] 
> > > 
https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
> > > Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > As I said before, I think this will work functionally. But I meant
> > it
> > as a quick fix when we were talking about patching this up to keep
> > it
> > enabled upstream.
> > 
> > So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
> > pages? The main benefit would be to keep the tear down of these
> > types
> > of allocations consistent for correctness reasons. The TLB flush
> > minimizing differences are probably less impactful given the
> > caching
> > introduced here. At the very least though, we should have (or have
> > already had) some WARN if people try to use it with huge pages.
> 
> I am not quite sure the exact work needed here. Rick, would you have
> time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the merge 
> window is coming soon, I guess we need current work around in 5.19. 

I would have hard time squeezing that in now. The vmalloc part is easy,
I think I already posted a diff. But first hibernate needs to be
changed to not care about direct map page sizes.

> 
> > 
> > > Signed-off-by: Song Liu <song@kernel.org>
> > > ---
> > > kernel/bpf/core.c | 12 +++++++-----
> > > 1 file changed, 7 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index cacd8684c3c4..b64d91fcb0ba 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
> > >       void *ptr;
> > > 
> > >       size = BPF_HPAGE_SIZE * num_online_nodes();
> > > -    ptr = module_alloc(size);
> > > +    ptr = module_alloc_huge(size);
> > 
> > This select_bpf_prog_pack_size() function always seemed weird -
> > doing a
> > big allocation and then immediately freeing. Can't it check a
> > config
> > for vmalloc huge page support?
> 
> Yes, it is weird. Checking a config is not enough here. We also need
> to 
> check vmap_allow_huge, which is controlled by boot parameter
> nohugeiomap. 
> I haven’t got a better solution for this. 

It's too weird. We should expose whats needed in vmalloc.
huge_vmalloc_supported() or something.

I'm also not clear why we wouldn't want to use the prog pack allocator
even if vmalloc huge pages was disabled. Doesn't it improve performance
even with small page sizes, per your benchmarks? What is the downside
to just always using it?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-17 23:58       ` Edgecombe, Rick P
@ 2022-05-18  6:25         ` Song Liu
  2022-05-18  6:34         ` Song Liu
  2022-05-19  6:42         ` Song Liu
  2 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-05-18  6:25 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, peterz, ast, bpf, Torvalds, Linus,
	Kernel Team, song, mcgrof



> On May 17, 2022, at 4:58 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Tue, 2022-05-17 at 21:08 +0000, Song Liu wrote:
>>> On May 17, 2022, at 12:15 PM, Edgecombe, Rick P <
>>> rick.p.edgecombe@intel.com> wrote:
>>> 
>>> On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
>>>> Use module_alloc_huge for bpf_prog_pack so that BPF programs sit
>>>> on
>>>> PMD_SIZE pages. This benefits system performance by reducing iTLB
>>>> miss
>>>> rate. Benchmark of a real web service workload shows this change
>>>> gives
>>>> another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
>>>> (which improve system throughput by ~0.5%).
>>> 
>>> 0.7% sounds good as a whole. How sure are you of that +0.2%? Was
>>> this a
>>> big averaged test?
>> 
>> Yes, this was a test between two tiers with 10+ servers on each
>> tier.  
>> We took the average performance over a few hours of shadow workload. 
> 
> Awesome. Sounds great.
> 
>> 
>>> 
>>>> 
>>>> Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and
>>>> use
>>>> set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
>>>> VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
>>>> 
>>>> [1] 
>>>> 
> https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
>>>> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> 
>>> As I said before, I think this will work functionally. But I meant
>>> it
>>> as a quick fix when we were talking about patching this up to keep
>>> it
>>> enabled upstream.
>>> 
>>> So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
>>> pages? The main benefit would be to keep the tear down of these
>>> types
>>> of allocations consistent for correctness reasons. The TLB flush
>>> minimizing differences are probably less impactful given the
>>> caching
>>> introduced here. At the very least though, we should have (or have
>>> already had) some WARN if people try to use it with huge pages.
>> 
>> I am not quite sure the exact work needed here. Rick, would you have
>> time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the merge 
>> window is coming soon, I guess we need current work around in 5.19. 
> 
> I would have hard time squeezing that in now. The vmalloc part is easy,
> I think I already posted a diff. But first hibernate needs to be
> changed to not care about direct map page sizes.

I guess I missed the diff, could you please send a link to it?

> 
>> 
>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>> kernel/bpf/core.c | 12 +++++++-----
>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>>> index cacd8684c3c4..b64d91fcb0ba 100644
>>>> --- a/kernel/bpf/core.c
>>>> +++ b/kernel/bpf/core.c
>>>> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>>>>      void *ptr;
>>>> 
>>>>      size = BPF_HPAGE_SIZE * num_online_nodes();
>>>> -    ptr = module_alloc(size);
>>>> +    ptr = module_alloc_huge(size);
>>> 
>>> This select_bpf_prog_pack_size() function always seemed weird -
>>> doing a
>>> big allocation and then immediately freeing. Can't it check a
>>> config
>>> for vmalloc huge page support?
>> 
>> Yes, it is weird. Checking a config is not enough here. We also need
>> to 
>> check vmap_allow_huge, which is controlled by boot parameter
>> nohugeiomap. 
>> I haven’t got a better solution for this. 
> 
> It's too weird. We should expose whats needed in vmalloc.
> huge_vmalloc_supported() or something.

Yeah, this should work. I will get something like this in the next 
version.

> 
> I'm also not clear why we wouldn't want to use the prog pack allocator
> even if vmalloc huge pages was disabled. Doesn't it improve performance
> even with small page sizes, per your benchmarks? What is the downside
> to just always using it?

With current version, when huge page is disabled, the prog pack allocator
will use 4kB pages for each pack. We still get about 0.5% performance
improvement with 4kB prog packs. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 4/5] module: introduce module_alloc_huge
  2022-05-16  5:40 ` [PATCH bpf-next 4/5] module: introduce module_alloc_huge Song Liu
@ 2022-05-18  6:28   ` Song Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-05-18  6:28 UTC (permalink / raw)
  To: Song Liu
  Cc: Linux Kernel Mailing List, bpf, Alexei Starovoitov,
	Daniel Borkmann, Peter Zijlstra, Luis Chamberlain, Torvalds,
	Linus, Edgecombe, Rick P, Kernel Team, Stephen Rothwell

Hi Luis,

> On May 15, 2022, at 10:40 PM, Song Liu <song@kernel.org> wrote:
> 
> Introduce module_alloc_huge, which allocates huge page backed memory in
> module memory space. The primary user of this memory is bpf_prog_pack
> (multiple BPF programs sharing a huge page).
> 
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Song Liu <song@kernel.org>
> 
> ---
> Note: This conflicts with the module.c => module/ split in modules-next.
> Current patch is for module.c in the bpf-next tree. After the split,
> __weak module_alloc_huge() should be added to kernel/module/main.c.

Could you please share your feedback on this patch? I guess we need
to address the conflict with module.c split in the merge window?

Thanks,
Song


> ---
> arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
> include/linux/moduleloader.h |  5 +++++
> kernel/module.c              |  8 ++++++++
> 3 files changed, 34 insertions(+)
> 
> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index b98ffcf4d250..63f6a16c70dc 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -86,6 +86,27 @@ void *module_alloc(unsigned long size)
> 	return p;
> }
> 
> +void *module_alloc_huge(unsigned long size)
> +{
> +	gfp_t gfp_mask = GFP_KERNEL;
> +	void *p;
> +
> +	if (PAGE_ALIGN(size) > MODULES_LEN)
> +		return NULL;
> +
> +	p = __vmalloc_node_range(size, MODULE_ALIGN,
> +				 MODULES_VADDR + get_module_load_offset(),
> +				 MODULES_END, gfp_mask, PAGE_KERNEL,
> +				 VM_DEFER_KMEMLEAK | VM_ALLOW_HUGE_VMAP,
> +				 NUMA_NO_NODE, __builtin_return_address(0));
> +	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
> +		vfree(p);
> +		return NULL;
> +	}
> +
> +	return p;
> +}
> +
> #ifdef CONFIG_X86_32
> int apply_relocate(Elf32_Shdr *sechdrs,
> 		   const char *strtab,
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 9e09d11ffe5b..d34743a88938 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -26,6 +26,11 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
>    sections.  Returns NULL on failure. */
> void *module_alloc(unsigned long size);
> 
> +/* Allocator used for allocating memory in module memory space. If size is
> + * greater than PMD_SIZE, allow using huge pages. Returns NULL on failure.
> + */
> +void *module_alloc_huge(unsigned long size);
> +
> /* Free memory returned from module_alloc. */
> void module_memfree(void *module_region);
> 
> diff --git a/kernel/module.c b/kernel/module.c
> index 6cea788fd965..2af20ac3209c 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -2839,6 +2839,14 @@ void * __weak module_alloc(unsigned long size)
> 			NUMA_NO_NODE, __builtin_return_address(0));
> }
> 
> +void * __weak module_alloc_huge(unsigned long size)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +				    GFP_KERNEL, PAGE_KERNEL_EXEC,
> +				    VM_FLUSH_RESET_PERMS | VM_ALLOW_HUGE_VMAP,
> +				    NUMA_NO_NODE, __builtin_return_address(0));
> +}
> +
> bool __weak module_init_section(const char *name)
> {
> 	return strstarts(name, ".init");
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-17 23:58       ` Edgecombe, Rick P
  2022-05-18  6:25         ` Song Liu
@ 2022-05-18  6:34         ` Song Liu
  2022-05-18 16:49           ` Edgecombe, Rick P
  2022-05-19  6:42         ` Song Liu
  2 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-05-18  6:34 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, peterz, ast, bpf, Torvalds, Linus,
	Kernel Team, song, mcgrof



> On May 17, 2022, at 4:58 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Tue, 2022-05-17 at 21:08 +0000, Song Liu wrote:
>>> On May 17, 2022, at 12:15 PM, Edgecombe, Rick P <
>>> rick.p.edgecombe@intel.com> wrote:
>>> 
>>> On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
>>>> Use module_alloc_huge for bpf_prog_pack so that BPF programs sit
>>>> on
>>>> PMD_SIZE pages. This benefits system performance by reducing iTLB
>>>> miss
>>>> rate. Benchmark of a real web service workload shows this change
>>>> gives
>>>> another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
>>>> (which improve system throughput by ~0.5%).
>>> 
>>> 0.7% sounds good as a whole. How sure are you of that +0.2%? Was
>>> this a
>>> big averaged test?
>> 
>> Yes, this was a test between two tiers with 10+ servers on each
>> tier.  
>> We took the average performance over a few hours of shadow workload. 
> 
> Awesome. Sounds great.
> 
>> 
>>> 
>>>> 
>>>> Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and
>>>> use
>>>> set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
>>>> VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
>>>> 
>>>> [1] 
>>>> 
> https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
>>>> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> 
>>> As I said before, I think this will work functionally. But I meant
>>> it
>>> as a quick fix when we were talking about patching this up to keep
>>> it
>>> enabled upstream.
>>> 
>>> So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
>>> pages? The main benefit would be to keep the tear down of these
>>> types
>>> of allocations consistent for correctness reasons. The TLB flush
>>> minimizing differences are probably less impactful given the
>>> caching
>>> introduced here. At the very least though, we should have (or have
>>> already had) some WARN if people try to use it with huge pages.
>> 
>> I am not quite sure the exact work needed here. Rick, would you have
>> time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the merge 
>> window is coming soon, I guess we need current work around in 5.19. 
> 
> I would have hard time squeezing that in now. The vmalloc part is easy,
> I think I already posted a diff. But first hibernate needs to be
> changed to not care about direct map page sizes.

I guess I missed the diff, could you please send a link to it?

> 
>> 
>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>> kernel/bpf/core.c | 12 +++++++-----
>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>>> index cacd8684c3c4..b64d91fcb0ba 100644
>>>> --- a/kernel/bpf/core.c
>>>> +++ b/kernel/bpf/core.c
>>>> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>>>>      void *ptr;
>>>> 
>>>>      size = BPF_HPAGE_SIZE * num_online_nodes();
>>>> -    ptr = module_alloc(size);
>>>> +    ptr = module_alloc_huge(size);
>>> 
>>> This select_bpf_prog_pack_size() function always seemed weird -
>>> doing a
>>> big allocation and then immediately freeing. Can't it check a
>>> config
>>> for vmalloc huge page support?
>> 
>> Yes, it is weird. Checking a config is not enough here. We also need
>> to 
>> check vmap_allow_huge, which is controlled by boot parameter
>> nohugeiomap. 
>> I haven’t got a better solution for this. 
> 
> It's too weird. We should expose whats needed in vmalloc.
> huge_vmalloc_supported() or something.

Yeah, this should work. I will get something like this in the next 
version.

> 
> I'm also not clear why we wouldn't want to use the prog pack allocator
> even if vmalloc huge pages was disabled. Doesn't it improve performance
> even with small page sizes, per your benchmarks? What is the downside
> to just always using it?

With current version, when huge page is disabled, the prog pack allocator
will use 4kB pages for each pack. We still get about 0.5% performance
improvement with 4kB prog packs. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set
  2022-05-16  5:40 ` [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set Song Liu
  2022-05-17 19:16   ` Edgecombe, Rick P
@ 2022-05-18  6:58   ` Song Liu
  2022-05-18  7:45     ` Peter Zijlstra
  2022-05-18 17:09   ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-05-18  6:58 UTC (permalink / raw)
  To: Song Liu, Peter Zijlstra
  Cc: Linux Kernel Mailing List, bpf, Alexei Starovoitov,
	Daniel Borkmann, Luis Chamberlain, Torvalds, Linus, Edgecombe,
	Rick P, Kernel Team

Hi Peter, 

> On May 15, 2022, at 10:40 PM, Song Liu <song@kernel.org> wrote:
> 
> Introduce a memset like API for text_poke. This will be used to fill the
> unused RX memory with illegal instructions.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Song Liu <song@kernel.org>

Could you please share your comments on this? 

Thanks,
Song


> ---
> arch/x86/include/asm/text-patching.h |  1 +
> arch/x86/kernel/alternative.c        | 70 ++++++++++++++++++++++++----
> 2 files changed, 61 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index d20ab0921480..1cc15528ce29 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -45,6 +45,7 @@ extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void text_poke_sync(void);
> extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
> +extern void *text_poke_set(void *addr, int c, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index d374cb3cf024..732814065389 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -994,7 +994,21 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
> __ro_after_init struct mm_struct *poking_mm;
> __ro_after_init unsigned long poking_addr;
> 
> -static void *__text_poke(void *addr, const void *opcode, size_t len)
> +static void text_poke_memcpy(void *dst, const void *src, size_t len)
> +{
> +	memcpy(dst, src, len);
> +}
> +
> +static void text_poke_memset(void *dst, const void *src, size_t len)
> +{
> +	int c = *(int *)src;
> +
> +	memset(dst, c, len);
> +}
> +
> +typedef void text_poke_f(void *dst, const void *src, size_t len);
> +
> +static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t len)
> {
> 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
> 	struct page *pages[2] = {NULL};
> @@ -1059,7 +1073,7 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
> 	prev = use_temporary_mm(poking_mm);
> 
> 	kasan_disable_current();
> -	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
> +	func((u8 *)poking_addr + offset_in_page(addr), src, len);
> 	kasan_enable_current();
> 
> 	/*
> @@ -1087,11 +1101,13 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
> 			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
> 			   PAGE_SHIFT, false);
> 
> -	/*
> -	 * If the text does not match what we just wrote then something is
> -	 * fundamentally screwy; there's nothing we can really do about that.
> -	 */
> -	BUG_ON(memcmp(addr, opcode, len));
> +	if (func == text_poke_memcpy) {
> +		/*
> +		 * If the text does not match what we just wrote then something is
> +		 * fundamentally screwy; there's nothing we can really do about that.
> +		 */
> +		BUG_ON(memcmp(addr, src, len));
> +	}
> 
> 	local_irq_restore(flags);
> 	pte_unmap_unlock(ptep, ptl);
> @@ -1118,7 +1134,7 @@ void *text_poke(void *addr, const void *opcode, size_t len)
> {
> 	lockdep_assert_held(&text_mutex);
> 
> -	return __text_poke(addr, opcode, len);
> +	return __text_poke(text_poke_memcpy, addr, opcode, len);
> }
> 
> /**
> @@ -1137,7 +1153,7 @@ void *text_poke(void *addr, const void *opcode, size_t len)
>  */
> void *text_poke_kgdb(void *addr, const void *opcode, size_t len)
> {
> -	return __text_poke(addr, opcode, len);
> +	return __text_poke(text_poke_memcpy, addr, opcode, len);
> }
> 
> /**
> @@ -1167,7 +1183,41 @@ void *text_poke_copy(void *addr, const void *opcode, size_t len)
> 
> 		s = min_t(size_t, PAGE_SIZE * 2 - offset_in_page(ptr), len - patched);
> 
> -		__text_poke((void *)ptr, opcode + patched, s);
> +		__text_poke(text_poke_memcpy, (void *)ptr, opcode + patched, s);
> +		patched += s;
> +	}
> +	mutex_unlock(&text_mutex);
> +	return addr;
> +}
> +
> +/**
> + * text_poke_set - memset into (an unused part of) RX memory
> + * @addr: address to modify
> + * @c: the byte to fill the area with
> + * @len: length to copy, could be more than 2x PAGE_SIZE
> + *
> + * Not safe against concurrent execution; useful for JITs to dump
> + * new code blocks into unused regions of RX memory. Can be used in
> + * conjunction with synchronize_rcu_tasks() to wait for existing
> + * execution to quiesce after having made sure no existing functions
> + * pointers are live.
> + */
> +void *text_poke_set(void *addr, int c, size_t len)
> +{
> +	unsigned long start = (unsigned long)addr;
> +	size_t patched = 0;
> +
> +	if (WARN_ON_ONCE(core_kernel_text(start)))
> +		return NULL;
> +
> +	mutex_lock(&text_mutex);
> +	while (patched < len) {
> +		unsigned long ptr = start + patched;
> +		size_t s;
> +
> +		s = min_t(size_t, PAGE_SIZE * 2 - offset_in_page(ptr), len - patched);
> +
> +		__text_poke(text_poke_memset, (void *)ptr, (void *)&c, s);
> 		patched += s;
> 	}
> 	mutex_unlock(&text_mutex);
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set
  2022-05-18  6:58   ` Song Liu
@ 2022-05-18  7:45     ` Peter Zijlstra
  2022-05-18 15:32       ` Song Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2022-05-18  7:45 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, Linux Kernel Mailing List, bpf, Alexei Starovoitov,
	Daniel Borkmann, Luis Chamberlain, Torvalds, Linus, Edgecombe,
	Rick P, Kernel Team

On Wed, May 18, 2022 at 06:58:46AM +0000, Song Liu wrote:
> Hi Peter, 
> 
> > On May 15, 2022, at 10:40 PM, Song Liu <song@kernel.org> wrote:
> > 
> > Introduce a memset like API for text_poke. This will be used to fill the
> > unused RX memory with illegal instructions.
> > 
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Song Liu <song@kernel.org>
> 
> Could you please share your comments on this? 

I wrote it; it must be good! What specifically you want to know?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set
  2022-05-18  7:45     ` Peter Zijlstra
@ 2022-05-18 15:32       ` Song Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-05-18 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, Linux Kernel Mailing List, bpf, Alexei Starovoitov,
	Daniel Borkmann, Luis Chamberlain, Torvalds, Linus, Edgecombe,
	Rick P, Kernel Team



> On May 18, 2022, at 12:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, May 18, 2022 at 06:58:46AM +0000, Song Liu wrote:
>> Hi Peter, 
>> 
>>> On May 15, 2022, at 10:40 PM, Song Liu <song@kernel.org> wrote:
>>> 
>>> Introduce a memset like API for text_poke. This will be used to fill the
>>> unused RX memory with illegal instructions.
>>> 
>>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>>> Signed-off-by: Song Liu <song@kernel.org>
>> 
>> Could you please share your comments on this? 
> 
> I wrote it; it must be good! What specifically you want to know?

Maybe just your Acked-by or Signed-off-by?

Thanks,
Song

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-18  6:34         ` Song Liu
@ 2022-05-18 16:49           ` Edgecombe, Rick P
  2022-05-18 18:31             ` Song Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Edgecombe, Rick P @ 2022-05-18 16:49 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, daniel, peterz, ast, bpf, Torvalds, Linus,
	Kernel-team, song, mcgrof

On Wed, 2022-05-18 at 06:34 +0000, Song Liu wrote:
> > > I am not quite sure the exact work needed here. Rick, would you
> > > have
> > > time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the
> > > merge 
> > > window is coming soon, I guess we need current work around in
> > > 5.19. 
> > 
> > I would have hard time squeezing that in now. The vmalloc part is
> > easy,
> > I think I already posted a diff. But first hibernate needs to be
> > changed to not care about direct map page sizes.
> 
> I guess I missed the diff, could you please send a link to it?


https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/

The remaining problem is that hibernate may encounter NP pages when
saving memory to disk. It resets them with CPA calls 4k at a time. So
if a page is NP, hibernate needs it to be already be 4k or it might
need to split. I think hibernate should just utilize a different
mapping to get at the page when it encounters this rare scenario. In
that diff I put some locking so that hibernate couldn't race with a
huge NP page, but then I thought we should just change hibernate.

> 
> > 
> > > 
> > > > 
> > > > > Signed-off-by: Song Liu <song@kernel.org>
> > > > > ---
> > > > > kernel/bpf/core.c | 12 +++++++-----
> > > > > 1 file changed, 7 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > > > index cacd8684c3c4..b64d91fcb0ba 100644
> > > > > --- a/kernel/bpf/core.c
> > > > > +++ b/kernel/bpf/core.c
> > > > > @@ -857,7 +857,7 @@ static size_t
> > > > > select_bpf_prog_pack_size(void)
> > > > >       void *ptr;
> > > > > 
> > > > >       size = BPF_HPAGE_SIZE * num_online_nodes();
> > > > > -    ptr = module_alloc(size);
> > > > > +    ptr = module_alloc_huge(size);
> > > > 
> > > > This select_bpf_prog_pack_size() function always seemed weird -
> > > > doing a
> > > > big allocation and then immediately freeing. Can't it check a
> > > > config
> > > > for vmalloc huge page support?
> > > 
> > > Yes, it is weird. Checking a config is not enough here. We also
> > > need
> > > to 
> > > check vmap_allow_huge, which is controlled by boot parameter
> > > nohugeiomap. 
> > > I haven’t got a better solution for this. 
> > 
> > It's too weird. We should expose whats needed in vmalloc.
> > huge_vmalloc_supported() or something.
> 
> Yeah, this should work. I will get something like this in the next 
> version.
> 
> > 
> > I'm also not clear why we wouldn't want to use the prog pack
> > allocator
> > even if vmalloc huge pages was disabled. Doesn't it improve
> > performance
> > even with small page sizes, per your benchmarks? What is the
> > downside
> > to just always using it?
> 
> With current version, when huge page is disabled, the prog pack
> allocator
> will use 4kB pages for each pack. We still get about 0.5% performance
> improvement with 4kB prog packs. 

Oh, I thought you were comparing a 2MB sized, small page mapped
allocation to a 2MB sized, huge page mapped allocation.

It looks like the logic is to free a pack if it is empty, so then for
smaller packs you are more likely to let the pages go back to the page
allocator. Then future allocations would break more pages.

So I think that is not a fully apples to apples test of huge mapping
benefits. I'd be surprised if there really was no huge mapping benefit,
since its been seen with core kernel text. Did you notice if the direct
map breakage was different between the tests?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set
  2022-05-16  5:40 ` [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set Song Liu
  2022-05-17 19:16   ` Edgecombe, Rick P
  2022-05-18  6:58   ` Song Liu
@ 2022-05-18 17:09   ` Peter Zijlstra
  2022-05-18 18:34     ` Song Liu
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2022-05-18 17:09 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, bpf, ast, daniel, mcgrof, torvalds,
	rick.p.edgecombe, kernel-team

On Sun, May 15, 2022 at 10:40:48PM -0700, Song Liu wrote:
> Introduce a memset like API for text_poke. This will be used to fill the
> unused RX memory with illegal instructions.

FWIW, you're going to use it to set INT3 (0xCC), that's not an illegal
instruction. INTO (0xCE) would be an illegal instruction (in 64bit
mode).


> +	return addr;
> +}
> +
> +/**
> + * text_poke_set - memset into (an unused part of) RX memory
> + * @addr: address to modify
> + * @c: the byte to fill the area with
> + * @len: length to copy, could be more than 2x PAGE_SIZE
> + *
> + * Not safe against concurrent execution; useful for JITs to dump
> + * new code blocks into unused regions of RX memory. Can be used in
> + * conjunction with synchronize_rcu_tasks() to wait for existing
> + * execution to quiesce after having made sure no existing functions
> + * pointers are live.

That comment suffers from copy-pasta and needs an update because it
clearly isn't correct.

> + */

Other than that, seems fine.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-18 16:49           ` Edgecombe, Rick P
@ 2022-05-18 18:31             ` Song Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-05-18 18:31 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, peterz, ast, bpf, Torvalds, Linus,
	Kernel Team, song, mcgrof



> On May 18, 2022, at 9:49 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Wed, 2022-05-18 at 06:34 +0000, Song Liu wrote:
>>>> I am not quite sure the exact work needed here. Rick, would you
>>>> have
>>>> time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the
>>>> merge 
>>>> window is coming soon, I guess we need current work around in
>>>> 5.19. 
>>> 
>>> I would have hard time squeezing that in now. The vmalloc part is
>>> easy,
>>> I think I already posted a diff. But first hibernate needs to be
>>> changed to not care about direct map page sizes.
>> 
>> I guess I missed the diff, could you please send a link to it?
> 
> 
> https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/
> 
> The remaining problem is that hibernate may encounter NP pages when
> saving memory to disk. It resets them with CPA calls 4k at a time. So
> if a page is NP, hibernate needs it to be already be 4k or it might
> need to split. I think hibernate should just utilize a different
> mapping to get at the page when it encounters this rare scenario. In
> that diff I put some locking so that hibernate couldn't race with a
> huge NP page, but then I thought we should just change hibernate.

I am not quite sure how to test the hibernate path. Given the merge
window is coming soon, how about we ship this patch in 5.19, and fix
VM_FLUSH_RESET_PERMS in a later release?

> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>> 
>>> 
>>> I'm also not clear why we wouldn't want to use the prog pack
>>> allocator
>>> even if vmalloc huge pages was disabled. Doesn't it improve
>>> performance
>>> even with small page sizes, per your benchmarks? What is the
>>> downside
>>> to just always using it?
>> 
>> With current version, when huge page is disabled, the prog pack
>> allocator
>> will use 4kB pages for each pack. We still get about 0.5% performance
>> improvement with 4kB prog packs. 
> 
> Oh, I thought you were comparing a 2MB sized, small page mapped
> allocation to a 2MB sized, huge page mapped allocation.
> 
> It looks like the logic is to free a pack if it is empty, so then for
> smaller packs you are more likely to let the pages go back to the page
> allocator. Then future allocations would break more pages.

This is correct. This is the current version we have with 5.18-rc7. 

> 
> So I think that is not a fully apples to apples test of huge mapping
> benefits. I'd be surprised if there really was no huge mapping benefit,
> since its been seen with core kernel text. Did you notice if the direct
> map breakage was different between the tests?

I didn’t check specifically, but it is expected that the 4kB prog pack
will cause more direct map breakage. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set
  2022-05-18 17:09   ` Peter Zijlstra
@ 2022-05-18 18:34     ` Song Liu
  2022-05-19  7:38       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-05-18 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, Linux Kernel Mailing List, bpf, Alexei Starovoitov,
	Daniel Borkmann, Luis Chamberlain, Torvalds, Linus, Edgecombe,
	Rick P, Kernel Team



> On May 18, 2022, at 10:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Sun, May 15, 2022 at 10:40:48PM -0700, Song Liu wrote:
>> Introduce a memset like API for text_poke. This will be used to fill the
>> unused RX memory with illegal instructions.
> 
> FWIW, you're going to use it to set INT3 (0xCC), that's not an illegal
> instruction. INTO (0xCE) would be an illegal instruction (in 64bit
> mode).

Hmm… we have been using INT3 as illegal/invalid/special instructions in 
the JIT. I guess they are equally good for this job?

> 
> 
>> +	return addr;
>> +}
>> +
>> +/**
>> + * text_poke_set - memset into (an unused part of) RX memory
>> + * @addr: address to modify
>> + * @c: the byte to fill the area with
>> + * @len: length to copy, could be more than 2x PAGE_SIZE
>> + *
>> + * Not safe against concurrent execution; useful for JITs to dump
>> + * new code blocks into unused regions of RX memory. Can be used in
>> + * conjunction with synchronize_rcu_tasks() to wait for existing
>> + * execution to quiesce after having made sure no existing functions
>> + * pointers are live.
> 
> That comment suffers from copy-pasta and needs an update because it
> clearly isn't correct.

Will fix in the next version. 

> 
>> + */
> 
> Other than that, seems fine.
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks,
Song

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-17 23:58       ` Edgecombe, Rick P
  2022-05-18  6:25         ` Song Liu
  2022-05-18  6:34         ` Song Liu
@ 2022-05-19  6:42         ` Song Liu
  2022-05-19 16:56           ` Edgecombe, Rick P
  2 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-05-19  6:42 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, peterz, ast, bpf, Torvalds, Linus,
	Kernel Team, song, mcgrof



> On May 17, 2022, at 4:58 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>> kernel/bpf/core.c | 12 +++++++-----
>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>>> index cacd8684c3c4..b64d91fcb0ba 100644
>>>> --- a/kernel/bpf/core.c
>>>> +++ b/kernel/bpf/core.c
>>>> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>>>>      void *ptr;
>>>> 
>>>>      size = BPF_HPAGE_SIZE * num_online_nodes();
>>>> -    ptr = module_alloc(size);
>>>> +    ptr = module_alloc_huge(size);
>>> 
>>> This select_bpf_prog_pack_size() function always seemed weird -
>>> doing a
>>> big allocation and then immediately freeing. Can't it check a
>>> config
>>> for vmalloc huge page support?
>> 
>> Yes, it is weird. Checking a config is not enough here. We also need
>> to 
>> check vmap_allow_huge, which is controlled by boot parameter
>> nohugeiomap. 
>> I haven’t got a better solution for this. 
> 
> It's too weird. We should expose whats needed in vmalloc.
> huge_vmalloc_supported() or something.

Thinking more on this. Even huge page is not supported, we can allocate
2MB worth of 4kB pages and keep using it. This would help direct map
fragmentation. And the code would also be simpler. 

Rick, I guess this is inline with some of your ideas?

Thanks,
Song

> 
> I'm also not clear why we wouldn't want to use the prog pack allocator
> even if vmalloc huge pages was disabled. Doesn't it improve performance
> even with small page sizes, per your benchmarks? What is the downside
> to just always using it?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set
  2022-05-18 18:34     ` Song Liu
@ 2022-05-19  7:38       ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2022-05-19  7:38 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, Linux Kernel Mailing List, bpf, Alexei Starovoitov,
	Daniel Borkmann, Luis Chamberlain, Torvalds, Linus, Edgecombe,
	Rick P, Kernel Team

On Wed, May 18, 2022 at 06:34:18PM +0000, Song Liu wrote:
> 
> 
> > On May 18, 2022, at 10:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Sun, May 15, 2022 at 10:40:48PM -0700, Song Liu wrote:
> >> Introduce a memset like API for text_poke. This will be used to fill the
> >> unused RX memory with illegal instructions.
> > 
> > FWIW, you're going to use it to set INT3 (0xCC), that's not an illegal
> > instruction. INTO (0xCE) would be an illegal instruction (in 64bit
> > mode).
> 
> Hmm… we have been using INT3 as illegal/invalid/special instructions in 
> the JIT. I guess they are equally good for this job?

INT3 is right, it's just not illegal. Terminology is everything :-)

INT3 is the breakpoint instruction, it raises #BP, an illegal
instruction would raise #UD. Different exception vectors and such.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-19  6:42         ` Song Liu
@ 2022-05-19 16:56           ` Edgecombe, Rick P
  2022-05-19 18:25             ` Song Liu
  0 siblings, 1 reply; 25+ messages in thread
From: Edgecombe, Rick P @ 2022-05-19 16:56 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, daniel, peterz, ast, bpf, Torvalds, Linus,
	Kernel-team, song, mcgrof

On Thu, 2022-05-19 at 06:42 +0000, Song Liu wrote:
> Thinking more on this. Even huge page is not supported, we can
> allocate
> 2MB worth of 4kB pages and keep using it. This would help direct map
> fragmentation. And the code would also be simpler. 
> 
> Rick, I guess this is inline with some of your ideas?

Yea, that is what I wondering. Potential benefits are just speculative
though. There is a memory overhead cost, so it's not free.

As for the other question of whether to fix VM_FLUSH_RESET_PERMS. If
there really is an intention to create a more general module_alloc()
replacement soon, then I think it is ok to side step it. An optimal
replacement might not need it and it could be removed in that case.
Let's at least add a WARN about it not working with huge pages though.

I also think the benchmarking so far is not sufficient to make the case
that huge page mappings help your workload since the direct map splits
were also different between the tests. I was expecting it to help
though. Others were the ones that asked for that, so just commenting my
analysis here.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack
  2022-05-19 16:56           ` Edgecombe, Rick P
@ 2022-05-19 18:25             ` Song Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-05-19 18:25 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, peterz, ast, bpf, Torvalds, Linus,
	Kernel Team, song, mcgrof



> On May 19, 2022, at 9:56 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Thu, 2022-05-19 at 06:42 +0000, Song Liu wrote:
>> Thinking more on this. Even huge page is not supported, we can
>> allocate
>> 2MB worth of 4kB pages and keep using it. This would help direct map
>> fragmentation. And the code would also be simpler. 
>> 
>> Rick, I guess this is inline with some of your ideas?
> 
> Yea, that is what I wondering. Potential benefits are just speculative
> though. There is a memory overhead cost, so it's not free.

Yeah, I had the same concern with memory overhead. The benefit should 
not exceed 0.2% (when 2MB page is supported), so I think we can use
4kB pages for now.  

> 
> As for the other question of whether to fix VM_FLUSH_RESET_PERMS. If
> there really is an intention to create a more general module_alloc()
> replacement soon, then I think it is ok to side step it. An optimal
> replacement might not need it and it could be removed in that case.
> Let's at least add a WARN about it not working with huge pages though.

IIUC, it will take some effort to let kernel modules use a solution 
like this. But there are some low hanging fruits, like ftrace and BPF
trampolines. Maybe that's enough to promote a more general solution. 

For the WARN, I guess we need something like this?

diff --git i/include/linux/vmalloc.h w/include/linux/vmalloc.h
index b159c2789961..5e0d0a60d9d5 100644
--- i/include/linux/vmalloc.h
+++ w/include/linux/vmalloc.h
@@ -238,6 +238,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
 {
        struct vm_struct *vm = find_vm_area(addr);

+       WARN_ON_ONCE(is_vm_area_hugepages(addr));
        if (vm)
                vm->flags |= VM_FLUSH_RESET_PERMS;
 }


Thanks,
Song

> 
> I also think the benchmarking so far is not sufficient to make the case
> that huge page mappings help your workload since the direct map splits
> were also different between the tests. I was expecting it to help
> though. Others were the ones that asked for that, so just commenting my
> analysis here.


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-05-19 18:25 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-16  5:40 [PATCH bpf-next 0/5] bpf_prog_pack followup Song Liu
2022-05-16  5:40 ` [PATCH bpf-next 1/5] bpf: fill new bpf_prog_pack with illegal instructions Song Liu
2022-05-16  5:40 ` [PATCH bpf-next 2/5] x86/alternative: introduce text_poke_set Song Liu
2022-05-17 19:16   ` Edgecombe, Rick P
2022-05-17 21:09     ` Song Liu
2022-05-18  6:58   ` Song Liu
2022-05-18  7:45     ` Peter Zijlstra
2022-05-18 15:32       ` Song Liu
2022-05-18 17:09   ` Peter Zijlstra
2022-05-18 18:34     ` Song Liu
2022-05-19  7:38       ` Peter Zijlstra
2022-05-16  5:40 ` [PATCH bpf-next 3/5] bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack Song Liu
2022-05-16  5:40 ` [PATCH bpf-next 4/5] module: introduce module_alloc_huge Song Liu
2022-05-18  6:28   ` Song Liu
2022-05-16  5:40 ` [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
2022-05-17 19:15   ` Edgecombe, Rick P
2022-05-17 21:08     ` Song Liu
2022-05-17 23:58       ` Edgecombe, Rick P
2022-05-18  6:25         ` Song Liu
2022-05-18  6:34         ` Song Liu
2022-05-18 16:49           ` Edgecombe, Rick P
2022-05-18 18:31             ` Song Liu
2022-05-19  6:42         ` Song Liu
2022-05-19 16:56           ` Edgecombe, Rick P
2022-05-19 18:25             ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.