linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
@ 2022-03-30 22:56 Song Liu
  2022-03-30 22:56 ` [PATCH bpf 1/4] x86: disable HAVE_ARCH_HUGE_VMALLOC Song Liu
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Song Liu @ 2022-03-30 22:56 UTC (permalink / raw)
  To: linux-mm, bpf, netdev, x86
  Cc: ast, daniel, andrii, kernel-team, akpm, pmenzel,
	rick.p.edgecombe, Song Liu

We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause
issues [1], [2]. Fix this by with HAVE_ARCH_HUGE_VMALLOC_FLAG, which allows
user of __vmalloc_node_range to ask for huge pages with a new flag
VM_TRY_HUGE_VMAP. bpf_prog_pack is updated to use this __vmalloc_node_range
with VM_TRY_HUGE_VMAP.

[1] https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/
[2] https://lore.kernel.org/netdev/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/

Song Liu (4):
  x86: disable HAVE_ARCH_HUGE_VMALLOC
  vmalloc: introduce HAVE_ARCH_HUGE_VMALLOC_FLAG
  x86: select HAVE_ARCH_HUGE_VMALLOC_FLAG for X86_64
  bpf: use __vmalloc_node_range() with VM_TRY_HUGE_VMAP for
    bpf_prog_pack

 arch/Kconfig            |  9 +++++++++
 arch/x86/Kconfig        |  2 +-
 include/linux/vmalloc.h |  9 +++++++--
 kernel/bpf/core.c       | 21 ++++++++++++++++++---
 mm/vmalloc.c            | 28 +++++++++++++++++++---------
 5 files changed, 54 insertions(+), 15 deletions(-)

--
2.30.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH bpf 1/4] x86: disable HAVE_ARCH_HUGE_VMALLOC
  2022-03-30 22:56 [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack Song Liu
@ 2022-03-30 22:56 ` Song Liu
  2022-03-30 23:47   ` Thomas Gleixner
  2022-03-30 22:56 ` [PATCH bpf 2/4] vmalloc: introduce HAVE_ARCH_HUGE_VMALLOC_FLAG Song Liu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2022-03-30 22:56 UTC (permalink / raw)
  To: linux-mm, bpf, netdev, x86
  Cc: ast, daniel, andrii, kernel-team, akpm, pmenzel,
	rick.p.edgecombe, Song Liu

We cannot savely enable HAVE_ARCH_HUGE_VMALLOC for X86_64 yet. See [1]
and [2] for more details. Disable it for now.

[1] https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/
[2] https://lore.kernel.org/netdev/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/

Fixes: fac54e2bfb5b ("x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP")
Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ff45a27fc29c..4691d3aef681 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -161,7 +161,6 @@ config X86
 	select HAVE_ALIGNED_STRUCT_PAGE		if SLUB
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_HUGE_VMAP		if X86_64 || X86_PAE
-	select HAVE_ARCH_HUGE_VMALLOC		if X86_64
 	select HAVE_ARCH_JUMP_LABEL
 	select HAVE_ARCH_JUMP_LABEL_RELATIVE
 	select HAVE_ARCH_KASAN			if X86_64
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf 2/4] vmalloc: introduce HAVE_ARCH_HUGE_VMALLOC_FLAG
  2022-03-30 22:56 [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack Song Liu
  2022-03-30 22:56 ` [PATCH bpf 1/4] x86: disable HAVE_ARCH_HUGE_VMALLOC Song Liu
@ 2022-03-30 22:56 ` Song Liu
  2022-03-30 23:40   ` Edgecombe, Rick P
  2022-03-30 22:56 ` [PATCH bpf 3/4] x86: select HAVE_ARCH_HUGE_VMALLOC_FLAG for X86_64 Song Liu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2022-03-30 22:56 UTC (permalink / raw)
  To: linux-mm, bpf, netdev, x86
  Cc: ast, daniel, andrii, kernel-team, akpm, pmenzel,
	rick.p.edgecombe, Song Liu

With HAVE_ARCH_HUGE_VMALLOC_FLAG, users of __vmalloc_node_range() could
use VM_TRY_HUGE_VMAP to (try to) allocate PMD_SIZE pages for
size >= PMD_SIZE cases. Similar to HAVE_ARCH_HUGE_VMALLOC, the use can
disable huge page by specifying nohugeiomap in kernel command line.

The first user of VM_TRY_HUGE_VMAP will be bpf_prog_pack.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/Kconfig            |  9 +++++++++
 include/linux/vmalloc.h |  9 +++++++--
 mm/vmalloc.c            | 28 +++++++++++++++++++---------
 3 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 33e06966f248..23b6e92aebaa 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -864,6 +864,15 @@ config HAVE_ARCH_HUGE_VMALLOC
 	depends on HAVE_ARCH_HUGE_VMAP
 	bool
 
+#
+# HAVE_ARCH_HUGE_VMALLOC_FLAG allows users of __vmalloc_node_range to allocate
+# huge page without HAVE_ARCH_HUGE_VMALLOC. To allocate huge pages, the user
+# need to call __vmalloc_node_range with VM_TRY_HUGE_VMAP.
+#
+config HAVE_ARCH_HUGE_VMALLOC_FLAG
+	depends on HAVE_ARCH_HUGE_VMAP
+	bool
+
 config ARCH_WANT_HUGE_PMD_SHARE
 	bool
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 3b1df7da402d..a48d0690b66f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -35,6 +35,11 @@ struct notifier_block;		/* in notifier.h */
 #define VM_DEFER_KMEMLEAK	0
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG
+#define VM_TRY_HUGE_VMAP	0x00001000	/* Allow for huge pages on HAVE_ARCH_HUGE_VMALLOC_FLAG arch's */
+#else
+#define VM_TRY_HUGE_VMAP	0
+#endif
 /* bits [20..32] reserved for arch specific ioremap internals */
 
 /*
@@ -51,7 +56,7 @@ struct vm_struct {
 	unsigned long		size;
 	unsigned long		flags;
 	struct page		**pages;
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
 	unsigned int		page_order;
 #endif
 	unsigned int		nr_pages;
@@ -225,7 +230,7 @@ static inline bool is_vm_area_hugepages(const void *addr)
 	 * prevents that. This only indicates the size of the physical page
 	 * allocated in the vmalloc layer.
 	 */
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
 	return find_vm_area(addr)->page_order > 0;
 #else
 	return false;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e163372d3967..179200bce285 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -46,7 +46,7 @@
 #include "internal.h"
 #include "pgalloc-track.h"
 
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
 static unsigned int __ro_after_init ioremap_max_page_shift = BITS_PER_LONG - 1;
 
 static int __init set_nohugeiomap(char *str)
@@ -55,11 +55,11 @@ static int __init set_nohugeiomap(char *str)
 	return 0;
 }
 early_param("nohugeiomap", set_nohugeiomap);
-#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP || CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG */
 static const unsigned int ioremap_max_page_shift = PAGE_SHIFT;
-#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP || CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG*/
 
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
 static bool __ro_after_init vmap_allow_huge = true;
 
 static int __init set_nohugevmalloc(char *str)
@@ -582,8 +582,9 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 
 	WARN_ON(page_shift < PAGE_SHIFT);
 
-	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
-			page_shift == PAGE_SHIFT)
+	if ((!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) &&
+	     !IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG)) ||
+	    (page_shift == PAGE_SHIFT))
 		return vmap_small_pages_range_noflush(addr, end, prot, pages);
 
 	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
@@ -2252,7 +2253,7 @@ static struct vm_struct *vmlist __initdata;
 
 static inline unsigned int vm_area_page_order(struct vm_struct *vm)
 {
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
 	return vm->page_order;
 #else
 	return 0;
@@ -2261,7 +2262,7 @@ static inline unsigned int vm_area_page_order(struct vm_struct *vm)
 
 static inline void set_vm_area_page_order(struct vm_struct *vm, unsigned int order)
 {
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
 	vm->page_order = order;
 #else
 	BUG_ON(order != 0);
@@ -3056,6 +3057,15 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	return NULL;
 }
 
+static bool vmalloc_try_huge_page(unsigned long vm_flags)
+{
+	if (!vmap_allow_huge || (vm_flags & VM_NO_HUGE_VMAP))
+		return false;
+
+	/* VM_TRY_HUGE_VMAP only works for CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG */
+	return vm_flags & VM_TRY_HUGE_VMAP;
+}
+
 /**
  * __vmalloc_node_range - allocate virtually contiguous memory
  * @size:		  allocation size
@@ -3106,7 +3116,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 		return NULL;
 	}
 
-	if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
+	if (vmalloc_try_huge_page(vm_flags)) {
 		unsigned long size_per_node;
 
 		/*
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf 3/4] x86: select HAVE_ARCH_HUGE_VMALLOC_FLAG for X86_64
  2022-03-30 22:56 [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack Song Liu
  2022-03-30 22:56 ` [PATCH bpf 1/4] x86: disable HAVE_ARCH_HUGE_VMALLOC Song Liu
  2022-03-30 22:56 ` [PATCH bpf 2/4] vmalloc: introduce HAVE_ARCH_HUGE_VMALLOC_FLAG Song Liu
@ 2022-03-30 22:56 ` Song Liu
  2022-03-30 23:54   ` Thomas Gleixner
  2022-03-30 22:56 ` [PATCH bpf 4/4] bpf: use __vmalloc_node_range() with VM_TRY_HUGE_VMAP for bpf_prog_pack Song Liu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2022-03-30 22:56 UTC (permalink / raw)
  To: linux-mm, bpf, netdev, x86
  Cc: ast, daniel, andrii, kernel-team, akpm, pmenzel,
	rick.p.edgecombe, Song Liu

As HAVE_ARCH_HUGE_VMALLOC is not ready for X86_64, enable
HAVE_ARCH_HUGE_VMALLOC_FLAG to allow bpf_prog_pack to allocate huge
pages.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4691d3aef681..2195120c8ebb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -161,6 +161,7 @@ config X86
 	select HAVE_ALIGNED_STRUCT_PAGE		if SLUB
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_HUGE_VMAP		if X86_64 || X86_PAE
+	select HAVE_ARCH_HUGE_VMALLOC_FLAG	if X86_64
 	select HAVE_ARCH_JUMP_LABEL
 	select HAVE_ARCH_JUMP_LABEL_RELATIVE
 	select HAVE_ARCH_KASAN			if X86_64
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf 4/4] bpf: use __vmalloc_node_range() with VM_TRY_HUGE_VMAP for bpf_prog_pack
  2022-03-30 22:56 [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack Song Liu
                   ` (2 preceding siblings ...)
  2022-03-30 22:56 ` [PATCH bpf 3/4] x86: select HAVE_ARCH_HUGE_VMALLOC_FLAG for X86_64 Song Liu
@ 2022-03-30 22:56 ` Song Liu
  2022-03-31  0:00   ` Thomas Gleixner
  2022-03-31  0:04 ` [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG " Edgecombe, Rick P
  2022-03-31  5:37 ` Christoph Hellwig
  5 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2022-03-30 22:56 UTC (permalink / raw)
  To: linux-mm, bpf, netdev, x86
  Cc: ast, daniel, andrii, kernel-team, akpm, pmenzel,
	rick.p.edgecombe, Song Liu

We cannot yet savely enable HAVE_ARCH_HUGE_VMAP for all vmalloc in X86_64.
Let bpf_prog_pack to call __vmalloc_node_range() with VM_TRY_HUGE_VMAP
directly.

Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 13e9dbeeedf3..257c6457f256 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -851,13 +851,28 @@ static LIST_HEAD(pack_list);
 #define BPF_HPAGE_MASK PAGE_MASK
 #endif
 
+static void *bpf_prog_pack_vmalloc(unsigned long size)
+{
+#if defined(MODULES_VADDR)
+	unsigned long start = MODULES_VADDR;
+	unsigned long end = MODULES_END;
+#else
+	unsigned long start = VMALLOC_START;
+	unsigned long end = VMALLOC_END;
+#endif
+
+	return __vmalloc_node_range(size, PAGE_SIZE, start, end, GFP_KERNEL, PAGE_KERNEL,
+				    VM_DEFER_KMEMLEAK | VM_TRY_HUGE_VMAP,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+
 static size_t select_bpf_prog_pack_size(void)
 {
 	size_t size;
 	void *ptr;
 
 	size = BPF_HPAGE_SIZE * num_online_nodes();
-	ptr = module_alloc(size);
+	ptr = bpf_prog_pack_vmalloc(size);
 
 	/* Test whether we can get huge pages. If not just use PAGE_SIZE
 	 * packs.
@@ -881,7 +896,7 @@ static struct bpf_prog_pack *alloc_new_pack(void)
 		       GFP_KERNEL);
 	if (!pack)
 		return NULL;
-	pack->ptr = module_alloc(bpf_prog_pack_size);
+	pack->ptr = bpf_prog_pack_vmalloc(bpf_prog_pack_size);
 	if (!pack->ptr) {
 		kfree(pack);
 		return NULL;
@@ -970,7 +985,7 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	if (bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0,
 				       bpf_prog_chunk_count(), 0) == 0) {
 		list_del(&pack->list);
-		module_memfree(pack->ptr);
+		vfree(pack->ptr);
 		kfree(pack);
 	}
 out:
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 2/4] vmalloc: introduce HAVE_ARCH_HUGE_VMALLOC_FLAG
  2022-03-30 22:56 ` [PATCH bpf 2/4] vmalloc: introduce HAVE_ARCH_HUGE_VMALLOC_FLAG Song Liu
@ 2022-03-30 23:40   ` Edgecombe, Rick P
  2022-03-31  0:26     ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: Edgecombe, Rick P @ 2022-03-30 23:40 UTC (permalink / raw)
  To: netdev, linux-mm, song, bpf, x86
  Cc: daniel, andrii, kernel-team, pmenzel, akpm, ast

On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote:
> With HAVE_ARCH_HUGE_VMALLOC_FLAG, users of __vmalloc_node_range()
> could
> use VM_TRY_HUGE_VMAP to (try to) allocate PMD_SIZE pages for
> size >= PMD_SIZE cases. Similar to HAVE_ARCH_HUGE_VMALLOC, the use
> can
> disable huge page by specifying nohugeiomap in kernel command line.
> 
> The first user of VM_TRY_HUGE_VMAP will be bpf_prog_pack.
> 
> Signed-off-by: Song Liu <song@kernel.org>
> ---
>  arch/Kconfig            |  9 +++++++++
>  include/linux/vmalloc.h |  9 +++++++--
>  mm/vmalloc.c            | 28 +++++++++++++++++++---------
>  3 files changed, 35 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 33e06966f248..23b6e92aebaa 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -864,6 +864,15 @@ config HAVE_ARCH_HUGE_VMALLOC
>  	depends on HAVE_ARCH_HUGE_VMAP
>  	bool
>  
> +#
> +# HAVE_ARCH_HUGE_VMALLOC_FLAG allows users of __vmalloc_node_range
> to allocate
> +# huge page without HAVE_ARCH_HUGE_VMALLOC. To allocate huge pages,
> the user
> +# need to call __vmalloc_node_range with VM_TRY_HUGE_VMAP.
> +#
> +config HAVE_ARCH_HUGE_VMALLOC_FLAG
> +	depends on HAVE_ARCH_HUGE_VMAP
> +	bool
> +
>  config ARCH_WANT_HUGE_PMD_SHARE
>  	bool
>  
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 3b1df7da402d..a48d0690b66f 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -35,6 +35,11 @@ struct notifier_block;		/* in
> notifier.h */
>  #define VM_DEFER_KMEMLEAK	0
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG
> +#define VM_TRY_HUGE_VMAP	0x00001000	/* Allow for huge
> pages on HAVE_ARCH_HUGE_VMALLOC_FLAG arch's */
> +#else
> +#define VM_TRY_HUGE_VMAP	0
> +#endif
>  /* bits [20..32] reserved for arch specific ioremap internals */
>  
>  /*
> @@ -51,7 +56,7 @@ struct vm_struct {
>  	unsigned long		size;
>  	unsigned long		flags;
>  	struct page		**pages;
> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>  	unsigned int		page_order;
>  #endif
>  	unsigned int		nr_pages;
> @@ -225,7 +230,7 @@ static inline bool is_vm_area_hugepages(const
> void *addr)
>  	 * prevents that. This only indicates the size of the physical
> page
>  	 * allocated in the vmalloc layer.
>  	 */
> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))

Since HAVE_ARCH_HUGE_VMALLOC_FLAG depends on HAVE_ARCH_HUGE_VMAP, I
don't think you need both here.

>  	return find_vm_area(addr)->page_order > 0;
>  #else
>  	return false;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e163372d3967..179200bce285 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -46,7 +46,7 @@
>  #include "internal.h"
>  #include "pgalloc-track.h"
>  
> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))

Same as above.

>  static unsigned int __ro_after_init ioremap_max_page_shift =
> BITS_PER_LONG - 1;
>  
>  static int __init set_nohugeiomap(char *str)
> @@ -55,11 +55,11 @@ static int __init set_nohugeiomap(char *str)
>  	return 0;
>  }
>  early_param("nohugeiomap", set_nohugeiomap);
> -#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
> +#else /* CONFIG_HAVE_ARCH_HUGE_VMAP ||
> CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG */
>  static const unsigned int ioremap_max_page_shift = PAGE_SHIFT;
> -#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
> +#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP ||
> CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG*/

Same here, and for the rest below. I think having
HAVE_ARCH_HUGE_VMALLOC_FLAG depend on HAVE_ARCH_HUGE_VMAP like you did
is nice because you don't need to special logic for most of the huge
page parts. It should shrink this patch.

>  
> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || 
> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>  static bool __ro_after_init vmap_allow_huge = true;
>  
>  static int __init set_nohugevmalloc(char *str)
> @@ -582,8 +582,9 @@ int vmap_pages_range_noflush(unsigned long addr,
> unsigned long end,
>  
>  	WARN_ON(page_shift < PAGE_SHIFT);
>  
> -	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> -			page_shift == PAGE_SHIFT)
> +	if ((!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) &&
> +	     !IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG)) ||
> +	    (page_shift == PAGE_SHIFT))
>  		return vmap_small_pages_range_noflush(addr, end, prot,
> pages);
>  
>  	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> @@ -2252,7 +2253,7 @@ static struct vm_struct *vmlist __initdata;
>  
>  static inline unsigned int vm_area_page_order(struct vm_struct *vm)
>  {
> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>  	return vm->page_order;
>  #else
>  	return 0;
> @@ -2261,7 +2262,7 @@ static inline unsigned int
> vm_area_page_order(struct vm_struct *vm)
>  
>  static inline void set_vm_area_page_order(struct vm_struct *vm,
> unsigned int order)
>  {
> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>  	vm->page_order = order;
>  #else
>  	BUG_ON(order != 0);
> @@ -3056,6 +3057,15 @@ static void *__vmalloc_area_node(struct
> vm_struct *area, gfp_t gfp_mask,
>  	return NULL;
>  }
>  
> +static bool vmalloc_try_huge_page(unsigned long vm_flags)
> +{
> +	if (!vmap_allow_huge || (vm_flags & VM_NO_HUGE_VMAP))
> +		return false;
> +
> +	/* VM_TRY_HUGE_VMAP only works for
> CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG */
> +	return vm_flags & VM_TRY_HUGE_VMAP;
> +}
> +

It won't return true in the case of just CONFIG_HAVE_ARCH_HUGE_VMALLOC
and vmap_allow_huge. If you have CONFIG_HAVE_ARCH_HUGE_VMALLOC, but not
CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG like powerpc, it should still try
huge pages like before.

>  /**
>   * __vmalloc_node_range - allocate virtually contiguous memory
>   * @size:		  allocation size
> @@ -3106,7 +3116,7 @@ void *__vmalloc_node_range(unsigned long size,
> unsigned long align,
>  		return NULL;
>  	}
>  
> -	if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
> +	if (vmalloc_try_huge_page(vm_flags)) {
>  		unsigned long size_per_node;
>  
>  		/*

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 1/4] x86: disable HAVE_ARCH_HUGE_VMALLOC
  2022-03-30 22:56 ` [PATCH bpf 1/4] x86: disable HAVE_ARCH_HUGE_VMALLOC Song Liu
@ 2022-03-30 23:47   ` Thomas Gleixner
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2022-03-30 23:47 UTC (permalink / raw)
  To: Song Liu, linux-mm, bpf, netdev, x86
  Cc: ast, daniel, andrii, kernel-team, akpm, pmenzel,
	rick.p.edgecombe, Song Liu

On Wed, Mar 30 2022 at 15:56, Song Liu wrote:
> We cannot savely enable HAVE_ARCH_HUGE_VMALLOC for X86_64 yet. See [1]
> and [2] for more details. Disable it for now.

This is not a proper changelog.

A changelog has to be self contained and provide all necessary
information. Links are there to provide supplementary information, but
are not the primary source of information for well documented reasons.

Aside of that, please read and follow:

      https://www.kernel.org/doc/html/latest/process/maintainer-tip.html

and update the CC list to comply with the general rules.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 3/4] x86: select HAVE_ARCH_HUGE_VMALLOC_FLAG for X86_64
  2022-03-30 22:56 ` [PATCH bpf 3/4] x86: select HAVE_ARCH_HUGE_VMALLOC_FLAG for X86_64 Song Liu
@ 2022-03-30 23:54   ` Thomas Gleixner
  2022-03-31  0:30     ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2022-03-30 23:54 UTC (permalink / raw)
  To: Song Liu, linux-mm, bpf, netdev, x86
  Cc: ast, daniel, andrii, kernel-team, akpm, pmenzel,
	rick.p.edgecombe, Song Liu

On Wed, Mar 30 2022 at 15:56, Song Liu wrote:
> As HAVE_ARCH_HUGE_VMALLOC is not ready for X86_64, enable
> HAVE_ARCH_HUGE_VMALLOC_FLAG to allow bpf_prog_pack to allocate huge
> pages.

Despite HAVE_ARCH_HUGE_VMALLOC being not ready for X86_64 enable it
nevertheless?

I assume you wanted to say something like this:

  The shortcomings of huge vmalloc allocations have been fixed in the
  memory management core code, so reenable HAVE_ARCH_HUGE_VMALLOC.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 4/4] bpf: use __vmalloc_node_range() with VM_TRY_HUGE_VMAP for bpf_prog_pack
  2022-03-30 22:56 ` [PATCH bpf 4/4] bpf: use __vmalloc_node_range() with VM_TRY_HUGE_VMAP for bpf_prog_pack Song Liu
@ 2022-03-31  0:00   ` Thomas Gleixner
  2022-03-31  0:31     ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2022-03-31  0:00 UTC (permalink / raw)
  To: Song Liu, linux-mm, bpf, netdev, x86
  Cc: ast, daniel, andrii, kernel-team, akpm, pmenzel,
	rick.p.edgecombe, Song Liu

On Wed, Mar 30 2022 at 15:56, Song Liu wrote:
> We cannot yet savely enable HAVE_ARCH_HUGE_VMAP for all vmalloc in X86_64.
> Let bpf_prog_pack to call __vmalloc_node_range() with VM_TRY_HUGE_VMAP
> directly.

Again, this changelog lacks any form of reasoning and justification.

Aside of that there is absolutely nothing x86_64 specific in the patch.

You might know all the details behind this change today, but will you be
able to make sense of the above half a year from now?

Even if you can, then anyone else is left in the dark.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-03-30 22:56 [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack Song Liu
                   ` (3 preceding siblings ...)
  2022-03-30 22:56 ` [PATCH bpf 4/4] bpf: use __vmalloc_node_range() with VM_TRY_HUGE_VMAP for bpf_prog_pack Song Liu
@ 2022-03-31  0:04 ` Edgecombe, Rick P
  2022-03-31  0:46   ` Song Liu
  2022-03-31  5:37 ` Christoph Hellwig
  5 siblings, 1 reply; 23+ messages in thread
From: Edgecombe, Rick P @ 2022-03-31  0:04 UTC (permalink / raw)
  To: netdev, linux-mm, song, bpf, x86
  Cc: daniel, andrii, kernel-team, pmenzel, akpm, ast

On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote:
> [1] 
> https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/

The issues I brought up around VM_FLUSH_RESET_PERMS are not fixed in
this series. And I think the solution I proposed is kind of wonky with
respect to hibernate. So I think maybe hibernate should be fixed to not
impose restrictions on the direct map, so the wonkiness is not needed.
But then this "fixes" series becomes quite extensive.

I wonder, why not just push the patch 1 here, then re-enable this thing
when it is all properly fixed up. It looked like your code could handle
the allocation not actually getting large pages.

Another solution that would keep large pages but still need fixing up
later: Just don't use VM_FLUSH_RESET_PERMS for now. Call
set_memory_nx() and then set_memory_rw() on the module space address
before vfree(). This will clean up everything that's needed with
respect to direct map permissions. Have vmalloc warn if is sees
VM_FLUSH_RESET_PERMS and huge pages together.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 2/4] vmalloc: introduce HAVE_ARCH_HUGE_VMALLOC_FLAG
  2022-03-30 23:40   ` Edgecombe, Rick P
@ 2022-03-31  0:26     ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2022-03-31  0:26 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: netdev, linux-mm, song, bpf, x86, daniel, andrii, Kernel Team,
	pmenzel, akpm, ast



> On Mar 30, 2022, at 4:40 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote:
>> With HAVE_ARCH_HUGE_VMALLOC_FLAG, users of __vmalloc_node_range()
>> could
>> use VM_TRY_HUGE_VMAP to (try to) allocate PMD_SIZE pages for
>> size >= PMD_SIZE cases. Similar to HAVE_ARCH_HUGE_VMALLOC, the use
>> can
>> disable huge page by specifying nohugeiomap in kernel command line.
>> 
>> The first user of VM_TRY_HUGE_VMAP will be bpf_prog_pack.
>> 
>> Signed-off-by: Song Liu <song@kernel.org>
>> ---
>> arch/Kconfig            |  9 +++++++++
>> include/linux/vmalloc.h |  9 +++++++--
>> mm/vmalloc.c            | 28 +++++++++++++++++++---------
>> 3 files changed, 35 insertions(+), 11 deletions(-)
>> 
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 33e06966f248..23b6e92aebaa 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -864,6 +864,15 @@ config HAVE_ARCH_HUGE_VMALLOC
>> 	depends on HAVE_ARCH_HUGE_VMAP
>> 	bool
>> 
>> +#
>> +# HAVE_ARCH_HUGE_VMALLOC_FLAG allows users of __vmalloc_node_range
>> to allocate
>> +# huge page without HAVE_ARCH_HUGE_VMALLOC. To allocate huge pages,
>> the user
>> +# need to call __vmalloc_node_range with VM_TRY_HUGE_VMAP.
>> +#
>> +config HAVE_ARCH_HUGE_VMALLOC_FLAG
>> +	depends on HAVE_ARCH_HUGE_VMAP
>> +	bool
>> +
>> config ARCH_WANT_HUGE_PMD_SHARE
>> 	bool
>> 
>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>> index 3b1df7da402d..a48d0690b66f 100644
>> --- a/include/linux/vmalloc.h
>> +++ b/include/linux/vmalloc.h
>> @@ -35,6 +35,11 @@ struct notifier_block;		/* in
>> notifier.h */
>> #define VM_DEFER_KMEMLEAK	0
>> #endif
>> 
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG
>> +#define VM_TRY_HUGE_VMAP	0x00001000	/* Allow for huge
>> pages on HAVE_ARCH_HUGE_VMALLOC_FLAG arch's */
>> +#else
>> +#define VM_TRY_HUGE_VMAP	0
>> +#endif
>> /* bits [20..32] reserved for arch specific ioremap internals */
>> 
>> /*
>> @@ -51,7 +56,7 @@ struct vm_struct {
>> 	unsigned long		size;
>> 	unsigned long		flags;
>> 	struct page		**pages;
>> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>> 	unsigned int		page_order;
>> #endif
>> 	unsigned int		nr_pages;
>> @@ -225,7 +230,7 @@ static inline bool is_vm_area_hugepages(const
>> void *addr)
>> 	 * prevents that. This only indicates the size of the physical
>> page
>> 	 * allocated in the vmalloc layer.
>> 	 */
>> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
> 
> Since HAVE_ARCH_HUGE_VMALLOC_FLAG depends on HAVE_ARCH_HUGE_VMAP, I
> don't think you need both here.

I think we still need this one (_VMALLOC || _VMALLOC_FLAG)? 
Note that this is not _VMAP || _VMALLOC_FLAG. 

> 
>> 	return find_vm_area(addr)->page_order > 0;
>> #else
>> 	return false;
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index e163372d3967..179200bce285 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -46,7 +46,7 @@
>> #include "internal.h"
>> #include "pgalloc-track.h"
>> 
>> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
> 
> Same as above.

And this one could be just CONFIG_HAVE_ARCH_HUGE_VMAP?

> 
>> static unsigned int __ro_after_init ioremap_max_page_shift =
>> BITS_PER_LONG - 1;
>> 
>> static int __init set_nohugeiomap(char *str)
>> @@ -55,11 +55,11 @@ static int __init set_nohugeiomap(char *str)
>> 	return 0;
>> }
>> early_param("nohugeiomap", set_nohugeiomap);
>> -#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
>> +#else /* CONFIG_HAVE_ARCH_HUGE_VMAP ||
>> CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG */
>> static const unsigned int ioremap_max_page_shift = PAGE_SHIFT;
>> -#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
>> +#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP ||
>> CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG*/
> 
> Same here, and for the rest below. I think having
> HAVE_ARCH_HUGE_VMALLOC_FLAG depend on HAVE_ARCH_HUGE_VMAP like you did
> is nice because you don't need to special logic for most of the huge
> page parts. It should shrink this patch.
> 
>> 
>> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || 
>> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>> static bool __ro_after_init vmap_allow_huge = true;
>> 
>> static int __init set_nohugevmalloc(char *str)
>> @@ -582,8 +582,9 @@ int vmap_pages_range_noflush(unsigned long addr,
>> unsigned long end,
>> 
>> 	WARN_ON(page_shift < PAGE_SHIFT);
>> 
>> -	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> -			page_shift == PAGE_SHIFT)
>> +	if ((!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) &&
>> +	     !IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG)) ||
>> +	    (page_shift == PAGE_SHIFT))
>> 		return vmap_small_pages_range_noflush(addr, end, prot,
>> pages);
>> 
>> 	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
>> @@ -2252,7 +2253,7 @@ static struct vm_struct *vmlist __initdata;
>> 
>> static inline unsigned int vm_area_page_order(struct vm_struct *vm)
>> {
>> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>> 	return vm->page_order;
>> #else
>> 	return 0;
>> @@ -2261,7 +2262,7 @@ static inline unsigned int
>> vm_area_page_order(struct vm_struct *vm)
>> 
>> static inline void set_vm_area_page_order(struct vm_struct *vm,
>> unsigned int order)
>> {
>> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>> 	vm->page_order = order;
>> #else
>> 	BUG_ON(order != 0);
>> @@ -3056,6 +3057,15 @@ static void *__vmalloc_area_node(struct
>> vm_struct *area, gfp_t gfp_mask,
>> 	return NULL;
>> }
>> 
>> +static bool vmalloc_try_huge_page(unsigned long vm_flags)
>> +{
>> +	if (!vmap_allow_huge || (vm_flags & VM_NO_HUGE_VMAP))
>> +		return false;
>> +
>> +	/* VM_TRY_HUGE_VMAP only works for
>> CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG */
>> +	return vm_flags & VM_TRY_HUGE_VMAP;
>> +}
>> +
> 
> It won't return true in the case of just CONFIG_HAVE_ARCH_HUGE_VMALLOC
> and vmap_allow_huge. If you have CONFIG_HAVE_ARCH_HUGE_VMALLOC, but not
> CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG like powerpc, it should still try
> huge pages like before.

Yeah, I missed this one.. Will fix in the next version. 

> 
>> /**
>>  * __vmalloc_node_range - allocate virtually contiguous memory
>>  * @size:		  allocation size
>> @@ -3106,7 +3116,7 @@ void *__vmalloc_node_range(unsigned long size,
>> unsigned long align,
>> 		return NULL;
>> 	}
>> 
>> -	if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
>> +	if (vmalloc_try_huge_page(vm_flags)) {
>> 		unsigned long size_per_node;
>> 
>> 		/*



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 3/4] x86: select HAVE_ARCH_HUGE_VMALLOC_FLAG for X86_64
  2022-03-30 23:54   ` Thomas Gleixner
@ 2022-03-31  0:30     ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2022-03-31  0:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Song Liu, linux-mm, bpf, netdev, x86, ast, daniel, andrii,
	Kernel Team, akpm, pmenzel, rick.p.edgecombe



> On Mar 30, 2022, at 4:54 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, Mar 30 2022 at 15:56, Song Liu wrote:
>> As HAVE_ARCH_HUGE_VMALLOC is not ready for X86_64, enable
>> HAVE_ARCH_HUGE_VMALLOC_FLAG to allow bpf_prog_pack to allocate huge
>> pages.
> 
> Despite HAVE_ARCH_HUGE_VMALLOC being not ready for X86_64 enable it
> nevertheless?

These are two different flags:

HAVE_ARCH_HUGE_VMALLOC allows vmalloc to try to allocate huge pages for
size >= PMD_SIZE, unless it is disabled with nohugeiomap or VM_NO_HUGE_VMAP.

HAVE_ARCH_HUGE_VMALLOC_FLAG allows vmalloc to try to allocate huge pages
for size >= PMD_SIZE, only when the user specifies VM_TRY_HUGE_VMAP. 

Recommendations for better naming will be highly appreciated..

Song


> 
> I assume you wanted to say something like this:
> 
>  The shortcomings of huge vmalloc allocations have been fixed in the
>  memory management core code, so reenable HAVE_ARCH_HUGE_VMALLOC.
> 
> Thanks,
> 
>        tglx



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 4/4] bpf: use __vmalloc_node_range() with VM_TRY_HUGE_VMAP for bpf_prog_pack
  2022-03-31  0:00   ` Thomas Gleixner
@ 2022-03-31  0:31     ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2022-03-31  0:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Song Liu, Linux Memory Management List, bpf, Networking, X86 ML,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Andrew Morton, Paul Menzel, rick.p.edgecombe



> On Mar 30, 2022, at 5:00 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, Mar 30 2022 at 15:56, Song Liu wrote:
>> We cannot yet savely enable HAVE_ARCH_HUGE_VMAP for all vmalloc in X86_64.
>> Let bpf_prog_pack to call __vmalloc_node_range() with VM_TRY_HUGE_VMAP
>> directly.
> 
> Again, this changelog lacks any form of reasoning and justification.
> 
> Aside of that there is absolutely nothing x86_64 specific in the patch.
> 
> You might know all the details behind this change today, but will you be
> able to make sense of the above half a year from now?
> 
> Even if you can, then anyone else is left in the dark.
> 
> Thanks,
> 
>        tglx


I will provide more information in the next version.

Thanks,
Song


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-03-31  0:04 ` [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG " Edgecombe, Rick P
@ 2022-03-31  0:46   ` Song Liu
  2022-03-31 16:19     ` Edgecombe, Rick P
  0 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2022-03-31  0:46 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: netdev, linux-mm, song, bpf, x86, daniel, andrii, Kernel Team,
	pmenzel, akpm, ast



> On Mar 30, 2022, at 5:04 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote:
>> [1] 
>> https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/
> 
> The issues I brought up around VM_FLUSH_RESET_PERMS are not fixed in
> this series. And I think the solution I proposed is kind of wonky with
> respect to hibernate. So I think maybe hibernate should be fixed to not
> impose restrictions on the direct map, so the wonkiness is not needed.
> But then this "fixes" series becomes quite extensive.
> 
> I wonder, why not just push the patch 1 here, then re-enable this thing
> when it is all properly fixed up. It looked like your code could handle
> the allocation not actually getting large pages.

Only shipping patch 1 should eliminate the issues. But that will also
reduce the benefit in iTLB efficiency (I don't know by how much yet.)

> 
> Another solution that would keep large pages but still need fixing up
> later: Just don't use VM_FLUSH_RESET_PERMS for now. Call
> set_memory_nx() and then set_memory_rw() on the module space address
> before vfree(). This will clean up everything that's needed with
> respect to direct map permissions. Have vmalloc warn if is sees
> VM_FLUSH_RESET_PERMS and huge pages together.

Do you mean we should remove set_vm_flush_reset_perms() from 
alloc_new_pack() and do set_memory_nx() and set_memory_rw() before
we call vfree() in bpf_prog_pack_free()? If this works, I would prefer
we go with this way. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-03-30 22:56 [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack Song Liu
                   ` (4 preceding siblings ...)
  2022-03-31  0:04 ` [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG " Edgecombe, Rick P
@ 2022-03-31  5:37 ` Christoph Hellwig
  2022-03-31 23:59   ` Song Liu
  5 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-03-31  5:37 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-mm, bpf, netdev, x86, ast, daniel, andrii, kernel-team,
	akpm, pmenzel, rick.p.edgecombe

On Wed, Mar 30, 2022 at 03:56:38PM -0700, Song Liu wrote:
> We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause
> issues [1], [2].
>

Please fix the underlying issues instead of papering over them and
creating a huge maintainance burden for others.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-03-31  0:46   ` Song Liu
@ 2022-03-31 16:19     ` Edgecombe, Rick P
  0 siblings, 0 replies; 23+ messages in thread
From: Edgecombe, Rick P @ 2022-03-31 16:19 UTC (permalink / raw)
  To: songliubraving
  Cc: daniel, bpf, ast, Kernel-team, linux-mm, song, andrii, pmenzel,
	x86, akpm, netdev

On Thu, 2022-03-31 at 00:46 +0000, Song Liu wrote:
> > On Mar 30, 2022, at 5:04 PM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote:
> > > [1] 
> > > 
https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/
> > 
> > The issues I brought up around VM_FLUSH_RESET_PERMS are not fixed
> > in
> > this series. And I think the solution I proposed is kind of wonky
> > with
> > respect to hibernate. So I think maybe hibernate should be fixed to
> > not
> > impose restrictions on the direct map, so the wonkiness is not
> > needed.
> > But then this "fixes" series becomes quite extensive.
> > 
> > I wonder, why not just push the patch 1 here, then re-enable this
> > thing
> > when it is all properly fixed up. It looked like your code could
> > handle
> > the allocation not actually getting large pages.
> 
> Only shipping patch 1 should eliminate the issues. But that will also
> reduce the benefit in iTLB efficiency (I don't know by how much yet.)

Yea, it's just a matter of what order/timeline things get done in. This
change didn't get enough mm attention ahead of time. Now there are two
issues. One where the root cause is not fully clear and one that
properly needs a wider fix. Just thinking it could be nice to take some
time on it, rather than rush to finish what was already too rushed.

> 
> > 
> > Another solution that would keep large pages but still need fixing
> > up
> > later: Just don't use VM_FLUSH_RESET_PERMS for now. Call
> > set_memory_nx() and then set_memory_rw() on the module space
> > address
> > before vfree(). This will clean up everything that's needed with
> > respect to direct map permissions. Have vmalloc warn if is sees
> > VM_FLUSH_RESET_PERMS and huge pages together.
> 
> Do you mean we should remove set_vm_flush_reset_perms() from 
> alloc_new_pack() and do set_memory_nx() and set_memory_rw() before
> we call vfree() in bpf_prog_pack_free()? If this works, I would
> prefer
> we go with this way. 

I believe this would work functionally.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-03-31  5:37 ` Christoph Hellwig
@ 2022-03-31 23:59   ` Song Liu
  2022-04-01 22:22     ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2022-03-31 23:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Song Liu, Linux Memory Management List, bpf, Networking, X86 ML,
	Alexei Starovoitov, Daniel Borkmann, andrii, Kernel Team, akpm,
	pmenzel, rick.p.edgecombe

Hi Christoph, 

> On Mar 30, 2022, at 10:37 PM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Wed, Mar 30, 2022 at 03:56:38PM -0700, Song Liu wrote:
>> We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause
>> issues [1], [2].
>> 
> 
> Please fix the underlying issues instead of papering over them and
> creating a huge maintainance burden for others.

I agree that this set is papering over the issue. And I would like 
your recommendations here. 

The biggest problem to me is that we (or at least myself) don't know 
all the issues HAVE_ARCH_HUGE_VMALLOC will trigger on x86_64. Right 
now we have a bug report from Paul, and the warning from Rick, but
I am afraid there might be some other issues. 

How about we approach it like this:

Since it is still early in the release cycle (pre rc1), we can keep 
HAVE_ARCH_HUGE_VMALLOC on for x86_64 for now and try to fix all the 
reported issues and warnings. If things don't go very well, we can
turn HAVE_ARCH_HUGE_VMALLOC off after rc4 or rc5. 

Does this make sense?

Thanks,
Song




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-03-31 23:59   ` Song Liu
@ 2022-04-01 22:22     ` Song Liu
  2022-04-05  7:07       ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2022-04-01 22:22 UTC (permalink / raw)
  To: Christoph Hellwig, rick.p.edgecombe, Nicholas Piggin, Claudio Imbrenda
  Cc: Song Liu, Linux Memory Management List, bpf, Networking, X86 ML,
	Alexei Starovoitov, Daniel Borkmann, andrii, Kernel Team, akpm,
	pmenzel

+ Nicholas and Claudio,


> On Mar 31, 2022, at 4:59 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> Hi Christoph, 
> 
>> On Mar 30, 2022, at 10:37 PM, Christoph Hellwig <hch@infradead.org> wrote:
>> 
>> On Wed, Mar 30, 2022 at 03:56:38PM -0700, Song Liu wrote:
>>> We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause
>>> issues [1], [2].
>>> 
>> 
>> Please fix the underlying issues instead of papering over them and
>> creating a huge maintainance burden for others.

After reading the code a little more, I wonder what would be best strategy. 
IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
For example, all the module_alloc cannot work with huge pages at the moment.
And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
powerpc with 5.17 kernel as-is? (trace attached below) 

Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
However, given there are so many users of vmalloc, vzalloc, etc., we 
probably do need a flag for the user to opt-in? 

Does this make sense? Any recommendations are really appreciated. 

Thanks,
Song 




[    1.687983] BUG: Bad page state in process systemd-udevd  pfn:102e03
[    1.687992] fbcon: Taking over console
[    1.688007] page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x3 pfn:0x102e03
[    1.688011] head:(____ptrval____) order:9 compound_mapcount:0 compound_pincount:0
[    1.688013] flags: 0x2fffc000010000(head|node=0|zone=2|lastcpupid=0x3fff)
[    1.688018] raw: 002fffc000000000 ffffe815040b8001 ffffe815040b80c8 0000000000000000
[    1.688020] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[    1.688022] head: 002fffc000010000 0000000000000000 dead000000000122 0000000000000000
[    1.688023] head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[    1.688024] page dumped because: corrupted mapping in tail page
[    1.688025] Modules linked in: r8169(+) k10temp snd_pcm(+) xhci_hcd snd_timer realtek ohci_hcd ehci_pci(+) i2c_piix4 ehci_hcd radeon(+) snd sg drm_ttm_helper ttm soundcore coreboot_table acpi_cpufreq fuse ipv6 autofs4
[    1.688045] CPU: 1 PID: 151 Comm: systemd-udevd Not tainted 5.16.0-11615-gfac54e2bfb5b #319
[    1.688048] Hardware name: ASUS F2A85-M_PRO/F2A85-M_PRO, BIOS 4.16-337-gb87986e67b 03/25/2022
[    1.688050] Call Trace:
[    1.688051]  <TASK>
[    1.688053]  dump_stack_lvl+0x34/0x44
[    1.688059]  bad_page.cold+0x63/0x94
[    1.688063]  free_tail_pages_check+0xd1/0x110
[    1.688067]  ? _raw_spin_lock+0x13/0x30
[    1.688071]  free_pcp_prepare+0x251/0x2e0
[    1.688075]  free_unref_page+0x1d/0x110
[    1.688078]  __vunmap+0x28a/0x380
[    1.688082]  drm_fbdev_cleanup+0x5f/0xb0
[    1.688085]  drm_fbdev_fb_destroy+0x15/0x30
[    1.688087]  unregister_framebuffer+0x1d/0x30
[    1.688091]  drm_client_dev_unregister+0x69/0xe0
[    1.688095]  drm_dev_unregister+0x2e/0x80
[    1.688098]  drm_dev_unplug+0x21/0x40
[    1.688100]  simpledrm_remove+0x11/0x20
[    1.688103]  platform_remove+0x1f/0x40
[    1.688106]  __device_release_driver+0x17a/0x240
[    1.688109]  device_release_driver+0x24/0x30
[    1.688112]  bus_remove_device+0xd8/0x140
[    1.688115]  device_del+0x18b/0x3f0
[    1.688118]  ? _raw_spin_unlock_irqrestore+0x1b/0x30
[    1.688121]  ? try_to_wake_up+0x94/0x5b0
[    1.688124]  platform_device_del.part.0+0x13/0x70
[    1.688127]  platform_device_unregister+0x1c/0x30
[    1.688130]  drm_aperture_detach_drivers+0xa1/0xd0
[    1.688134]  drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x60
[    1.688137]  radeon_pci_probe+0x54/0xf0 [radeon]
[    1.688212]  local_pci_probe+0x45/0x80
[    1.688216]  ? pci_match_device+0xd7/0x130
[    1.688219]  pci_device_probe+0xc2/0x1d0
[    1.688223]  really_probe+0x1f5/0x3d0
[    1.688226]  __driver_probe_device+0xfe/0x180
[    1.688229]  driver_probe_device+0x1e/0x90
[    1.688232]  __driver_attach+0xc0/0x1c0
[    1.688235]  ? __device_attach_driver+0xe0/0xe0
[    1.688237]  ? __device_attach_driver+0xe0/0xe0
[    1.688239]  bus_for_each_dev+0x78/0xc0
[    1.688242]  bus_add_driver+0x149/0x1e0
[    1.688245]  driver_register+0x8f/0xe0
[    1.688248]  ? 0xffffffffc051d000
[    1.688250]  do_one_initcall+0x44/0x200
[    1.688254]  ? kmem_cache_alloc_trace+0x170/0x2c0
[    1.688257]  do_init_module+0x5c/0x260
[    1.688262]  __do_sys_finit_module+0xb4/0x120
[    1.688266]  __do_fast_syscall_32+0x6b/0xe0
[    1.688270]  do_fast_syscall_32+0x2f/0x70
[    1.688272]  entry_SYSCALL_compat_after_hwframe+0x45/0x4d
[    1.688275] RIP: 0023:0xf7e51549
[    1.688278] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 cd 0f 05 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
[    1.688281] RSP: 002b:00000000ffa1666c EFLAGS: 00200292 ORIG_RAX: 000000000000015e
[    1.688285] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00000000f7e30e09
[    1.688287] RDX: 0000000000000000 RSI: 00000000f9a705d0 RDI: 00000000f9a6f6a0
[    1.688288] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[    1.688290] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    1.688291] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    1.688294]  </TASK>
[    1.688355] Disabling lock debugging due to kernel taint

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-04-01 22:22     ` Song Liu
@ 2022-04-05  7:07       ` Christoph Hellwig
  2022-04-05 23:54         ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-04-05  7:07 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, rick.p.edgecombe, Nicholas Piggin,
	Claudio Imbrenda, Song Liu, Linux Memory Management List, bpf,
	Networking, X86 ML, Alexei Starovoitov, Daniel Borkmann, andrii,
	Kernel Team, akpm, pmenzel

On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:
> >> Please fix the underlying issues instead of papering over them and
> >> creating a huge maintainance burden for others.
> 
> After reading the code a little more, I wonder what would be best strategy. 
> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
> For example, all the module_alloc cannot work with huge pages at the moment.
> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
> powerpc with 5.17 kernel as-is? (trace attached below) 
> 
> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
> However, given there are so many users of vmalloc, vzalloc, etc., we 
> probably do need a flag for the user to opt-in? 
> 
> Does this make sense? Any recommendations are really appreciated. 

I think there is multiple aspects here:

 - if we think that the kernel is not ready for hugepage backed vmalloc
   in general we need to disable it in powerpc for now.
 - if we think even in the longer run only some users can cope with
   hugepage backed vmalloc we need to turn it into an opt-in in
   general and not just for x86
 - there still to appear various unresolved underlying x86 specific
   issues that need to be fixed either way


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-04-05  7:07       ` Christoph Hellwig
@ 2022-04-05 23:54         ` Song Liu
  2022-04-07 19:57           ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2022-04-05 23:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: rick.p.edgecombe, Nicholas Piggin, Claudio Imbrenda, Song Liu,
	Linux Memory Management List, bpf, Networking, X86 ML,
	Alexei Starovoitov, Daniel Borkmann, andrii, Kernel Team, akpm,
	pmenzel



> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:
>>>> Please fix the underlying issues instead of papering over them and
>>>> creating a huge maintainance burden for others.
>> 
>> After reading the code a little more, I wonder what would be best strategy. 
>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
>> For example, all the module_alloc cannot work with huge pages at the moment.
>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
>> powerpc with 5.17 kernel as-is? (trace attached below) 
>> 
>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
>> However, given there are so many users of vmalloc, vzalloc, etc., we 
>> probably do need a flag for the user to opt-in? 
>> 
>> Does this make sense? Any recommendations are really appreciated. 
> 
> I think there is multiple aspects here:
> 
> - if we think that the kernel is not ready for hugepage backed vmalloc
>   in general we need to disable it in powerpc for now.

Nicholas and Claudio, 

What do you think about the status of hugepage backed vmalloc on powerpc? 
I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages.
But I am not aware of users that benefit from huge pages (except vfs hash,
which was mentioned in 8abddd968a30). Does an opt-in flag (instead of 
current opt-out flag, VM_NO_HUGE_VMAP) make sense to you? 

Thanks,
Song

> - if we think even in the longer run only some users can cope with
>   hugepage backed vmalloc we need to turn it into an opt-in in
>   general and not just for x86
> - there still to appear various unresolved underlying x86 specific
>   issues that need to be fixed either way



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-04-05 23:54         ` Song Liu
@ 2022-04-07 19:57           ` Song Liu
  2022-04-08 10:08             ` Claudio Imbrenda
  0 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2022-04-07 19:57 UTC (permalink / raw)
  To: Christoph Hellwig, Claudio Imbrenda, Nicholas Piggin
  Cc: rick.p.edgecombe, Song Liu, Linux Memory Management List, bpf,
	Networking, X86 ML, Alexei Starovoitov, Daniel Borkmann, andrii,
	Kernel Team, akpm, pmenzel

Hi Nicholas and Claudio, 

> On Apr 5, 2022, at 4:54 PM, Song Liu <songliubraving@fb.com> wrote:
> 
>> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> 
>> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:
>>>>> Please fix the underlying issues instead of papering over them and
>>>>> creating a huge maintainance burden for others.
>>> 
>>> After reading the code a little more, I wonder what would be best strategy. 
>>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
>>> For example, all the module_alloc cannot work with huge pages at the moment.
>>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
>>> powerpc with 5.17 kernel as-is? (trace attached below) 
>>> 
>>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
>>> However, given there are so many users of vmalloc, vzalloc, etc., we 
>>> probably do need a flag for the user to opt-in? 
>>> 
>>> Does this make sense? Any recommendations are really appreciated. 
>> 
>> I think there is multiple aspects here:
>> 
>> - if we think that the kernel is not ready for hugepage backed vmalloc
>>  in general we need to disable it in powerpc for now.
> 
> Nicholas and Claudio, 
> 
> What do you think about the status of hugepage backed vmalloc on powerpc? 
> I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages.
> But I am not aware of users that benefit from huge pages (except vfs hash,
> which was mentioned in 8abddd968a30). Does an opt-in flag (instead of 
> current opt-out flag, VM_NO_HUGE_VMAP) make sense to you? 

Could you please share your comments on this? Specifically, does it make 
sense to replace VM_NO_HUGE_VMAP with an opt-in flag? If we think current
opt-out flag is better approach, what would be the best practice to find 
all the cases to opt-out?

Thanks,
Song


> Thanks,
> Song
> 
>> - if we think even in the longer run only some users can cope with
>>  hugepage backed vmalloc we need to turn it into an opt-in in
>>  general and not just for x86
>> - there still to appear various unresolved underlying x86 specific
>>  issues that need to be fixed either way
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-04-07 19:57           ` Song Liu
@ 2022-04-08 10:08             ` Claudio Imbrenda
  2022-04-08 21:22               ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: Claudio Imbrenda @ 2022-04-08 10:08 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Nicholas Piggin, rick.p.edgecombe, Song Liu,
	Linux Memory Management List, bpf, Networking, X86 ML,
	Alexei Starovoitov, Daniel Borkmann, andrii, Kernel Team, akpm,
	pmenzel

On Thu, 7 Apr 2022 19:57:25 +0000
Song Liu <songliubraving@fb.com> wrote:

> Hi Nicholas and Claudio, 
> 
> > On Apr 5, 2022, at 4:54 PM, Song Liu <songliubraving@fb.com> wrote:
> >   
> >> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote:
> >> 
> >> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:  
> >>>>> Please fix the underlying issues instead of papering over them and
> >>>>> creating a huge maintainance burden for others.  
> >>> 
> >>> After reading the code a little more, I wonder what would be best strategy. 
> >>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
> >>> For example, all the module_alloc cannot work with huge pages at the moment.
> >>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
> >>> powerpc with 5.17 kernel as-is? (trace attached below) 
> >>> 
> >>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
> >>> However, given there are so many users of vmalloc, vzalloc, etc., we 
> >>> probably do need a flag for the user to opt-in? 
> >>> 
> >>> Does this make sense? Any recommendations are really appreciated.   
> >> 
> >> I think there is multiple aspects here:
> >> 
> >> - if we think that the kernel is not ready for hugepage backed vmalloc
> >>  in general we need to disable it in powerpc for now.  
> > 
> > Nicholas and Claudio, 
> > 
> > What do you think about the status of hugepage backed vmalloc on powerpc? 
> > I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages.
> > But I am not aware of users that benefit from huge pages (except vfs hash,
> > which was mentioned in 8abddd968a30). Does an opt-in flag (instead of 
> > current opt-out flag, VM_NO_HUGE_VMAP) make sense to you?   
> 
> Could you please share your comments on this? Specifically, does it make 
> sense to replace VM_NO_HUGE_VMAP with an opt-in flag? If we think current
> opt-out flag is better approach, what would be the best practice to find 
> all the cases to opt-out?

An opt in flag would surely make sense, and it would be more backwards
compatible with existing code. That way each user can decide whether to
fix the code to allow for hugepages, if possible at all. For example,
the case you mentioned for s390 (kvm_s390_pv_alloc_vm) would not be
fixable, because of a hardware limitation (the whole area _must_ be
mapped with 4k pages)

If the consensus were to keep the current opt-put, then I guess each
user would have to check each usage of vmalloc and similar, and see if
anything breaks. To be honest, I think an opt-out would only be
possible after having the opt-in for a (long) while, when most users
would have fixed their code.

In short, I fully support opt-in.

> 
> Thanks,
> Song
> 
> 
> > Thanks,
> > Song
> >   
> >> - if we think even in the longer run only some users can cope with
> >>  hugepage backed vmalloc we need to turn it into an opt-in in
> >>  general and not just for x86
> >> - there still to appear various unresolved underlying x86 specific
> >>  issues that need to be fixed either way  
> >   
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack
  2022-04-08 10:08             ` Claudio Imbrenda
@ 2022-04-08 21:22               ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2022-04-08 21:22 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Christoph Hellwig, Nicholas Piggin, rick.p.edgecombe, Song Liu,
	Linux Memory Management List, bpf, Networking, X86 ML,
	Alexei Starovoitov, Daniel Borkmann, andrii, Kernel Team, akpm,
	pmenzel



> On Apr 8, 2022, at 3:08 AM, Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> 
> On Thu, 7 Apr 2022 19:57:25 +0000
> Song Liu <songliubraving@fb.com> wrote:
> 
>> Hi Nicholas and Claudio, 
>> 
>>> On Apr 5, 2022, at 4:54 PM, Song Liu <songliubraving@fb.com> wrote:
>>> 
>>>> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>>> 
>>>> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:  
>>>>>>> Please fix the underlying issues instead of papering over them and
>>>>>>> creating a huge maintainance burden for others.  
>>>>> 
>>>>> After reading the code a little more, I wonder what would be best strategy. 
>>>>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
>>>>> For example, all the module_alloc cannot work with huge pages at the moment.
>>>>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
>>>>> powerpc with 5.17 kernel as-is? (trace attached below) 
>>>>> 
>>>>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
>>>>> However, given there are so many users of vmalloc, vzalloc, etc., we 
>>>>> probably do need a flag for the user to opt-in? 
>>>>> 
>>>>> Does this make sense? Any recommendations are really appreciated.   
>>>> 
>>>> I think there is multiple aspects here:
>>>> 
>>>> - if we think that the kernel is not ready for hugepage backed vmalloc
>>>> in general we need to disable it in powerpc for now.  
>>> 
>>> Nicholas and Claudio, 
>>> 
>>> What do you think about the status of hugepage backed vmalloc on powerpc? 
>>> I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages.
>>> But I am not aware of users that benefit from huge pages (except vfs hash,
>>> which was mentioned in 8abddd968a30). Does an opt-in flag (instead of 
>>> current opt-out flag, VM_NO_HUGE_VMAP) make sense to you?   
>> 
>> Could you please share your comments on this? Specifically, does it make 
>> sense to replace VM_NO_HUGE_VMAP with an opt-in flag? If we think current
>> opt-out flag is better approach, what would be the best practice to find 
>> all the cases to opt-out?
> 
> An opt in flag would surely make sense, and it would be more backwards
> compatible with existing code. That way each user can decide whether to
> fix the code to allow for hugepages, if possible at all. For example,
> the case you mentioned for s390 (kvm_s390_pv_alloc_vm) would not be
> fixable, because of a hardware limitation (the whole area _must_ be
> mapped with 4k pages)
> 
> If the consensus were to keep the current opt-put, then I guess each
> user would have to check each usage of vmalloc and similar, and see if
> anything breaks. To be honest, I think an opt-out would only be
> possible after having the opt-in for a (long) while, when most users
> would have fixed their code.
> 
> In short, I fully support opt-in.

Thanks Claudio!

I will prepare patches to replace VM_NO_HUGE_VMAP with an opt-in flag, 
and use the new flag in BPF. Please let me know any comments/suggestions
ont this direction. 

Song



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-04-08 21:24 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-30 22:56 [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack Song Liu
2022-03-30 22:56 ` [PATCH bpf 1/4] x86: disable HAVE_ARCH_HUGE_VMALLOC Song Liu
2022-03-30 23:47   ` Thomas Gleixner
2022-03-30 22:56 ` [PATCH bpf 2/4] vmalloc: introduce HAVE_ARCH_HUGE_VMALLOC_FLAG Song Liu
2022-03-30 23:40   ` Edgecombe, Rick P
2022-03-31  0:26     ` Song Liu
2022-03-30 22:56 ` [PATCH bpf 3/4] x86: select HAVE_ARCH_HUGE_VMALLOC_FLAG for X86_64 Song Liu
2022-03-30 23:54   ` Thomas Gleixner
2022-03-31  0:30     ` Song Liu
2022-03-30 22:56 ` [PATCH bpf 4/4] bpf: use __vmalloc_node_range() with VM_TRY_HUGE_VMAP for bpf_prog_pack Song Liu
2022-03-31  0:00   ` Thomas Gleixner
2022-03-31  0:31     ` Song Liu
2022-03-31  0:04 ` [PATCH bpf 0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG " Edgecombe, Rick P
2022-03-31  0:46   ` Song Liu
2022-03-31 16:19     ` Edgecombe, Rick P
2022-03-31  5:37 ` Christoph Hellwig
2022-03-31 23:59   ` Song Liu
2022-04-01 22:22     ` Song Liu
2022-04-05  7:07       ` Christoph Hellwig
2022-04-05 23:54         ` Song Liu
2022-04-07 19:57           ` Song Liu
2022-04-08 10:08             ` Claudio Imbrenda
2022-04-08 21:22               ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).