[RFC 00/12] module: avoid userspace pressure on unwanted allocations

linux-modules.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/12] module: avoid userspace pressure on unwanted allocations
@ 2023-03-11  5:17 Luis Chamberlain
  2023-03-11  5:17 ` [RFC 01/12] module: use goto errors on check_modinfo() and layout_and_allocate() Luis Chamberlain
                   ` (12 more replies)
  0 siblings, 13 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

A long time ago we had some issues with userspace doing stupid stuff.
Well, it turns out even the kernel and do stupid stuff too, as we're
learning with the ACPI modules aliaes and that hammering tons of loads.

So add a bit of code which gets us a bit more in the defensive about
these situations.

To experiment, this also adds in-kernel alias support to see if this helps
with some larger systems.

This is all based on some old code which tried to add defensive
mechanisms the last of which was here and I had dropped the ball:

https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/

I've only compile tested this for now. Will need to stress to test
with kmod tests 0008 and 0009 to see if there's any differences.
I'll have to re-test and re-gnuplot stuff there. But early feedback
is appreciated, hence the RFC.

David Hildenbrand had reported a while ago issues with userspace
doing insane things with allocations bringing a system down to
its knees. This is part of the motivation for this series.

I repeat, I only have compiled tested this so far.

A few Suggested-by there linger since Linus had suggested a few of
these ideas a long time ago and we just never picked them up.

Luis Chamberlain (12):
  module: use goto errors on check_modinfo() and layout_and_allocate()
  module: move get_modinfo() helpers all above
  module: rename next_string() to module_next_tag_pair()
  module: add a for_each_modinfo_entry()
  module: add debugging alias parsing support
  module: move early sanity checks into a helper
  module: move check_modinfo() early to early_mod_check()
  module: move finished_loading()
  module: extract patient module check into helper
  module: avoid allocation if module is already present and ready
  module: use list_add_tail_rcu() when adding module
  module: use aliases to find module on find_module_all()

 include/linux/module.h   |   4 +
 kernel/module/Kconfig    |  19 +++
 kernel/module/Makefile   |   1 +
 kernel/module/aliases.c  | 109 +++++++++++++
 kernel/module/internal.h |  25 +++
 kernel/module/main.c     | 324 +++++++++++++++++++++++----------------
 6 files changed, 346 insertions(+), 136 deletions(-)
 create mode 100644 kernel/module/aliases.c

-- 
2.39.1

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC 01/12] module: use goto errors on check_modinfo() and layout_and_allocate()
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 02/12] module: move get_modinfo() helpers all above Luis Chamberlain
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

Although both routines don't have much complex errors paths we
will expand on these routine in the future, setting up error lables
now for both will make subsequent changes easier to review. This
changes has no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index b4759f1695b7..5f1473eb34e0 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1998,7 +1998,7 @@ static int check_modinfo(struct module *mod, struct load_info *info, int flags)
 
 	err = check_modinfo_livepatch(mod, info);
 	if (err)
-		return err;
+		goto err_out;
 
 	/* Set up license info based on the info section */
 	set_license(mod, get_modinfo(info, "license"));
@@ -2011,6 +2011,8 @@ static int check_modinfo(struct module *mod, struct load_info *info, int flags)
 	}
 
 	return 0;
+err_out:
+	return err;
 }
 
 static int find_module_sections(struct module *mod, struct load_info *info)
@@ -2288,12 +2290,12 @@ static struct module *layout_and_allocate(struct load_info *info, int flags)
 	err = module_frob_arch_sections(info->hdr, info->sechdrs,
 					info->secstrings, info->mod);
 	if (err < 0)
-		return ERR_PTR(err);
+		goto err_out;
 
 	err = module_enforce_rwx_sections(info->hdr, info->sechdrs,
 					  info->secstrings, info->mod);
 	if (err < 0)
-		return ERR_PTR(err);
+		goto err_out;
 
 	/* We will do a special allocation for per-cpu sections later. */
 	info->sechdrs[info->index.pcpu].sh_flags &= ~(unsigned long)SHF_ALLOC;
@@ -2327,12 +2329,14 @@ static struct module *layout_and_allocate(struct load_info *info, int flags)
 	/* Allocate and move to the final place */
 	err = move_module(info->mod, info);
 	if (err)
-		return ERR_PTR(err);
+		goto err_out;
 
 	/* Module has been copied to its final place now: return it. */
 	mod = (void *)info->sechdrs[info->index.mod].sh_addr;
 	kmemleak_load_module(mod, info);
 	return mod;
+err_out:
+	return ERR_PTR(err);
 }
 
 /* mod is no longer valid after this! */
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 02/12] module: move get_modinfo() helpers all above
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
  2023-03-11  5:17 ` [RFC 01/12] module: use goto errors on check_modinfo() and layout_and_allocate() Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 03/12] module: rename next_string() to module_next_tag_pair() Luis Chamberlain
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

Instead of forward declaring routines for get_modinfo() just move
everything up. This makes no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 100 +++++++++++++++++++++----------------------
 1 file changed, 48 insertions(+), 52 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 5f1473eb34e0..6d6e6a6184b5 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1016,9 +1016,55 @@ int try_to_force_load(struct module *mod, const char *reason)
 #endif
 }
 
-static char *get_modinfo(const struct load_info *info, const char *tag);
+/* Parse tag=value strings from .modinfo section */
+static char *next_string(char *string, unsigned long *secsize)
+{
+	/* Skip non-zero chars */
+	while (string[0]) {
+		string++;
+		if ((*secsize)-- <= 1)
+			return NULL;
+	}
+
+	/* Skip any zero padding. */
+	while (!string[0]) {
+		string++;
+		if ((*secsize)-- <= 1)
+			return NULL;
+	}
+	return string;
+}
+
 static char *get_next_modinfo(const struct load_info *info, const char *tag,
-			      char *prev);
+			      char *prev)
+{
+	char *p;
+	unsigned int taglen = strlen(tag);
+	Elf_Shdr *infosec = &info->sechdrs[info->index.info];
+	unsigned long size = infosec->sh_size;
+
+	/*
+	 * get_modinfo() calls made before rewrite_section_headers()
+	 * must use sh_offset, as sh_addr isn't set!
+	 */
+	char *modinfo = (char *)info->hdr + infosec->sh_offset;
+
+	if (prev) {
+		size -= prev - modinfo;
+		modinfo = next_string(prev, &size);
+	}
+
+	for (p = modinfo; p; p = next_string(p, &size)) {
+		if (strncmp(p, tag, taglen) == 0 && p[taglen] == '=')
+			return p + taglen + 1;
+	}
+	return NULL;
+}
+
+static char *get_modinfo(const struct load_info *info, const char *tag)
+{
+	return get_next_modinfo(info, tag, NULL);
+}
 
 static int verify_namespace_is_imported(const struct load_info *info,
 					const struct kernel_symbol *sym,
@@ -1544,56 +1590,6 @@ static void set_license(struct module *mod, const char *license)
 	}
 }
 
-/* Parse tag=value strings from .modinfo section */
-static char *next_string(char *string, unsigned long *secsize)
-{
-	/* Skip non-zero chars */
-	while (string[0]) {
-		string++;
-		if ((*secsize)-- <= 1)
-			return NULL;
-	}
-
-	/* Skip any zero padding. */
-	while (!string[0]) {
-		string++;
-		if ((*secsize)-- <= 1)
-			return NULL;
-	}
-	return string;
-}
-
-static char *get_next_modinfo(const struct load_info *info, const char *tag,
-			      char *prev)
-{
-	char *p;
-	unsigned int taglen = strlen(tag);
-	Elf_Shdr *infosec = &info->sechdrs[info->index.info];
-	unsigned long size = infosec->sh_size;
-
-	/*
-	 * get_modinfo() calls made before rewrite_section_headers()
-	 * must use sh_offset, as sh_addr isn't set!
-	 */
-	char *modinfo = (char *)info->hdr + infosec->sh_offset;
-
-	if (prev) {
-		size -= prev - modinfo;
-		modinfo = next_string(prev, &size);
-	}
-
-	for (p = modinfo; p; p = next_string(p, &size)) {
-		if (strncmp(p, tag, taglen) == 0 && p[taglen] == '=')
-			return p + taglen + 1;
-	}
-	return NULL;
-}
-
-static char *get_modinfo(const struct load_info *info, const char *tag)
-{
-	return get_next_modinfo(info, tag, NULL);
-}
-
 static void setup_modinfo(struct module *mod, struct load_info *info)
 {
 	struct module_attribute *attr;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 03/12] module: rename next_string() to module_next_tag_pair()
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
  2023-03-11  5:17 ` [RFC 01/12] module: use goto errors on check_modinfo() and layout_and_allocate() Luis Chamberlain
  2023-03-11  5:17 ` [RFC 02/12] module: move get_modinfo() helpers all above Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 04/12] module: add a for_each_modinfo_entry() Luis Chamberlain
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

This makes it clearer what it is doing. While at it,
make it available to other code other than main.c.
This will be used in the subsequent patch and make
the changes easier to read.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/internal.h | 2 ++
 kernel/module/main.c     | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index e3883b7d4840..1fa2328636ec 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -96,6 +96,8 @@ long module_get_offset_and_type(struct module *mod, enum mod_mem_type type,
 char *module_flags(struct module *mod, char *buf, bool show_state);
 size_t module_flags_taint(unsigned long taints, char *buf);
 
+char *module_next_tag_pair(char *string, unsigned long *secsize);
+
 static inline void module_assert_mutex_or_preempt(void)
 {
 #ifdef CONFIG_LOCKDEP
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 6d6e6a6184b5..50364707c0cd 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1017,7 +1017,7 @@ int try_to_force_load(struct module *mod, const char *reason)
 }
 
 /* Parse tag=value strings from .modinfo section */
-static char *next_string(char *string, unsigned long *secsize)
+char *module_next_tag_pair(char *string, unsigned long *secsize)
 {
 	/* Skip non-zero chars */
 	while (string[0]) {
@@ -1051,10 +1051,10 @@ static char *get_next_modinfo(const struct load_info *info, const char *tag,
 
 	if (prev) {
 		size -= prev - modinfo;
-		modinfo = next_string(prev, &size);
+		modinfo = module_next_tag_pair(prev, &size);
 	}
 
-	for (p = modinfo; p; p = next_string(p, &size)) {
+	for (p = modinfo; p; p = module_next_tag_pair(p, &size)) {
 		if (strncmp(p, tag, taglen) == 0 && p[taglen] == '=')
 			return p + taglen + 1;
 	}
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 04/12] module: add a for_each_modinfo_entry()
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (2 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 03/12] module: rename next_string() to module_next_tag_pair() Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 05/12] module: add debugging alias parsing support Luis Chamberlain
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

Add a for_each_modinfo_entry() to make it easier to read and use.
This produces no functional changes but makes this code easiert
to read as we are used to with loops in the kernel and trims more
lines of code.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/internal.h | 3 +++
 kernel/module/main.c     | 5 +----
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 1fa2328636ec..6ae29bb8836f 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -98,6 +98,9 @@ size_t module_flags_taint(unsigned long taints, char *buf);
 
 char *module_next_tag_pair(char *string, unsigned long *secsize);
 
+#define for_each_modinfo_entry(entry, info, name) \
+	for (entry = get_modinfo(info, name); entry; entry = get_next_modinfo(info, name, entry))
+
 static inline void module_assert_mutex_or_preempt(void)
 {
 #ifdef CONFIG_LOCKDEP
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 50364707c0cd..3f7c8634cf06 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1075,12 +1075,9 @@ static int verify_namespace_is_imported(const struct load_info *info,
 
 	namespace = kernel_symbol_namespace(sym);
 	if (namespace && namespace[0]) {
-		imported_namespace = get_modinfo(info, "import_ns");
-		while (imported_namespace) {
+		for_each_modinfo_entry(imported_namespace, info, "import_ns") {
 			if (strcmp(namespace, imported_namespace) == 0)
 				return 0;
-			imported_namespace = get_next_modinfo(
-				info, "import_ns", imported_namespace);
 		}
 #ifdef CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS
 		pr_warn(
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 05/12] module: add debugging alias parsing support
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (3 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 04/12] module: add a for_each_modinfo_entry() Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 06/12] module: move early sanity checks into a helper Luis Chamberlain
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

We don't have in-kernel alias parsing support as aliases are all
dealt with in userspace. There has simply been no need to have
support for processing them in kernel. We have done this under
the assumption that userspace just Does The Right Thing (TM) about
aliases and loading modules and that there is no real gain of
processing aliases in-kernel.

Obviously userspace can be buggy though, and it can lie to us. We
currently have no easy way to determine this. Parsing aliases is
an example debugging facility we can use to help with these sorts
of problems.

But there are some possible optimizations that may also be possible
and enabling support let's folks experiment with these posibilities.

We disable this by default but folks can also enable this to
experiment with features which may use aliases in-kernel. Folks
should not enable this on production kernels. It'll bloat your
loaded kernel modules a tiny bit by the size of the aliases that
exist for them once loaded.

You can debug aliase by adding to your dynamic debug:

GRUB_CMDLINE_LINUX_DEFAULT="dyndbg=\"func module_process_aliases +p;\" "

Upon boot for example here a few entries:

module ext4 num_aliases: 5
alias[0] = fs-ext4
alias[1] = ext3
alias[2] = fs-ext3
alias[3] = ext2
alias[4] = fs-ext2

module xfs num_aliases: 1
alias[0] = fs-xfs

module floppy num_aliases: 3
alias[0] = block-major-2-*
alias[1] = acpi*:PNP0700:*
alias[2] = pnp:dPNP0700*

module ata_piix num_aliases: 89
alias[0] = pci:v00008086d00008C81sv*sd*bc*sc*i*
alias[1] = pci:v00008086d00008C80sv*sd*bc*sc*i*
alias[2] = pci:v00008086d00008C89sv*sd*bc*sc*i*
alias[3] = pci:v00008086d00008C88sv*sd*bc*sc*i*
... etc ...

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/module.h   |  4 ++
 kernel/module/Kconfig    | 19 +++++++++
 kernel/module/Makefile   |  1 +
 kernel/module/aliases.c  | 92 ++++++++++++++++++++++++++++++++++++++++
 kernel/module/internal.h | 15 +++++++
 kernel/module/main.c     | 17 ++++++--
 6 files changed, 145 insertions(+), 3 deletions(-)
 create mode 100644 kernel/module/aliases.c

diff --git a/include/linux/module.h b/include/linux/module.h
index c3b357196470..aed1b43edb55 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -420,6 +420,10 @@ struct module {
 	const char *srcversion;
 	struct kobject *holders_dir;
 
+#ifdef CONFIG_MODULE_KERNEL_ALIAS
+	unsigned int num_aliases;
+	const char **aliases;
+#endif
 	/* Exported symbols */
 	const struct kernel_symbol *syms;
 	const s32 *crcs;
diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index 424b3bc58f3f..e4ba335fa279 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -22,6 +22,25 @@ menuconfig MODULES
 
 if MODULES
 
+config MODULE_DEBUG
+	bool "Enable debugging information for modules"
+	default n
+	help
+	  Enables debugging of the module infrastructure. Say no unless you
+	  are debugging the module framework. Don't enable this on production.
+	  This is only for experimentation and debugging.
+
+config MODULE_KERNEL_ALIAS
+	bool "Enable in-kernel alias processing for modules"
+	default n
+	depends on MODULE_DEBUG
+	help
+	  The kernel has historically not processed aliases in-kernel since
+	  we expect userspace can do all the proper work for us. Enable this
+	  if you want to experiment processing aliases in-kernel. This will
+	  bloat your kernel modules's memory by the number of aliases each
+	  module has once loaded into the kernel.
+
 config MODULE_FORCE_LOAD
 	bool "Forced module loading"
 	default n
diff --git a/kernel/module/Makefile b/kernel/module/Makefile
index 948efea81e85..34f0db0a016b 100644
--- a/kernel/module/Makefile
+++ b/kernel/module/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SYSFS) += sysfs.o
 obj-$(CONFIG_KGDB_KDB) += kdb.o
 obj-$(CONFIG_MODVERSIONS) += version.o
 obj-$(CONFIG_MODULE_UNLOAD_TAINT_TRACKING) += tracking.o
+obj-$(CONFIG_MODULE_KERNEL_ALIAS) += aliases.o
diff --git a/kernel/module/aliases.c b/kernel/module/aliases.c
new file mode 100644
index 000000000000..2f30c9d4c765
--- /dev/null
+++ b/kernel/module/aliases.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Optional module in-kernel alias processing support.
+ *
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
+ */
+
+#include <linux/module.h>
+#include "internal.h"
+
+void free_mod_aliases(struct module *mod)
+{
+	unsigned int i;
+
+	if (!mod->num_aliases)
+		return;
+
+	for (i=0; i < mod->num_aliases; i++) {
+		kfree(mod->aliases[i]);
+		mod->aliases[i] = NULL;
+	}
+
+	kfree(mod->aliases);
+	mod->aliases = NULL;
+}
+
+static int get_modinfo_tags(struct load_info *info,
+			    const char *tag,
+			    unsigned int *num_entries)
+{
+	char *p;
+	unsigned int taglen = strlen(tag);
+	Elf_Shdr *infosec = &info->sechdrs[info->index.info];
+	unsigned long size = infosec->sh_size;
+	const char *value;
+	unsigned int len, tags_size = 0;
+
+	for (p = (char *)infosec->sh_addr; p; p = module_next_tag_pair(p, &size)) {
+		if (strncmp(p, tag, taglen) == 0 && p[taglen] == '=') {
+			value = p + taglen + 1;
+			len = strlen(value);
+			if (len >=0 && len <= PAGE_SIZE) {
+				(*num_entries)++;
+				tags_size+=len;
+			}
+		}
+	}
+
+	return tags_size;
+}
+
+int module_process_aliases(struct module *mod, struct load_info *info)
+{
+	unsigned int size, i = 0, num_entries = 0;
+	char *alias;
+
+	size = get_modinfo_tags(info, "alias", &num_entries);
+	if (WARN_ON(!size))
+		return 0;
+
+	mod->aliases = kzalloc(num_entries * sizeof(char *), GFP_KERNEL);
+	if (!mod->aliases)
+		return -ENOMEM;
+
+	pr_debug("module %s num_aliases: %u\n", mod->name, num_entries);
+
+	for_each_modinfo_entry(alias, info, "alias") {
+		pr_debug("alias[%u] = %s\n", i, alias);
+		mod->aliases[i] = kasprintf(GFP_KERNEL, "%s", alias);
+		if (!mod->aliases[i])
+			goto err_free;
+		i++;
+	}
+
+	WARN_ON(i != num_entries);
+
+	mod->num_aliases = num_entries;
+
+	return 0;
+
+err_free:
+	while (i!=0) {
+		i--;
+		kfree(mod->aliases[i]);
+		mod->aliases[i] = NULL;
+	}
+
+	kfree(mod->aliases);
+	mod->aliases = NULL;
+
+	return -ENOMEM;
+}
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 6ae29bb8836f..40bb80ed21e2 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -96,6 +96,8 @@ long module_get_offset_and_type(struct module *mod, enum mod_mem_type type,
 char *module_flags(struct module *mod, char *buf, bool show_state);
 size_t module_flags_taint(unsigned long taints, char *buf);
 
+char *get_modinfo(const struct load_info *info, const char *tag);
+char *get_next_modinfo(const struct load_info *info, const char *tag, char *prev);
 char *module_next_tag_pair(char *string, unsigned long *secsize);
 
 #define for_each_modinfo_entry(entry, info, name) \
@@ -300,3 +302,16 @@ static inline int same_magic(const char *amagic, const char *bmagic, bool has_cr
 	return strcmp(amagic, bmagic) == 0;
 }
 #endif /* CONFIG_MODVERSIONS */
+
+#ifdef CONFIG_MODULE_KERNEL_ALIAS
+void free_mod_aliases(struct module *mod);
+int module_process_aliases(struct module *mod, struct load_info *info);
+#else
+static void free_mod_aliases(struct module *mod)
+{
+}
+static int module_process_aliases(struct module *mod, struct load_info *info)
+{
+	return 0;
+}
+#endif /* CONFIG_MODULE_KERNEL_ALIAS */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 3f7c8634cf06..16770942f33a 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2,6 +2,7 @@
 /*
  * Copyright (C) 2002 Richard Henderson
  * Copyright (C) 2001 Rusty Russell, 2002, 2010 Rusty Russell IBM.
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
  */
 
 #define INCLUDE_VERMAGIC
@@ -1035,8 +1036,7 @@ char *module_next_tag_pair(char *string, unsigned long *secsize)
 	return string;
 }
 
-static char *get_next_modinfo(const struct load_info *info, const char *tag,
-			      char *prev)
+char *get_next_modinfo(const struct load_info *info, const char *tag, char *prev)
 {
 	char *p;
 	unsigned int taglen = strlen(tag);
@@ -1061,7 +1061,7 @@ static char *get_next_modinfo(const struct load_info *info, const char *tag,
 	return NULL;
 }
 
-static char *get_modinfo(const struct load_info *info, const char *tag)
+char *get_modinfo(const struct load_info *info, const char *tag)
 {
 	return get_next_modinfo(info, tag, NULL);
 }
@@ -1289,6 +1289,7 @@ static void free_module(struct module *mod)
 	module_arch_freeing_init(mod);
 	kfree(mod->args);
 	percpu_modfree(mod);
+	free_mod_aliases(mod);
 
 	free_mod_mem(mod);
 }
@@ -1989,6 +1990,12 @@ static int check_modinfo(struct module *mod, struct load_info *info, int flags)
 			"is unknown, you have been warned.\n", mod->name);
 	}
 
+	if (get_modinfo(info, "alias")) {
+		err = module_process_aliases(mod, info);
+		if (err)
+			goto err_out_skip_alloc;
+	}
+
 	err = check_modinfo_livepatch(mod, info);
 	if (err)
 		goto err_out;
@@ -2005,6 +2012,8 @@ static int check_modinfo(struct module *mod, struct load_info *info, int flags)
 
 	return 0;
 err_out:
+	free_mod_aliases(mod);
+err_out_skip_alloc:
 	return err;
 }
 
@@ -2329,6 +2338,7 @@ static struct module *layout_and_allocate(struct load_info *info, int flags)
 	kmemleak_load_module(mod, info);
 	return mod;
 err_out:
+	free_mod_aliases(mod);
 	return ERR_PTR(err);
 }
 
@@ -2890,6 +2900,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	synchronize_rcu();
 	mutex_unlock(&module_mutex);
  free_module:
+	free_mod_aliases(mod);
 	/* Free lock-classes; relies on the preceding sync_rcu() */
 	for_class_mod_mem_type(type, core_data) {
 		lockdep_free_key_range(mod->mem[type].base,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 06/12] module: move early sanity checks into a helper
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (4 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 05/12] module: add debugging alias parsing support Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 07/12] module: move check_modinfo() early to early_mod_check() Luis Chamberlain
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

Move early sanity checkers for the module into a helper.
This let's us make it clear when we are working with the
local copy of the module prior to allocation.

This produces no functional changes, it just makes subsequent
changes easier to read.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 16770942f33a..32c92fb69c05 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2682,6 +2682,31 @@ static int unknown_module_param_cb(char *param, char *val, const char *modname,
 	return 0;
 }
 
+/* Module within temporary copy, this doesn't do any allocation  */
+static int early_mod_check(struct load_info *info, int flags)
+{
+	int err;
+
+	/*
+	 * Now that we know we have the correct module name, check
+	 * if it's blacklisted.
+	 */
+	if (blacklisted(info->name)) {
+		pr_err("Module %s is blacklisted\n", info->name);
+		return -EPERM;
+	}
+
+	err = rewrite_section_headers(info, flags);
+	if (err)
+		return err;
+
+	/* Check module struct version now, before we try to use module. */
+	if (!check_modstruct_version(info, info->mod))
+		return ENOEXEC;
+
+	return 0;
+}
+
 /*
  * Allocate and load the module: note that size of section 0 is always
  * zero, and we rely on this for optional sections.
@@ -2725,26 +2750,10 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	if (err)
 		goto free_copy;
 
-	/*
-	 * Now that we know we have the correct module name, check
-	 * if it's blacklisted.
-	 */
-	if (blacklisted(info->name)) {
-		err = -EPERM;
-		pr_err("Module %s is blacklisted\n", info->name);
-		goto free_copy;
-	}
-
-	err = rewrite_section_headers(info, flags);
+	err = early_mod_check(info, flags);
 	if (err)
 		goto free_copy;
 
-	/* Check module struct version now, before we try to use module. */
-	if (!check_modstruct_version(info, info->mod)) {
-		err = -ENOEXEC;
-		goto free_copy;
-	}
-
 	/* Figure out module layout, and allocate all the memory. */
 	mod = layout_and_allocate(info, flags);
 	if (IS_ERR(mod)) {
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 07/12] module: move check_modinfo() early to early_mod_check()
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (5 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 06/12] module: move early sanity checks into a helper Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 08/12] module: move finished_loading() Luis Chamberlain
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

This moves check_modinfo() to early_mod_check(). This
doesn't make any functional changes either, as check_modinfo()
was the first call on layout_and_allocate(), so we're just
moving it back one routine and at the end.

This let's us keep separate the checkers from the allocater.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 32c92fb69c05..e9c7eb827f0d 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2284,10 +2284,6 @@ static struct module *layout_and_allocate(struct load_info *info, int flags)
 	unsigned int ndx;
 	int err;
 
-	err = check_modinfo(info->mod, info, flags);
-	if (err)
-		return ERR_PTR(err);
-
 	/* Allow arches to frob section contents and sizes.  */
 	err = module_frob_arch_sections(info->hdr, info->sechdrs,
 					info->secstrings, info->mod);
@@ -2702,7 +2698,11 @@ static int early_mod_check(struct load_info *info, int flags)
 
 	/* Check module struct version now, before we try to use module. */
 	if (!check_modstruct_version(info, info->mod))
-		return ENOEXEC;
+		return -ENOEXEC;
+
+	err = check_modinfo(info->mod, info, flags);
+	if (err)
+		return err;
 
 	return 0;
 }
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 08/12] module: move finished_loading()
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (6 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 07/12] module: move check_modinfo() early to early_mod_check() Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 09/12] module: extract patient module check into helper Luis Chamberlain
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

This has no functional change, just moves a routine earlier
as we'll make use of it next.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index e9c7eb827f0d..c3e5076c0436 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2370,27 +2370,6 @@ static int post_relocation(struct module *mod, const struct load_info *info)
 	return module_finalize(info->hdr, info->sechdrs, mod);
 }
 
-/* Is this module of this name done loading?  No locks held. */
-static bool finished_loading(const char *name)
-{
-	struct module *mod;
-	bool ret;
-
-	/*
-	 * The module_mutex should not be a heavily contended lock;
-	 * if we get the occasional sleep here, we'll go an extra iteration
-	 * in the wait_event_interruptible(), which is harmless.
-	 */
-	sched_annotate_sleep();
-	mutex_lock(&module_mutex);
-	mod = find_module_all(name, strlen(name), true);
-	ret = !mod || mod->state == MODULE_STATE_LIVE
-		|| mod->state == MODULE_STATE_GOING;
-	mutex_unlock(&module_mutex);
-
-	return ret;
-}
-
 /* Call module constructors. */
 static void do_mod_ctors(struct module *mod)
 {
@@ -2554,6 +2533,27 @@ static int may_init_module(void)
 	return 0;
 }
 
+/* Is this module of this name done loading?  No locks held. */
+static bool finished_loading(const char *name)
+{
+	struct module *mod;
+	bool ret;
+
+	/*
+	 * The module_mutex should not be a heavily contended lock;
+	 * if we get the occasional sleep here, we'll go an extra iteration
+	 * in the wait_event_interruptible(), which is harmless.
+	 */
+	sched_annotate_sleep();
+	mutex_lock(&module_mutex);
+	mod = find_module_all(name, strlen(name), true);
+	ret = !mod || mod->state == MODULE_STATE_LIVE
+		|| mod->state == MODULE_STATE_GOING;
+	mutex_unlock(&module_mutex);
+
+	return ret;
+}
+
 /*
  * We try to place it in the list now to make sure it's unique before
  * we dedicate too many resources.  In particular, temporary percpu
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 09/12] module: extract patient module check into helper
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (7 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 08/12] module: move finished_loading() Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 10/12] module: avoid allocation if module is already present and ready Luis Chamberlain
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

The patient module check inside add_unformed_module() is large
enough as we need it. It is a bit hard to read too, so just
move it to a helper and do the inverse checks first to help
shift the code and make it easier to read. The new helper then
is module_patient_check_exists().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 71 +++++++++++++++++++++++++-------------------
 1 file changed, 40 insertions(+), 31 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index c3e5076c0436..e24323e2c499 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2554,6 +2554,43 @@ static bool finished_loading(const char *name)
 	return ret;
 }
 
+/* Must be called with module_mutex held */
+static int module_patient_check_exists(const char *name)
+{
+	struct module *old;
+	int err = 0;
+
+	old = find_module_all(name, strlen(name), true);
+	if (old == NULL)
+		return 0;
+
+	if (old->state == MODULE_STATE_COMING
+	    || old->state == MODULE_STATE_UNFORMED) {
+		/* Wait in case it fails to load. */
+		mutex_unlock(&module_mutex);
+		err = wait_event_interruptible(module_wq,
+				       finished_loading(name));
+		if (err)
+			return err;
+
+		/* The module might have gone in the meantime. */
+		mutex_lock(&module_mutex);
+		old = find_module_all(name, strlen(name), true);
+	}
+
+	/*
+	 * We are here only when the same module was being loaded. Do
+	 * not try to load it again right now. It prevents long delays
+	 * caused by serialized module load failures. It might happen
+	 * when more devices of the same type trigger load of
+	 * a particular module.
+	 */
+	if (old && old->state == MODULE_STATE_LIVE)
+		return -EEXIST;
+	else
+		return -EBUSY;
+}
+
 /*
  * We try to place it in the list now to make sure it's unique before
  * we dedicate too many resources.  In particular, temporary percpu
@@ -2562,41 +2599,14 @@ static bool finished_loading(const char *name)
 static int add_unformed_module(struct module *mod)
 {
 	int err;
-	struct module *old;
 
 	mod->state = MODULE_STATE_UNFORMED;
 
 	mutex_lock(&module_mutex);
-	old = find_module_all(mod->name, strlen(mod->name), true);
-	if (old != NULL) {
-		if (old->state == MODULE_STATE_COMING
-		    || old->state == MODULE_STATE_UNFORMED) {
-			/* Wait in case it fails to load. */
-			mutex_unlock(&module_mutex);
-			err = wait_event_interruptible(module_wq,
-					       finished_loading(mod->name));
-			if (err)
-				goto out_unlocked;
-
-			/* The module might have gone in the meantime. */
-			mutex_lock(&module_mutex);
-			old = find_module_all(mod->name, strlen(mod->name),
-					      true);
-		}
-
-		/*
-		 * We are here only when the same module was being loaded. Do
-		 * not try to load it again right now. It prevents long delays
-		 * caused by serialized module load failures. It might happen
-		 * when more devices of the same type trigger load of
-		 * a particular module.
-		 */
-		if (old && old->state == MODULE_STATE_LIVE)
-			err = -EEXIST;
-		else
-			err = -EBUSY;
+	err = module_patient_check_exists(mod->name);
+	if (err)
 		goto out;
-	}
+
 	mod_update_bounds(mod);
 	list_add_rcu(&mod->list, &modules);
 	mod_tree_insert(mod);
@@ -2604,7 +2614,6 @@ static int add_unformed_module(struct module *mod)
 
 out:
 	mutex_unlock(&module_mutex);
-out_unlocked:
 	return err;
 }
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 10/12] module: avoid allocation if module is already present and ready
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (8 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 09/12] module: extract patient module check into helper Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 11/12] module: use list_add_tail_rcu() when adding module Luis Chamberlain
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

load_module() will allocate a struct module before even checking
if the module is already loaded. This can create unecessary memory
pressure since we can easily just check if the module is already
present early with the copy of the module information from userspace
after we've validated it a bit.

This can only be an issue if a system is getting hammered with userspace
loading modules. Note that there are two ways to load modules, one is
auto-loading in-kernel and that pings back to userspace to just call
modprobe. Then userspace itself *is* supposed to check if a module
is present before loading it. But we're observing situations where
tons of the same module are in effect being loaded. In that situation
we can end up allocating memory for module requests which after
allocating memory will fail to load.

To avoid memory pressure for such stupid cases put a stop gap for them.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index e24323e2c499..909454f9616e 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2713,6 +2713,10 @@ static int early_mod_check(struct load_info *info, int flags)
 	if (err)
 		return err;

+	err = module_patient_check_exists(info->mod->name);
+	if (err)
+		return err;
+
 	return 0;
 }

-- 
2.39.1

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 11/12] module: use list_add_tail_rcu() when adding module
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (9 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 10/12] module: avoid allocation if module is already present and ready Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-11  5:17 ` [RFC 12/12] module: use aliases to find module on find_module_all() Luis Chamberlain
  2023-03-15 12:24 ` [RFC 00/12] module: avoid userspace pressure on unwanted allocations David Hildenbrand
  12 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

Put a new module at the end of the list intead of making
new modules at the top of the list. find_module_all() start
the hunt using the first entry on the list, if we assume
that the modules which are first loaded are the most
frequently looked for modules this should provide a tiny
optimization.

This is theoretical, and could use more actual data by analzying
the impact of this patch on boot time a slew of systems using
systemd-analyze.

Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
---
 kernel/module/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 909454f9616e..bc9202b60d55 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2608,7 +2608,7 @@ static int add_unformed_module(struct module *mod)
 		goto out;
 
 	mod_update_bounds(mod);
-	list_add_rcu(&mod->list, &modules);
+	list_add_tail_rcu(&mod->list, &modules);
 	mod_tree_insert(mod);
 	err = 0;
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 12/12] module: use aliases to find module on find_module_all()
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (10 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 11/12] module: use list_add_tail_rcu() when adding module Luis Chamberlain
@ 2023-03-11  5:17 ` Luis Chamberlain
  2023-03-15 14:43   ` Petr Pavlu
  2023-03-15 12:24 ` [RFC 00/12] module: avoid userspace pressure on unwanted allocations David Hildenbrand
  12 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-11  5:17 UTC (permalink / raw)
  To: linux-modules, linux-kernel, pmladek, david, petr.pavlu, prarit
  Cc: christophe.leroy, song, mcgrof, torvalds

Modules can have a series of aliases, but we don't currently use
them to check if a module is already loaded. Part of this is because
load_module() will stick to checking for already loaded modules using
the actual module name, not an alias. Its however desriable to also
check for aliases on find_module_all() for existing callers and future
callers. The curent gain to using aliases on find_module_all() will
simply be to be able to support unloading modules using the alias using
the delete_module() syscall.

You can debug this with dynamic debug:

GRUB_CMDLINE_LINUX_DEFAULT="dyndbg=\"func module_process_aliases +p; func module_name_match +p; \" "

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/aliases.c  | 17 +++++++++++++++++
 kernel/module/internal.h |  5 +++++
 kernel/module/main.c     | 24 +++++++++++++++++++++++-
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/kernel/module/aliases.c b/kernel/module/aliases.c
index 2f30c9d4c765..69518bc5169a 100644
--- a/kernel/module/aliases.c
+++ b/kernel/module/aliases.c
@@ -90,3 +90,20 @@ int module_process_aliases(struct module *mod, struct load_info *info)
 
 	return -ENOMEM;
 }
+
+bool module_name_match_aliases(struct module *mod, const char *name, size_t len)
+{
+	unsigned int i;
+	const char *alias;
+
+	for (i=0; i < mod->num_aliases; i++) {
+		alias = mod->aliases[i];
+		if (strlen(alias) == len && !memcmp(alias, name, len)) {
+			pr_debug("module %s alias matched: alias[%u] = %s\n",
+				 mod->name, i, alias);
+			return true;
+		}
+	}
+
+	return false;
+}
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 40bb80ed21e2..78aaad74f4ca 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -306,6 +306,7 @@ static inline int same_magic(const char *amagic, const char *bmagic, bool has_cr
 #ifdef CONFIG_MODULE_KERNEL_ALIAS
 void free_mod_aliases(struct module *mod);
 int module_process_aliases(struct module *mod, struct load_info *info);
+bool module_name_match_aliases(struct module *mod, const char *name, size_t len);
 #else
 static void free_mod_aliases(struct module *mod)
 {
@@ -314,4 +315,8 @@ static int module_process_aliases(struct module *mod, struct load_info *info)
 {
 	return 0;
 }
+static bool module_name_match_aliases(struct module *mod, const char *name, size_t len)
+{
+	return false;
+}
 #endif /* CONFIG_MODULE_KERNEL_ALIAS */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index bc9202b60d55..cf044329da3c 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -338,6 +338,28 @@ bool find_symbol(struct find_symbol_arg *fsa)
 	return false;
 }
 
+static bool module_name_match(struct module *mod, const char *name, size_t len)
+{
+	unsigned int i;
+	const char *alias;
+
+	if (strlen(mod->name) == len && !memcmp(mod->name, name, len))
+		return true;
+
+	return module_name_match_aliases(mod, name, len);
+
+	for (i=0; i < mod->num_aliases; i++) {
+		alias = mod->aliases[i];
+		if (strlen(alias) == len && !memcmp(alias, name, len)) {
+			pr_debug("module %s alias matched: alias[%u] = %s\n",
+				 mod->name, i, alias);
+			return true;
+		}
+	}
+
+	return false;
+}
+
 /*
  * Search for module by name: must hold module_mutex (or preempt disabled
  * for read-only access).
@@ -353,7 +375,7 @@ struct module *find_module_all(const char *name, size_t len,
 				lockdep_is_held(&module_mutex)) {
 		if (!even_unformed && mod->state == MODULE_STATE_UNFORMED)
 			continue;
-		if (strlen(mod->name) == len && !memcmp(mod->name, name, len))
+		if (module_name_match(mod, name, len))
 			return mod;
 	}
 	return NULL;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (11 preceding siblings ...)
  2023-03-11  5:17 ` [RFC 12/12] module: use aliases to find module on find_module_all() Luis Chamberlain
@ 2023-03-15 12:24 ` David Hildenbrand
  2023-03-15 16:10   ` Luis Chamberlain
  12 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-15 12:24 UTC (permalink / raw)
  To: Luis Chamberlain, linux-modules, linux-kernel, pmladek,
	petr.pavlu, prarit
  Cc: christophe.leroy, song, torvalds

On 11.03.23 06:17, Luis Chamberlain wrote:
> A long time ago we had some issues with userspace doing stupid stuff.
> Well, it turns out even the kernel and do stupid stuff too, as we're
> learning with the ACPI modules aliaes and that hammering tons of loads.
> 
> So add a bit of code which gets us a bit more in the defensive about
> these situations.
> 
> To experiment, this also adds in-kernel alias support to see if this helps
> with some larger systems.
> 
> This is all based on some old code which tried to add defensive
> mechanisms the last of which was here and I had dropped the ball:
> 
> https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/
> 
> I've only compile tested this for now. Will need to stress to test
> with kmod tests 0008 and 0009 to see if there's any differences.
> I'll have to re-test and re-gnuplot stuff there. But early feedback
> is appreciated, hence the RFC.
> 
> David Hildenbrand had reported a while ago issues with userspace
> doing insane things with allocations bringing a system down to
> its knees. This is part of the motivation for this series.


I'll try to grab a system where I can reproduce the issue and give your 
patches a churn.

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 12/12] module: use aliases to find module on find_module_all()
  2023-03-11  5:17 ` [RFC 12/12] module: use aliases to find module on find_module_all() Luis Chamberlain
@ 2023-03-15 14:43   ` Petr Pavlu
  2023-03-15 16:12     ` Luis Chamberlain
  0 siblings, 1 reply; 45+ messages in thread
From: Petr Pavlu @ 2023-03-15 14:43 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-modules, linux-kernel, christophe.leroy, song, torvalds,
	pmladek, david, prarit

On 3/11/23 06:17, Luis Chamberlain wrote:
> Modules can have a series of aliases, but we don't currently use
> them to check if a module is already loaded. Part of this is because
> load_module() will stick to checking for already loaded modules using
> the actual module name, not an alias. Its however desriable to also
> check for aliases on find_module_all() for existing callers and future
> callers. The curent gain to using aliases on find_module_all() will
> simply be to be able to support unloading modules using the alias using
> the delete_module() syscall.

Different modules can have same aliases. Running
'sort modules.alias | cut -d' ' -f2 | uniq -dc' shows a list of them.
When a modprobe load of such an alias is requested, my reading is that this
new find_module_all() logic (if enabled) causes that only the first matched
module is inserted and others get recognized as duplicates, which doesn't look
right to me.

In general, I'm not sure that I understand motivation to keep track of these
aliases in the kernel. Do you have links to previous discussions that I could
perhaps read?

Thanks,
Petr

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-15 12:24 ` [RFC 00/12] module: avoid userspace pressure on unwanted allocations David Hildenbrand
@ 2023-03-15 16:10   ` Luis Chamberlain
  2023-03-15 16:41     ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-15 16:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On Wed, Mar 15, 2023 at 01:24:41PM +0100, David Hildenbrand wrote:
> On 11.03.23 06:17, Luis Chamberlain wrote:
> > A long time ago we had some issues with userspace doing stupid stuff.
> > Well, it turns out even the kernel and do stupid stuff too, as we're
> > learning with the ACPI modules aliaes and that hammering tons of loads.
> > 
> > So add a bit of code which gets us a bit more in the defensive about
> > these situations.
> > 
> > To experiment, this also adds in-kernel alias support to see if this helps
> > with some larger systems.
> > 
> > This is all based on some old code which tried to add defensive
> > mechanisms the last of which was here and I had dropped the ball:
> > 
> > https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/
> > 
> > I've only compile tested this for now. Will need to stress to test
> > with kmod tests 0008 and 0009 to see if there's any differences.
> > I'll have to re-test and re-gnuplot stuff there. But early feedback
> > is appreciated, hence the RFC.
> > 
> > David Hildenbrand had reported a while ago issues with userspace
> > doing insane things with allocations bringing a system down to
> > its knees. This is part of the motivation for this series.
> 
> 
> I'll try to grab a system where I can reproduce the issue and give your
> patches a churn.

Great, then please wait for v2 RFC as the first patch was missing an
obvious mutex grab / release, I already have some memory pressure data
that shows gains. Hope to post soon.

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 12/12] module: use aliases to find module on find_module_all()
  2023-03-15 14:43   ` Petr Pavlu
@ 2023-03-15 16:12     ` Luis Chamberlain
  0 siblings, 0 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-15 16:12 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: linux-modules, linux-kernel, christophe.leroy, song, torvalds,
	pmladek, david, prarit

On Wed, Mar 15, 2023 at 03:43:54PM +0100, Petr Pavlu wrote:
> On 3/11/23 06:17, Luis Chamberlain wrote:
> > Modules can have a series of aliases, but we don't currently use
> > them to check if a module is already loaded. Part of this is because
> > load_module() will stick to checking for already loaded modules using
> > the actual module name, not an alias. Its however desriable to also
> > check for aliases on find_module_all() for existing callers and future
> > callers. The curent gain to using aliases on find_module_all() will
> > simply be to be able to support unloading modules using the alias using
> > the delete_module() syscall.
> 
> Different modules can have same aliases. Running
> 'sort modules.alias | cut -d' ' -f2 | uniq -dc' shows a list of them.

Ah then nevermind then thanks! I'll be sure to document this moving
forwward!

> When a modprobe load of such an alias is requested, my reading is that this
> new find_module_all() logic (if enabled) causes that only the first matched
> module is inserted and others get recognized as duplicates, which doesn't look
> right to me.
> 
> In general, I'm not sure that I understand motivation to keep track of these
> aliases in the kernel. Do you have links to previous discussions that I could
> perhaps read?

The idea was an old one and if it could help, but the above makes the
idea useless, and so the only value for these aliases in-kernel could be
for debugging, that's all.

Removing this will also simplify this patch series anyway.

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-15 16:10   ` Luis Chamberlain
@ 2023-03-15 16:41     ` David Hildenbrand
  2023-03-16 23:55       ` Luis Chamberlain
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-15 16:41 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On 15.03.23 17:10, Luis Chamberlain wrote:
> On Wed, Mar 15, 2023 at 01:24:41PM +0100, David Hildenbrand wrote:
>> On 11.03.23 06:17, Luis Chamberlain wrote:
>>> A long time ago we had some issues with userspace doing stupid stuff.
>>> Well, it turns out even the kernel and do stupid stuff too, as we're
>>> learning with the ACPI modules aliaes and that hammering tons of loads.
>>>
>>> So add a bit of code which gets us a bit more in the defensive about
>>> these situations.
>>>
>>> To experiment, this also adds in-kernel alias support to see if this helps
>>> with some larger systems.
>>>
>>> This is all based on some old code which tried to add defensive
>>> mechanisms the last of which was here and I had dropped the ball:
>>>
>>> https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/
>>>
>>> I've only compile tested this for now. Will need to stress to test
>>> with kmod tests 0008 and 0009 to see if there's any differences.
>>> I'll have to re-test and re-gnuplot stuff there. But early feedback
>>> is appreciated, hence the RFC.
>>>
>>> David Hildenbrand had reported a while ago issues with userspace
>>> doing insane things with allocations bringing a system down to
>>> its knees. This is part of the motivation for this series.
>>
>>
>> I'll try to grab a system where I can reproduce the issue and give your
>> patches a churn.
> 
> Great, then please wait for v2 RFC as the first patch was missing an
> obvious mutex grab / release, I already have some memory pressure data
> that shows gains. Hope to post soon.

I expect to have a machine (with a crazy number of CPUs/devices) 
available in a couple of days (1-2), so no need to rush.

The original machine I was able to reproduce with is blocked for a 
little bit longer; so I hope the alternative I looked up will similarly 
trigger the issue easily.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-15 16:41     ` David Hildenbrand
@ 2023-03-16 23:55       ` Luis Chamberlain
  2023-03-16 23:56         ` Luis Chamberlain
  0 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-16 23:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> On 15.03.23 17:10, Luis Chamberlain wrote:
> > On Wed, Mar 15, 2023 at 01:24:41PM +0100, David Hildenbrand wrote:
> > > On 11.03.23 06:17, Luis Chamberlain wrote:
> > > > A long time ago we had some issues with userspace doing stupid stuff.
> > > > Well, it turns out even the kernel and do stupid stuff too, as we're
> > > > learning with the ACPI modules aliaes and that hammering tons of loads.
> > > > 
> > > > So add a bit of code which gets us a bit more in the defensive about
> > > > these situations.
> > > > 
> > > > To experiment, this also adds in-kernel alias support to see if this helps
> > > > with some larger systems.
> > > > 
> > > > This is all based on some old code which tried to add defensive
> > > > mechanisms the last of which was here and I had dropped the ball:
> > > > 
> > > > https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/
> > > > 
> > > > I've only compile tested this for now. Will need to stress to test
> > > > with kmod tests 0008 and 0009 to see if there's any differences.
> > > > I'll have to re-test and re-gnuplot stuff there. But early feedback
> > > > is appreciated, hence the RFC.
> > > > 
> > > > David Hildenbrand had reported a while ago issues with userspace
> > > > doing insane things with allocations bringing a system down to
> > > > its knees. This is part of the motivation for this series.
> > > 
> > > 
> > > I'll try to grab a system where I can reproduce the issue and give your
> > > patches a churn.
> > 
> > Great, then please wait for v2 RFC as the first patch was missing an
> > obvious mutex grab / release, I already have some memory pressure data
> > that shows gains. Hope to post soon.
> 
> I expect to have a machine (with a crazy number of CPUs/devices) available
> in a couple of days (1-2), so no need to rush.
> 
> The original machine I was able to reproduce with is blocked for a little
> bit longer; so I hope the alternative I looked up will similarly trigger the
> issue easily.

OK give this a spin:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts

I'm seeing about ~86 MiB saving on the upper bound on memory usage
while hammering on kmod test 0008, and this is on a small system.

Probably won't help *that* much but am curious... if it helps somewhat.

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-16 23:55       ` Luis Chamberlain
@ 2023-03-16 23:56         ` Luis Chamberlain
  2023-03-18  0:11           ` Luis Chamberlain
  2023-03-20  9:37           ` David Hildenbrand
  0 siblings, 2 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-16 23:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > I expect to have a machine (with a crazy number of CPUs/devices) available
> > in a couple of days (1-2), so no need to rush.
> > 
> > The original machine I was able to reproduce with is blocked for a little
> > bit longer; so I hope the alternative I looked up will similarly trigger the
> > issue easily.
> 
> OK give this a spin:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> 
> I'm seeing about ~86 MiB saving on the upper bound on memory usage
> while hammering on kmod test 0008, and this is on a small system.
> 
> Probably won't help *that* much but am curious... if it helps somewhat.

How much cpu count BTW?

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-16 23:56         ` Luis Chamberlain
@ 2023-03-18  0:11           ` Luis Chamberlain
  2023-03-20  9:38             ` David Hildenbrand
  2023-03-20  9:37           ` David Hildenbrand
  1 sibling, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-18  0:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > > I expect to have a machine (with a crazy number of CPUs/devices) available
> > > in a couple of days (1-2), so no need to rush.
> > > 
> > > The original machine I was able to reproduce with is blocked for a little
> > > bit longer; so I hope the alternative I looked up will similarly trigger the
> > > issue easily.
> > 
> > OK give this a spin:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts

Today I am up to here:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts

The last patch really would have no justification yet at all unless it
does help your case.

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-16 23:56         ` Luis Chamberlain
  2023-03-18  0:11           ` Luis Chamberlain
@ 2023-03-20  9:37           ` David Hildenbrand
  1 sibling, 0 replies; 45+ messages in thread
From: David Hildenbrand @ 2023-03-20  9:37 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On 17.03.23 00:56, Luis Chamberlain wrote:
> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>> in a couple of days (1-2), so no need to rush.
>>>
>>> The original machine I was able to reproduce with is blocked for a little
>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>> issue easily.
>>
>> OK give this a spin:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>
>> I'm seeing about ~86 MiB saving on the upper bound on memory usage
>> while hammering on kmod test 0008, and this is on a small system.
>>
>> Probably won't help *that* much but am curious... if it helps somewhat.
> 
> How much cpu count BTW?

220 physical ones and 440 virtual ones.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-18  0:11           ` Luis Chamberlain
@ 2023-03-20  9:38             ` David Hildenbrand
  2023-03-20 19:40               ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-20  9:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On 18.03.23 01:11, Luis Chamberlain wrote:
> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>> in a couple of days (1-2), so no need to rush.
>>>>
>>>> The original machine I was able to reproduce with is blocked for a little
>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>> issue easily.
>>>
>>> OK give this a spin:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> 
> Today I am up to here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
> 
> The last patch really would have no justification yet at all unless it
> does help your case.

Still waiting on the system (the replacement system I was able to grab 
broke ...).

I'll let you know once I succeeded in reproducing + testing your fixes.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-20  9:38             ` David Hildenbrand
@ 2023-03-20 19:40               ` David Hildenbrand
  2023-03-20 21:09                 ` Luis Chamberlain
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-20 19:40 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On 20.03.23 10:38, David Hildenbrand wrote:
> On 18.03.23 01:11, Luis Chamberlain wrote:
>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>>> in a couple of days (1-2), so no need to rush.
>>>>>
>>>>> The original machine I was able to reproduce with is blocked for a little
>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>>> issue easily.
>>>>
>>>> OK give this a spin:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>
>> Today I am up to here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
>>
>> The last patch really would have no justification yet at all unless it
>> does help your case.
> 
> Still waiting on the system (the replacement system I was able to grab
> broke ...).
> 
> I'll let you know once I succeeded in reproducing + testing your fixes.

Okay, I have a system where I can reproduce.

Should I give

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts

from yesterday a churn?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-20 19:40               ` David Hildenbrand
@ 2023-03-20 21:09                 ` Luis Chamberlain
  2023-03-20 21:15                   ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-20 21:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
> On 20.03.23 10:38, David Hildenbrand wrote:
> > On 18.03.23 01:11, Luis Chamberlain wrote:
> > > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
> > > > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> > > > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > > > > > I expect to have a machine (with a crazy number of CPUs/devices) available
> > > > > > in a couple of days (1-2), so no need to rush.
> > > > > > 
> > > > > > The original machine I was able to reproduce with is blocked for a little
> > > > > > bit longer; so I hope the alternative I looked up will similarly trigger the
> > > > > > issue easily.
> > > > > 
> > > > > OK give this a spin:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> > > 
> > > Today I am up to here:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
> > > 
> > > The last patch really would have no justification yet at all unless it
> > > does help your case.
> > 
> > Still waiting on the system (the replacement system I was able to grab
> > broke ...).
> > 
> > I'll let you know once I succeeded in reproducing + testing your fixes.
> 
> Okay, I have a system where I can reproduce.
> 
> Should I give
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
> 
> from yesterday a churn?

Yes please give that a run.

Please collect systemd-analyze given lack of any other tool to evaluate
any deltas. Can't think of anything else to gather other than seeing if
it booted.

If that boots works then try removing the last patch "module: add a
sanity check prior to allowing kernel module auto-loading" to see if
that last patch helped or was just noise. As it stands I'm not convinced
yet if it did help, if it *does* help we probably need to rethink some
finit_module() allocations things.

If you *still* can't boot the system, well, we should re-think
finit_module() allocation path a bit more too.

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-20 21:09                 ` Luis Chamberlain
@ 2023-03-20 21:15                   ` David Hildenbrand
  2023-03-20 21:23                     ` Luis Chamberlain
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-20 21:15 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On 20.03.23 22:09, Luis Chamberlain wrote:
> On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
>> On 20.03.23 10:38, David Hildenbrand wrote:
>>> On 18.03.23 01:11, Luis Chamberlain wrote:
>>>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>>>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>>>>> in a couple of days (1-2), so no need to rush.
>>>>>>>
>>>>>>> The original machine I was able to reproduce with is blocked for a little
>>>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>>>>> issue easily.
>>>>>>
>>>>>> OK give this a spin:
>>>>>>
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>>>
>>>> Today I am up to here:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
>>>>
>>>> The last patch really would have no justification yet at all unless it
>>>> does help your case.
>>>
>>> Still waiting on the system (the replacement system I was able to grab
>>> broke ...).
>>>
>>> I'll let you know once I succeeded in reproducing + testing your fixes.
>>
>> Okay, I have a system where I can reproduce.
>>
>> Should I give
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
>>
>> from yesterday a churn?
> 
> Yes please give that a run.

Reproduced with v6.3.0-rc1 (on 1st try)

Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).

> 
> Please collect systemd-analyze given lack of any other tool to evaluate
> any deltas. Can't think of anything else to gather other than seeing if
> it booted.

Issue is that some services (kdump, tuned) seem to take sometimes ages 
on that system to start for some reason, and systemd-analyze refuses to 
do something reasonable while the system is still booting up.

I'll see if I can come up with some data.

> 
> If that boots works then try removing the last patch "module: add a
> sanity check prior to allowing kernel module auto-loading" to see if
> that last patch helped or was just noise. As it stands I'm not convinced
> yet if it did help, if it *does* help we probably need to rethink some
> finit_module() allocations things.

Okay, will try without the last patch tomorrow.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-20 21:15                   ` David Hildenbrand
@ 2023-03-20 21:23                     ` Luis Chamberlain
  2023-03-20 21:27                       ` Luis Chamberlain
  2023-03-21 15:11                       ` David Hildenbrand
  0 siblings, 2 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-20 21:23 UTC (permalink / raw)
  To: David Hildenbrand, Adam Manzanares
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> On 20.03.23 22:09, Luis Chamberlain wrote:
> > On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
> > > On 20.03.23 10:38, David Hildenbrand wrote:
> > > > On 18.03.23 01:11, Luis Chamberlain wrote:
> > > > > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
> > > > > > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> > > > > > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > > > > > > > I expect to have a machine (with a crazy number of CPUs/devices) available
> > > > > > > > in a couple of days (1-2), so no need to rush.
> > > > > > > > 
> > > > > > > > The original machine I was able to reproduce with is blocked for a little
> > > > > > > > bit longer; so I hope the alternative I looked up will similarly trigger the
> > > > > > > > issue easily.
> > > > > > > 
> > > > > > > OK give this a spin:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> > > > > 
> > > > > Today I am up to here:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
> > > > > 
> > > > > The last patch really would have no justification yet at all unless it
> > > > > does help your case.
> > > > 
> > > > Still waiting on the system (the replacement system I was able to grab
> > > > broke ...).
> > > > 
> > > > I'll let you know once I succeeded in reproducing + testing your fixes.
> > > 
> > > Okay, I have a system where I can reproduce.
> > > 
> > > Should I give
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
> > > 
> > > from yesterday a churn?
> > 
> > Yes please give that a run.
> 
> Reproduced with v6.3.0-rc1 (on 1st try)

By reproduced, you mean it fails to boot?

> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).

Oh wow, so to clarify, it boots OK?

> > Please collect systemd-analyze given lack of any other tool to evaluate
> > any deltas. Can't think of anything else to gather other than seeing if
> > it booted.
> 
> Issue is that some services (kdump, tuned) seem to take sometimes ages on
> that system to start for some reason, 

How about disabling that?

> and systemd-analyze refuses to do
> something reasonable while the system is still booting up.

I see.

> I'll see if I can come up with some data.

Thanks!

> > If that boots works then try removing the last patch "module: add a
> > sanity check prior to allowing kernel module auto-loading" to see if
> > that last patch helped or was just noise. As it stands I'm not convinced
> > yet if it did help, if it *does* help we probably need to rethink some
> > finit_module() allocations things.
> 
> Okay, will try without the last patch tomorrow.

Thanks!!

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-20 21:23                     ` Luis Chamberlain
@ 2023-03-20 21:27                       ` Luis Chamberlain
  2023-03-21 19:32                         ` David Hildenbrand
  2023-03-21 15:11                       ` David Hildenbrand
  1 sibling, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-20 21:27 UTC (permalink / raw)
  To: David Hildenbrand, Adam Manzanares
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> 
> Oh wow, so to clarify, it boots OK?
> 

Now that we know that tree works, I'm curious also now if you can
confirm just re-ordering the patches still works (it should)

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust

And although it's *probably* just noise, but I'm very curious how much,
if any difference there is if you just revert "module: use
list_add_tail_rcu() when adding module".

The data on that commit log is pretty small as I have a low end system,
and I'm not yet done beating the hell out of a system with stress-ng,
but getting some data froma  pretty large system would be great.
Specially if this series seems to prove fixing boot on them.

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-20 21:23                     ` Luis Chamberlain
  2023-03-20 21:27                       ` Luis Chamberlain
@ 2023-03-21 15:11                       ` David Hildenbrand
  2023-03-21 16:52                         ` Luis Chamberlain
  1 sibling, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-21 15:11 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On 20.03.23 22:23, Luis Chamberlain wrote:
> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>> On 20.03.23 22:09, Luis Chamberlain wrote:
>>> On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
>>>> On 20.03.23 10:38, David Hildenbrand wrote:
>>>>> On 18.03.23 01:11, Luis Chamberlain wrote:
>>>>>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>>>>>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>>>>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>>>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>>>>>>> in a couple of days (1-2), so no need to rush.
>>>>>>>>>
>>>>>>>>> The original machine I was able to reproduce with is blocked for a little
>>>>>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>>>>>>> issue easily.
>>>>>>>>
>>>>>>>> OK give this a spin:
>>>>>>>>
>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>>>>>
>>>>>> Today I am up to here:
>>>>>>
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
>>>>>>
>>>>>> The last patch really would have no justification yet at all unless it
>>>>>> does help your case.
>>>>>
>>>>> Still waiting on the system (the replacement system I was able to grab
>>>>> broke ...).
>>>>>
>>>>> I'll let you know once I succeeded in reproducing + testing your fixes.
>>>>
>>>> Okay, I have a system where I can reproduce.
>>>>
>>>> Should I give
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
>>>>
>>>> from yesterday a churn?
>>>
>>> Yes please give that a run.
>>
>> Reproduced with v6.3.0-rc1 (on 1st try)
> 
> By reproduced, you mean it fails to boot?

It boots but we get vmap allocation warnings, because the ~440 CPUs 
manage to completely exhaust the module vmap area due to KASAN.

> 
>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> 
> Oh wow, so to clarify, it boots OK?

It boots and I don't get the vmap allocation warnings.

> 
>>> Please collect systemd-analyze given lack of any other tool to evaluate
>>> any deltas. Can't think of anything else to gather other than seeing if
>>> it booted.
>>
>> Issue is that some services (kdump, tuned) seem to take sometimes ages on
>> that system to start for some reason,
> 
> How about disabling that?

It seems to be random services. On my debug kernel with KASAN everything 
is just super slow. I'll try to measure on a !debug kernel.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-21 15:11                       ` David Hildenbrand
@ 2023-03-21 16:52                         ` Luis Chamberlain
  2023-03-21 17:01                           ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-21 16:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Adam Manzanares, linux-modules, linux-kernel, pmladek,
	petr.pavlu, prarit, christophe.leroy, song, torvalds

On Tue, Mar 21, 2023 at 04:11:27PM +0100, David Hildenbrand wrote:
> On 20.03.23 22:23, Luis Chamberlain wrote:
> > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> > > On 20.03.23 22:09, Luis Chamberlain wrote:
> > > > On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
> > > > > On 20.03.23 10:38, David Hildenbrand wrote:
> > > > > > On 18.03.23 01:11, Luis Chamberlain wrote:
> > > > > > > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
> > > > > > > > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> > > > > > > > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > > > > > > > > > I expect to have a machine (with a crazy number of CPUs/devices) available
> > > > > > > > > > in a couple of days (1-2), so no need to rush.
> > > > > > > > > > 
> > > > > > > > > > The original machine I was able to reproduce with is blocked for a little
> > > > > > > > > > bit longer; so I hope the alternative I looked up will similarly trigger the
> > > > > > > > > > issue easily.
> > > > > > > > > 
> > > > > > > > > OK give this a spin:
> > > > > > > > > 
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> > > > > > > 
> > > > > > > Today I am up to here:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
> > > > > > > 
> > > > > > > The last patch really would have no justification yet at all unless it
> > > > > > > does help your case.
> > > > > > 
> > > > > > Still waiting on the system (the replacement system I was able to grab
> > > > > > broke ...).
> > > > > > 
> > > > > > I'll let you know once I succeeded in reproducing + testing your fixes.
> > > > > 
> > > > > Okay, I have a system where I can reproduce.
> > > > > 
> > > > > Should I give
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
> > > > > 
> > > > > from yesterday a churn?
> > > > 
> > > > Yes please give that a run.
> > > 
> > > Reproduced with v6.3.0-rc1 (on 1st try)
> > 
> > By reproduced, you mean it fails to boot?
> 
> It boots but we get vmap allocation warnings, because the ~440 CPUs manage
> to completely exhaust the module vmap area due to KASAN.

Thanks, can you post a trace?

> > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> > 
> > Oh wow, so to clarify, it boots OK?
> 
> It boots and I don't get the vmap allocation warnings.

Wonderful!

Now to zero if all commits are required, and measure small value-add 
for each of them. That is the hard part and thanks for any time you
can help dedicate towards that.

There's really only 3 functional changes we need to measure:

In the 20230319-module-alloc-opts-adjust branch they are left towards
the end, in history, the top being the most recent commit:

600f1f769d06 module: add a sanity check prior to allowing kernel module auto-loading
680f2c1fff2d module: use list_add_tail_rcu() when adding module
6023a6c7d98c module: avoid allocation if module is already present and ready

My guess is the majority of the fix comes from the last commit, so
commit 6023a6c7d98c ("module: avoid allocation if module is already
present and ready").

The rcu one probably doesn't help much but I think it makes sense while
we're at it.

The last one is the one I'm less convinced makes sense. But *if* it does
help, it means such finit_module() situations / stresses are much larger
than we ever anticipated.

> > > > Please collect systemd-analyze given lack of any other tool to evaluate
> > > > any deltas. Can't think of anything else to gather other than seeing if
> > > > it booted.
> > > 
> > > Issue is that some services (kdump, tuned) seem to take sometimes ages on
> > > that system to start for some reason,
> > 
> > How about disabling that?
> 
> It seems to be random services. On my debug kernel with KASAN everything is
> just super slow. I'll try to measure on a !debug kernel.

Got it thanks!

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-21 16:52                         ` Luis Chamberlain
@ 2023-03-21 17:01                           ` David Hildenbrand
  0 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand @ 2023-03-21 17:01 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Adam Manzanares, linux-modules, linux-kernel, pmladek,
	petr.pavlu, prarit, christophe.leroy, song, torvalds

On 21.03.23 17:52, Luis Chamberlain wrote:
> On Tue, Mar 21, 2023 at 04:11:27PM +0100, David Hildenbrand wrote:
>> On 20.03.23 22:23, Luis Chamberlain wrote:
>>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>>>> On 20.03.23 22:09, Luis Chamberlain wrote:
>>>>> On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
>>>>>> On 20.03.23 10:38, David Hildenbrand wrote:
>>>>>>> On 18.03.23 01:11, Luis Chamberlain wrote:
>>>>>>>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>>>>>>>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>>>>>>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>>>>>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>>>>>>>>> in a couple of days (1-2), so no need to rush.
>>>>>>>>>>>
>>>>>>>>>>> The original machine I was able to reproduce with is blocked for a little
>>>>>>>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>>>>>>>>> issue easily.
>>>>>>>>>>
>>>>>>>>>> OK give this a spin:
>>>>>>>>>>
>>>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>>>>>>>
>>>>>>>> Today I am up to here:
>>>>>>>>
>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
>>>>>>>>
>>>>>>>> The last patch really would have no justification yet at all unless it
>>>>>>>> does help your case.
>>>>>>>
>>>>>>> Still waiting on the system (the replacement system I was able to grab
>>>>>>> broke ...).
>>>>>>>
>>>>>>> I'll let you know once I succeeded in reproducing + testing your fixes.
>>>>>>
>>>>>> Okay, I have a system where I can reproduce.
>>>>>>
>>>>>> Should I give
>>>>>>
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
>>>>>>
>>>>>> from yesterday a churn?
>>>>>
>>>>> Yes please give that a run.
>>>>
>>>> Reproduced with v6.3.0-rc1 (on 1st try)
>>>
>>> By reproduced, you mean it fails to boot?
>>
>> It boots but we get vmap allocation warnings, because the ~440 CPUs manage
>> to completely exhaust the module vmap area due to KASAN.
> 
> Thanks, can you post a trace?

See

https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

where I also describe the issue in more detail.

> 
>>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
>>>
>>> Oh wow, so to clarify, it boots OK?
>>
>> It boots and I don't get the vmap allocation warnings.
> 
> Wonderful!

I'm still compiling/testing the debug kernels 
(20230319-module-alloc-opts-adjust +  single revert of the rcu patch). I 
should have some systemd-analyze results for !debug kernels later.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-20 21:27                       ` Luis Chamberlain
@ 2023-03-21 19:32                         ` David Hildenbrand
  2023-03-24  9:27                           ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-21 19:32 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On 20.03.23 22:27, Luis Chamberlain wrote:
> On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
>>
>> Oh wow, so to clarify, it boots OK?
>>
> 
> Now that we know that tree works, I'm curious also now if you can
> confirm just re-ordering the patches still works (it should)
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
> 

So far no vmap errors booting the debug/kasan kernel (2 tries).

> And although it's *probably* just noise, but I'm very curious how much,
> if any difference there is if you just revert "module: use
> list_add_tail_rcu() when adding module".

Dito, no vmap errors booting the debug/kasan kernel (2 tries).

> 
> The data on that commit log is pretty small as I have a low end system,
> and I'm not yet done beating the hell out of a system with stress-ng,
> but getting some data froma  pretty large system would be great.
> Specially if this series seems to prove fixing boot on them.

2x booting RHEL9.1 on a !debug kernel. Something went wrong with kdump in 2 runs (think I
had to delete the kdump initrd to make space for another kernel), but we can just mostly
ignore that. I wanted to rerun with kdump disabled completely, but I'm out of time for today.


1) v6.3-rc1:

#1

Startup finished in 25.354s (kernel) + 7.662s (initrd) + 1min 8.805s (userspace) = 1min 41.822s
multi-user.target reached after 29.186s in userspace

47.178s kdump.service
14.898s tuned.service
11.394s chrony-wait.service
  7.486s systemd-udev-settle.service
  7.334s NetworkManager-wait-online.service
  2.908s initrd-switch-root.service
  2.451s smartd.service
  2.316s dracut-initqueue.service
  2.057s polkit.service
  1.290s NetworkManager.service
  1.046s cups.service
  ...

#2

Startup finished in 25.375s (kernel) + 7.497s (initrd) + 29.428s (userspace) = 1min 2.301s
multi-user.target reached after 29.392s in userspace

14.552s tuned.service
  9.410s chrony-wait.service
  8.126s systemd-udev-settle.service
  7.502s NetworkManager-wait-online.service
  2.871s initrd-switch-root.service
  2.401s kdump.service
  2.297s polkit.service
  2.116s dracut-initqueue.service
  2.027s smartd.service
  1.262s NetworkManager.service
  1.102s cups.service
  1.011s sshd.service
  ...


2) 20230319-module-alloc-opts-adjust + revert of list_add_tail_rcu():

#1

Startup finished in 25.441s (kernel) + 7.285s (initrd) + 1min 7.644s (userspace) = 1min 40.371s
multi-user.target reached after 27.213s in userspace

47.232s kdump.service
14.235s tuned.service
  8.220s chrony-wait.service
  7.444s NetworkManager-wait-online.service
  5.986s systemd-udev-settle.service
  2.881s initrd-switch-root.service
  2.236s smartd.service
  1.899s dracut-initqueue.service
  1.812s polkit.service
  1.554s NetworkManager.service
  1.140s ModemManager.service
  ...

#2

Startup finished in 25.377s (kernel) + 7.271s (initrd) + 28.247s (userspace) = 1min 897ms
multi-user.target reached after 28.210s in userspace

15.435s tuned.service
11.365s chrony-wait.service
  7.512s NetworkManager-wait-online.service
  5.962s systemd-udev-settle.service
  2.889s initrd-switch-root.service
  2.846s smartd.service
  2.819s kdump.service
  2.228s polkit.service
  1.872s dracut-initqueue.service
  1.312s NetworkManager.service
  1.152s ModemManager.service
  1.011s sshd.service
  ...

3) 20230319-module-alloc-opts-adjust:


#1

Startup finished in 25.320s (kernel) + 7.192s (initrd) + 1min 6.511s (userspace) = 1min 39.024s
multi-user.target reached after 28.205s in userspace

46.307s kdump.service
14.199s tuned.service
13.358s chrony-wait.service
  7.468s NetworkManager-wait-online.service
  6.329s systemd-udev-settle.service
  2.910s initrd-switch-root.service
  2.752s smartd.service
  2.142s polkit.service
  1.909s dracut-initqueue.service
  1.210s NetworkManager.service
  1.041s ModemManager.service
   925ms sshd.service
  ...

#2

Startup finished in 25.368s (kernel) + 7.303s (initrd) + 1min 6.897s (userspace) = 1min 39.569s
multi-user.target reached after 27.326s in userspace

45.626s kdump.service
14.707s tuned.service
13.246s chrony-wait.service
  7.428s NetworkManager-wait-online.service
  6.507s systemd-udev-settle.service
  3.038s initrd-switch-root.service
  3.001s smartd.service
  2.057s polkit.service
  1.746s dracut-initqueue.service
  1.221s NetworkManager.service
  ...


I think we primarily only care about systemd-udev-settle.service.

That is fastest without the rcu patch (~6s), compared to with the rcu
patch (~6.5s) and with stock (~7.5s -- 8s).

Looks like dracut-initqueue also might be a bit faster with your changes, but
maybe it's mostly noise (would have to do more runs).

So maybe drop that rcu patch? But of course, there could be other scenarios where it's
helpful ...


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-21 19:32                         ` David Hildenbrand
@ 2023-03-24  9:27                           ` David Hildenbrand
  2023-03-24 17:54                             ` Luis Chamberlain
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-24  9:27 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds

On 21.03.23 20:32, David Hildenbrand wrote:
> On 20.03.23 22:27, Luis Chamberlain wrote:
>> On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
>>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
>>>
>>> Oh wow, so to clarify, it boots OK?
>>>
>>
>> Now that we know that tree works, I'm curious also now if you can
>> confirm just re-ordering the patches still works (it should)
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
>>
> 
> So far no vmap errors booting the debug/kasan kernel (2 tries).
> 
>> And although it's *probably* just noise, but I'm very curious how much,
>> if any difference there is if you just revert "module: use
>> list_add_tail_rcu() when adding module".
> 
> Dito, no vmap errors booting the debug/kasan kernel (2 tries).
> 
>>
>> The data on that commit log is pretty small as I have a low end system,
>> and I'm not yet done beating the hell out of a system with stress-ng,
>> but getting some data froma  pretty large system would be great.
>> Specially if this series seems to prove fixing boot on them.
> 
> 2x booting RHEL9.1 on a !debug kernel. Something went wrong with kdump in 2 runs (think I
> had to delete the kdump initrd to make space for another kernel), but we can just mostly
> ignore that. I wanted to rerun with kdump disabled completely, but I'm out of time for today.
> 
> 
> 1) v6.3-rc1:
> 
> #1
> 
> Startup finished in 25.354s (kernel) + 7.662s (initrd) + 1min 8.805s (userspace) = 1min 41.822s
> multi-user.target reached after 29.186s in userspace
> 
> 47.178s kdump.service
> 14.898s tuned.service
> 11.394s chrony-wait.service
>    7.486s systemd-udev-settle.service
>    7.334s NetworkManager-wait-online.service
>    2.908s initrd-switch-root.service
>    2.451s smartd.service
>    2.316s dracut-initqueue.service
>    2.057s polkit.service
>    1.290s NetworkManager.service
>    1.046s cups.service
>    ...
> 
> #2
> 
> Startup finished in 25.375s (kernel) + 7.497s (initrd) + 29.428s (userspace) = 1min 2.301s
> multi-user.target reached after 29.392s in userspace
> 
> 14.552s tuned.service
>    9.410s chrony-wait.service
>    8.126s systemd-udev-settle.service
>    7.502s NetworkManager-wait-online.service
>    2.871s initrd-switch-root.service
>    2.401s kdump.service
>    2.297s polkit.service
>    2.116s dracut-initqueue.service
>    2.027s smartd.service
>    1.262s NetworkManager.service
>    1.102s cups.service
>    1.011s sshd.service
>    ...
> 
> 
> 2) 20230319-module-alloc-opts-adjust + revert of list_add_tail_rcu():
> 
> #1
> 
> Startup finished in 25.441s (kernel) + 7.285s (initrd) + 1min 7.644s (userspace) = 1min 40.371s
> multi-user.target reached after 27.213s in userspace
> 
> 47.232s kdump.service
> 14.235s tuned.service
>    8.220s chrony-wait.service
>    7.444s NetworkManager-wait-online.service
>    5.986s systemd-udev-settle.service
>    2.881s initrd-switch-root.service
>    2.236s smartd.service
>    1.899s dracut-initqueue.service
>    1.812s polkit.service
>    1.554s NetworkManager.service
>    1.140s ModemManager.service
>    ...
> 
> #2
> 
> Startup finished in 25.377s (kernel) + 7.271s (initrd) + 28.247s (userspace) = 1min 897ms
> multi-user.target reached after 28.210s in userspace
> 
> 15.435s tuned.service
> 11.365s chrony-wait.service
>    7.512s NetworkManager-wait-online.service
>    5.962s systemd-udev-settle.service
>    2.889s initrd-switch-root.service
>    2.846s smartd.service
>    2.819s kdump.service
>    2.228s polkit.service
>    1.872s dracut-initqueue.service
>    1.312s NetworkManager.service
>    1.152s ModemManager.service
>    1.011s sshd.service
>    ...
> 
> 3) 20230319-module-alloc-opts-adjust:
> 
> 
> #1
> 
> Startup finished in 25.320s (kernel) + 7.192s (initrd) + 1min 6.511s (userspace) = 1min 39.024s
> multi-user.target reached after 28.205s in userspace
> 
> 46.307s kdump.service
> 14.199s tuned.service
> 13.358s chrony-wait.service
>    7.468s NetworkManager-wait-online.service
>    6.329s systemd-udev-settle.service
>    2.910s initrd-switch-root.service
>    2.752s smartd.service
>    2.142s polkit.service
>    1.909s dracut-initqueue.service
>    1.210s NetworkManager.service
>    1.041s ModemManager.service
>     925ms sshd.service
>    ...
> 
> #2
> 
> Startup finished in 25.368s (kernel) + 7.303s (initrd) + 1min 6.897s (userspace) = 1min 39.569s
> multi-user.target reached after 27.326s in userspace
> 
> 45.626s kdump.service
> 14.707s tuned.service
> 13.246s chrony-wait.service
>    7.428s NetworkManager-wait-online.service
>    6.507s systemd-udev-settle.service
>    3.038s initrd-switch-root.service
>    3.001s smartd.service
>    2.057s polkit.service
>    1.746s dracut-initqueue.service
>    1.221s NetworkManager.service
>    ...
> 
> 
> I think we primarily only care about systemd-udev-settle.service.
> 
> That is fastest without the rcu patch (~6s), compared to with the rcu
> patch (~6.5s) and with stock (~7.5s -- 8s).
> 
> Looks like dracut-initqueue also might be a bit faster with your changes, but
> maybe it's mostly noise (would have to do more runs).
> 
> So maybe drop that rcu patch? But of course, there could be other scenarios where it's
> helpful ...

Are there any other things you would like me to measure/test? I'll have 
to hand back that test machine soonish.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-24  9:27                           ` David Hildenbrand
@ 2023-03-24 17:54                             ` Luis Chamberlain
  2023-03-24 19:11                               ` Linus Torvalds
  2023-03-28  3:44                               ` David Hildenbrand
  0 siblings, 2 replies; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-24 17:54 UTC (permalink / raw)
  To: David Hildenbrand, Kees Cook
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds, dave, fan.ni, vincent.fu,
	a.manzanares, colin.i.king

On Fri, Mar 24, 2023 at 10:27:14AM +0100, David Hildenbrand wrote:
> On 21.03.23 20:32, David Hildenbrand wrote:
> > On 20.03.23 22:27, Luis Chamberlain wrote:
> > > On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
> > > > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> > > > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> > > > 
> > > > Oh wow, so to clarify, it boots OK?
> > > > 
> > > 
> > > Now that we know that tree works, I'm curious also now if you can
> > > confirm just re-ordering the patches still works (it should)
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
> > > 
> > 
> > So far no vmap errors booting the debug/kasan kernel (2 tries).

<-- snip -->

> > I think we primarily only care about systemd-udev-settle.service.
> > 
> > That is fastest without the rcu patch (~6s), compared to with the rcu
> > patch (~6.5s) and with stock (~7.5s -- 8s).
> > 
> > Looks like dracut-initqueue also might be a bit faster with your changes, but
> > maybe it's mostly noise (would have to do more runs).
> > 
> > So maybe drop that rcu patch? But of course, there could be other scenarios where it's
> > helpful ...

Yes I confirm the RCU patch does not help at all now also using
stress-ng.

> Are there any other things you would like me to measure/test? I'll have to
> hand back that test machine soonish.

Yes please test the below. Perhaps its not the final form we want, but
it *does* fix OOM'ing when thrashing with stress-ng now with the module
option and even with 100 threads brings down max memory consumption by
259 MiB. The reason is that we also vmalloc during each finit_read_file()
for each module as well way before we even do layout_and_allocate(), and
so obviously if we fix the module path but not this path this will eventually
catch up with us as. I'm not at all happy with the current approach given
ideally we'd bump the counter when the user is done with the file, but we
don't yet have any tracking of that for users, they just vfree the memory
itself. And so this is just trying to catch heavy immediate abuse on the
caller to fend off abuse of vmalloc uses in a lightway manner.

There's gotta be a better way to do this, but its just an idea I have so far.
If we *want* to keep tabs until the user is done, we have to just modify
most users of these APIs and intrudce our own free. I don't think we're
in a rush to fix this so maybe that's the better approach.

And so I've managed to reproduce the issues you found now with my new stress-ng
module stressor as well.

https://github.com/ColinIanKing/stress-ng.git

Even though you have 400 CPUs with stress-ng we can likely reproduce it
with (use a module not loaded on your system):

./stress-ng --module 100 --module-name xfs

Without the patch below using 400 threads still OOMs easily due to the
kread issue. Max threads allowed are 8192.

From ec5099b15bb74f154319034547184e81e4ad8ba0 Mon Sep 17 00:00:00 2001
From: Luis Chamberlain <mcgrof@kernel.org>
Date: Fri, 24 Mar 2023 10:35:41 -0700
Subject: [PATCH] fs/kernel_read_file: add a clutch

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/kernel_read_file.c | 52 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..2d55abe73b21 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -1,10 +1,52 @@
 // SPDX-License-Identifier: GPL-2.0-only
+
+#define pr_fmt(fmt) "kread: " fmt
+
 #include <linux/fs.h>
 #include <linux/fs_struct.h>
 #include <linux/kernel_read_file.h>
 #include <linux/security.h>
 #include <linux/vmalloc.h>
 
+/*
+ * This clutch ensures we only allow a certain number concurrent threads at a
+ * time allocating space concurrently and they must all finish within the
+ * timeout specified.  Anything more we know we're thrashing.
+ */
+#define MAX_KREAD_CONCURRENT 20
+static atomic_t kread_concurrent_max = ATOMIC_INIT(MAX_KREAD_CONCURRENT);
+static DECLARE_WAIT_QUEUE_HEAD(kread_wq);
+
+/*
+ * How many seconds to wait for *all*  MAX_KREAD_CONCURRENT threads running
+ * at the same time without returning.
+ */
+#define MAX_KREAD_ALL_BUSY_TIMEOUT 1
+
+static int kernel_read_check_concurrent(void)
+{
+	int ret;
+
+	if (atomic_dec_if_positive(&kread_concurrent_max) < 0) {
+		pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
+				    atomic_read(&kread_concurrent_max),
+				    MAX_KREAD_CONCURRENT);
+		ret = wait_event_killable_timeout(kread_wq,
+						  atomic_dec_if_positive(&kread_concurrent_max) >= 0,
+						  MAX_KREAD_ALL_BUSY_TIMEOUT * HZ);
+		if (!ret) {
+			pr_warn_ratelimited("reading cannot be processed, kernel busy with %d threads reading files now for more than %d seconds",
+					    MAX_KREAD_CONCURRENT, MAX_KREAD_ALL_BUSY_TIMEOUT);
+			return -ETIME;
+		} else if (ret == -ERESTARTSYS) {
+			pr_warn_ratelimited("sigkill sent for kernel read, giving up");
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * kernel_read_file() - read file contents into a kernel buffer
  *
@@ -68,10 +110,14 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 		goto out;
 	}
 
+	ret = kernel_read_check_concurrent();
+	if (ret)
+		goto out;
+
 	whole_file = (offset == 0 && i_size <= buf_size);
 	ret = security_kernel_read_file(file, id, whole_file);
 	if (ret)
-		goto out;
+		goto out_allow_new_read;
 
 	if (file_size)
 		*file_size = i_size;
@@ -117,7 +163,9 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 			*buf = NULL;
 		}
 	}
-
+out_allow_new_read:
+	atomic_inc(&kread_concurrent_max);
+	wake_up(&kread_wq);
 out:
 	allow_write_access(file);
 	return ret == 0 ? copied : ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-24 17:54                             ` Luis Chamberlain
@ 2023-03-24 19:11                               ` Linus Torvalds
  2023-03-24 19:59                                 ` Luis Chamberlain
  2023-03-28  3:44                               ` David Hildenbrand
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2023-03-24 19:11 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, Kees Cook, linux-modules, linux-kernel,
	pmladek, petr.pavlu, prarit, christophe.leroy, song, dave,
	fan.ni, vincent.fu, a.manzanares, colin.i.king

On Fri, Mar 24, 2023 at 10:54 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> +/*
> + * This clutch ensures we only allow a certain number concurrent threads at a

kludge, not clutch.

And it's much worse than a kludge. It's just wrong and disgusting.

> +               pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
> +                                   atomic_read(&kread_concurrent_max),
> +                                   MAX_KREAD_CONCURRENT);

This is also wrong, since it's not kernel_read_file() that is the
problem, but whatever broken caller.

Yeah, yeah, in practice it's presumably always just finit_module()
doing kernel_read_file_from_fd(), but it's still *completely* wrong to
just say "function X is throttling" when "X" isn't the problem, and
doesn't tell what the _real_ problem is.

I really think this all needs some core fixing at the module layer,
not these kinds of horrific hacks.

          Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-24 19:11                               ` Linus Torvalds
@ 2023-03-24 19:59                                 ` Luis Chamberlain
  2023-03-24 20:28                                   ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-24 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Kees Cook, linux-modules, linux-kernel,
	pmladek, petr.pavlu, prarit, christophe.leroy, song, dave,
	fan.ni, vincent.fu, a.manzanares, colin.i.king

On Fri, Mar 24, 2023 at 12:11:07PM -0700, Linus Torvalds wrote:
> On Fri, Mar 24, 2023 at 10:54 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > +/*
> > + * This clutch ensures we only allow a certain number concurrent threads at a
> 
> kludge, not clutch.
> 
> And it's much worse than a kludge. It's just wrong and disgusting.

I wasn't happy with it either...

> > +               pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
> > +                                   atomic_read(&kread_concurrent_max),
> > +                                   MAX_KREAD_CONCURRENT);
> 
> This is also wrong, since it's not kernel_read_file() that is the
> problem, but whatever broken caller.
> 
> Yeah, yeah, in practice it's presumably always just finit_module()
> doing kernel_read_file_from_fd(), but it's still *completely* wrong to
> just say "function X is throttling" when "X" isn't the problem, and
> doesn't tell what the _real_ problem is.

True.

> I really think this all needs some core fixing at the module layer,
> not these kinds of horrific hacks.

On the modules side of things we can be super defensive on the second
vmalloc allocation defensive [0] but other than this the initial kread
also needs care too.

To address the kread abuse within finit_module we could just move the
kludge to the modules side of things until each free happens as in the
below alternative. That just means any easy user interfacing call with
kernel_read*() would likely have to be as careful. Untested below.

[0] https://lkml.kernel.org/r/20230319214926.1794108-4-mcgrof@kernel.org

  Luis

From 3c3f7e597ab35b4482ccb4064bb897eefa449071 Mon Sep 17 00:00:00 2001
From: Luis Chamberlain <mcgrof@kernel.org>
Date: Fri, 24 Mar 2023 12:51:44 -0700
Subject: [PATCH] module: kludge

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 45 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 145e15f19576..a96de989532a 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -62,6 +62,16 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/module.h>
 
+#define MAX_INITMOD_CONCURRENT 50
+static atomic_t initmod_concurrent_max = ATOMIC_INIT(MAX_INITMOD_CONCURRENT);
+static DECLARE_WAIT_QUEUE_HEAD(initmod_wq);
+
+/*
+ * How much time to wait for *all*  MAX_INITMOD_CONCURRENT threads running
+ * at the same time without returning.
+ */
+#define MAX_INITMOD_ALL_BUSY_TIMEOUT 5
+
 /*
  * Mutex protects:
  * 1) List of modules (also safely readable with preempt_disable),
@@ -3015,6 +3025,30 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
 	return load_module(&info, uargs, 0);
 }
 
+static int module_kread_concurrent(void)
+{
+	int err;
+
+	if (atomic_dec_if_positive(&initmod_concurrent_max) < 0) {
+		pr_warn_ratelimited("finit_module: initkmod_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
+				    atomic_read(&initmod_concurrent_max),
+				    MAX_INITMOD_CONCURRENT);
+		err = wait_event_killable_timeout(initmod_wq,
+						  atomic_dec_if_positive(&initmod_concurrent_max) >= 0,
+						  MAX_INITMOD_ALL_BUSY_TIMEOUT * HZ);
+		if (!err) {
+			pr_warn_ratelimited("finit_module: loading module cannot be processed, kernel busy with %d threads loading modules now for more than %d seconds",
+					    MAX_INITMOD_CONCURRENT, MAX_INITMOD_ALL_BUSY_TIMEOUT);
+			return -ETIME;
+		} else if (err == -ERESTARTSYS) {
+			pr_warn_ratelimited("finit_module: sigkill sent for load_module giving up");
+			return err;
+		}
+	}
+
+	return 0;
+}
+
 SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 {
 	struct load_info info = { };
@@ -3033,6 +3067,10 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 		      |MODULE_INIT_COMPRESSED_FILE))
 		return -EINVAL;
 
+	err = module_kread_concurrent();
+	if (err)
+		return err;
+
 	len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
 				       READING_MODULE);
 	if (len < 0)
@@ -3048,7 +3086,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 		info.len = len;
 	}
 
-	return load_module(&info, uargs, flags);
+	err = load_module(&info, uargs, flags);
+
+	atomic_inc(&initmod_concurrent_max);
+	wake_up(&initmod_wq);
+
+	return err;
 }
 
 /* Keep in sync with MODULE_FLAGS_BUF_SIZE !!! */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-24 19:59                                 ` Luis Chamberlain
@ 2023-03-24 20:28                                   ` Linus Torvalds
  2023-03-24 21:14                                     ` Luis Chamberlain
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2023-03-24 20:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, Kees Cook, linux-modules, linux-kernel,
	pmladek, petr.pavlu, prarit, christophe.leroy, song, dave,
	fan.ni, vincent.fu, a.manzanares, colin.i.king

On Fri, Mar 24, 2023 at 1:00 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On the modules side of things we can be super defensive on the second
> vmalloc allocation defensive [0] but other than this the initial kread
> also needs care too.

Please don't re-implement semaphores. They are a *classic* concurrency limiter.

In fact, probably *THE* classic one.

So just do something like this instead:

   --- a/kernel/module/main.c
   +++ b/kernel/module/main.c
   @@ -2937,6 +2937,11 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
        return load_module(&info, uargs, 0);
    }

   +#define CONCURRENCY_LIMITER(name, n) \
   +    struct semaphore name = __SEMAPHORE_INITIALIZER(name, n)
   +
   +static CONCURRENCY_LIMITER(module_loading_concurrency, 50);
   +
    SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs,
int, flags)
    {
        struct load_info info = { };
   @@ -2955,8 +2960,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const
char __user *, uargs, int, flags)
                      |MODULE_INIT_COMPRESSED_FILE))
                return -EINVAL;

   +    err = down_killable(&module_loading_concurrency);
   +    if (err)
   +            return err;
        len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
                                       READING_MODULE);
   +    up(&module_loading_concurrency);
        if (len < 0)
                return len;

NOTE! Entirely untested. Surprise surprise.

I'm a tiny bit worried about deadlocks here, so somebody needs to make
sure that the kernel_read_file_from_fd() case cannot possibly in turn
cause a "request_module()" recursion.

We most definitely have had module recursion before, but I *think*
it's always just in the module init function (ie one module requests
another when ithe mod->init() function is called).

I think by the time we have opened the module file descriptors and are
just reading the data, we should be ok and the above would never
deadlock, but I want people to at least think about it.

Of course, with that "max 50 concurrent loaders" limit, maybe it's
never an issue anyway. Even if you get a recursion a few deep, at most
you just wait for somebody else to get out of their loaders. Unless
*everybody* ends up waiting on some progress.

              Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-24 20:28                                   ` Linus Torvalds
@ 2023-03-24 21:14                                     ` Luis Chamberlain
  2023-03-24 23:27                                       ` Luis Chamberlain
  0 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-24 21:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Kees Cook, linux-modules, linux-kernel,
	pmladek, petr.pavlu, prarit, christophe.leroy, song, dave,
	fan.ni, vincent.fu, a.manzanares, colin.i.king

On Fri, Mar 24, 2023 at 01:28:51PM -0700, Linus Torvalds wrote:
> On Fri, Mar 24, 2023 at 1:00 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On the modules side of things we can be super defensive on the second
> > vmalloc allocation defensive [0] but other than this the initial kread
> > also needs care too.
> 
> Please don't re-implement semaphores. They are a *classic* concurrency limiter.
> 
> In fact, probably *THE* classic one.
> 
> So just do something like this instead:
> 
>    --- a/kernel/module/main.c
>    +++ b/kernel/module/main.c
>    @@ -2937,6 +2937,11 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
>         return load_module(&info, uargs, 0);
>     }
> 
>    +#define CONCURRENCY_LIMITER(name, n) \
>    +    struct semaphore name = __SEMAPHORE_INITIALIZER(name, n)
>    +
>    +static CONCURRENCY_LIMITER(module_loading_concurrency, 50);
>    +
>     SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs,
> int, flags)
>     {
>         struct load_info info = { };
>    @@ -2955,8 +2960,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const
> char __user *, uargs, int, flags)
>                       |MODULE_INIT_COMPRESSED_FILE))
>                 return -EINVAL;
> 
>    +    err = down_killable(&module_loading_concurrency);
>    +    if (err)
>    +            return err;
>         len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
>                                        READING_MODULE);
>    +    up(&module_loading_concurrency);
>         if (len < 0)
>                 return len;
> 
> NOTE! Entirely untested. Surprise surprise.

I'll give it a good wack thanks.

But it still begs the question if *other* vmalloc user-interfacing
places need similar practices. It's not just system calls that use it
willy nilly but anything that could in the end use it. Surely a lot of
"issues" could only happen in an insane pathological use case, but it's
a generic thing to keep in mind in the end.

> I'm a tiny bit worried about deadlocks here, so somebody needs to make
> sure that the kernel_read_file_from_fd() case cannot possibly in turn
> cause a "request_module()" recursion.

Automount on a path where the module lies in a path for a modue not
loaded yet triggering a kernel module autoload is the only thing
I can think of that could cause that, but that just calls userspace
modprobe. And I think that'd be an insane setup.

> We most definitely have had module recursion before, but I *think*
> it's always just in the module init function (ie one module requests
> another when ithe mod->init() function is called).

Since you up() right after the kernel_read_file_from_fd() we would not
be racing module inits with this. If however we place the up() after
the load_module() then that does incur that recurssion possibilty.

> I think by the time we have opened the module file descriptors and are
> just reading the data, we should be ok and the above would never
> deadlock, but I want people to at least think about it.
> 
> Of course, with that "max 50 concurrent loaders" limit, maybe it's
> never an issue anyway. Even if you get a recursion a few deep, at most
> you just wait for somebody else to get out of their loaders. Unless
> *everybody* ends up waiting on some progress.

Yeah, these certainly are pathalogical corner cases. I'm more interested
in solving doing something sane for 1000 CPUs or 2000 CPUs for now, even
if the issue was the kernel (not userspace) to blame (and even if those
use cases are being fixed now like the queued up linux-next ACPI
CPU frequency modules per CPU).

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-24 21:14                                     ` Luis Chamberlain
@ 2023-03-24 23:27                                       ` Luis Chamberlain
  2023-03-24 23:41                                         ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-24 23:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Kees Cook, linux-modules, linux-kernel,
	pmladek, petr.pavlu, prarit, christophe.leroy, song, dave,
	fan.ni, vincent.fu, a.manzanares, colin.i.king

On Fri, Mar 24, 2023 at 02:14:11PM -0700, Luis Chamberlain wrote:
> On Fri, Mar 24, 2023 at 01:28:51PM -0700, Linus Torvalds wrote:
> >    +static CONCURRENCY_LIMITER(module_loading_concurrency, 50);
> I'll give it a good wack thanks.

Using a value of 20 above matches what I to get up to 100 stress-ng
module ops without any OOM (which seems to mimick a 400 CPU environment):

echo 0 > /proc/sys/vm/oom_dump_tasks
./stress-ng --module 100 --module-name xfs

I'll go with that and evaluate max memory usage as well, but this is
still inviting us to consider other users interfacing areas with an
implicated vmalloc and the scale of this as we grow number of CPU
counts.

A magic values of 20 is completely empirical.

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-24 23:27                                       ` Luis Chamberlain
@ 2023-03-24 23:41                                         ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2023-03-24 23:41 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, Kees Cook, linux-modules, linux-kernel,
	pmladek, petr.pavlu, prarit, christophe.leroy, song, dave,
	fan.ni, vincent.fu, a.manzanares, colin.i.king

On Fri, Mar 24, 2023 at 4:27 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> A magic values of 20 is completely empirical.

I suspect a magic value is fine in this context. I do not believe that
"hundreds of concurrent module loads" is a valid real-life load. It
sounds more like a "systemd does horrible things at startup".

20 sounds plenty big enough to me.

But I guess this is one of those "only real life will tell".

              Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-24 17:54                             ` Luis Chamberlain
  2023-03-24 19:11                               ` Linus Torvalds
@ 2023-03-28  3:44                               ` David Hildenbrand
  2023-03-28  6:16                                 ` Luis Chamberlain
  1 sibling, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-28  3:44 UTC (permalink / raw)
  To: Luis Chamberlain, Kees Cook
  Cc: linux-modules, linux-kernel, pmladek, petr.pavlu, prarit,
	christophe.leroy, song, torvalds, dave, fan.ni, vincent.fu,
	a.manzanares, colin.i.king

On 24.03.23 18:54, Luis Chamberlain wrote:
> On Fri, Mar 24, 2023 at 10:27:14AM +0100, David Hildenbrand wrote:
>> On 21.03.23 20:32, David Hildenbrand wrote:
>>> On 20.03.23 22:27, Luis Chamberlain wrote:
>>>> On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
>>>>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>>>>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
>>>>>
>>>>> Oh wow, so to clarify, it boots OK?
>>>>>
>>>>
>>>> Now that we know that tree works, I'm curious also now if you can
>>>> confirm just re-ordering the patches still works (it should)
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
>>>>
>>>
>>> So far no vmap errors booting the debug/kasan kernel (2 tries).
> 
> <-- snip -->
> 
>>> I think we primarily only care about systemd-udev-settle.service.
>>>
>>> That is fastest without the rcu patch (~6s), compared to with the rcu
>>> patch (~6.5s) and with stock (~7.5s -- 8s).
>>>
>>> Looks like dracut-initqueue also might be a bit faster with your changes, but
>>> maybe it's mostly noise (would have to do more runs).
>>>
>>> So maybe drop that rcu patch? But of course, there could be other scenarios where it's
>>> helpful ...
> 
> Yes I confirm the RCU patch does not help at all now also using
> stress-ng.
> 
>> Are there any other things you would like me to measure/test? I'll have to
>> hand back that test machine soonish.
> 
> Yes please test the below. Perhaps its not the final form we want, but
> it *does* fix OOM'ing when thrashing with stress-ng now with the module
> option and even with 100 threads brings down max memory consumption by
> 259 MiB. The reason is that we also vmalloc during each finit_read_file()
> for each module as well way before we even do layout_and_allocate(), and
> so obviously if we fix the module path but not this path this will eventually
> catch up with us as. I'm not at all happy with the current approach given
> ideally we'd bump the counter when the user is done with the file, but we
> don't yet have any tracking of that for users, they just vfree the memory
> itself. And so this is just trying to catch heavy immediate abuse on the
> caller to fend off abuse of vmalloc uses in a lightway manner.

Understood. (I'm planning on review one I have time some spare cycles)

> 
> There's gotta be a better way to do this, but its just an idea I have so far.
> If we *want* to keep tabs until the user is done, we have to just modify
> most users of these APIs and intrudce our own free. I don't think we're
> in a rush to fix this so maybe that's the better approach.
> 
> And so I've managed to reproduce the issues you found now with my new stress-ng
> module stressor as well.

Nice!

> 
> https://github.com/ColinIanKing/stress-ng.git
> 
> Even though you have 400 CPUs with stress-ng we can likely reproduce it
> with (use a module not loaded on your system):
> 
> ./stress-ng --module 100 --module-name xfs

I'll give that a churn on that machine with the updated patch ...

> 
> Without the patch below using 400 threads still OOMs easily due to the
> kread issue. Max threads allowed are 8192.
> 

... do you have an updated patch/branch that includes the feedback from 
Linus so I can give it a churn tomorrow?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-28  3:44                               ` David Hildenbrand
@ 2023-03-28  6:16                                 ` Luis Chamberlain
  2023-03-28 21:02                                   ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-28  6:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kees Cook, linux-modules, linux-kernel, pmladek, petr.pavlu,
	prarit, christophe.leroy, song, torvalds, dave, fan.ni,
	vincent.fu, a.manzanares, colin.i.king

On Tue, Mar 28, 2023 at 05:44:40AM +0200, David Hildenbrand wrote:
> ... do you have an updated patch/branch that includes the feedback from
> Linus so I can give it a churn tomorrow?

Yeah sure:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230327-module-alloc-opts

The commit log needs updateing to reflect the results I just collected:

With the alloc patch ("module: avoid allocation if module is already
present and ready") I see 145 MiB in memory difference in comparison
to its last patch, "module: extract patient module check into helper".
So I think that's a clear keeper and should help large CPU count boots.

The patch "module: add concurrency limiter" which puts the concurency
delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
which seems to be what I needed to reproduce your 400 CPU count original
issue.

The program used to reproduce is stress-ng with the new module option:

echo 0 > /proc/sys/vm/oom_dump_tasks
./stress-ng --module 100 --module-name xfs

To see how much max memory I use, I just use:

free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log

Run the test in another window, CTRL-C the test when above
finishes after 40 seconds and then:

sort -n -r foo.log  | head -1

If you have xfs loaded already you probably wanna pick module just as big
that you don't have loaded. You must have dependencies loaded already as
it doesn't call modprobe, it just finit_module's the module.

The last patch "modules/kmod: replace implementation with a sempahore"
just takes Linus' simplification suggestion to replace kmod's solution.
I tested it with:

tools/testing/selftests/kmod/kmod.sh -t 0008

This stress tests the kmod autoloading. Using time, the timing is
similar, perhaps mildly slower, but I like the simplification. I'm still
not happy with the "module: add concurrency limiter", specially it
doesn't seem to really buy us much, 2 MiB seems like within noise. 
But stress-ng clearly shows it is a possible source of issue as we
grow the ops. So it may make sense to just use the idea to replace the
delimiter for kmod for now, unless you see different results.

The stress-ng module test can at least be used now to replicate insane setups
at bootup.

Let me know if you get other useful results.

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-28  6:16                                 ` Luis Chamberlain
@ 2023-03-28 21:02                                   ` David Hildenbrand
  2023-03-29  5:31                                     ` Luis Chamberlain
  0 siblings, 1 reply; 45+ messages in thread
From: David Hildenbrand @ 2023-03-28 21:02 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Kees Cook, linux-modules, linux-kernel, pmladek, petr.pavlu,
	prarit, christophe.leroy, song, torvalds, dave, fan.ni,
	vincent.fu, a.manzanares, colin.i.king

On 28.03.23 08:16, Luis Chamberlain wrote:
> On Tue, Mar 28, 2023 at 05:44:40AM +0200, David Hildenbrand wrote:
>> ... do you have an updated patch/branch that includes the feedback from
>> Linus so I can give it a churn tomorrow?
> 
> Yeah sure:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230327-module-alloc-opts
> 

I gave that one a go and get for system bootup:

#1:

13.761s tuned.service
12.261s chrony-wait.service
  7.386s NetworkManager-wait-online.service
  5.227s systemd-udev-settle.service
  2.893s initrd-switch-root.service
  2.148s polkit.service
  2.137s smartd.service
  1.893s dracut-initqueue.service
  1.290s NetworkManager.service
  1.032s cups.service


#2

13.881s tuned.service
  9.255s chrony-wait.service
  7.404s NetworkManager-wait-online.service
  5.826s systemd-udev-settle.service
  2.859s initrd-switch-root.service
  2.847s smartd.service
  2.172s polkit.service
  1.884s dracut-initqueue.service
  1.371s NetworkManager.service
  1.119s ModemManager.service


So we're a bit faster (0.2 -- 0.7s) than the original version without 
the rcu patch (~6s).


> The commit log needs updateing to reflect the results I just collected:
> 
> With the alloc patch ("module: avoid allocation if module is already
> present and ready") I see 145 MiB in memory difference in comparison
> to its last patch, "module: extract patient module check into helper".
> So I think that's a clear keeper and should help large CPU count boots.
> 
> The patch "module: add concurrency limiter" which puts the concurency
> delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
> which seems to be what I needed to reproduce your 400 CPU count original
> issue.
> 
> The program used to reproduce is stress-ng with the new module option:
> 
> echo 0 > /proc/sys/vm/oom_dump_tasks
> ./stress-ng --module 100 --module-name xfs

Above command fills for me with nfs (but also ext4) the kernel log with:

...
[  883.036035] nfs: Unknown symbol xdr_reserve_space (err -2)
[  883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2)
[  883.048549] nfs: Unknown symbol put_rpccred (err -2)
[  883.054104] nfs: Unknown symbol __fscache_invalidate (err -2)
[  883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2)
[  883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2)
[  883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2)
[  883.081743] nfs: Unknown symbol nlmclnt_init (err -2)
[  883.087396] nfs: Unknown symbol nlmclnt_done (err -2)
[  883.093074] nfs: Unknown symbol nfs_debug (err -2)
[  883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2)
[  883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2)
[  883.163764] nfs: Unknown symbol rpc_put_task (err -2)
[  883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2)
[  883.176297] nfs: Unknown symbol rpc_proc_register (err -2)
[  883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2)
[  883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2)
[  883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2)
...


I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent 
handling or just side-effect of the concurrent loading?

> 
> To see how much max memory I use, I just use:
> 
> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log
> 
> Run the test in another window, CTRL-C the test when above
> finishes after 40 seconds and then:
> 
> sort -n -r foo.log  | head -1

[root@lenovo-sr950-01 fs]# sort -n -r foo.log  | head -1
14254024
[root@lenovo-sr950-01 fs]# sort -n -r foo.log  | tail -1
12862528

So 1391496 (KiB I assume, so 1.3 GiB !?) difference compared to before 
the test (I first start capturing and then run stress-ng).


> 
> If you have xfs loaded already you probably wanna pick module just as big
> that you don't have loaded. You must have dependencies loaded already as
> it doesn't call modprobe, it just finit_module's the module.


My setup already has xfs in use. nfs and ext4 are a bit smaller, but 
still big.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-28 21:02                                   ` David Hildenbrand
@ 2023-03-29  5:31                                     ` Luis Chamberlain
  2023-03-30  4:42                                       ` David Hildenbrand
  0 siblings, 1 reply; 45+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kees Cook, linux-modules, linux-kernel, pmladek, petr.pavlu,
	prarit, christophe.leroy, song, torvalds, dave, fan.ni,
	vincent.fu, a.manzanares, colin.i.king

On Tue, Mar 28, 2023 at 11:02:49PM +0200, David Hildenbrand wrote:
> 
> So we're a bit faster (0.2 -- 0.7s) than the original version without the
> rcu patch (~6s).

Groovy.

> > The commit log needs updateing to reflect the results I just collected:
> > 
> > With the alloc patch ("module: avoid allocation if module is already
> > present and ready") I see 145 MiB in memory difference in comparison
> > to its last patch, "module: extract patient module check into helper".
> > So I think that's a clear keeper and should help large CPU count boots.
> > 
> > The patch "module: add concurrency limiter" which puts the concurency
> > delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
> > which seems to be what I needed to reproduce your 400 CPU count original
> > issue.
> > 
> > The program used to reproduce is stress-ng with the new module option:
> > 
> > echo 0 > /proc/sys/vm/oom_dump_tasks
> > ./stress-ng --module 100 --module-name xfs
> 
> Above command fills for me with nfs (but also ext4) the kernel log with:
> 
> ...
> [  883.036035] nfs: Unknown symbol xdr_reserve_space (err -2)
> [  883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2)
> [  883.048549] nfs: Unknown symbol put_rpccred (err -2)
> [  883.054104] nfs: Unknown symbol __fscache_invalidate (err -2)
> [  883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2)
> [  883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2)
> [  883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2)
> [  883.081743] nfs: Unknown symbol nlmclnt_init (err -2)
> [  883.087396] nfs: Unknown symbol nlmclnt_done (err -2)
> [  883.093074] nfs: Unknown symbol nfs_debug (err -2)
> [  883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2)
> [  883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2)
> [  883.163764] nfs: Unknown symbol rpc_put_task (err -2)
> [  883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2)
> [  883.176297] nfs: Unknown symbol rpc_proc_register (err -2)
> [  883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2)
> [  883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2)
> [  883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2)
> ...
> 
> 
> I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent
> handling or just side-effect of the concurrent loading?

It is just because modprobe deals with module dependencies, stress-ng
modprobe doesn't, it just calls finit_module() and expects dependencies
to be loaded first.

> > To see how much max memory I use, I just use:
> > 
> > free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log
> > 
> > Run the test in another window, CTRL-C the test when above
> > finishes after 40 seconds and then:
> > 
> > sort -n -r foo.log  | head -1
> 
> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | head -1
> 14254024
> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | tail -1
> 12862528
> 
> So 1391496 (KiB I assume, so 1.3 GiB !?)

That is sadly correct.

> difference compared to before the
> test (I first start capturing and then run stress-ng).

That's a good chunk :-D

On a recent thread I do a full analysis of average module sizes [0], you
can use that to also get an average size estimate of your currently
loaded modules. Do a bit of math and that could give you the average
memory pressure on real vmap allocations. For a synthentic test as with
stress-ng modules we'd be doing twice the memory for a success load, and
only one time allocation due to kread for a failed allocation, since we
always allocate memory even for bogus modules on the kread side.

My xfs.ko is 42M on my test guest system, with 100 concurrent threads
doing two allocations each that puts the expected vmap allocation at
around:

42*2*100
8400
8400/1024
8.20312500000000000000

8.2 GiB

So on the worst case that's what we'd see. Since I saw seeing the
patches helped overall around 145 MiB in difference let's see what
that means. Let me re-test:

Idle memory: (note free -k uses kibibytes, KiB.

root@kmod ~ # free -k -s 1 -c 3 | grep Mem | awk '{print $3}' | sort -n -r  | head -1
416452

So while idle, the system is using up 416452 KiB, so 406 MiB.

While running stress-ng with 100 module ops

free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > 2023-03-28-kmodsem-stress-ng-v1.txt
echo 0 > /proc/sys/vm/oom_dump_tasks
./stress-ng --module 100 --module-name xfs
root@kmod ~ # sort -n -r 2023-03-28-kmodsem-stress-ng-v1.txt | head -1
4886492

4886492/1024
4771.96484375000000000000
4886492/1024/1024
4.66012191772460937500

So during the stress test we see memory highest usage was about
4771.96 MiB or ~4.66 GiB.

What's the difference between idle and after the stress test:

4886492 - 416452
4470040
4470040/1024
4365.27343750000000000000
4470040/1024/1024
4.26296234130859375000

So about 4365.27 MiB or 4.26 GiB.

Now, the upper limit for the test should have been a delta of 8.2 GiB
and we get about 4.26, so it means we're rejecting more than half of
the requests. Why half not just half? I only see 40 successful loads
of XFS during a 40 second window:

  dmesg -c > /dev/null
  # run tests
  dmesg -c | grep XFS | wc -l

At 100 ops all running finit_module for 40 seconds those successful
loads only should have consumed about 40 * 2 * size of XFS (42 MiB):
3360 MiB or 3.28 GiB. But at any point in time only one module could
be loaded at a time, so in the *worst* consumption point a XFS in this
test should only be consuming 2*size_of_XFS(42 MiB) so 84 MiB.

4365.27 - 84
4281.27

So about 4281.27 MiB (4.18 GiB) are consumed by the 98.07% of the
failed module loads due to the kread*() calls from finit_module().

And get this... *if* you use module compression that also uses vmap()
*after* the kernel_read*() call which uses vmalloc(). *At least* in
that case we immediately free the buffer for the compressed module,
but *still* -- that is 3 possible vmap space allocations for every
module!

It'd be hard, but not impossible to collect stats for failed
finit_modules(). *That* could be indicative of areas in the kernel
we need to brush up on to stop doing stupid things, like we learned
about for the CPU frequency scaling modules. As I wrote this paragraph
I realized -- that this *is*  what we really have wanted all along to
help us debug these stupid things, so we can slowly learn where to
point fingers at to help optimize things. Altough I recently did some
tool scraping to collect stats and *wished* for this from userspace [0],
it wasn't hard to just a debug option for this to help us debug these
failures. So I added support for that and sent finally a new patch
series.

[0] https://lkml.kernel.org/r/CAB=NE6UTC4VkNM57GGJ3XkG_PWLkMfXv2e2=yQJhtM6Fc-uMsQ@mail.gmail.com

> > If you have xfs loaded already you probably wanna pick module just as big
> > that you don't have loaded. You must have dependencies loaded already as
> > it doesn't call modprobe, it just finit_module's the module.
> 
> My setup already has xfs in use. nfs and ext4 are a bit smaller, but still
> big.

Sure.

  Luis

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
  2023-03-29  5:31                                     ` Luis Chamberlain
@ 2023-03-30  4:42                                       ` David Hildenbrand
  0 siblings, 0 replies; 45+ messages in thread
From: David Hildenbrand @ 2023-03-30  4:42 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Kees Cook, linux-modules, linux-kernel, pmladek, petr.pavlu,
	prarit, christophe.leroy, song, torvalds, dave, fan.ni,
	vincent.fu, a.manzanares, colin.i.king

On 29.03.23 07:31, Luis Chamberlain wrote:
> On Tue, Mar 28, 2023 at 11:02:49PM +0200, David Hildenbrand wrote:
>>
>> So we're a bit faster (0.2 -- 0.7s) than the original version without the
>> rcu patch (~6s).
> 
> Groovy.

Just to clarify, it was up to 0.7s faster (I phrased it in a confusing way).

> 
>>> The commit log needs updateing to reflect the results I just collected:
>>>
>>> With the alloc patch ("module: avoid allocation if module is already
>>> present and ready") I see 145 MiB in memory difference in comparison
>>> to its last patch, "module: extract patient module check into helper".
>>> So I think that's a clear keeper and should help large CPU count boots.
>>>
>>> The patch "module: add concurrency limiter" which puts the concurency
>>> delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
>>> which seems to be what I needed to reproduce your 400 CPU count original
>>> issue.
>>>
>>> The program used to reproduce is stress-ng with the new module option:
>>>
>>> echo 0 > /proc/sys/vm/oom_dump_tasks
>>> ./stress-ng --module 100 --module-name xfs
>>
>> Above command fills for me with nfs (but also ext4) the kernel log with:
>>
>> ...
>> [  883.036035] nfs: Unknown symbol xdr_reserve_space (err -2)
>> [  883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2)
>> [  883.048549] nfs: Unknown symbol put_rpccred (err -2)
>> [  883.054104] nfs: Unknown symbol __fscache_invalidate (err -2)
>> [  883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2)
>> [  883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2)
>> [  883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2)
>> [  883.081743] nfs: Unknown symbol nlmclnt_init (err -2)
>> [  883.087396] nfs: Unknown symbol nlmclnt_done (err -2)
>> [  883.093074] nfs: Unknown symbol nfs_debug (err -2)
>> [  883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2)
>> [  883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2)
>> [  883.163764] nfs: Unknown symbol rpc_put_task (err -2)
>> [  883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2)
>> [  883.176297] nfs: Unknown symbol rpc_proc_register (err -2)
>> [  883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2)
>> [  883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2)
>> [  883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2)
>> ...
>>
>>
>> I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent
>> handling or just side-effect of the concurrent loading?
> 
> It is just because modprobe deals with module dependencies, stress-ng
> modprobe doesn't, it just calls finit_module() and expects dependencies
> to be loaded first.

Oh, right. That makes sense.

> 
>>> To see how much max memory I use, I just use:
>>>
>>> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log
>>>
>>> Run the test in another window, CTRL-C the test when above
>>> finishes after 40 seconds and then:
>>>
>>> sort -n -r foo.log  | head -1
>>
>> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | head -1
>> 14254024
>> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | tail -1
>> 12862528
>>
>> So 1391496 (KiB I assume, so 1.3 GiB !?)
> 
> That is sadly correct.
> 
>> difference compared to before the
>> test (I first start capturing and then run stress-ng).
> 
> That's a good chunk :-D
> 
> On a recent thread I do a full analysis of average module sizes [0], you
> can use that to also get an average size estimate of your currently
> loaded modules. Do a bit of math and that could give you the average
> memory pressure on real vmap allocations. For a synthentic test as with
> stress-ng modules we'd be doing twice the memory for a success load, and
> only one time allocation due to kread for a failed allocation, since we
> always allocate memory even for bogus modules on the kread side.
> 
> My xfs.ko is 42M on my test guest system, with 100 concurrent threads
> doing two allocations each that puts the expected vmap allocation at
> around:
> 
> 42*2*100
> 8400
> 8400/1024
> 8.20312500000000000000
> 
> 8.2 GiB
> 
> So on the worst case that's what we'd see. Since I saw seeing the
> patches helped overall around 145 MiB in difference let's see what
> that means. Let me re-test:
> 
> Idle memory: (note free -k uses kibibytes, KiB.
> 
> root@kmod ~ # free -k -s 1 -c 3 | grep Mem | awk '{print $3}' | sort -n -r  | head -1
> 416452
> 
> So while idle, the system is using up 416452 KiB, so 406 MiB.
> 
> While running stress-ng with 100 module ops
> 
> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > 2023-03-28-kmodsem-stress-ng-v1.txt
> echo 0 > /proc/sys/vm/oom_dump_tasks
> ./stress-ng --module 100 --module-name xfs
> root@kmod ~ # sort -n -r 2023-03-28-kmodsem-stress-ng-v1.txt | head -1
> 4886492
> 
> 4886492/1024
> 4771.96484375000000000000
> 4886492/1024/1024
> 4.66012191772460937500
> 
> So during the stress test we see memory highest usage was about
> 4771.96 MiB or ~4.66 GiB.
> 
> What's the difference between idle and after the stress test:
> 
> 4886492 - 416452
> 4470040
> 4470040/1024
> 4365.27343750000000000000
> 4470040/1024/1024
> 4.26296234130859375000
> 
> So about 4365.27 MiB or 4.26 GiB.
> 
> Now, the upper limit for the test should have been a delta of 8.2 GiB
> and we get about 4.26, so it means we're rejecting more than half of
> the requests. Why half not just half? I only see 40 successful loads
> of XFS during a 40 second window:
> 
>    dmesg -c > /dev/null
>    # run tests
>    dmesg -c | grep XFS | wc -l
> 
> At 100 ops all running finit_module for 40 seconds those successful
> loads only should have consumed about 40 * 2 * size of XFS (42 MiB):
> 3360 MiB or 3.28 GiB. But at any point in time only one module could
> be loaded at a time, so in the *worst* consumption point a XFS in this
> test should only be consuming 2*size_of_XFS(42 MiB) so 84 MiB.
> 
> 4365.27 - 84
> 4281.27
> 
> So about 4281.27 MiB (4.18 GiB) are consumed by the 98.07% of the
> failed module loads due to the kread*() calls from finit_module().
> 
> And get this... *if* you use module compression that also uses vmap()
> *after* the kernel_read*() call which uses vmalloc(). *At least* in
> that case we immediately free the buffer for the compressed module,
> but *still* -- that is 3 possible vmap space allocations for every
> module!

:/

> 
> It'd be hard, but not impossible to collect stats for failed
> finit_modules(). *That* could be indicative of areas in the kernel
> we need to brush up on to stop doing stupid things, like we learned
> about for the CPU frequency scaling modules. As I wrote this paragraph
> I realized -- that this *is*  what we really have wanted all along to
> help us debug these stupid things, so we can slowly learn where to
> point fingers at to help optimize things. Altough I recently did some
> tool scraping to collect stats and *wished* for this from userspace [0],
> it wasn't hard to just a debug option for this to help us debug these
> failures. So I added support for that and sent finally a new patch
> series.

Makes sense to me.


I didn't have time to look at the code yet (currently traveling), I hope 
that the kasan vmap issues are at least history.

I'll do another test with debug kernels using your latest version on 
that machine before I have to hand it back.

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2023-03-30  4:44 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
2023-03-11  5:17 ` [RFC 01/12] module: use goto errors on check_modinfo() and layout_and_allocate() Luis Chamberlain
2023-03-11  5:17 ` [RFC 02/12] module: move get_modinfo() helpers all above Luis Chamberlain
2023-03-11  5:17 ` [RFC 03/12] module: rename next_string() to module_next_tag_pair() Luis Chamberlain
2023-03-11  5:17 ` [RFC 04/12] module: add a for_each_modinfo_entry() Luis Chamberlain
2023-03-11  5:17 ` [RFC 05/12] module: add debugging alias parsing support Luis Chamberlain
2023-03-11  5:17 ` [RFC 06/12] module: move early sanity checks into a helper Luis Chamberlain
2023-03-11  5:17 ` [RFC 07/12] module: move check_modinfo() early to early_mod_check() Luis Chamberlain
2023-03-11  5:17 ` [RFC 08/12] module: move finished_loading() Luis Chamberlain
2023-03-11  5:17 ` [RFC 09/12] module: extract patient module check into helper Luis Chamberlain
2023-03-11  5:17 ` [RFC 10/12] module: avoid allocation if module is already present and ready Luis Chamberlain
2023-03-11  5:17 ` [RFC 11/12] module: use list_add_tail_rcu() when adding module Luis Chamberlain
2023-03-11  5:17 ` [RFC 12/12] module: use aliases to find module on find_module_all() Luis Chamberlain
2023-03-15 14:43   ` Petr Pavlu
2023-03-15 16:12     ` Luis Chamberlain
2023-03-15 12:24 ` [RFC 00/12] module: avoid userspace pressure on unwanted allocations David Hildenbrand
2023-03-15 16:10   ` Luis Chamberlain
2023-03-15 16:41     ` David Hildenbrand
2023-03-16 23:55       ` Luis Chamberlain
2023-03-16 23:56         ` Luis Chamberlain
2023-03-18  0:11           ` Luis Chamberlain
2023-03-20  9:38             ` David Hildenbrand
2023-03-20 19:40               ` David Hildenbrand
2023-03-20 21:09                 ` Luis Chamberlain
2023-03-20 21:15                   ` David Hildenbrand
2023-03-20 21:23                     ` Luis Chamberlain
2023-03-20 21:27                       ` Luis Chamberlain
2023-03-21 19:32                         ` David Hildenbrand
2023-03-24  9:27                           ` David Hildenbrand
2023-03-24 17:54                             ` Luis Chamberlain
2023-03-24 19:11                               ` Linus Torvalds
2023-03-24 19:59                                 ` Luis Chamberlain
2023-03-24 20:28                                   ` Linus Torvalds
2023-03-24 21:14                                     ` Luis Chamberlain
2023-03-24 23:27                                       ` Luis Chamberlain
2023-03-24 23:41                                         ` Linus Torvalds
2023-03-28  3:44                               ` David Hildenbrand
2023-03-28  6:16                                 ` Luis Chamberlain
2023-03-28 21:02                                   ` David Hildenbrand
2023-03-29  5:31                                     ` Luis Chamberlain
2023-03-30  4:42                                       ` David Hildenbrand
2023-03-21 15:11                       ` David Hildenbrand
2023-03-21 16:52                         ` Luis Chamberlain
2023-03-21 17:01                           ` David Hildenbrand
2023-03-20  9:37           ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).