patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] module: avoid all memory pressure due to duplicates
@ 2023-05-24 21:36 Luis Chamberlain
  2023-05-24 21:36 ` [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection Luis Chamberlain
  2023-05-24 21:36 ` [PATCH 2/2] module: add support to avoid duplicates early on load Luis Chamberlain
  0 siblings, 2 replies; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-24 21:36 UTC (permalink / raw)
  To: david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, petr.pavlu, prarit, torvalds, lennart
  Cc: gregkh, rafael, song, lucas.de.marchi, lucas.demarchi,
	christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko,
	dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, mcgrof

We now know that duplicate kernel module loading can incur considerable
vmalloc memory pressure on boot. Module loading uses vmalloc space 3 times
in the worst case. Detecting duplicates early before processing the module
name was the last mile we had to go and we recently were debating if userspace
should fix this or the kernel should fix this. We decided to punt the problem
to userspace.

At LSFMM 2023 in Vancouver this year I talked to Lennart Poettering about this
and described the issue with udev. Fixing this in userspace would be complex,
and it was preferable if we could resolve this in-kernel. And indeed it is
possible to resolve it relatively easily in-kernel with just the inode. So
let's do that and be done with the issue.

Patch #2 has the details on the savings on a 255 CPU system.

With this we get 0 bytes wasted incurred due to duplicates.

Luis Chamberlain (2):
  fs/kernel_read_file: add support for duplicate detection
  module: add support to avoid duplicates early on load

 fs/Kconfig                       |   3 +
 fs/kernel_read_file.c            | 124 +++++++++++++++++++++++++++++++
 include/linux/kernel_read_file.h |  14 ++++
 include/linux/module.h           |   1 +
 kernel/module/Kconfig            |  20 +++++
 kernel/module/internal.h         |   1 +
 kernel/module/main.c             |  19 +++--
 7 files changed, 175 insertions(+), 7 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-24 21:36 [PATCH 0/2] module: avoid all memory pressure due to duplicates Luis Chamberlain
@ 2023-05-24 21:36 ` Luis Chamberlain
  2023-05-24 21:52   ` Linus Torvalds
  2023-05-24 21:36 ` [PATCH 2/2] module: add support to avoid duplicates early on load Luis Chamberlain
  1 sibling, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-24 21:36 UTC (permalink / raw)
  To: david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, petr.pavlu, prarit, torvalds, lennart
  Cc: gregkh, rafael, song, lucas.de.marchi, lucas.demarchi,
	christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko,
	dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, mcgrof

Add support for a new call which allows to detect duplicate requests
for each inode passed. This enables users to avoid having to incur
a whole vmalloc() for duplicates inodes with kernel_read_file().
Support to avoid duplicates is desirable in-kernel since changing
existing userspace or kernel users to account for these duplicates
would otherwise be difficult to implement, and the measured impact
of amount of wasted memory due to duplicates with vmalloc is observed
to be high.

This currently goes disabled because no user exists yet which wants
this enabled. Kernel code which needs this enabled should select the
new CONFIG_KREAD_UNIQ, otherwise the API falls back to the existing
kernel_read_file_from_fd().

If we later need to have some code enable CONFIG_KREAD_UNIQ but some
not we can have the feature be enabled per enum kernel_read_file_id id.
For now this should cover the main future use case with modules and
allow easily to disable / enable this feature with just one future
kconfig option.

Contrary to kernel_read_file_from_fd() users of thew new API will
use kread_uniq_fd(), keep track of the inode internally, and once
done use kread_uniq_fd_free() once the inode is no longer used.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/Kconfig                       |   3 +
 fs/kernel_read_file.c            | 124 +++++++++++++++++++++++++++++++
 include/linux/kernel_read_file.h |  14 ++++
 3 files changed, 141 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index cc07a0cd3172..0a78657b00d5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -18,6 +18,9 @@ config VALIDATE_FS_PARSER
 config FS_IOMAP
 	bool
 
+config KREAD_UNIQ
+	bool
+
 # old blockdev_direct_IO implementation.  Use iomap for new code instead
 config LEGACY_DIRECT_IO
 	bool
diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..a066e2f239e8 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -1,9 +1,12 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#define pr_fmt(fmt) "kread: " fmt
+
 #include <linux/fs.h>
 #include <linux/fs_struct.h>
 #include <linux/kernel_read_file.h>
 #include <linux/security.h>
 #include <linux/vmalloc.h>
+#include <linux/fdtable.h>
 
 /**
  * kernel_read_file() - read file contents into a kernel buffer
@@ -187,3 +190,124 @@ ssize_t kernel_read_file_from_fd(int fd, loff_t offset, void **buf,
 	return ret;
 }
 EXPORT_SYMBOL_GPL(kernel_read_file_from_fd);
+
+#ifdef CONFIG_KREAD_UNIQ
+static DEFINE_MUTEX(kread_dup_mutex);
+static LIST_HEAD(kread_dup_reqs);
+
+struct kread_dup_req {
+	struct list_head list;
+	unsigned long i_ino;
+};
+
+static struct kread_dup_req *kread_dup_request_lookup(unsigned long i_ino)
+{
+	struct kread_dup_req *kread_req;
+
+	list_for_each_entry_rcu(kread_req, &kread_dup_reqs, list,
+				lockdep_is_held(&kread_dup_mutex)) {
+		if (kread_req->i_ino == i_ino)
+			return kread_req;
+        }
+
+	return NULL;
+}
+
+static struct kread_dup_req *kread_dup_request_new(char *name, unsigned long i_ino)
+{
+	struct kread_dup_req *kread_req, *new_kread_req;
+
+	/*
+	 * Pre-allocate the entry in case we have to use it later
+	 * to avoid contention with the mutex.
+	 */
+	new_kread_req = kzalloc(sizeof(*new_kread_req), GFP_KERNEL);
+	if (!new_kread_req)
+		return false;
+
+	new_kread_req->i_ino = i_ino;
+
+	kread_req = kread_dup_request_lookup(i_ino);
+	if (!kread_req) {
+		/*
+		 * There was no duplicate, just add the request so we can
+		 * keep tab on duplicates later.
+		 */
+		pr_debug("accepted request for i_ino %lu for: %s\n", i_ino, name);
+		return new_kread_req;
+	}
+
+	/* We are dealing with a duplicate request now */
+
+	kfree(new_kread_req);
+
+	pr_debug("duplicate request on i_ino %lu for: %s\n", i_ino, name);
+
+	return NULL;
+}
+
+ssize_t kread_uniq_fd(int fd, loff_t offset, void **buf, unsigned long *i_ino,
+		      size_t buf_size, size_t *file_size, enum kernel_read_file_id id)
+{
+	struct fd f = fdget(fd);
+	ssize_t ret = -EBADF;
+	char *name, *path;
+	struct kread_dup_req *req;
+
+	if (!f.file || !(f.file->f_mode & FMODE_READ))
+		goto out;
+
+	path = kzalloc(PATH_MAX, GFP_KERNEL);
+	if (!path)
+		return -ENOMEM;
+
+	name = file_path(f.file, path, PATH_MAX);
+	if (IS_ERR(name)) {
+		ret = PTR_ERR(name);
+		goto out_mem;
+	}
+
+	*i_ino = file_inode(f.file)->i_ino;
+
+	mutex_lock(&kread_dup_mutex);
+	req = kread_dup_request_new(name, *i_ino);
+	if (!req) {
+		mutex_unlock(&kread_dup_mutex);
+		ret = -EBUSY;
+		goto out_mem;
+	}
+
+	ret = kernel_read_file(f.file, offset, buf, buf_size, file_size, id);
+	if (ret < 0)
+		kfree(req);
+	else
+		list_add_rcu(&req->list, &kread_dup_reqs);
+	mutex_unlock(&kread_dup_mutex);
+out_mem:
+	kfree(path);
+out:
+	fdput(f);
+	return ret;
+}
+
+void kread_uniq_fd_free(unsigned long i_ino)
+{
+	struct kread_dup_req *kread_req;
+
+	if (!i_ino)
+		return;
+
+	mutex_lock(&kread_dup_mutex);
+
+	kread_req = kread_dup_request_lookup(i_ino);
+	if (!kread_req)
+		goto out;
+
+	list_del_rcu(&kread_req->list);
+	synchronize_rcu();
+
+out:
+	mutex_unlock(&kread_dup_mutex);
+	kfree(kread_req);
+}
+#endif /* CONFIG_KREAD_UNIQ */
diff --git a/include/linux/kernel_read_file.h b/include/linux/kernel_read_file.h
index 90451e2e12bd..884985b0dc88 100644
--- a/include/linux/kernel_read_file.h
+++ b/include/linux/kernel_read_file.h
@@ -51,5 +51,19 @@ ssize_t kernel_read_file_from_fd(int fd, loff_t offset,
 				 void **buf, size_t buf_size,
 				 size_t *file_size,
 				 enum kernel_read_file_id id);
+#ifdef CONFIG_KREAD_UNIQ
+ssize_t kread_uniq_fd(int fd, loff_t offset, void **buf, unsigned long *i_ino,
+		      size_t buf_size, size_t *file_size, enum kernel_read_file_id id);
+void kread_uniq_fd_free(unsigned long i_ino);
+#else
+static inline ssize_t kread_uniq_fd(int fd, loff_t offset, void **buf, unsigned long *i_ino,
+				    size_t buf_size, size_t *file_size, enum kernel_read_file_id id)
+{
+	return kernel_read_file_from_fd(fd, offset, buf, buf_size, file_size, id);
+}
+static inline void kread_uniq_fd_free(unsigned long i_ino)
+{
+}
+#endif /* CONFIG_KREAD_UNIQ */
 
 #endif /* _LINUX_KERNEL_READ_FILE_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-24 21:36 [PATCH 0/2] module: avoid all memory pressure due to duplicates Luis Chamberlain
  2023-05-24 21:36 ` [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection Luis Chamberlain
@ 2023-05-24 21:36 ` Luis Chamberlain
  2023-05-25 11:40   ` Petr Pavlu
  1 sibling, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-24 21:36 UTC (permalink / raw)
  To: david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, petr.pavlu, prarit, torvalds, lennart
  Cc: gregkh, rafael, song, lucas.de.marchi, lucas.demarchi,
	christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko,
	dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, mcgrof

Add support to use the new kread_uniq_fd() to avoid duplicate kernel
reads on modules. At the cost of about ~945 bytes to your kernel size,
enabling this on a 255 CPU x86_64 qemu guest this saves about ~1.8 GiB
of memory during boot which would otherwise be free'd, and reduces boot
time by about ~11 seconds.

Userspace loads modules through finit_module(), this in turn will
use vmalloc space up to 3 times:

  a) The kernel_read_file() call
  b) Optional module decompression
  c) Our final copy of the module

Commit 064f4536d139 ("module: avoid allocation if module is already
present and ready") shows a graph of the amount of vmalloc space
observed allocated but freed for duplicate module request which end
up in the trash bin. Since there is a linear relationship with the
number of CPUs eventually this will bite us and you end up not being
able to boot. That commit put a stop gap for c) but to avoid the
vmalloc() space wasted on a) and b) we need to detect duplicates
earlier.

We could just have userspace fix this, but as reviewed at LSFMM 2023
this year in Vancouver, fixing this in userspace can be complex and we
also can't know when userpace is fixed. Fixing this in kernel turned
out to be easy with the inode and with a simple kconfig option we can
let users / distros decide if this full stop gap is worthy to enable.

With this enabled I'm now able to see 0 bytes wasted on vmalloc space
due to duplicates.

Before:
 # sudo systemd-analyze
 Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
 graphical.target reached after 44.178s in userspace.

 # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
 Virtual mem wasted bytes       1949006968

So ~1.8 GiB... of vmalloc space wasted during boot.

After:

 # sudo systemd-analyze
 Startup finished in 29.883s (kernel) + 45.817s (userspace) = 1min 15.700s
 graphical.target reached after 45.682s in userspace.

 # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
 Virtual mem wasted bytes       0

Suggested-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/module.h   |  1 +
 kernel/module/Kconfig    | 20 ++++++++++++++++++++
 kernel/module/internal.h |  1 +
 kernel/module/main.c     | 19 ++++++++++++-------
 4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 9e56763dff81..afc44df96278 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -557,6 +557,7 @@ struct module {
 	unsigned int printk_index_size;
 	struct pi_entry **printk_index_start;
 #endif
+	unsigned long i_ino;
 
 #ifdef CONFIG_MODULE_UNLOAD
 	/* What modules depend on me? */
diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index 33a2e991f608..85a6c7c5ddc0 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -157,6 +157,26 @@ config MODULE_UNLOAD_TAINT_TRACKING
 	  page (see bad_page()), the aforementioned details are also
 	  shown. If unsure, say N.
 
+config MODULE_KREAD_UNIQ
+	bool "Avoid duplicate module kernel reads"
+	select KREAD_UNIQ
+	help
+	  Enable this option to avoid vmalloc() space for duplicate module
+	  requests early before we can even check for the module name. This
+	  is useful to avoid duplicate module requests which userspace or kernel
+	  can issue. On some architectures such as x86_64 there is only 128 MiB
+	  of virtual memory space and since in the worst case we can end up
+	  allocating up to 3 times the module size in vmalloc space, avoiding
+	  duplicates can save virtual memory on boot.
+
+	  Enabling this will incrase your kernel by about 945 bytes, but can
+	  save considerable memory during bootup which would otherwise be freed
+	  and this in turn can help speed up kernel boot time.
+
+	  Disable this if you have enabled CONFIG_MODULE_STATS and have
+	  verified you see no duplicates or virtual memory being freed on
+	  bootup.
+
 config MODVERSIONS
 	bool "Module versioning support"
 	help
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index dc7b0160c480..7ea7f177f907 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -67,6 +67,7 @@ struct load_info {
 	unsigned int max_pages;
 	unsigned int used_pages;
 #endif
+	unsigned long i_ino;
 	struct {
 		unsigned int sym, str, mod, vers, info, pcpu;
 	} index;
diff --git a/kernel/module/main.c b/kernel/module/main.c
index ea7d0c7f3e60..e35900ee616a 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1283,6 +1283,7 @@ static void free_module(struct module *mod)
 	kfree(mod->args);
 	percpu_modfree(mod);
 
+	kread_uniq_fd_free(mod->i_ino);
 	free_mod_mem(mod);
 }
 
@@ -1964,12 +1965,14 @@ static int copy_module_from_user(const void __user *umod, unsigned long len,
 	return err;
 }
 
-static void free_copy(struct load_info *info, int flags)
+static void free_copy(struct load_info *info, int flags, bool error)
 {
 	if (flags & MODULE_INIT_COMPRESSED_FILE)
 		module_decompress_cleanup(info);
 	else
 		vfree(info->hdr);
+	if (error)
+		kread_uniq_fd_free(info->i_ino);
 }
 
 static int rewrite_section_headers(struct load_info *info, int flags)
@@ -2965,7 +2968,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	}
 
 	/* Get rid of temporary copy. */
-	free_copy(info, flags);
+	free_copy(info, flags, false);
 
 	/* Done! */
 	trace_module_load(mod);
@@ -3023,7 +3026,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	 */
 	if (!module_allocated)
 		mod_stat_bump_becoming(info, flags);
-	free_copy(info, flags);
+	free_copy(info, flags, true);
 	return err;
 }
 
@@ -3068,11 +3071,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 		      |MODULE_INIT_COMPRESSED_FILE))
 		return -EINVAL;
 
-	len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
-				       READING_MODULE);
+	len = kread_uniq_fd(fd, 0, &buf, &info.i_ino, INT_MAX, NULL, READING_MODULE);
 	if (len < 0) {
-		mod_stat_inc(&failed_kreads);
-		mod_stat_add_long(len, &invalid_kread_bytes);
+		if (len != -EBUSY) {
+			mod_stat_inc(&failed_kreads);
+			mod_stat_add_long(len, &invalid_kread_bytes);
+		}
 		return len;
 	}
 
@@ -3082,6 +3086,7 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 		if (err) {
 			mod_stat_inc(&failed_decompress);
 			mod_stat_add_long(len, &invalid_decompress_bytes);
+			kread_uniq_fd_free(info.i_ino);
 			return err;
 		}
 	} else {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-24 21:36 ` [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection Luis Chamberlain
@ 2023-05-24 21:52   ` Linus Torvalds
  2023-05-24 21:56     ` Linus Torvalds
                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Linus Torvalds @ 2023-05-24 21:52 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song,
	lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt,
	dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu

On Wed, May 24, 2023 at 2:36 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> Add support for a new call which allows to detect duplicate requests
> for each inode passed.

No.

This is all disgusting.

Stop adding horrific code for some made-up load that isn't real.

           Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-24 21:52   ` Linus Torvalds
@ 2023-05-24 21:56     ` Linus Torvalds
  2023-05-24 22:07       ` Luis Chamberlain
  2023-05-25  4:00     ` Linus Torvalds
  2023-05-25  7:01     ` Christian Brauner
  2 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-05-24 21:56 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song,
	lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt,
	dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu

On Wed, May 24, 2023 at 2:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Stop adding horrific code for some made-up load that isn't real.

Even if you trigger some "worst case 3x memory use", that is
_temporary_, and will be free'd in the end.

The patches to "fix" this are worse than the disease.

                  Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-24 21:56     ` Linus Torvalds
@ 2023-05-24 22:07       ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-24 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song,
	lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt,
	dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu

On Wed, May 24, 2023 at 02:56:47PM -0700, Linus Torvalds wrote:
> On Wed, May 24, 2023 at 2:52 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Stop adding horrific code for some made-up load that isn't real.
> 
> Even if you trigger some "worst case 3x memory use", that is
> _temporary_, and will be free'd in the end.
> 
> The patches to "fix" this are worse than the disease.

Fine with me, we can punt back and wait and hope udev fixes this.
No one can tell me I didn't try. Now let's hope userspace will try
an alternative.

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-24 21:52   ` Linus Torvalds
  2023-05-24 21:56     ` Linus Torvalds
@ 2023-05-25  4:00     ` Linus Torvalds
  2023-05-25 18:08       ` Luis Chamberlain
  2023-05-25  7:01     ` Christian Brauner
  2 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-05-25  4:00 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song,
	lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt,
	dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu

[-- Attachment #1: Type: text/plain, Size: 2033 bytes --]

On Wed, May 24, 2023 at 2:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is all disgusting.

Bringing back the original thread, because I just sent an alternate
patch to Luis to test.

That one is also disgusting, but for different reasons: it needs some
polish if it works. It's a very simple patch, in that it just extends
our existing i_writecount and ETXTBSY logic to also have a "exclusive"
mode, and says that we do the module file reading in that exclusive
mode (so if/when udev in its incompetence tries to load the same
module X number of times at the same time, only one will read at a
time).

The disgusting part is mainly the hacky test for "id ==
READING_MODULE", and it would probably be better with some kind of
"exclusive flag" field for general use, but right now READING_MODULE
is basically that one user.

Luis having explained _why_ we'd want this (and honestly, it took a
couple of tries), I can only say that udev is horribly broken, and
this most definitely should be fixed in user mode. udev randomly
loading the same module multiple times just because it gets confused
is not ok.

Any udev developer that goes "we can't fix it in user space" should be
ashamed of themselves. Really? Just randomly doing the same thing in
parallel and expecting the kernel to sort out your mess? What a crock.

But this *might* mitigate that udev horror. And not introduce any new
kernel-side horror, just a slight extension of our existing writer
exclusion logic to allow "full exclusive access".

(Note: it's not actually excluding other purely regular readers - but
it *is* excluding not just writers, but also other "special readers"
that want to exclude writers)

I'd like to point out that this patch really is completely untested.
It built for me, but that's all the testing it has gotten. It's
_small_. Tiny, even. But that "id == READING_MODULE" thing really is
pretty disgusting and I feel this needs more thought.

                         Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1304 bytes --]

 fs/kernel_read_file.c | 6 +++++-
 include/linux/fs.h    | 6 ++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..ff3e894f8cd4 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -48,7 +48,11 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 	if (!S_ISREG(file_inode(file)->i_mode))
 		return -EINVAL;
 
-	ret = deny_write_access(file);
+	/* Module reading wants *exclusive* access to the file */
+	if (id == READING_MODULE)
+		ret = exclusive_deny_write_access(file);
+	else
+		ret = deny_write_access(file);
 	if (ret)
 		return ret;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21a981680856..722b42a77d51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2566,6 +2566,12 @@ static inline int deny_write_access(struct file *file)
 	struct inode *inode = file_inode(file);
 	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
 }
+static inline int exclusive_deny_write_access(struct file *file)
+{
+	int old = 0;
+	struct inode *inode = file_inode(file);
+	return atomic_try_cmpxchg(&inode->i_writecount, &old, -1) ? 0 : -ETXTBSY;
+}
 static inline void put_write_access(struct inode * inode)
 {
 	atomic_dec(&inode->i_writecount);

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-24 21:52   ` Linus Torvalds
  2023-05-24 21:56     ` Linus Torvalds
  2023-05-25  4:00     ` Linus Torvalds
@ 2023-05-25  7:01     ` Christian Brauner
  2 siblings, 0 replies; 53+ messages in thread
From: Christian Brauner @ 2023-05-25  7:01 UTC (permalink / raw)
  To: Linus Torvalds, Luis Chamberlain
  Cc: david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song,
	lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt,
	dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	linux-fsdevel, Alexander Viro

On Wed, May 24, 2023 at 02:52:50PM -0700, Linus Torvalds wrote:
> On Wed, May 24, 2023 at 2:36 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > Add support for a new call which allows to detect duplicate requests
> > for each inode passed.
> 
> No.
> 
> This is all disgusting.
> 
> Stop adding horrific code for some made-up load that isn't real.

Also, this series adds non-trivial amount of code to fs/ without ever
having Cced fsdevel. As I told the bpf folks already if fs/ code is
touched fsdevel should absolutely be Cced, please.

It's literally also the first thing that get_maintainers.pl spews out.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-24 21:36 ` [PATCH 2/2] module: add support to avoid duplicates early on load Luis Chamberlain
@ 2023-05-25 11:40   ` Petr Pavlu
  2023-05-25 16:07     ` Linus Torvalds
  2023-05-25 16:54     ` Lucas De Marchi
  0 siblings, 2 replies; 53+ messages in thread
From: Petr Pavlu @ 2023-05-25 11:40 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: gregkh, rafael, song, lucas.de.marchi, lucas.demarchi,
	christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko,
	dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, david, tglx, hch, patches,
	linux-modules, linux-mm, linux-kernel, pmladek, prarit, torvalds,
	lennart

On 5/24/23 23:36, Luis Chamberlain wrote:
> Add support to use the new kread_uniq_fd() to avoid duplicate kernel
> reads on modules. At the cost of about ~945 bytes to your kernel size,
> enabling this on a 255 CPU x86_64 qemu guest this saves about ~1.8 GiB
> of memory during boot which would otherwise be free'd, and reduces boot
> time by about ~11 seconds.
> 
> Userspace loads modules through finit_module(), this in turn will
> use vmalloc space up to 3 times:
> 
>   a) The kernel_read_file() call
>   b) Optional module decompression
>   c) Our final copy of the module
> 
> Commit 064f4536d139 ("module: avoid allocation if module is already
> present and ready") shows a graph of the amount of vmalloc space
> observed allocated but freed for duplicate module request which end
> up in the trash bin. Since there is a linear relationship with the
> number of CPUs eventually this will bite us and you end up not being
> able to boot. That commit put a stop gap for c) but to avoid the
> vmalloc() space wasted on a) and b) we need to detect duplicates
> earlier.
> 
> We could just have userspace fix this, but as reviewed at LSFMM 2023
> this year in Vancouver, fixing this in userspace can be complex and we
> also can't know when userpace is fixed. Fixing this in kernel turned
> out to be easy with the inode and with a simple kconfig option we can
> let users / distros decide if this full stop gap is worthy to enable.

kmod normally uses finit_module() only if a module is not compressed,
otherwise it decompresses it first and then invokes init_module().

Looking at Fedora and openSUSE Tumbleweed, they compress kernel modules
with xz and zstd, respectively. They also have their kernels built
without any CONFIG_MODULE_COMPRESS_{GZIP,XZ,ZSTD} options.

It means that these and similarly organized distributions end up using
init_module(), and adding complexity to optimize finit_module() wouldn't
actually help in their case.

-- Petr

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-25 11:40   ` Petr Pavlu
@ 2023-05-25 16:07     ` Linus Torvalds
  2023-05-25 16:42       ` Greg KH
                         ` (2 more replies)
  2023-05-25 16:54     ` Lucas De Marchi
  1 sibling, 3 replies; 53+ messages in thread
From: Linus Torvalds @ 2023-05-25 16:07 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: Luis Chamberlain, gregkh, rafael, song, lucas.de.marchi,
	lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Thu, May 25, 2023 at 4:40 AM Petr Pavlu <petr.pavlu@suse.com> wrote:
>
> kmod normally uses finit_module() only if a module is not compressed,
> otherwise it decompresses it first and then invokes init_module().

Note that it would probably be good to teach Fedora and SuSE to use
the kernel-side decompression, if only because we have it and would
like to try to avoid using the old "load contents from user memory".

Mainly because it allows security modules to actively check for
tampering (ie things like verity etc). Long-term, it would be good to
just deprecate the old init_module() entirely.

But yes:

> It means that these and similarly organized distributions end up using
> init_module(), and adding complexity to optimize finit_module() wouldn't
> actually help in their case.

Yeah, I think the real bug is absolutely in udev, and trying to load
the same module hundreds of times is very very wrong. So I think the
"mitigate it in the kernel" is at most a quick hack to fix user-space
brokenness.

And I don't think 1/2 is acceptable as that "quick hack". Not at all.
It also seems fundamentally buggy, as it uses purely the inode number
as the file identity, which means that it does bad things across
filesystem limits.

That said, I posted an alternate patch that I think _is_ valid as that
quick hack. I don't love it, but it sure is simpler (and avoids the
i_ino bug):

    https://lore.kernel.org/lkml/CAHk-=wgKu=tJf1bm_dtme4Hde4zTB=_7EdgR8avsDRK4_jD+uA@mail.gmail.com/

that patch hasn't seen any testing, and for all I know it won't even
boot because of some thinko, but I think it would be acceptable as a
workaround if it does work.

But no, it's not some kind of "fix" for the bug, and yes, using
init_module() rather than finit_module() will circumvent the quick
hack. The true fix would be for udev to do proper handling of its data
structures instead of randomly spraying duplicate module loading
events.

I don't know why udev does what it does. From what Luis told me,
apparently it's just forking stuff and keeping all its data structures
in memory, and has no actual consistency or locking or memory of what
it has done. Luis pointed me at

    https://lore.kernel.org/all/23bd0ce6-ef78-1cd8-1f21-0e706a00424a@suse.com/T/#u

for some udev background.

It's been about a decade since I looked at udev sources, and none of
this encourages me to take a second look, so all of the above may be
me misunderstanding just exactly what the udev problem is. But for
that 'finit' case, we *could* try that simple hack of mine.

I say "hack", but the patch really is pretty simple, and the concept
of "exclusive special access" certainly is not some hack in itself.
It's just not anything we've ever done before. So the hackishness from
that exclusive_deny_write_access() thing in my patch is mainly that it
shouldn't be needed at all (and that the exclusivity should probably
be set some other way).

Comments welcome.

                  Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-25 16:07     ` Linus Torvalds
@ 2023-05-25 16:42       ` Greg KH
  2023-05-25 18:22         ` Luis Chamberlain
  2023-05-25 17:52       ` Linus Torvalds
  2023-05-25 18:45       ` Lucas De Marchi
  2 siblings, 1 reply; 53+ messages in thread
From: Greg KH @ 2023-05-25 16:42 UTC (permalink / raw)
  To: Linus Torvalds, Petr Pavlu, Luis Chamberlain
  Cc: rafael, song, lucas.de.marchi, lucas.demarchi, christophe.leroy,
	peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen,
	colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, david, tglx, hch, patches,
	linux-modules, linux-mm, linux-kernel, pmladek, prarit, lennart

On Thu, May 25, 2023 at 09:07:23AM -0700, Linus Torvalds wrote:
> > It means that these and similarly organized distributions end up using
> > init_module(), and adding complexity to optimize finit_module() wouldn't
> > actually help in their case.
> 
> Yeah, I think the real bug is absolutely in udev, and trying to load
> the same module hundreds of times is very very wrong. So I think the
> "mitigate it in the kernel" is at most a quick hack to fix user-space
> brokenness.

I totally agree.  I also agree that this doesn't really seem to be any
sort of "bug" in that no memory leaks, and when userspace calms down,
all goes back to normal.  So hacks in the vfs layer for this is not
good, let's not paper over userspace code that we have control over with
kernel changes.

Luis, I asked last time what modules are being asked by the kernel to be
loaded thousands of times at boot and can't seem to find an answer
anywhere, did I miss that?  This should be very easy to handle in
userspace if systems need it, so that begs the questions, what types of
systems need this?  We have handled booting with tens of thousands of
devices attached for decades now with no reports of boot/udev/kmod
issues before, what has recently changed to cause issues?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-25 11:40   ` Petr Pavlu
  2023-05-25 16:07     ` Linus Torvalds
@ 2023-05-25 16:54     ` Lucas De Marchi
  1 sibling, 0 replies; 53+ messages in thread
From: Lucas De Marchi @ 2023-05-25 16:54 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: Luis Chamberlain, gregkh, rafael, song, lucas.de.marchi,
	christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko,
	dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, david, tglx, hch, patches,
	linux-modules, linux-mm, linux-kernel, pmladek, prarit, torvalds,
	lennart

On Thu, May 25, 2023 at 01:40:32PM +0200, Petr Pavlu wrote:
>On 5/24/23 23:36, Luis Chamberlain wrote:
>> Add support to use the new kread_uniq_fd() to avoid duplicate kernel
>> reads on modules. At the cost of about ~945 bytes to your kernel size,
>> enabling this on a 255 CPU x86_64 qemu guest this saves about ~1.8 GiB
>> of memory during boot which would otherwise be free'd, and reduces boot
>> time by about ~11 seconds.
>>
>> Userspace loads modules through finit_module(), this in turn will
>> use vmalloc space up to 3 times:
>>
>>   a) The kernel_read_file() call
>>   b) Optional module decompression
>>   c) Our final copy of the module
>>
>> Commit 064f4536d139 ("module: avoid allocation if module is already
>> present and ready") shows a graph of the amount of vmalloc space
>> observed allocated but freed for duplicate module request which end
>> up in the trash bin. Since there is a linear relationship with the
>> number of CPUs eventually this will bite us and you end up not being
>> able to boot. That commit put a stop gap for c) but to avoid the
>> vmalloc() space wasted on a) and b) we need to detect duplicates
>> earlier.
>>
>> We could just have userspace fix this, but as reviewed at LSFMM 2023
>> this year in Vancouver, fixing this in userspace can be complex and we
>> also can't know when userpace is fixed. Fixing this in kernel turned
>> out to be easy with the inode and with a simple kconfig option we can
>> let users / distros decide if this full stop gap is worthy to enable.
>
>kmod normally uses finit_module() only if a module is not compressed,
>otherwise it decompresses it first and then invokes init_module().

that is for historical reasons, because the kernel didn't support to
uncompress the module by itself.

>
>Looking at Fedora and openSUSE Tumbleweed, they compress kernel modules
>with xz and zstd, respectively. They also have their kernels built
>without any CONFIG_MODULE_COMPRESS_{GZIP,XZ,ZSTD} options.
>
>It means that these and similarly organized distributions end up using
>init_module(), and adding complexity to optimize finit_module() wouldn't
>actually help in their case.

true, but the change in kmod should be trivial now that the kernel has
the proper support in place and the algorithms support match the ones
kmod has.  I will take a look at switching the logic around to just pass
the fd to the kernel so it can also deduplicate the requests.

thanks for the reminder,

Lucas De Marchi

>
>-- Petr

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-25 16:07     ` Linus Torvalds
  2023-05-25 16:42       ` Greg KH
@ 2023-05-25 17:52       ` Linus Torvalds
  2023-05-25 18:45       ` Lucas De Marchi
  2 siblings, 0 replies; 53+ messages in thread
From: Linus Torvalds @ 2023-05-25 17:52 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: Luis Chamberlain, gregkh, rafael, song, lucas.de.marchi,
	lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Thu, May 25, 2023 at 9:07 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That said, I posted an alternate patch that I think _is_ valid as that
> quick hack. I don't love it, but it sure is simpler (and avoids the
> i_ino bug):
>
>     https://lore.kernel.org/lkml/CAHk-=wgKu=tJf1bm_dtme4Hde4zTB=_7EdgR8avsDRK4_jD+uA@mail.gmail.com/
>
> that patch hasn't seen any testing, and for all I know it won't even
> boot because of some thinko, but I think it would be acceptable as a
> workaround if it does work.

Well, it boots here, so it must be perfect.

That said, I didn't add any debugging code, and I didn't test it on
any odd setups, and I've never had any problems before. So I don't
actually know if the patch *does* anything.

But it did boot..

               Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-25  4:00     ` Linus Torvalds
@ 2023-05-25 18:08       ` Luis Chamberlain
  2023-05-25 18:35         ` Luis Chamberlain
  2023-05-25 18:50         ` Linus Torvalds
  0 siblings, 2 replies; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-25 18:08 UTC (permalink / raw)
  To: Linus Torvalds, Linux FS Devel, hch, brauner, david
  Cc: tglx, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, lennart, gregkh, rafael, song,
	lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt,
	dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu

+ fsdevel please review,

On Wed, May 24, 2023 at 09:00:02PM -0700, Linus Torvalds wrote:
> On Wed, May 24, 2023 at 2:52 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > This is all disgusting.
> 
> Bringing back the original thread, because I just sent an alternate
> patch to Luis to test.
> 
> That one is also disgusting, but for different reasons: it needs some
> polish if it works. It's a very simple patch, in that it just extends
> our existing i_writecount and ETXTBSY logic to also have a "exclusive"
> mode, and says that we do the module file reading in that exclusive
> mode (so if/when udev in its incompetence tries to load the same
> module X number of times at the same time, only one will read at a
> time).

Indeed, this is the sort of gem I was hoping we could acomplish.

> The disgusting part is mainly the hacky test for "id ==
> READING_MODULE", and it would probably be better with some kind of
> "exclusive flag" field for general use, but right now READING_MODULE
> is basically that one user.
> 
> Luis having explained _why_ we'd want this (and honestly, it took a
> couple of tries), I can only say that udev is horribly broken, and
> this most definitely should be fixed in user mode. udev randomly
> loading the same module multiple times just because it gets confused
> is not ok.

At this point it would be good for for someone on the udev camp to at
least to *try*. If the problem is the fork on udev, maybe the shmem
could be used to share the module context to prevent duplicates.

> Any udev developer that goes "we can't fix it in user space" should be
> ashamed of themselves. Really? Just randomly doing the same thing in
> parallel and expecting the kernel to sort out your mess? What a crock.
> 
> But this *might* mitigate that udev horror. And not introduce any new
> kernel-side horror, just a slight extension of our existing writer
> exclusion logic to allow "full exclusive access".

Yes, that expresses what is needed well and is simple enough.

> (Note: it's not actually excluding other purely regular readers - but
> it *is* excluding not just writers, but also other "special readers"
> that want to exclude writers)
> 
> I'd like to point out that this patch really is completely untested.
> It built for me, but that's all the testing it has gotten. It's
> _small_. Tiny, even. But that "id == READING_MODULE" thing really is
> pretty disgusting and I feel this needs more thought.

>  fs/kernel_read_file.c | 6 +++++-
>  include/linux/fs.h    | 6 ++++++
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
> index 5d826274570c..ff3e894f8cd4 100644
> --- a/fs/kernel_read_file.c
> +++ b/fs/kernel_read_file.c
> @@ -48,7 +48,11 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
>  	if (!S_ISREG(file_inode(file)->i_mode))
>  		return -EINVAL;
>  
> -	ret = deny_write_access(file);
> +	/* Module reading wants *exclusive* access to the file */
> +	if (id == READING_MODULE)
> +		ret = exclusive_deny_write_access(file);
> +	else
> +		ret = deny_write_access(file);
>  	if (ret)
>  		return ret;
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 21a981680856..722b42a77d51 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2566,6 +2566,12 @@ static inline int deny_write_access(struct file *file)
>  	struct inode *inode = file_inode(file);
>  	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
>  }
> +static inline int exclusive_deny_write_access(struct file *file)
> +{
> +	int old = 0;
> +	struct inode *inode = file_inode(file);
> +	return atomic_try_cmpxchg(&inode->i_writecount, &old, -1) ? 0 : -ETXTBSY;
> +}
>  static inline void put_write_access(struct inode * inode)
>  {
>  	atomic_dec(&inode->i_writecount);

Certainly on the track where I wish we could go. Now this goes tested.
On 255 cores:

Before:                                                                          
                                                                                 
vagrant@kmod ~ $ sudo systemd-analyze                                            
Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s        
graphical.target reached after 44.178s in userspace.  

root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats    
 Virtual mem wasted bytes       1949006968 

                                                                                 
; 1949006968/1024/1024/1024                                                      
        ~1.81515418738126754761                                                  
                                                                                 
So ~1.8 GiB... of vmalloc space wasted during boot.

After:

systemd-analyze 
Startup finished in 24.438s (kernel) + 41.278s (userspace) = 1min 5.717s 
graphical.target reached after 41.154s in userspace.

root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
 Virtual mem wasted bytes       354413398

So still 337.99 MiB of vmalloc space wasted during boot due to
duplicates. The reason is the exclusive_deny_write_access() must be
kept during the life of the module otherwise as soon as it is done
others can still race to load after and fail later after it sees the
module is already loaded. It sounds crazy to think that such races
exist because that means userspace didn't see the module loaded yet
and still tried finit_module() but the stats reveal that's the
case.

So with two other hunks added (2nd and 4th), this now matches parity with
my patch, not suggesting this is right, just demonstrating how this
could be resolved with this. We could also just have a helper which lets
the module code allow_write_access() at the end of its use of the fd
(failure to load or module is removed).

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..ff5b338a288b 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -48,7 +48,11 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 	if (!S_ISREG(file_inode(file)->i_mode))
 		return -EINVAL;
 
-	ret = deny_write_access(file);
+	/* Module reading wants *exclusive* access to the file */
+	if (id == READING_MODULE)
+		ret = exclusive_deny_write_access(file);
+	else
+		ret = deny_write_access(file);
 	if (ret)
 		return ret;
 
@@ -119,7 +123,8 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 	}
 
 out:
-	allow_write_access(file);
+	if (id != READING_MODULE)
+		allow_write_access(file);
 	return ret == 0 ? copied : ret;
 }
 EXPORT_SYMBOL_GPL(kernel_read_file);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21a981680856..722b42a77d51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2566,6 +2566,12 @@ static inline int deny_write_access(struct file *file)
 	struct inode *inode = file_inode(file);
 	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
 }
+static inline int exclusive_deny_write_access(struct file *file)
+{
+	int old = 0;
+	struct inode *inode = file_inode(file);
+	return atomic_try_cmpxchg(&inode->i_writecount, &old, -1) ? 0 : -ETXTBSY;
+}
 static inline void put_write_access(struct inode * inode)
 {
 	atomic_dec(&inode->i_writecount);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..88aaada929b1 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -3078,8 +3079,10 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 	len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
 				       READING_MODULE);
 	if (len < 0) {
-		mod_stat_inc(&failed_kreads);
-		mod_stat_add_long(len, &invalid_kread_bytes);
+		if (len != -ETXTBSY) {
+			mod_stat_inc(&failed_kreads);
+			mod_stat_add_long(len, &invalid_kread_bytes);
+		}
 		return len;
 	}
 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-25 16:42       ` Greg KH
@ 2023-05-25 18:22         ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-25 18:22 UTC (permalink / raw)
  To: Greg KH, Linux FS Devel
  Cc: Linus Torvalds, Petr Pavlu, rafael, song, lucas.de.marchi,
	lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Thu, May 25, 2023 at 05:42:10PM +0100, Greg KH wrote:
> Luis, I asked last time what modules are being asked by the kernel to be
> loaded thousands of times at boot and can't seem to find an answer
> anywhere, did I miss that?

Yes you missed it, I had explained it:

https://lore.kernel.org/all/ZEGopJ8VAYnE7LQ2@bombadil.infradead.org/

"My best assessment of the situation is that each CPU in udev ends up
triggering a load of duplicate set of modules, not just one, but *a
lot*. Not sure what heuristics udev uses to load a set of modules per
CPU."

Petr Pavlu then finishes the assessment:

https://lore.kernel.org/all/23bd0ce6-ef78-1cd8-1f21-0e706a00424a@suse.com/

But let me quote it, so it is not missed:

"My understanding is that udev workers are forked. An initial kmod
context is created by the main udevd process but no sharing happens
after the fork.  It means that the mentioned memory pool logic doesn't
really kick in.

Multiple parallel load requests come from multiple udev workers, for
instance, each handling an udev event for one CPU device and making the
exactly same requests as all others are doing at the same time.

The optimization idea would be to recognize these duplicate requests at
the udevd/kmod level and converge them."

> This should be very easy to handle in
> userspace if systems need it, so that begs the questions, what types of
> systems need this? 

I had explained, this has existed for a long time.

> We have handled booting with tens of thousands of
> devices attached for decades now with no reports of boot/udev/kmod
> issues before, what has recently changed to cause issues?

Doesn't mean this didn't happen before, just because memory is freed due
to duplicates does not mean that the memory pressure induced by them is
not stupid. It is stupid, but hasn't come up as a possible real issue
nowadays where systems require more vmalloc space used during boot with
new features. I had explained also the context where this came from.
David Hildenbrand had reported failure to boot on many CPUs.  If you
induce more vmap memory pressure on boot with multiple CPUs eventually
you can't boot. Enabling KASAN will make this worse today.

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-25 18:08       ` Luis Chamberlain
@ 2023-05-25 18:35         ` Luis Chamberlain
  2023-05-25 18:50         ` Linus Torvalds
  1 sibling, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-25 18:35 UTC (permalink / raw)
  To: Linus Torvalds, Linux FS Devel, hch, brauner, david
  Cc: tglx, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, lennart, gregkh, rafael, song,
	lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt,
	dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu

On Thu, May 25, 2023 at 11:08:13AM -0700, Luis Chamberlain wrote:
> + fsdevel please review,

> So with two other hunks added (2nd and 4th), this now matches parity with
> my patch, not suggesting this is right, just demonstrating how this
> could be resolved with this. We could also just have a helper which lets
> the module code allow_write_access() at the end of its use of the fd
> (failure to load or module is removed).

This even fixes the pathological case with stress-ng for finit_module:

./stress-ng --module 8192 --module-name xfs

(stress-ng assumes you have all dependencies already loaded and
 the module is not loaded, it uses finit_module() directly)

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-25 16:07     ` Linus Torvalds
  2023-05-25 16:42       ` Greg KH
  2023-05-25 17:52       ` Linus Torvalds
@ 2023-05-25 18:45       ` Lucas De Marchi
  2023-05-25 21:12         ` Linus Torvalds
  2 siblings, 1 reply; 53+ messages in thread
From: Lucas De Marchi @ 2023-05-25 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Petr Pavlu, Luis Chamberlain, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Thu, May 25, 2023 at 09:07:23AM -0700, Linus Torvalds wrote:
>On Thu, May 25, 2023 at 4:40 AM Petr Pavlu <petr.pavlu@suse.com> wrote:
>>
>> kmod normally uses finit_module() only if a module is not compressed,
>> otherwise it decompresses it first and then invokes init_module().
>
>Note that it would probably be good to teach Fedora and SuSE to use
>the kernel-side decompression, if only because we have it and would
>like to try to avoid using the old "load contents from user memory".
>
>Mainly because it allows security modules to actively check for
>tampering (ie things like verity etc). Long-term, it would be good to
>just deprecate the old init_module() entirely.

Right... I was trying to remember why that wasn't done yet since I
thought it was. The in-kernel decompression is much more recent than
the finit_module. Commit b1ae6dc41eaa ("module: add in-kernel support for decompressing")
was actually the one allowing to decompress on the kernel side
and  commit 169a58ad824d ("module/decompress: Support zstd in-kernel decompression")
brought the algo support on the kernel and userspace side to parity.

I will teach kmod to take the proper path considering the in-kernel
decompression availability.

>
>But yes:
>
>> It means that these and similarly organized distributions end up using
>> init_module(), and adding complexity to optimize finit_module() wouldn't
>> actually help in their case.
>
>Yeah, I think the real bug is absolutely in udev, and trying to load
>the same module hundreds of times is very very wrong. So I think the
>"mitigate it in the kernel" is at most a quick hack to fix user-space
>brokenness.
>
>And I don't think 1/2 is acceptable as that "quick hack". Not at all.
>It also seems fundamentally buggy, as it uses purely the inode number
>as the file identity, which means that it does bad things across
>filesystem limits.
>
>That said, I posted an alternate patch that I think _is_ valid as that
>quick hack. I don't love it, but it sure is simpler (and avoids the
>i_ino bug):
>
>    https://lore.kernel.org/lkml/CAHk-=wgKu=tJf1bm_dtme4Hde4zTB=_7EdgR8avsDRK4_jD+uA@mail.gmail.com/
>
>that patch hasn't seen any testing, and for all I know it won't even
>boot because of some thinko, but I think it would be acceptable as a
>workaround if it does work.
>
>But no, it's not some kind of "fix" for the bug, and yes, using
>init_module() rather than finit_module() will circumvent the quick
>hack. The true fix would be for udev to do proper handling of its data
>structures instead of randomly spraying duplicate module loading
>events.
>
>I don't know why udev does what it does. From what Luis told me,
>apparently it's just forking stuff and keeping all its data structures
>in memory, and has no actual consistency or locking or memory of what
>it has done. Luis pointed me at

It's a long time I don't touch that udev code, but my understanding
is that it first creates the kmod context and then starts to fork workers
(up to a limit) as the events arrive and there are no idle workers available.
At this point each of them have a separate kmod context derived from the
initial context. I was told the workers are needed because
a) they must be resilient to crashing without catastrophic consequences and
b) the kernel floods udev with thousands of netlink events during boot.
c) unrelated netlink events can't wait a module to be loaded, for example.

If the above is true (need confirmation from udev devs), then what could
be done on the userspace side would be:

1) do the modalias lookup first, before delegating the module load part
    to the workers. That will translate the modalias to the module name,
2) hand over to the worker the module loading part by name, not by alias,
    iff there isn't one being done for that already by other workers -
    workers need to share some state with the main process.

With this the dedup can happen based on the *module name*. I was told
a dedup based on the aliases is not effective as there are slight
changes on the modaliases being sent on boot leading to the same module.


>    https://lore.kernel.org/all/23bd0ce6-ef78-1cd8-1f21-0e706a00424a@suse.com/T/#u
>
>for some udev background.

the synchronization point in the kernel side rather than on userspace
used to be cheap and the race window smaller.  About the race: libkmod
already checks if there's a module being loaded before actually loading
it, however there is a race between the initstate file being created by
the kernel side and new requests arriving to load the same module.

>
>It's been about a decade since I looked at udev sources, and none of
>this encourages me to take a second look, so all of the above may be
>me misunderstanding just exactly what the udev problem is. But for
>that 'finit' case, we *could* try that simple hack of mine.
>
>I say "hack", but the patch really is pretty simple, and the concept
>of "exclusive special access" certainly is not some hack in itself.
>It's just not anything we've ever done before. So the hackishness from
>that exclusive_deny_write_access() thing in my patch is mainly that it
>shouldn't be needed at all (and that the exclusivity should probably
>be set some other way).
>
>Comments welcome.

Thinking only on the finit_module case and given libkmod will be
changed to prefer that path, it's not clear if it's preferred
to dedup on module name (userspace) or inode (kernel). Also worth
mentioning that both of them only protect against the window of calling
finit_module() and having a initstate file created by the kernel: if the
file exists in the coming or live states, libkmod will already do the
shortcut.

Are you willig to merge (a possibly improved version of) your patch
or the userspace change is still something that
would be desired?  Doing that on the kernel has the small advantage
that it also synchronizes requests from sources other than udev,
but I don't think we would have many to justify.

Lucas De Marchi

>
>                  Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-25 18:08       ` Luis Chamberlain
  2023-05-25 18:35         ` Luis Chamberlain
@ 2023-05-25 18:50         ` Linus Torvalds
  2023-05-25 19:32           ` Luis Chamberlain
  1 sibling, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-05-25 18:50 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linux FS Devel, hch, brauner, david, tglx, patches,
	linux-modules, linux-mm, linux-kernel, pmladek, petr.pavlu,
	prarit, lennart, gregkh, rafael, song, lucas.de.marchi,
	lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu

On Thu, May 25, 2023 at 11:08 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> Certainly on the track where I wish we could go. Now this goes tested.
> On 255 cores:
>
> Before:
>
> vagrant@kmod ~ $ sudo systemd-analyze
> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
> graphical.target reached after 44.178s in userspace.
>
> root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
>  Virtual mem wasted bytes       1949006968
>
>
> ; 1949006968/1024/1024/1024
>         ~1.81515418738126754761
>
> So ~1.8 GiB... of vmalloc space wasted during boot.
>
> After:
>
> systemd-analyze
> Startup finished in 24.438s (kernel) + 41.278s (userspace) = 1min 5.717s
> graphical.target reached after 41.154s in userspace.
>
> root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
>  Virtual mem wasted bytes       354413398
>
> So still 337.99 MiB of vmalloc space wasted during boot due to
> duplicates.

Ok. I think this will count as 'good enough for mitigation purposes'

> The reason is the exclusive_deny_write_access() must be
> kept during the life of the module otherwise as soon as it is done
> others can still race to load

Yes. The exclusion only applies while the file is actively being read.

> So with two other hunks added (2nd and 4th), this now matches parity with
> my patch, not suggesting this is right,

Yeah, we can't do that, because user space may quite validly want to
write the file afterwards.

Or, in fact, unload the module and re-load it.

So the "exclusion" really needs to be purely temporary.

That said, I considered moving the exclusion to module/main.c itself,
rather than the reading part. That wouild get rid of the hacky "id ==
READING_MODULE", and put the exclusion in the place that actually
wants it.

And that would allow us to at least extend that temporary exlusion a
bit - we could keep it until the module has actually been loaded and
inited.

So it would probably improve on those numbers a bit more, but you'd
still have the fundamental race where *serial* duplicates end up
always wasting CPU effort and temporary vmalloc space.

                   Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection
  2023-05-25 18:50         ` Linus Torvalds
@ 2023-05-25 19:32           ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-25 19:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux FS Devel, hch, brauner, david, tglx, patches,
	linux-modules, linux-mm, linux-kernel, pmladek, petr.pavlu,
	prarit, lennart, gregkh, rafael, song, lucas.de.marchi,
	lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu

On Thu, May 25, 2023 at 11:50 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So it would probably improve on those numbers a bit more, but you'd
> still have the fundamental race where *serial* duplicates end up
> always wasting CPU effort and temporary vmalloc space.

The known failed boots are with KASAN with a large number of CPUs, so
the value in
the mitigation would be to help those boot until userspace fixes it
and we have enough
time for propagation. But since it is not a full proof solution, it
may seem like an odd thing
to have in place later and this being lost as odd tribal knowledge.
I'd be in favor of only
applying the mitigation if we really are chasing userspace to fix
this, and we'd be OK
in later removing it after userspace gets this fixed / propagated.

If we're going to have userspace fix this, who is volunteering?

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-25 18:45       ` Lucas De Marchi
@ 2023-05-25 21:12         ` Linus Torvalds
  2023-05-25 22:02           ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-05-25 21:12 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: Petr Pavlu, Luis Chamberlain, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

[-- Attachment #1: Type: text/plain, Size: 2097 bytes --]

On Thu, May 25, 2023 at 11:45 AM Lucas De Marchi
<lucas.demarchi@intel.com> wrote:
>
> Are you willig to merge (a possibly improved version of) your patch
> or the userspace change is still something that would be desired?

I think a user space change should still be something that people
should look at, particularly as the kernel side patch I'm willing to
accept doesn't catch the "completely serial" cases, only the "trying
to load at the same time that the same module is literally busy being
loaded".

But I've cleaned up my patch a bit, and while the cleaned-up version
is rather larger as a patch (mainly because of just also re-organizing
the finit_module() code to do all the 'struct file' prep), I'm
actually pretty happy with this attached patch conceptually.

In this form, it actually "makes sense" to me, rather than being just
clearly a workaround.  Also, unlike the previous patch, this doesn't
actually make any changes to the basic kernel_read_file() set of
functions, it's all done by the module loading code itself.

Luis, would you mind testing this version on your load? It still won't
actually handle the purely serial case, so there *will* be those
spurious double module reads from different CPU's just doing the
things serially, but the exclusive file access region has been
extended to not just cover the actual file content reading, but to
cover the whole "turn it into a a real module" part too.

Also, this does *not* update some of the comments in the module
loading. I changed finit_module to use "kernel_read_file()" instead of
"kernel_read_file_from_fd()", since it actually now has to look up the
file descriptor anyway. But the comments still talk about that
"from_fd" thing.

Anyway, this is back to "ENTIRELY UNTESTED" territory, in that I've
compiled this, but haven't booted it. The changes look obvious, but
hey, mistakes happen.

And the commit message is just a place-holder. Obviously. I won't sign
off on this or write more of a commit message until it has had some
real testing.

                  Linus

[-- Attachment #2: 0001-Attempt-at-workaround-for-horrible-udev-module-loadi.patch --]
[-- Type: text/x-patch, Size: 3708 bytes --]

From e3223bfbdd5455f0b11337a591fad3a1816b9d09 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 25 May 2023 09:32:25 -0700
Subject: [PATCH] Attempt at workaround for horrible udev module loading
 behavior

---
 include/linux/fs.h   |  6 ++++
 kernel/module/main.c | 72 ++++++++++++++++++++++++++++++--------------
 2 files changed, 56 insertions(+), 22 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21a981680856..722b42a77d51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2566,6 +2566,12 @@ static inline int deny_write_access(struct file *file)
 	struct inode *inode = file_inode(file);
 	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
 }
+static inline int exclusive_deny_write_access(struct file *file)
+{
+	int old = 0;
+	struct inode *inode = file_inode(file);
+	return atomic_try_cmpxchg(&inode->i_writecount, &old, -1) ? 0 : -ETXTBSY;
+}
 static inline void put_write_access(struct inode * inode)
 {
 	atomic_dec(&inode->i_writecount);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..b4c7e925fdb0 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -3057,11 +3057,53 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
 	return load_module(&info, uargs, 0);
 }
 
-SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
+static int file_init_module(struct file *file, const char __user * uargs, int flags)
 {
 	struct load_info info = { };
 	void *buf = NULL;
 	int len;
+
+	len = kernel_read_file(file, 0, &buf, INT_MAX, NULL,
+				       READING_MODULE);
+	if (len < 0) {
+		mod_stat_inc(&failed_kreads);
+		mod_stat_add_long(len, &invalid_kread_bytes);
+		return len;
+	}
+
+	if (flags & MODULE_INIT_COMPRESSED_FILE) {
+		int err = module_decompress(&info, buf, len);
+		vfree(buf); /* compressed data is no longer needed */
+		if (err) {
+			mod_stat_inc(&failed_decompress);
+			mod_stat_add_long(len, &invalid_decompress_bytes);
+			return err;
+		}
+	} else {
+		info.hdr = buf;
+		info.len = len;
+	}
+
+	return load_module(&info, uargs, flags);
+}
+
+/*
+ * kernel_read_file() will already deny write access, but module
+ * loading wants _exclusive_ access to the file, so we do that
+ * here, along with basic sanity checks.
+ */
+static int prepare_file_for_module_load(struct file *file)
+{
+	if (!file || !(file->f_mode & FMODE_READ))
+		return -EBADF;
+	if (!S_ISREG(file_inode(file)->i_mode))
+		return -EINVAL;
+	return exclusive_deny_write_access(file);
+}
+
+SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
+{
+	struct fd f;
 	int err;
 
 	err = may_init_module();
@@ -3075,28 +3117,14 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 		      |MODULE_INIT_COMPRESSED_FILE))
 		return -EINVAL;
 
-	len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
-				       READING_MODULE);
-	if (len < 0) {
-		mod_stat_inc(&failed_kreads);
-		mod_stat_add_long(len, &invalid_kread_bytes);
-		return len;
+	f = fdget(fd);
+	err = prepare_file_for_module_load(f.file);
+	if (!err) {
+		err = file_init_module(f.file, uargs, flags);
+		allow_write_access(f.file);
 	}
-
-	if (flags & MODULE_INIT_COMPRESSED_FILE) {
-		err = module_decompress(&info, buf, len);
-		vfree(buf); /* compressed data is no longer needed */
-		if (err) {
-			mod_stat_inc(&failed_decompress);
-			mod_stat_add_long(len, &invalid_decompress_bytes);
-			return err;
-		}
-	} else {
-		info.hdr = buf;
-		info.len = len;
-	}
-
-	return load_module(&info, uargs, flags);
+	fdput(f);
+	return err;
 }
 
 /* Keep in sync with MODULE_FLAGS_BUF_SIZE !!! */
-- 
2.41.0.rc1.4.gb4dae75050


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-25 21:12         ` Linus Torvalds
@ 2023-05-25 22:02           ` Luis Chamberlain
  2023-05-26  1:39             ` Linus Torvalds
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-25 22:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lucas De Marchi, Petr Pavlu, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Thu, May 25, 2023 at 02:12:49PM -0700, Linus Torvalds wrote:
> On Thu, May 25, 2023 at 11:45 AM Lucas De Marchi
> <lucas.demarchi@intel.com> wrote:
> >
> > Are you willig to merge (a possibly improved version of) your patch
> > or the userspace change is still something that would be desired?
> 
> I think a user space change should still be something that people
> should look at, particularly as the kernel side patch I'm willing to
> accept doesn't catch the "completely serial" cases, only the "trying
> to load at the same time that the same module is literally busy being
> loaded".
> 
> But I've cleaned up my patch a bit, and while the cleaned-up version
> is rather larger as a patch (mainly because of just also re-organizing
> the finit_module() code to do all the 'struct file' prep), I'm
> actually pretty happy with this attached patch conceptually.
> 
> In this form, it actually "makes sense" to me, rather than being just
> clearly a workaround.  Also, unlike the previous patch, this doesn't
> actually make any changes to the basic kernel_read_file() set of
> functions, it's all done by the module loading code itself.
> 
> Luis, would you mind testing this version on your load? It still won't
> actually handle the purely serial case, so there *will* be those
> spurious double module reads from different CPU's just doing the
> things serially, but the exclusive file access region has been
> extended to not just cover the actual file content reading, but to
> cover the whole "turn it into a a real module" part too.
> 
> Also, this does *not* update some of the comments in the module
> loading. I changed finit_module to use "kernel_read_file()" instead of
> "kernel_read_file_from_fd()", since it actually now has to look up the
> file descriptor anyway. But the comments still talk about that
> "from_fd" thing.
> 
> Anyway, this is back to "ENTIRELY UNTESTED" territory, in that I've
> compiled this, but haven't booted it. The changes look obvious, but
> hey, mistakes happen.
> 
> And the commit message is just a place-holder. Obviously. I won't sign
> off on this or write more of a commit message until it has had some
> real testing.

With 255 vcpus:

Before:

vagrant@kmod ~ $ sudo systemd-analyze                                            
Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s        
graphical.target reached after 44.178s in userspace.  

root@kmod ~ # grep "Virtual mem wasted bytes"
/sys/kernel/debug/modules/stats    
 Virtual mem wasted bytes       1949006968                                       

So ~1.8 GiB.

After:

root@kmod ~ # systemd-analyze 
Startup finished in 35.872s (kernel) + 41.715s (userspace) = 1min 17.588s 
graphical.target reached after 41.594s in userspace.

root@kmod ~ # cat /sys/kernel/debug/modules/stats
         Mods ever loaded       66
     Mods failed on kread       0
Mods failed on decompress       0
  Mods failed on becoming       1
      Mods failed on load       0
        Total module size       11268096
      Total mod text size       4149248
       Failed kread bytes       0
  Failed decompress bytes       0
    Failed becoming bytes       474688
        Failed kmod bytes       0
 Virtual mem wasted bytes       474688
         Average mod size       170729
    Average mod text size       62868
  Avg fail becoming bytes       474688
Duplicate failed modules:
              Module-name        How-many-times                    Reason
                   cryptd                     1                  Becoming

root@kmod ~ # du -b /lib/modules/6.3.0-next-20230505+/kernel/crypto/cryptd.ko
475409 /lib/modules/6.3.0-next-20230505+/kernel/crypto/cryptd.ko

So yeah definitely a pretty good improvement. Sometimes the system boots
without any duplicates at all, for some reason Vs the previous attempt.

Tested-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-25 22:02           ` Luis Chamberlain
@ 2023-05-26  1:39             ` Linus Torvalds
  2023-05-29  8:58               ` Johan Hovold
  0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-05-26  1:39 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Lucas De Marchi, Petr Pavlu, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Thu, May 25, 2023 at 3:02 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> So yeah definitely a pretty good improvement. Sometimes the system boots
> without any duplicates at all, for some reason Vs the previous attempt.
>
> Tested-by: Luis Chamberlain <mcgrof@kernel.org>

Ok, I decided to just move it from my experimental tree to my main tree.

I think I used about three times the time and effort (and lines of
text) on writing the commit message compared to what I did on the
patch itself.

I tried to lay out the background and the implications of the change -
it may be pretty darn simple, but it does have some subtle issues.

Anyway: I've committed it to my tree. This is not necessarily the best
time to do that, but let's get this behind us, and in particular,
let's get it out and into wider testing asap.

If it causes any problems what-so-ever, I'll just revert it very
aggressively (unless the problem is trivially and obviously fixable).
It is, after all, not a fix for a _kernel_ bug per se, and whil eI
think the patch is very benign, it does change user-visible behavior.
Very intentionally so, but still..

                         Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-26  1:39             ` Linus Torvalds
@ 2023-05-29  8:58               ` Johan Hovold
  2023-05-29 11:00                 ` Linus Torvalds
  0 siblings, 1 reply; 53+ messages in thread
From: Johan Hovold @ 2023-05-29  8:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Thu, May 25, 2023 at 06:39:52PM -0700, Linus Torvalds wrote:

> Ok, I decided to just move it from my experimental tree to my main tree.
> 
> I think I used about three times the time and effort (and lines of
> text) on writing the commit message compared to what I did on the
> patch itself.
> 
> I tried to lay out the background and the implications of the change -
> it may be pretty darn simple, but it does have some subtle issues.
> 
> Anyway: I've committed it to my tree. This is not necessarily the best
> time to do that, but let's get this behind us, and in particular,
> let's get it out and into wider testing asap.
> 
> If it causes any problems what-so-ever, I'll just revert it very
> aggressively (unless the problem is trivially and obviously fixable).
> It is, after all, not a fix for a _kernel_ bug per se, and whil eI
> think the patch is very benign, it does change user-visible behavior.
> Very intentionally so, but still..

This change breaks module loading during boot on the Lenovo Thinkpad
X13s (aarch64).

Specifically it results in indefinite probe deferral of the display and
USB (ethernet) which makes it a pain to debug. Typing in the dark to
acquire some logs reveals that other modules are missing as well.

Fortunately commit 9828ed3f695a ("module: error out early on concurrent
load of the same module file") stood out when skimming the changes that
went into -rc4, and reverting it make all the expected modules be loaded
again.

I have not tried to figure out exactly why things break, but it does
seem like this one should be reverted.

Johan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-29  8:58               ` Johan Hovold
@ 2023-05-29 11:00                 ` Linus Torvalds
  2023-05-29 12:44                   ` Johan Hovold
  0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-05-29 11:00 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Mon, May 29, 2023 at 4:58 AM Johan Hovold <johan@kernel.org> wrote:
>
> I have not tried to figure out exactly why things break, but it does
> seem like this one should be reverted.

Yes, I have done so.

However, can I ask you to just verify that it was purely the exclusive
open part, and it wasn't that I messed up something else. IOW, can you
replace the

        return exclusive_deny_write_access(file);

in prepare_file_for_module_load() with just a "return 0", and remove the

                allow_write_access(f.file);

line in finit_module()?

That's obviously _instead_ of the revert that I already pushed out,
just to verify that "yup, it's that part, not something silly
elsewhere"

I do wonder what it is that is different in your setup, and maybe you
could also enable the

        pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);

in finit_module() while you are at it? Since you'd be editing that
file anyway for the test, just change the pr_debug() to a printk() and
then do

    dmesg | grep finit_module

to see what it all results in (on a working kernel, of course).

                    Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-29 11:00                 ` Linus Torvalds
@ 2023-05-29 12:44                   ` Johan Hovold
  2023-05-29 15:18                     ` Johan Hovold
  2023-05-29 17:47                     ` Linus Torvalds
  0 siblings, 2 replies; 53+ messages in thread
From: Johan Hovold @ 2023-05-29 12:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Mon, May 29, 2023 at 07:00:05AM -0400, Linus Torvalds wrote:

> However, can I ask you to just verify that it was purely the exclusive
> open part, and it wasn't that I messed up something else. IOW, can you
> replace the
> 
>         return exclusive_deny_write_access(file);
> 
> in prepare_file_for_module_load() with just a "return 0", and remove the
> 
>                 allow_write_access(f.file);
> 
> line in finit_module()?
> 
> That's obviously _instead_ of the revert that I already pushed out,
> just to verify that "yup, it's that part, not something silly
> elsewhere"

Yes, those two changes are enough to make the problem go away.

> I do wonder what it is that is different in your setup, and maybe you
> could also enable the
> 
>         pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);

Below is the corresponding output with a working kernel: 174 requests
for the 131 modules that end up being loaded (without the revert there
is only around 110 modules loaded).

There is some probe deferral and async probing going on during normal
boot which may be part of the explanation.

Johan

[    0.669112] finit_module: fd=3, uargs=00000000b461506c, flags=0
[    0.674144] finit_module: fd=4, uargs=00000000b461506c, flags=0
[    0.676783] finit_module: fd=5, uargs=00000000b461506c, flags=0
[    0.678920] finit_module: fd=3, uargs=00000000b461506c, flags=0
[    0.837967] finit_module: fd=5, uargs=0000000000157d9f, flags=0
[    0.839414] finit_module: fd=3, uargs=00000000b461506c, flags=0
[    0.844129] finit_module: fd=4, uargs=00000000b461506c, flags=0
[    0.845016] finit_module: fd=3, uargs=00000000b461506c, flags=0
[    0.849132] finit_module: fd=3, uargs=00000000b461506c, flags=0
[    0.849460] finit_module: fd=4, uargs=00000000b461506c, flags=0
[    3.345004] finit_module: fd=4, uargs=00000000e3e6c6d2, flags=0
[    3.364302] finit_module: fd=4, uargs=0000000095136ea7, flags=0
[    3.371928] finit_module: fd=5, uargs=0000000095136ea7, flags=0
[    4.099183] finit_module: fd=3, uargs=00000000ce2d6f3e, flags=0
[    4.103156] finit_module: fd=3, uargs=000000004e3e14c2, flags=0
[    4.713558] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.715608] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.717620] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.717910] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.719517] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.725862] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.726730] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.727018] finit_module: fd=14, uargs=00000000262da138, flags=0
[    4.730525] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.749602] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.749675] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.749678] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.774117] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.795307] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.797327] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.798405] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.799140] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.800850] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.807306] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.807313] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.807321] finit_module: fd=16, uargs=00000000262da138, flags=0
[    4.807394] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.807463] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.807525] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.807530] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.807590] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.811469] finit_module: fd=0, uargs=0000000080fab15b, flags=0
[    4.845851] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.845875] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.846282] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.846363] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.846363] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.846669] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.846994] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.847005] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.847194] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.847356] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.847448] finit_module: fd=17, uargs=00000000262da138, flags=0
[    4.847556] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.847651] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.848175] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.850005] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.850485] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.866031] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.866032] finit_module: fd=16, uargs=00000000262da138, flags=0
[    4.866381] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.866711] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.867757] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.868360] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.886043] finit_module: fd=16, uargs=00000000262da138, flags=0
[    4.886046] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.886046] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.886060] finit_module: fd=17, uargs=00000000262da138, flags=0
[    4.886114] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.886140] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.886326] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.886716] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.887210] finit_module: fd=16, uargs=00000000262da138, flags=0
[    4.887451] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.887811] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.887963] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.892066] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.896048] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.896070] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.896092] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.896157] finit_module: fd=18, uargs=00000000262da138, flags=0
[    4.896193] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.896211] finit_module: fd=18, uargs=00000000262da138, flags=0
[    4.896737] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.896751] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.897174] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.897343] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.897388] finit_module: fd=19, uargs=00000000262da138, flags=0
[    4.897555] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.897592] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.899657] finit_module: fd=16, uargs=00000000262da138, flags=0
[    4.899973] finit_module: fd=6, uargs=00000000262da138, flags=0
[    4.900316] finit_module: fd=17, uargs=00000000262da138, flags=0
[    4.901188] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.901668] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.901708] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.902030] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.903964] finit_module: fd=14, uargs=00000000262da138, flags=0
[    4.905243] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.907083] finit_module: fd=17, uargs=00000000262da138, flags=0
[    4.907480] finit_module: fd=16, uargs=00000000262da138, flags=0
[    4.907519] finit_module: fd=16, uargs=00000000262da138, flags=0
[    4.909481] finit_module: fd=13, uargs=00000000262da138, flags=0
[    4.911705] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.912056] finit_module: fd=18, uargs=00000000262da138, flags=0
[    4.912079] finit_module: fd=15, uargs=00000000262da138, flags=0
[    4.915340] finit_module: fd=19, uargs=00000000262da138, flags=0
[    4.933199] finit_module: fd=16, uargs=00000000262da138, flags=0
[    4.933585] finit_module: fd=18, uargs=00000000262da138, flags=0
[    4.935169] finit_module: fd=20, uargs=00000000262da138, flags=0
[    4.956021] finit_module: fd=19, uargs=00000000262da138, flags=0
[    4.956797] finit_module: fd=16, uargs=00000000262da138, flags=0
[    4.959865] finit_module: fd=17, uargs=00000000262da138, flags=0
[    4.964171] finit_module: fd=17, uargs=00000000262da138, flags=0
[    4.977073] finit_module: fd=18, uargs=00000000262da138, flags=0
[    4.980167] finit_module: fd=18, uargs=00000000262da138, flags=0
[    5.043379] finit_module: fd=6, uargs=00000000262da138, flags=0
[    5.053709] finit_module: fd=6, uargs=00000000262da138, flags=0
[    5.131232] finit_module: fd=15, uargs=00000000262da138, flags=0
[    5.140785] finit_module: fd=19, uargs=00000000262da138, flags=0
[    5.186244] finit_module: fd=18, uargs=00000000262da138, flags=0
[    5.186247] finit_module: fd=6, uargs=00000000262da138, flags=0
[    5.186252] finit_module: fd=6, uargs=00000000262da138, flags=0
[    5.186451] finit_module: fd=15, uargs=00000000262da138, flags=0
[    5.186507] finit_module: fd=13, uargs=00000000262da138, flags=0
[    5.187345] finit_module: fd=6, uargs=00000000262da138, flags=0
[    5.190282] finit_module: fd=15, uargs=00000000262da138, flags=0
[    5.195744] finit_module: fd=13, uargs=00000000262da138, flags=0
[    5.198242] finit_module: fd=16, uargs=00000000262da138, flags=0
[    5.198271] finit_module: fd=20, uargs=00000000262da138, flags=0
[    5.222394] finit_module: fd=6, uargs=00000000262da138, flags=0
[    5.222395] finit_module: fd=6, uargs=00000000262da138, flags=0
[    5.222407] finit_module: fd=16, uargs=00000000262da138, flags=0
[    5.222430] finit_module: fd=21, uargs=00000000262da138, flags=0
[    5.222432] finit_module: fd=16, uargs=00000000262da138, flags=0
[    5.222443] finit_module: fd=15, uargs=00000000262da138, flags=0
[    5.229650] finit_module: fd=22, uargs=00000000262da138, flags=0
[    5.257981] finit_module: fd=6, uargs=00000000262da138, flags=0
[    5.313560] finit_module: fd=6, uargs=00000000262da138, flags=0
[    6.144316] finit_module: fd=6, uargs=00000000262da138, flags=0
[    6.178956] finit_module: fd=6, uargs=00000000262da138, flags=0
[    6.178961] finit_module: fd=6, uargs=00000000262da138, flags=0
[    6.182057] finit_module: fd=13, uargs=00000000262da138, flags=0
[    6.182067] finit_module: fd=13, uargs=00000000262da138, flags=0
[    6.243708] finit_module: fd=0, uargs=00000000f9e4f67e, flags=0
[    6.249397] finit_module: fd=0, uargs=000000001b26db10, flags=0
[    6.249904] finit_module: fd=1, uargs=000000001b26db10, flags=0
[    6.250626] finit_module: fd=2, uargs=000000001b26db10, flags=0
[    6.251515] finit_module: fd=3, uargs=000000001b26db10, flags=0
[    6.254112] finit_module: fd=4, uargs=000000001b26db10, flags=0
[    6.255129] finit_module: fd=6, uargs=00000000262da138, flags=0
[    6.255504] finit_module: fd=6, uargs=00000000262da138, flags=0
[    6.259256] finit_module: fd=0, uargs=00000000e1b6cfe4, flags=0
[    6.264136] finit_module: fd=0, uargs=000000004070418f, flags=0
[    6.265227] finit_module: fd=1, uargs=000000004070418f, flags=0
[    6.270175] finit_module: fd=0, uargs=0000000092757077, flags=0
[    6.271230] finit_module: fd=1, uargs=0000000092757077, flags=0
[    6.322960] finit_module: fd=3, uargs=00000000fb904223, flags=0
[    6.373125] finit_module: fd=0, uargs=00000000e2cdc73f, flags=0
[    6.380061] finit_module: fd=0, uargs=000000000591e4e9, flags=0
[    6.392296] finit_module: fd=0, uargs=0000000088d2796a, flags=0
[    6.464595] finit_module: fd=0, uargs=000000001aa8b42e, flags=0
[    6.479839] finit_module: fd=0, uargs=00000000de50a030, flags=0
[    6.488790] finit_module: fd=0, uargs=00000000cbcb6a65, flags=0
[    6.492573] finit_module: fd=1, uargs=00000000cbcb6a65, flags=0
[    6.514903] finit_module: fd=0, uargs=000000006f393376, flags=0
[    7.989970] finit_module: fd=0, uargs=00000000c4594f52, flags=0
[   11.492886] finit_module: fd=13, uargs=00000000262da138, flags=0
[   11.639532] finit_module: fd=6, uargs=00000000262da138, flags=0
[   11.640048] finit_module: fd=6, uargs=00000000262da138, flags=0
[   11.640997] finit_module: fd=15, uargs=00000000262da138, flags=0
[   11.641049] finit_module: fd=6, uargs=00000000262da138, flags=0
[   11.775051] finit_module: fd=6, uargs=00000000262da138, flags=0
[   11.776806] finit_module: fd=6, uargs=00000000262da138, flags=0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-29 12:44                   ` Johan Hovold
@ 2023-05-29 15:18                     ` Johan Hovold
  2023-05-30  1:55                       ` Linus Torvalds
  2023-05-29 17:47                     ` Linus Torvalds
  1 sibling, 1 reply; 53+ messages in thread
From: Johan Hovold @ 2023-05-29 15:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

> On Mon, May 29, 2023 at 07:00:05AM -0400, Linus Torvalds wrote:
> > I do wonder what it is that is different in your setup

I took a closer look at some of the modules that failed to load and
noticed a pattern in that they have dependencies that are needed by more
than one device.

If attempts to load the drivers for two such devices are made in
parallel, only one of them may be successful in loading the shared
dependency which means that module loading is now aborted for the other.

I took a quick look at the kmod code

	https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git/tree/libkmod/libkmod-module.c#n1305

and it does seem like this could happen if we start returning errors
when a module is already in the process of being loaded.

Johan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-29 12:44                   ` Johan Hovold
  2023-05-29 15:18                     ` Johan Hovold
@ 2023-05-29 17:47                     ` Linus Torvalds
  2023-05-30 10:01                       ` Johan Hovold
  1 sibling, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-05-29 17:47 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Mon, May 29, 2023 at 8:44 AM Johan Hovold <johan@kernel.org> wrote:
>
> Yes, those two changes are enough to make the problem go away.

Ok, good. Expected, but just verifying that it wasn't some silly
incidental thinko.

> > I do wonder what it is that is different in your setup, and maybe you
> > could also enable the
> >
> >         pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);
>
> Below is the corresponding output with a working kernel: 174 requests
> for the 131 modules that end up being loaded (without the revert there
> is only around 110 modules loaded).

Ok, your setup doesn't sound *too* different from mine. I have 176
kernel modules on my laptop right now, and that exclusive open
obviously worked fine for me.

But it could easily be some random small difference just from
different hardware, so...

And yeah, that dmesg output is useless, I didn't think of the fact
that it only prints out the file descriptor, not the actual path to
the file. In fact, without that change in place, the module code never
actually looks at the file and leaves it all to
kernel_read_file_from_fd().

With my change, it woul dhave been trivial to use "%pD" and point it
at the file pointer instead, and get the dentry name that way, but
never mind.  I think you're entirely right that it's probably due to a
shared dependency module, and I just didn't happen to trigger that
case.

Sadly, the whole idea was to figure out the exclusion so early that we
don't have the module data structures lookup up yet, so there's no
really obvious thing to serialize the load on.

I'll have to think about this more. Serializing on a per-inode lock
would seem to be the simplest thing, but they are all for IO, and we
can't just take them over the read.

                     Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-29 15:18                     ` Johan Hovold
@ 2023-05-30  1:55                       ` Linus Torvalds
  2023-05-30  9:40                         ` Johan Hovold
                                           ` (3 more replies)
  0 siblings, 4 replies; 53+ messages in thread
From: Linus Torvalds @ 2023-05-30  1:55 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

[-- Attachment #1: Type: text/plain, Size: 2044 bytes --]

On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
>
> I took a closer look at some of the modules that failed to load and
> noticed a pattern in that they have dependencies that are needed by more
> than one device.

Ok, this is a "maybe something like this" RFC series of two patches -
one trivial one to re-organize things a bit so that we can then do the
real one which uses a filter based on the inode pointer to return an
"idempotent return value" for module loads that share the same inode.

It's entirely untested, and since I'm on the road I'm going to not
really be able to test it. It compiles for me, and the code looks
fairly straightforward, but it's probably buggy.

It's very loosely based on Luis' attempt,  but it
 (a) is internal to module loading
 (b) uses a reliable cookie
 (c) doesn't leave the cookie around randomly for later
 (d) has seen absolutely no testing

Put another way: if somebody wants to play with this, please treat it
as a starting point, not the final thing. You might need to debug
things, and fix silly mistakes.

The idea is to just have a simple hash list of currently executing
module loads, protected by a trivial spinlock. Every module loader
adds itself to the right hash list, and if they were the *first* one
(ie no other pending module loads for that inode), will actually do
the module load.

Everybody who *isn't* the first one will just wait for completion and
return the same error code that the first one returned.

This is technically bogus. The first one might fail due to arguments.
So the cookie shouldn't be just the inode, it should be the inode and
a hash of the arguments or something like that. But it is what it is,
and apart from possible show-stopper bugs this is no worse than the
failed "exclusive write deny" attempt. IOW - maybe worth trying?

And if *that* didn't sell people on this patch series, I don't know
what will. I should be in marketing! Two drink minimums, here I come!

               Linus

[-- Attachment #2: 0001-module-split-up-finit_module-into-init_module_from_f.patch --]
[-- Type: text/x-patch, Size: 2888 bytes --]

From b7d19af1b2a3ca9f789df9c04147576fca7c5b8f Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 29 May 2023 20:55:13 -0400
Subject: [PATCH 1/2] module: split up 'finit_module()' into
 init_module_from_file() helper

This will simplify the next step, where we can then key off the inode to
do one idempotent module load.

Let's do the obvious re-organization in one step, and then the new code
in another.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 kernel/module/main.c | 42 +++++++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..427bffa2844f 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -3057,26 +3057,16 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
 	return load_module(&info, uargs, 0);
 }
 
-SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
+static int init_module_from_file(struct file *f, const char __user * uargs, int flags)
 {
 	struct load_info info = { };
 	void *buf = NULL;
 	int len;
-	int err;
 
-	err = may_init_module();
-	if (err)
-		return err;
+	if (!f || !(f->f_mode & FMODE_READ))
+		return -EBADF;
 
-	pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);
-
-	if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS
-		      |MODULE_INIT_IGNORE_VERMAGIC
-		      |MODULE_INIT_COMPRESSED_FILE))
-		return -EINVAL;
-
-	len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
-				       READING_MODULE);
+	len = kernel_read_file(f, 0, &buf, INT_MAX, NULL, READING_MODULE);
 	if (len < 0) {
 		mod_stat_inc(&failed_kreads);
 		mod_stat_add_long(len, &invalid_kread_bytes);
@@ -3084,7 +3074,7 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 	}
 
 	if (flags & MODULE_INIT_COMPRESSED_FILE) {
-		err = module_decompress(&info, buf, len);
+		int err = module_decompress(&info, buf, len);
 		vfree(buf); /* compressed data is no longer needed */
 		if (err) {
 			mod_stat_inc(&failed_decompress);
@@ -3099,6 +3089,28 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 	return load_module(&info, uargs, flags);
 }
 
+SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
+{
+	int err;
+	struct fd f;
+
+	err = may_init_module();
+	if (err)
+		return err;
+
+	pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);
+
+	if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS
+		      |MODULE_INIT_IGNORE_VERMAGIC
+		      |MODULE_INIT_COMPRESSED_FILE))
+		return -EINVAL;
+
+	f = fdget(fd);
+	err = init_module_from_file(f.file, uargs, flags);
+	fdput(f);
+	return err;
+}
+
 /* Keep in sync with MODULE_FLAGS_BUF_SIZE !!! */
 char *module_flags(struct module *mod, char *buf, bool show_state)
 {
-- 
2.40.0.rc1.2.gd15644fe02


[-- Attachment #3: 0002-modules-catch-concurrent-module-loads-take-two.patch --]
[-- Type: text/x-patch, Size: 3300 bytes --]

From c844099ab4d3032424b4bf8720e761fbccf88ea5 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 29 May 2023 21:39:51 -0400
Subject: [PATCH 2/2] modules: catch concurrent module loads, take two

This treats concurrent module loads from a file as "idempotent" in the
inode, ie if one module load is ongoing, we don't start a new one, but
instead just expect the first one to complete and return the same return
value as it did.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 kernel/module/main.c | 73 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 71 insertions(+), 2 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 427bffa2844f..82b0dcc1fe77 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -3057,15 +3057,82 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
 	return load_module(&info, uargs, 0);
 }
 
+struct idempotent {
+	const void *cookie;
+	struct hlist_node entry;
+	struct completion complete;
+	int ret;
+};
+
+#define IDEM_HASH_BITS 8
+static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
+static struct spinlock idem_lock;
+
+static bool idempotent(struct idempotent *u, const void *cookie)
+{
+	int hash = hash_ptr(cookie, IDEM_HASH_BITS);
+	struct hlist_head *head = idem_hash + hash;
+	struct idempotent *existing;
+	bool first;
+
+	u->ret = 0;
+	u->cookie = cookie;
+	init_completion(&u->complete);
+
+	spin_lock(&idem_lock);
+	first = true;
+	hlist_for_each_entry(existing, head, entry) {
+		if (existing->cookie != cookie)
+			continue;
+		first = false;
+		break;
+	}
+	hlist_add_head(&u->entry, idem_hash+hash);
+	spin_unlock(&idem_lock);
+
+	return !first;
+}
+
+/*
+ * We were the first one with 'cookie' on the list, and we ended
+ * up completing the operation. We now need to walk the list,
+ * remove everybody - which includes ourselfs - fill in the return
+ * value, and then complete the operation.
+ */
+static void idempotent_complete(struct idempotent *u, int ret)
+{
+	const void *cookie = u->cookie;
+	int hash = hash_ptr(cookie, IDEM_HASH_BITS);
+	struct hlist_head *head = idem_hash + hash;
+	struct hlist_node *next;
+	struct idempotent *pos;
+
+	spin_lock(&idem_lock);
+	hlist_for_each_entry_safe(pos, next, head, entry) {
+		if (pos->cookie != cookie)
+			continue;
+		hlist_del(&pos->entry);
+		pos->ret = ret;
+		complete(&pos->complete);
+	}
+	spin_unlock(&idem_lock);
+}
+
 static int init_module_from_file(struct file *f, const char __user * uargs, int flags)
 {
+	struct idempotent idem;
 	struct load_info info = { };
 	void *buf = NULL;
-	int len;
+	int len, ret;
 
 	if (!f || !(f->f_mode & FMODE_READ))
 		return -EBADF;
 
+	if (idempotent(&idem, file_inode(f))) {
+		wait_for_completion(&idem.complete);
+		return idem.ret;
+	}
+
 	len = kernel_read_file(f, 0, &buf, INT_MAX, NULL, READING_MODULE);
 	if (len < 0) {
 		mod_stat_inc(&failed_kreads);
@@ -3086,7 +3153,9 @@ static int init_module_from_file(struct file *f, const char __user * uargs, int
 		info.len = len;
 	}
 
-	return load_module(&info, uargs, flags);
+	ret = load_module(&info, uargs, flags);
+	idempotent_complete(&idem, ret);
+	return ret;
 }
 
 SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
-- 
2.40.0.rc1.2.gd15644fe02


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30  1:55                       ` Linus Torvalds
@ 2023-05-30  9:40                         ` Johan Hovold
  2023-06-05 12:25                           ` Johan Hovold
  2023-05-30 16:22                         ` Luis Chamberlain
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Johan Hovold @ 2023-05-30  9:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
> >
> > I took a closer look at some of the modules that failed to load and
> > noticed a pattern in that they have dependencies that are needed by more
> > than one device.
> 
> Ok, this is a "maybe something like this" RFC series of two patches -
> one trivial one to re-organize things a bit so that we can then do the
> real one which uses a filter based on the inode pointer to return an
> "idempotent return value" for module loads that share the same inode.
> 
> It's entirely untested, and since I'm on the road I'm going to not
> really be able to test it. It compiles for me, and the code looks
> fairly straightforward, but it's probably buggy.
> 
> It's very loosely based on Luis' attempt,  but it
>  (a) is internal to module loading
>  (b) uses a reliable cookie
>  (c) doesn't leave the cookie around randomly for later
>  (d) has seen absolutely no testing
> 
> Put another way: if somebody wants to play with this, please treat it
> as a starting point, not the final thing. You might need to debug
> things, and fix silly mistakes.

With the missing spinlock initialisation fixed:

-static struct spinlock idem_lock;
+static DEFINE_SPINLOCK(idem_lock);

this passes basic smoke testing and allows the X13s to boot.

It does not seem to have any significant impact on boot time, but it
avoids some of the unnecessary load attempts as intended:

Before:

         Mods ever loaded       131
     Mods failed on kread       0
Mods failed on decompress       0
  Mods failed on becoming       24
      Mods failed on load       14
        Total module size       12587008
      Total mod text size       5058560
       Failed kread bytes       0
  Failed decompress bytes       0
    Failed becoming bytes       2437992
        Failed kmod bytes       1858992
 Virtual mem wasted bytes       4296984
         Average mod size       96085
    Average mod text size       38615
  Avg fail becoming bytes       101583
  Average fail load bytes       132786

After:

         Mods ever loaded       131
     Mods failed on kread       0
Mods failed on decompress       0
  Mods failed on becoming       4
      Mods failed on load       0
        Total module size       12587008
      Total mod text size       5058560
       Failed kread bytes       0
  Failed decompress bytes       0
    Failed becoming bytes       109776
        Failed kmod bytes       0
 Virtual mem wasted bytes       109776
         Average mod size       96085
    Average mod text size       38615
  Avg fail becoming bytes       27444

Johan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-29 17:47                     ` Linus Torvalds
@ 2023-05-30 10:01                       ` Johan Hovold
  0 siblings, 0 replies; 53+ messages in thread
From: Johan Hovold @ 2023-05-30 10:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Mon, May 29, 2023 at 01:47:28PM -0400, Linus Torvalds wrote:
> On Mon, May 29, 2023 at 8:44 AM Johan Hovold <johan@kernel.org> wrote:

> > > I do wonder what it is that is different in your setup, and maybe you
> > > could also enable the
> > >
> > >         pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);
> >
> > Below is the corresponding output with a working kernel: 174 requests
> > for the 131 modules that end up being loaded (without the revert there
> > is only around 110 modules loaded).

> And yeah, that dmesg output is useless, I didn't think of the fact
> that it only prints out the file descriptor, not the actual path to
> the file. In fact, without that change in place, the module code never
> actually looks at the file and leaves it all to
> kernel_read_file_from_fd().

Yeah, I added a printk with the module name to load_module() to make
some sense of it.
 
> With my change, it woul dhave been trivial to use "%pD" and point it
> at the file pointer instead, and get the dentry name that way, but
> never mind.  I think you're entirely right that it's probably due to a
> shared dependency module, and I just didn't happen to trigger that
> case.

For completeness, here's a corresponding log from when running with your
RFC. Those duplicate requests that now wait for loading to complete
would have failed, and that explains why some modules like
qcom-spmi-adc5 and qcom-spmi-adc-tm5 that both depend on
qcom-vadc-common would in turn fail to load.

Johan

[    0.633127] init_module_from_file - i2c-core.ko
[    0.638320] init_module_from_file - i2c-hid.ko
[    0.640654] init_module_from_file - i2c-hid-of.ko
[    0.642572] init_module_from_file - i2c-qcom-geni.ko
[    0.826911] init_module_from_file - hid-multitouch.ko
[    0.827861] init_module_from_file - nvme-core.ko
[    0.833011] init_module_from_file - nvme.ko
[    0.835558] init_module_from_file - phy-qcom-qmp-pcie.ko
[    0.841050] init_module_from_file - crc8.ko
[    0.841371] init_module_from_file - pcie-qcom.ko
[    3.390182] init_module_from_file - ipv6.ko
[    3.402261] init_module_from_file - x_tables.ko
[    3.406573] init_module_from_file - ip_tables.ko
[    4.180345] init_module_from_file - dm-mod.ko
[    4.186611] init_module_from_file - drm.ko
[    4.793481] init_module_from_file - pwm_bl.ko
[    4.798935] init_module_from_file - soundwire-bus.ko
[    4.802551] init_module_from_file - qmi_helpers.ko
[    4.805664] init_module_from_file - socinfo.ko
[    4.814729] init_module_from_file - pdr_interface.ko
[    4.830809] init_module_from_file - soundcore.ko
[    4.832227] init_module_from_file - qcom-wdt.ko
[    4.832242] init_module_from_file - qcom-rng.ko
[    4.832311] init_module_from_file - icc-osm-l3.ko
[    4.867903] init_module_from_file - pmic_glink.ko
[    4.867904] init_module_from_file - snd.ko
[    4.868045] init_module_from_file - mdt_loader.ko
[    4.890274] init_module_from_file - snd.ko
[    4.893962] init_module_from_file - snd.ko
[    4.894620] init_module_from_file - snd.ko
[    4.896162] init_module_from_file - snd.ko
[    4.896797] init_module_from_file - snd.ko
[    4.896907] init_module_from_file - snd.ko
[    4.898087] init_module_from_file - snd.ko
[    4.907548] init_module_from_file - qcom_sysmon.ko
[    4.907560] init_module_from_file - qcom_sysmon.ko
[    4.918690] init_module_from_file - snd.ko - waited, ret = 0
[    4.918692] init_module_from_file - snd.ko - waited, ret = 0
[    4.918694] init_module_from_file - snd.ko - waited, ret = 0
[    4.918695] init_module_from_file - snd.ko - waited, ret = 0
[    4.918699] init_module_from_file - snd.ko - waited, ret = 0
[    4.918700] init_module_from_file - snd.ko - waited, ret = 0
[    4.920849] init_module_from_file - snd.ko - waited, ret = 0
[    4.937139] init_module_from_file - qrtr.ko
[    4.937163] init_module_from_file - icc-bwmon.ko
[    4.937185] init_module_from_file - icc-bwmon.ko
[    4.938603] init_module_from_file - qcom_sysmon.ko - waited, ret = 0
[    4.939866] init_module_from_file - snd-timer.ko
[    4.939877] init_module_from_file - snd-timer.ko
[    4.939885] init_module_from_file - snd-timer.ko
[    4.939897] init_module_from_file - snd-timer.ko
[    4.939905] init_module_from_file - snd-timer.ko
[    4.939914] init_module_from_file - snd-timer.ko
[    4.939982] init_module_from_file - snd-timer.ko
[    4.940050] init_module_from_file - snd-timer.ko
[    4.943465] init_module_from_file - pinctrl-lpass-lpi.ko
[    4.943493] init_module_from_file - phy-qcom-snps-femto-v2.ko
[    4.943512] init_module_from_file - phy-qcom-snps-femto-v2.ko
[    4.944176] init_module_from_file - qcom_q6v5.ko
[    4.944362] init_module_from_file - qcom_q6v5.ko
[    4.946140] init_module_from_file - qcom_q6v5.ko - waited, ret = 0
[    4.946758] init_module_from_file - snd-timer.ko - waited, ret = 0
[    4.946762] init_module_from_file - snd-timer.ko - waited, ret = 0
[    4.946919] init_module_from_file - snd-timer.ko - waited, ret = 0
[    4.947484] init_module_from_file - snd-timer.ko - waited, ret = 0
[    4.947827] init_module_from_file - snd-timer.ko - waited, ret = 0
[    4.948101] init_module_from_file - snd-timer.ko - waited, ret = 0
[    4.948163] init_module_from_file - snd-timer.ko - waited, ret = 0
[    4.948483] init_module_from_file - phy-qcom-snps-femto-v2.ko - waited, ret = 0
[    4.950639] init_module_from_file - icc-bwmon.ko - waited, ret = 0
[    4.951178] init_module_from_file - snd-pcm.ko
[    4.951279] init_module_from_file - snd-pcm.ko
[    4.951616] init_module_from_file - qcom_glink_smem.ko
[    4.952049] init_module_from_file - snd-pcm.ko
[    4.952122] init_module_from_file - snd-pcm.ko
[    4.952132] init_module_from_file - snd-pcm.ko
[    4.952185] init_module_from_file - snd-pcm.ko
[    4.952231] init_module_from_file - pinctrl-sc8280xp-lpass-lpi.ko
[    4.952238] init_module_from_file - snd-pcm.ko
[    4.952326] init_module_from_file - qcom_glink_smem.ko
[    4.957870] init_module_from_file - snd-pcm.ko
[    4.971513] init_module_from_file - typec.ko
[    4.971529] init_module_from_file - typec.ko
[    4.971544] init_module_from_file - phy-qcom-edp.ko
[    4.971619] init_module_from_file - typec.ko
[    4.971806] init_module_from_file - snd-pcm.ko
[    4.971961] init_module_from_file - qcom_common.ko
[    4.972583] init_module_from_file - qcom_common.ko
[    4.972676] init_module_from_file - rfkill.ko
[    4.972898] init_module_from_file - qcom_stats.ko
[    4.973031] init_module_from_file - typec.ko
[    4.973917] init_module_from_file - snd-pcm.ko - waited, ret = 0
[    4.974024] init_module_from_file - snd-pcm.ko - waited, ret = 0
[    4.974090] init_module_from_file - qcom_common.ko - waited, ret = 0
[    4.974134] init_module_from_file - snd-pcm.ko - waited, ret = 0
[    4.974205] init_module_from_file - snd-pcm.ko - waited, ret = 0
[    4.974265] init_module_from_file - snd-pcm.ko - waited, ret = 0
[    4.974307] init_module_from_file - snd-pcm.ko - waited, ret = 0
[    4.974385] init_module_from_file - snd-pcm.ko - waited, ret = 0
[    4.974436] init_module_from_file - snd-pcm.ko - waited, ret = 0
[    4.979168] init_module_from_file - qcom_pil_info.ko
[    4.979352] init_module_from_file - snd-compress.ko
[    4.979399] init_module_from_file - qcom_pil_info.ko
[    4.979668] init_module_from_file - snd-compress.ko
[    4.979912] init_module_from_file - snd-compress.ko
[    4.980022] init_module_from_file - rtc-pm8xxx.ko
[    4.980107] init_module_from_file - snd-compress.ko
[    4.980216] init_module_from_file - sysimgblt.ko
[    4.980490] init_module_from_file - nvmem_qcom-spmi-sdam.ko
[    4.980593] init_module_from_file - snd-compress.ko
[    4.981109] init_module_from_file - snd-compress.ko
[    4.981136] init_module_from_file - qcom_pil_info.ko - waited, ret = 0
[    4.981285] init_module_from_file - snd-compress.ko
[    4.981378] init_module_from_file - typec.ko - waited, ret = 0
[    4.981382] init_module_from_file - typec.ko - waited, ret = 0
[    4.981402] init_module_from_file - snd-compress.ko - waited, ret = 0
[    4.981404] init_module_from_file - snd-compress.ko - waited, ret = 0
[    4.981412] init_module_from_file - typec.ko - waited, ret = 0
[    4.981453] init_module_from_file - snd-compress.ko - waited, ret = 0
[    4.981482] init_module_from_file - snd-compress.ko - waited, ret = 0
[    4.981516] init_module_from_file - snd-compress.ko - waited, ret = 0
[    4.981620] init_module_from_file - snd-compress.ko
[    4.982970] init_module_from_file - snd-compress.ko
[    4.985063] init_module_from_file - qcom_q6v5_pas.ko
[    4.985132] init_module_from_file - qcom_q6v5_pas.ko
[    4.985200] init_module_from_file - snd-soc-core.ko
[    4.985222] init_module_from_file - qcom-vadc-common.ko
[    4.985254] init_module_from_file - sysfillrect.ko
[    4.985654] init_module_from_file - snd-soc-core.ko
[    4.985697] init_module_from_file - sysfillrect.ko
[    4.985862] init_module_from_file - snd-soc-core.ko
[    4.985883] init_module_from_file - ecc.ko
[    4.985950] init_module_from_file - reboot-mode.ko
[    4.985982] init_module_from_file - snd-soc-core.ko
[    4.986159] init_module_from_file - snd-soc-core.ko
[    4.986304] init_module_from_file - snd-soc-core.ko
[    4.986829] init_module_from_file - qcom-vadc-common.ko
[    4.987106] init_module_from_file - snd-soc-core.ko
[    4.987546] init_module_from_file - qcom-vadc-common.ko - waited, ret = 0
[    4.987889] init_module_from_file - snd-soc-core.ko
[    4.988198] init_module_from_file - snd-soc-core.ko
[    4.989151] init_module_from_file - industrialio.ko
[    4.989220] init_module_from_file - industrialio.ko
[    4.989588] init_module_from_file - gpio-sbu-mux.ko
[    4.989651] init_module_from_file - gpio-sbu-mux.ko
[    4.990037] init_module_from_file - qcom-pon.ko
[    4.991354] init_module_from_file - mhi.ko
[    4.992889] init_module_from_file - industrialio.ko - waited, ret = 0
[    4.993473] init_module_from_file - cfg80211.ko
[    4.994053] init_module_from_file - snd-soc-core.ko - waited, ret = 0
[    4.994073] init_module_from_file - snd-soc-core.ko - waited, ret = 0
[    4.994103] init_module_from_file - snd-soc-core.ko - waited, ret = 0
[    4.994107] init_module_from_file - snd-soc-core.ko - waited, ret = 0
[    4.994129] init_module_from_file - snd-soc-core.ko - waited, ret = 0
[    4.994147] init_module_from_file - snd-soc-core.ko - waited, ret = 0
[    4.994166] init_module_from_file - sysfillrect.ko
[    4.994176] init_module_from_file - snd-soc-core.ko - waited, ret = 0
[    4.994206] init_module_from_file - snd-soc-core.ko - waited, ret = 0
[    4.994319] init_module_from_file - qcom-spmi-adc5.ko
[    4.994621] init_module_from_file - ecdh_generic.ko
[    4.994975] init_module_from_file - snd-soc-lpass-macro-common.ko
[    4.994999] init_module_from_file - qcom-spmi-adc-tm5.ko
[    4.996797] init_module_from_file - gpio-sbu-mux.ko - waited, ret = 0
[    4.997487] init_module_from_file - snd-soc-qcom-sdw.ko
[    4.997511] init_module_from_file - snd-soc-lpass-macro-common.ko
[    4.997546] init_module_from_file - snd-soc-wcd-mbhc.ko
[    4.997573] init_module_from_file - sysfillrect.ko - waited, ret = 0
[    4.997754] init_module_from_file - syscopyarea.ko
[    4.997762] init_module_from_file - syscopyarea.ko
[    4.999053] init_module_from_file - sysfillrect.ko - waited, ret = 0
[    4.999247] init_module_from_file - snd-soc-lpass-macro-common.ko
[    4.999482] init_module_from_file - snd-soc-lpass-wsa-macro.ko
[    4.999587] init_module_from_file - soundwire-qcom.ko
[    5.000338] init_module_from_file - soundwire-qcom.ko
[    5.000745] init_module_from_file - soundwire-qcom.ko
[    5.001008] init_module_from_file - syscopyarea.ko
[    5.015734] init_module_from_file - bluetooth.ko
[    5.016020] init_module_from_file - regmap-sdw.ko
[    5.016434] init_module_from_file - snd-soc-qcom-common.ko
[    5.017416] init_module_from_file - syscopyarea.ko - waited, ret = 0
[    5.018912] init_module_from_file - snd-soc-lpass-macro-common.ko - waited, ret = 0
[    5.018952] init_module_from_file - syscopyarea.ko - waited, ret = 0
[    5.019196] init_module_from_file - snd-soc-lpass-macro-common.ko - waited, ret = 0
[    5.020220] init_module_from_file - drm_kms_helper.ko
[    5.020230] init_module_from_file - snd-soc-lpass-rx-macro.ko
[    5.023262] init_module_from_file - drm_kms_helper.ko
[    5.023366] init_module_from_file - snd-soc-wcd938x-sdw.ko
[    5.023705] init_module_from_file - snd-soc-lpass-va-macro.ko
[    5.024133] init_module_from_file - drm_kms_helper.ko
[    5.024223] init_module_from_file - snd-soc-sc8280xp.ko
[    5.024814] init_module_from_file - snd-soc-lpass-tx-macro.ko
[    5.027886] init_module_from_file - drm_kms_helper.ko - waited, ret = 0
[    5.027937] init_module_from_file - drm_kms_helper.ko - waited, ret = 0
[    5.028550] init_module_from_file - phy-qcom-qmp-combo.ko
[    5.028828] init_module_from_file - snd-soc-wcd938x.ko
[    5.029106] init_module_from_file - drm_dp_aux_bus.ko
[    5.030265] init_module_from_file - phy-qcom-qmp-combo.ko
[    5.037773] init_module_from_file - qcom_q6v5_pas.ko - waited, ret = 0
[    5.038034] init_module_from_file - soundwire-qcom.ko - waited, ret = 0
[    5.038548] init_module_from_file - libarc4.ko
[    5.038551] init_module_from_file - soundwire-qcom.ko - waited, ret = 0
[    5.042176] init_module_from_file - btqca.ko
[    5.044433] init_module_from_file - mac80211.ko
[    5.044480] init_module_from_file - drm_display_helper.ko
[    5.046672] init_module_from_file - qcom-spmi-temp-alarm.ko
[    5.048163] init_module_from_file - hci_uart.ko
[    5.056843] init_module_from_file - gpu-sched.ko
[    5.073356] init_module_from_file - led-class-multicolor.ko
[    5.087906] init_module_from_file - qcom-spmi-temp-alarm.ko
[    5.106010] init_module_from_file - ath11k.ko
[    5.106084] init_module_from_file - msm.ko
[    5.125345] init_module_from_file - leds-qcom-lpg.ko
[    5.150715] init_module_from_file - ath11k_pci.ko
[    5.157163] init_module_from_file - qcom-spmi-temp-alarm.ko - waited, ret = 0
[    5.157894] init_module_from_file - qcom_battmgr.ko
[    5.158102] init_module_from_file - phy-qcom-qmp-combo.ko - waited, ret = 0
[    5.159204] init_module_from_file - pmic_glink_altmode.ko
[    5.258217] init_module_from_file - qrtr-smd.ko
[    5.258238] init_module_from_file - qrtr-smd.ko
[    5.258294] init_module_from_file - rpmsg_char.ko
[    5.259157] init_module_from_file - apr.ko
[    5.259160] init_module_from_file - rpmsg_char.ko
[    5.276634] init_module_from_file - qrtr-smd.ko - waited, ret = 0
[    5.277698] init_module_from_file - fastrpc.ko
[    5.278293] init_module_from_file - rpmsg_char.ko - waited, ret = 0
[    5.278926] init_module_from_file - rpmsg_ctrl.ko
[    5.279642] init_module_from_file - rpmsg_ctrl.ko
[    5.281399] init_module_from_file - rpmsg_ctrl.ko - waited, ret = 0
[    5.330747] init_module_from_file - snd-soc-hdmi-codec.ko
[    5.330826] init_module_from_file - snd-soc-hdmi-codec.ko
[    5.333130] init_module_from_file - snd-soc-hdmi-codec.ko
[    5.333201] init_module_from_file - snd-soc-hdmi-codec.ko - waited, ret = 0
[    5.333214] init_module_from_file - snd-soc-hdmi-codec.ko - waited, ret = 0
[    5.396803] init_module_from_file - panel-edp.ko
[    6.101509] init_module_from_file - qrtr-mhi.ko
[    6.196359] init_module_from_file - iptable_filter.ko
[    6.203333] init_module_from_file - nf_defrag_ipv4.ko
[    6.203903] init_module_from_file - nf_defrag_ipv6.ko
[    6.204628] init_module_from_file - libcrc32c.ko
[    6.204727] init_module_from_file - snd-q6apm.ko
[    6.204985] init_module_from_file - snd-q6apm.ko
[    6.205007] init_module_from_file - nf_conntrack.ko
[    6.207137] init_module_from_file - xt_conntrack.ko
[    6.211214] init_module_from_file - xt_tcpudp.ko
[    6.215535] init_module_from_file - nf_reject_ipv4.ko
[    6.216071] init_module_from_file - ipt_REJECT.ko
[    6.222201] init_module_from_file - nf_log_syslog.ko
[    6.223050] init_module_from_file - xt_LOG.ko
[    6.241544] init_module_from_file - ip6_tables.ko
[    6.286511] init_module_from_file - af_alg.ko
[    6.293272] init_module_from_file - algif_hash.ko
[    6.304773] init_module_from_file - md5.ko
[    6.333540] init_module_from_file - mii.ko
[    6.333550] init_module_from_file - mii.ko
[    6.334785] init_module_from_file - mii.ko - waited, ret = 0
[    6.335170] init_module_from_file - r8152.ko
[    6.335282] init_module_from_file - r8152.ko
[    6.370285] init_module_from_file - algif_skcipher.ko
[    6.384424] init_module_from_file - ecb.ko
[    6.392643] init_module_from_file - libdes.ko
[    6.397046] init_module_from_file - des_generic.ko
[    6.416039] init_module_from_file - cbc.ko
[    6.643702] init_module_from_file - r8152.ko - waited, ret = 0
[    7.940666] init_module_from_file - michael_mic.ko
[   11.233648] init_module_from_file - snd-q6apm.ko - waited, ret = 0
[   11.234264] init_module_from_file - q6prm.ko
[   11.303802] init_module_from_file - q6apm-dai.ko
[   11.304498] init_module_from_file - snd-q6dsp-common.ko
[   11.305476] init_module_from_file - q6apm-lpass-dais.ko
[   11.309858] init_module_from_file - q6prm-clocks.ko
[   11.446919] init_module_from_file - snd-soc-wsa883x.ko
[   11.446953] init_module_from_file - snd-soc-wsa883x.ko
[   11.456469] init_module_from_file - snd-soc-wsa883x.ko - waited, ret = 0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30  1:55                       ` Linus Torvalds
  2023-05-30  9:40                         ` Johan Hovold
@ 2023-05-30 16:22                         ` Luis Chamberlain
  2023-05-30 17:16                           ` Lucas De Marchi
                                             ` (2 more replies)
  2023-05-30 22:45                         ` Dan Williams
  2023-06-04 14:26                         ` Rudi Heitbaum
  3 siblings, 3 replies; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-30 16:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johan Hovold, Lucas De Marchi, Petr Pavlu, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
> >
> > I took a closer look at some of the modules that failed to load and
> > noticed a pattern in that they have dependencies that are needed by more
> > than one device.
> 
> Ok, this is a "maybe something like this" RFC series of two patches -
> one trivial one to re-organize things a bit so that we can then do the
> real one which uses a filter based on the inode pointer to return an
> "idempotent return value" for module loads that share the same inode.
> 
> It's entirely untested, and since I'm on the road I'm going to not
> really be able to test it. It compiles for me, and the code looks
> fairly straightforward, but it's probably buggy.
> 
> It's very loosely based on Luis' attempt,  but it
>  (a) is internal to module loading
>  (b) uses a reliable cookie
>  (c) doesn't leave the cookie around randomly for later
>  (d) has seen absolutely no testing
> 
> Put another way: if somebody wants to play with this, please treat it
> as a starting point, not the final thing. You might need to debug
> things, and fix silly mistakes.
> 
> The idea is to just have a simple hash list of currently executing
> module loads, protected by a trivial spinlock. Every module loader
> adds itself to the right hash list, and if they were the *first* one
> (ie no other pending module loads for that inode), will actually do
> the module load.
> 
> Everybody who *isn't* the first one will just wait for completion and
> return the same error code that the first one returned.

That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
ever wanted to do something similar there if we wanted to also
join request_module() calls, instead of it hiding under debug.

> This is technically bogus. The first one might fail due to arguments.

For boot it's fine, as I can't think of boot wanting to support trying
to load a module with different arguments but who knows. But I can't
see it sensible to issue concurrent multiple requests for modules
with different arguments without waiting in userspace for the first
to fail.

Even post-boot, doing that sounds rather insane, but it would certainly
be a compromise and should probably be clearly documented. I think just
a comment acknolwedging that corner case seems sensible.

Because we won't be able to get the arguments until we process the
module, so it would be too late for this optimization on kread. So it is
why I had also stuck to the original feature being in kread, as then it
provides a uniq kread call and the caller is aware of it. But indeed I
had not considered the effects of arguments.

Lucas, any thoughts from modules kmod userspace perspective into
supporting anyone likely issuing concurrent modules requests with
differing arguments?

> So the cookie shouldn't be just the inode, it should be the inode and
> a hash of the arguments or something like that.

Personally I think it's a fine optimization without the arguments.

> But it is what it is,
> and apart from possible show-stopper bugs this is no worse than the
> failed "exclusive write deny" attempt. IOW - maybe worth trying?

The only thing I can think of is allowing threads other than the
first one to complete before the one that actually loaded the
module. I thought about this race for module auto-loading, see
the comment in kmod_dup_request_announce(), so that just
further delays the completion to other thread with a stupid
queue_work(). That seems more important for module auto-loading
duplicates than for boot finit_module() duplicates. But not sure
if odering matters in the end due to a preemtible kernel and maybe
that concern is hysteria.

> And if *that* didn't sell people on this patch series, I don't know
> what will. I should be in marketing! Two drink minimums, here I come!

Sold:

on 255 vcpus 0 duplicates found with this setup:

root@kmod ~ # cat /sys/kernel/debug/modules/stats
         Mods ever loaded       66
     Mods failed on kread       0
Mods failed on decompress       0
  Mods failed on becoming       0
      Mods failed on load       0
        Total module size       11268096
      Total mod text size       4149248
       Failed kread bytes       0
  Failed decompress bytes       0
    Failed becoming bytes       0
        Failed kmod bytes       0
 Virtual mem wasted bytes       0
         Average mod size       170729
    Average mod text size       62868

So:

Tested-by: Luis Chamberlain <mcgrof@kernel.org>

In terms of bootup timing:

Before:
Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s        
graphical.target reached after 44.178s in userspace.                             
                                                                                 
After:                                                                           
Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s         
graphical.target reached after 40.226s in userspace. 

So other than the brain farts above, I think this is pretty sexy.

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30 16:22                         ` Luis Chamberlain
@ 2023-05-30 17:16                           ` Lucas De Marchi
  2023-05-30 19:41                             ` Luis Chamberlain
  2023-05-31  0:31                           ` Luis Chamberlain
  2023-05-31  7:51                           ` David Hildenbrand
  2 siblings, 1 reply; 53+ messages in thread
From: Lucas De Marchi @ 2023-05-30 17:16 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linus Torvalds, Johan Hovold, Petr Pavlu, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Tue, May 30, 2023 at 09:22:14AM -0700, Luis Chamberlain wrote:
>On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
>> On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
>> >
>> > I took a closer look at some of the modules that failed to load and
>> > noticed a pattern in that they have dependencies that are needed by more
>> > than one device.
>>
>> Ok, this is a "maybe something like this" RFC series of two patches -
>> one trivial one to re-organize things a bit so that we can then do the
>> real one which uses a filter based on the inode pointer to return an
>> "idempotent return value" for module loads that share the same inode.
>>
>> It's entirely untested, and since I'm on the road I'm going to not
>> really be able to test it. It compiles for me, and the code looks
>> fairly straightforward, but it's probably buggy.
>>
>> It's very loosely based on Luis' attempt,  but it
>>  (a) is internal to module loading
>>  (b) uses a reliable cookie
>>  (c) doesn't leave the cookie around randomly for later
>>  (d) has seen absolutely no testing
>>
>> Put another way: if somebody wants to play with this, please treat it
>> as a starting point, not the final thing. You might need to debug
>> things, and fix silly mistakes.
>>
>> The idea is to just have a simple hash list of currently executing
>> module loads, protected by a trivial spinlock. Every module loader
>> adds itself to the right hash list, and if they were the *first* one
>> (ie no other pending module loads for that inode), will actually do
>> the module load.
>>
>> Everybody who *isn't* the first one will just wait for completion and
>> return the same error code that the first one returned.
>
>That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
>ever wanted to do something similar there if we wanted to also
>join request_module() calls, instead of it hiding under debug.
>
>> This is technically bogus. The first one might fail due to arguments.
>
>For boot it's fine, as I can't think of boot wanting to support trying
>to load a module with different arguments but who knows. But I can't
>see it sensible to issue concurrent multiple requests for modules
>with different arguments without waiting in userspace for the first
>to fail.
>
>Even post-boot, doing that sounds rather insane, but it would certainly
>be a compromise and should probably be clearly documented. I think just
>a comment acknolwedging that corner case seems sensible.
>
>Because we won't be able to get the arguments until we process the
>module, so it would be too late for this optimization on kread. So it is
>why I had also stuck to the original feature being in kread, as then it
>provides a uniq kread call and the caller is aware of it. But indeed I
>had not considered the effects of arguments.
>
>Lucas, any thoughts from modules kmod userspace perspective into
>supporting anyone likely issuing concurrent modules requests with
>differing arguments?

Changing module params like that without first explicitly removing the
module was never supported by kmod or module-init-tools (I'm not digging
the history before 2.6 kernel)

During boot, note there is already a shortcut
if we have the sysfs node already in the "live" state or if the module is
built-in. In that case we will return success or -EEXIST (if the
KMOD_PROBE_IGNORE_LOADED flag was passed). To be 100% equivalent when
covering the window between the first finit_module() and the sysfs node
being created, then we could add a similar flag to finit_module() and
return -EEXIST. Note however that likbmod will already clear the error
when no flag is passed, which is the normal case during boot.

The only scenario I can think of during boot in which the module params
could change would be when a buggy initrd is created, i.e.
/etc/modprobed.d/*.conf is in the rootfs but absent from initrd.

So returning the same error code seems good to me.

thanks
Lucas De Marchi

>
>> So the cookie shouldn't be just the inode, it should be the inode and
>> a hash of the arguments or something like that.
>
>Personally I think it's a fine optimization without the arguments.
>
>> But it is what it is,
>> and apart from possible show-stopper bugs this is no worse than the
>> failed "exclusive write deny" attempt. IOW - maybe worth trying?
>
>The only thing I can think of is allowing threads other than the
>first one to complete before the one that actually loaded the
>module. I thought about this race for module auto-loading, see
>the comment in kmod_dup_request_announce(), so that just
>further delays the completion to other thread with a stupid
>queue_work(). That seems more important for module auto-loading
>duplicates than for boot finit_module() duplicates. But not sure
>if odering matters in the end due to a preemtible kernel and maybe
>that concern is hysteria.
>
>> And if *that* didn't sell people on this patch series, I don't know
>> what will. I should be in marketing! Two drink minimums, here I come!
>
>Sold:
>
>on 255 vcpus 0 duplicates found with this setup:
>
>root@kmod ~ # cat /sys/kernel/debug/modules/stats
>         Mods ever loaded       66
>     Mods failed on kread       0
>Mods failed on decompress       0
>  Mods failed on becoming       0
>      Mods failed on load       0
>        Total module size       11268096
>      Total mod text size       4149248
>       Failed kread bytes       0
>  Failed decompress bytes       0
>    Failed becoming bytes       0
>        Failed kmod bytes       0
> Virtual mem wasted bytes       0
>         Average mod size       170729
>    Average mod text size       62868
>
>So:
>
>Tested-by: Luis Chamberlain <mcgrof@kernel.org>
>
>In terms of bootup timing:
>
>Before:
>Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
>graphical.target reached after 44.178s in userspace.
>
>After:
>Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
>graphical.target reached after 40.226s in userspace.
>
>So other than the brain farts above, I think this is pretty sexy.
>
>  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30 17:16                           ` Lucas De Marchi
@ 2023-05-30 19:41                             ` Luis Chamberlain
  2023-05-30 22:17                               ` Linus Torvalds
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-30 19:41 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: Linus Torvalds, Johan Hovold, Petr Pavlu, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Tue, May 30, 2023 at 10:16:22AM -0700, Lucas De Marchi wrote:
> On Tue, May 30, 2023 at 09:22:14AM -0700, Luis Chamberlain wrote:
> > Lucas, any thoughts from modules kmod userspace perspective into
> > supporting anyone likely issuing concurrent modules requests with
> > differing arguments?
> 
> Changing module params like that without first explicitly removing the
> module was never supported by kmod or module-init-tools (I'm not digging
> the history before 2.6 kernel)

That's good enough.

> During boot, note there is already a shortcut
> if we have the sysfs node already in the "live" state or if the module is
> built-in. In that case we will return success or -EEXIST (if the
> KMOD_PROBE_IGNORE_LOADED flag was passed).

Linus' code would make duplicates wait and share the same return value,
ie, no new odd error code is returned. Or are you suggesting -EEXIST
should be returned to duplicates if the module succeeded to load
instead of 0 ?

> The only scenario I can think of during boot in which the module params
> could change would be when a buggy initrd is created, i.e.
> /etc/modprobed.d/*.conf is in the rootfs but absent from initrd.

This helps thanks.

> So returning the same error code seems good to me.

OK thanks! So just to confirm, it seems fine to return the same error
code if duplicates wait, or do you prefer for some reason for there to
be an exception and return -EEXIST if the module did succeed in the
duplicate case?

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30 19:41                             ` Luis Chamberlain
@ 2023-05-30 22:17                               ` Linus Torvalds
  2023-05-31  5:30                                 ` Lucas De Marchi
  0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-05-30 22:17 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Lucas De Marchi, Johan Hovold, Petr Pavlu, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Tue, May 30, 2023 at 3:41 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> OK thanks! So just to confirm, it seems fine to return the same error
> code if duplicates wait, or do you prefer for some reason for there to
> be an exception and return -EEXIST if the module did succeed in the
> duplicate case?

I think either should be fine, since either was possible before.

By definition, these are module loads being done in parallel, and so
any of them "could" have been the first, and returned success before.

And by extension, any of them could have been not first, and returned
-EEXIST if somebody else loaded the same module first.

So that "somebody else did a load" code:

        if (idempotent(&idem, file_inode(f))) {
                wait_for_completion(&idem.complete);
                return idem.ret;
        }

could certainly have made the return value be something like

        return idem.ret ? : -EEXIST;

instead of that "return idem.ret".

But it does seem simpler - and more in line with the conceptual
"loading the same module is an idempotent operation" of the patch -
to just always return the success value to all of them.

After all, they all did in some sense succeed to get that module
loaded, even if it was a communal effort, and some threads did more
than others...

As mentioned, I don't think it can matter either way, since any of the
callers might as well have been the successful one, and they would
basically have to act the same way regardless (ie "somebody else
succeeded" and "you succeeded" are basically equivalent return
values). If the module was a prerequisite for another module being
loaded, either -EEXIST or 0 _is_ a success case.

             Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30  1:55                       ` Linus Torvalds
  2023-05-30  9:40                         ` Johan Hovold
  2023-05-30 16:22                         ` Luis Chamberlain
@ 2023-05-30 22:45                         ` Dan Williams
  2023-06-04 14:26                         ` Rudi Heitbaum
  3 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2023-05-30 22:45 UTC (permalink / raw)
  To: Linus Torvalds, Johan Hovold
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

Linus Torvalds wrote:
> On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
> >
> > I took a closer look at some of the modules that failed to load and
> > noticed a pattern in that they have dependencies that are needed by more
> > than one device.
> 
> Ok, this is a "maybe something like this" RFC series of two patches -
> one trivial one to re-organize things a bit so that we can then do the
> real one which uses a filter based on the inode pointer to return an
> "idempotent return value" for module loads that share the same inode.
> 
> It's entirely untested, and since I'm on the road I'm going to not
> really be able to test it. It compiles for me, and the code looks
> fairly straightforward, but it's probably buggy.

At least for me, these 2 patches, plus Johan's spinlock fixup, are:

Tested-by: Dan Williams <dan.j.williams@intel.com>

...on the test that was previously broken on plain v6.4-rc4. I have
these cherry picked on top of v6.4-rc4:

 Revert "module: error out early on concurrent load of the same module file"
 module: split up 'finit_module()' into init_module_from_file() helper
 modules: catch concurrent module loads, take two

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30 16:22                         ` Luis Chamberlain
  2023-05-30 17:16                           ` Lucas De Marchi
@ 2023-05-31  0:31                           ` Luis Chamberlain
  2023-05-31  7:51                           ` David Hildenbrand
  2 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-31  0:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johan Hovold, Lucas De Marchi, Petr Pavlu, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Tue, May 30, 2023 at 09:22:14AM -0700, Luis Chamberlain wrote:
> The only thing I can think of is allowing threads other than the
> first one to complete before the one that actually loaded the
> module. I thought about this race for module auto-loading, see
> the comment in kmod_dup_request_announce(), so that just
> further delays the completion to other thread with a stupid
> queue_work(). That seems more important for module auto-loading
> duplicates than for boot finit_module() duplicates. But not sure
> if odering matters in the end due to a preemtible kernel and maybe
> that concern is hysteria.

I think I'm OK to accept this ordering concern as hysteria for now.

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30 22:17                               ` Linus Torvalds
@ 2023-05-31  5:30                                 ` Lucas De Marchi
  0 siblings, 0 replies; 53+ messages in thread
From: Lucas De Marchi @ 2023-05-31  5:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, Johan Hovold, Petr Pavlu, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Tue, May 30, 2023 at 06:17:11PM -0400, Linus Torvalds wrote:
>On Tue, May 30, 2023 at 3:41 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>>
>> OK thanks! So just to confirm, it seems fine to return the same error
>> code if duplicates wait, or do you prefer for some reason for there to
>> be an exception and return -EEXIST if the module did succeed in the
>> duplicate case?
>
>I think either should be fine, since either was possible before.
>
>By definition, these are module loads being done in parallel, and so
>any of them "could" have been the first, and returned success before.
>
>And by extension, any of them could have been not first, and returned
>-EEXIST if somebody else loaded the same module first.
>
>So that "somebody else did a load" code:
>
>        if (idempotent(&idem, file_inode(f))) {
>                wait_for_completion(&idem.complete);
>                return idem.ret;
>        }
>
>could certainly have made the return value be something like
>
>        return idem.ret ? : -EEXIST;

yes, this is what I had in mind.

>
>instead of that "return idem.ret".
>
>But it does seem simpler - and more in line with the conceptual
>"loading the same module is an idempotent operation" of the patch -
>to just always return the success value to all of them.
>
>After all, they all did in some sense succeed to get that module
>loaded, even if it was a communal effort, and some threads did more
>than others...
>
>As mentioned, I don't think it can matter either way, since any of the
>callers might as well have been the successful one, and they would
>basically have to act the same way regardless (ie "somebody else
>succeeded" and "you succeeded" are basically equivalent return

agreed, it will just be a slightly different behavior if finit_module()
is called twice and the first call is already in the process of
initializing the module, i.e. complete_formation() was already called,
putting the module in the MODULE_STATE_COMING state, as per
kernel/module/main.c:add_unformed_module():

	/*
	 * We are here only when the same module was being loaded. Do
	 * not try to load it again right now. It prevents long delays
	 * caused by serialized module load failures. It might happen
	 * when more devices of the same type trigger load of
	 * a particular module.
	 */
	if (old && old->state == MODULE_STATE_LIVE)
		err = -EEXIST;
	else
		err = -EBUSY;
	goto out;

in userspace we already deal with that in a special way and should be
compatible with returning 0 for all practical purposes.

thanks
Lucas De Marchi

>values). If the module was a prerequisite for another module being
>loaded, either -EEXIST or 0 _is_ a success case.
>
>             Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30 16:22                         ` Luis Chamberlain
  2023-05-30 17:16                           ` Lucas De Marchi
  2023-05-31  0:31                           ` Luis Chamberlain
@ 2023-05-31  7:51                           ` David Hildenbrand
  2023-05-31 16:57                             ` Luis Chamberlain
  2 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2023-05-31  7:51 UTC (permalink / raw)
  To: Luis Chamberlain, Linus Torvalds
  Cc: Johan Hovold, Lucas De Marchi, Petr Pavlu, gregkh, rafael, song,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, tglx, hch,
	patches, linux-modules, linux-mm, linux-kernel, pmladek, prarit,
	lennart

On 30.05.23 18:22, Luis Chamberlain wrote:
> On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
>> On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
>>>
>>> I took a closer look at some of the modules that failed to load and
>>> noticed a pattern in that they have dependencies that are needed by more
>>> than one device.
>>
>> Ok, this is a "maybe something like this" RFC series of two patches -
>> one trivial one to re-organize things a bit so that we can then do the
>> real one which uses a filter based on the inode pointer to return an
>> "idempotent return value" for module loads that share the same inode.
>>
>> It's entirely untested, and since I'm on the road I'm going to not
>> really be able to test it. It compiles for me, and the code looks
>> fairly straightforward, but it's probably buggy.
>>
>> It's very loosely based on Luis' attempt,  but it
>>   (a) is internal to module loading
>>   (b) uses a reliable cookie
>>   (c) doesn't leave the cookie around randomly for later
>>   (d) has seen absolutely no testing
>>
>> Put another way: if somebody wants to play with this, please treat it
>> as a starting point, not the final thing. You might need to debug
>> things, and fix silly mistakes.
>>
>> The idea is to just have a simple hash list of currently executing
>> module loads, protected by a trivial spinlock. Every module loader
>> adds itself to the right hash list, and if they were the *first* one
>> (ie no other pending module loads for that inode), will actually do
>> the module load.
>>
>> Everybody who *isn't* the first one will just wait for completion and
>> return the same error code that the first one returned.
> 
> That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
> ever wanted to do something similar there if we wanted to also
> join request_module() calls, instead of it hiding under debug.
> 
>> This is technically bogus. The first one might fail due to arguments.
> 
> For boot it's fine, as I can't think of boot wanting to support trying
> to load a module with different arguments but who knows. But I can't
> see it sensible to issue concurrent multiple requests for modules
> with different arguments without waiting in userspace for the first
> to fail.
> 
> Even post-boot, doing that sounds rather insane, but it would certainly
> be a compromise and should probably be clearly documented. I think just
> a comment acknolwedging that corner case seems sensible.
> 
> Because we won't be able to get the arguments until we process the
> module, so it would be too late for this optimization on kread. So it is
> why I had also stuck to the original feature being in kread, as then it
> provides a uniq kread call and the caller is aware of it. But indeed I
> had not considered the effects of arguments.
> 
> Lucas, any thoughts from modules kmod userspace perspective into
> supporting anyone likely issuing concurrent modules requests with
> differing arguments?
> 
>> So the cookie shouldn't be just the inode, it should be the inode and
>> a hash of the arguments or something like that.
> 
> Personally I think it's a fine optimization without the arguments.
> 
>> But it is what it is,
>> and apart from possible show-stopper bugs this is no worse than the
>> failed "exclusive write deny" attempt. IOW - maybe worth trying?
> 
> The only thing I can think of is allowing threads other than the
> first one to complete before the one that actually loaded the
> module. I thought about this race for module auto-loading, see
> the comment in kmod_dup_request_announce(), so that just
> further delays the completion to other thread with a stupid
> queue_work(). That seems more important for module auto-loading
> duplicates than for boot finit_module() duplicates. But not sure
> if odering matters in the end due to a preemtible kernel and maybe
> that concern is hysteria.
> 
>> And if *that* didn't sell people on this patch series, I don't know
>> what will. I should be in marketing! Two drink minimums, here I come!
> 
> Sold:
> 
> on 255 vcpus 0 duplicates found with this setup:
> 
> root@kmod ~ # cat /sys/kernel/debug/modules/stats
>           Mods ever loaded       66
>       Mods failed on kread       0
> Mods failed on decompress       0
>    Mods failed on becoming       0
>        Mods failed on load       0
>          Total module size       11268096
>        Total mod text size       4149248
>         Failed kread bytes       0
>    Failed decompress bytes       0
>      Failed becoming bytes       0
>          Failed kmod bytes       0
>   Virtual mem wasted bytes       0
>           Average mod size       170729
>      Average mod text size       62868
> 
> So:
> 
> Tested-by: Luis Chamberlain <mcgrof@kernel.org>
> 
> In terms of bootup timing:
> 
> Before:
> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
> graphical.target reached after 44.178s in userspace.
>                                                                                   
> After:
> Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
> graphical.target reached after 40.226s in userspace.

I'll try grabbing the system where we saw the KASAN-related issues [1] 
and give it a churn with and without the two patches. Might take a bit 
(~1 day), unfortunately.

[1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-31  7:51                           ` David Hildenbrand
@ 2023-05-31 16:57                             ` Luis Chamberlain
  2023-06-02 15:19                               ` David Hildenbrand
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2023-05-31 16:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Johan Hovold, Lucas De Marchi, Petr Pavlu,
	gregkh, rafael, song, lucas.de.marchi, christophe.leroy, peterz,
	rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Wed, May 31, 2023 at 09:51:41AM +0200, David Hildenbrand wrote:
> On 30.05.23 18:22, Luis Chamberlain wrote:
> > On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> > > On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
> > > > 
> > > > I took a closer look at some of the modules that failed to load and
> > > > noticed a pattern in that they have dependencies that are needed by more
> > > > than one device.
> > > 
> > > Ok, this is a "maybe something like this" RFC series of two patches -
> > > one trivial one to re-organize things a bit so that we can then do the
> > > real one which uses a filter based on the inode pointer to return an
> > > "idempotent return value" for module loads that share the same inode.
> > > 
> > > It's entirely untested, and since I'm on the road I'm going to not
> > > really be able to test it. It compiles for me, and the code looks
> > > fairly straightforward, but it's probably buggy.
> > > 
> > > It's very loosely based on Luis' attempt,  but it
> > >   (a) is internal to module loading
> > >   (b) uses a reliable cookie
> > >   (c) doesn't leave the cookie around randomly for later
> > >   (d) has seen absolutely no testing
> > > 
> > > Put another way: if somebody wants to play with this, please treat it
> > > as a starting point, not the final thing. You might need to debug
> > > things, and fix silly mistakes.
> > > 
> > > The idea is to just have a simple hash list of currently executing
> > > module loads, protected by a trivial spinlock. Every module loader
> > > adds itself to the right hash list, and if they were the *first* one
> > > (ie no other pending module loads for that inode), will actually do
> > > the module load.
> > > 
> > > Everybody who *isn't* the first one will just wait for completion and
> > > return the same error code that the first one returned.
> > 
> > That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
> > ever wanted to do something similar there if we wanted to also
> > join request_module() calls, instead of it hiding under debug.
> > 
> > > This is technically bogus. The first one might fail due to arguments.
> > 
> > For boot it's fine, as I can't think of boot wanting to support trying
> > to load a module with different arguments but who knows. But I can't
> > see it sensible to issue concurrent multiple requests for modules
> > with different arguments without waiting in userspace for the first
> > to fail.
> > 
> > Even post-boot, doing that sounds rather insane, but it would certainly
> > be a compromise and should probably be clearly documented. I think just
> > a comment acknolwedging that corner case seems sensible.
> > 
> > Because we won't be able to get the arguments until we process the
> > module, so it would be too late for this optimization on kread. So it is
> > why I had also stuck to the original feature being in kread, as then it
> > provides a uniq kread call and the caller is aware of it. But indeed I
> > had not considered the effects of arguments.
> > 
> > Lucas, any thoughts from modules kmod userspace perspective into
> > supporting anyone likely issuing concurrent modules requests with
> > differing arguments?
> > 
> > > So the cookie shouldn't be just the inode, it should be the inode and
> > > a hash of the arguments or something like that.
> > 
> > Personally I think it's a fine optimization without the arguments.
> > 
> > > But it is what it is,
> > > and apart from possible show-stopper bugs this is no worse than the
> > > failed "exclusive write deny" attempt. IOW - maybe worth trying?
> > 
> > The only thing I can think of is allowing threads other than the
> > first one to complete before the one that actually loaded the
> > module. I thought about this race for module auto-loading, see
> > the comment in kmod_dup_request_announce(), so that just
> > further delays the completion to other thread with a stupid
> > queue_work(). That seems more important for module auto-loading
> > duplicates than for boot finit_module() duplicates. But not sure
> > if odering matters in the end due to a preemtible kernel and maybe
> > that concern is hysteria.
> > 
> > > And if *that* didn't sell people on this patch series, I don't know
> > > what will. I should be in marketing! Two drink minimums, here I come!
> > 
> > Sold:
> > 
> > on 255 vcpus 0 duplicates found with this setup:
> > 
> > root@kmod ~ # cat /sys/kernel/debug/modules/stats
> >           Mods ever loaded       66
> >       Mods failed on kread       0
> > Mods failed on decompress       0
> >    Mods failed on becoming       0
> >        Mods failed on load       0
> >          Total module size       11268096
> >        Total mod text size       4149248
> >         Failed kread bytes       0
> >    Failed decompress bytes       0
> >      Failed becoming bytes       0
> >          Failed kmod bytes       0
> >   Virtual mem wasted bytes       0
> >           Average mod size       170729
> >      Average mod text size       62868
> > 
> > So:
> > 
> > Tested-by: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > In terms of bootup timing:
> > 
> > Before:
> > Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
> > graphical.target reached after 44.178s in userspace.
> > After:
> > Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
> > graphical.target reached after 40.226s in userspace.
> 
> I'll try grabbing the system where we saw the KASAN-related issues [1] and
> give it a churn with and without the two patches. Might take a bit (~1 day),
> unfortunately.
> 
> [1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

Great, don't forget:

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 82b0dcc1fe77..222015093eeb 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -3066,7 +3066,7 @@ struct idempotent {
 
 #define IDEM_HASH_BITS 8
 static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
-static struct spinlock idem_lock;
+static DEFINE_SPINLOCK(idem_lock);
 
 static bool idempotent(struct idempotent *u, const void *cookie)
 {

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-31 16:57                             ` Luis Chamberlain
@ 2023-06-02 15:19                               ` David Hildenbrand
  2023-06-02 16:04                                 ` Luis Chamberlain
  2023-06-02 16:06                                 ` Linus Torvalds
  0 siblings, 2 replies; 53+ messages in thread
From: David Hildenbrand @ 2023-06-02 15:19 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linus Torvalds, Johan Hovold, Lucas De Marchi, Petr Pavlu,
	gregkh, rafael, song, lucas.de.marchi, christophe.leroy, peterz,
	rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On 31.05.23 18:57, Luis Chamberlain wrote:
> On Wed, May 31, 2023 at 09:51:41AM +0200, David Hildenbrand wrote:
>> On 30.05.23 18:22, Luis Chamberlain wrote:
>>> On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
>>>> On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
>>>>>
>>>>> I took a closer look at some of the modules that failed to load and
>>>>> noticed a pattern in that they have dependencies that are needed by more
>>>>> than one device.
>>>>
>>>> Ok, this is a "maybe something like this" RFC series of two patches -
>>>> one trivial one to re-organize things a bit so that we can then do the
>>>> real one which uses a filter based on the inode pointer to return an
>>>> "idempotent return value" for module loads that share the same inode.
>>>>
>>>> It's entirely untested, and since I'm on the road I'm going to not
>>>> really be able to test it. It compiles for me, and the code looks
>>>> fairly straightforward, but it's probably buggy.
>>>>
>>>> It's very loosely based on Luis' attempt,  but it
>>>>    (a) is internal to module loading
>>>>    (b) uses a reliable cookie
>>>>    (c) doesn't leave the cookie around randomly for later
>>>>    (d) has seen absolutely no testing
>>>>
>>>> Put another way: if somebody wants to play with this, please treat it
>>>> as a starting point, not the final thing. You might need to debug
>>>> things, and fix silly mistakes.
>>>>
>>>> The idea is to just have a simple hash list of currently executing
>>>> module loads, protected by a trivial spinlock. Every module loader
>>>> adds itself to the right hash list, and if they were the *first* one
>>>> (ie no other pending module loads for that inode), will actually do
>>>> the module load.
>>>>
>>>> Everybody who *isn't* the first one will just wait for completion and
>>>> return the same error code that the first one returned.
>>>
>>> That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
>>> ever wanted to do something similar there if we wanted to also
>>> join request_module() calls, instead of it hiding under debug.
>>>
>>>> This is technically bogus. The first one might fail due to arguments.
>>>
>>> For boot it's fine, as I can't think of boot wanting to support trying
>>> to load a module with different arguments but who knows. But I can't
>>> see it sensible to issue concurrent multiple requests for modules
>>> with different arguments without waiting in userspace for the first
>>> to fail.
>>>
>>> Even post-boot, doing that sounds rather insane, but it would certainly
>>> be a compromise and should probably be clearly documented. I think just
>>> a comment acknolwedging that corner case seems sensible.
>>>
>>> Because we won't be able to get the arguments until we process the
>>> module, so it would be too late for this optimization on kread. So it is
>>> why I had also stuck to the original feature being in kread, as then it
>>> provides a uniq kread call and the caller is aware of it. But indeed I
>>> had not considered the effects of arguments.
>>>
>>> Lucas, any thoughts from modules kmod userspace perspective into
>>> supporting anyone likely issuing concurrent modules requests with
>>> differing arguments?
>>>
>>>> So the cookie shouldn't be just the inode, it should be the inode and
>>>> a hash of the arguments or something like that.
>>>
>>> Personally I think it's a fine optimization without the arguments.
>>>
>>>> But it is what it is,
>>>> and apart from possible show-stopper bugs this is no worse than the
>>>> failed "exclusive write deny" attempt. IOW - maybe worth trying?
>>>
>>> The only thing I can think of is allowing threads other than the
>>> first one to complete before the one that actually loaded the
>>> module. I thought about this race for module auto-loading, see
>>> the comment in kmod_dup_request_announce(), so that just
>>> further delays the completion to other thread with a stupid
>>> queue_work(). That seems more important for module auto-loading
>>> duplicates than for boot finit_module() duplicates. But not sure
>>> if odering matters in the end due to a preemtible kernel and maybe
>>> that concern is hysteria.
>>>
>>>> And if *that* didn't sell people on this patch series, I don't know
>>>> what will. I should be in marketing! Two drink minimums, here I come!
>>>
>>> Sold:
>>>
>>> on 255 vcpus 0 duplicates found with this setup:
>>>
>>> root@kmod ~ # cat /sys/kernel/debug/modules/stats
>>>            Mods ever loaded       66
>>>        Mods failed on kread       0
>>> Mods failed on decompress       0
>>>     Mods failed on becoming       0
>>>         Mods failed on load       0
>>>           Total module size       11268096
>>>         Total mod text size       4149248
>>>          Failed kread bytes       0
>>>     Failed decompress bytes       0
>>>       Failed becoming bytes       0
>>>           Failed kmod bytes       0
>>>    Virtual mem wasted bytes       0
>>>            Average mod size       170729
>>>       Average mod text size       62868
>>>
>>> So:
>>>
>>> Tested-by: Luis Chamberlain <mcgrof@kernel.org>
>>>
>>> In terms of bootup timing:
>>>
>>> Before:
>>> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
>>> graphical.target reached after 44.178s in userspace.
>>> After:
>>> Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
>>> graphical.target reached after 40.226s in userspace.
>>
>> I'll try grabbing the system where we saw the KASAN-related issues [1] and
>> give it a churn with and without the two patches. Might take a bit (~1 day),
>> unfortunately.
>>
>> [1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
> 
> Great, don't forget:
> 
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 82b0dcc1fe77..222015093eeb 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -3066,7 +3066,7 @@ struct idempotent {
>   
>   #define IDEM_HASH_BITS 8
>   static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
> -static struct spinlock idem_lock;
> +static DEFINE_SPINLOCK(idem_lock);
>   
>   static bool idempotent(struct idempotent *u, const void *cookie)
>   {
> 

Finally was able to run it on that ThinkSystem SR950 with 8 populated
sockets -> 224 cores and 448 logical CPUs.

The KASAN vmap issues on that system were already no longer reproducible with your
(Luis) previous work that's already in master.

I tested a !debug and debug config (both based on corresponding RHEL9 configs), comparing
929ed21dfdb6 ("master") with 929ed21dfdb6 + Linus' patches ("patched").


Unfortunately, boot times vary a lot, and I did not figure out how to reduce
the noise. I captured the "systemd-analyze blame" output as well.


1) !debug config (not enabling KASAN)

a) master

Startup finished in 32.225s (kernel) + 7.399s (initrd) + 20.378s (userspace) = 1min 3ms
multi-user.target reached after 20.352s in userspace.
Startup finished in 43.734s (kernel) + 7.288s (initrd) + 19.827s (userspace) = 1min 10.851s
multi-user.target reached after 19.800s in userspace.
Startup finished in 50.514s (kernel) + 7.171s (initrd) + 24.757s (userspace) = 1min 22.443s
multi-user.target reached after 24.734s in userspace.
Startup finished in 26.722s (kernel) + 7.249s (initrd) + 23.923s (userspace) = 57.895s
multi-user.target reached after 23.892s in userspace.

b) patched

Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
multi-user.target reached after 21.355s in userspace.
Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
multi-user.target reached after 21.355s in userspace.
Startup finished in 1min 34.678s (kernel) + 7.239s (initrd) + 24.066s (userspace) = 2min 5.985s
multi-user.target reached after 24.040s in userspace.
Startup finished in 25.879s (kernel) + 7.144s (initrd) + 29.665s (userspace) = 1min 2.689s
multi-user.target reached after 29.637s in userspace.



2) debug config (enabling KASAN)

a) master

Startup finished in 2min 12.695s (kernel) + 25.058s (initrd) + 1min 13.012s (userspace) = 3min 50.765s
multi-user.target reached after 1min 12.903s in userspace.
Startup finished in 1min 45.400s (kernel) + 24.294s (initrd) + 1min 8.910s (userspace) = 3min 18.606s
multi-user.target reached after 1min 8.786s in userspace.
Startup finished in 2min 4.857s (kernel) + 24.715s (initrd) + 1min 5.088s (userspace) = 3min 34.660s
multi-user.target reached after 1min 4.967s in userspace.
Startup finished in 3min 20.400s (kernel) + 24.703s (initrd) + 1min 5.469s (userspace) = 4min 50.573s
multi-user.target reached after 1min 5.344s in userspace.

b) patched

Startup finished in 2min 5.250s (kernel) + 25.049s (initrd) + 1min 1.961s (userspace) = 3min 32.262s
multi-user.target reached after 1min 1.844s in userspace.
Startup finished in 1min 52.524s (kernel) + 24.897s (initrd) + 1min 5.062s (userspace) = 3min 22.484s
multi-user.target reached after 1min 4.916s in userspace.
Startup finished in 9min 36.817s (kernel) + 24.859s (initrd) + 1min 18.657s (userspace) = 11min 20.335s
multi-user.target reached after 1min 18.455s in userspace.
Startup finished in 30min 20.715s (kernel) + 24.722s (initrd) + 1min 7.039s (userspace) = 31min 52.476s
multi-user.target reached after 1min 6.907s in userspace.


What concerns me a bit, is that on the patched kernel we seem to hit more cases where
boot takes much longer (in both kernel configs).

I'll do some more runs/investigation to see if this is reproducible or just some system oddity.

Staring just at the udev settle time (systemd-analyze blame), it's very similar between both kernel
versions.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-02 15:19                               ` David Hildenbrand
@ 2023-06-02 16:04                                 ` Luis Chamberlain
  2023-06-05 11:26                                   ` David Hildenbrand
  2023-06-02 16:06                                 ` Linus Torvalds
  1 sibling, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2023-06-02 16:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Johan Hovold, Lucas De Marchi, Petr Pavlu,
	gregkh, rafael, song, lucas.de.marchi, christophe.leroy, peterz,
	rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Fri, Jun 02, 2023 at 05:19:58PM +0200, David Hildenbrand wrote:
> On 31.05.23 18:57, Luis Chamberlain wrote:
> > On Wed, May 31, 2023 at 09:51:41AM +0200, David Hildenbrand wrote:
> > > On 30.05.23 18:22, Luis Chamberlain wrote:
> > > > On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> > > > > On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
> > > > > > 
> > > > > > I took a closer look at some of the modules that failed to load and
> > > > > > noticed a pattern in that they have dependencies that are needed by more
> > > > > > than one device.
> > > > > 
> > > > > Ok, this is a "maybe something like this" RFC series of two patches -
> > > > > one trivial one to re-organize things a bit so that we can then do the
> > > > > real one which uses a filter based on the inode pointer to return an
> > > > > "idempotent return value" for module loads that share the same inode.
> > > > > 
> > > > > It's entirely untested, and since I'm on the road I'm going to not
> > > > > really be able to test it. It compiles for me, and the code looks
> > > > > fairly straightforward, but it's probably buggy.
> > > > > 
> > > > > It's very loosely based on Luis' attempt,  but it
> > > > >    (a) is internal to module loading
> > > > >    (b) uses a reliable cookie
> > > > >    (c) doesn't leave the cookie around randomly for later
> > > > >    (d) has seen absolutely no testing
> > > > > 
> > > > > Put another way: if somebody wants to play with this, please treat it
> > > > > as a starting point, not the final thing. You might need to debug
> > > > > things, and fix silly mistakes.
> > > > > 
> > > > > The idea is to just have a simple hash list of currently executing
> > > > > module loads, protected by a trivial spinlock. Every module loader
> > > > > adds itself to the right hash list, and if they were the *first* one
> > > > > (ie no other pending module loads for that inode), will actually do
> > > > > the module load.
> > > > > 
> > > > > Everybody who *isn't* the first one will just wait for completion and
> > > > > return the same error code that the first one returned.
> > > > 
> > > > That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
> > > > ever wanted to do something similar there if we wanted to also
> > > > join request_module() calls, instead of it hiding under debug.
> > > > 
> > > > > This is technically bogus. The first one might fail due to arguments.
> > > > 
> > > > For boot it's fine, as I can't think of boot wanting to support trying
> > > > to load a module with different arguments but who knows. But I can't
> > > > see it sensible to issue concurrent multiple requests for modules
> > > > with different arguments without waiting in userspace for the first
> > > > to fail.
> > > > 
> > > > Even post-boot, doing that sounds rather insane, but it would certainly
> > > > be a compromise and should probably be clearly documented. I think just
> > > > a comment acknolwedging that corner case seems sensible.
> > > > 
> > > > Because we won't be able to get the arguments until we process the
> > > > module, so it would be too late for this optimization on kread. So it is
> > > > why I had also stuck to the original feature being in kread, as then it
> > > > provides a uniq kread call and the caller is aware of it. But indeed I
> > > > had not considered the effects of arguments.
> > > > 
> > > > Lucas, any thoughts from modules kmod userspace perspective into
> > > > supporting anyone likely issuing concurrent modules requests with
> > > > differing arguments?
> > > > 
> > > > > So the cookie shouldn't be just the inode, it should be the inode and
> > > > > a hash of the arguments or something like that.
> > > > 
> > > > Personally I think it's a fine optimization without the arguments.
> > > > 
> > > > > But it is what it is,
> > > > > and apart from possible show-stopper bugs this is no worse than the
> > > > > failed "exclusive write deny" attempt. IOW - maybe worth trying?
> > > > 
> > > > The only thing I can think of is allowing threads other than the
> > > > first one to complete before the one that actually loaded the
> > > > module. I thought about this race for module auto-loading, see
> > > > the comment in kmod_dup_request_announce(), so that just
> > > > further delays the completion to other thread with a stupid
> > > > queue_work(). That seems more important for module auto-loading
> > > > duplicates than for boot finit_module() duplicates. But not sure
> > > > if odering matters in the end due to a preemtible kernel and maybe
> > > > that concern is hysteria.
> > > > 
> > > > > And if *that* didn't sell people on this patch series, I don't know
> > > > > what will. I should be in marketing! Two drink minimums, here I come!
> > > > 
> > > > Sold:
> > > > 
> > > > on 255 vcpus 0 duplicates found with this setup:
> > > > 
> > > > root@kmod ~ # cat /sys/kernel/debug/modules/stats
> > > >            Mods ever loaded       66
> > > >        Mods failed on kread       0
> > > > Mods failed on decompress       0
> > > >     Mods failed on becoming       0
> > > >         Mods failed on load       0
> > > >           Total module size       11268096
> > > >         Total mod text size       4149248
> > > >          Failed kread bytes       0
> > > >     Failed decompress bytes       0
> > > >       Failed becoming bytes       0
> > > >           Failed kmod bytes       0
> > > >    Virtual mem wasted bytes       0
> > > >            Average mod size       170729
> > > >       Average mod text size       62868
> > > > 
> > > > So:
> > > > 
> > > > Tested-by: Luis Chamberlain <mcgrof@kernel.org>
> > > > 
> > > > In terms of bootup timing:
> > > > 
> > > > Before:
> > > > Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
> > > > graphical.target reached after 44.178s in userspace.
> > > > After:
> > > > Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
> > > > graphical.target reached after 40.226s in userspace.
> > > 
> > > I'll try grabbing the system where we saw the KASAN-related issues [1] and
> > > give it a churn with and without the two patches. Might take a bit (~1 day),
> > > unfortunately.
> > > 
> > > [1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
> > 
> > Great, don't forget:
> > 
> > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > index 82b0dcc1fe77..222015093eeb 100644
> > --- a/kernel/module/main.c
> > +++ b/kernel/module/main.c
> > @@ -3066,7 +3066,7 @@ struct idempotent {
> >   #define IDEM_HASH_BITS 8
> >   static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
> > -static struct spinlock idem_lock;
> > +static DEFINE_SPINLOCK(idem_lock);
> >   static bool idempotent(struct idempotent *u, const void *cookie)
> >   {
> > 
> 
> Finally was able to run it on that ThinkSystem SR950 with 8 populated
> sockets -> 224 cores and 448 logical CPUs.
> 
> The KASAN vmap issues on that system were already no longer reproducible with your
> (Luis) previous work that's already in master.
> 
> I tested a !debug and debug config (both based on corresponding RHEL9 configs), comparing
> 929ed21dfdb6 ("master") with 929ed21dfdb6 + Linus' patches ("patched").
> 
> 
> Unfortunately, boot times vary a lot, and I did not figure out how to reduce
> the noise. I captured the "systemd-analyze blame" output as well.
> 
> 
> 1) !debug config (not enabling KASAN)
> 
> a) master
> 
> Startup finished in 32.225s (kernel) + 7.399s (initrd) + 20.378s (userspace) = 1min 3ms
> multi-user.target reached after 20.352s in userspace.
> Startup finished in 43.734s (kernel) + 7.288s (initrd) + 19.827s (userspace) = 1min 10.851s
> multi-user.target reached after 19.800s in userspace.
> Startup finished in 50.514s (kernel) + 7.171s (initrd) + 24.757s (userspace) = 1min 22.443s
> multi-user.target reached after 24.734s in userspace.
> Startup finished in 26.722s (kernel) + 7.249s (initrd) + 23.923s (userspace) = 57.895s
> multi-user.target reached after 23.892s in userspace.
> 
> b) patched
> 
> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
> multi-user.target reached after 21.355s in userspace.
> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
> multi-user.target reached after 21.355s in userspace.
> Startup finished in 1min 34.678s (kernel) + 7.239s (initrd) + 24.066s (userspace) = 2min 5.985s
> multi-user.target reached after 24.040s in userspace.
> Startup finished in 25.879s (kernel) + 7.144s (initrd) + 29.665s (userspace) = 1min 2.689s
> multi-user.target reached after 29.637s in userspace.
> 
> 
> 
> 2) debug config (enabling KASAN)
> 
> a) master
> 
> Startup finished in 2min 12.695s (kernel) + 25.058s (initrd) + 1min 13.012s (userspace) = 3min 50.765s
> multi-user.target reached after 1min 12.903s in userspace.
> Startup finished in 1min 45.400s (kernel) + 24.294s (initrd) + 1min 8.910s (userspace) = 3min 18.606s
> multi-user.target reached after 1min 8.786s in userspace.
> Startup finished in 2min 4.857s (kernel) + 24.715s (initrd) + 1min 5.088s (userspace) = 3min 34.660s
> multi-user.target reached after 1min 4.967s in userspace.
> Startup finished in 3min 20.400s (kernel) + 24.703s (initrd) + 1min 5.469s (userspace) = 4min 50.573s
> multi-user.target reached after 1min 5.344s in userspace.
> 
> b) patched
> 
> Startup finished in 2min 5.250s (kernel) + 25.049s (initrd) + 1min 1.961s (userspace) = 3min 32.262s
> multi-user.target reached after 1min 1.844s in userspace.
> Startup finished in 1min 52.524s (kernel) + 24.897s (initrd) + 1min 5.062s (userspace) = 3min 22.484s
> multi-user.target reached after 1min 4.916s in userspace.
> Startup finished in 9min 36.817s (kernel) + 24.859s (initrd) + 1min 18.657s (userspace) = 11min 20.335s
> multi-user.target reached after 1min 18.455s in userspace.
> Startup finished in 30min 20.715s (kernel) + 24.722s (initrd) + 1min 7.039s (userspace) = 31min 52.476s
> multi-user.target reached after 1min 6.907s in userspace.
> 
> 
> What concerns me a bit, is that on the patched kernel we seem to hit more cases where
> boot takes much longer (in both kernel configs).
> 
> I'll do some more runs/investigation to see if this is reproducible or just some system oddity.
> 
> Staring just at the udev settle time (systemd-analyze blame), it's very similar between both kernel
> versions.

Thanks for these tests, having MODULE_STATS enabled +

cat /sys/kernel/debug/modules/stats

would be of huge value, both on master and + patched.

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-02 15:19                               ` David Hildenbrand
  2023-06-02 16:04                                 ` Luis Chamberlain
@ 2023-06-02 16:06                                 ` Linus Torvalds
  2023-06-02 16:37                                   ` David Hildenbrand
  1 sibling, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-06-02 16:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Luis Chamberlain, Johan Hovold, Lucas De Marchi, Petr Pavlu,
	gregkh, rafael, song, lucas.de.marchi, christophe.leroy, peterz,
	rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Fri, Jun 2, 2023 at 11:20 AM David Hildenbrand <david@redhat.com> wrote:
>
> What concerns me a bit, is that on the patched kernel we seem to hit more cases where
> boot takes much longer (in both kernel configs).

So it potentially serializes the loads to the same file more, but in
the process uses much less memory (since the ones waiting will not
have done any of the "load file contents and uncompress them"). So
it's a bit of a trade-off.

We could complicate things a bit, and let other callers return -EEXIST
a bit earlier, but I'm not convinced it really matters.

Honestly, taking too long because user space does something stupid and
wrong is not a kernel bug. Not booting because we use too much memory
- that's problematic. But booting slowly because udev does several
thousand unnecessary module loads is entirely on udev.

                 Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-02 16:06                                 ` Linus Torvalds
@ 2023-06-02 16:37                                   ` David Hildenbrand
  0 siblings, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2023-06-02 16:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, Johan Hovold, Lucas De Marchi, Petr Pavlu,
	gregkh, rafael, song, lucas.de.marchi, christophe.leroy, peterz,
	rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On 02.06.23 18:06, Linus Torvalds wrote:
> On Fri, Jun 2, 2023 at 11:20 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> What concerns me a bit, is that on the patched kernel we seem to hit more cases where
>> boot takes much longer (in both kernel configs).
> 
> So it potentially serializes the loads to the same file more, but in
> the process uses much less memory (since the ones waiting will not
> have done any of the "load file contents and uncompress them"). So
> it's a bit of a trade-off.

I have the feeling that -- on this system -- it's some inaccurate 
accounting of firmware+loader times to the kernel startup time. Combined 
with some other noise. Especially the firmware loading time seems to be 
fairly randomized.

I guess what we care about regarding module loading is the 
initrd+userspace loading times, and they are fairly stable. But we 
mostly care about udev.

So let's look only at "systemd-udev" services:

1) !debug

a) master

5.672s systemd-udev-settle.service
  505ms systemd-udev-trigger.service
  272ms systemd-udevd.service
5.418s systemd-udev-settle.service
  487ms systemd-udev-trigger.service
  258ms systemd-udevd.service
5.707s systemd-udev-settle.service
  527ms systemd-udev-trigger.service
  273ms systemd-udevd.service
6.250s systemd-udev-settle.service
  455ms systemd-udev-trigger.service
  283ms systemd-udevd.service


b) patched

4.652s systemd-udev-settle.service
  461ms systemd-udev-trigger.service
  302ms systemd-udevd.service
4.652s systemd-udev-settle.service
  461ms systemd-udev-trigger.service
  302ms systemd-udevd.service
4.634s systemd-udev-settle.service
  444ms systemd-udev-trigger.service
  296ms systemd-udevd.service
4.745s systemd-udev-settle.service
  444ms systemd-udev-trigger.service
  273ms systemd-udevd.service


2) debug

a) master

32.806s systemd-udev-settle.service
  9.584s systemd-udev-trigger.service
   471ms systemd-udevd.service
29.901s systemd-udev-settle.service
  8.914s systemd-udev-trigger.service
   400ms systemd-udevd.service
28.640s systemd-udev-settle.service
  9.260s systemd-udev-trigger.service
   477ms systemd-udevd.service
29.498s systemd-udev-settle.service
  9.073s systemd-udev-trigger.service
   444ms systemd-udevd.service


b) patched

28.765s systemd-udev-settle.service
  8.898s systemd-udev-trigger.service
   400ms systemd-udevd.service
28.292s systemd-udev-settle.service
  8.903s systemd-udev-trigger.service
   401ms systemd-udevd.service
34.588s systemd-udev-settle.service
  8.959s systemd-udev-trigger.service
   455ms systemd-udevd.service
28.641s systemd-udev-settle.service
  8.953s systemd-udev-trigger.service
   389ms systemd-udevd.service



So except some noise, in the general case the patched version seems to 
be faster just looking at systemd-udev.

> 
> We could complicate things a bit, and let other callers return -EEXIST
> a bit earlier, but I'm not convinced it really matters.

Looking at the numbers, agreed.

> 
> Honestly, taking too long because user space does something stupid and
> wrong is not a kernel bug. Not booting because we use too much memory
> - that's problematic. But booting slowly because udev does several
> thousand unnecessary module loads is entirely on udev.

Yes.


I'll do some more experiments, but from what I can tell

Tested-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30  1:55                       ` Linus Torvalds
                                           ` (2 preceding siblings ...)
  2023-05-30 22:45                         ` Dan Williams
@ 2023-06-04 14:26                         ` Rudi Heitbaum
  3 siblings, 0 replies; 53+ messages in thread
From: Rudi Heitbaum @ 2023-06-04 14:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johan Hovold, Luis Chamberlain, Lucas De Marchi, Petr Pavlu,
	gregkh, rafael, song, lucas.de.marchi, christophe.leroy, peterz,
	rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart, imre.deak

On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
> >
> > I took a closer look at some of the modules that failed to load and
> > noticed a pattern in that they have dependencies that are needed by more
> > than one device.
> 
> Ok, this is a "maybe something like this" RFC series of two patches -
> one trivial one to re-organize things a bit so that we can then do the
> real one which uses a filter based on the inode pointer to return an
> "idempotent return value" for module loads that share the same inode.
> 
> It's entirely untested, and since I'm on the road I'm going to not
> really be able to test it. It compiles for me, and the code looks
> fairly straightforward, but it's probably buggy.

The updated patches fix the issue experienced with 6.4-rc4 where the Intel
NUC12 booted fine, but on media playback resulted in the video stuttering,
then freezing - and that audio was not coming though.

# diff modules-rc4 modules-rc4-patched
< snd_hda_codec_hdmi     77824  0
< snd_hda_codec_realtek   159744  0
---
> snd_hda_codec_hdmi     77824  1
> snd_hda_codec_realtek   159744  1

 Revert "module: error out early on concurrent load of the same module file"
 module: split up 'finit_module()' into init_module_from_file() helper
 modules: catch concurrent module loads, take two
 Johan's fix:
  -static struct spinlock idem_lock;
  +static DEFINE_SPINLOCK(idem_lock);

Tested-by: Rudi Heitbaum <rudi@heitbaum..com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-02 16:04                                 ` Luis Chamberlain
@ 2023-06-05 11:26                                   ` David Hildenbrand
  2023-06-05 15:17                                     ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2023-06-05 11:26 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linus Torvalds, Johan Hovold, Lucas De Marchi, Petr Pavlu,
	gregkh, rafael, song, lucas.de.marchi, christophe.leroy, peterz,
	rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On 02.06.23 18:04, Luis Chamberlain wrote:
> On Fri, Jun 02, 2023 at 05:19:58PM +0200, David Hildenbrand wrote:
>> On 31.05.23 18:57, Luis Chamberlain wrote:
>>> On Wed, May 31, 2023 at 09:51:41AM +0200, David Hildenbrand wrote:
>>>> On 30.05.23 18:22, Luis Chamberlain wrote:
>>>>> On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
>>>>>> On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
>>>>>>>
>>>>>>> I took a closer look at some of the modules that failed to load and
>>>>>>> noticed a pattern in that they have dependencies that are needed by more
>>>>>>> than one device.
>>>>>>
>>>>>> Ok, this is a "maybe something like this" RFC series of two patches -
>>>>>> one trivial one to re-organize things a bit so that we can then do the
>>>>>> real one which uses a filter based on the inode pointer to return an
>>>>>> "idempotent return value" for module loads that share the same inode.
>>>>>>
>>>>>> It's entirely untested, and since I'm on the road I'm going to not
>>>>>> really be able to test it. It compiles for me, and the code looks
>>>>>> fairly straightforward, but it's probably buggy.
>>>>>>
>>>>>> It's very loosely based on Luis' attempt,  but it
>>>>>>     (a) is internal to module loading
>>>>>>     (b) uses a reliable cookie
>>>>>>     (c) doesn't leave the cookie around randomly for later
>>>>>>     (d) has seen absolutely no testing
>>>>>>
>>>>>> Put another way: if somebody wants to play with this, please treat it
>>>>>> as a starting point, not the final thing. You might need to debug
>>>>>> things, and fix silly mistakes.
>>>>>>
>>>>>> The idea is to just have a simple hash list of currently executing
>>>>>> module loads, protected by a trivial spinlock. Every module loader
>>>>>> adds itself to the right hash list, and if they were the *first* one
>>>>>> (ie no other pending module loads for that inode), will actually do
>>>>>> the module load.
>>>>>>
>>>>>> Everybody who *isn't* the first one will just wait for completion and
>>>>>> return the same error code that the first one returned.
>>>>>
>>>>> That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
>>>>> ever wanted to do something similar there if we wanted to also
>>>>> join request_module() calls, instead of it hiding under debug.
>>>>>
>>>>>> This is technically bogus. The first one might fail due to arguments.
>>>>>
>>>>> For boot it's fine, as I can't think of boot wanting to support trying
>>>>> to load a module with different arguments but who knows. But I can't
>>>>> see it sensible to issue concurrent multiple requests for modules
>>>>> with different arguments without waiting in userspace for the first
>>>>> to fail.
>>>>>
>>>>> Even post-boot, doing that sounds rather insane, but it would certainly
>>>>> be a compromise and should probably be clearly documented. I think just
>>>>> a comment acknolwedging that corner case seems sensible.
>>>>>
>>>>> Because we won't be able to get the arguments until we process the
>>>>> module, so it would be too late for this optimization on kread. So it is
>>>>> why I had also stuck to the original feature being in kread, as then it
>>>>> provides a uniq kread call and the caller is aware of it. But indeed I
>>>>> had not considered the effects of arguments.
>>>>>
>>>>> Lucas, any thoughts from modules kmod userspace perspective into
>>>>> supporting anyone likely issuing concurrent modules requests with
>>>>> differing arguments?
>>>>>
>>>>>> So the cookie shouldn't be just the inode, it should be the inode and
>>>>>> a hash of the arguments or something like that.
>>>>>
>>>>> Personally I think it's a fine optimization without the arguments.
>>>>>
>>>>>> But it is what it is,
>>>>>> and apart from possible show-stopper bugs this is no worse than the
>>>>>> failed "exclusive write deny" attempt. IOW - maybe worth trying?
>>>>>
>>>>> The only thing I can think of is allowing threads other than the
>>>>> first one to complete before the one that actually loaded the
>>>>> module. I thought about this race for module auto-loading, see
>>>>> the comment in kmod_dup_request_announce(), so that just
>>>>> further delays the completion to other thread with a stupid
>>>>> queue_work(). That seems more important for module auto-loading
>>>>> duplicates than for boot finit_module() duplicates. But not sure
>>>>> if odering matters in the end due to a preemtible kernel and maybe
>>>>> that concern is hysteria.
>>>>>
>>>>>> And if *that* didn't sell people on this patch series, I don't know
>>>>>> what will. I should be in marketing! Two drink minimums, here I come!
>>>>>
>>>>> Sold:
>>>>>
>>>>> on 255 vcpus 0 duplicates found with this setup:
>>>>>
>>>>> root@kmod ~ # cat /sys/kernel/debug/modules/stats
>>>>>             Mods ever loaded       66
>>>>>         Mods failed on kread       0
>>>>> Mods failed on decompress       0
>>>>>      Mods failed on becoming       0
>>>>>          Mods failed on load       0
>>>>>            Total module size       11268096
>>>>>          Total mod text size       4149248
>>>>>           Failed kread bytes       0
>>>>>      Failed decompress bytes       0
>>>>>        Failed becoming bytes       0
>>>>>            Failed kmod bytes       0
>>>>>     Virtual mem wasted bytes       0
>>>>>             Average mod size       170729
>>>>>        Average mod text size       62868
>>>>>
>>>>> So:
>>>>>
>>>>> Tested-by: Luis Chamberlain <mcgrof@kernel.org>
>>>>>
>>>>> In terms of bootup timing:
>>>>>
>>>>> Before:
>>>>> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
>>>>> graphical.target reached after 44.178s in userspace.
>>>>> After:
>>>>> Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
>>>>> graphical.target reached after 40.226s in userspace.
>>>>
>>>> I'll try grabbing the system where we saw the KASAN-related issues [1] and
>>>> give it a churn with and without the two patches. Might take a bit (~1 day),
>>>> unfortunately.
>>>>
>>>> [1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
>>>
>>> Great, don't forget:
>>>
>>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>>> index 82b0dcc1fe77..222015093eeb 100644
>>> --- a/kernel/module/main.c
>>> +++ b/kernel/module/main.c
>>> @@ -3066,7 +3066,7 @@ struct idempotent {
>>>    #define IDEM_HASH_BITS 8
>>>    static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
>>> -static struct spinlock idem_lock;
>>> +static DEFINE_SPINLOCK(idem_lock);
>>>    static bool idempotent(struct idempotent *u, const void *cookie)
>>>    {
>>>
>>
>> Finally was able to run it on that ThinkSystem SR950 with 8 populated
>> sockets -> 224 cores and 448 logical CPUs.
>>
>> The KASAN vmap issues on that system were already no longer reproducible with your
>> (Luis) previous work that's already in master.
>>
>> I tested a !debug and debug config (both based on corresponding RHEL9 configs), comparing
>> 929ed21dfdb6 ("master") with 929ed21dfdb6 + Linus' patches ("patched").
>>
>>
>> Unfortunately, boot times vary a lot, and I did not figure out how to reduce
>> the noise. I captured the "systemd-analyze blame" output as well.
>>
>>
>> 1) !debug config (not enabling KASAN)
>>
>> a) master
>>
>> Startup finished in 32.225s (kernel) + 7.399s (initrd) + 20.378s (userspace) = 1min 3ms
>> multi-user.target reached after 20.352s in userspace.
>> Startup finished in 43.734s (kernel) + 7.288s (initrd) + 19.827s (userspace) = 1min 10.851s
>> multi-user.target reached after 19.800s in userspace.
>> Startup finished in 50.514s (kernel) + 7.171s (initrd) + 24.757s (userspace) = 1min 22.443s
>> multi-user.target reached after 24.734s in userspace.
>> Startup finished in 26.722s (kernel) + 7.249s (initrd) + 23.923s (userspace) = 57.895s
>> multi-user.target reached after 23.892s in userspace.
>>
>> b) patched
>>
>> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
>> multi-user.target reached after 21.355s in userspace.
>> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
>> multi-user.target reached after 21.355s in userspace.
>> Startup finished in 1min 34.678s (kernel) + 7.239s (initrd) + 24.066s (userspace) = 2min 5.985s
>> multi-user.target reached after 24.040s in userspace.
>> Startup finished in 25.879s (kernel) + 7.144s (initrd) + 29.665s (userspace) = 1min 2.689s
>> multi-user.target reached after 29.637s in userspace.
>>
>>
>>
>> 2) debug config (enabling KASAN)
>>
>> a) master
>>
>> Startup finished in 2min 12.695s (kernel) + 25.058s (initrd) + 1min 13.012s (userspace) = 3min 50.765s
>> multi-user.target reached after 1min 12.903s in userspace.
>> Startup finished in 1min 45.400s (kernel) + 24.294s (initrd) + 1min 8.910s (userspace) = 3min 18.606s
>> multi-user.target reached after 1min 8.786s in userspace.
>> Startup finished in 2min 4.857s (kernel) + 24.715s (initrd) + 1min 5.088s (userspace) = 3min 34.660s
>> multi-user.target reached after 1min 4.967s in userspace.
>> Startup finished in 3min 20.400s (kernel) + 24.703s (initrd) + 1min 5.469s (userspace) = 4min 50.573s
>> multi-user.target reached after 1min 5.344s in userspace.
>>
>> b) patched
>>
>> Startup finished in 2min 5.250s (kernel) + 25.049s (initrd) + 1min 1.961s (userspace) = 3min 32.262s
>> multi-user.target reached after 1min 1.844s in userspace.
>> Startup finished in 1min 52.524s (kernel) + 24.897s (initrd) + 1min 5.062s (userspace) = 3min 22.484s
>> multi-user.target reached after 1min 4.916s in userspace.
>> Startup finished in 9min 36.817s (kernel) + 24.859s (initrd) + 1min 18.657s (userspace) = 11min 20.335s
>> multi-user.target reached after 1min 18.455s in userspace.
>> Startup finished in 30min 20.715s (kernel) + 24.722s (initrd) + 1min 7.039s (userspace) = 31min 52.476s
>> multi-user.target reached after 1min 6.907s in userspace.
>>
>>
>> What concerns me a bit, is that on the patched kernel we seem to hit more cases where
>> boot takes much longer (in both kernel configs).
>>
>> I'll do some more runs/investigation to see if this is reproducible or just some system oddity.
>>
>> Staring just at the udev settle time (systemd-analyze blame), it's very similar between both kernel
>> versions.
> 
> Thanks for these tests, having MODULE_STATS enabled +
> 
> cat /sys/kernel/debug/modules/stats
> 
> would be of huge value, both on master and + patched.

Sure! Sorry for the slight delay, had to recompile the kernels.

I only did a single run on each kernel, should be good enough for the purpose here.


1) !debug config (not enabling KASAN)

a) master

# cat /sys/kernel/debug/modules/stats
          Mods ever loaded       75
      Mods failed on kread       0
Mods failed on decompress       0
   Mods failed on becoming       8052
       Mods failed on load       531
         Total module size       16343040
       Total mod text size       4767744
        Failed kread bytes       0
   Failed decompress bytes       0
     Failed becoming bytes       19185958488
         Failed kmod bytes       1172591936
  Virtual mem wasted bytes       20358550424
          Average mod size       217908
     Average mod text size       63570
   Avg fail becoming bytes       2382757
   Average fail load bytes       2208272
Duplicate failed modules:
               Module-name        How-many-times                    Reason
         intel_rapl_common                   447                  Becoming
    intel_uncore_frequency                   447                  Becoming
intel_uncore_frequency_common               445           Becoming & Load
          isst_if_mbox_msr                   447                  Becoming
            isst_if_common                   447           Becoming & Load
                  skx_edac                   447           Becoming & Load
                      nfit                   447                  Becoming
                 libnvdimm                   447                  Becoming
      x86_pkg_temp_thermal                   447                  Becoming
          intel_powerclamp                   447                  Becoming
                  coretemp                   447           Becoming & Load
                 kvm_intel                   447                  Becoming
                       kvm                   447           Becoming & Load
                     irdma                     3                      Load
                       ice                     3                  Becoming
                 irqbypass                   447           Becoming & Load
                      rapl                   393           Becoming & Load
                      gnss                     2           Becoming & Load
                   ioatdma                    63           Becoming & Load
                 ib_uverbs                     3           Becoming & Load
              intel_cstate                   447           Becoming & Load
                   ipmi_si                     1                      Load
              intel_uncore                   447           Becoming & Load
                   ib_core                     3           Becoming & Load
              ipmi_devintf                     1                      Load
           ipmi_msghandler                     1                  Becoming
                       dca                    63                  Becoming
                      ahci                     1                  Becoming
                      i40e                     3                  Becoming
                   libahci                     1                  Becoming
          crct10dif_pclmul                   195                  Becoming
              crc32_pclmul                   262                  Becoming
              crc32c_intel                   126           Becoming & Load
                    libata                     1                      Load
       ghash_clmulni_intel                   297           Becoming & Load
                      fjes                    11                  Becoming


b) patched

# cat /sys/kernel/debug/modules/stats
          Mods ever loaded       75
      Mods failed on kread       0
Mods failed on decompress       0
   Mods failed on becoming       0
       Mods failed on load       0
         Total module size       16343040
       Total mod text size       4767744
        Failed kread bytes       0
   Failed decompress bytes       0
     Failed becoming bytes       0
         Failed kmod bytes       0
  Virtual mem wasted bytes       0
          Average mod size       217908
     Average mod text size       63570



2) debug config (enabling KASAN)

a) master

# cat /sys/kernel/debug/modules/stats
          Mods ever loaded       75
      Mods failed on kread       0
Mods failed on decompress       0
   Mods failed on becoming       3173
       Mods failed on load       923
         Total module size       32288768
       Total mod text size       14819328
        Failed kread bytes       0
   Failed decompress bytes       0
     Failed becoming bytes       4279728952
         Failed kmod bytes       2174133088
  Virtual mem wasted bytes       6453862040
          Average mod size       430517
     Average mod text size       197592
   Avg fail becoming bytes       1348796
   Average fail load bytes       2355508
Duplicate failed modules:
               Module-name        How-many-times                    Reason
                   ioatdma                    63           Becoming & Load
         intel_rapl_common                   313                  Becoming
    intel_uncore_frequency                   136           Becoming & Load
intel_uncore_frequency_common                38                  Becoming
          isst_if_mbox_msr                   443           Becoming & Load
            isst_if_common                   210           Becoming & Load
                  skx_edac                   257           Becoming & Load
                      nfit                   180           Becoming & Load
                 libnvdimm                   294           Becoming & Load
      x86_pkg_temp_thermal                   321           Becoming & Load
          intel_powerclamp                   247           Becoming & Load
                  coretemp                    55           Becoming & Load
                 kvm_intel                     2                  Becoming
                       kvm                     3           Becoming & Load
                 irqbypass                     3                  Becoming
                      rapl                     1                  Becoming
              intel_cstate                   444                      Load
              intel_uncore                   447           Becoming & Load
                 ib_uverbs                     1                      Load
                   ib_core                     3           Becoming & Load
                   ipmi_si                     1                  Becoming
                       dca                    20           Becoming & Load
                        sg                     1                  Becoming
                      i40e                     3           Becoming & Load
                      ahci                     1                  Becoming
          crct10dif_pclmul                   160           Becoming & Load
              crc32_pclmul                   151           Becoming & Load
              crc32c_intel                   146           Becoming & Load
       ghash_clmulni_intel                   140                  Becoming
                    libata                     1                      Load
                      fjes                    11           Becoming & Load



b) patched

# cat /sys/kernel/debug/modules/stats
          Mods ever loaded       75
      Mods failed on kread       0
Mods failed on decompress       0
   Mods failed on becoming       19
       Mods failed on load       0
         Total module size       32288768
       Total mod text size       14819328
        Failed kread bytes       0
   Failed decompress bytes       0
     Failed becoming bytes       6441296
         Failed kmod bytes       0
  Virtual mem wasted bytes       6441296
          Average mod size       430517
     Average mod text size       197592
   Avg fail becoming bytes       339016
Duplicate failed modules:
               Module-name        How-many-times                    Reason
         intel_rapl_common                     2                  Becoming
intel_uncore_frequency_common                 5                  Becoming
            isst_if_common                     3                  Becoming
          intel_powerclamp                     2                  Becoming
                 irqbypass                     5                  Becoming
          crct10dif_pclmul                     1                  Becoming
              crc32c_intel                     1                  Becoming



So, with these (helpful) stats, the improvement is obvious (and explains the ~1s
improvement I saw staring at the startup times of the udev services).

There are still some failed module loads with the debug config (only in the
becoming state), I did not dive deeply into the actual code changes (-EBUSY),
just spelling it out so we can decide if this is to be expected or some corner
case that shouldn't be happening.


Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-05-30  9:40                         ` Johan Hovold
@ 2023-06-05 12:25                           ` Johan Hovold
  0 siblings, 0 replies; 53+ messages in thread
From: Johan Hovold @ 2023-06-05 12:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	song, lucas.de.marchi, christophe.leroy, peterz, rppt, dave,
	willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Tue, May 30, 2023 at 11:40:35AM +0200, Johan Hovold wrote:
> On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
> > On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
> > >
> > > I took a closer look at some of the modules that failed to load and
> > > noticed a pattern in that they have dependencies that are needed by more
> > > than one device.
> > 
> > Ok, this is a "maybe something like this" RFC series of two patches -
> > one trivial one to re-organize things a bit so that we can then do the
> > real one which uses a filter based on the inode pointer to return an
> > "idempotent return value" for module loads that share the same inode.
> > 
> > It's entirely untested, and since I'm on the road I'm going to not
> > really be able to test it. It compiles for me, and the code looks
> > fairly straightforward, but it's probably buggy.
> > 
> > It's very loosely based on Luis' attempt,  but it
> >  (a) is internal to module loading
> >  (b) uses a reliable cookie
> >  (c) doesn't leave the cookie around randomly for later
> >  (d) has seen absolutely no testing
> > 
> > Put another way: if somebody wants to play with this, please treat it
> > as a starting point, not the final thing. You might need to debug
> > things, and fix silly mistakes.
> 
> With the missing spinlock initialisation fixed:
> 
> -static struct spinlock idem_lock;
> +static DEFINE_SPINLOCK(idem_lock);
> 
> this passes basic smoke testing and allows the X13s to boot.
> 
> It does not seem to have any significant impact on boot time, but it
> avoids some of the unnecessary load attempts as intended:

Took another look at code which modulo some nits like missing spaces
around a + operator and an "ourselfs" typo looks good to me. So with the
spinlock initialisation fixed:

Reviewed-by: Johan Hovold <johan@kernel.org>
Tested-by: Johan Hovold <johan@kernel.org>

Johan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-05 11:26                                   ` David Hildenbrand
@ 2023-06-05 15:17                                     ` Luis Chamberlain
  2023-06-05 15:28                                       ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2023-06-05 15:17 UTC (permalink / raw)
  To: David Hildenbrand, song
  Cc: Linus Torvalds, Johan Hovold, Lucas De Marchi, Petr Pavlu,
	gregkh, rafael, lucas.de.marchi, christophe.leroy, peterz, rppt,
	dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Mon, Jun 05, 2023 at 01:26:00PM +0200, David Hildenbrand wrote:
> I only did a single run on each kernel, should be good enough for the purpose here.
> 
> 
> 1) !debug config (not enabling KASAN)
> 
> a) master
> 
> # cat /sys/kernel/debug/modules/stats
>  Virtual mem wasted bytes       20358550424
18 GiB

>          Average mod size       217908
212 KiB

>     Average mod text size       63570
62 KiB

> b) patched
> 
> # cat /sys/kernel/debug/modules/stats
>  Virtual mem wasted bytes       0

> 2) debug config (enabling KASAN)
> 
> a) master
> 
> # cat /sys/kernel/debug/modules/stats
>  Virtual mem wasted bytes       6453862040
6 GiB

>          Average mod size       430517
420 KiB, so ballpark kasan pretty much doubles module size.

>     Average mod text size       197592
192 KiB, and is reflected on module .text too, in fact .text more than doubles.

It would have otherwise been difficult to get some of these stats, so
thanks! I make note of .text just because of the recent development work
going on for a new module_alloc(). About 14 MiB required to house a big
iron kasan enabled module .text, whereas about half is required for !kasan.

> b) patched
> 
> # cat /sys/kernel/debug/modules/stats
>  Virtual mem wasted bytes       6441296

We've gone down from ~6 GiB to ~6 MiB.

> So, with these (helpful) stats,

Extremely useful, yes thanks.

> the improvement is obvious (and explains the ~1s
> improvement I saw staring at the startup times of the udev services).
> 
> There are still some failed module loads with the debug config (only in the
> becoming state), I did not dive deeply into the actual code changes (-EBUSY),

That's fine, Linus' patch does not keep the lock for the entire life
of the module, it releases it right away after we're done with the
kernel_read(), and because of this, there is a small race in between
a thread the kernel_read() finishing and the module then being processed
into the modules linked list. During that time, if a new module with the
same name comes in, we'll have to allow it since the lock was released.
Those extra modules end up lingering to wait for the first one that made
it to the modules linked list.

I don't think we need to worry about 6 MiB, this patch alone should
suffice for a long time until userspace gets its act together and fixes
this properly. Fixing userspace should reduce some latencies as well on
bootup so someone who cares about bootup speeds on high end systems
could likely be encouraged to fix that.

> just spelling it out so we can decide if this is to be expected or some corner
> case that shouldn't be happening.

It is expected, in fact the fact that the heuristic works so well,
without keeping the lock forever, and therefore keeping the code changes
to a minimum is quite an amazing.

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-05 15:17                                     ` Luis Chamberlain
@ 2023-06-05 15:28                                       ` Luis Chamberlain
  2023-06-28 18:52                                         ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2023-06-05 15:28 UTC (permalink / raw)
  To: David Hildenbrand, song
  Cc: Linus Torvalds, Johan Hovold, Lucas De Marchi, Petr Pavlu,
	gregkh, rafael, lucas.de.marchi, christophe.leroy, peterz, rppt,
	dave, willy, vbabka, mhocko, dave.hansen, colin.i.king,
	jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu,
	tglx, hch, patches, linux-modules, linux-mm, linux-kernel,
	pmladek, prarit, lennart

On Mon, Jun 05, 2023 at 08:17:42AM -0700, Luis Chamberlain wrote:
> We've gone down from ~6 GiB to ~6 MiB.

And just to also highlight, that was just for for the KASAN enabled case, and
for !KASAN we went from ~18 GiB to 0.

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-05 15:28                                       ` Luis Chamberlain
@ 2023-06-28 18:52                                         ` Luis Chamberlain
  2023-06-28 20:14                                           ` Linus Torvalds
  0 siblings, 1 reply; 53+ messages in thread
From: Luis Chamberlain @ 2023-06-28 18:52 UTC (permalink / raw)
  To: Linus Torvalds, David Hildenbrand, song
  Cc: Johan Hovold, Lucas De Marchi, Petr Pavlu, gregkh, rafael,
	lucas.de.marchi, christophe.leroy, peterz, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, tglx, hch,
	patches, linux-modules, linux-mm, linux-kernel, pmladek, prarit,
	lennart

On Mon, Jun 05, 2023 at 08:28:41AM -0700, Luis Chamberlain wrote:
> On Mon, Jun 05, 2023 at 08:17:42AM -0700, Luis Chamberlain wrote:
> > We've gone down from ~6 GiB to ~6 MiB.
> 
> And just to also highlight, that was just for for the KASAN enabled case, and
> for !KASAN we went from ~18 GiB to 0.

Linus, were you thinking of including these patches in for v6.5-rc1?

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-28 18:52                                         ` Luis Chamberlain
@ 2023-06-28 20:14                                           ` Linus Torvalds
  2023-06-28 22:07                                             ` Linus Torvalds
  0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-06-28 20:14 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, song, Johan Hovold, Lucas De Marchi,
	Petr Pavlu, gregkh, rafael, lucas.de.marchi, christophe.leroy,
	peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen,
	colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, tglx, hch, patches, linux-modules,
	linux-mm, linux-kernel, pmladek, prarit, lennart

On Wed, 28 Jun 2023 at 11:52, Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> Linus, were you thinking of including these patches in for v6.5-rc1?

Heh, I don't even have them in my tree any more, I was assuming that
if they were good they'd be sent back to me...

I guess I can go fish them out from my old emails, but I actually was
expecting to just get them from you.

I do actually maintain my own branches for *some* things, and merge
them myself, but they tend to be for areas that I feel I'm
co-maintaining (ie notably vfs and mm that I still feel I'm involved
in).

In other areas, I may send out patches, but I don't feel like I'm a
maintainer, so I then think that "the real maintainer can decide if
these patches are good or not".

And I would very much hope that people don't take said patches _just_
because they come from me. They should be judged on their own merits,
and then occasionally people can mutter "Christ, what drugs is Linus
on _today_?" and just throw my patches in the garbage.

             Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-28 20:14                                           ` Linus Torvalds
@ 2023-06-28 22:07                                             ` Linus Torvalds
  2023-06-28 23:17                                               ` Linus Torvalds
  0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-06-28 22:07 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, song, Johan Hovold, Lucas De Marchi,
	Petr Pavlu, gregkh, rafael, lucas.de.marchi, christophe.leroy,
	peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen,
	colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, tglx, hch, patches, linux-modules,
	linux-mm, linux-kernel, pmladek, prarit, lennart

On Wed, 28 Jun 2023 at 13:14, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> In other areas, I may send out patches, but I don't feel like I'm a
> maintainer, so I then think that "the real maintainer can decide if
> these patches are good or not".

Anyway, since I clearly didn't communicate clearly that I was throwing
the patch over to you, let me go and find it and just apply it myself.

           Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-28 22:07                                             ` Linus Torvalds
@ 2023-06-28 23:17                                               ` Linus Torvalds
  2023-06-29  0:18                                                 ` Luis Chamberlain
  0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2023-06-28 23:17 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: David Hildenbrand, song, Johan Hovold, Lucas De Marchi,
	Petr Pavlu, gregkh, rafael, lucas.de.marchi, christophe.leroy,
	peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen,
	colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, tglx, hch, patches, linux-modules,
	linux-mm, linux-kernel, pmladek, prarit, lennart

On Wed, 28 Jun 2023 at 15:07, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, since I clearly didn't communicate clearly that I was throwing
> the patch over to you, let me go and find it and just apply it myself.

Ok, I spent some time writing a better commit message for that change,
tried to make sure I got everybody's comments and Tested-by's sorted
out etc, and committed the end result.

I also did the actual module pull and your sysctl pull.

Can you please check that it all looks like you expected?

              Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/2] module: add support to avoid duplicates early on load
  2023-06-28 23:17                                               ` Linus Torvalds
@ 2023-06-29  0:18                                                 ` Luis Chamberlain
  0 siblings, 0 replies; 53+ messages in thread
From: Luis Chamberlain @ 2023-06-29  0:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, song, Johan Hovold, Lucas De Marchi,
	Petr Pavlu, gregkh, rafael, lucas.de.marchi, christophe.leroy,
	peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen,
	colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe, yujie.liu, tglx, hch, patches, linux-modules,
	linux-mm, linux-kernel, pmladek, prarit, lennart

On Wed, Jun 28, 2023 at 04:17:10PM -0700, Linus Torvalds wrote:
> On Wed, 28 Jun 2023 at 15:07, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Anyway, since I clearly didn't communicate clearly that I was throwing
> > the patch over to you, let me go and find it and just apply it myself.
> 
> Ok, I spent some time writing a better commit message for that change,
> tried to make sure I got everybody's comments and Tested-by's sorted
> out etc, and committed the end result.
> 
> I also did the actual module pull and your sysctl pull.
> 
> Can you please check that it all looks like you expected?

Looks good to me, thanks for doing that. It certainly was not clear
to me that you had expected me to merge your patch just because you
had previously merged your own and later reverted it. But now I know
in the future I may just pick up your patches, instead of expecting
you to shortcut it, thanks!

  Luis

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2023-06-29  0:19 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-24 21:36 [PATCH 0/2] module: avoid all memory pressure due to duplicates Luis Chamberlain
2023-05-24 21:36 ` [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection Luis Chamberlain
2023-05-24 21:52   ` Linus Torvalds
2023-05-24 21:56     ` Linus Torvalds
2023-05-24 22:07       ` Luis Chamberlain
2023-05-25  4:00     ` Linus Torvalds
2023-05-25 18:08       ` Luis Chamberlain
2023-05-25 18:35         ` Luis Chamberlain
2023-05-25 18:50         ` Linus Torvalds
2023-05-25 19:32           ` Luis Chamberlain
2023-05-25  7:01     ` Christian Brauner
2023-05-24 21:36 ` [PATCH 2/2] module: add support to avoid duplicates early on load Luis Chamberlain
2023-05-25 11:40   ` Petr Pavlu
2023-05-25 16:07     ` Linus Torvalds
2023-05-25 16:42       ` Greg KH
2023-05-25 18:22         ` Luis Chamberlain
2023-05-25 17:52       ` Linus Torvalds
2023-05-25 18:45       ` Lucas De Marchi
2023-05-25 21:12         ` Linus Torvalds
2023-05-25 22:02           ` Luis Chamberlain
2023-05-26  1:39             ` Linus Torvalds
2023-05-29  8:58               ` Johan Hovold
2023-05-29 11:00                 ` Linus Torvalds
2023-05-29 12:44                   ` Johan Hovold
2023-05-29 15:18                     ` Johan Hovold
2023-05-30  1:55                       ` Linus Torvalds
2023-05-30  9:40                         ` Johan Hovold
2023-06-05 12:25                           ` Johan Hovold
2023-05-30 16:22                         ` Luis Chamberlain
2023-05-30 17:16                           ` Lucas De Marchi
2023-05-30 19:41                             ` Luis Chamberlain
2023-05-30 22:17                               ` Linus Torvalds
2023-05-31  5:30                                 ` Lucas De Marchi
2023-05-31  0:31                           ` Luis Chamberlain
2023-05-31  7:51                           ` David Hildenbrand
2023-05-31 16:57                             ` Luis Chamberlain
2023-06-02 15:19                               ` David Hildenbrand
2023-06-02 16:04                                 ` Luis Chamberlain
2023-06-05 11:26                                   ` David Hildenbrand
2023-06-05 15:17                                     ` Luis Chamberlain
2023-06-05 15:28                                       ` Luis Chamberlain
2023-06-28 18:52                                         ` Luis Chamberlain
2023-06-28 20:14                                           ` Linus Torvalds
2023-06-28 22:07                                             ` Linus Torvalds
2023-06-28 23:17                                               ` Linus Torvalds
2023-06-29  0:18                                                 ` Luis Chamberlain
2023-06-02 16:06                                 ` Linus Torvalds
2023-06-02 16:37                                   ` David Hildenbrand
2023-05-30 22:45                         ` Dan Williams
2023-06-04 14:26                         ` Rudi Heitbaum
2023-05-29 17:47                     ` Linus Torvalds
2023-05-30 10:01                       ` Johan Hovold
2023-05-25 16:54     ` Lucas De Marchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).