All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] module: fix virtual memory wasted on finit_module()
@ 2023-04-14  5:28 Luis Chamberlain
  2023-04-14  5:28 ` [RFC 1/2] module: add debugging auto-load duplicate module support Luis Chamberlain
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-14  5:28 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, dave, willy, vbabka,
	mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas,
	jbaron, rick.p.edgecombe, mcgrof

The graph from my v3 patch series [0] which tries to resolve the virtual
memory lost bytes due to duplicates says it all:

         +----------------------------------------------------------------------------+
    14GB |-+          +            +            +           +           *+          +-|
         |                                                          ****              |
         |                                                       ***                  |
         |                                                     **                     |
    12GB |-+                                                 **                     +-|
         |                                                 **                         |
         |                                               **                           |
         |                                             **                             |
         |                                           **                               |
    10GB |-+                                       **                               +-|
         |                                       **                                   |
         |                                     **                                     |
         |                                   **                                       |
     8GB |-+                               **                                       +-|
waste    |                               **                             ###           |
         |                             **                           ####              |
         |                           **                      #######                  |
     6GB |-+                     ****                    ####                       +-|
         |                      *                    ####                             |
         |                     *                 ####                                 |
         |                *****              ####                                     |
     4GB |-+            **               ####                                       +-|
         |            **             ####                                             |
         |          **           ####                                                 |
         |        **         ####                                                     |
     2GB |-+    **      #####                                                       +-|
         |     *    ####                                                              |
         |    * ####                                                   Before ******* |
         |  **##      +            +            +           +           After ####### |
         +----------------------------------------------------------------------------+
         0            50          100          150         200          250          300
                                          CPUs count

So we really need to debug to see WTF, because really, WTF. The first
patch tries to answer the question if the issue is module auto-loading
being abused and that causing the issues. The patch proves that the
answer is no, but it does also help us find *a few* requests which can
get a bit of love to avoid duplicates. My system at least found one. So
it adds a debugging facility to let you do that.

As I was writing the commit log for my first patch series [0] I was noting
that this is it... and the obvious conclusion is that the culprit is udev
issuing requests per CPU for tons of modules. I didn't feel comfortable in
writing that this is it and we can't really do anything before really
trying hard. So I gave it a good 'ol college try. At first I wondered if
we could use file descriptor hints to just exlude users early on boot
before SYSTEM_RUNNING. I couldn't find much, but if there are some ways
to do that -- then the last patch can be simplified to do just that.
The second patch proves essentially that we can just send -EBUSY to
duplicate requests, at least for duplicate module loads and the world
doesn't fall apart. It *would* solve the issue. The patch however
borrows tons of the code from the first, and if we're realy going to
rely on something like that we may as well share. But I'm hopeful that
perhaps there are some jucier file descriptor tricks we can use to
just make a file mutually exlusivive and introduce a new kread which
lets finit_module() use that. The saving grace is that at least all
finit_module() calls *wait*, contray to request_module() calls and so
the solution can be much simpler.

The end result is 0 wasted virtual memory bytes.

Any ideas how not to make patch 2 suck as-is ?

Yes -- we can also go fix udev, or libkmod, and that's what should be
done. However, it seems silly to not fix if the fix is as trivial as
patch 2 demonstrates.

If you want to test / muck with all this you can use my branch
20230413-module-alloc-opts [1]:

[0] https://lkml.kernel.org/r/20230414050836.1984746-1-mcgrof@kernel.org
[1] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230413-module-alloc-opts

Luis Chamberlain (2):
  module: add debugging auto-load duplicate module support
  kread: avoid duplicates

 fs/kernel_read_file.c    | 150 +++++++++++++++++++++++++
 kernel/module/Kconfig    |  40 +++++++
 kernel/module/Makefile   |   1 +
 kernel/module/dups.c     | 234 +++++++++++++++++++++++++++++++++++++++
 kernel/module/internal.h |  15 +++
 kernel/module/kmod.c     |  23 +++-
 kernel/module/main.c     |   6 +-
 7 files changed, 463 insertions(+), 6 deletions(-)
 create mode 100644 kernel/module/dups.c

-- 
2.39.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC 1/2] module: add debugging auto-load duplicate module support
  2023-04-14  5:28 [RFC 0/2] module: fix virtual memory wasted on finit_module() Luis Chamberlain
@ 2023-04-14  5:28 ` Luis Chamberlain
  2023-04-14  5:28 ` [RFC 2/2] kread: avoid duplicates Luis Chamberlain
  2023-04-14 17:25 ` [RFC 0/2] module: fix virtual memory wasted on finit_module() Luis Chamberlain
  2 siblings, 0 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-14  5:28 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, dave, willy, vbabka,
	mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas,
	jbaron, rick.p.edgecombe, mcgrof

This adds debugging support to the kernel module auto-loader to
easily detect and deal with duplicate module requests. To aid
with possible bootup failure issues it will supress the waste
in virtual memory when races happen before userspace loads a
module and the kernel is still issuing requests for the same
module.

Folks debugging virtual memory abuse on bootup can and should
enable this to see what WARN()s come on, to see if module
auto-loading is to blame for their woes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/Kconfig    |  40 +++++++
 kernel/module/Makefile   |   1 +
 kernel/module/dups.c     | 234 +++++++++++++++++++++++++++++++++++++++
 kernel/module/internal.h |  15 +++
 kernel/module/kmod.c     |  23 +++-
 5 files changed, 309 insertions(+), 4 deletions(-)
 create mode 100644 kernel/module/dups.c

diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index ca277b945a67..ac7d23679799 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -57,6 +57,46 @@ config MODULE_STATS
 
 	  If unsure, say N.
 
+config MODULE_AUTOLOAD_SUPRESS_DUPS
+	bool "Debug duplicate modules with auto-loading"
+	help
+	  Module autoloading allows in-kernel code to request modules through
+	  the *request_module*() API calls. This in turn just calls userspace
+	  modprobe. Although modprobe checks to see if a module is already
+	  loaded before trying to load a module there is a small time window in
+	  which multiple duplicate requests can end up in userspace and multiple
+	  modprobe calls race calling finit_module() around the same time for
+	  duplicate modules. The finit_module() system call can consume in the
+	  worst case more than twice the respective module size in virtual
+	  memory for each duplicate module requests. Although duplicate module
+	  requests are non-fatal virtual memory is a limited resource and each
+	  duplicate module request ends up just wasting virtual memory.
+
+	  This debugging facility will create WARN() splats for duplicate module
+	  requests to help identify if module auto-loading is the culprit to your
+	  woes. Since virtual memory abuse caused by duplicate module requests
+	  could render a system unusable this functionality will also suppresses
+	  the waste in virtual memory caused by duplicate requests by sharing
+	  races in requests for the same module to a single unified request.
+	  Once a non-wait request_module() call completes a module should be
+	  loaded and modprobe should simply not allow new finit_module() calls.
+
+	  Enable this functionality to try to debug virtual memory abuse during
+	  boot on systems and identify if the abuse was due to module
+	  auto-loading.
+
+	  If the first module request used request_module_nowait() we cannot
+	  use that as the anchor to wait for duplicate module requests, since
+	  users of request_module() do want a proper return value. If a call
+	  for the same module happened earlier with request_module() though,
+	  then a duplicate request_module_nowait() would be detected.
+
+	  You want to enable this if you want to debug and see if duplicate
+	  module auto-loading might be causing virtual memory abuse during
+	  bootup. A kernel trace will be provided for each duplicate request.
+
+	  Disable this if you are on production.
+
 endif # MODULE_DEBUG
 
 config MODULE_FORCE_LOAD
diff --git a/kernel/module/Makefile b/kernel/module/Makefile
index 52340bce497e..e8b121ac39cf 100644
--- a/kernel/module/Makefile
+++ b/kernel/module/Makefile
@@ -10,6 +10,7 @@ KCOV_INSTRUMENT_module.o := n
 obj-y += main.o
 obj-y += strict_rwx.o
 obj-y += kmod.o
+obj-$(CONFIG_MODULE_AUTOLOAD_SUPRESS_DUPS) += dups.o
 obj-$(CONFIG_MODULE_DECOMPRESS) += decompress.o
 obj-$(CONFIG_MODULE_SIG) += signing.o
 obj-$(CONFIG_LIVEPATCH) += livepatch.o
diff --git a/kernel/module/dups.c b/kernel/module/dups.c
new file mode 100644
index 000000000000..903ab7c7e8f4
--- /dev/null
+++ b/kernel/module/dups.c
@@ -0,0 +1,234 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * kmod dups - the kernel module autoloader duplicate suppressor
+ *
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/sched/task.h>
+#include <linux/binfmts.h>
+#include <linux/syscalls.h>
+#include <linux/unistd.h>
+#include <linux/kmod.h>
+#include <linux/slab.h>
+#include <linux/completion.h>
+#include <linux/cred.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/workqueue.h>
+#include <linux/security.h>
+#include <linux/mount.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/resource.h>
+#include <linux/notifier.h>
+#include <linux/suspend.h>
+#include <linux/rwsem.h>
+#include <linux/ptrace.h>
+#include <linux/async.h>
+#include <linux/uaccess.h>
+
+DEFINE_MUTEX(kmod_dup_mutex);
+static LIST_HEAD(dup_kmod_reqs);
+
+struct kmod_dup_req {
+	struct list_head list;
+	char name[MODULE_NAME_LEN];
+	struct completion first_req_done;
+	struct work_struct complete_work;
+	struct delayed_work delete_work;
+	int dup_ret;
+};
+
+static struct kmod_dup_req *kmod_dup_request_lookup(char *module_name)
+{
+	struct kmod_dup_req *kmod_req;
+
+	list_for_each_entry_rcu(kmod_req, &dup_kmod_reqs, list,
+				lockdep_is_held(&kmod_dup_mutex)) {
+		if (strlen(kmod_req->name) == strlen(module_name) &&
+		    !memcmp(kmod_req->name, module_name, strlen(module_name))) {
+			return kmod_req;
+                }
+        }
+
+	return NULL;
+}
+
+static void kmod_dup_request_delete(struct work_struct *work)
+{
+	struct kmod_dup_req *kmod_req;
+	kmod_req = container_of(to_delayed_work(work), struct kmod_dup_req, delete_work);
+
+	/*
+	 * The typical situation is a module successully loaded. In that
+	 * situation the module will be present already in userspace. If
+	 * new requests come in after that, userspace will already know the
+	 * module is loaded so will just return 0 right away. There is still
+	 * a small chance right after we delete this entry new request_module()
+	 * calls may happen after that, they can happen. These heuristics
+	 * are to protect finit_module() abuse for auto-loading, if modules
+	 * are still tryign to auto-load even if a module is already loaded,
+	 * that's on them, and those inneficiencies should not be fixed by
+	 * kmod. The inneficies there are a call to modprobe and modprobe
+	 * just returning 0.
+	 */
+	mutex_lock(&kmod_dup_mutex);
+	list_del_rcu(&kmod_req->list);
+	synchronize_rcu();
+	mutex_unlock(&kmod_dup_mutex);
+	kfree(kmod_req);
+}
+
+static void kmod_dup_request_complete(struct work_struct *work)
+{
+	struct kmod_dup_req *kmod_req;
+
+	kmod_req = container_of(work, struct kmod_dup_req, complete_work);
+
+	/*
+	 * This will ensure that the kernel will let all the waiters get
+	 * informed its time to check the return value. It's time to
+	 * go home.
+	 */
+	complete_all(&kmod_req->first_req_done);
+
+	/*
+	 * Now that we have allowed prior request_module() calls to go on
+	 * with life, let's schedule deleting this entry. We don't have
+	 * to do it right away, but we *eventually* want to do it so to not
+	 * let this linger forever as this is just a boot optimization for
+	 * possible abuses of vmalloc() incurred by finit_module() thrashing.
+	 */
+	queue_delayed_work(system_wq, &kmod_req->delete_work, 60 * HZ);
+}
+
+bool kmod_dup_request_exists_wait(char *module_name, bool wait, int *dup_ret)
+{
+	struct kmod_dup_req *kmod_req, *new_kmod_req;
+	int ret;
+
+	/*
+	 * Pre-allocate the entry in case we have to use it later
+	 * to avoid contention with the mutex.
+	 */
+	new_kmod_req = kzalloc(sizeof(*new_kmod_req), GFP_KERNEL);
+	if (!new_kmod_req)
+		return false;
+
+	memcpy(new_kmod_req->name, module_name, strlen(module_name));
+	INIT_WORK(&new_kmod_req->complete_work, kmod_dup_request_complete);
+	INIT_DELAYED_WORK(&new_kmod_req->delete_work, kmod_dup_request_delete);
+	init_completion(&new_kmod_req->first_req_done);
+
+	mutex_lock(&kmod_dup_mutex);
+
+	kmod_req = kmod_dup_request_lookup(module_name);
+	if (!kmod_req) {
+		/*
+		 * If the first request that came through for a module
+		 * was with request_module_nowait() we cannot wait for it
+		 * and share its return value with other users which may
+		 * have used request_module() and need a proper return value
+		 * so just skip using them as an anchor.
+		 *
+		 * If a prior request to this one came through with
+		 * request_module() though, then a request_module_nowait()
+		 * would benefit from duplicate detection.
+		 */
+		if (!wait) {
+			kfree(new_kmod_req);
+			pr_warn("New request_module_nowait() for %s -- cannot track duplicates for this request\n", module_name);
+			mutex_unlock(&kmod_dup_mutex);
+			return false;
+		}
+
+		/*
+		 * There was no duplicate, just add the request so we can
+		 * keep tab on duplicates later.
+		 */
+		pr_info("New request_module() for %s\n", module_name);
+		list_add_rcu(&new_kmod_req->list, &dup_kmod_reqs);
+		mutex_unlock(&kmod_dup_mutex);
+		return false;
+	}
+	mutex_unlock(&kmod_dup_mutex);
+
+	/* We are dealing with a duplicate request now */
+
+	kfree(new_kmod_req);
+
+	/*
+	 * To fix these try to use try_then_request_module() instead as that
+	 * will check if the component you are looking for is present or not.
+	 * You could also just queue a single request to load the module once,
+	 * instead of having each and everything you need try to request for
+	 * the module.
+	 *
+	 * Duplicate request_module() calls  can cause quite a bit of wasted
+	 * vmalloc() space when racing with userspace.
+	 */
+	WARN(1, "module-autoload: duplicate request for module %s\n", module_name);
+
+	if (!wait) {
+		/*
+		 * If request_module_nowait() was used then the user just
+		 * wanted to issue the request and if another module request
+		 * was already its way with the same name we don't care for
+		 * the return value either. Let duplicate request_module_nowait()
+		 * calls bail out right away.
+		 */
+		*dup_ret = 0;
+		return true;
+	}
+
+	/*
+	 * If a duplicate request_module() was used they *may* care for
+	 * the return value, so we have no other option but to wait for
+	 * the first caller to complete. If the first caller used
+	 * the request_module_nowait() call, subsquent callers will
+	 * deal with the comprmise of getting a successful call with this
+	 * optimization enabled ...
+	 */
+	ret = wait_for_completion_state(&kmod_req->first_req_done,
+					TASK_UNINTERRUPTIBLE | TASK_KILLABLE);
+	if (ret) {
+		*dup_ret = ret;
+		return true;
+	}
+
+	/* Now the duplicate request has the same exact return value as the first request */
+	*dup_ret = kmod_req->dup_ret;
+
+	return true;
+}
+
+void kmod_dup_request_announce(char *module_name, int ret)
+{
+	struct kmod_dup_req *kmod_req;
+
+	mutex_lock(&kmod_dup_mutex);
+
+	kmod_req = kmod_dup_request_lookup(module_name);
+	if (!kmod_req)
+		goto out;
+
+	kmod_req->dup_ret = ret;
+
+	/*
+	 * If we complete() here we may allow duplicate threads
+	 * to continue before the first one that submitted the
+	 * request. We're in no rush also, given that each and
+	 * every bounce back to userspace is slow we avoid that
+	 * with a slight delay here. So queueue up the completion
+	 * and let duplicates suffer, just wait a tad bit longer.
+	 * There is no rush. But we also don't want to hold the
+	 * caller up forever or introduce any boot delays.
+	 */
+	queue_work(system_wq, &kmod_req->complete_work);
+
+out:
+	mutex_unlock(&kmod_dup_mutex);
+}
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 9d97a59a9127..67d750549a44 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -3,6 +3,7 @@
  *
  * Copyright (C) 2012 Red Hat, Inc. All Rights Reserved.
  * Written by David Howells (dhowells@redhat.com)
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
  */
 
 #include <linux/elf.h>
@@ -217,6 +218,20 @@ static inline void mod_stat_bump_becoming(struct load_info *info, int flags)
 
 #endif /* CONFIG_MODULE_STATS */
 
+#ifdef CONFIG_MODULE_AUTOLOAD_SUPRESS_DUPS
+bool kmod_dup_request_exists_wait(char *module_name, bool wait, int *dup_ret);
+void kmod_dup_request_announce(char *module_name, int ret);
+#else
+static inline bool kmod_dup_request_exists_wait(char *module_name, bool wait, int *dup_ret)
+{
+	return false;
+}
+
+static inline void kmod_dup_request_announce(char *module_name, int ret)
+{
+}
+#endif
+
 #ifdef CONFIG_MODULE_UNLOAD_TAINT_TRACKING
 struct mod_unload_taint {
 	struct list_head list;
diff --git a/kernel/module/kmod.c b/kernel/module/kmod.c
index 5899083436a3..0800d9891692 100644
--- a/kernel/module/kmod.c
+++ b/kernel/module/kmod.c
@@ -1,6 +1,9 @@
 /*
  * kmod - the kernel module loader
+ *
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
  */
+
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/sched/task.h>
@@ -27,6 +30,7 @@
 #include <linux/uaccess.h>
 
 #include <trace/events/module.h>
+#include "internal.h"
 
 /*
  * Assuming:
@@ -65,7 +69,7 @@ static void free_modprobe_argv(struct subprocess_info *info)
 	kfree(info->argv);
 }
 
-static int call_modprobe(char *module_name, int wait)
+static int call_modprobe(char *orig_module_name, int wait)
 {
 	struct subprocess_info *info;
 	static char *envp[] = {
@@ -74,12 +78,14 @@ static int call_modprobe(char *module_name, int wait)
 		"PATH=/sbin:/usr/sbin:/bin:/usr/bin",
 		NULL
 	};
+	char *module_name;
+	int ret;
 
 	char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
 	if (!argv)
 		goto out;
 
-	module_name = kstrdup(module_name, GFP_KERNEL);
+	module_name = kstrdup(orig_module_name, GFP_KERNEL);
 	if (!module_name)
 		goto free_argv;
 
@@ -94,13 +100,16 @@ static int call_modprobe(char *module_name, int wait)
 	if (!info)
 		goto free_module_name;
 
-	return call_usermodehelper_exec(info, wait | UMH_KILLABLE);
+	ret = call_usermodehelper_exec(info, wait | UMH_KILLABLE);
+	kmod_dup_request_announce(orig_module_name, ret);
+	return ret;
 
 free_module_name:
 	kfree(module_name);
 free_argv:
 	kfree(argv);
 out:
+	kmod_dup_request_announce(orig_module_name, -ENOMEM);
 	return -ENOMEM;
 }
 
@@ -124,7 +133,7 @@ int __request_module(bool wait, const char *fmt, ...)
 {
 	va_list args;
 	char module_name[MODULE_NAME_LEN];
-	int ret;
+	int ret, dup_ret;
 
 	/*
 	 * We don't allow synchronous module loading from async.  Module
@@ -156,8 +165,14 @@ int __request_module(bool wait, const char *fmt, ...)
 
 	trace_module_request(module_name, wait, _RET_IP_);
 
+	if (kmod_dup_request_exists_wait(module_name, wait, &dup_ret)) {
+		ret = dup_ret;
+		goto out;
+	}
+
 	ret = call_modprobe(module_name, wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
 
+out:
 	up(&kmod_concurrent_max);
 
 	return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 2/2] kread: avoid duplicates
  2023-04-14  5:28 [RFC 0/2] module: fix virtual memory wasted on finit_module() Luis Chamberlain
  2023-04-14  5:28 ` [RFC 1/2] module: add debugging auto-load duplicate module support Luis Chamberlain
@ 2023-04-14  5:28 ` Luis Chamberlain
  2023-04-14  6:35   ` Greg KH
  2023-04-16  6:04   ` Christoph Hellwig
  2023-04-14 17:25 ` [RFC 0/2] module: fix virtual memory wasted on finit_module() Luis Chamberlain
  2 siblings, 2 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-14  5:28 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, dave, willy, vbabka,
	mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas,
	jbaron, rick.p.edgecombe, mcgrof

With this we run into 0 wasted virtual memory bytes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/kernel_read_file.c | 150 ++++++++++++++++++++++++++++++++++++++++++
 kernel/module/main.c  |   6 +-
 2 files changed, 154 insertions(+), 2 deletions(-)

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..209c56764442 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -4,6 +4,7 @@
 #include <linux/kernel_read_file.h>
 #include <linux/security.h>
 #include <linux/vmalloc.h>
+#include <linux/fdtable.h>
 
 /**
  * kernel_read_file() - read file contents into a kernel buffer
@@ -171,17 +172,166 @@ ssize_t kernel_read_file_from_path_initns(const char *path, loff_t offset,
 }
 EXPORT_SYMBOL_GPL(kernel_read_file_from_path_initns);
 
+DEFINE_MUTEX(kread_dup_mutex);
+static LIST_HEAD(kread_dup_reqs);
+
+struct kread_dup_req {
+	struct list_head list;
+	char name[PATH_MAX];
+	struct completion first_req_done;
+	struct work_struct complete_work;
+	struct delayed_work delete_work;
+	int dup_ret;
+};
+
+static struct kread_dup_req *kread_dup_request_lookup(char *name)
+{
+	struct kread_dup_req *kread_req;
+
+	list_for_each_entry_rcu(kread_req, &kread_dup_reqs, list,
+				lockdep_is_held(&kread_dup_mutex)) {
+		if (strlen(kread_req->name) == strlen(name) &&
+		    !memcmp(kread_req->name, name, strlen(name))) {
+			return kread_req;
+                }
+        }
+
+	return NULL;
+}
+
+static void kread_dup_request_delete(struct work_struct *work)
+{
+	struct kread_dup_req *kread_req;
+	kread_req = container_of(to_delayed_work(work), struct kread_dup_req, delete_work);
+
+	mutex_lock(&kread_dup_mutex);
+	list_del_rcu(&kread_req->list);
+	synchronize_rcu();
+	mutex_unlock(&kread_dup_mutex);
+	kfree(kread_req);
+}
+
+static void kread_dup_request_complete(struct work_struct *work)
+{
+	struct kread_dup_req *kread_req;
+
+	kread_req = container_of(work, struct kread_dup_req, complete_work);
+
+	complete_all(&kread_req->first_req_done);
+	queue_delayed_work(system_wq, &kread_req->delete_work, 60 * HZ);
+}
+
+static bool kread_dup_request_exists_wait(char *name, int *dup_ret)
+{
+	struct kread_dup_req *kread_req, *new_kread_req;
+	int ret;
+
+	/*
+	 * Pre-allocate the entry in case we have to use it later
+	 * to avoid contention with the mutex.
+	 */
+	new_kread_req = kzalloc(sizeof(*new_kread_req), GFP_KERNEL);
+	if (!new_kread_req)
+		return false;
+
+	memcpy(new_kread_req->name, name, strlen(name));
+	INIT_WORK(&new_kread_req->complete_work, kread_dup_request_complete);
+	INIT_DELAYED_WORK(&new_kread_req->delete_work, kread_dup_request_delete);
+	init_completion(&new_kread_req->first_req_done);
+
+	mutex_lock(&kread_dup_mutex);
+
+	kread_req = kread_dup_request_lookup(name);
+	if (!kread_req) {
+		/*
+		 * There was no duplicate, just add the request so we can
+		 * keep tab on duplicates later.
+		 */
+		//pr_info("New kread request for %s\n", name);
+		list_add_rcu(&new_kread_req->list, &kread_dup_reqs);
+		mutex_unlock(&kread_dup_mutex);
+		return false;
+	}
+	mutex_unlock(&kread_dup_mutex);
+
+	/* We are dealing with a duplicate request now */
+
+	kfree(new_kread_req);
+
+	//pr_warn("kread: duplicate request for file %s\n", name);
+
+	ret = wait_for_completion_state(&kread_req->first_req_done,
+					TASK_UNINTERRUPTIBLE | TASK_KILLABLE);
+	if (ret) {
+		*dup_ret = ret;
+		return true;
+	}
+
+	/* breath */
+	schedule_timeout(2*HZ);
+
+	*dup_ret = kread_req->dup_ret;
+
+	return true;
+}
+
+void kread_dup_request_announce(char *name, int ret)
+{
+	struct kread_dup_req *kread_req;
+
+	mutex_lock(&kread_dup_mutex);
+
+	kread_req = kread_dup_request_lookup(name);
+	if (!kread_req)
+		goto out;
+
+	kread_req->dup_ret = ret;
+
+	/*
+	 * If we complete() here we may allow duplicate threads
+	 * to continue before the first one that submitted the
+	 * request. We're in no rush but avoid boot delays caused
+	 * by these threads waiting too long.
+	 */
+	queue_work(system_wq, &kread_req->complete_work);
+
+out:
+	mutex_unlock(&kread_dup_mutex);
+}
+
 ssize_t kernel_read_file_from_fd(int fd, loff_t offset, void **buf,
 				 size_t buf_size, size_t *file_size,
 				 enum kernel_read_file_id id)
 {
 	struct fd f = fdget(fd);
 	ssize_t ret = -EBADF;
+	char *name, *path;
+	int dup_ret;
 
 	if (!f.file || !(f.file->f_mode & FMODE_READ))
 		goto out;
 
+	path = kzalloc(PATH_MAX, GFP_KERNEL);
+	if (!path)
+		return -ENOMEM;
+
+	name = file_path(f.file, path, PATH_MAX);
+	if (IS_ERR(name)) {
+		ret = PTR_ERR(name);
+		goto out_mem;
+	}
+
+	if (kread_dup_request_exists_wait(name, &dup_ret)) {
+		ret = -EBUSY;
+		goto out_mem;
+	}
+
 	ret = kernel_read_file(f.file, offset, buf, buf_size, file_size, id);
+
+	kread_dup_request_announce(name, ret);
+
+out_mem:
+	kfree(path);
 out:
 	fdput(f);
 	return ret;
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 1ed373145278..e99419b4d85c 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -3080,8 +3080,10 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 	len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
 				       READING_MODULE);
 	if (len < 0) {
-		mod_stat_inc(&failed_kreads);
-		mod_stat_add_long(len, &invalid_kread_bytes);
+		if (len != -EBUSY) {
+			mod_stat_inc(&failed_kreads);
+			mod_stat_add_long(len, &invalid_kread_bytes);
+		}
 		return len;
 	}
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-14  5:28 ` [RFC 2/2] kread: avoid duplicates Luis Chamberlain
@ 2023-04-14  6:35   ` Greg KH
  2023-04-14 16:35     ` Luis Chamberlain
  2023-04-16  6:04   ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Greg KH @ 2023-04-14  6:35 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, rafael, christophe.leroy, tglx,
	peterz, song, rppt, dave, willy, vbabka, mhocko, dave.hansen,
	colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe

On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> With this we run into 0 wasted virtual memory bytes.

This changelog does not make any sense at all, sorry.  What are you
doing here and why?

> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  fs/kernel_read_file.c | 150 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/module/main.c  |   6 +-
>  2 files changed, 154 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
> index 5d826274570c..209c56764442 100644
> --- a/fs/kernel_read_file.c
> +++ b/fs/kernel_read_file.c
> @@ -4,6 +4,7 @@
>  #include <linux/kernel_read_file.h>
>  #include <linux/security.h>
>  #include <linux/vmalloc.h>
> +#include <linux/fdtable.h>
>  
>  /**
>   * kernel_read_file() - read file contents into a kernel buffer
> @@ -171,17 +172,166 @@ ssize_t kernel_read_file_from_path_initns(const char *path, loff_t offset,
>  }
>  EXPORT_SYMBOL_GPL(kernel_read_file_from_path_initns);
>  
> +DEFINE_MUTEX(kread_dup_mutex);

static?

I stopped reading here :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-14  6:35   ` Greg KH
@ 2023-04-14 16:35     ` Luis Chamberlain
  0 siblings, 0 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-14 16:35 UTC (permalink / raw)
  To: Greg KH
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, rafael, christophe.leroy, tglx,
	peterz, song, rppt, dave, willy, vbabka, mhocko, dave.hansen,
	colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe

On Fri, Apr 14, 2023 at 08:35:53AM +0200, Greg KH wrote:
> On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > With this we run into 0 wasted virtual memory bytes.
> 
> This changelog does not make any sense at all, sorry.  What are you
> doing here and why?

It's an RFC and the cover letter described the motivation looking for
ideas for an alternative, and it is the reason I was so terse on the
commit log.

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] module: fix virtual memory wasted on finit_module()
  2023-04-14  5:28 [RFC 0/2] module: fix virtual memory wasted on finit_module() Luis Chamberlain
  2023-04-14  5:28 ` [RFC 1/2] module: add debugging auto-load duplicate module support Luis Chamberlain
  2023-04-14  5:28 ` [RFC 2/2] kread: avoid duplicates Luis Chamberlain
@ 2023-04-14 17:25 ` Luis Chamberlain
  2 siblings, 0 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-14 17:25 UTC (permalink / raw)
  To: Kees Cook, Christoph Hellwig, david, patches, linux-modules,
	linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, torvalds,
	gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, dave, willy, vbabka,
	mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas,
	jbaron, rick.p.edgecombe

On Thu, Apr 13, 2023 at 10:28:38PM -0700, Luis Chamberlain wrote:
> At first I wondered if
> we could use file descriptor hints to just exlude users early on boot
> before SYSTEM_RUNNING. I couldn't find much, but if there are some ways
> to do that -- then the last patch can be simplified to do just that.
> The second patch proves essentially that we can just send -EBUSY to
> duplicate requests, at least for duplicate module loads and the world
> doesn't fall apart. It *would* solve the issue. The patch however
> borrows tons of the code from the first, and if we're realy going to
> rely on something like that we may as well share. But I'm hopeful that
> perhaps there are some jucier file descriptor tricks we can use to
> just make a file mutually exlusivive and introduce a new kread which
> lets finit_module() use that. The saving grace is that at least all
> finit_module() calls *wait*, contray to request_module() calls and so
> the solution can be much simpler.
> 
> The end result is 0 wasted virtual memory bytes.
> 
> Any ideas how not to make patch 2 suck as-is ?

The more I think about it, a file descriptor based approach would
have to loop over all tasks as we're not sure if userspace would
use the same thread and just fork requests or what. One would
then have to loop over all tasks and look for files that match the
same name. This approach is domain specific to internal kernel reads
and I am thinking now it is more appropriate.

Since this is a bootup issue for races with module loading what we
could do is just have say kernel_read_file_from_fd_excl() which
would just call a shared __kernel_read_file_from_fd(..., bool excl, ...)
and the kernel_read_file_from_fd() could just set that excl to false
while the new one sets it to true. Then finit_module() could just use
that.

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-14  5:28 ` [RFC 2/2] kread: avoid duplicates Luis Chamberlain
  2023-04-14  6:35   ` Greg KH
@ 2023-04-16  6:04   ` Christoph Hellwig
  2023-04-16  6:41     ` Luis Chamberlain
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-04-16  6:04 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael, christophe.leroy,
	tglx, peterz, song, rppt, dave, willy, vbabka, mhocko,
	dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe

On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> With this we run into 0 wasted virtual memory bytes.

Avoid what duplicates?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-16  6:04   ` Christoph Hellwig
@ 2023-04-16  6:41     ` Luis Chamberlain
  2023-04-16 12:50       ` Greg KH
  2023-04-17 17:33       ` Edgecombe, Rick P
  0 siblings, 2 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-16  6:41 UTC (permalink / raw)
  To: Christoph Hellwig, Kees Cook
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael, christophe.leroy,
	tglx, peterz, song, rppt, dave, willy, vbabka, mhocko,
	dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron,
	rick.p.edgecombe

On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote:
> On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > With this we run into 0 wasted virtual memory bytes.
> 
> Avoid what duplicates?

David Hildenbrand had reported that with over 400 CPUs vmap space
runs out and it seems it was related to module loading. I took a
look and confirmed it. Module loading ends up requiring in the
worst case 3 vmalloc allocations, so typically at least twice
the size of the module size and in the worst case just add
the decompressed module size:

a) initial kernel_read*() call
b) optional module decompression
c) the actual module data copy we will keep

Duplicate module requests that come from userspace end up being thrown
in the trash bin, as only one module will be allocated.  Although there
are checks for a module prior to requesting a module udev still doesn't
do the best of a job to avoid that and so we end up with tons of
duplicate module requests. We're talking about gigabytes of vmalloc
bytes just lost because of this for large systems and megabytes for
average systems. So for example with just 255 CPUs we can loose about
13.58 GiB, and for 8 CPUs about 226.53 MiB.

I have patches to curtail 1/2 of that space by doing a check in kernel
before we do the allocation in c) if the module is already present. For
a) it is harder because userspace just passes a file descriptor. But
since we can get the file path without the vmalloc this RFC suggest
maybe we can add a new kernel_read*() for module loading where it makes
sense to have only one read happen at a time.

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-16  6:41     ` Luis Chamberlain
@ 2023-04-16 12:50       ` Greg KH
  2023-04-16 18:46         ` Luis Chamberlain
  2023-04-17 17:33       ` Edgecombe, Rick P
  1 sibling, 1 reply; 15+ messages in thread
From: Greg KH @ 2023-04-16 12:50 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Kees Cook, david, patches, linux-modules,
	linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, torvalds,
	rafael, christophe.leroy, tglx, peterz, song, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe

On Sat, Apr 15, 2023 at 11:41:28PM -0700, Luis Chamberlain wrote:
> On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote:
> > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > > With this we run into 0 wasted virtual memory bytes.
> > 
> > Avoid what duplicates?
> 
> David Hildenbrand had reported that with over 400 CPUs vmap space
> runs out and it seems it was related to module loading. I took a
> look and confirmed it. Module loading ends up requiring in the
> worst case 3 vmalloc allocations, so typically at least twice
> the size of the module size and in the worst case just add
> the decompressed module size:
> 
> a) initial kernel_read*() call
> b) optional module decompression
> c) the actual module data copy we will keep
> 
> Duplicate module requests that come from userspace end up being thrown
> in the trash bin, as only one module will be allocated.  Although there
> are checks for a module prior to requesting a module udev still doesn't
> do the best of a job to avoid that and so we end up with tons of
> duplicate module requests. We're talking about gigabytes of vmalloc
> bytes just lost because of this for large systems and megabytes for
> average systems. So for example with just 255 CPUs we can loose about
> 13.58 GiB, and for 8 CPUs about 226.53 MiB.

How does the memory get "lost"?  Shouldn't it be properly freed when the
duplicate module load fails?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-16 12:50       ` Greg KH
@ 2023-04-16 18:46         ` Luis Chamberlain
  2023-04-17  6:05           ` Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-16 18:46 UTC (permalink / raw)
  To: Greg KH
  Cc: Christoph Hellwig, Kees Cook, david, patches, linux-modules,
	linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, torvalds,
	rafael, christophe.leroy, tglx, peterz, song, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe

On Sun, Apr 16, 2023 at 02:50:01PM +0200, Greg KH wrote:
> On Sat, Apr 15, 2023 at 11:41:28PM -0700, Luis Chamberlain wrote:
> > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote:
> > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > > > With this we run into 0 wasted virtual memory bytes.
> > > 
> > > Avoid what duplicates?
> > 
> > David Hildenbrand had reported that with over 400 CPUs vmap space
> > runs out and it seems it was related to module loading. I took a
> > look and confirmed it. Module loading ends up requiring in the
> > worst case 3 vmalloc allocations, so typically at least twice
> > the size of the module size and in the worst case just add
> > the decompressed module size:
> > 
> > a) initial kernel_read*() call
> > b) optional module decompression
> > c) the actual module data copy we will keep
> > 
> > Duplicate module requests that come from userspace end up being thrown
> > in the trash bin, as only one module will be allocated.  Although there
> > are checks for a module prior to requesting a module udev still doesn't
> > do the best of a job to avoid that and so we end up with tons of
> > duplicate module requests. We're talking about gigabytes of vmalloc
> > bytes just lost because of this for large systems and megabytes for
> > average systems. So for example with just 255 CPUs we can loose about
> > 13.58 GiB, and for 8 CPUs about 226.53 MiB.
> 
> How does the memory get "lost"?  Shouldn't it be properly freed when the
> duplicate module load fails?

Yes memory gets freed, but since virtual memory space can be limitted it
also means you can end up eventually getting to the point -ENOMEMs will
happen as you have more CPUS and you cannot use virtual memory for other
things during kernel bootup and bootup fails. This is apparently
exacerbated with KASAN enabled.

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-16 18:46         ` Luis Chamberlain
@ 2023-04-17  6:05           ` Greg KH
  2023-04-17 22:05             ` Luis Chamberlain
  0 siblings, 1 reply; 15+ messages in thread
From: Greg KH @ 2023-04-17  6:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Kees Cook, david, patches, linux-modules,
	linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, torvalds,
	rafael, christophe.leroy, tglx, peterz, song, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe

On Sun, Apr 16, 2023 at 11:46:44AM -0700, Luis Chamberlain wrote:
> On Sun, Apr 16, 2023 at 02:50:01PM +0200, Greg KH wrote:
> > On Sat, Apr 15, 2023 at 11:41:28PM -0700, Luis Chamberlain wrote:
> > > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote:
> > > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > > > > With this we run into 0 wasted virtual memory bytes.
> > > > 
> > > > Avoid what duplicates?
> > > 
> > > David Hildenbrand had reported that with over 400 CPUs vmap space
> > > runs out and it seems it was related to module loading. I took a
> > > look and confirmed it. Module loading ends up requiring in the
> > > worst case 3 vmalloc allocations, so typically at least twice
> > > the size of the module size and in the worst case just add
> > > the decompressed module size:
> > > 
> > > a) initial kernel_read*() call
> > > b) optional module decompression
> > > c) the actual module data copy we will keep
> > > 
> > > Duplicate module requests that come from userspace end up being thrown
> > > in the trash bin, as only one module will be allocated.  Although there
> > > are checks for a module prior to requesting a module udev still doesn't
> > > do the best of a job to avoid that and so we end up with tons of
> > > duplicate module requests. We're talking about gigabytes of vmalloc
> > > bytes just lost because of this for large systems and megabytes for
> > > average systems. So for example with just 255 CPUs we can loose about
> > > 13.58 GiB, and for 8 CPUs about 226.53 MiB.
> > 
> > How does the memory get "lost"?  Shouldn't it be properly freed when the
> > duplicate module load fails?
> 
> Yes memory gets freed, but since virtual memory space can be limitted it
> also means you can end up eventually getting to the point -ENOMEMs will
> happen as you have more CPUS and you cannot use virtual memory for other
> things during kernel bootup and bootup fails. This is apparently
> exacerbated with KASAN enabled.

Then why not just rate-limit the module loader in userspace on such
large systems if that's an issue?  No kernel changes needed to do that.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-16  6:41     ` Luis Chamberlain
  2023-04-16 12:50       ` Greg KH
@ 2023-04-17 17:33       ` Edgecombe, Rick P
  2023-04-17 22:08         ` Luis Chamberlain
  1 sibling, 1 reply; 15+ messages in thread
From: Edgecombe, Rick P @ 2023-04-17 17:33 UTC (permalink / raw)
  To: keescook, mcgrof, hch
  Cc: prarit, rppt, catalin.marinas, Torvalds, Linus, willy, song,
	patches, pmladek, david, colin.i.king, linux-kernel, dave.hansen,
	jim.cromie, vbabka, christophe.leroy, linux-mm, tglx, jbaron,
	peterz, linux-modules, gregkh, petr.pavlu, rafael, Hocko, Michal,
	dave

On Sat, 2023-04-15 at 23:41 -0700, Luis Chamberlain wrote:
> On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote:
> > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > > With this we run into 0 wasted virtual memory bytes.
> > 
> > Avoid what duplicates?
> 
> David Hildenbrand had reported that with over 400 CPUs vmap space
> runs out and it seems it was related to module loading. I took a
> look and confirmed it. Module loading ends up requiring in the
> worst case 3 vmalloc allocations, so typically at least twice
> the size of the module size and in the worst case just add
> the decompressed module size:
> 
> a) initial kernel_read*() call
> b) optional module decompression
> c) the actual module data copy we will keep
> 
> Duplicate module requests that come from userspace end up being
> thrown
> in the trash bin, as only one module will be allocated.  Although
> there
> are checks for a module prior to requesting a module udev still
> doesn't
> do the best of a job to avoid that and so we end up with tons of
> duplicate module requests. We're talking about gigabytes of vmalloc
> bytes just lost because of this for large systems and megabytes for
> average systems. So for example with just 255 CPUs we can loose about
> 13.58 GiB, and for 8 CPUs about 226.53 MiB.
> 
> I have patches to curtail 1/2 of that space by doing a check in
> kernel
> before we do the allocation in c) if the module is already present.
> For
> a) it is harder because userspace just passes a file descriptor. But
> since we can get the file path without the vmalloc this RFC suggest
> maybe we can add a new kernel_read*() for module loading where it
> makes
> sense to have only one read happen at a time.

I'm wondering how difficult it would be to just try to remove the
vmallocs in (a) and (b) and operate on a list of pages.

So the operations before module_patient_check_exists() are now:
1. decompressing (vmalloc)
2. signature check (vmalloc)
3. elf_validity_cache_copy()
4. early_mod_check() -> module_patient_check_exists()

Right? Then after that a bunch of arch code and other code outside of
modules operates on the vmalloc, so this other code would take a large
amount of changes to switch to a list of pages.

But did you consider teaching just 1-3 to operate on a list of pages?
And then move module_patient_check_exists() a little up in the order?
After module_patient_check_exists() you could vmap() the pages and hand
it off to the existing code located all over the place.

Then you can catch the duplicate requests before any vmalloc happens.
It also takes (a) and (b) down to one vmalloc even in the normal case,
as a side benefit. The changes to the signature check part might be
tricky though.

Sorry if this idea is off, I've got a little confused as this series
split into all these offshoots series.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-17  6:05           ` Greg KH
@ 2023-04-17 22:05             ` Luis Chamberlain
  0 siblings, 0 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-17 22:05 UTC (permalink / raw)
  To: Greg KH
  Cc: Christoph Hellwig, Kees Cook, david, patches, linux-modules,
	linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, torvalds,
	rafael, christophe.leroy, tglx, peterz, song, rppt, dave, willy,
	vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie,
	catalin.marinas, jbaron, rick.p.edgecombe

On Mon, Apr 17, 2023 at 08:05:31AM +0200, Greg KH wrote:
> On Sun, Apr 16, 2023 at 11:46:44AM -0700, Luis Chamberlain wrote:
> > On Sun, Apr 16, 2023 at 02:50:01PM +0200, Greg KH wrote:
> > > On Sat, Apr 15, 2023 at 11:41:28PM -0700, Luis Chamberlain wrote:
> > > > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote:
> > > > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > > > > > With this we run into 0 wasted virtual memory bytes.
> > > > > 
> > > > > Avoid what duplicates?
> > > > 
> > > > David Hildenbrand had reported that with over 400 CPUs vmap space
> > > > runs out and it seems it was related to module loading. I took a
> > > > look and confirmed it. Module loading ends up requiring in the
> > > > worst case 3 vmalloc allocations, so typically at least twice
> > > > the size of the module size and in the worst case just add
> > > > the decompressed module size:
> > > > 
> > > > a) initial kernel_read*() call
> > > > b) optional module decompression
> > > > c) the actual module data copy we will keep
> > > > 
> > > > Duplicate module requests that come from userspace end up being thrown
> > > > in the trash bin, as only one module will be allocated.  Although there
> > > > are checks for a module prior to requesting a module udev still doesn't
> > > > do the best of a job to avoid that and so we end up with tons of
> > > > duplicate module requests. We're talking about gigabytes of vmalloc
> > > > bytes just lost because of this for large systems and megabytes for
> > > > average systems. So for example with just 255 CPUs we can loose about
> > > > 13.58 GiB, and for 8 CPUs about 226.53 MiB.
> > > 
> > > How does the memory get "lost"?  Shouldn't it be properly freed when the
> > > duplicate module load fails?
> > 
> > Yes memory gets freed, but since virtual memory space can be limitted it
> > also means you can end up eventually getting to the point -ENOMEMs will
> > happen as you have more CPUS and you cannot use virtual memory for other
> > things during kernel bootup and bootup fails. This is apparently
> > exacerbated with KASAN enabled.
> 
> Then why not just rate-limit the module loader in userspace on such
> large systems if that's an issue?  No kernel changes needed to do that.

We can certainly just take a stance punt this as a userspace problem. I thought
it would be good to see what a kernel style of workaround would look like for
us to evluate.

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-17 17:33       ` Edgecombe, Rick P
@ 2023-04-17 22:08         ` Luis Chamberlain
  2023-04-18 18:46           ` Luis Chamberlain
  0 siblings, 1 reply; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-17 22:08 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: keescook, hch, prarit, rppt, catalin.marinas, Torvalds, Linus,
	willy, song, patches, pmladek, david, colin.i.king, linux-kernel,
	dave.hansen, jim.cromie, vbabka, christophe.leroy, linux-mm,
	tglx, jbaron, peterz, linux-modules, gregkh, petr.pavlu, rafael,
	Hocko, Michal, dave

On Mon, Apr 17, 2023 at 05:33:49PM +0000, Edgecombe, Rick P wrote:
> On Sat, 2023-04-15 at 23:41 -0700, Luis Chamberlain wrote:
> > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote:
> > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > > > With this we run into 0 wasted virtual memory bytes.
> > > 
> > > Avoid what duplicates?
> > 
> > David Hildenbrand had reported that with over 400 CPUs vmap space
> > runs out and it seems it was related to module loading. I took a
> > look and confirmed it. Module loading ends up requiring in the
> > worst case 3 vmalloc allocations, so typically at least twice
> > the size of the module size and in the worst case just add
> > the decompressed module size:
> > 
> > a) initial kernel_read*() call
> > b) optional module decompression
> > c) the actual module data copy we will keep
> > 
> > Duplicate module requests that come from userspace end up being
> > thrown
> > in the trash bin, as only one module will be allocated.  Although
> > there
> > are checks for a module prior to requesting a module udev still
> > doesn't
> > do the best of a job to avoid that and so we end up with tons of
> > duplicate module requests. We're talking about gigabytes of vmalloc
> > bytes just lost because of this for large systems and megabytes for
> > average systems. So for example with just 255 CPUs we can loose about
> > 13.58 GiB, and for 8 CPUs about 226.53 MiB.
> > 
> > I have patches to curtail 1/2 of that space by doing a check in
> > kernel
> > before we do the allocation in c) if the module is already present.
> > For
> > a) it is harder because userspace just passes a file descriptor. But
> > since we can get the file path without the vmalloc this RFC suggest
> > maybe we can add a new kernel_read*() for module loading where it
> > makes
> > sense to have only one read happen at a time.
> 
> I'm wondering how difficult it would be to just try to remove the
> vmallocs in (a) and (b) and operate on a list of pages.

Yes I think it's worth long term to do that, if possible with seq reads.

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/2] kread: avoid duplicates
  2023-04-17 22:08         ` Luis Chamberlain
@ 2023-04-18 18:46           ` Luis Chamberlain
  0 siblings, 0 replies; 15+ messages in thread
From: Luis Chamberlain @ 2023-04-18 18:46 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: keescook, hch, prarit, rppt, catalin.marinas, Torvalds, Linus,
	willy, song, patches, pmladek, david, colin.i.king, linux-kernel,
	dave.hansen, jim.cromie, vbabka, christophe.leroy, linux-mm,
	tglx, jbaron, peterz, linux-modules, gregkh, petr.pavlu, rafael,
	Hocko, Michal, dave

On Mon, Apr 17, 2023 at 03:08:34PM -0700, Luis Chamberlain wrote:
> On Mon, Apr 17, 2023 at 05:33:49PM +0000, Edgecombe, Rick P wrote:
> > On Sat, 2023-04-15 at 23:41 -0700, Luis Chamberlain wrote:
> > > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote:
> > > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> > > > > With this we run into 0 wasted virtual memory bytes.
> > > > 
> > > > Avoid what duplicates?
> > > 
> > > David Hildenbrand had reported that with over 400 CPUs vmap space
> > > runs out and it seems it was related to module loading. I took a
> > > look and confirmed it. Module loading ends up requiring in the
> > > worst case 3 vmalloc allocations, so typically at least twice
> > > the size of the module size and in the worst case just add
> > > the decompressed module size:
> > > 
> > > a) initial kernel_read*() call
> > > b) optional module decompression
> > > c) the actual module data copy we will keep
> > > 
> > > Duplicate module requests that come from userspace end up being
> > > thrown
> > > in the trash bin, as only one module will be allocated.  Although
> > > there
> > > are checks for a module prior to requesting a module udev still
> > > doesn't
> > > do the best of a job to avoid that and so we end up with tons of
> > > duplicate module requests. We're talking about gigabytes of vmalloc
> > > bytes just lost because of this for large systems and megabytes for
> > > average systems. So for example with just 255 CPUs we can loose about
> > > 13.58 GiB, and for 8 CPUs about 226.53 MiB.
> > > 
> > > I have patches to curtail 1/2 of that space by doing a check in
> > > kernel
> > > before we do the allocation in c) if the module is already present.
> > > For
> > > a) it is harder because userspace just passes a file descriptor. But
> > > since we can get the file path without the vmalloc this RFC suggest
> > > maybe we can add a new kernel_read*() for module loading where it
> > > makes
> > > sense to have only one read happen at a time.
> > 
> > I'm wondering how difficult it would be to just try to remove the
> > vmallocs in (a) and (b) and operate on a list of pages.
> 
> Yes I think it's worth long term to do that, if possible with seq reads.

OK here's what I suggest we do then:

I'll resubmit the first patch which allows us to prove / disprove if
module-autoloading is the culprit. With that in place folks can debug
their setup and verify how udev is to blame.

I'll drop the second kernel_read*() patch / effort and punt this as a
userspace problem as this is also not extremely pressing.

Long term should evaluate how we can avoid vmalloc for the kread and
module decompression.

If this really becomes a pressing issue we can revisit if we want an in
kernel solution, but at this point that likely would be systems with
over 400-500 CPUs with KASAN enabled. Without KASAN the issue should
eventually trigger if you're enablig modules but its hard to say at what
point you'd hit this issue.

  Luis

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-04-18 18:46 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-14  5:28 [RFC 0/2] module: fix virtual memory wasted on finit_module() Luis Chamberlain
2023-04-14  5:28 ` [RFC 1/2] module: add debugging auto-load duplicate module support Luis Chamberlain
2023-04-14  5:28 ` [RFC 2/2] kread: avoid duplicates Luis Chamberlain
2023-04-14  6:35   ` Greg KH
2023-04-14 16:35     ` Luis Chamberlain
2023-04-16  6:04   ` Christoph Hellwig
2023-04-16  6:41     ` Luis Chamberlain
2023-04-16 12:50       ` Greg KH
2023-04-16 18:46         ` Luis Chamberlain
2023-04-17  6:05           ` Greg KH
2023-04-17 22:05             ` Luis Chamberlain
2023-04-17 17:33       ` Edgecombe, Rick P
2023-04-17 22:08         ` Luis Chamberlain
2023-04-18 18:46           ` Luis Chamberlain
2023-04-14 17:25 ` [RFC 0/2] module: fix virtual memory wasted on finit_module() Luis Chamberlain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.