linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] module: avoid userspace pressure on unwanted allocations
@ 2023-03-29  5:31 Luis Chamberlain
  2023-03-29  5:31 ` [PATCH 1/7] module: move finished_loading() Luis Chamberlain
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:31 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, willy, vbabka,
	mhocko, dave.hansen, mcgrof

This patch set addresses a fix to the vmap allocation presure issues which
David Hildenbrand had reported last year in October. While at it,
I've simplified the kmod concurrency delimiter using Linus' suggestion,
and added debugfs stats to help us keep sane in doing analysis for memory
pressure issues on the finit_module() side of things. That should *also*
help do an empirical evaluation of module .text sizes *actually* present
on systems, given userspace makes it a bit tricky to get that right.

All this would not have been possible without stress-ng and Colin Ian King's
help to getting a modules ops in shape so to reproduce a situation only
reported so far on a system with over 400 CPUs.

I *think* the degugfs stats *should* probably be used to help identify
areas where we perhaps need *more work* to try to mitigate vmalloc()
space, as in the worst case we can end up using vmap space 3 times for
a single module, two just as big as the module, and if you are enabling
compression one with the compressed module size. That's significant memory
pressure on vmalloc() / vmap() space.

If you'd like to give this a spin this is available on my branch based on
modules-next 20230328-module-alloc-opts [2].

[0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
[1] https://github.com/ColinIanKing/stress-ng
[2] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230328-module-alloc-opts

Luis Chamberlain (7):
  module: move finished_loading()
  module: extract patient module check into helper
  module: avoid allocation if module is already present and ready
  sempahore: add a helper for a concurrency limiter
  modules/kmod: replace implementation with a sempahore
  debugfs: add debugfs_create_atomic64_t for atomic64_t
  module: add debug stats to help identify memory pressure

 fs/debugfs/file.c         |  36 +++++++
 include/linux/debugfs.h   |   2 +
 include/linux/semaphore.h |   3 +
 kernel/module/Kconfig     |  32 ++++++
 kernel/module/Makefile    |   4 +
 kernel/module/debug.c     |  16 +++
 kernel/module/internal.h  |  35 +++++++
 kernel/module/kmod.c      |  26 ++---
 kernel/module/main.c      | 164 ++++++++++++++++++++-----------
 kernel/module/stats.c     | 200 ++++++++++++++++++++++++++++++++++++++
 kernel/module/tracking.c  |   7 +-
 11 files changed, 445 insertions(+), 80 deletions(-)
 create mode 100644 kernel/module/debug.c
 create mode 100644 kernel/module/stats.c

-- 
2.39.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/7] module: move finished_loading()
  2023-03-29  5:31 [PATCH 0/7] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
@ 2023-03-29  5:31 ` Luis Chamberlain
  2023-03-29  5:31 ` [PATCH 2/7] module: extract patient module check into helper Luis Chamberlain
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:31 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, willy, vbabka,
	mhocko, dave.hansen, mcgrof

This has no functional change, just moves a routine earlier
as we'll make use of it next.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 5cc21083af04..af58e63e5daf 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2442,27 +2442,6 @@ static int post_relocation(struct module *mod, const struct load_info *info)
 	return module_finalize(info->hdr, info->sechdrs, mod);
 }
 
-/* Is this module of this name done loading?  No locks held. */
-static bool finished_loading(const char *name)
-{
-	struct module *mod;
-	bool ret;
-
-	/*
-	 * The module_mutex should not be a heavily contended lock;
-	 * if we get the occasional sleep here, we'll go an extra iteration
-	 * in the wait_event_interruptible(), which is harmless.
-	 */
-	sched_annotate_sleep();
-	mutex_lock(&module_mutex);
-	mod = find_module_all(name, strlen(name), true);
-	ret = !mod || mod->state == MODULE_STATE_LIVE
-		|| mod->state == MODULE_STATE_GOING;
-	mutex_unlock(&module_mutex);
-
-	return ret;
-}
-
 /* Call module constructors. */
 static void do_mod_ctors(struct module *mod)
 {
@@ -2626,6 +2605,27 @@ static int may_init_module(void)
 	return 0;
 }
 
+/* Is this module of this name done loading?  No locks held. */
+static bool finished_loading(const char *name)
+{
+	struct module *mod;
+	bool ret;
+
+	/*
+	 * The module_mutex should not be a heavily contended lock;
+	 * if we get the occasional sleep here, we'll go an extra iteration
+	 * in the wait_event_interruptible(), which is harmless.
+	 */
+	sched_annotate_sleep();
+	mutex_lock(&module_mutex);
+	mod = find_module_all(name, strlen(name), true);
+	ret = !mod || mod->state == MODULE_STATE_LIVE
+		|| mod->state == MODULE_STATE_GOING;
+	mutex_unlock(&module_mutex);
+
+	return ret;
+}
+
 /*
  * We try to place it in the list now to make sure it's unique before
  * we dedicate too many resources.  In particular, temporary percpu
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/7] module: extract patient module check into helper
  2023-03-29  5:31 [PATCH 0/7] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
  2023-03-29  5:31 ` [PATCH 1/7] module: move finished_loading() Luis Chamberlain
@ 2023-03-29  5:31 ` Luis Chamberlain
  2023-03-29  5:31 ` [PATCH 3/7] module: avoid allocation if module is already present and ready Luis Chamberlain
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:31 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, willy, vbabka,
	mhocko, dave.hansen, mcgrof

The patient module check inside add_unformed_module() is large
enough as we need it. It is a bit hard to read too, so just
move it to a helper and do the inverse checks first to help
shift the code and make it easier to read. The new helper then
is module_patient_check_exists().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 71 +++++++++++++++++++++++++-------------------
 1 file changed, 40 insertions(+), 31 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index af58e63e5daf..77c2e7a60f2e 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2626,6 +2626,43 @@ static bool finished_loading(const char *name)
 	return ret;
 }
 
+/* Must be called with module_mutex held */
+static int module_patient_check_exists(const char *name)
+{
+	struct module *old;
+	int err = 0;
+
+	old = find_module_all(name, strlen(name), true);
+	if (old == NULL)
+		return 0;
+
+	if (old->state == MODULE_STATE_COMING
+	    || old->state == MODULE_STATE_UNFORMED) {
+		/* Wait in case it fails to load. */
+		mutex_unlock(&module_mutex);
+		err = wait_event_interruptible(module_wq,
+				       finished_loading(name));
+		if (err)
+			return err;
+
+		/* The module might have gone in the meantime. */
+		mutex_lock(&module_mutex);
+		old = find_module_all(name, strlen(name), true);
+	}
+
+	/*
+	 * We are here only when the same module was being loaded. Do
+	 * not try to load it again right now. It prevents long delays
+	 * caused by serialized module load failures. It might happen
+	 * when more devices of the same type trigger load of
+	 * a particular module.
+	 */
+	if (old && old->state == MODULE_STATE_LIVE)
+		return -EEXIST;
+	else
+		return -EBUSY;
+}
+
 /*
  * We try to place it in the list now to make sure it's unique before
  * we dedicate too many resources.  In particular, temporary percpu
@@ -2634,41 +2671,14 @@ static bool finished_loading(const char *name)
 static int add_unformed_module(struct module *mod)
 {
 	int err;
-	struct module *old;
 
 	mod->state = MODULE_STATE_UNFORMED;
 
 	mutex_lock(&module_mutex);
-	old = find_module_all(mod->name, strlen(mod->name), true);
-	if (old != NULL) {
-		if (old->state == MODULE_STATE_COMING
-		    || old->state == MODULE_STATE_UNFORMED) {
-			/* Wait in case it fails to load. */
-			mutex_unlock(&module_mutex);
-			err = wait_event_interruptible(module_wq,
-					       finished_loading(mod->name));
-			if (err)
-				goto out_unlocked;
-
-			/* The module might have gone in the meantime. */
-			mutex_lock(&module_mutex);
-			old = find_module_all(mod->name, strlen(mod->name),
-					      true);
-		}
-
-		/*
-		 * We are here only when the same module was being loaded. Do
-		 * not try to load it again right now. It prevents long delays
-		 * caused by serialized module load failures. It might happen
-		 * when more devices of the same type trigger load of
-		 * a particular module.
-		 */
-		if (old && old->state == MODULE_STATE_LIVE)
-			err = -EEXIST;
-		else
-			err = -EBUSY;
+	err = module_patient_check_exists(mod->name);
+	if (err)
 		goto out;
-	}
+
 	mod_update_bounds(mod);
 	list_add_rcu(&mod->list, &modules);
 	mod_tree_insert(mod);
@@ -2676,7 +2686,6 @@ static int add_unformed_module(struct module *mod)
 
 out:
 	mutex_unlock(&module_mutex);
-out_unlocked:
 	return err;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/7] module: avoid allocation if module is already present and ready
  2023-03-29  5:31 [PATCH 0/7] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
  2023-03-29  5:31 ` [PATCH 1/7] module: move finished_loading() Luis Chamberlain
  2023-03-29  5:31 ` [PATCH 2/7] module: extract patient module check into helper Luis Chamberlain
@ 2023-03-29  5:31 ` Luis Chamberlain
  2023-03-29  5:31 ` [PATCH 4/7] sempahore: add a helper for a concurrency limiter Luis Chamberlain
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:31 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, willy, vbabka,
	mhocko, dave.hansen, mcgrof

load_module() will allocate a struct module before even checking
if the module is already loaded. This can create unecessary memory
pressure since we can easily just check if the module is already
present early with the copy of the module information from userspace
after we've validated it a bit.

This can only be an issue if a system is getting hammered with userspace
loading modules. Note that there are two ways to load modules, one is
kernel moduile auto-loading (request_module() calls in-kernel) and the
other is modprobe calls from userspace. The auto-loading is in-kernel, that
pings back to userspace to just call modprobe. We already have a way to
restrict the amount of concurrent kernel auto-loads in a given time, however
that does not stop a system from issuing tons of system calls to load a
module and for the races to exist. Userspace itself *is* supposed to check
if a module is present before loading it. But we're observing situations
where tons of the same module are in effect being loaded. Although some of
these are acknolwedged as in-kernel bugs such as the ACPI frequency
modules, issues for which we already have fixes merged or are working
towards, but we can also help a bit more in the modules side to avoid
those dramatic situations. All that is just memory being allocated to then
be thrown away.

To avoid memory pressure for such stupid cases put a stop gap for them.
We now check for the module being present *before* allocation, and then
right after we are going to add it to the system.

On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests
kmod.sh -t 0008 I see a saving in the *highest* side of memory
consumption of up to ~ 84 MiB with the Linux kernel selftests kmod
test 0008. With the new stress-ng module test I see a 145 MiB difference
in max memory consumption with 100 ops. The stress-ng module ops tests can be
pretty pathalogical -- it is not realistic, however it was used to
finally successfully reproduce issues which are only reported to happen on
system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM
system.

This can be observed and visualized below. The time it takes to run the
test is also not affected.

The kmod tests 0008:

The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib)
given the tests peak around that range.

cat kmod.plot
set term dumb
set output fileout
set yrange [400000:580000]
plot filein with linespoints title "Memory usage (KiB)"

Before:
root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-before.txt ^C
root@kmod ~ # sort -n -r log-0008-before.txt | head -1
528732

So ~516.33 MiB

After:

root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-after.txt ^C

root@kmod ~ # sort -n -r log-0008-after.txt | head -1
442516

So ~432.14 MiB

That's about 84 ~MiB in savings in the worst case. The graphs:

root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot
root@kmod ~ # gnuplot -e "filein='log-0008-after.txt';  fileout='graph-0008-after.txt'"  kmod.plot

root@kmod ~ # cat graph-0008-before.txt

  580000 +-----------------------------------------------------------------+
         |       +        +       +       +       +        +       +       |
  560000 |-+                                    Memory usage (KiB) ***A***-|
         |                                                                 |
  540000 |-+                                                             +-|
         |                                                                 |
         |        *A     *AA*AA*A*AA          *A*AA    A*A*A *AA*A*AA*A  A |
  520000 |-+A*A*AA  *AA*A           *A*AA*A*AA     *A*A     A          *A+-|
         |*A                                                               |
  500000 |-+                                                             +-|
         |                                                                 |
  480000 |-+                                                             +-|
         |                                                                 |
  460000 |-+                                                             +-|
         |                                                                 |
         |                                                                 |
  440000 |-+                                                             +-|
         |                                                                 |
  420000 |-+                                                             +-|
         |       +        +       +       +       +        +       +       |
  400000 +-----------------------------------------------------------------+
         0       5        10      15      20      25       30      35      40

root@kmod ~ # cat graph-0008-after.txt

  580000 +-----------------------------------------------------------------+
         |       +        +       +       +       +        +       +       |
  560000 |-+                                    Memory usage (KiB) ***A***-|
         |                                                                 |
  540000 |-+                                                             +-|
         |                                                                 |
         |                                                                 |
  520000 |-+                                                             +-|
         |                                                                 |
  500000 |-+                                                             +-|
         |                                                                 |
  480000 |-+                                                             +-|
         |                                                                 |
  460000 |-+                                                             +-|
         |                                                                 |
         |          *A              *A*A                                   |
  440000 |-+A*A*AA*A  A       A*A*AA    A*A*AA*A*AA*A*AA*A*AA*AA*A*AA*A*AA-|
         |*A           *A*AA*A                                             |
  420000 |-+                                                             +-|
         |       +        +       +       +       +        +       +       |
  400000 +-----------------------------------------------------------------+
         0       5        10      15      20      25       30      35      40

The stress-ng module tests:

This is used to run the test to try to reproduce the vmap issues
reported by David:

  echo 0 > /proc/sys/vm/oom_dump_tasks
  ./stress-ng --module 100 --module-name xfs

Prior to this commit:
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > baseline-stress-ng.txt
root@kmod ~ # sort -n -r baseline-stress-ng.txt | head -1
5046456

After this commit:
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > after-stress-ng.txt
root@kmod ~ # sort -n -r after-stress-ng.txt | head -1
4896972

5046456 - 4896972
149484
149484/1024
145.98046875000000000000

So this commit using stress-ng reveals saving about 145 MiB in memory
using 100 ops from stress-ng which reproduced the vmap issue reported.

cat kmod.plot
set term dumb
set output fileout
set yrange [4700000:5070000]
plot filein with linespoints title "Memory usage (KiB)"

root@kmod ~ # gnuplot -e "filein='baseline-stress-ng.txt'; fileout='graph-stress-ng-before.txt'"  kmod-simple-stress-ng.plot
root@kmod ~ # gnuplot -e "filein='after-stress-ng.txt'; fileout='graph-stress-ng-after.txt'"  kmod-simple-stress-ng.plot

root@kmod ~ # cat graph-stress-ng-before.txt

           +---------------------------------------------------------------+
  5.05e+06 |-+     + A     +       +       +       +       +       +     +-|
           |         *                          Memory usage (KiB) ***A*** |
           |         *                             A                       |
     5e+06 |-+      **                            **                     +-|
           |        **                            * *    A                 |
  4.95e+06 |-+      * *                          A  *   A*               +-|
           |        * *      A       A           *  *  *  *             A  |
           |       *  *     * *     * *        *A   *  *  *      A      *  |
   4.9e+06 |-+     *  *     * A*A   * A*AA*A  A      *A    **A   **A*A  *+-|
           |       A  A*A  A    *  A       *  *      A     A *  A    * **  |
           |      *      **      **         * *              *  *    * * * |
  4.85e+06 |-+   A       A       A          **               *  *     ** *-|
           |     *                           *               * *      ** * |
           |     *                           A               * *      *  * |
   4.8e+06 |-+   *                                           * *      A  A-|
           |     *                                           * *           |
  4.75e+06 |-+  *                                            * *         +-|
           |    *                                            **            |
           |    *  +       +       +       +       +       + **    +       |
   4.7e+06 +---------------------------------------------------------------+
           0       5       10      15      20      25      30      35      40

root@kmod ~ # cat graph-stress-ng-after.txt

           +---------------------------------------------------------------+
  5.05e+06 |-+     +       +       +       +       +       +       +     +-|
           |                                    Memory usage (KiB) ***A*** |
           |                                                               |
     5e+06 |-+                                                           +-|
           |                                                               |
  4.95e+06 |-+                                                           +-|
           |                                                               |
           |                                                               |
   4.9e+06 |-+                                      *AA                  +-|
           |  A*AA*A*A  A  A*AA*AA*A*AA*A  A  A  A*A   *AA*A*A  A  A*AA*AA |
           |  *      * **  *            *  *  ** *            ***  *       |
  4.85e+06 |-+*       ***  *            * * * ***             A *  *     +-|
           |  *       A *  *             ** * * A               *  *       |
           |  *         *  *             *  **                  *  *       |
   4.8e+06 |-+*         *  *             A   *                  *  *     +-|
           | *          * *                  A                  * *        |
  4.75e+06 |-*          * *                                     * *      +-|
           | *          * *                                     * *        |
           | *     +    * *+       +       +       +       +    * *+       |
   4.7e+06 +---------------------------------------------------------------+
           0       5       10      15      20      25      30      35      40

[0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

Reported-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 77c2e7a60f2e..145e15f19576 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2785,7 +2785,11 @@ static int early_mod_check(struct load_info *info, int flags)
 	if (err)
 		return err;
 
-	return 0;
+	mutex_lock(&module_mutex);
+	err = module_patient_check_exists(info->mod->name);
+	mutex_unlock(&module_mutex);
+
+	return err;
 }
 
 /*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-29  5:31 [PATCH 0/7] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (2 preceding siblings ...)
  2023-03-29  5:31 ` [PATCH 3/7] module: avoid allocation if module is already present and ready Luis Chamberlain
@ 2023-03-29  5:31 ` Luis Chamberlain
  2023-03-29  7:21   ` Peter Zijlstra
  2023-03-29  5:31 ` [PATCH 5/7] modules/kmod: replace implementation with a sempahore Luis Chamberlain
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:31 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, willy, vbabka,
	mhocko, dave.hansen, mcgrof

While I looked at re-using the old kernel/kmod.c (now kernel/module/kmod.c)
concurrency delimiter methodology for another place in the kernel Linus
noted that this could be simply replaced with a sempahore [0].

So add that so we we don't re-invent the wheel and make it obvious to use.

[0] https://lore.kernel.org/all/CAHk-=whkj6=wyi201JXkw9iT_eTUTsSx+Yb9d4OgmZFjDJA18g@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/semaphore.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
index 6694d0019a68..2ecdffdb9814 100644
--- a/include/linux/semaphore.h
+++ b/include/linux/semaphore.h
@@ -28,6 +28,9 @@ struct semaphore {
 #define DEFINE_SEMAPHORE(name)	\
 	struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
 
+#define CONCURRENCY_LIMITER(name, n) \
+	struct semaphore name = __SEMAPHORE_INITIALIZER(name, n)
+
 static inline void sema_init(struct semaphore *sem, int val)
 {
 	static struct lock_class_key __key;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/7] modules/kmod: replace implementation with a sempahore
  2023-03-29  5:31 [PATCH 0/7] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (3 preceding siblings ...)
  2023-03-29  5:31 ` [PATCH 4/7] sempahore: add a helper for a concurrency limiter Luis Chamberlain
@ 2023-03-29  5:31 ` Luis Chamberlain
  2023-03-29  5:31 ` [PATCH 6/7] debugfs: add debugfs_create_atomic64_t for atomic64_t Luis Chamberlain
  2023-03-29  5:31 ` [PATCH 7/7] module: add debug stats to help identify memory pressure Luis Chamberlain
  6 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:31 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, willy, vbabka,
	mhocko, dave.hansen, mcgrof

Simplfy the concurrency delimiter we user for kmod with the semaphore.
I had used the kmod strategy to try to implement a similar concurrency
delimiter for the kernel_read*() calls from the finit_module() path
so to reduce vmalloc() memory pressure. That effort didn't provid yet
conclusive results, but one thing that did became clear is we can use
the suggested alternative solution with semaphores which Linus hinted
at instead of using the atomic / wait strategy.

I've stress tested this with kmod test 0008:

time /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008

And I get only a *slight* delay. That delay however is small, a few
seconds for a full test loop run that runs 150 times, for about ~30-40
seconds. The small delay is worth the simplfication IMHO.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/kmod.c | 26 +++++++-------------------
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/kernel/module/kmod.c b/kernel/module/kmod.c
index b717134ebe17..fd7c461387f8 100644
--- a/kernel/module/kmod.c
+++ b/kernel/module/kmod.c
@@ -40,8 +40,7 @@
  * effect. Systems like these are very unlikely if modules are enabled.
  */
 #define MAX_KMOD_CONCURRENT 50
-static atomic_t kmod_concurrent_max = ATOMIC_INIT(MAX_KMOD_CONCURRENT);
-static DECLARE_WAIT_QUEUE_HEAD(kmod_wq);
+static CONCURRENCY_LIMITER(kmod_concurrent_max, MAX_KMOD_CONCURRENT);
 
 /*
  * This is a restriction on having *all* MAX_KMOD_CONCURRENT threads
@@ -148,29 +147,18 @@ int __request_module(bool wait, const char *fmt, ...)
 	if (ret)
 		return ret;
 
-	if (atomic_dec_if_positive(&kmod_concurrent_max) < 0) {
-		pr_warn_ratelimited("request_module: kmod_concurrent_max (%u) close to 0 (max_modprobes: %u), for module %s, throttling...",
-				    atomic_read(&kmod_concurrent_max),
-				    MAX_KMOD_CONCURRENT, module_name);
-		ret = wait_event_killable_timeout(kmod_wq,
-						  atomic_dec_if_positive(&kmod_concurrent_max) >= 0,
-						  MAX_KMOD_ALL_BUSY_TIMEOUT * HZ);
-		if (!ret) {
-			pr_warn_ratelimited("request_module: modprobe %s cannot be processed, kmod busy with %d threads for more than %d seconds now",
-					    module_name, MAX_KMOD_CONCURRENT, MAX_KMOD_ALL_BUSY_TIMEOUT);
-			return -ETIME;
-		} else if (ret == -ERESTARTSYS) {
-			pr_warn_ratelimited("request_module: sigkill sent for modprobe %s, giving up", module_name);
-			return ret;
-		}
+	ret = down_timeout(&kmod_concurrent_max, MAX_KMOD_ALL_BUSY_TIMEOUT);
+	if (ret) {
+		pr_warn_ratelimited("request_module: modprobe %s cannot be processed, kmod busy with %d threads for more than %d seconds now",
+				    module_name, MAX_KMOD_CONCURRENT, MAX_KMOD_ALL_BUSY_TIMEOUT);
+		return ret;
 	}
 
 	trace_module_request(module_name, wait, _RET_IP_);
 
 	ret = call_modprobe(module_name, wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
 
-	atomic_inc(&kmod_concurrent_max);
-	wake_up(&kmod_wq);
+	up(&kmod_concurrent_max);
 
 	return ret;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/7] debugfs: add debugfs_create_atomic64_t for atomic64_t
  2023-03-29  5:31 [PATCH 0/7] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (4 preceding siblings ...)
  2023-03-29  5:31 ` [PATCH 5/7] modules/kmod: replace implementation with a sempahore Luis Chamberlain
@ 2023-03-29  5:31 ` Luis Chamberlain
  2023-03-29  5:46   ` Greg KH
  2023-03-29  5:31 ` [PATCH 7/7] module: add debug stats to help identify memory pressure Luis Chamberlain
  6 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:31 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, willy, vbabka,
	mhocko, dave.hansen, mcgrof

Sometimes you want to add debugfs entries for atomic counters which
can be pretty large using atomic64_t. Add support for these.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/debugfs/file.c       | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/debugfs.h |  2 ++
 2 files changed, 38 insertions(+)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 1f971c880dde..76d923503861 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -780,6 +780,42 @@ void debugfs_create_atomic_t(const char *name, umode_t mode,
 }
 EXPORT_SYMBOL_GPL(debugfs_create_atomic_t);
 
+static int debugfs_atomic64_t_set(void *data, u64 val)
+{
+	atomic64_set((atomic64_t *)data, val);
+	return 0;
+}
+static int debugfs_atomic64_t_get(void *data, u64 *val)
+{
+	*val = atomic64_read((atomic64_t *)data);
+	return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE_SIGNED(fops_atomic64_t, debugfs_atomic64_t_get,
+			debugfs_atomic64_t_set, "%lld\n");
+DEFINE_DEBUGFS_ATTRIBUTE_SIGNED(fops_atomic64_t_ro, debugfs_atomic64_t_get, NULL,
+			"%lld\n");
+DEFINE_DEBUGFS_ATTRIBUTE_SIGNED(fops_atomic64_t_wo, NULL, debugfs_atomic64_t_set,
+			"%lld\n");
+
+/**
+ * debugfs_create_atomic64_t - create a debugfs file that is used to read and
+ * write an atomic64_t value
+ * @name: a pointer to a string containing the name of the file to create.
+ * @mode: the permission that the file should have
+ * @parent: a pointer to the parent dentry for this file.  This should be a
+ *          directory dentry if set.  If this parameter is %NULL, then the
+ *          file will be created in the root of the debugfs filesystem.
+ * @value: a pointer to the variable that the file should read to and write
+ *         from.
+ */
+void debugfs_create_atomic64_t(const char *name, umode_t mode,
+			     struct dentry *parent, atomic64_t *value)
+{
+	debugfs_create_mode_unsafe(name, mode, parent, value, &fops_atomic64_t,
+				   &fops_atomic64_t_ro, &fops_atomic64_t_wo);
+}
+EXPORT_SYMBOL_GPL(debugfs_create_atomic64_t);
+
 ssize_t debugfs_read_file_bool(struct file *file, char __user *user_buf,
 			       size_t count, loff_t *ppos)
 {
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index ea2d919fd9c7..f5cc613a545e 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -136,6 +136,8 @@ void debugfs_create_size_t(const char *name, umode_t mode,
 			   struct dentry *parent, size_t *value);
 void debugfs_create_atomic_t(const char *name, umode_t mode,
 			     struct dentry *parent, atomic_t *value);
+void debugfs_create_atomic64_t(const char *name, umode_t mode,
+			       struct dentry *parent, atomic64_t *value);
 void debugfs_create_bool(const char *name, umode_t mode, struct dentry *parent,
 			 bool *value);
 void debugfs_create_str(const char *name, umode_t mode,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/7] module: add debug stats to help identify memory pressure
  2023-03-29  5:31 [PATCH 0/7] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
                   ` (5 preceding siblings ...)
  2023-03-29  5:31 ` [PATCH 6/7] debugfs: add debugfs_create_atomic64_t for atomic64_t Luis Chamberlain
@ 2023-03-29  5:31 ` Luis Chamberlain
  2023-03-29  5:46   ` Greg KH
  6 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:31 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, willy, vbabka,
	mhocko, dave.hansen, mcgrof

Loading modules with finit_module() can end up using vmalloc(), vmap()
and vmalloc() again, for a total of up to 3 separate allocations in the
worse case for a single module. We always kernel_read*() the module,
that's a vmalloc(). Then vmap() is used for the module decompression,
and if so the last read buffer is freed as we use the now decompressed
module buffer to stuff data into our copy module. The last one is
specific to architectures but pretty much that's generally a series
of vmalloc() for different ELF sections...

Evaluation with new stress-ng module support [1] with just 100 ops
us proving that you can end up using GiBs of data easily even if
we are trying to be very careful not to load modules which are already
loaded. 100 ops seems to resemble the sort of pressure a system with
about 400 CPUs can create on modules. Although those issues for so
many concurrent loads per CPU is silly and are being fixed, we lack
proper tooling to help diagnose easily what happened, when it happened
and what likely are the culprits -- userspace or kernel module
autoloading.

Provide an initial set of stats for debugfs which let us easily scrape
post-boot information about failed loads. This sort of information can
be used on production worklaods to try to optimize *avoiding* redundant
memory pressure using finit_module().

Screen shot:

root@kmod ~ # cat /sys/kernel/debug/modules/stats
           Modules loaded       67
        Total module size       11464704
      Total mod text size       4194304
       Failed kread bytes       890064
        Failed kmod bytes       890064
      Invalid kread bytes       890064
 Invalid decompress bytes       0
        Invalid mod bytes       890064
         Average mod size       171115
    Average mod text size       62602
Failed modules:
                kvm_intel
                      kvm
                irqbypass
         crct10dif_pclmul
      ghash_clmulni_intel
             sha512_ssse3
           sha512_generic
              aesni_intel
              crypto_simd
                   cryptd
                    evdev
                serio_raw
               virtio_pci
                     nvme
                nvme_core
    virtio_pci_legacy_dev
                   t10_pi
           crc64_rocksoft
    virtio_pci_modern_dev
             crc32_pclmul
                   virtio
             crc32c_intel
              virtio_ring
                    crc64

[0] https://github.com/ColinIanKing/stress-ng.git
[1] echo 0 > /proc/sys/vm/oom_dump_tasks
    ./stress-ng --module 100 --module-name xfs

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/Kconfig    |  32 +++++++
 kernel/module/Makefile   |   4 +
 kernel/module/debug.c    |  16 ++++
 kernel/module/internal.h |  35 +++++++
 kernel/module/main.c     |  45 ++++++++-
 kernel/module/stats.c    | 200 +++++++++++++++++++++++++++++++++++++++
 kernel/module/tracking.c |   7 +-
 7 files changed, 331 insertions(+), 8 deletions(-)
 create mode 100644 kernel/module/debug.c
 create mode 100644 kernel/module/stats.c

diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index 424b3bc58f3f..fbf7b92cb3d0 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -22,6 +22,12 @@ menuconfig MODULES
 
 if MODULES
 
+config MODULE_DEBUG
+	bool "Module debugging"
+	default n
+	help
+	  Allows to help debug module functionality.
+
 config MODULE_FORCE_LOAD
 	bool "Forced module loading"
 	default n
@@ -48,6 +54,8 @@ config MODULE_FORCE_UNLOAD
 	  rmmod).  This is mainly for kernel developers and desperate users.
 	  If unsure, say N.
 
+if MODULE_DEBUG
+
 config MODULE_UNLOAD_TAINT_TRACKING
 	bool "Tainted module unload tracking"
 	depends on MODULE_UNLOAD
@@ -59,6 +67,30 @@ config MODULE_UNLOAD_TAINT_TRACKING
 	  page (see bad_page()), the aforementioned details are also
 	  shown. If unsure, say N.
 
+config MODULE_STATS
+	bool "Module statistics"
+	depends on DEBUG_FS
+	default n
+	help
+	  This option allows you to maintain a record of module statistics.
+	  For example each all modules size, average size, text size, and
+	  failed modules and the size for each of those. For failed
+	  modules we keep track of module which failed due to either the
+	  existing module taking too long to load or that module already
+	  was loaded.
+
+	  You should enable this if you are debugging production loads
+	  and want to see if userspace or the kernel is doing stupid things
+	  with loading modules when it shouldn't or if you want to help
+	  optimize userspace / kernel space module autoloading schemes.
+	  You might want to do this because failed modules tend to use
+	  use up significan amount of memory, and so you'd be doing everyone
+	  a favor in avoiding these failure proactively.
+
+	  If unsure, say N.
+
+endif # MODULE_DEBUG
+
 config MODVERSIONS
 	bool "Module versioning support"
 	help
diff --git a/kernel/module/Makefile b/kernel/module/Makefile
index 5b1d26b53b8d..fe97047f3807 100644
--- a/kernel/module/Makefile
+++ b/kernel/module/Makefile
@@ -20,4 +20,8 @@ obj-$(CONFIG_PROC_FS) += procfs.o
 obj-$(CONFIG_SYSFS) += sysfs.o
 obj-$(CONFIG_KGDB_KDB) += kdb.o
 obj-$(CONFIG_MODVERSIONS) += version.o
+
+# Link order matters here, keep debug.o first.
+obj-$(CONFIG_MODULE_DEBUG) += debug.o
+obj-$(CONFIG_MODULE_STATS) += stats.o
 obj-$(CONFIG_MODULE_UNLOAD_TAINT_TRACKING) += tracking.o
diff --git a/kernel/module/debug.c b/kernel/module/debug.c
new file mode 100644
index 000000000000..ef580d70b751
--- /dev/null
+++ b/kernel/module/debug.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
+ */
+
+#include <linux/debugfs.h>
+#include "internal.h"
+
+struct dentry *mod_debugfs_root;
+
+static int module_debugfs_init(void)
+{
+	mod_debugfs_root = debugfs_create_dir("modules", NULL);
+	return 0;
+}
+module_init(module_debugfs_init);
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 6ae29bb8836f..a645cb3fafc7 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -143,6 +143,41 @@ static inline bool set_livepatch_module(struct module *mod)
 #endif
 }
 
+#ifdef CONFIG_MODULE_STATS
+
+#define mod_stat_add64(count, var) atomic64_add(count, var)
+#define mod_stat_inc(name) atomic_inc(name)
+
+extern atomic64_t total_mod_size;
+extern atomic64_t total_text_size;
+extern atomic64_t invalid_kread_bytes;
+extern atomic64_t invalid_decompress_bytes;
+extern atomic64_t invalid_mod_becoming_bytes;
+extern atomic64_t invalid_mod_bytes;
+
+extern atomic_t modcount;
+extern atomic_t failed_kreads;
+extern atomic_t failed_decompress;
+extern atomic_t failed_load_modules;
+struct mod_fail_load {
+	struct list_head list;
+	char name[MODULE_NAME_LEN];
+	atomic_t count;
+};
+
+int try_add_failed_module(const char *name);
+
+#else
+
+#define mod_stat_inc64(name)
+#define mod_stat_inc(name) atomic_inc(name)
+
+static inline int try_add_failed_module(const char *name)
+{
+	return 0;
+}
+#endif /* CONFIG_MODULE_STATS */
+
 #ifdef CONFIG_MODULE_UNLOAD_TAINT_TRACKING
 struct mod_unload_taint {
 	struct list_head list;
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 145e15f19576..8b851042b7f9 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2495,6 +2495,16 @@ static noinline int do_init_module(struct module *mod)
 {
 	int ret = 0;
 	struct mod_initfree *freeinit;
+	unsigned int text_size = 0, total_size = 0;
+
+	for_each_mod_mem_type(type) {
+		const struct module_memory *mod_mem = &mod->mem[type];
+		if (mod_mem->size) {
+			total_size += mod_mem->size;
+			if (type == MOD_TEXT || type == MOD_INIT_TEXT)
+				text_size += mod->mem[type].size;
+		}
+	}
 
 	freeinit = kmalloc(sizeof(*freeinit), GFP_KERNEL);
 	if (!freeinit) {
@@ -2556,6 +2566,7 @@ static noinline int do_init_module(struct module *mod)
 		mod->mem[type].base = NULL;
 		mod->mem[type].size = 0;
 	}
+
 #ifdef CONFIG_DEBUG_INFO_BTF_MODULES
 	/* .BTF is not SHF_ALLOC and will get removed, so sanitize pointer */
 	mod->btf_data = NULL;
@@ -2579,6 +2590,11 @@ static noinline int do_init_module(struct module *mod)
 	mutex_unlock(&module_mutex);
 	wake_up_all(&module_wq);
 
+	mod_stat_add64(text_size, &total_text_size);
+	mod_stat_add64(total_size, &total_mod_size);
+
+	mod_stat_inc(&modcount);
+
 	return 0;
 
 fail_free_freeinit:
@@ -2594,6 +2610,7 @@ static noinline int do_init_module(struct module *mod)
 	ftrace_release_mod(mod);
 	free_module(mod);
 	wake_up_all(&module_wq);
+
 	return ret;
 }
 
@@ -2650,6 +2667,9 @@ static int module_patient_check_exists(const char *name)
 		old = find_module_all(name, strlen(name), true);
 	}
 
+	if (try_add_failed_module(name))
+		pr_warn("Could not add fail-tracking for module: %s\n", name);
+
 	/*
 	 * We are here only when the same module was being loaded. Do
 	 * not try to load it again right now. It prevents long delays
@@ -2800,6 +2820,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
 		       int flags)
 {
 	struct module *mod;
+	bool module_allocated = false;
 	long err = 0;
 	char *after_dashes;
 
@@ -2839,6 +2860,8 @@ static int load_module(struct load_info *info, const char __user *uargs,
 		goto free_copy;
 	}
 
+	module_allocated = true;
+
 	audit_log_kern_module(mod->name);
 
 	/* Reserve our place in the list. */
@@ -2983,6 +3006,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	synchronize_rcu();
 	mutex_unlock(&module_mutex);
  free_module:
+	mod_stat_add64(info->len, &invalid_mod_bytes);
 	/* Free lock-classes; relies on the preceding sync_rcu() */
 	for_class_mod_mem_type(type, core_data) {
 		lockdep_free_key_range(mod->mem[type].base,
@@ -2991,6 +3015,13 @@ static int load_module(struct load_info *info, const char __user *uargs,
 
 	module_deallocate(mod, info);
  free_copy:
+	/*
+	 * The info->len is always set. We distinguish between
+	 * failures once the proper module was allocated and
+	 * before that.
+	 */
+	if (!module_allocated)
+		mod_stat_add64(info->len, &invalid_mod_becoming_bytes);
 	free_copy(info, flags);
 	return err;
 }
@@ -3009,8 +3040,10 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
 	       umod, len, uargs);
 
 	err = copy_module_from_user(umod, len, &info);
-	if (err)
+	if (err) {
+		mod_stat_inc(&failed_kreads);
 		return err;
+	}
 
 	return load_module(&info, uargs, 0);
 }
@@ -3035,14 +3068,20 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 
 	len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
 				       READING_MODULE);
-	if (len < 0)
+	if (len < 0) {
+		mod_stat_inc(&failed_kreads);
+		mod_stat_add64(len, &invalid_kread_bytes);
 		return len;
+	}
 
 	if (flags & MODULE_INIT_COMPRESSED_FILE) {
 		err = module_decompress(&info, buf, len);
 		vfree(buf); /* compressed data is no longer needed */
-		if (err)
+		if (err) {
+			mod_stat_inc(&failed_decompress);
+			mod_stat_add64(len, &invalid_decompress_bytes);
 			return err;
+		}
 	} else {
 		info.hdr = buf;
 		info.len = len;
diff --git a/kernel/module/stats.c b/kernel/module/stats.c
new file mode 100644
index 000000000000..bbf0ddb8b589
--- /dev/null
+++ b/kernel/module/stats.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Let's keep keep tabs on a few important module stats, useful
+ * for debugging production loads and interactions between userspace
+ * and kernelspace for loading modules.
+ *
+ * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/debugfs.h>
+#include <linux/rculist.h>
+
+#include "internal.h"
+
+extern struct dentry *mod_debugfs_root;
+
+/*
+ * Tracks modules which failed to be loaded as they were being processed.
+ * These require modulesed consumed vmalloc space for all finit_module()
+ * calls as kernel_read*() is used. Then if compression is used vmap()
+ * is used to allocate space for the decompressed version of what userspace
+ * has on the filesystem, we vfree() the compressed data which kerne_read*()
+ * fetched for us. Finally, a final module is allocated as well which we
+ * use to keep around, and that *can* use vmalloc() too.
+ *
+ * In the worst case, when module compression is used then we use the vmap
+ * space three times.
+ *
+ * We really should strive to get this list to be empty. This not being empty
+ * is a reflection of us needing to do more work to ensure either the kernel
+ * or usersapce does not do unnecessary calls to load modules which it should
+ * know are already loaded or on its way to be loaded.
+ */
+static LIST_HEAD(failed_modules);
+
+/* Total bytes used by all modules we've dealt with on this system */
+atomic64_t total_mod_size;
+
+/* Total .text section sizes we've dealt with on this system */
+atomic64_t total_text_size;
+
+/* Failures happen on the initial kernel_read_*() call. They use vmalloc() */
+atomic64_t invalid_kread_bytes;
+
+/* Failures happen on the module decompression path, these use use vmap(). */
+atomic64_t invalid_decompress_bytes;
+
+/*
+ * The invalid_mod_becoming_bytes only keeps tabs of failures in between kread
+ * success and right before we allocate the module to process it. These
+ * can be failures due to:
+ *
+ *  o module_sig_check() - module signature checks
+ *  o elf_validity_cache_copy() - ELF does not add up
+ *  o early_mod_check():
+ *  	- blacklist
+ *  	- failed to rewrite section headers
+ *  	- verion magic
+ *  	- live patch requirements didn't check out
+ *  	- the module was detected as being already present, this
+ *  	  first check avoids a new vmalloc for the full size of
+ *  	  the module.
+ */
+atomic64_t invalid_mod_becoming_bytes;
+
+/*
+ * These are failures after we did all the sanity checks of the
+ * module userspace passed to us. This can still fail if we detect
+ * the module is loaded, we do this check after we allocate space
+ * for the module.
+ */
+atomic64_t invalid_mod_bytes;
+
+/* How many modules we've loaded in our kernel life time */
+atomic_t modcount;
+
+/* How many modules failed due to failed kernel_read*() */
+atomic_t failed_kreads;
+
+/* How many failed decompression attempts we've had */
+atomic_t failed_decompress;
+
+/* How many modules failed once we've allocated space for our module */
+atomic_t failed_load_modules;
+
+int try_add_failed_module(const char *name)
+{
+	struct mod_fail_load *mod_fail;
+
+	list_for_each_entry_rcu(mod_fail, &failed_modules, list,
+				lockdep_is_held(&module_mutex)) {
+                if (!strcmp(mod_fail->name, name)) {
+                        atomic_inc(&mod_fail->count);
+                        goto out;
+                }
+        }
+
+	mod_fail = kmalloc(sizeof(*mod_fail), GFP_KERNEL);
+	if (!mod_fail)
+		return -ENOMEM;
+	strscpy(mod_fail->name, name, MODULE_NAME_LEN);
+        atomic_inc(&mod_fail->count);
+        list_add_rcu(&mod_fail->list, &failed_modules);
+out:
+	return 0;
+}
+
+static ssize_t read_file_mod_stats(struct file *file, char __user *user_buf,
+				   size_t count, loff_t *ppos)
+{
+	struct mod_fail_load *mod_fail;
+	unsigned int len;
+	const unsigned int size = 1024;
+	char *buf;
+	u32 live_mod_count, fkreads, floads;
+	u64 total_size, text_size, ikread_bytes, idecomp_bytes, imod_bytes;
+
+	live_mod_count = atomic_read(&modcount);
+	fkreads = atomic_read(&failed_kreads);
+	floads = atomic_read(&failed_load_modules);
+
+	total_size = atomic64_read(&total_mod_size);
+	text_size = atomic64_read(&total_text_size);
+	ikread_bytes = atomic64_read(&invalid_mod_bytes);
+	idecomp_bytes = atomic64_read(&invalid_decompress_bytes);
+	imod_bytes = atomic64_read(&invalid_mod_bytes);
+
+	buf = kzalloc(size, GFP_KERNEL);
+	if (buf == NULL)
+		return -ENOMEM;
+
+	len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Modules loaded", live_mod_count);
+	len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Total module size", total_size);
+	len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Total mod text size", text_size);
+
+	/*
+	 * Failed kmod bytes do not contain any failed kreads bytes as those
+	 * failures would happen earlier on kread. Failed kread bytes are wasted
+	 * vmalloc() space allocations and are indicative of invalid modules.
+	 */
+	len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed kread bytes", ikread_bytes);
+
+	/*
+	 * Failed kmod bytes are modules which for whatever reason fail to load
+	 * on the load_module() effort. They are good signs the kernel or userspace
+	 * is doing something stupid or that could be improved.
+	 */
+	len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed kmod bytes", imod_bytes);
+
+	len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Invalid kread bytes", ikread_bytes);
+	len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Invalid decompress bytes", idecomp_bytes);
+	len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Invalid mod bytes", imod_bytes);
+
+	if (live_mod_count && total_size) {
+		len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average mod size",
+				 DIV_ROUND_UP(total_size, live_mod_count));
+	}
+
+	if (live_mod_count && text_size) {
+		len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average mod text size",
+				 DIV_ROUND_UP(text_size, live_mod_count));
+	}
+
+	if (list_empty(&failed_modules))
+		goto out;
+
+	len += scnprintf(buf + len, size - len, "Failed modules:\n");
+	list_for_each_entry_rcu(mod_fail, &failed_modules, list)
+		len += scnprintf(buf + len, size - len, "%25s\n", mod_fail->name);
+out:
+        return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static const struct file_operations fops_mod_stats = {
+	.read = read_file_mod_stats,
+	.open = simple_open,
+	.owner = THIS_MODULE,
+	.llseek = default_llseek,
+};
+
+static int __init module_stats_init(void)
+{
+	debugfs_create_atomic64_t("total_mod_size", 0400, mod_debugfs_root, &total_mod_size);
+	debugfs_create_atomic64_t("total_text_size", 0400, mod_debugfs_root, &total_text_size);
+	debugfs_create_atomic64_t("invalid_kread_bytes", 0400, mod_debugfs_root, &invalid_kread_bytes);
+	debugfs_create_atomic64_t("invalid_decompress_bytes", 0400, mod_debugfs_root, &invalid_decompress_bytes);
+	debugfs_create_atomic64_t("invalid_mod_bytes", 0400, mod_debugfs_root, &invalid_mod_bytes);
+	debugfs_create_atomic_t("modcount", 0400, mod_debugfs_root, &modcount);
+	debugfs_create_atomic_t("failed_kreads", 0400, mod_debugfs_root, &failed_kreads);
+	debugfs_create_atomic_t("failed_load_modules", 0400, mod_debugfs_root, &failed_load_modules);
+	debugfs_create_file("stats", 0400, mod_debugfs_root, mod_debugfs_root, &fops_mod_stats);
+
+	return 0;
+}
+module_init(module_stats_init);
diff --git a/kernel/module/tracking.c b/kernel/module/tracking.c
index 26d812e07615..cbeb9330db9b 100644
--- a/kernel/module/tracking.c
+++ b/kernel/module/tracking.c
@@ -15,6 +15,7 @@
 #include "internal.h"
 
 static LIST_HEAD(unloaded_tainted_modules);
+extern struct dentry *mod_debugfs_root
 
 int try_add_tainted_module(struct module *mod)
 {
@@ -120,12 +121,8 @@ static const struct file_operations unloaded_tainted_modules_fops = {
 
 static int __init unloaded_tainted_modules_init(void)
 {
-	struct dentry *dir;
-
-	dir = debugfs_create_dir("modules", NULL);
-	debugfs_create_file("unloaded_tainted", 0444, dir, NULL,
+	debugfs_create_file("unloaded_tainted", 0444, mod_debugfs_root, NULL,
 			    &unloaded_tainted_modules_fops);
-
 	return 0;
 }
 module_init(unloaded_tainted_modules_init);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] module: add debug stats to help identify memory pressure
  2023-03-29  5:31 ` [PATCH 7/7] module: add debug stats to help identify memory pressure Luis Chamberlain
@ 2023-03-29  5:46   ` Greg KH
  2023-03-29  6:04     ` Luis Chamberlain
  0 siblings, 1 reply; 26+ messages in thread
From: Greg KH @ 2023-03-29  5:46 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, rafael, christophe.leroy, tglx,
	peterz, song, rppt, willy, vbabka, mhocko, dave.hansen

On Tue, Mar 28, 2023 at 10:31:49PM -0700, Luis Chamberlain wrote:
> Loading modules with finit_module() can end up using vmalloc(), vmap()
> and vmalloc() again, for a total of up to 3 separate allocations in the
> worse case for a single module. We always kernel_read*() the module,
> that's a vmalloc(). Then vmap() is used for the module decompression,
> and if so the last read buffer is freed as we use the now decompressed
> module buffer to stuff data into our copy module. The last one is
> specific to architectures but pretty much that's generally a series
> of vmalloc() for different ELF sections...
> 
> Evaluation with new stress-ng module support [1] with just 100 ops
> us proving that you can end up using GiBs of data easily even if
> we are trying to be very careful not to load modules which are already
> loaded. 100 ops seems to resemble the sort of pressure a system with
> about 400 CPUs can create on modules. Although those issues for so
> many concurrent loads per CPU is silly and are being fixed, we lack
> proper tooling to help diagnose easily what happened, when it happened
> and what likely are the culprits -- userspace or kernel module
> autoloading.
> 
> Provide an initial set of stats for debugfs which let us easily scrape
> post-boot information about failed loads. This sort of information can
> be used on production worklaods to try to optimize *avoiding* redundant
> memory pressure using finit_module().
> 
> Screen shot:
> 
> root@kmod ~ # cat /sys/kernel/debug/modules/stats
>            Modules loaded       67

Is this "loaded now", or "ever successfully loaded"?  As in a
modprobe/rmmod/modprobe would bump this by 2, right?

Anyway, minor comment on the code, this looks overall great to me:


>         Total module size       11464704
>       Total mod text size       4194304
>        Failed kread bytes       890064
>         Failed kmod bytes       890064
>       Invalid kread bytes       890064
>  Invalid decompress bytes       0
>         Invalid mod bytes       890064
>          Average mod size       171115
>     Average mod text size       62602
> Failed modules:
>                 kvm_intel
>                       kvm
>                 irqbypass
>          crct10dif_pclmul
>       ghash_clmulni_intel
>              sha512_ssse3
>            sha512_generic
>               aesni_intel
>               crypto_simd
>                    cryptd
>                     evdev
>                 serio_raw
>                virtio_pci
>                      nvme
>                 nvme_core
>     virtio_pci_legacy_dev
>                    t10_pi
>            crc64_rocksoft
>     virtio_pci_modern_dev
>              crc32_pclmul
>                    virtio
>              crc32c_intel
>               virtio_ring
>                     crc64
> 
> [0] https://github.com/ColinIanKing/stress-ng.git
> [1] echo 0 > /proc/sys/vm/oom_dump_tasks
>     ./stress-ng --module 100 --module-name xfs
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  kernel/module/Kconfig    |  32 +++++++
>  kernel/module/Makefile   |   4 +
>  kernel/module/debug.c    |  16 ++++
>  kernel/module/internal.h |  35 +++++++
>  kernel/module/main.c     |  45 ++++++++-
>  kernel/module/stats.c    | 200 +++++++++++++++++++++++++++++++++++++++
>  kernel/module/tracking.c |   7 +-
>  7 files changed, 331 insertions(+), 8 deletions(-)
>  create mode 100644 kernel/module/debug.c
>  create mode 100644 kernel/module/stats.c
> 
> diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
> index 424b3bc58f3f..fbf7b92cb3d0 100644
> --- a/kernel/module/Kconfig
> +++ b/kernel/module/Kconfig
> @@ -22,6 +22,12 @@ menuconfig MODULES
>  
>  if MODULES
>  
> +config MODULE_DEBUG
> +	bool "Module debugging"
> +	default n

default is always "n", no need to set that specifically.

> +	help
> +	  Allows to help debug module functionality.

Might want to make this a bit more verbose.

>  config MODULE_FORCE_LOAD
>  	bool "Forced module loading"
>  	default n
> @@ -48,6 +54,8 @@ config MODULE_FORCE_UNLOAD
>  	  rmmod).  This is mainly for kernel developers and desperate users.
>  	  If unsure, say N.
>  
> +if MODULE_DEBUG
> +
>  config MODULE_UNLOAD_TAINT_TRACKING
>  	bool "Tainted module unload tracking"
>  	depends on MODULE_UNLOAD
> @@ -59,6 +67,30 @@ config MODULE_UNLOAD_TAINT_TRACKING
>  	  page (see bad_page()), the aforementioned details are also
>  	  shown. If unsure, say N.
>  
> +config MODULE_STATS
> +	bool "Module statistics"
> +	depends on DEBUG_FS
> +	default n

Again, already default.


> +	help
> +	  This option allows you to maintain a record of module statistics.
> +	  For example each all modules size, average size, text size, and
> +	  failed modules and the size for each of those. For failed
> +	  modules we keep track of module which failed due to either the
> +	  existing module taking too long to load or that module already
> +	  was loaded.
> +
> +	  You should enable this if you are debugging production loads
> +	  and want to see if userspace or the kernel is doing stupid things
> +	  with loading modules when it shouldn't or if you want to help
> +	  optimize userspace / kernel space module autoloading schemes.
> +	  You might want to do this because failed modules tend to use
> +	  use up significan amount of memory, and so you'd be doing everyone
> +	  a favor in avoiding these failure proactively.
> +
> +	  If unsure, say N.
> +
> +endif # MODULE_DEBUG
> +
>  config MODVERSIONS
>  	bool "Module versioning support"
>  	help
> diff --git a/kernel/module/Makefile b/kernel/module/Makefile
> index 5b1d26b53b8d..fe97047f3807 100644
> --- a/kernel/module/Makefile
> +++ b/kernel/module/Makefile
> @@ -20,4 +20,8 @@ obj-$(CONFIG_PROC_FS) += procfs.o
>  obj-$(CONFIG_SYSFS) += sysfs.o
>  obj-$(CONFIG_KGDB_KDB) += kdb.o
>  obj-$(CONFIG_MODVERSIONS) += version.o
> +
> +# Link order matters here, keep debug.o first.
> +obj-$(CONFIG_MODULE_DEBUG) += debug.o
> +obj-$(CONFIG_MODULE_STATS) += stats.o
>  obj-$(CONFIG_MODULE_UNLOAD_TAINT_TRACKING) += tracking.o
> diff --git a/kernel/module/debug.c b/kernel/module/debug.c
> new file mode 100644
> index 000000000000..ef580d70b751
> --- /dev/null
> +++ b/kernel/module/debug.c
> @@ -0,0 +1,16 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2023 Luis Chamberlain <mcgrof@kernel.org>
> + */
> +
> +#include <linux/debugfs.h>
> +#include "internal.h"
> +
> +struct dentry *mod_debugfs_root;
> +
> +static int module_debugfs_init(void)
> +{
> +	mod_debugfs_root = debugfs_create_dir("modules", NULL);
> +	return 0;
> +}
> +module_init(module_debugfs_init);

Why is this a whole separate file?

And as MODULE_DEBUG does not reference debugfs, yet it still creates the
debugfs directory here?  Yes, it will just not do anything if debugfs is
not enabled, but then why is this file here?

> diff --git a/kernel/module/internal.h b/kernel/module/internal.h
> index 6ae29bb8836f..a645cb3fafc7 100644
> --- a/kernel/module/internal.h
> +++ b/kernel/module/internal.h
> @@ -143,6 +143,41 @@ static inline bool set_livepatch_module(struct module *mod)
>  #endif
>  }
>  
> +#ifdef CONFIG_MODULE_STATS
> +
> +#define mod_stat_add64(count, var) atomic64_add(count, var)
> +#define mod_stat_inc(name) atomic_inc(name)

Ok, but:

> +
> +extern atomic64_t total_mod_size;
> +extern atomic64_t total_text_size;
> +extern atomic64_t invalid_kread_bytes;
> +extern atomic64_t invalid_decompress_bytes;
> +extern atomic64_t invalid_mod_becoming_bytes;
> +extern atomic64_t invalid_mod_bytes;
> +
> +extern atomic_t modcount;
> +extern atomic_t failed_kreads;
> +extern atomic_t failed_decompress;
> +extern atomic_t failed_load_modules;
> +struct mod_fail_load {
> +	struct list_head list;
> +	char name[MODULE_NAME_LEN];
> +	atomic_t count;
> +};
> +
> +int try_add_failed_module(const char *name);
> +
> +#else
> +
> +#define mod_stat_inc64(name)
> +#define mod_stat_inc(name) atomic_inc(name)

Why do you still increment the variable here if the option is not
enabled?

Also, didn't we have some sort of "we want to use an atomic variable as
statistics" type somewhere in the kernel?  Or did that never get
accepted?

And do all of these really need to be atomic variables?  Don't you have
locks for some of this to not need the atomic-ness of them?  I guess it
doesn't matter much as this isn't that fast of a code-path.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] debugfs: add debugfs_create_atomic64_t for atomic64_t
  2023-03-29  5:31 ` [PATCH 6/7] debugfs: add debugfs_create_atomic64_t for atomic64_t Luis Chamberlain
@ 2023-03-29  5:46   ` Greg KH
  0 siblings, 0 replies; 26+ messages in thread
From: Greg KH @ 2023-03-29  5:46 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, rafael, christophe.leroy, tglx,
	peterz, song, rppt, willy, vbabka, mhocko, dave.hansen

On Tue, Mar 28, 2023 at 10:31:48PM -0700, Luis Chamberlain wrote:
> Sometimes you want to add debugfs entries for atomic counters which
> can be pretty large using atomic64_t. Add support for these.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  fs/debugfs/file.c       | 36 ++++++++++++++++++++++++++++++++++++
>  include/linux/debugfs.h |  2 ++
>  2 files changed, 38 insertions(+)

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] module: add debug stats to help identify memory pressure
  2023-03-29  5:46   ` Greg KH
@ 2023-03-29  6:04     ` Luis Chamberlain
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  6:04 UTC (permalink / raw)
  To: Greg KH
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, rafael, christophe.leroy, tglx,
	peterz, song, rppt, willy, vbabka, mhocko, dave.hansen

On Wed, Mar 29, 2023 at 07:46:22AM +0200, Greg KH wrote:
> On Tue, Mar 28, 2023 at 10:31:49PM -0700, Luis Chamberlain wrote:
> > Loading modules with finit_module() can end up using vmalloc(), vmap()
> > and vmalloc() again, for a total of up to 3 separate allocations in the
> > worse case for a single module. We always kernel_read*() the module,
> > that's a vmalloc(). Then vmap() is used for the module decompression,
> > and if so the last read buffer is freed as we use the now decompressed
> > module buffer to stuff data into our copy module. The last one is
> > specific to architectures but pretty much that's generally a series
> > of vmalloc() for different ELF sections...
> > 
> > Evaluation with new stress-ng module support [1] with just 100 ops
> > us proving that you can end up using GiBs of data easily even if
> > we are trying to be very careful not to load modules which are already
> > loaded. 100 ops seems to resemble the sort of pressure a system with
> > about 400 CPUs can create on modules. Although those issues for so
> > many concurrent loads per CPU is silly and are being fixed, we lack
> > proper tooling to help diagnose easily what happened, when it happened
> > and what likely are the culprits -- userspace or kernel module
> > autoloading.
> > 
> > Provide an initial set of stats for debugfs which let us easily scrape
> > post-boot information about failed loads. This sort of information can
> > be used on production worklaods to try to optimize *avoiding* redundant
> > memory pressure using finit_module().
> > 
> > Screen shot:
> > 
> > root@kmod ~ # cat /sys/kernel/debug/modules/stats
> >            Modules loaded       67
> 
> Is this "loaded now", or "ever successfully loaded"?  As in a
> modprobe/rmmod/modprobe would bump this by 2, right?

Ah, the later, so "how modules have I ever loaded". Maybe

Modules ever loaded

?

Will fix the nits, thanks!

> > diff --git a/kernel/module/debug.c b/kernel/module/debug.c
> 
> Why is this a whole separate file?

It's just a style preference, no real hard reason other than
module.c was huge before and now its split up. I find that
easier to review / manage. Certainly overkill for such as
simple thing but if its debug I think I rather see that
then some ifdef eyesore. But that's just preference.

> And as MODULE_DEBUG does not reference debugfs,

That should be fixed thanks.

> > diff --git a/kernel/module/internal.h b/kernel/module/internal.h
> > index 6ae29bb8836f..a645cb3fafc7 100644
> > --- a/kernel/module/internal.h
> > +++ b/kernel/module/internal.h
> > @@ -143,6 +143,41 @@ static inline bool set_livepatch_module(struct module *mod)
> >  #endif
> >  }
> >  
> > +#ifdef CONFIG_MODULE_STATS
> > +
> > +#define mod_stat_add64(count, var) atomic64_add(count, var)
> > +#define mod_stat_inc(name) atomic_inc(name)
> 
> Ok, but:
> 
> > +#define mod_stat_inc(name) atomic_inc(name)
> 
> Why do you still increment the variable here if the option is not
> enabled?

Whoops, will fix!

> Also, didn't we have some sort of "we want to use an atomic variable as
> statistics" type somewhere in the kernel? 

I didn't get the memo, nor do I recall, so it's not on my radar.

> Or did that never get accepted?

Not sure.

> And do all of these really need to be atomic variables?  Don't you have
> locks for some of this to not need the atomic-ness of them?  I guess it
> doesn't matter much as this isn't that fast of a code-path.

That was actually intentional, as this only *grows* I just care its not 0
so to help divide by the total number of modules to get average module
length and average module .text length. I used atomics and made it only
grow precisely to not have to lock anywhere.

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-29  5:31 ` [PATCH 4/7] sempahore: add a helper for a concurrency limiter Luis Chamberlain
@ 2023-03-29  7:21   ` Peter Zijlstra
  2023-03-29  7:51     ` Luis Chamberlain
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2023-03-29  7:21 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael, christophe.leroy,
	tglx, song, rppt, willy, vbabka, mhocko, dave.hansen

On Tue, Mar 28, 2023 at 10:31:46PM -0700, Luis Chamberlain wrote:
> While I looked at re-using the old kernel/kmod.c (now kernel/module/kmod.c)
> concurrency delimiter methodology for another place in the kernel Linus
> noted that this could be simply replaced with a sempahore [0].
> 
> So add that so we we don't re-invent the wheel and make it obvious to use.
> 
> [0] https://lore.kernel.org/all/CAHk-=whkj6=wyi201JXkw9iT_eTUTsSx+Yb9d4OgmZFjDJA18g@mail.gmail.com/
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  include/linux/semaphore.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
> index 6694d0019a68..2ecdffdb9814 100644
> --- a/include/linux/semaphore.h
> +++ b/include/linux/semaphore.h
> @@ -28,6 +28,9 @@ struct semaphore {
>  #define DEFINE_SEMAPHORE(name)	\
>  	struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
>  
> +#define CONCURRENCY_LIMITER(name, n) \
> +	struct semaphore name = __SEMAPHORE_INITIALIZER(name, n)
> +

Why should this live in semaphore.h?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-29  7:21   ` Peter Zijlstra
@ 2023-03-29  7:51     ` Luis Chamberlain
  2023-03-29  9:19       ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  7:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael, christophe.leroy,
	tglx, song, rppt, willy, vbabka, mhocko, dave.hansen

On Wed, Mar 29, 2023 at 09:21:12AM +0200, Peter Zijlstra wrote:
> On Tue, Mar 28, 2023 at 10:31:46PM -0700, Luis Chamberlain wrote:
> > While I looked at re-using the old kernel/kmod.c (now kernel/module/kmod.c)
> > concurrency delimiter methodology for another place in the kernel Linus
> > noted that this could be simply replaced with a sempahore [0].
> > 
> > So add that so we we don't re-invent the wheel and make it obvious to use.
> > 
> > [0] https://lore.kernel.org/all/CAHk-=whkj6=wyi201JXkw9iT_eTUTsSx+Yb9d4OgmZFjDJA18g@mail.gmail.com/
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  include/linux/semaphore.h | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
> > index 6694d0019a68..2ecdffdb9814 100644
> > --- a/include/linux/semaphore.h
> > +++ b/include/linux/semaphore.h
> > @@ -28,6 +28,9 @@ struct semaphore {
> >  #define DEFINE_SEMAPHORE(name)	\
> >  	struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
> >  
> > +#define CONCURRENCY_LIMITER(name, n) \
> > +	struct semaphore name = __SEMAPHORE_INITIALIZER(name, n)
> > +
> 
> Why should this live in semaphore.h?

I have no preference, but sharing seems to have been better. Do you
have any recommendations?

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-29  7:51     ` Luis Chamberlain
@ 2023-03-29  9:19       ` Peter Zijlstra
  2023-03-29  9:49         ` Luis Chamberlain
  2023-03-29 16:50         ` Linus Torvalds
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2023-03-29  9:19 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael, christophe.leroy,
	tglx, song, rppt, willy, vbabka, mhocko, dave.hansen

On Wed, Mar 29, 2023 at 12:51:48AM -0700, Luis Chamberlain wrote:
> On Wed, Mar 29, 2023 at 09:21:12AM +0200, Peter Zijlstra wrote:
> > On Tue, Mar 28, 2023 at 10:31:46PM -0700, Luis Chamberlain wrote:
> > > While I looked at re-using the old kernel/kmod.c (now kernel/module/kmod.c)
> > > concurrency delimiter methodology for another place in the kernel Linus
> > > noted that this could be simply replaced with a sempahore [0].
> > > 
> > > So add that so we we don't re-invent the wheel and make it obvious to use.
> > > 
> > > [0] https://lore.kernel.org/all/CAHk-=whkj6=wyi201JXkw9iT_eTUTsSx+Yb9d4OgmZFjDJA18g@mail.gmail.com/
> > > 
> > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > ---
> > >  include/linux/semaphore.h | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
> > > index 6694d0019a68..2ecdffdb9814 100644
> > > --- a/include/linux/semaphore.h
> > > +++ b/include/linux/semaphore.h
> > > @@ -28,6 +28,9 @@ struct semaphore {
> > >  #define DEFINE_SEMAPHORE(name)	\
> > >  	struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
> > >  
> > > +#define CONCURRENCY_LIMITER(name, n) \
> > > +	struct semaphore name = __SEMAPHORE_INITIALIZER(name, n)
> > > +
> > 
> > Why should this live in semaphore.h?
> 
> I have no preference, but sharing seems to have been better. Do you
> have any recommendations?

Call is DEFINE_SEMAPHORE_N() ?

Arguably DEFINE_SEMAPHORE() should have the argument, as binary
semaphores are a special case, but then we gotta go and fix up all
users.

/me git-greps a little.. Hmm, not too bad.

How's this?

---
 arch/mips/cavium-octeon/setup.c                               | 2 +-
 arch/x86/kernel/cpu/intel.c                                   | 2 +-
 drivers/firmware/efi/runtime-wrappers.c                       | 2 +-
 drivers/firmware/efi/vars.c                                   | 2 +-
 drivers/macintosh/adb.c                                       | 2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c              | 2 +-
 drivers/platform/x86/intel/ifs/sysfs.c                        | 2 +-
 drivers/scsi/esas2r/esas2r_ioctl.c                            | 2 +-
 drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c | 2 +-
 include/linux/semaphore.h                                     | 7 +++++--
 kernel/printk/printk.c                                        | 2 +-
 net/rxrpc/call_object.c                                       | 6 ++----
 12 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c
index a71727f7a608..794b681433a7 100644
--- a/arch/mips/cavium-octeon/setup.c
+++ b/arch/mips/cavium-octeon/setup.c
@@ -72,7 +72,7 @@ extern void pci_console_init(const char *arg);
 static unsigned long long max_memory = ULLONG_MAX;
 static unsigned long long reserve_low_mem;
 
-DEFINE_SEMAPHORE(octeon_bootbus_sem);
+DEFINE_BINARY_SEMAPHORE(octeon_bootbus_sem);
 EXPORT_SYMBOL(octeon_bootbus_sem);
 
 static struct octeon_boot_descriptor *octeon_boot_desc_ptr;
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 291d4167fab8..c1ace4d46c35 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1177,7 +1177,7 @@ static const struct {
 static struct ratelimit_state bld_ratelimit;
 
 static unsigned int sysctl_sld_mitigate = 1;
-static DEFINE_SEMAPHORE(buslock_sem);
+static DEFINE_BINARY_SEMAPHORE(buslock_sem);
 
 #ifdef CONFIG_PROC_SYSCTL
 static struct ctl_table sld_sysctls[] = {
diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c
index 1fba4e09cdcf..1139ad6429e7 100644
--- a/drivers/firmware/efi/runtime-wrappers.c
+++ b/drivers/firmware/efi/runtime-wrappers.c
@@ -158,7 +158,7 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call)
  * none of the remaining functions are actually ever called at runtime.
  * So let's just use a single lock to serialize all Runtime Services calls.
  */
-static DEFINE_SEMAPHORE(efi_runtime_lock);
+static DEFINE_BINARY_SEMAPHORE(efi_runtime_lock);
 
 /*
  * Expose the EFI runtime lock to the UV platform
diff --git a/drivers/firmware/efi/vars.c b/drivers/firmware/efi/vars.c
index bd75b87f5fc1..09647e133c5a 100644
--- a/drivers/firmware/efi/vars.c
+++ b/drivers/firmware/efi/vars.c
@@ -21,7 +21,7 @@
 /* Private pointer to registered efivars */
 static struct efivars *__efivars;
 
-static DEFINE_SEMAPHORE(efivars_lock);
+static DEFINE_BINARY_SEMAPHORE(efivars_lock);
 
 static efi_status_t check_var_size(bool nonblocking, u32 attributes,
 				   unsigned long size)
diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
index 23bd0c77ac1a..c70b18b9f97d 100644
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -80,7 +80,7 @@ static struct adb_driver *adb_controller;
 BLOCKING_NOTIFIER_HEAD(adb_client_list);
 static int adb_got_sleep;
 static int adb_inited;
-static DEFINE_SEMAPHORE(adb_probe_mutex);
+static DEFINE_BINARY_SEMAPHORE(adb_probe_mutex);
 static int sleepy_trackpad;
 static int autopoll_devs;
 int __adb_probe_sync;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 5d1e4fe335aa..0cade1fd266f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -298,7 +298,7 @@ const u32 dmae_reg_go_c[] = {
 
 /* Global resources for unloading a previously loaded device */
 #define BNX2X_PREV_WAIT_NEEDED 1
-static DEFINE_SEMAPHORE(bnx2x_prev_sem);
+static DEFINE_BINARY_SEMAPHORE(bnx2x_prev_sem);
 static LIST_HEAD(bnx2x_prev_list);
 
 /* Forward declaration */
diff --git a/drivers/platform/x86/intel/ifs/sysfs.c b/drivers/platform/x86/intel/ifs/sysfs.c
index ee636a76b083..ee6782cad44f 100644
--- a/drivers/platform/x86/intel/ifs/sysfs.c
+++ b/drivers/platform/x86/intel/ifs/sysfs.c
@@ -13,7 +13,7 @@
  * Protects against simultaneous tests on multiple cores, or
  * reloading can file while a test is in progress
  */
-static DEFINE_SEMAPHORE(ifs_sem);
+static DEFINE_BINARY_SEMAPHORE(ifs_sem);
 
 /*
  * The sysfs interface to check additional details of last test
diff --git a/drivers/scsi/esas2r/esas2r_ioctl.c b/drivers/scsi/esas2r/esas2r_ioctl.c
index e003d923acbf..ca40d32181a7 100644
--- a/drivers/scsi/esas2r/esas2r_ioctl.c
+++ b/drivers/scsi/esas2r/esas2r_ioctl.c
@@ -56,7 +56,7 @@ dma_addr_t esas2r_buffered_ioctl_addr;
 u32 esas2r_buffered_ioctl_size;
 struct pci_dev *esas2r_buffered_ioctl_pcid;
 
-static DEFINE_SEMAPHORE(buffered_ioctl_semaphore);
+static DEFINE_BINARY_SEMAPHORE(buffered_ioctl_semaphore);
 typedef int (*BUFFERED_IOCTL_CALLBACK)(struct esas2r_adapter *,
 				       struct esas2r_request *,
 				       struct esas2r_sg_context *,
diff --git a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
index cddcd3c596c9..c7093f367f27 100644
--- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
+++ b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
@@ -149,7 +149,7 @@ static char *g_fragments_base;
 static char *g_free_fragments;
 static struct semaphore g_free_fragments_sema;
 
-static DEFINE_SEMAPHORE(g_free_fragments_mutex);
+static DEFINE_BINARY_SEMAPHORE(g_free_fragments_mutex);
 
 static int
 vchiq_blocking_bulk_transfer(struct vchiq_instance *instance, unsigned int handle, void *data,
diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
index 6694d0019a68..18894bffb3f4 100644
--- a/include/linux/semaphore.h
+++ b/include/linux/semaphore.h
@@ -25,8 +25,11 @@ struct semaphore {
 	.wait_list	= LIST_HEAD_INIT((name).wait_list),		\
 }
 
-#define DEFINE_SEMAPHORE(name)	\
-	struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
+#define DEFINE_SEMAPHORE(_name, _n)	\
+	struct semaphore _name = __SEMAPHORE_INITIALIZER(_name, _n)
+
+#define DEFINE_BINARY_SEMAPHORE(_name)	\
+	DEFINE_SEMAPHORE(_name, 1)
 
 static inline void sema_init(struct semaphore *sem, int val)
 {
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index fd0c9f913940..a3aa21d26de2 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -89,7 +89,7 @@ static DEFINE_MUTEX(console_mutex);
  * console_sem protects updates to console->seq and console_suspended,
  * and also provides serialization for console printing.
  */
-static DEFINE_SEMAPHORE(console_sem);
+static DEFINE_BINARY_SEMAPHORE(console_sem);
 HLIST_HEAD(console_list);
 EXPORT_SYMBOL_GPL(console_list);
 DEFINE_STATIC_SRCU(console_srcu);
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index e9f1f49d18c2..3e5cc70884dd 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -40,10 +40,8 @@ const char *const rxrpc_call_completions[NR__RXRPC_CALL_COMPLETIONS] = {
 
 struct kmem_cache *rxrpc_call_jar;
 
-static struct semaphore rxrpc_call_limiter =
-	__SEMAPHORE_INITIALIZER(rxrpc_call_limiter, 1000);
-static struct semaphore rxrpc_kernel_call_limiter =
-	__SEMAPHORE_INITIALIZER(rxrpc_kernel_call_limiter, 1000);
+static DEFINE_SEMAPHORE(rxrpc_call_limiter, 1000);
+static DEFINE_SEMAPHORE(rxrpc_kernel_call_limiter, 1000);
 
 void rxrpc_poke_call(struct rxrpc_call *call, enum rxrpc_call_poke_trace what)
 {

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-29  9:19       ` Peter Zijlstra
@ 2023-03-29  9:49         ` Luis Chamberlain
  2023-03-29 10:14           ` Peter Zijlstra
  2023-03-29 16:50         ` Linus Torvalds
  1 sibling, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  9:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael, christophe.leroy,
	tglx, song, rppt, willy, vbabka, mhocko, dave.hansen

On Wed, Mar 29, 2023 at 11:19:35AM +0200, Peter Zijlstra wrote:
> On Wed, Mar 29, 2023 at 12:51:48AM -0700, Luis Chamberlain wrote:
> > On Wed, Mar 29, 2023 at 09:21:12AM +0200, Peter Zijlstra wrote:
> > > On Tue, Mar 28, 2023 at 10:31:46PM -0700, Luis Chamberlain wrote:
> > > > While I looked at re-using the old kernel/kmod.c (now kernel/module/kmod.c)
> > > > concurrency delimiter methodology for another place in the kernel Linus
> > > > noted that this could be simply replaced with a sempahore [0].
> > > > 
> > > > So add that so we we don't re-invent the wheel and make it obvious to use.
> > > > 
> > > > [0] https://lore.kernel.org/all/CAHk-=whkj6=wyi201JXkw9iT_eTUTsSx+Yb9d4OgmZFjDJA18g@mail.gmail.com/
> > > > 
> > > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > > ---
> > > >  include/linux/semaphore.h | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
> > > > index 6694d0019a68..2ecdffdb9814 100644
> > > > --- a/include/linux/semaphore.h
> > > > +++ b/include/linux/semaphore.h
> > > > @@ -28,6 +28,9 @@ struct semaphore {
> > > >  #define DEFINE_SEMAPHORE(name)	\
> > > >  	struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
> > > >  
> > > > +#define CONCURRENCY_LIMITER(name, n) \
> > > > +	struct semaphore name = __SEMAPHORE_INITIALIZER(name, n)
> > > > +
> > > 
> > > Why should this live in semaphore.h?
> > 
> > I have no preference, but sharing seems to have been better. Do you
> > have any recommendations?
> 
> Call is DEFINE_SEMAPHORE_N() ?
> 
> Arguably DEFINE_SEMAPHORE() should have the argument, as binary
> semaphores are a special case, but then we gotta go and fix up all
> users.
> 
> /me git-greps a little.. Hmm, not too bad.
> 
> How's this?

Seems OK to me. Either way works. Should I carry a patch from you for this
series?

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-29  9:49         ` Luis Chamberlain
@ 2023-03-29 10:14           ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2023-03-29 10:14 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael, christophe.leroy,
	tglx, song, rppt, willy, vbabka, mhocko, dave.hansen

On Wed, Mar 29, 2023 at 02:49:43AM -0700, Luis Chamberlain wrote:
> Seems OK to me. Either way works. Should I carry a patch from you for this
> series?

Sure, here goes (still not even compile tested, but what can go wrong,
right :-)

---
Subject: semaphore: Change DEFINE_SEMAPHORE() to take a number argument
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Mar 29 12:06:08 CEST 2023

Fundamentally semaphores are a counted primitive, but
DEFINE_SEMAPHORE() does not expose this and explicitly creates a
binary semaphore.

Provide DEFINE_BINARY_SEMAPHORE() for this case and change
DEFINE_SEMAPHORE() to take a number argument and use that in the few
places that open-coded it using __SEMAPHORE_INITIALIZER().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/mips/cavium-octeon/setup.c                               |    2 +-
 arch/x86/kernel/cpu/intel.c                                   |    2 +-
 drivers/firmware/efi/runtime-wrappers.c                       |    2 +-
 drivers/firmware/efi/vars.c                                   |    2 +-
 drivers/macintosh/adb.c                                       |    2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c              |    2 +-
 drivers/platform/x86/intel/ifs/sysfs.c                        |    2 +-
 drivers/scsi/esas2r/esas2r_ioctl.c                            |    2 +-
 drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c |    2 +-
 include/linux/semaphore.h                                     |    7 +++++--
 kernel/printk/printk.c                                        |    2 +-
 net/rxrpc/call_object.c                                       |    6 ++----
 12 files changed, 17 insertions(+), 16 deletions(-)

--- a/arch/mips/cavium-octeon/setup.c
+++ b/arch/mips/cavium-octeon/setup.c
@@ -72,7 +72,7 @@ extern void pci_console_init(const char
 static unsigned long long max_memory = ULLONG_MAX;
 static unsigned long long reserve_low_mem;
 
-DEFINE_SEMAPHORE(octeon_bootbus_sem);
+DEFINE_BINARY_SEMAPHORE(octeon_bootbus_sem);
 EXPORT_SYMBOL(octeon_bootbus_sem);
 
 static struct octeon_boot_descriptor *octeon_boot_desc_ptr;
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1177,7 +1177,7 @@ static const struct {
 static struct ratelimit_state bld_ratelimit;
 
 static unsigned int sysctl_sld_mitigate = 1;
-static DEFINE_SEMAPHORE(buslock_sem);
+static DEFINE_BINARY_SEMAPHORE(buslock_sem);
 
 #ifdef CONFIG_PROC_SYSCTL
 static struct ctl_table sld_sysctls[] = {
--- a/drivers/firmware/efi/runtime-wrappers.c
+++ b/drivers/firmware/efi/runtime-wrappers.c
@@ -158,7 +158,7 @@ void efi_call_virt_check_flags(unsigned
  * none of the remaining functions are actually ever called at runtime.
  * So let's just use a single lock to serialize all Runtime Services calls.
  */
-static DEFINE_SEMAPHORE(efi_runtime_lock);
+static DEFINE_BINARY_SEMAPHORE(efi_runtime_lock);
 
 /*
  * Expose the EFI runtime lock to the UV platform
--- a/drivers/firmware/efi/vars.c
+++ b/drivers/firmware/efi/vars.c
@@ -21,7 +21,7 @@
 /* Private pointer to registered efivars */
 static struct efivars *__efivars;
 
-static DEFINE_SEMAPHORE(efivars_lock);
+static DEFINE_BINARY_SEMAPHORE(efivars_lock);
 
 static efi_status_t check_var_size(bool nonblocking, u32 attributes,
 				   unsigned long size)
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -80,7 +80,7 @@ static struct adb_driver *adb_controller
 BLOCKING_NOTIFIER_HEAD(adb_client_list);
 static int adb_got_sleep;
 static int adb_inited;
-static DEFINE_SEMAPHORE(adb_probe_mutex);
+static DEFINE_BINARY_SEMAPHORE(adb_probe_mutex);
 static int sleepy_trackpad;
 static int autopoll_devs;
 int __adb_probe_sync;
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -298,7 +298,7 @@ const u32 dmae_reg_go_c[] = {
 
 /* Global resources for unloading a previously loaded device */
 #define BNX2X_PREV_WAIT_NEEDED 1
-static DEFINE_SEMAPHORE(bnx2x_prev_sem);
+static DEFINE_BINARY_SEMAPHORE(bnx2x_prev_sem);
 static LIST_HEAD(bnx2x_prev_list);
 
 /* Forward declaration */
--- a/drivers/platform/x86/intel/ifs/sysfs.c
+++ b/drivers/platform/x86/intel/ifs/sysfs.c
@@ -13,7 +13,7 @@
  * Protects against simultaneous tests on multiple cores, or
  * reloading can file while a test is in progress
  */
-static DEFINE_SEMAPHORE(ifs_sem);
+static DEFINE_BINARY_SEMAPHORE(ifs_sem);
 
 /*
  * The sysfs interface to check additional details of last test
--- a/drivers/scsi/esas2r/esas2r_ioctl.c
+++ b/drivers/scsi/esas2r/esas2r_ioctl.c
@@ -56,7 +56,7 @@ dma_addr_t esas2r_buffered_ioctl_addr;
 u32 esas2r_buffered_ioctl_size;
 struct pci_dev *esas2r_buffered_ioctl_pcid;
 
-static DEFINE_SEMAPHORE(buffered_ioctl_semaphore);
+static DEFINE_BINARY_SEMAPHORE(buffered_ioctl_semaphore);
 typedef int (*BUFFERED_IOCTL_CALLBACK)(struct esas2r_adapter *,
 				       struct esas2r_request *,
 				       struct esas2r_sg_context *,
--- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
+++ b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
@@ -149,7 +149,7 @@ static char *g_fragments_base;
 static char *g_free_fragments;
 static struct semaphore g_free_fragments_sema;
 
-static DEFINE_SEMAPHORE(g_free_fragments_mutex);
+static DEFINE_BINARY_SEMAPHORE(g_free_fragments_mutex);
 
 static int
 vchiq_blocking_bulk_transfer(struct vchiq_instance *instance, unsigned int handle, void *data,
--- a/include/linux/semaphore.h
+++ b/include/linux/semaphore.h
@@ -25,8 +25,11 @@ struct semaphore {
 	.wait_list	= LIST_HEAD_INIT((name).wait_list),		\
 }
 
-#define DEFINE_SEMAPHORE(name)	\
-	struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
+#define DEFINE_SEMAPHORE(_name, _n)	\
+	struct semaphore _name = __SEMAPHORE_INITIALIZER(_name, _n)
+
+#define DEFINE_BINARY_SEMAPHORE(_name)	\
+	DEFINE_SEMAPHORE(_name, 1)
 
 static inline void sema_init(struct semaphore *sem, int val)
 {
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -89,7 +89,7 @@ static DEFINE_MUTEX(console_mutex);
  * console_sem protects updates to console->seq and console_suspended,
  * and also provides serialization for console printing.
  */
-static DEFINE_SEMAPHORE(console_sem);
+static DEFINE_BINARY_SEMAPHORE(console_sem);
 HLIST_HEAD(console_list);
 EXPORT_SYMBOL_GPL(console_list);
 DEFINE_STATIC_SRCU(console_srcu);
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -40,10 +40,8 @@ const char *const rxrpc_call_completions
 
 struct kmem_cache *rxrpc_call_jar;
 
-static struct semaphore rxrpc_call_limiter =
-	__SEMAPHORE_INITIALIZER(rxrpc_call_limiter, 1000);
-static struct semaphore rxrpc_kernel_call_limiter =
-	__SEMAPHORE_INITIALIZER(rxrpc_kernel_call_limiter, 1000);
+static DEFINE_SEMAPHORE(rxrpc_call_limiter, 1000);
+static DEFINE_SEMAPHORE(rxrpc_kernel_call_limiter, 1000);
 
 void rxrpc_poke_call(struct rxrpc_call *call, enum rxrpc_call_poke_trace what)
 {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-29  9:19       ` Peter Zijlstra
  2023-03-29  9:49         ` Luis Chamberlain
@ 2023-03-29 16:50         ` Linus Torvalds
  2023-03-30 11:56           ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2023-03-29 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luis Chamberlain, david, patches, linux-modules, linux-mm,
	linux-kernel, pmladek, petr.pavlu, prarit, gregkh, rafael,
	christophe.leroy, tglx, song, rppt, willy, vbabka, mhocko,
	dave.hansen

On Wed, Mar 29, 2023 at 2:19 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Arguably DEFINE_SEMAPHORE() should have the argument, as binary
> semaphores are a special case, but then we gotta go and fix up all
> users.

Using semaphores for just pure mutual exclusion used to be *the* most
common use of it, which is why we didn't have an argument.

Then we got the mutexes, and now semaphores are almost entirely a legacy thing.

I think we should just make DEFINE_SEMAPHORE() take the number, and
people who want a mutex should either put in the "1", or they should
just use a mutex.

> /me git-greps a little.. Hmm, not too bad.
>
> How's this?

I'd actually prefer to not have that DEFINE_BINARY_SEMAPHORE() at all.
It really shouldn't exist in this day and age.

It's not even less typing, ie

    static DEFINE_SEMAPHORE(efivars_lock, 1);

is actually shorter than

    static DEFINE_BINARY_SEMAPHORE(efivars_lock);

And what you actually *want* is

    static DEFINE_MUTEX(efivars_lock);

and converting the up/down to mutex_unlock/mutex_lock.

So let's just make it clear that the only reason to use semaphores
these days is for counting semaphores, and just make
DEFINE_SEMAPHORE() take the number.

Strangely, sema_init() already does that, but I guess that's because
some people really *do* use semaphores for concurrency control (ie I
see things like

        sema_init(&dc->in_flight, 64);

which is clearly using a semaphore in that very traditional way).

So ack on your patch, but don't bother with DEFINE_BINARY_SEMAPHORE().

               Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-29 16:50         ` Linus Torvalds
@ 2023-03-30 11:56           ` Peter Zijlstra
  2023-03-30 16:23             ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2023-03-30 11:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, david, patches, linux-modules, linux-mm,
	linux-kernel, pmladek, petr.pavlu, prarit, gregkh, rafael,
	christophe.leroy, tglx, song, rppt, willy, vbabka, mhocko,
	dave.hansen

On Wed, Mar 29, 2023 at 09:50:39AM -0700, Linus Torvalds wrote:
> So ack on your patch, but don't bother with DEFINE_BINARY_SEMAPHORE().

Sure thing; still completely untested...

---
Subject: Change DEFINE_SEMAPHORE() to take a number argument
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed, 29 Mar 2023 12:14:42 +0200

Fundamentally semaphores are a counted primitive, but
DEFINE_SEMAPHORE() does not expose this and explicitly creates a
binary semaphore.

Change DEFINE_SEMAPHORE() to take a number argument and use that in the
few places that open-coded it using __SEMAPHORE_INITIALIZER().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/mips/cavium-octeon/setup.c                               |    2 +-
 arch/x86/kernel/cpu/intel.c                                   |    2 +-
 drivers/firmware/efi/runtime-wrappers.c                       |    2 +-
 drivers/firmware/efi/vars.c                                   |    2 +-
 drivers/macintosh/adb.c                                       |    2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c              |    2 +-
 drivers/platform/x86/intel/ifs/sysfs.c                        |    2 +-
 drivers/scsi/esas2r/esas2r_ioctl.c                            |    2 +-
 drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c |    2 +-
 include/linux/semaphore.h                                     |    4 ++--
 kernel/printk/printk.c                                        |    2 +-
 net/rxrpc/call_object.c                                       |    6 ++----
 12 files changed, 14 insertions(+), 16 deletions(-)

--- a/arch/mips/cavium-octeon/setup.c
+++ b/arch/mips/cavium-octeon/setup.c
@@ -72,7 +72,7 @@ extern void pci_console_init(const char
 static unsigned long long max_memory = ULLONG_MAX;
 static unsigned long long reserve_low_mem;
 
-DEFINE_SEMAPHORE(octeon_bootbus_sem);
+DEFINE_SEMAPHORE(octeon_bootbus_sem, 1);
 EXPORT_SYMBOL(octeon_bootbus_sem);
 
 static struct octeon_boot_descriptor *octeon_boot_desc_ptr;
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1177,7 +1177,7 @@ static const struct {
 static struct ratelimit_state bld_ratelimit;
 
 static unsigned int sysctl_sld_mitigate = 1;
-static DEFINE_SEMAPHORE(buslock_sem);
+static DEFINE_SEMAPHORE(buslock_sem, 1);
 
 #ifdef CONFIG_PROC_SYSCTL
 static struct ctl_table sld_sysctls[] = {
--- a/drivers/firmware/efi/runtime-wrappers.c
+++ b/drivers/firmware/efi/runtime-wrappers.c
@@ -158,7 +158,7 @@ void efi_call_virt_check_flags(unsigned
  * none of the remaining functions are actually ever called at runtime.
  * So let's just use a single lock to serialize all Runtime Services calls.
  */
-static DEFINE_SEMAPHORE(efi_runtime_lock);
+static DEFINE_SEMAPHORE(efi_runtime_lock, 1);
 
 /*
  * Expose the EFI runtime lock to the UV platform
--- a/drivers/firmware/efi/vars.c
+++ b/drivers/firmware/efi/vars.c
@@ -21,7 +21,7 @@
 /* Private pointer to registered efivars */
 static struct efivars *__efivars;
 
-static DEFINE_SEMAPHORE(efivars_lock);
+static DEFINE_SEMAPHORE(efivars_lock, 1);
 
 static efi_status_t check_var_size(bool nonblocking, u32 attributes,
 				   unsigned long size)
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -80,7 +80,7 @@ static struct adb_driver *adb_controller
 BLOCKING_NOTIFIER_HEAD(adb_client_list);
 static int adb_got_sleep;
 static int adb_inited;
-static DEFINE_SEMAPHORE(adb_probe_mutex);
+static DEFINE_SEMAPHORE(adb_probe_mutex, 1);
 static int sleepy_trackpad;
 static int autopoll_devs;
 int __adb_probe_sync;
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -298,7 +298,7 @@ const u32 dmae_reg_go_c[] = {
 
 /* Global resources for unloading a previously loaded device */
 #define BNX2X_PREV_WAIT_NEEDED 1
-static DEFINE_SEMAPHORE(bnx2x_prev_sem);
+static DEFINE_SEMAPHORE(bnx2x_prev_sem, 1);
 static LIST_HEAD(bnx2x_prev_list);
 
 /* Forward declaration */
--- a/drivers/platform/x86/intel/ifs/sysfs.c
+++ b/drivers/platform/x86/intel/ifs/sysfs.c
@@ -13,7 +13,7 @@
  * Protects against simultaneous tests on multiple cores, or
  * reloading can file while a test is in progress
  */
-static DEFINE_SEMAPHORE(ifs_sem);
+static DEFINE_SEMAPHORE(ifs_sem, 1);
 
 /*
  * The sysfs interface to check additional details of last test
--- a/drivers/scsi/esas2r/esas2r_ioctl.c
+++ b/drivers/scsi/esas2r/esas2r_ioctl.c
@@ -56,7 +56,7 @@ dma_addr_t esas2r_buffered_ioctl_addr;
 u32 esas2r_buffered_ioctl_size;
 struct pci_dev *esas2r_buffered_ioctl_pcid;
 
-static DEFINE_SEMAPHORE(buffered_ioctl_semaphore);
+static DEFINE_SEMAPHORE(buffered_ioctl_semaphore, 1);
 typedef int (*BUFFERED_IOCTL_CALLBACK)(struct esas2r_adapter *,
 				       struct esas2r_request *,
 				       struct esas2r_sg_context *,
--- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
+++ b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
@@ -149,7 +149,7 @@ static char *g_fragments_base;
 static char *g_free_fragments;
 static struct semaphore g_free_fragments_sema;
 
-static DEFINE_SEMAPHORE(g_free_fragments_mutex);
+static DEFINE_SEMAPHORE(g_free_fragments_mutex, 1);
 
 static int
 vchiq_blocking_bulk_transfer(struct vchiq_instance *instance, unsigned int handle, void *data,
--- a/include/linux/semaphore.h
+++ b/include/linux/semaphore.h
@@ -25,8 +25,8 @@ struct semaphore {
 	.wait_list	= LIST_HEAD_INIT((name).wait_list),		\
 }
 
-#define DEFINE_SEMAPHORE(name)	\
-	struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
+#define DEFINE_SEMAPHORE(_name, _n)	\
+	struct semaphore _name = __SEMAPHORE_INITIALIZER(_name, _n)
 
 static inline void sema_init(struct semaphore *sem, int val)
 {
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -89,7 +89,7 @@ static DEFINE_MUTEX(console_mutex);
  * console_sem protects updates to console->seq and console_suspended,
  * and also provides serialization for console printing.
  */
-static DEFINE_SEMAPHORE(console_sem);
+static DEFINE_SEMAPHORE(console_sem, 1);
 HLIST_HEAD(console_list);
 EXPORT_SYMBOL_GPL(console_list);
 DEFINE_STATIC_SRCU(console_srcu);
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -40,10 +40,8 @@ const char *const rxrpc_call_completions
 
 struct kmem_cache *rxrpc_call_jar;
 
-static struct semaphore rxrpc_call_limiter =
-	__SEMAPHORE_INITIALIZER(rxrpc_call_limiter, 1000);
-static struct semaphore rxrpc_kernel_call_limiter =
-	__SEMAPHORE_INITIALIZER(rxrpc_kernel_call_limiter, 1000);
+static DEFINE_SEMAPHORE(rxrpc_call_limiter, 1000);
+static DEFINE_SEMAPHORE(rxrpc_kernel_call_limiter, 1000);
 
 void rxrpc_poke_call(struct rxrpc_call *call, enum rxrpc_call_poke_trace what)
 {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-30 11:56           ` Peter Zijlstra
@ 2023-03-30 16:23             ` Linus Torvalds
  2023-03-31  3:42               ` Sergey Senozhatsky
  2023-03-31  3:45               ` Luis Chamberlain
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2023-03-30 16:23 UTC (permalink / raw)
  To: Peter Zijlstra, Petr Mladek, Sergey Senozhatsky
  Cc: Luis Chamberlain, david, patches, linux-modules, linux-mm,
	linux-kernel, petr.pavlu, prarit, gregkh, rafael,
	christophe.leroy, tglx, song, rppt, willy, vbabka, mhocko,
	dave.hansen

On Thu, Mar 30, 2023 at 4:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Sure thing; still completely untested...

Seems obvious enough.

Looking at the people who use a semaphore as a mutex, one core user stands out:

>  kernel/printk/printk.c                                        |    2 +-

.. and I'm not entirely sure why that uses a semaphore. It may be
*entirely* legacy, and should just be changed to be a mutex.

But it may also be that the 'console_sem' has some subtle reason why
it wants to be a semaphore, and why it then plays games with lockdep
(which doesn't support counting semaphores) and does things like

  #define down_console_sem() do { \
        down(&console_sem);\
        mutex_acquire(&console_lock_dep_map, 0, 0, _RET_IP_);\
  } while (0)

Anyway, I think your patch is obviously safe as-is, and I think it's
long overdue to make it clear that the only real reason to use
semaphores rather than mutexes is if you do need the counting thing.

Of course, there is the thing about lockdep, and also about how
semaphores these days have no architecture-specific parts, so if
anybody wants to play deep games with their locking, that may be a
reason for using them.

Although we also do have some other issues - I think down_trylock() is
ok in irq contexts, but mutex_trylock() is not. Maybe that's why
printk uses semaphores? I forget.

                Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-30 16:23             ` Linus Torvalds
@ 2023-03-31  3:42               ` Sergey Senozhatsky
  2023-03-31  8:05                 ` Petr Mladek
  2023-03-31  3:45               ` Luis Chamberlain
  1 sibling, 1 reply; 26+ messages in thread
From: Sergey Senozhatsky @ 2023-03-31  3:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Petr Mladek, Sergey Senozhatsky,
	Luis Chamberlain, david, patches, linux-modules, linux-mm,
	linux-kernel, petr.pavlu, prarit, gregkh, rafael,
	christophe.leroy, tglx, song, rppt, willy, vbabka, mhocko,
	dave.hansen

On (23/03/30 09:23), Linus Torvalds wrote:
> Although we also do have some other issues - I think down_trylock() is
> ok in irq contexts, but mutex_trylock() is not. Maybe that's why
> printk uses semaphores? I forget.

Yes, correct. IIRC we also cannot safely call mutex_unlock() from IRQ
context because it takes some internal mutex spin_lock in a non-IRQ-safe
manner. Semaphore is OK in this regard, both semaphore try_lock() and
unlock() can be called from IRQ.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-30 16:23             ` Linus Torvalds
  2023-03-31  3:42               ` Sergey Senozhatsky
@ 2023-03-31  3:45               ` Luis Chamberlain
  2023-03-31  4:06                 ` Matthew Wilcox
  2023-03-31  4:11                 ` Sergey Senozhatsky
  1 sibling, 2 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-31  3:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Petr Mladek, Sergey Senozhatsky, david, patches,
	linux-modules, linux-mm, linux-kernel, petr.pavlu, prarit,
	gregkh, rafael, christophe.leroy, tglx, song, rppt, willy,
	vbabka, mhocko, dave.hansen

On Thu, Mar 30, 2023 at 09:23:54AM -0700, Linus Torvalds wrote:
> On Thu, Mar 30, 2023 at 4:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Sure thing; still completely untested...
> 
> Seems obvious enough.
> 
> Looking at the people who use a semaphore as a mutex, one core user stands out:
> 
> >  kernel/printk/printk.c                                        |    2 +-
> 
> .. and I'm not entirely sure why that uses a semaphore. It may be
> *entirely* legacy, and should just be changed to be a mutex.
> 
> But it may also be that the 'console_sem' has some subtle reason why
> it wants to be a semaphore, and why it then plays games with lockdep
> (which doesn't support counting semaphores) and does things like
> 
>   #define down_console_sem() do { \
>         down(&console_sem);\
>         mutex_acquire(&console_lock_dep_map, 0, 0, _RET_IP_);\
>   } while (0)
> 
> Anyway, I think your patch is obviously safe as-is, and I think it's
> long overdue to make it clear that the only real reason to use
> semaphores rather than mutexes is if you do need the counting thing.
> 
> Of course, there is the thing about lockdep, and also about how
> semaphores these days have no architecture-specific parts, so if
> anybody wants to play deep games with their locking, that may be a
> reason for using them.
> 
> Although we also do have some other issues - I think down_trylock() is
> ok in irq contexts, but mutex_trylock() is not. Maybe that's why
> printk uses semaphores? I forget.

It sounds like perhaps we should take our time with the conversion and
let each component review / decide and get things tested just in case.
kmod can just open code the __SEMAPHORE_INITIALIZER(name, n) for now
and once we get all conversions we replace it with the counter one
in semaphore.h.

I'll drop this and the kmod stuff from this series as separate effort,
no point to hold up the other stuff part of this patchset.

The coccinelle approach to this is as follows, it goes build tested
with allmodconfig and boot tested on x86_64 without issues.

You'd use:

make coccicheck MODE=patch SPFLAGS="--in-place --no-show-diff" COCCI=./sema-binary-to-mutex.cocci

$ cat sema-binary-to-mutex.cocci

virtual patch

@ binary_sema_found depends on patch @
identifier sema;
declarer name DEFINE_SEMAPHORE;
declarer name DEFINE_MUTEX;
@@

-DEFINE_SEMAPHORE(sema);
+DEFINE_MUTEX(sema);

@ sema_down_replace_with_mutex_lock depends on binary_sema_found @
identifier binary_sema_found.sema;
@@

-down(&sema)
+mutex_lock(&sema)

@ sema_up_replace_with_mutex_unlock depends on binary_sema_found @
identifier binary_sema_found.sema;
@@

-up(&sema)
+mutex_unlock(&sema)

@ sema_mutex_lock_interruptible depends on binary_sema_found @
identifier binary_sema_found.sema;
@@

-down_interruptible(&sema)
+mutex_lock_interruptible(&sema)

@ sema_mutex_trylock depends on binary_sema_found @
identifier binary_sema_found.sema;
@@

-down_trylock(&sema)
+!mutex_trylock(&sema)

This produces the following patch, I'll send each separately next.

 arch/mips/cavium-octeon/setup.c               |  2 +-
 arch/x86/kernel/cpu/intel.c                   |  6 +-
 drivers/firmware/efi/runtime-wrappers.c       | 58 +++++++++----------
 drivers/firmware/efi/vars.c                   | 20 +++----
 drivers/macintosh/adb.c                       | 18 +++---
 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  | 28 ++++-----
 drivers/platform/x86/intel/ifs/sysfs.c        | 10 ++--
 drivers/scsi/esas2r/esas2r_ioctl.c            |  6 +-
 .../interface/vchiq_arm/vchiq_arm.c           | 10 ++--
 kernel/printk/printk.c                        |  8 +--
 10 files changed, 83 insertions(+), 83 deletions(-)

diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c
index a71727f7a608..30c286c5ac9c 100644
--- a/arch/mips/cavium-octeon/setup.c
+++ b/arch/mips/cavium-octeon/setup.c
@@ -72,7 +72,7 @@ extern void pci_console_init(const char *arg);
 static unsigned long long max_memory = ULLONG_MAX;
 static unsigned long long reserve_low_mem;
 
-DEFINE_SEMAPHORE(octeon_bootbus_sem);
+DEFINE_MUTEX(octeon_bootbus_sem);
 EXPORT_SYMBOL(octeon_bootbus_sem);
 
 static struct octeon_boot_descriptor *octeon_boot_desc_ptr;
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 291d4167fab8..00c9fcd90e1a 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1177,7 +1177,7 @@ static const struct {
 static struct ratelimit_state bld_ratelimit;
 
 static unsigned int sysctl_sld_mitigate = 1;
-static DEFINE_SEMAPHORE(buslock_sem);
+static DEFINE_MUTEX(buslock_sem);
 
 #ifdef CONFIG_PROC_SYSCTL
 static struct ctl_table sld_sysctls[] = {
@@ -1315,7 +1315,7 @@ static void split_lock_init(void)
 static void __split_lock_reenable_unlock(struct work_struct *work)
 {
 	sld_update_msr(true);
-	up(&buslock_sem);
+	mutex_unlock(&buslock_sem);
 }
 
 static DECLARE_DELAYED_WORK(sl_reenable_unlock, __split_lock_reenable_unlock);
@@ -1364,7 +1364,7 @@ static void split_lock_warn(unsigned long ip)
 		 * Misery factor #2:
 		 * only allow one buslocked disabled core at a time.
 		 */
-		if (down_interruptible(&buslock_sem) == -EINTR)
+		if (mutex_lock_interruptible(&buslock_sem) == -EINTR)
 			return;
 		work = &sl_reenable_unlock;
 	} else {
diff --git a/drivers/firmware/efi/runtime-wrappers.c b/drivers/firmware/efi/runtime-wrappers.c
index 1fba4e09cdcf..4826e2a2f193 100644
--- a/drivers/firmware/efi/runtime-wrappers.c
+++ b/drivers/firmware/efi/runtime-wrappers.c
@@ -158,7 +158,7 @@ void efi_call_virt_check_flags(unsigned long flags, const char *call)
  * none of the remaining functions are actually ever called at runtime.
  * So let's just use a single lock to serialize all Runtime Services calls.
  */
-static DEFINE_SEMAPHORE(efi_runtime_lock);
+static DEFINE_MUTEX(efi_runtime_lock);
 
 /*
  * Expose the EFI runtime lock to the UV platform
@@ -254,10 +254,10 @@ static efi_status_t virt_efi_get_time(efi_time_t *tm, efi_time_cap_t *tc)
 {
 	efi_status_t status;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_GET_TIME, tm, tc, NULL, NULL, NULL);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -265,10 +265,10 @@ static efi_status_t virt_efi_set_time(efi_time_t *tm)
 {
 	efi_status_t status;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_SET_TIME, tm, NULL, NULL, NULL, NULL);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -278,11 +278,11 @@ static efi_status_t virt_efi_get_wakeup_time(efi_bool_t *enabled,
 {
 	efi_status_t status;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_GET_WAKEUP_TIME, enabled, pending, tm, NULL,
 				NULL);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -290,11 +290,11 @@ static efi_status_t virt_efi_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 {
 	efi_status_t status;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_SET_WAKEUP_TIME, &enabled, tm, NULL, NULL,
 				NULL);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -306,11 +306,11 @@ static efi_status_t virt_efi_get_variable(efi_char16_t *name,
 {
 	efi_status_t status;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_GET_VARIABLE, name, vendor, attr, data_size,
 				data);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -320,11 +320,11 @@ static efi_status_t virt_efi_get_next_variable(unsigned long *name_size,
 {
 	efi_status_t status;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_GET_NEXT_VARIABLE, name_size, name, vendor,
 				NULL, NULL);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -336,11 +336,11 @@ static efi_status_t virt_efi_set_variable(efi_char16_t *name,
 {
 	efi_status_t status;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_SET_VARIABLE, name, vendor, &attr, &data_size,
 				data);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -351,12 +351,12 @@ virt_efi_set_variable_nonblocking(efi_char16_t *name, efi_guid_t *vendor,
 {
 	efi_status_t status;
 
-	if (down_trylock(&efi_runtime_lock))
+	if (!mutex_trylock(&efi_runtime_lock))
 		return EFI_NOT_READY;
 
 	status = efi_call_virt(set_variable, name, vendor, attr, data_size,
 			       data);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -371,11 +371,11 @@ static efi_status_t virt_efi_query_variable_info(u32 attr,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_QUERY_VARIABLE_INFO, &attr, storage_space,
 				remaining_space, max_variable_size, NULL);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -390,12 +390,12 @@ virt_efi_query_variable_info_nonblocking(u32 attr,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	if (down_trylock(&efi_runtime_lock))
+	if (!mutex_trylock(&efi_runtime_lock))
 		return EFI_NOT_READY;
 
 	status = efi_call_virt(query_variable_info, attr, storage_space,
 			       remaining_space, max_variable_size);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -403,11 +403,11 @@ static efi_status_t virt_efi_get_next_high_mono_count(u32 *count)
 {
 	efi_status_t status;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_GET_NEXT_HIGH_MONO_COUNT, count, NULL, NULL,
 				NULL, NULL);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -416,14 +416,14 @@ static void virt_efi_reset_system(int reset_type,
 				  unsigned long data_size,
 				  efi_char16_t *data)
 {
-	if (down_trylock(&efi_runtime_lock)) {
+	if (!mutex_trylock(&efi_runtime_lock)) {
 		pr_warn("failed to invoke the reset_system() runtime service:\n"
 			"could not get exclusive access to the firmware\n");
 		return;
 	}
 	efi_rts_work.efi_rts_id = EFI_RESET_SYSTEM;
 	__efi_call_virt(reset_system, reset_type, status, data_size, data);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 }
 
 static efi_status_t virt_efi_update_capsule(efi_capsule_header_t **capsules,
@@ -435,11 +435,11 @@ static efi_status_t virt_efi_update_capsule(efi_capsule_header_t **capsules,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_UPDATE_CAPSULE, capsules, &count, &sg_list,
 				NULL, NULL);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
@@ -453,11 +453,11 @@ static efi_status_t virt_efi_query_capsule_caps(efi_capsule_header_t **capsules,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	if (down_interruptible(&efi_runtime_lock))
+	if (mutex_lock_interruptible(&efi_runtime_lock))
 		return EFI_ABORTED;
 	status = efi_queue_work(EFI_QUERY_CAPSULE_CAPS, capsules, &count,
 				max_size, reset_type, NULL);
-	up(&efi_runtime_lock);
+	mutex_unlock(&efi_runtime_lock);
 	return status;
 }
 
diff --git a/drivers/firmware/efi/vars.c b/drivers/firmware/efi/vars.c
index bd75b87f5fc1..4a7e4cb8a04a 100644
--- a/drivers/firmware/efi/vars.c
+++ b/drivers/firmware/efi/vars.c
@@ -21,7 +21,7 @@
 /* Private pointer to registered efivars */
 static struct efivars *__efivars;
 
-static DEFINE_SEMAPHORE(efivars_lock);
+static DEFINE_MUTEX(efivars_lock);
 
 static efi_status_t check_var_size(bool nonblocking, u32 attributes,
 				   unsigned long size)
@@ -64,7 +64,7 @@ int efivars_register(struct efivars *efivars,
 {
 	int rv;
 
-	if (down_interruptible(&efivars_lock))
+	if (mutex_lock_interruptible(&efivars_lock))
 		return -EINTR;
 
 	if (__efivars) {
@@ -80,7 +80,7 @@ int efivars_register(struct efivars *efivars,
 	pr_info("Registered efivars operations\n");
 	rv = 0;
 out:
-	up(&efivars_lock);
+	mutex_unlock(&efivars_lock);
 
 	return rv;
 }
@@ -97,7 +97,7 @@ int efivars_unregister(struct efivars *efivars)
 {
 	int rv;
 
-	if (down_interruptible(&efivars_lock))
+	if (mutex_lock_interruptible(&efivars_lock))
 		return -EINTR;
 
 	if (!__efivars) {
@@ -116,7 +116,7 @@ int efivars_unregister(struct efivars *efivars)
 
 	rv = 0;
 out:
-	up(&efivars_lock);
+	mutex_unlock(&efivars_lock);
 	return rv;
 }
 EXPORT_SYMBOL_GPL(efivars_unregister);
@@ -133,10 +133,10 @@ EXPORT_SYMBOL_GPL(efivar_supports_writes);
  */
 int efivar_lock(void)
 {
-	if (down_interruptible(&efivars_lock))
+	if (mutex_lock_interruptible(&efivars_lock))
 		return -EINTR;
 	if (!__efivars->ops) {
-		up(&efivars_lock);
+		mutex_unlock(&efivars_lock);
 		return -ENODEV;
 	}
 	return 0;
@@ -149,10 +149,10 @@ EXPORT_SYMBOL_NS_GPL(efivar_lock, EFIVAR);
  */
 int efivar_trylock(void)
 {
-	if (down_trylock(&efivars_lock))
+	if (!mutex_trylock(&efivars_lock))
 		 return -EBUSY;
 	if (!__efivars->ops) {
-		up(&efivars_lock);
+		mutex_unlock(&efivars_lock);
 		return -ENODEV;
 	}
 	return 0;
@@ -164,7 +164,7 @@ EXPORT_SYMBOL_NS_GPL(efivar_trylock, EFIVAR);
  */
 void efivar_unlock(void)
 {
-	up(&efivars_lock);
+	mutex_unlock(&efivars_lock);
 }
 EXPORT_SYMBOL_NS_GPL(efivar_unlock, EFIVAR);
 
diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
index 23bd0c77ac1a..8553da83ac77 100644
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -80,7 +80,7 @@ static struct adb_driver *adb_controller;
 BLOCKING_NOTIFIER_HEAD(adb_client_list);
 static int adb_got_sleep;
 static int adb_inited;
-static DEFINE_SEMAPHORE(adb_probe_mutex);
+static DEFINE_MUTEX(adb_probe_mutex);
 static int sleepy_trackpad;
 static int autopoll_devs;
 int __adb_probe_sync;
@@ -228,7 +228,7 @@ adb_probe_task(void *x)
 	do_adb_reset_bus();
 	pr_debug("adb: finished probe task...\n");
 
-	up(&adb_probe_mutex);
+	mutex_unlock(&adb_probe_mutex);
 
 	return 0;
 }
@@ -249,7 +249,7 @@ adb_reset_bus(void)
 		return 0;
 	}
 
-	down(&adb_probe_mutex);
+	mutex_lock(&adb_probe_mutex);
 	schedule_work(&adb_reset_work);
 	return 0;
 }
@@ -262,7 +262,7 @@ static int __adb_suspend(struct platform_device *dev, pm_message_t state)
 {
 	adb_got_sleep = 1;
 	/* We need to get a lock on the probe thread */
-	down(&adb_probe_mutex);
+	mutex_lock(&adb_probe_mutex);
 	/* Stop autopoll */
 	if (adb_controller->autopoll)
 		adb_controller->autopoll(0);
@@ -292,7 +292,7 @@ static int adb_poweroff(struct device *dev)
 static int __adb_resume(struct platform_device *dev)
 {
 	adb_got_sleep = 0;
-	up(&adb_probe_mutex);
+	mutex_unlock(&adb_probe_mutex);
 	adb_reset_bus();
 
 	return 0;
@@ -798,7 +798,7 @@ static ssize_t adb_write(struct file *file, const char __user *buf,
 	atomic_inc(&state->n_pending);
 
 	/* If a probe is in progress or we are sleeping, wait for it to complete */
-	down(&adb_probe_mutex);
+	mutex_lock(&adb_probe_mutex);
 
 	/* Queries are special requests sent to the ADB driver itself */
 	if (req->data[0] == ADB_QUERY) {
@@ -806,14 +806,14 @@ static ssize_t adb_write(struct file *file, const char __user *buf,
 			ret = do_adb_query(req);
 		else
 			ret = -EINVAL;
-		up(&adb_probe_mutex);
+		mutex_unlock(&adb_probe_mutex);
 	}
 	/* Special case for ADB_BUSRESET request, all others are sent to
 	   the controller */
 	else if ((req->data[0] == ADB_PACKET) && (count > 1)
 		&& (req->data[1] == ADB_BUSRESET)) {
 		ret = do_adb_reset_bus();
-		up(&adb_probe_mutex);
+		mutex_unlock(&adb_probe_mutex);
 		atomic_dec(&state->n_pending);
 		if (ret == 0)
 			ret = count;
@@ -824,7 +824,7 @@ static ssize_t adb_write(struct file *file, const char __user *buf,
 			ret = adb_controller->send_request(req, 0);
 		else
 			ret = -ENXIO;
-		up(&adb_probe_mutex);
+		mutex_unlock(&adb_probe_mutex);
 	}
 
 	if (ret != 0) {
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 5d1e4fe335aa..f90763aa93b8 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -298,7 +298,7 @@ const u32 dmae_reg_go_c[] = {
 
 /* Global resources for unloading a previously loaded device */
 #define BNX2X_PREV_WAIT_NEEDED 1
-static DEFINE_SEMAPHORE(bnx2x_prev_sem);
+static DEFINE_MUTEX(bnx2x_prev_sem);
 static LIST_HEAD(bnx2x_prev_list);
 
 /* Forward declaration */
@@ -10569,7 +10569,7 @@ static int bnx2x_prev_path_mark_eeh(struct bnx2x *bp)
 	struct bnx2x_prev_path_list *tmp_list;
 	int rc;
 
-	rc = down_interruptible(&bnx2x_prev_sem);
+	rc = mutex_lock_interruptible(&bnx2x_prev_sem);
 	if (rc) {
 		BNX2X_ERR("Received %d when tried to take lock\n", rc);
 		return rc;
@@ -10584,7 +10584,7 @@ static int bnx2x_prev_path_mark_eeh(struct bnx2x *bp)
 			  BP_PATH(bp));
 	}
 
-	up(&bnx2x_prev_sem);
+	mutex_unlock(&bnx2x_prev_sem);
 
 	return rc;
 }
@@ -10594,7 +10594,7 @@ static bool bnx2x_prev_is_path_marked(struct bnx2x *bp)
 	struct bnx2x_prev_path_list *tmp_list;
 	bool rc = false;
 
-	if (down_trylock(&bnx2x_prev_sem))
+	if (!mutex_trylock(&bnx2x_prev_sem))
 		return false;
 
 	tmp_list = bnx2x_prev_path_get_entry(bp);
@@ -10609,7 +10609,7 @@ static bool bnx2x_prev_is_path_marked(struct bnx2x *bp)
 		}
 	}
 
-	up(&bnx2x_prev_sem);
+	mutex_unlock(&bnx2x_prev_sem);
 
 	return rc;
 }
@@ -10619,12 +10619,12 @@ bool bnx2x_port_after_undi(struct bnx2x *bp)
 	struct bnx2x_prev_path_list *entry;
 	bool val;
 
-	down(&bnx2x_prev_sem);
+	mutex_lock(&bnx2x_prev_sem);
 
 	entry = bnx2x_prev_path_get_entry(bp);
 	val = !!(entry && (entry->undi & (1 << BP_PORT(bp))));
 
-	up(&bnx2x_prev_sem);
+	mutex_unlock(&bnx2x_prev_sem);
 
 	return val;
 }
@@ -10634,7 +10634,7 @@ static int bnx2x_prev_mark_path(struct bnx2x *bp, bool after_undi)
 	struct bnx2x_prev_path_list *tmp_list;
 	int rc;
 
-	rc = down_interruptible(&bnx2x_prev_sem);
+	rc = mutex_lock_interruptible(&bnx2x_prev_sem);
 	if (rc) {
 		BNX2X_ERR("Received %d when tried to take lock\n", rc);
 		return rc;
@@ -10650,10 +10650,10 @@ static int bnx2x_prev_mark_path(struct bnx2x *bp, bool after_undi)
 			   BP_PATH(bp));
 			tmp_list->aer = 0;
 		}
-		up(&bnx2x_prev_sem);
+		mutex_unlock(&bnx2x_prev_sem);
 		return 0;
 	}
-	up(&bnx2x_prev_sem);
+	mutex_unlock(&bnx2x_prev_sem);
 
 	/* Create an entry for this path and add it */
 	tmp_list = kmalloc(sizeof(struct bnx2x_prev_path_list), GFP_KERNEL);
@@ -10668,7 +10668,7 @@ static int bnx2x_prev_mark_path(struct bnx2x *bp, bool after_undi)
 	tmp_list->aer = 0;
 	tmp_list->undi = after_undi ? (1 << BP_PORT(bp)) : 0;
 
-	rc = down_interruptible(&bnx2x_prev_sem);
+	rc = mutex_lock_interruptible(&bnx2x_prev_sem);
 	if (rc) {
 		BNX2X_ERR("Received %d when tried to take lock\n", rc);
 		kfree(tmp_list);
@@ -10676,7 +10676,7 @@ static int bnx2x_prev_mark_path(struct bnx2x *bp, bool after_undi)
 		DP(NETIF_MSG_HW, "Marked path [%d] - finished previous unload\n",
 		   BP_PATH(bp));
 		list_add(&tmp_list->list, &bnx2x_prev_list);
-		up(&bnx2x_prev_sem);
+		mutex_unlock(&bnx2x_prev_sem);
 	}
 
 	return rc;
@@ -10893,7 +10893,7 @@ static int bnx2x_prev_unload(struct bnx2x *bp)
 			break;
 		}
 
-		rc = down_interruptible(&bnx2x_prev_sem);
+		rc = mutex_lock_interruptible(&bnx2x_prev_sem);
 		if (rc) {
 			BNX2X_ERR("Cannot check for AER; Received %d when tried to take lock\n",
 				  rc);
@@ -10901,7 +10901,7 @@ static int bnx2x_prev_unload(struct bnx2x *bp)
 			/* If Path is marked by EEH, ignore unload status */
 			aer = !!(bnx2x_prev_path_get_entry(bp) &&
 				 bnx2x_prev_path_get_entry(bp)->aer);
-			up(&bnx2x_prev_sem);
+			mutex_unlock(&bnx2x_prev_sem);
 		}
 
 		if (fw == FW_MSG_CODE_DRV_UNLOAD_COMMON || aer) {
diff --git a/drivers/platform/x86/intel/ifs/sysfs.c b/drivers/platform/x86/intel/ifs/sysfs.c
index ee636a76b083..34001b2885cd 100644
--- a/drivers/platform/x86/intel/ifs/sysfs.c
+++ b/drivers/platform/x86/intel/ifs/sysfs.c
@@ -13,7 +13,7 @@
  * Protects against simultaneous tests on multiple cores, or
  * reloading can file while a test is in progress
  */
-static DEFINE_SEMAPHORE(ifs_sem);
+static DEFINE_MUTEX(ifs_sem);
 
 /*
  * The sysfs interface to check additional details of last test
@@ -72,7 +72,7 @@ static ssize_t run_test_store(struct device *dev,
 	if (rc < 0 || cpu >= nr_cpu_ids)
 		return -EINVAL;
 
-	if (down_interruptible(&ifs_sem))
+	if (mutex_lock_interruptible(&ifs_sem))
 		return -EINTR;
 
 	if (!ifsd->loaded)
@@ -80,7 +80,7 @@ static ssize_t run_test_store(struct device *dev,
 	else
 		rc = do_core_test(cpu, dev);
 
-	up(&ifs_sem);
+	mutex_unlock(&ifs_sem);
 
 	return rc ? rc : count;
 }
@@ -99,14 +99,14 @@ static ssize_t current_batch_store(struct device *dev,
 	if (rc < 0 || cur_batch > 0xff)
 		return -EINVAL;
 
-	if (down_interruptible(&ifs_sem))
+	if (mutex_lock_interruptible(&ifs_sem))
 		return -EINTR;
 
 	ifsd->cur_batch = cur_batch;
 
 	rc = ifs_load_firmware(dev);
 
-	up(&ifs_sem);
+	mutex_unlock(&ifs_sem);
 
 	return (rc == 0) ? count : rc;
 }
diff --git a/drivers/scsi/esas2r/esas2r_ioctl.c b/drivers/scsi/esas2r/esas2r_ioctl.c
index e003d923acbf..b935fc076fc8 100644
--- a/drivers/scsi/esas2r/esas2r_ioctl.c
+++ b/drivers/scsi/esas2r/esas2r_ioctl.c
@@ -56,7 +56,7 @@ dma_addr_t esas2r_buffered_ioctl_addr;
 u32 esas2r_buffered_ioctl_size;
 struct pci_dev *esas2r_buffered_ioctl_pcid;
 
-static DEFINE_SEMAPHORE(buffered_ioctl_semaphore);
+static DEFINE_MUTEX(buffered_ioctl_semaphore);
 typedef int (*BUFFERED_IOCTL_CALLBACK)(struct esas2r_adapter *,
 				       struct esas2r_request *,
 				       struct esas2r_sg_context *,
@@ -209,7 +209,7 @@ static u8 handle_buffered_ioctl(struct esas2r_buffered_ioctl *bi)
 	struct esas2r_sg_context sgc;
 	u8 result = IOCTL_SUCCESS;
 
-	if (down_interruptible(&buffered_ioctl_semaphore))
+	if (mutex_lock_interruptible(&buffered_ioctl_semaphore))
 		return IOCTL_OUT_OF_RESOURCES;
 
 	/* allocate a buffer or use the existing buffer. */
@@ -285,7 +285,7 @@ static u8 handle_buffered_ioctl(struct esas2r_buffered_ioctl *bi)
 	if (result == IOCTL_SUCCESS)
 		memcpy(bi->ioctl, esas2r_buffered_ioctl, bi->length);
 
-	up(&buffered_ioctl_semaphore);
+	mutex_unlock(&buffered_ioctl_semaphore);
 	return result;
 }
 
diff --git a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
index cddcd3c596c9..dde211f08145 100644
--- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
+++ b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
@@ -149,7 +149,7 @@ static char *g_fragments_base;
 static char *g_free_fragments;
 static struct semaphore g_free_fragments_sema;
 
-static DEFINE_SEMAPHORE(g_free_fragments_mutex);
+static DEFINE_MUTEX(g_free_fragments_mutex);
 
 static int
 vchiq_blocking_bulk_transfer(struct vchiq_instance *instance, unsigned int handle, void *data,
@@ -383,11 +383,11 @@ create_pagelist(struct vchiq_instance *instance, char *buf, char __user *ubuf,
 
 		WARN_ON(!g_free_fragments);
 
-		down(&g_free_fragments_mutex);
+		mutex_lock(&g_free_fragments_mutex);
 		fragments = g_free_fragments;
 		WARN_ON(!fragments);
 		g_free_fragments = *(char **)g_free_fragments;
-		up(&g_free_fragments_mutex);
+		mutex_unlock(&g_free_fragments_mutex);
 		pagelist->type = PAGELIST_READ_WITH_FRAGMENTS +
 			(fragments - g_fragments_base) / g_fragments_size;
 	}
@@ -443,10 +443,10 @@ free_pagelist(struct vchiq_instance *instance, struct vchiq_pagelist_info *pagel
 				fragments + g_cache_line_size,
 				tail_bytes);
 
-		down(&g_free_fragments_mutex);
+		mutex_lock(&g_free_fragments_mutex);
 		*(char **)fragments = g_free_fragments;
 		g_free_fragments = fragments;
-		up(&g_free_fragments_mutex);
+		mutex_unlock(&g_free_fragments_mutex);
 		up(&g_free_fragments_sema);
 	}
 
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index fd0c9f913940..a896461427a0 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -89,7 +89,7 @@ static DEFINE_MUTEX(console_mutex);
  * console_sem protects updates to console->seq and console_suspended,
  * and also provides serialization for console printing.
  */
-static DEFINE_SEMAPHORE(console_sem);
+static DEFINE_MUTEX(console_sem);
 HLIST_HEAD(console_list);
 EXPORT_SYMBOL_GPL(console_list);
 DEFINE_STATIC_SRCU(console_srcu);
@@ -309,7 +309,7 @@ EXPORT_SYMBOL(console_srcu_read_unlock);
  * macros instead of functions so that _RET_IP_ contains useful information.
  */
 #define down_console_sem() do { \
-	down(&console_sem);\
+	mutex_lock(&console_sem);\
 	mutex_acquire(&console_lock_dep_map, 0, 0, _RET_IP_);\
 } while (0)
 
@@ -324,7 +324,7 @@ static int __down_trylock_console_sem(unsigned long ip)
 	 * deadlock in printk()->down_trylock_console_sem() otherwise.
 	 */
 	printk_safe_enter_irqsave(flags);
-	lock_failed = down_trylock(&console_sem);
+	lock_failed = !mutex_trylock(&console_sem);
 	printk_safe_exit_irqrestore(flags);
 
 	if (lock_failed)
@@ -341,7 +341,7 @@ static void __up_console_sem(unsigned long ip)
 	mutex_release(&console_lock_dep_map, ip);
 
 	printk_safe_enter_irqsave(flags);
-	up(&console_sem);
+	mutex_unlock(&console_sem);
 	printk_safe_exit_irqrestore(flags);
 }
 #define up_console_sem() __up_console_sem(_RET_IP_)
-- 
2.39.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-31  3:45               ` Luis Chamberlain
@ 2023-03-31  4:06                 ` Matthew Wilcox
  2023-03-31  4:13                   ` Luis Chamberlain
  2023-03-31  4:11                 ` Sergey Senozhatsky
  1 sibling, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-03-31  4:06 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linus Torvalds, Peter Zijlstra, Petr Mladek, Sergey Senozhatsky,
	david, patches, linux-modules, linux-mm, linux-kernel,
	petr.pavlu, prarit, gregkh, rafael, christophe.leroy, tglx, song,
	rppt, vbabka, mhocko, dave.hansen

On Thu, Mar 30, 2023 at 08:45:52PM -0700, Luis Chamberlain wrote:
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 291d4167fab8..00c9fcd90e1a 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -1177,7 +1177,7 @@ static const struct {
>  static struct ratelimit_state bld_ratelimit;
>  
>  static unsigned int sysctl_sld_mitigate = 1;
> -static DEFINE_SEMAPHORE(buslock_sem);
> +static DEFINE_MUTEX(buslock_sem);
>  
>  #ifdef CONFIG_PROC_SYSCTL
>  static struct ctl_table sld_sysctls[] = {
> @@ -1315,7 +1315,7 @@ static void split_lock_init(void)
>  static void __split_lock_reenable_unlock(struct work_struct *work)
>  {
>  	sld_update_msr(true);
> -	up(&buslock_sem);
> +	mutex_unlock(&buslock_sem);
>  }
>  
>  static DECLARE_DELAYED_WORK(sl_reenable_unlock, __split_lock_reenable_unlock);

^^^ clearly unsafe.  __split_lock_reenable_unlock() is called as a
delayed_work(), ie not in the context of the mutex locker.  lockdep
will freak out at this.

> @@ -351,12 +351,12 @@ virt_efi_set_variable_nonblocking(efi_char16_t *name, efi_guid_t *vendor,
>  {
>  	efi_status_t status;
>  
> -	if (down_trylock(&efi_runtime_lock))
> +	if (!mutex_trylock(&efi_runtime_lock))
>  		return EFI_NOT_READY;

looks to me like this can be called while we're oopsing.  if that's in
non-process context, lockdep will get angry.

> @@ -149,10 +149,10 @@ EXPORT_SYMBOL_NS_GPL(efivar_lock, EFIVAR);
>   */
>  int efivar_trylock(void)
>  {
> -	if (down_trylock(&efivars_lock))
> +	if (!mutex_trylock(&efivars_lock))

also can be called from oops context.

> @@ -228,7 +228,7 @@ adb_probe_task(void *x)
>  	do_adb_reset_bus();
>  	pr_debug("adb: finished probe task...\n");
>  
> -	up(&adb_probe_mutex);
> +	mutex_unlock(&adb_probe_mutex);

adb_probe_task() can be called from a different context than the lock
holder.

> @@ -10594,7 +10594,7 @@ static bool bnx2x_prev_is_path_marked(struct bnx2x *bp)
>  	struct bnx2x_prev_path_list *tmp_list;
>  	bool rc = false;
>  
> -	if (down_trylock(&bnx2x_prev_sem))
> +	if (!mutex_trylock(&bnx2x_prev_sem))

bet you this can be called from interrupt context.

this really isn't something to use coccinelle for.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-31  3:45               ` Luis Chamberlain
  2023-03-31  4:06                 ` Matthew Wilcox
@ 2023-03-31  4:11                 ` Sergey Senozhatsky
  1 sibling, 0 replies; 26+ messages in thread
From: Sergey Senozhatsky @ 2023-03-31  4:11 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linus Torvalds, Peter Zijlstra, Petr Mladek, Sergey Senozhatsky,
	david, patches, linux-modules, linux-mm, linux-kernel,
	petr.pavlu, prarit, gregkh, rafael, christophe.leroy, tglx, song,
	rppt, willy, vbabka, mhocko, dave.hansen

On (23/03/30 20:45), Luis Chamberlain wrote:
[..]
> -static DEFINE_SEMAPHORE(console_sem);
> +static DEFINE_MUTEX(console_sem);
>  HLIST_HEAD(console_list);
>  EXPORT_SYMBOL_GPL(console_list);
>  DEFINE_STATIC_SRCU(console_srcu);
> @@ -309,7 +309,7 @@ EXPORT_SYMBOL(console_srcu_read_unlock);
>   * macros instead of functions so that _RET_IP_ contains useful information.
>   */
>  #define down_console_sem() do { \
> -	down(&console_sem);\
> +	mutex_lock(&console_sem);\
>  	mutex_acquire(&console_lock_dep_map, 0, 0, _RET_IP_);\
>  } while (0)
>  
> @@ -324,7 +324,7 @@ static int __down_trylock_console_sem(unsigned long ip)
>  	 * deadlock in printk()->down_trylock_console_sem() otherwise.
>  	 */
>  	printk_safe_enter_irqsave(flags);
> -	lock_failed = down_trylock(&console_sem);
> +	lock_failed = !mutex_trylock(&console_sem);
>  	printk_safe_exit_irqrestore(flags);
>  
>  	if (lock_failed)
> @@ -341,7 +341,7 @@ static void __up_console_sem(unsigned long ip)
>  	mutex_release(&console_lock_dep_map, ip);
>  
>  	printk_safe_enter_irqsave(flags);
> -	up(&console_sem);
> +	mutex_unlock(&console_sem);

mutex_unlock() does not like when its called from IRQ, so this is not
going to work very well.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-31  4:06                 ` Matthew Wilcox
@ 2023-03-31  4:13                   ` Luis Chamberlain
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-31  4:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Peter Zijlstra, Petr Mladek, Sergey Senozhatsky,
	david, patches, linux-modules, linux-mm, linux-kernel,
	petr.pavlu, prarit, gregkh, rafael, christophe.leroy, tglx, song,
	rppt, vbabka, mhocko, dave.hansen

On Fri, Mar 31, 2023 at 05:06:17AM +0100, Matthew Wilcox wrote:
> On Thu, Mar 30, 2023 at 08:45:52PM -0700, Luis Chamberlain wrote:
> > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> > index 291d4167fab8..00c9fcd90e1a 100644
> > --- a/arch/x86/kernel/cpu/intel.c
> > +++ b/arch/x86/kernel/cpu/intel.c
> > @@ -1177,7 +1177,7 @@ static const struct {
> >  static struct ratelimit_state bld_ratelimit;
> >  
> >  static unsigned int sysctl_sld_mitigate = 1;
> > -static DEFINE_SEMAPHORE(buslock_sem);
> > +static DEFINE_MUTEX(buslock_sem);
> >  
> >  #ifdef CONFIG_PROC_SYSCTL
> >  static struct ctl_table sld_sysctls[] = {
> > @@ -1315,7 +1315,7 @@ static void split_lock_init(void)
> >  static void __split_lock_reenable_unlock(struct work_struct *work)
> >  {
> >  	sld_update_msr(true);
> > -	up(&buslock_sem);
> > +	mutex_unlock(&buslock_sem);
> >  }
> >  
> >  static DECLARE_DELAYED_WORK(sl_reenable_unlock, __split_lock_reenable_unlock);
> 
> ^^^ clearly unsafe.  __split_lock_reenable_unlock() is called as a
> delayed_work(), ie not in the context of the mutex locker.  lockdep
> will freak out at this.
> 
> > @@ -351,12 +351,12 @@ virt_efi_set_variable_nonblocking(efi_char16_t *name, efi_guid_t *vendor,
> >  {
> >  	efi_status_t status;
> >  
> > -	if (down_trylock(&efi_runtime_lock))
> > +	if (!mutex_trylock(&efi_runtime_lock))
> >  		return EFI_NOT_READY;
> 
> looks to me like this can be called while we're oopsing.  if that's in
> non-process context, lockdep will get angry.
> 
> > @@ -149,10 +149,10 @@ EXPORT_SYMBOL_NS_GPL(efivar_lock, EFIVAR);
> >   */
> >  int efivar_trylock(void)
> >  {
> > -	if (down_trylock(&efivars_lock))
> > +	if (!mutex_trylock(&efivars_lock))
> 
> also can be called from oops context.
> 
> > @@ -228,7 +228,7 @@ adb_probe_task(void *x)
> >  	do_adb_reset_bus();
> >  	pr_debug("adb: finished probe task...\n");
> >  
> > -	up(&adb_probe_mutex);
> > +	mutex_unlock(&adb_probe_mutex);
> 
> adb_probe_task() can be called from a different context than the lock
> holder.
> 
> > @@ -10594,7 +10594,7 @@ static bool bnx2x_prev_is_path_marked(struct bnx2x *bp)
> >  	struct bnx2x_prev_path_list *tmp_list;
> >  	bool rc = false;
> >  
> > -	if (down_trylock(&bnx2x_prev_sem))
> > +	if (!mutex_trylock(&bnx2x_prev_sem))
> 
> bet you this can be called from interrupt context.
> 
> this really isn't something to use coccinelle for.

Coccinelle gives you what *would* happen, its up to us to review
if the conversion is correct. Thanks for the feedback, seems like
we're not going there for some users, contrary to what we expected.

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] sempahore: add a helper for a concurrency limiter
  2023-03-31  3:42               ` Sergey Senozhatsky
@ 2023-03-31  8:05                 ` Petr Mladek
  0 siblings, 0 replies; 26+ messages in thread
From: Petr Mladek @ 2023-03-31  8:05 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Linus Torvalds, Peter Zijlstra, Luis Chamberlain, david, patches,
	linux-modules, linux-mm, linux-kernel, petr.pavlu, prarit,
	gregkh, rafael, christophe.leroy, tglx, song, rppt, willy,
	vbabka, mhocko, dave.hansen

On Fri 2023-03-31 12:42:09, Sergey Senozhatsky wrote:
> On (23/03/30 09:23), Linus Torvalds wrote:
> > Although we also do have some other issues - I think down_trylock() is
> > ok in irq contexts, but mutex_trylock() is not. Maybe that's why
> > printk uses semaphores? I forget.
> 
> Yes, correct. IIRC we also cannot safely call mutex_unlock() from IRQ
> context because it takes some internal mutex spin_lock in a non-IRQ-safe
> manner. Semaphore is OK in this regard, both semaphore try_lock() and
> unlock() can be called from IRQ.

One more reason is that mutex must be released in the same context
that took it. And printk() tries to pass console_sem() to another context.

It was added by the commit dbdda842fe96 ("printk: Add console owner
and waiter logic to load balance console writes"). It was relatively
effective in reducing the risk of soft lockups.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 0/7] module: avoid userspace pressure on unwanted allocations
@ 2023-03-29  5:29 Luis Chamberlain
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-03-29  5:29 UTC (permalink / raw)
  To: david, patches, linux-modules, linux-mm, linux-kernel, pmladek,
	petr.pavlu, prarit, torvalds, gregkh, rafael
  Cc: christophe.leroy, tglx, peterz, song, rppt, willy, vbabka,
	mhocko, dave.hansen, mcgrof

This patch set addresses a fix to the vmap allocation presure issues which
David Hildenbrand had reported last year in October. While at it,
I've simplified the kmod concurrency delimiter using Linus' suggestion,
and added debugfs stats to help us keep sane in doing analysis for memory
pressure issues on the finit_module() side of things. That should *also*
help do an empirical evaluation of module .text sizes *actually* present
on systems, given userspace makes it a bit tricky to get that right.

All this would not have been possible without stress-ng and Colin Ian King's
help to getting a modules ops in shape so to reproduce a situation only
reported so far on a system with over 400 CPUs.

I *think* the degugfs stats *should* probably be used to help identify
areas where we perhaps need *more work* to try to mitigate vmalloc()
space, as in the worst case we can end up using vmap space 3 times for
a single module, two just as big as the module, and if you are enabling
compression one with the compressed module size. That's significant memory
pressure on vmalloc() / vmap() space.

If you'd like to give this a spin this is available on my branch based on
modules-next 20230328-module-alloc-opts [2].

[0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
[1] https://github.com/ColinIanKing/stress-ng
[2] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230328-module-alloc-opts

Luis Chamberlain (7):
  module: move finished_loading()
  module: extract patient module check into helper
  module: avoid allocation if module is already present and ready
  sempahore: add a helper for a concurrency limiter
  modules/kmod: replace implementation with a sempahore
  debugfs: add debugfs_create_atomic64_t for atomic64_t
  module: add debug stats to help identify memory pressure

 fs/debugfs/file.c         |  36 +++++++
 include/linux/debugfs.h   |   2 +
 include/linux/semaphore.h |   3 +
 kernel/module/Kconfig     |  32 ++++++
 kernel/module/Makefile    |   4 +
 kernel/module/debug.c     |  16 +++
 kernel/module/internal.h  |  35 +++++++
 kernel/module/kmod.c      |  26 ++---
 kernel/module/main.c      | 164 ++++++++++++++++++++-----------
 kernel/module/stats.c     | 200 ++++++++++++++++++++++++++++++++++++++
 kernel/module/tracking.c  |   7 +-
 11 files changed, 445 insertions(+), 80 deletions(-)
 create mode 100644 kernel/module/debug.c
 create mode 100644 kernel/module/stats.c

-- 
2.39.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-03-31  8:06 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-29  5:31 [PATCH 0/7] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
2023-03-29  5:31 ` [PATCH 1/7] module: move finished_loading() Luis Chamberlain
2023-03-29  5:31 ` [PATCH 2/7] module: extract patient module check into helper Luis Chamberlain
2023-03-29  5:31 ` [PATCH 3/7] module: avoid allocation if module is already present and ready Luis Chamberlain
2023-03-29  5:31 ` [PATCH 4/7] sempahore: add a helper for a concurrency limiter Luis Chamberlain
2023-03-29  7:21   ` Peter Zijlstra
2023-03-29  7:51     ` Luis Chamberlain
2023-03-29  9:19       ` Peter Zijlstra
2023-03-29  9:49         ` Luis Chamberlain
2023-03-29 10:14           ` Peter Zijlstra
2023-03-29 16:50         ` Linus Torvalds
2023-03-30 11:56           ` Peter Zijlstra
2023-03-30 16:23             ` Linus Torvalds
2023-03-31  3:42               ` Sergey Senozhatsky
2023-03-31  8:05                 ` Petr Mladek
2023-03-31  3:45               ` Luis Chamberlain
2023-03-31  4:06                 ` Matthew Wilcox
2023-03-31  4:13                   ` Luis Chamberlain
2023-03-31  4:11                 ` Sergey Senozhatsky
2023-03-29  5:31 ` [PATCH 5/7] modules/kmod: replace implementation with a sempahore Luis Chamberlain
2023-03-29  5:31 ` [PATCH 6/7] debugfs: add debugfs_create_atomic64_t for atomic64_t Luis Chamberlain
2023-03-29  5:46   ` Greg KH
2023-03-29  5:31 ` [PATCH 7/7] module: add debug stats to help identify memory pressure Luis Chamberlain
2023-03-29  5:46   ` Greg KH
2023-03-29  6:04     ` Luis Chamberlain
  -- strict thread matches above, loose matches on Subject: below --
2023-03-29  5:29 [PATCH 0/7] module: avoid userspace pressure on unwanted allocations Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).