All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 01/12] LICENSES: Add the copyleft-next-0.3.1 license
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-18  5:04   ` Luis Chamberlain
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next,
	Goldwyn Rodrigues, Kuno Woudt, Richard Fontana, Ciaran Farrell,
	Christopher De Nicolo, Jonathan Corbet, Thorsten Leemhuis

Add the full text of the copyleft-next-0.3.1 license to the kernel
tree as well as the required tags for reference and tooling.
The license text was copied directly from the copyleft-next project's
git tree [0].

Discussion of using copyleft-next-0.3.1 on Linux started since June,
2016 [1]. In the end Linus' preference was to have drivers use
MODULE_LICENSE("GPL") to make it clear that the GPL applies when it
comes to Linux [2]. Additionally, even though copyleft-next-0.3.1 has
been found to be to be GPLv2 compatible by three attorneys at SUSE and
Redhat [3], to err on the side of caution we simply recommend to
always use the "OR" language for this license [4].

Even though it has been a goal of the project to be GPL-v2 compatible
to be certain in 2016 I asked for a clarification about what makes
copyleft-next GPLv2 compatible and also asked for a summary of
benefits. This prompted some small minor changes to make compatibility
even further clear and as of copyleft 0.3.1 compatibility should
be crystal clear [5].

The summary of why copyleft-next 0.3.1 is compatible with GPLv2
is explained as follows:

  Like GPLv2, copyleft-next requires distribution of derivative works
  ("Derived Works" in copyleft-next 0.3.x) to be under the same license.
  Ordinarily this would make the two licenses incompatible. However,
  copyleft-next 0.3.1 says: "If the Derived Work includes material
  licensed under the GPL, You may instead license the Derived Work under
  the GPL." "GPL" is defined to include GPLv2.

In practice this means copyleft-next code in Linux may be licensed
under the GPL2, however there are additional obvious gains for
bringing contributions from Linux outbound where copyleft-next is
preferred. A summary of benefits why projects outside of Linux might
prefer to use copyleft-next >= 0.3.1 over GPLv2:

o It is much shorter and simpler
o It has an explicit patent license grant, unlike GPLv2
o Its notice preservation conditions are clearer
o More free software/open source licenses are compatible
  with it (via section 4)
o The source code requirement triggered by binary distribution
  is much simpler in a procedural sense
o Recipients potentially have a contract claim against distributors
  who are noncompliant with the source code requirement
o There is a built-in inbound=outbound policy for upstream
  contributions (cf. Apache License 2.0 section 5)
o There are disincentives to engage in the controversial practice
  of copyleft/ proprietary dual-licensing
o In 15 years copyleft expires, which can be advantageous
  for legacy code
o There are explicit disincentives to bringing patent infringement
  claims accusing the licensed work of infringement (see 10b)
o There is a cure period for licensees who are not compliant
  with the license (there is no cure opportunity in GPLv2)
o copyleft-next has a 'built-in or-later' provision

The first driver submission to Linux under this dual strategy was
lib/test_sysctl.c through commit 9308f2f9e7f05 ("test_sysctl: add
dedicated proc sysctl test driver") merged in July 2017. Shortly after
that I also added test_kmod through commit d9c6a72d6fa29 ("kmod: add
test driver to stress test the module loader") in the same month. These
two drivers went in just a few months before the SPDX license practice
kicked in. In 2018 Kuno Woudt went through the process to get SPDX
identifiers for copyleft-next [6] [7]. Although there are SPDX tags
for copyleft-next-0.3.0, we only document use in Linux starting from
copyleft-next-0.3.1 which makes GPLv2 compatibility crystal clear.

This patch will let us update the two Linux selftest drivers in
subsequent patches with their respective SPDX license identifiers and
let us remove repetitive license boiler plate.

[0] https://github.com/copyleft-next/copyleft-next/blob/master/Releases/copyleft-next-0.3.1
[1] https://lore.kernel.org/lkml/1465929311-13509-1-git-send-email-mcgrof@kernel.org/
[2] https://lore.kernel.org/lkml/CA+55aFyhxcvD+q7tp+-yrSFDKfR0mOHgyEAe=f_94aKLsOu0Og@mail.gmail.com/
[3] https://lore.kernel.org/lkml/20170516232702.GL17314@wotan.suse.de/
[4] https://lkml.kernel.org/r/1495234558.7848.122.camel@linux.intel.com
[5] https://lists.fedorahosted.org/archives/list/copyleft-next@lists.fedorahosted.org/thread/JTGV56DDADWGKU7ZKTZA4DLXTGTLNJ57/#SQMDIKBRAVDOCT4UVNOOCRGBN2UJIKHZ
[6] https://spdx.org/licenses/copyleft-next-0.3.0.html
[7] https://spdx.org/licenses/copyleft-next-0.3.1.html

Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Kuno Woudt <kuno@frob.nl>
Cc: Richard Fontana <fontana@sharpeleven.org>
Cc: copyleft-next@lists.fedorahosted.org
Cc: Ciaran Farrell <Ciaran.Farrell@suse.com>
Cc: Christopher De Nicolo <Christopher.DeNicolo@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 LICENSES/dual/copyleft-next-0.3.1 | 237 ++++++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)
 create mode 100644 LICENSES/dual/copyleft-next-0.3.1

diff --git a/LICENSES/dual/copyleft-next-0.3.1 b/LICENSES/dual/copyleft-next-0.3.1
new file mode 100644
index 000000000000..086bcb74b478
--- /dev/null
+++ b/LICENSES/dual/copyleft-next-0.3.1
@@ -0,0 +1,237 @@
+Valid-License-Identifier: copyleft-next-0.3.1
+SPDX-URL: https://spdx.org/licenses/copyleft-next-0.3.1
+Usage-Guide:
+  This license can be used in code, it has been found to be GPLv2 compatible
+  by attorneys at Redhat and SUSE, however to air on the side of caution,
+  it's best to only use it together with a GPL2 compatible license using "OR".
+  To use the copyleft-next-0.3.1 license put the following SPDX tag/value
+  pair into a comment according to the placement guidelines in the
+  licensing rules documentation:
+    SPDX-License-Identifier: GPL-2.0 OR copyleft-next-0.3.1
+    SPDX-License-Identifier: GPL-2.0-only OR copyleft-next 0.3.1
+    SPDX-License-Identifier: GPL-2.0+ OR copyleft-next-0.3.1
+    SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
+License-Text:
+
+=======================================================================
+
+                      copyleft-next 0.3.1 ("this License")
+                            Release date: 2016-04-29
+
+1. License Grants; No Trademark License
+
+   Subject to the terms of this License, I grant You:
+
+   a) A non-exclusive, worldwide, perpetual, royalty-free, irrevocable
+      copyright license, to reproduce, Distribute, prepare derivative works
+      of, publicly perform and publicly display My Work.
+
+   b) A non-exclusive, worldwide, perpetual, royalty-free, irrevocable
+      patent license under Licensed Patents to make, have made, use, sell,
+      offer for sale, and import Covered Works.
+
+   This License does not grant any rights in My name, trademarks, service
+   marks, or logos.
+
+2. Distribution: General Conditions
+
+   You may Distribute Covered Works, provided that You (i) inform
+   recipients how they can obtain a copy of this License; (ii) satisfy the
+   applicable conditions of sections 3 through 6; and (iii) preserve all
+   Legal Notices contained in My Work (to the extent they remain
+   pertinent). "Legal Notices" means copyright notices, license notices,
+   license texts, and author attributions, but does not include logos,
+   other graphical images, trademarks or trademark legends.
+
+3. Conditions for Distributing Derived Works; Outbound GPL Compatibility
+
+   If You Distribute a Derived Work, You must license the entire Derived
+   Work as a whole under this License, with prominent notice of such
+   licensing. This condition may not be avoided through such means as
+   separate Distribution of portions of the Derived Work.
+
+   If the Derived Work includes material licensed under the GPL, You may
+   instead license the Derived Work under the GPL.
+   
+4. Condition Against Further Restrictions; Inbound License Compatibility
+
+   When Distributing a Covered Work, You may not impose further
+   restrictions on the exercise of rights in the Covered Work granted under
+   this License. This condition is not excused merely because such
+   restrictions result from Your compliance with conditions or obligations
+   extrinsic to this License (such as a court order or an agreement with a
+   third party).
+
+   However, You may Distribute a Covered Work incorporating material
+   governed by a license that is both OSI-Approved and FSF-Free as of the
+   release date of this License, provided that compliance with such
+   other license would not conflict with any conditions stated in other
+   sections of this License.
+
+5. Conditions for Distributing Object Code
+
+   You may Distribute an Object Code form of a Covered Work, provided that
+   you accompany the Object Code with a URL through which the Corresponding
+   Source is made available, at no charge, by some standard or customary
+   means of providing network access to source code.
+
+   If you Distribute the Object Code in a physical product or tangible
+   storage medium ("Product"), the Corresponding Source must be available
+   through such URL for two years from the date of Your most recent
+   Distribution of the Object Code in the Product. However, if the Product
+   itself contains or is accompanied by the Corresponding Source (made
+   available in a customarily accessible manner), You need not also comply
+   with the first paragraph of this section.
+
+   Each direct and indirect recipient of the Covered Work from You is an
+   intended third-party beneficiary of this License solely as to this
+   section 5, with the right to enforce its terms.
+
+6. Symmetrical Licensing Condition for Upstream Contributions
+
+   If You Distribute a work to Me specifically for inclusion in or
+   modification of a Covered Work (a "Patch"), and no explicit licensing
+   terms apply to the Patch, You license the Patch under this License, to
+   the extent of Your copyright in the Patch. This condition does not
+   negate the other conditions of this License, if applicable to the Patch.
+
+7. Nullification of Copyleft/Proprietary Dual Licensing
+
+   If I offer to license, for a fee, a Covered Work under terms other than
+   a license that is OSI-Approved or FSF-Free as of the release date of this
+   License or a numbered version of copyleft-next released by the
+   Copyleft-Next Project, then the license I grant You under section 1 is no
+   longer subject to the conditions in sections 3 through 5.
+
+8. Copyleft Sunset
+
+   The conditions in sections 3 through 5 no longer apply once fifteen
+   years have elapsed from the date of My first Distribution of My Work
+   under this License.
+
+9. Pass-Through
+
+   When You Distribute a Covered Work, the recipient automatically receives
+   a license to My Work from Me, subject to the terms of this License.
+
+10. Termination
+
+    Your license grants under section 1 are automatically terminated if You
+
+    a) fail to comply with the conditions of this License, unless You cure
+       such noncompliance within thirty days after becoming aware of it, or
+
+    b) initiate a patent infringement litigation claim (excluding
+       declaratory judgment actions, counterclaims, and cross-claims)
+       alleging that any part of My Work directly or indirectly infringes
+       any patent.
+
+    Termination of Your license grants extends to all copies of Covered
+    Works You subsequently obtain. Termination does not terminate the
+    rights of those who have received copies or rights from You subject to
+    this License.
+
+    To the extent permission to make copies of a Covered Work is necessary
+    merely for running it, such permission is not terminable.
+
+11. Later License Versions
+
+    The Copyleft-Next Project may release new versions of copyleft-next,
+    designated by a distinguishing version number ("Later Versions").
+    Unless I explicitly remove the option of Distributing Covered Works
+    under Later Versions, You may Distribute Covered Works under any Later
+    Version.
+
+** 12. No Warranty                                                       **
+**                                                                       **
+**     My Work is provided "as-is", without warranty. You bear the risk  **
+**     of using it. To the extent permitted by applicable law, each      **
+**     Distributor of My Work excludes the implied warranties of title,  **
+**     merchantability, fitness for a particular purpose and             **
+**     non-infringement.                                                 **
+
+** 13. Limitation of Liability                                           **
+**                                                                       **
+**     To the extent permitted by applicable law, in no event will any   **
+**     Distributor of My Work be liable to You for any damages           **
+**     whatsoever, whether direct, indirect, special, incidental, or     **
+**     consequential damages, whether arising under contract, tort       **
+**     (including negligence), or otherwise, even where the Distributor  **
+**     knew or should have known about the possibility of such damages.  **
+
+14. Severability
+
+    The invalidity or unenforceability of any provision of this License
+    does not affect the validity or enforceability of the remainder of
+    this License. Such provision is to be reformed to the minimum extent
+    necessary to make it valid and enforceable.
+
+15. Definitions
+
+    "Copyleft-Next Project" means the project that maintains the source
+    code repository at <https://github.com/copyleft-next/copyleft-next.git/>
+    as of the release date of this License.
+
+    "Corresponding Source" of a Covered Work in Object Code form means (i)
+    the Source Code form of the Covered Work; (ii) all scripts,
+    instructions and similar information that are reasonably necessary for
+    a skilled developer to generate such Object Code from the Source Code
+    provided under (i); and (iii) a list clearly identifying all Separate
+    Works (other than those provided in compliance with (ii)) that were
+    specifically used in building and (if applicable) installing the
+    Covered Work (for example, a specified proprietary compiler including
+    its version number). Corresponding Source must be machine-readable.
+
+    "Covered Work" means My Work or a Derived Work.
+
+    "Derived Work" means a work of authorship that copies from, modifies,
+    adapts, is based on, is a derivative work of, transforms, translates or
+    contains all or part of My Work, such that copyright permission is
+    required. The following are not Derived Works: (i) Mere Aggregation;
+    (ii) a mere reproduction of My Work; and (iii) if My Work fails to
+    explicitly state an expectation otherwise, a work that merely makes
+    reference to My Work.
+
+    "Distribute" means to distribute, transfer or make a copy available to
+    someone else, such that copyright permission is required.
+
+    "Distributor" means Me and anyone else who Distributes a Covered Work.
+
+    "FSF-Free" means classified as 'free' by the Free Software Foundation.
+
+    "GPL" means a version of the GNU General Public License or the GNU
+    Affero General Public License.
+
+    "I"/"Me"/"My" refers to the individual or legal entity that places My
+    Work under this License. "You"/"Your" refers to the individual or legal
+    entity exercising rights in My Work under this License. A legal entity
+    includes each entity that controls, is controlled by, or is under
+    common control with such legal entity. "Control" means (a) the power to
+    direct the actions of such legal entity, whether by contract or
+    otherwise, or (b) ownership of more than fifty percent of the
+    outstanding shares or beneficial ownership of such legal entity.
+
+    "Licensed Patents" means all patent claims licensable royalty-free by
+    Me, now or in the future, that are necessarily infringed by making,
+    using, or selling My Work, and excludes claims that would be infringed
+    only as a consequence of further modification of My Work.
+
+    "Mere Aggregation" means an aggregation of a Covered Work with a
+    Separate Work.
+
+    "My Work" means the particular work of authorship I license to You
+    under this License.
+
+    "Object Code" means any form of a work that is not Source Code.
+
+    "OSI-Approved" means approved as 'Open Source' by the Open Source
+    Initiative.
+
+    "Separate Work" means a work that is separate from and independent of a
+    particular Covered Work and is not by its nature an extension or
+    enhancement of the Covered Work, and/or a runtime library, standard
+    library or similar component that is used to generate an Object Code
+    form of a Covered Work.
+
+    "Source Code" means the preferred form of a work for making
+    modifications to it.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 01/12] LICENSES: Add the copyleft-next-0.3.1 license
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams

Add the full text of the copyleft-next-0.3.1 license to the kernel
tree as well as the required tags for reference and tooling.
The license text was copied directly from the copyleft-next project's
git tree [0].

Discussion of using copyleft-next-0.3.1 on Linux started since June,
2016 [1]. In the end Linus' preference was to have drivers use
MODULE_LICENSE("GPL") to make it clear that the GPL applies when it
comes to Linux [2]. Additionally, even though copyleft-next-0.3.1 has
been found to be to be GPLv2 compatible by three attorneys at SUSE and
Redhat [3], to err on the side of caution we simply recommend to
always use the "OR" language for this license [4].

Even though it has been a goal of the project to be GPL-v2 compatible
to be certain in 2016 I asked for a clarification about what makes
copyleft-next GPLv2 compatible and also asked for a summary of
benefits. This prompted some small minor changes to make compatibility
even further clear and as of copyleft 0.3.1 compatibility should
be crystal clear [5].

The summary of why copyleft-next 0.3.1 is compatible with GPLv2
is explained as follows:

  Like GPLv2, copyleft-next requires distribution of derivative works
  ("Derived Works" in copyleft-next 0.3.x) to be under the same license.
  Ordinarily this would make the two licenses incompatible. However,
  copyleft-next 0.3.1 says: "If the Derived Work includes material
  licensed under the GPL, You may instead license the Derived Work under
  the GPL." "GPL" is defined to include GPLv2.

In practice this means copyleft-next code in Linux may be licensed
under the GPL2, however there are additional obvious gains for
bringing contributions from Linux outbound where copyleft-next is
preferred. A summary of benefits why projects outside of Linux might
prefer to use copyleft-next >= 0.3.1 over GPLv2:

o It is much shorter and simpler
o It has an explicit patent license grant, unlike GPLv2
o Its notice preservation conditions are clearer
o More free software/open source licenses are compatible
  with it (via section 4)
o The source code requirement triggered by binary distribution
  is much simpler in a procedural sense
o Recipients potentially have a contract claim against distributors
  who are noncompliant with the source code requirement
o There is a built-in inbound=outbound policy for upstream
  contributions (cf. Apache License 2.0 section 5)
o There are disincentives to engage in the controversial practice
  of copyleft/ proprietary dual-licensing
o In 15 years copyleft expires, which can be advantageous
  for legacy code
o There are explicit disincentives to bringing patent infringement
  claims accusing the licensed work of infringement (see 10b)
o There is a cure period for licensees who are not compliant
  with the license (there is no cure opportunity in GPLv2)
o copyleft-next has a 'built-in or-later' provision

The first driver submission to Linux under this dual strategy was
lib/test_sysctl.c through commit 9308f2f9e7f05 ("test_sysctl: add
dedicated proc sysctl test driver") merged in July 2017. Shortly after
that I also added test_kmod through commit d9c6a72d6fa29 ("kmod: add
test driver to stress test the module loader") in the same month. These
two drivers went in just a few months before the SPDX license practice
kicked in. In 2018 Kuno Woudt went through the process to get SPDX
identifiers for copyleft-next [6] [7]. Although there are SPDX tags
for copyleft-next-0.3.0, we only document use in Linux starting from
copyleft-next-0.3.1 which makes GPLv2 compatibility crystal clear.

This patch will let us update the two Linux selftest drivers in
subsequent patches with their respective SPDX license identifiers and
let us remove repetitive license boiler plate.

[0] https://github.com/copyleft-next/copyleft-next/blob/master/Releases/copyleft-next-0.3.1
[1] https://lore.kernel.org/lkml/1465929311-13509-1-git-send-email-mcgrof@kernel.org/
[2] https://lore.kernel.org/lkml/CA+55aFyhxcvD+q7tp+-yrSFDKfR0mOHgyEAe=f_94aKLsOu0Og@mail.gmail.com/
[3] https://lore.kernel.org/lkml/20170516232702.GL17314@wotan.suse.de/
[4] https://lkml.kernel.org/r/1495234558.7848.122.camel@linux.intel.com
[5] https://lists.fedorahosted.org/archives/list/copyleft-next@lists.fedorahosted.org/thread/JTGV56DDADWGKU7ZKTZA4DLXTGTLNJ57/#SQMDIKBRAVDOCT4UVNOOCRGBN2UJIKHZ
[6] https://spdx.org/licenses/copyleft-next-0.3.0.html
[7] https://spdx.org/licenses/copyleft-next-0.3.1.html

Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Kuno Woudt <kuno@frob.nl>
Cc: Richard Fontana <fontana@sharpeleven.org>
Cc: copyleft-next@lists.fedorahosted.org
Cc: Ciaran Farrell <Ciaran.Farrell@suse.com>
Cc: Christopher De Nicolo <Christopher.DeNicolo@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 LICENSES/dual/copyleft-next-0.3.1 | 237 ++++++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)
 create mode 100644 LICENSES/dual/copyleft-next-0.3.1

diff --git a/LICENSES/dual/copyleft-next-0.3.1 b/LICENSES/dual/copyleft-next-0.3.1
new file mode 100644
index 000000000000..086bcb74b478
--- /dev/null
+++ b/LICENSES/dual/copyleft-next-0.3.1
@@ -0,0 +1,237 @@
+Valid-License-Identifier: copyleft-next-0.3.1
+SPDX-URL: https://spdx.org/licenses/copyleft-next-0.3.1
+Usage-Guide:
+  This license can be used in code, it has been found to be GPLv2 compatible
+  by attorneys at Redhat and SUSE, however to air on the side of caution,
+  it's best to only use it together with a GPL2 compatible license using "OR".
+  To use the copyleft-next-0.3.1 license put the following SPDX tag/value
+  pair into a comment according to the placement guidelines in the
+  licensing rules documentation:
+    SPDX-License-Identifier: GPL-2.0 OR copyleft-next-0.3.1
+    SPDX-License-Identifier: GPL-2.0-only OR copyleft-next 0.3.1
+    SPDX-License-Identifier: GPL-2.0+ OR copyleft-next-0.3.1
+    SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
+License-Text:
+
+=======================================================================
+
+                      copyleft-next 0.3.1 ("this License")
+                            Release date: 2016-04-29
+
+1. License Grants; No Trademark License
+
+   Subject to the terms of this License, I grant You:
+
+   a) A non-exclusive, worldwide, perpetual, royalty-free, irrevocable
+      copyright license, to reproduce, Distribute, prepare derivative works
+      of, publicly perform and publicly display My Work.
+
+   b) A non-exclusive, worldwide, perpetual, royalty-free, irrevocable
+      patent license under Licensed Patents to make, have made, use, sell,
+      offer for sale, and import Covered Works.
+
+   This License does not grant any rights in My name, trademarks, service
+   marks, or logos.
+
+2. Distribution: General Conditions
+
+   You may Distribute Covered Works, provided that You (i) inform
+   recipients how they can obtain a copy of this License; (ii) satisfy the
+   applicable conditions of sections 3 through 6; and (iii) preserve all
+   Legal Notices contained in My Work (to the extent they remain
+   pertinent). "Legal Notices" means copyright notices, license notices,
+   license texts, and author attributions, but does not include logos,
+   other graphical images, trademarks or trademark legends.
+
+3. Conditions for Distributing Derived Works; Outbound GPL Compatibility
+
+   If You Distribute a Derived Work, You must license the entire Derived
+   Work as a whole under this License, with prominent notice of such
+   licensing. This condition may not be avoided through such means as
+   separate Distribution of portions of the Derived Work.
+
+   If the Derived Work includes material licensed under the GPL, You may
+   instead license the Derived Work under the GPL.
+   
+4. Condition Against Further Restrictions; Inbound License Compatibility
+
+   When Distributing a Covered Work, You may not impose further
+   restrictions on the exercise of rights in the Covered Work granted under
+   this License. This condition is not excused merely because such
+   restrictions result from Your compliance with conditions or obligations
+   extrinsic to this License (such as a court order or an agreement with a
+   third party).
+
+   However, You may Distribute a Covered Work incorporating material
+   governed by a license that is both OSI-Approved and FSF-Free as of the
+   release date of this License, provided that compliance with such
+   other license would not conflict with any conditions stated in other
+   sections of this License.
+
+5. Conditions for Distributing Object Code
+
+   You may Distribute an Object Code form of a Covered Work, provided that
+   you accompany the Object Code with a URL through which the Corresponding
+   Source is made available, at no charge, by some standard or customary
+   means of providing network access to source code.
+
+   If you Distribute the Object Code in a physical product or tangible
+   storage medium ("Product"), the Corresponding Source must be available
+   through such URL for two years from the date of Your most recent
+   Distribution of the Object Code in the Product. However, if the Product
+   itself contains or is accompanied by the Corresponding Source (made
+   available in a customarily accessible manner), You need not also comply
+   with the first paragraph of this section.
+
+   Each direct and indirect recipient of the Covered Work from You is an
+   intended third-party beneficiary of this License solely as to this
+   section 5, with the right to enforce its terms.
+
+6. Symmetrical Licensing Condition for Upstream Contributions
+
+   If You Distribute a work to Me specifically for inclusion in or
+   modification of a Covered Work (a "Patch"), and no explicit licensing
+   terms apply to the Patch, You license the Patch under this License, to
+   the extent of Your copyright in the Patch. This condition does not
+   negate the other conditions of this License, if applicable to the Patch.
+
+7. Nullification of Copyleft/Proprietary Dual Licensing
+
+   If I offer to license, for a fee, a Covered Work under terms other than
+   a license that is OSI-Approved or FSF-Free as of the release date of this
+   License or a numbered version of copyleft-next released by the
+   Copyleft-Next Project, then the license I grant You under section 1 is no
+   longer subject to the conditions in sections 3 through 5.
+
+8. Copyleft Sunset
+
+   The conditions in sections 3 through 5 no longer apply once fifteen
+   years have elapsed from the date of My first Distribution of My Work
+   under this License.
+
+9. Pass-Through
+
+   When You Distribute a Covered Work, the recipient automatically receives
+   a license to My Work from Me, subject to the terms of this License.
+
+10. Termination
+
+    Your license grants under section 1 are automatically terminated if You
+
+    a) fail to comply with the conditions of this License, unless You cure
+       such noncompliance within thirty days after becoming aware of it, or
+
+    b) initiate a patent infringement litigation claim (excluding
+       declaratory judgment actions, counterclaims, and cross-claims)
+       alleging that any part of My Work directly or indirectly infringes
+       any patent.
+
+    Termination of Your license grants extends to all copies of Covered
+    Works You subsequently obtain. Termination does not terminate the
+    rights of those who have received copies or rights from You subject to
+    this License.
+
+    To the extent permission to make copies of a Covered Work is necessary
+    merely for running it, such permission is not terminable.
+
+11. Later License Versions
+
+    The Copyleft-Next Project may release new versions of copyleft-next,
+    designated by a distinguishing version number ("Later Versions").
+    Unless I explicitly remove the option of Distributing Covered Works
+    under Later Versions, You may Distribute Covered Works under any Later
+    Version.
+
+** 12. No Warranty                                                       **
+**                                                                       **
+**     My Work is provided "as-is", without warranty. You bear the risk  **
+**     of using it. To the extent permitted by applicable law, each      **
+**     Distributor of My Work excludes the implied warranties of title,  **
+**     merchantability, fitness for a particular purpose and             **
+**     non-infringement.                                                 **
+
+** 13. Limitation of Liability                                           **
+**                                                                       **
+**     To the extent permitted by applicable law, in no event will any   **
+**     Distributor of My Work be liable to You for any damages           **
+**     whatsoever, whether direct, indirect, special, incidental, or     **
+**     consequential damages, whether arising under contract, tort       **
+**     (including negligence), or otherwise, even where the Distributor  **
+**     knew or should have known about the possibility of such damages.  **
+
+14. Severability
+
+    The invalidity or unenforceability of any provision of this License
+    does not affect the validity or enforceability of the remainder of
+    this License. Such provision is to be reformed to the minimum extent
+    necessary to make it valid and enforceable.
+
+15. Definitions
+
+    "Copyleft-Next Project" means the project that maintains the source
+    code repository at <https://github.com/copyleft-next/copyleft-next.git/>
+    as of the release date of this License.
+
+    "Corresponding Source" of a Covered Work in Object Code form means (i)
+    the Source Code form of the Covered Work; (ii) all scripts,
+    instructions and similar information that are reasonably necessary for
+    a skilled developer to generate such Object Code from the Source Code
+    provided under (i); and (iii) a list clearly identifying all Separate
+    Works (other than those provided in compliance with (ii)) that were
+    specifically used in building and (if applicable) installing the
+    Covered Work (for example, a specified proprietary compiler including
+    its version number). Corresponding Source must be machine-readable.
+
+    "Covered Work" means My Work or a Derived Work.
+
+    "Derived Work" means a work of authorship that copies from, modifies,
+    adapts, is based on, is a derivative work of, transforms, translates or
+    contains all or part of My Work, such that copyright permission is
+    required. The following are not Derived Works: (i) Mere Aggregation;
+    (ii) a mere reproduction of My Work; and (iii) if My Work fails to
+    explicitly state an expectation otherwise, a work that merely makes
+    reference to My Work.
+
+    "Distribute" means to distribute, transfer or make a copy available to
+    someone else, such that copyright permission is required.
+
+    "Distributor" means Me and anyone else who Distributes a Covered Work.
+
+    "FSF-Free" means classified as 'free' by the Free Software Foundation.
+
+    "GPL" means a version of the GNU General Public License or the GNU
+    Affero General Public License.
+
+    "I"/"Me"/"My" refers to the individual or legal entity that places My
+    Work under this License. "You"/"Your" refers to the individual or legal
+    entity exercising rights in My Work under this License. A legal entity
+    includes each entity that controls, is controlled by, or is under
+    common control with such legal entity. "Control" means (a) the power to
+    direct the actions of such legal entity, whether by contract or
+    otherwise, or (b) ownership of more than fifty percent of the
+    outstanding shares or beneficial ownership of such legal entity.
+
+    "Licensed Patents" means all patent claims licensable royalty-free by
+    Me, now or in the future, that are necessarily infringed by making,
+    using, or selling My Work, and excludes claims that would be infringed
+    only as a consequence of further modification of My Work.
+
+    "Mere Aggregation" means an aggregation of a Covered Work with a
+    Separate Work.
+
+    "My Work" means the particular work of authorship I license to You
+    under this License.
+
+    "Object Code" means any form of a work that is not Source Code.
+
+    "OSI-Approved" means approved as 'Open Source' by the Open Source
+    Initiative.
+
+    "Separate Work" means a work that is separate from and independent of a
+    particular Covered Work and is not by its nature an extension or
+    enhancement of the Covered Work, and/or a runtime library, standard
+    library or similar component that is used to generate an Object Code
+    form of a Covered Work.
+
+    "Source Code" means the preferred form of a work for making
+    modifications to it.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 02/12] testing: use the copyleft-next-0.3.1 SPDX tag
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-18  5:04   ` Luis Chamberlain
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next,
	Goldwyn Rodrigues, Kuno Woudt, Richard Fontana, Ciaran Farrell,
	Christopher De Nicolo, Jonathan Corbet, Thorsten Leemhuis

Two selftests drivers exist under the copyleft-next license.
These drivers were added prior to SPDX practice taking full swing
in the kernel. Now that we have an SPDX tag for copylef-next-0.3.1
documented, embrace it and remove the boiler plate.

Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Kuno Woudt <kuno@frob.nl>
Cc: Richard Fontana <fontana@sharpeleven.org>
Cc: copyleft-next@lists.fedorahosted.org
Cc: Ciaran Farrell <Ciaran.Farrell@suse.com>
Cc: Christopher De Nicolo <Christopher.DeNicolo@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 lib/test_kmod.c                          | 12 +-----------
 lib/test_sysctl.c                        | 12 +-----------
 tools/testing/selftests/kmod/kmod.sh     | 13 +------------
 tools/testing/selftests/sysctl/sysctl.sh | 12 +-----------
 4 files changed, 4 insertions(+), 45 deletions(-)

diff --git a/lib/test_kmod.c b/lib/test_kmod.c
index ce1589391413..d62afd89dc63 100644
--- a/lib/test_kmod.c
+++ b/lib/test_kmod.c
@@ -1,18 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 /*
  * kmod stress test driver
  *
  * Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or at your option any
- * later version; or, when distributed separately from the Linux kernel or
- * when incorporated into other software packages, subject to the following
- * license:
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of copyleft-next (version 0.3.1 or later) as published
- * at http://copyleft-next.org/.
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c
index 3750323973f4..9e5bd10a930a 100644
--- a/lib/test_sysctl.c
+++ b/lib/test_sysctl.c
@@ -1,18 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 /*
  * proc sysctl test driver
  *
  * Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or at your option any
- * later version; or, when distributed separately from the Linux kernel or
- * when incorporated into other software packages, subject to the following
- * license:
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of copyleft-next (version 0.3.1 or later) as published
- * at http://copyleft-next.org/.
  */
 
 /*
diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh
index afd42387e8b2..7189715d7960 100755
--- a/tools/testing/selftests/kmod/kmod.sh
+++ b/tools/testing/selftests/kmod/kmod.sh
@@ -1,18 +1,7 @@
 #!/bin/bash
-#
+# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 # Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
 #
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of the GNU General Public License as published by the Free
-# Software Foundation; either version 2 of the License, or at your option any
-# later version; or, when distributed separately from the Linux kernel or
-# when incorporated into other software packages, subject to the following
-# license:
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of copyleft-next (version 0.3.1 or later) as published
-# at http://copyleft-next.org/.
-
 # This is a stress test script for kmod, the kernel module loader. It uses
 # test_kmod which exposes a series of knobs for the API for us so we can
 # tweak each test in userspace rather than in kernelspace.
diff --git a/tools/testing/selftests/sysctl/sysctl.sh b/tools/testing/selftests/sysctl/sysctl.sh
index 19515dcb7d04..2046c603a4d4 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -1,16 +1,6 @@
 #!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 # Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of the GNU General Public License as published by the Free
-# Software Foundation; either version 2 of the License, or at your option any
-# later version; or, when distributed separately from the Linux kernel or
-# when incorporated into other software packages, subject to the following
-# license:
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of copyleft-next (version 0.3.1 or later) as published
-# at http://copyleft-next.org/.
 
 # This performs a series tests against the proc sysctl interface.
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 02/12] testing: use the copyleft-next-0.3.1 SPDX tag
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams

Two selftests drivers exist under the copyleft-next license.
These drivers were added prior to SPDX practice taking full swing
in the kernel. Now that we have an SPDX tag for copylef-next-0.3.1
documented, embrace it and remove the boiler plate.

Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Kuno Woudt <kuno@frob.nl>
Cc: Richard Fontana <fontana@sharpeleven.org>
Cc: copyleft-next@lists.fedorahosted.org
Cc: Ciaran Farrell <Ciaran.Farrell@suse.com>
Cc: Christopher De Nicolo <Christopher.DeNicolo@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 lib/test_kmod.c                          | 12 +-----------
 lib/test_sysctl.c                        | 12 +-----------
 tools/testing/selftests/kmod/kmod.sh     | 13 +------------
 tools/testing/selftests/sysctl/sysctl.sh | 12 +-----------
 4 files changed, 4 insertions(+), 45 deletions(-)

diff --git a/lib/test_kmod.c b/lib/test_kmod.c
index ce1589391413..d62afd89dc63 100644
--- a/lib/test_kmod.c
+++ b/lib/test_kmod.c
@@ -1,18 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 /*
  * kmod stress test driver
  *
  * Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or at your option any
- * later version; or, when distributed separately from the Linux kernel or
- * when incorporated into other software packages, subject to the following
- * license:
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of copyleft-next (version 0.3.1 or later) as published
- * at http://copyleft-next.org/.
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c
index 3750323973f4..9e5bd10a930a 100644
--- a/lib/test_sysctl.c
+++ b/lib/test_sysctl.c
@@ -1,18 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 /*
  * proc sysctl test driver
  *
  * Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or at your option any
- * later version; or, when distributed separately from the Linux kernel or
- * when incorporated into other software packages, subject to the following
- * license:
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of copyleft-next (version 0.3.1 or later) as published
- * at http://copyleft-next.org/.
  */
 
 /*
diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh
index afd42387e8b2..7189715d7960 100755
--- a/tools/testing/selftests/kmod/kmod.sh
+++ b/tools/testing/selftests/kmod/kmod.sh
@@ -1,18 +1,7 @@
 #!/bin/bash
-#
+# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 # Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
 #
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of the GNU General Public License as published by the Free
-# Software Foundation; either version 2 of the License, or at your option any
-# later version; or, when distributed separately from the Linux kernel or
-# when incorporated into other software packages, subject to the following
-# license:
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of copyleft-next (version 0.3.1 or later) as published
-# at http://copyleft-next.org/.
-
 # This is a stress test script for kmod, the kernel module loader. It uses
 # test_kmod which exposes a series of knobs for the API for us so we can
 # tweak each test in userspace rather than in kernelspace.
diff --git a/tools/testing/selftests/sysctl/sysctl.sh b/tools/testing/selftests/sysctl/sysctl.sh
index 19515dcb7d04..2046c603a4d4 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -1,16 +1,6 @@
 #!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
 # Copyright (C) 2017 Luis R. Rodriguez <mcgrof@kernel.org>
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of the GNU General Public License as published by the Free
-# Software Foundation; either version 2 of the License, or at your option any
-# later version; or, when distributed separately from the Linux kernel or
-# when incorporated into other software packages, subject to the following
-# license:
-#
-# This program is free software; you can redistribute it and/or modify it
-# under the terms of copyleft-next (version 0.3.1 or later) as published
-# at http://copyleft-next.org/.
 
 # This performs a series tests against the proc sysctl interface.
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 03/12] selftests: add tests_sysfs module
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-18  5:04   ` Luis Chamberlain
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

This adds a new selftest module which can be used to test sysfs, which
would otherwise require using an existing driver. This lets us muck
with a template driver to test breaking things without affecting
system behaviour or requiring the dependencies of a real device
driver.

A series of 28 tests are added. Support for using two device types are
supported:

  * misc
  * block

Contrary to sysctls, sysfs requires a full write to happen at once, and
so we reduce the digit tests to single writes. Two main sysfs knobs are
provided for testing reading/storing, one which doesn't inclur any
delays and another which can incur programmed delays. What locks are
held, if any, are configurable, at module load time, or through dynamic
configuration at run time.

Since sysfs is a technically filesystem, but a pseudo one, which
requires a kernel user, our test_sysfs module and respective test script
embraces fstests format for tests in the kernel ring bufffer. Likewise,
a scraper for kernel crashes is provided which matches what fstests does
as well.

Two tests are kept disabled as they currently cause a deadlock, and so
this provides a mechanism to easily show proof and demo how the deadlock
can happen:

Demos the deadlock with a device specific lock
./tools/testing/selftests/sysfs/sysfs.sh -t 0027

Demos the deadlock with rtnl_lock()
./tools/testing/selftests/sysfs/sysfs.sh -t 0028

Two separate solutions to the deadlock issue have been proposed,
and so now its a matter of either documenting this limitation or
eventually adopting a generic fix.

This selftests will shortly be expanded upon with more tests which
require further kernel changes in order to provide better test
coverage.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 MAINTAINERS                            |    7 +
 lib/Kconfig.debug                      |   12 +
 lib/Makefile                           |    1 +
 lib/test_sysfs.c                       |  943 ++++++++++++++++++
 tools/testing/selftests/sysfs/Makefile |   12 +
 tools/testing/selftests/sysfs/config   |    2 +
 tools/testing/selftests/sysfs/sysfs.sh | 1208 ++++++++++++++++++++++++
 7 files changed, 2185 insertions(+)
 create mode 100644 lib/test_sysfs.c
 create mode 100644 tools/testing/selftests/sysfs/Makefile
 create mode 100644 tools/testing/selftests/sysfs/config
 create mode 100755 tools/testing/selftests/sysfs/sysfs.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 15a5fd8323f7..28a34384f541 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18166,6 +18166,13 @@ L:	linux-mmc@vger.kernel.org
 S:	Maintained
 F:	drivers/mmc/host/sdhci-pci-dwc-mshc.c
 
+SYSFS TEST DRIVER
+M:	Luis Chamberlain <mcgrof@kernel.org>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+F:	lib/test_sysfs.c
+F:	tools/testing/selftests/sysfs/
+
 SYSTEM CONFIGURATION (SYSCON)
 M:	Lee Jones <lee.jones@linaro.org>
 M:	Arnd Bergmann <arnd@arndb.de>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index dbff322207ad..ae19bf1a21b8 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2343,6 +2343,18 @@ config TEST_SYSCTL
 
 	  If unsure, say N.
 
+config TEST_SYSFS
+	tristate "sysfs test driver"
+	depends on SYSFS
+	depends on NET
+	depends on BLOCK
+	help
+	  This builds the "test_sysfs" module. This driver enables to test the
+	  sysfs file system safely without affecting production knobs which
+	  might alter system functionality.
+
+	  If unsure, say N.
+
 config BITFIELD_KUNIT
 	tristate "KUnit test bitfield functions at runtime"
 	depends on KUNIT
diff --git a/lib/Makefile b/lib/Makefile
index 2cfd33917ad5..5143d65f90d6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_TEST_FIRMWARE) += test_firmware.o
 obj-$(CONFIG_TEST_BITOPS) += test_bitops.o
 CFLAGS_test_bitops.o += -Werror
 obj-$(CONFIG_TEST_SYSCTL) += test_sysctl.o
+obj-$(CONFIG_TEST_SYSFS) += test_sysfs.o
 obj-$(CONFIG_TEST_HASH) += test_hash.o test_siphash.o
 obj-$(CONFIG_TEST_IDA) += test_ida.o
 obj-$(CONFIG_KASAN_KUNIT_TEST) += test_kasan.o
diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
new file mode 100644
index 000000000000..273fc3f39740
--- /dev/null
+++ b/lib/test_sysfs.c
@@ -0,0 +1,943 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
+/*
+ * sysfs test driver
+ *
+ * Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or at your option any
+ * later version; or, when distributed separately from the Linux kernel or
+ * when incorporated into other software packages, subject to the following
+ * license:
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of copyleft-next (version 0.3.1 or later) as published
+ * at http://copyleft-next.org/.
+ */
+
+/*
+ * This module allows us to add race conditions which we can test for
+ * against the sysfs filesystem.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/async.h>
+#include <linux/delay.h>
+#include <linux/vmalloc.h>
+#include <linux/debugfs.h>
+#include <linux/rtnetlink.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+
+static bool enable_lock;
+module_param(enable_lock, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_lock,
+		 "enable locking on reads / stores from the start");
+
+static bool enable_lock_on_rmmod;
+module_param(enable_lock_on_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_lock_on_rmmod,
+		 "enable locking on rmmod");
+
+static bool use_rtnl_lock;
+module_param(use_rtnl_lock, bool_enable_only, 0644);
+MODULE_PARM_DESC(use_rtnl_lock,
+		 "use an rtnl_lock instead of the device mutex_lock");
+
+static unsigned int write_delay_msec_y = 500;
+module_param_named(write_delay_msec_y, write_delay_msec_y, uint, 0644);
+MODULE_PARM_DESC(write_delay_msec_y, "msec write delay for writes to y");
+
+static unsigned int test_devtype;
+module_param_named(devtype, test_devtype, uint, 0644);
+MODULE_PARM_DESC(devtype, "device type to register");
+
+static bool enable_busy_alloc;
+module_param(enable_busy_alloc, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_busy_alloc, "do a fake allocation during writes");
+
+static bool enable_debugfs;
+module_param(enable_debugfs, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_debugfs, "enable a few debugfs files");
+
+static bool enable_verbose_writes;
+module_param(enable_verbose_writes, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_debugfs, "enable stores to print verbose information");
+
+static unsigned int delay_rmmod_ms;
+module_param_named(delay_rmmod_ms, delay_rmmod_ms, uint, 0644);
+MODULE_PARM_DESC(delay_rmmod_ms, "if set how many ms to delay rmmod before device deletion");
+
+static bool enable_verbose_rmmod;
+module_param(enable_verbose_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
+
+static int sysfs_test_major;
+
+/**
+ * test_config - used for configuring how the sysfs test device will behave
+ *
+ * @enable_lock: if enabled a lock will be used when reading/storing variables
+ * @enable_lock_on_rmmod: if enabled a lock will be used when reading/storing
+ *	sysfs attributes, but it will also be used to lock on rmmod. This is
+ *	useful to test for a deadlock.
+ * @use_rtnl_lock: if enabled instead of configuration specific mutex, we'll
+ *	use the rtnl_lock. If your test case is modifying this on the fly
+ *	while doing other stores / reads, things will break as a lock can be
+ *	left contending. Best is that tests use this knob serially, without
+ *	allowing userspace to modify other knobs while this one changes.
+ * @write_delay_msec_y: the amount of delay to use when writing to y
+ * @enable_busy_alloc: if enabled we'll do a large allocation between
+ *	writes. We immediately free right away. We also schedule to give the
+ *	kernel some time to re-use any memory we don't need. This is intened
+ *	to mimic typical driver behaviour.
+ */
+struct test_config {
+	bool enable_lock;
+	bool enable_lock_on_rmmod;
+	bool use_rtnl_lock;
+	unsigned int write_delay_msec_y;
+	bool enable_busy_alloc;
+};
+
+/**
+ * enum sysfs_test_devtype - sysfs device type
+ * @TESTDEV_TYPE_MISC: misc device type
+ * @TESTDEV_TYPE_BLOCK: use a block device for the sysfs test device.
+ */
+enum sysfs_test_devtype {
+	TESTDEV_TYPE_MISC = 0,
+	TESTDEV_TYPE_BLOCK,
+};
+
+/**
+ * sysfs_test_device - test device to help test sysfs
+ *
+ * @devtype: the type of device to use
+ * @config: configuration for the test
+ * @config_mutex: protects configuration of test
+ * @misc_dev: we use a misc device under the hood
+ * @disk: represents a disk when used as a block device
+ * @dev: pointer to misc_dev's own struct device
+ * @dev_idx: unique ID for test device
+ * @x: variable we can use to test read / store
+ * @y: slow variable we can use to test read / store
+ */
+struct sysfs_test_device {
+	enum sysfs_test_devtype devtype;
+	struct test_config config;
+	struct mutex config_mutex;
+	struct miscdevice misc_dev;
+	struct gendisk *disk;
+	struct device *dev;
+	int dev_idx;
+	int x;
+	int y;
+};
+
+static struct sysfs_test_device *first_test_dev;
+
+static struct miscdevice *dev_to_misc_dev(struct device *dev)
+{
+	return dev_get_drvdata(dev);
+}
+
+static struct sysfs_test_device *misc_dev_to_test_dev(struct miscdevice *misc_dev)
+{
+	return container_of(misc_dev, struct sysfs_test_device, misc_dev);
+}
+
+static struct sysfs_test_device *devblock_to_test_dev(struct device *dev)
+{
+	return (struct sysfs_test_device *)dev_to_disk(dev)->private_data;
+}
+
+static struct sysfs_test_device *devmisc_to_testdev(struct device *dev)
+{
+	struct miscdevice *misc_dev;
+
+	misc_dev = dev_to_misc_dev(dev);
+	return misc_dev_to_test_dev(misc_dev);
+}
+
+static struct sysfs_test_device *dev_to_test_dev(struct device *dev)
+{
+	if (test_devtype == TESTDEV_TYPE_MISC)
+		return devmisc_to_testdev(dev);
+	else if (test_devtype == TESTDEV_TYPE_BLOCK)
+		return devblock_to_test_dev(dev);
+	return NULL;
+}
+
+static void test_dev_config_lock(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+
+	config = &test_dev->config;
+	if (config->enable_lock) {
+		if (config->use_rtnl_lock)
+			rtnl_lock();
+		else
+			mutex_lock(&test_dev->config_mutex);
+	}
+}
+
+static void test_dev_config_unlock(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+
+	config = &test_dev->config;
+	if (config->enable_lock) {
+		if (config->use_rtnl_lock)
+			rtnl_unlock();
+		else
+			mutex_unlock(&test_dev->config_mutex);
+	}
+}
+
+static void test_dev_config_lock_rmmod(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+
+	config = &test_dev->config;
+	if (config->enable_lock_on_rmmod)
+		test_dev_config_lock(test_dev);
+}
+
+static void test_dev_config_unlock_rmmod(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+
+	config = &test_dev->config;
+	if (config->enable_lock_on_rmmod)
+		test_dev_config_unlock(test_dev);
+}
+
+static void free_test_dev_sysfs(struct sysfs_test_device *test_dev)
+{
+	if (test_dev) {
+		kfree_const(test_dev->misc_dev.name);
+		test_dev->misc_dev.name = NULL;
+		kfree(test_dev);
+		test_dev = NULL;
+	}
+}
+
+static void test_sysfs_reset_vals(struct sysfs_test_device *test_dev)
+{
+	test_dev->x = 3;
+	test_dev->y = 4;
+}
+
+static ssize_t config_show(struct device *dev,
+			   struct device_attribute *attr,
+			   char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	int len = 0;
+
+	test_dev_config_lock(test_dev);
+
+	len += snprintf(buf, PAGE_SIZE,
+			"Configuration for: %s\n",
+			dev_name(dev));
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"x:\t%d\n",
+			test_dev->x);
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"y:\t%d\n",
+			test_dev->y);
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_lock:\t%s\n",
+			config->enable_lock ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_lock_on_rmmmod:\t%s\n",
+			config->enable_lock_on_rmmod ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"use_rtnl_lock:\t%s\n",
+			config->use_rtnl_lock ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"write_delay_msec_y:\t%d\n",
+			config->write_delay_msec_y);
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_busy_alloc:\t%s\n",
+			config->enable_busy_alloc ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_debugfs:\t%s\n",
+			enable_debugfs ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_verbose_writes:\t%s\n",
+			enable_verbose_writes ? "true" : "false");
+
+	test_dev_config_unlock(test_dev);
+
+	return len;
+}
+static DEVICE_ATTR_RO(config);
+
+static ssize_t reset_store(struct device *dev,
+			   struct device_attribute *attr,
+			   const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+
+	config = &test_dev->config;
+
+	/*
+	 * We compromise and simplify this condition and do not use a lock
+	 * here as the lock type can change.
+	 */
+	config->enable_lock = false;
+	config->enable_lock_on_rmmod = false;
+	config->use_rtnl_lock = false;
+	config->enable_busy_alloc = false;
+	test_sysfs_reset_vals(test_dev);
+
+	dev_info(dev, "reset\n");
+
+	return count;
+}
+static DEVICE_ATTR_WO(reset);
+
+static void test_dev_busy_alloc(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+	char *ignore;
+
+	config = &test_dev->config;
+	if (!config->enable_busy_alloc)
+		return;
+
+	ignore = kzalloc(sizeof(struct sysfs_test_device) * 10, GFP_KERNEL);
+	kfree(ignore);
+
+	schedule();
+}
+
+static ssize_t test_dev_x_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	int ret;
+
+	test_dev_busy_alloc(test_dev);
+	test_dev_config_lock(test_dev);
+
+	ret = kstrtoint(buf, 10, &test_dev->x);
+	if (ret)
+		count = ret;
+
+	if (enable_verbose_writes)
+		dev_info(test_dev->dev, "wrote x = %d\n", test_dev->x);
+
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t test_dev_x_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	int ret;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->x);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(test_dev_x);
+
+static ssize_t test_dev_y_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int y;
+	int ret;
+
+	test_dev_busy_alloc(test_dev);
+	test_dev_config_lock(test_dev);
+
+	config = &test_dev->config;
+
+	ret = kstrtoint(buf, 10, &y);
+	if (ret)
+		count = ret;
+
+	msleep(config->write_delay_msec_y);
+	test_dev->y = test_dev->x + y + 7;
+
+	if (enable_verbose_writes)
+		dev_info(test_dev->dev, "wrote y = %d\n", test_dev->y);
+
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t test_dev_y_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	int ret;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->y);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(test_dev_y);
+
+static ssize_t config_enable_lock_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	/*
+	 * We compromise for simplicty and do not lock when changing
+	 * locking configuration, with the assumption userspace tests
+	 * will know this.
+	 */
+	if (val)
+		config->enable_lock = true;
+	else
+		config->enable_lock = false;
+
+	return count;
+}
+
+static ssize_t config_enable_lock_show(struct device *dev,
+				       struct device_attribute *attr,
+				       char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	ssize_t ret;
+
+	config = &test_dev->config;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(config_enable_lock);
+
+static ssize_t config_enable_lock_on_rmmod_store(struct device *dev,
+						 struct device_attribute *attr,
+						 const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	test_dev_config_lock(test_dev);
+	if (val)
+		config->enable_lock_on_rmmod = true;
+	else
+		config->enable_lock_on_rmmod = false;
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t config_enable_lock_on_rmmod_show(struct device *dev,
+						struct device_attribute *attr,
+						char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	ssize_t ret;
+
+	config = &test_dev->config;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock_on_rmmod);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(config_enable_lock_on_rmmod);
+
+static ssize_t config_use_rtnl_lock_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	/*
+	 * We compromise and simplify this condition and do not use a lock
+	 * here as the lock type can change.
+	 */
+	if (val)
+		config->use_rtnl_lock = true;
+	else
+		config->use_rtnl_lock = false;
+
+	return count;
+}
+
+static ssize_t config_use_rtnl_lock_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+
+	config = &test_dev->config;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", config->use_rtnl_lock);
+}
+static DEVICE_ATTR_RW(config_use_rtnl_lock);
+
+static ssize_t config_write_delay_msec_y_store(struct device *dev,
+					       struct device_attribute *attr,
+					       const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	test_dev_config_lock(test_dev);
+	config->write_delay_msec_y = val;
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t config_write_delay_msec_y_show(struct device *dev,
+					      struct device_attribute *attr,
+					      char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+
+	config = &test_dev->config;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", config->write_delay_msec_y);
+}
+static DEVICE_ATTR_RW(config_write_delay_msec_y);
+
+static ssize_t config_enable_busy_alloc_store(struct device *dev,
+					      struct device_attribute *attr,
+					      const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	test_dev_config_lock(test_dev);
+	config->enable_busy_alloc = val;
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t config_enable_busy_alloc_show(struct device *dev,
+					     struct device_attribute *attr,
+					     char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+
+	config = &test_dev->config;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", config->enable_busy_alloc);
+}
+static DEVICE_ATTR_RW(config_enable_busy_alloc);
+
+#define TEST_SYSFS_DEV_ATTR(name)		(&dev_attr_##name.attr)
+
+static struct attribute *test_dev_attrs[] = {
+	/* Generic driver knobs go here */
+	TEST_SYSFS_DEV_ATTR(config),
+	TEST_SYSFS_DEV_ATTR(reset),
+
+	/* These are used to test sysfs */
+	TEST_SYSFS_DEV_ATTR(test_dev_x),
+	TEST_SYSFS_DEV_ATTR(test_dev_y),
+
+	/*
+	 * These are configuration knobs to modify how we test sysfs when
+	 * doing reads / stores.
+	 */
+	TEST_SYSFS_DEV_ATTR(config_enable_lock),
+	TEST_SYSFS_DEV_ATTR(config_enable_lock_on_rmmod),
+	TEST_SYSFS_DEV_ATTR(config_use_rtnl_lock),
+	TEST_SYSFS_DEV_ATTR(config_write_delay_msec_y),
+	TEST_SYSFS_DEV_ATTR(config_enable_busy_alloc),
+
+	NULL,
+};
+
+ATTRIBUTE_GROUPS(test_dev);
+
+static int sysfs_test_dev_alloc_miscdev(struct sysfs_test_device *test_dev)
+{
+	struct miscdevice *misc_dev;
+
+	misc_dev = &test_dev->misc_dev;
+	misc_dev->minor = MISC_DYNAMIC_MINOR;
+	misc_dev->name = kasprintf(GFP_KERNEL, "test_sysfs%d", test_dev->dev_idx);
+	if (!misc_dev->name) {
+		pr_err("Cannot alloc misc_dev->name\n");
+		return -ENOMEM;
+	}
+	misc_dev->groups = test_dev_groups;
+
+	return 0;
+}
+
+static int testdev_open(struct block_device *bdev, fmode_t mode)
+{
+	return -EINVAL;
+}
+
+static blk_qc_t testdev_submit_bio(struct bio *bio)
+{
+	return BLK_QC_T_NONE;
+}
+
+static void testdev_slot_free_notify(struct block_device *bdev,
+				     unsigned long index)
+{
+}
+
+static int testdev_rw_page(struct block_device *bdev, sector_t sector,
+			   struct page *page, unsigned int op)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct block_device_operations sysfs_testdev_ops = {
+	.open = testdev_open,
+	.submit_bio = testdev_submit_bio,
+	.swap_slot_free_notify = testdev_slot_free_notify,
+	.rw_page = testdev_rw_page,
+	.owner = THIS_MODULE
+};
+
+static int sysfs_test_dev_alloc_blockdev(struct sysfs_test_device *test_dev)
+{
+	int ret = -ENOMEM;
+
+	test_dev->disk = blk_alloc_disk(NUMA_NO_NODE);
+	if (!test_dev->disk) {
+		pr_err("Error allocating disk structure for device %d\n",
+		       test_dev->dev_idx);
+		goto out;
+	}
+
+	test_dev->disk->major = sysfs_test_major;
+	test_dev->disk->first_minor = test_dev->dev_idx + 1;
+	test_dev->disk->fops = &sysfs_testdev_ops;
+	test_dev->disk->private_data = test_dev;
+	snprintf(test_dev->disk->disk_name, 16, "test_sysfs%d",
+		 test_dev->dev_idx);
+	set_capacity(test_dev->disk, 0);
+	blk_queue_flag_set(QUEUE_FLAG_NONROT, test_dev->disk->queue);
+	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, test_dev->disk->queue);
+	blk_queue_physical_block_size(test_dev->disk->queue, PAGE_SIZE);
+	blk_queue_max_discard_sectors(test_dev->disk->queue, UINT_MAX);
+	blk_queue_flag_set(QUEUE_FLAG_DISCARD, test_dev->disk->queue);
+
+	return 0;
+out:
+	return ret;
+}
+
+static struct sysfs_test_device *alloc_test_dev_sysfs(int idx)
+{
+	struct sysfs_test_device *test_dev;
+	int ret;
+
+	switch (test_devtype) {
+	case TESTDEV_TYPE_MISC:
+	       fallthrough;
+	case TESTDEV_TYPE_BLOCK:
+		break;
+	default:
+		return NULL;
+	}
+
+	test_dev = kzalloc(sizeof(struct sysfs_test_device), GFP_KERNEL);
+	if (!test_dev)
+		goto err_out;
+
+	mutex_init(&test_dev->config_mutex);
+	test_dev->dev_idx = idx;
+	test_dev->devtype = test_devtype;
+
+	if (test_dev->devtype == TESTDEV_TYPE_MISC) {
+		ret = sysfs_test_dev_alloc_miscdev(test_dev);
+		if (ret)
+			goto err_out_free;
+	} else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) {
+		ret = sysfs_test_dev_alloc_blockdev(test_dev);
+		if (ret)
+			goto err_out_free;
+	}
+	return test_dev;
+
+err_out_free:
+	kfree(test_dev);
+	test_dev = NULL;
+err_out:
+	return NULL;
+}
+
+static int register_test_dev_sysfs_misc(struct sysfs_test_device *test_dev)
+{
+	int ret;
+
+	ret = misc_register(&test_dev->misc_dev);
+	if (ret)
+		return ret;
+
+	test_dev->dev = test_dev->misc_dev.this_device;
+
+	return 0;
+}
+
+static int register_test_dev_sysfs_block(struct sysfs_test_device *test_dev)
+{
+	device_add_disk(NULL, test_dev->disk, test_dev_groups);
+	test_dev->dev = disk_to_dev(test_dev->disk);
+
+	return 0;
+}
+
+static struct sysfs_test_device *register_test_dev_sysfs(void)
+{
+	struct sysfs_test_device *test_dev = NULL;
+	int ret;
+
+	test_dev = alloc_test_dev_sysfs(0);
+	if (!test_dev)
+		goto out;
+
+	if (test_dev->devtype == TESTDEV_TYPE_MISC) {
+		ret = register_test_dev_sysfs_misc(test_dev);
+		if (ret) {
+			pr_err("could not register misc device: %d\n", ret);
+			goto out_free_dev;
+		}
+	} else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) {
+		ret = register_test_dev_sysfs_block(test_dev);
+		if (ret) {
+			pr_err("could not register block device: %d\n", ret);
+			goto out_free_dev;
+		}
+	}
+
+	dev_info(test_dev->dev, "interface ready\n");
+
+out:
+	return test_dev;
+out_free_dev:
+	free_test_dev_sysfs(test_dev);
+	return NULL;
+}
+
+static struct sysfs_test_device *register_test_dev_set_config(void)
+{
+	struct sysfs_test_device *test_dev;
+	struct test_config *config;
+
+	test_dev = register_test_dev_sysfs();
+	if (!test_dev)
+		return NULL;
+
+	config = &test_dev->config;
+
+	if (enable_lock)
+		config->enable_lock = true;
+	if (enable_lock_on_rmmod)
+		config->enable_lock_on_rmmod = true;
+	if (use_rtnl_lock)
+		config->use_rtnl_lock = true;
+	if (enable_busy_alloc)
+		config->enable_busy_alloc = true;
+
+	config->write_delay_msec_y = write_delay_msec_y;
+	test_sysfs_reset_vals(test_dev);
+
+	return test_dev;
+}
+
+static void unregister_test_dev_sysfs_misc(struct sysfs_test_device *test_dev)
+{
+	misc_deregister(&test_dev->misc_dev);
+}
+
+static void unregister_test_dev_sysfs_block(struct sysfs_test_device *test_dev)
+{
+	del_gendisk(test_dev->disk);
+	blk_cleanup_disk(test_dev->disk);
+}
+
+static void unregister_test_dev_sysfs(struct sysfs_test_device *test_dev)
+{
+	test_dev_config_lock_rmmod(test_dev);
+
+	dev_info(test_dev->dev, "removing interface\n");
+
+	if (test_dev->devtype == TESTDEV_TYPE_MISC)
+		unregister_test_dev_sysfs_misc(test_dev);
+	else if (test_dev->devtype == TESTDEV_TYPE_BLOCK)
+		unregister_test_dev_sysfs_block(test_dev);
+
+	test_dev_config_unlock_rmmod(test_dev);
+
+	free_test_dev_sysfs(test_dev);
+}
+
+static struct dentry *debugfs_dir;
+
+/* When read represents how many times we have reset the first_test_dev */
+static u8 reset_first_test_dev;
+
+static ssize_t read_reset_first_test_dev(struct file *file,
+					 char __user *user_buf,
+					 size_t count, loff_t *ppos)
+{
+	ssize_t len;
+	char buf[32];
+
+	reset_first_test_dev++;
+	len = sprintf(buf, "%d\n", reset_first_test_dev);
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t write_reset_first_test_dev(struct file *file,
+					  const char __user *user_buf,
+					  size_t count, loff_t *ppos)
+{
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	if (!first_test_dev) {
+		module_put(THIS_MODULE);
+		return -ENODEV;
+	}
+
+	dev_info(first_test_dev->dev, "going to reset first interface ...\n");
+
+	unregister_test_dev_sysfs(first_test_dev);
+	first_test_dev = register_test_dev_set_config();
+
+	dev_info(first_test_dev->dev, "first interface reset complete\n");
+
+	module_put(THIS_MODULE);
+
+	return count;
+}
+
+static const struct file_operations fops_reset_first_test_dev = {
+	.read = read_reset_first_test_dev,
+	.write = write_reset_first_test_dev,
+	.open = simple_open,
+	.owner = THIS_MODULE,
+	.llseek = default_llseek,
+};
+
+static int __init test_sysfs_init(void)
+{
+	first_test_dev = register_test_dev_set_config();
+	if (!first_test_dev)
+		return -ENOMEM;
+
+	if (!enable_debugfs)
+		return 0;
+
+	debugfs_dir = debugfs_create_dir("test_sysfs", NULL);
+	if (!debugfs_dir) {
+		unregister_test_dev_sysfs(first_test_dev);
+		return -ENOMEM;
+	}
+
+	debugfs_create_file("reset_first_test_dev", 0600, debugfs_dir,
+			    NULL, &fops_reset_first_test_dev);
+	return 0;
+}
+module_init(test_sysfs_init);
+
+static void __exit test_sysfs_exit(void)
+{
+	if (enable_debugfs)
+		debugfs_remove(debugfs_dir);
+	if (delay_rmmod_ms)
+		msleep(delay_rmmod_ms);
+	unregister_test_dev_sysfs(first_test_dev);
+	if (enable_verbose_rmmod)
+		pr_info("unregister_test_dev_sysfs() completed\n");
+	first_test_dev = NULL;
+}
+module_exit(test_sysfs_exit);
+
+MODULE_AUTHOR("Luis Chamberlain <mcgrof@kernel.org>");
+MODULE_LICENSE("GPL");
diff --git a/tools/testing/selftests/sysfs/Makefile b/tools/testing/selftests/sysfs/Makefile
new file mode 100644
index 000000000000..fde99caa2338
--- /dev/null
+++ b/tools/testing/selftests/sysfs/Makefile
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+# Makefile for sysfs selftests.
+
+# No binaries, but make sure arg-less "make" doesn't trigger "run_tests".
+all:
+
+TEST_PROGS := sysfs.sh
+
+include ../lib.mk
+
+# Nothing to clean up.
+clean:
diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
new file mode 100644
index 000000000000..9196f452ecd5
--- /dev/null
+++ b/tools/testing/selftests/sysfs/config
@@ -0,0 +1,2 @@
+CONFIG_SYSFS=m
+CONFIG_TEST_SYSFS=m
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
new file mode 100755
index 000000000000..b3f4c2236c7f
--- /dev/null
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -0,0 +1,1208 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms of the GNU General Public License as published by the Free
+# Software Foundation; either version 2 of the License, or at your option any
+# later version; or, when distributed separately from the Linux kernel or
+# when incorporated into other software packages, subject to the following
+# license:
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms of copyleft-next (version 0.3.1 or later) as published
+# at http://copyleft-next.org/.
+
+# This performs a series tests against the sysfs filesystem.
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+TEST_NAME="sysfs"
+TEST_DRIVER="test_${TEST_NAME}"
+TEST_DIR=$(dirname $0)
+TEST_FILE=$(mktemp)
+
+# This represents
+#
+# TEST_ID:TEST_COUNT:ENABLED:TARGET
+#
+# TEST_ID: is the test id number
+# TEST_COUNT: number of times we should run the test
+# ENABLED: 1 if enabled, 0 otherwise
+# TARGET: test target file required on the test_sysfs module
+#
+# Once these are enabled please leave them as-is. Write your own test,
+# we have tons of space.
+ALL_TESTS="0001:3:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0002:3:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0003:3:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0004:3:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0005:1:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0006:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0007:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0008:1:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0009:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0010:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0011:1:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0012:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0013:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0014:3:1:test_dev_x:block" # block equivalent set
+ALL_TESTS="$ALL_TESTS 0015:3:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0016:3:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0017:3:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0018:1:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0019:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0020:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0021:1:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0022:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
+ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+
+allow_user_defaults()
+{
+	if [ -z $DIR ]; then
+		case $TEST_DEV_TYPE in
+		misc)
+			DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0"
+			;;
+		block)
+			DIR="/sys/devices/virtual/block/${TEST_DRIVER}0"
+			;;
+		*)
+			DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0"
+			;;
+		esac
+	fi
+	case $TEST_DEV_TYPE in
+		misc)
+			MODPROBE_TESTDEV_TYPE=""
+			;;
+		block)
+			MODPROBE_TESTDEV_TYPE="devtype=1"
+			;;
+		*)
+			MODPROBE_TESTDEV_TYPE=""
+			;;
+	esac
+	if [ -z $SYSFS_DEBUGFS_DIR ]; then
+		SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
+	fi
+	if [ -z $PAGE_SIZE ]; then
+		PAGE_SIZE=$(getconf PAGESIZE)
+	fi
+	if [ -z $MAX_DIGITS ]; then
+		MAX_DIGITS=$(($PAGE_SIZE/8))
+	fi
+	if [ -z $INT_MAX ]; then
+		INT_MAX=$(getconf INT_MAX)
+	fi
+	if [ -z $UINT_MAX ]; then
+		UINT_MAX=$(getconf UINT_MAX)
+	fi
+}
+
+test_reqs()
+{
+	uid=$(id -u)
+	if [ $uid -ne 0 ]; then
+		echo $msg must be run as root >&2
+		exit $ksft_skip
+	fi
+
+	if ! which modprobe 2> /dev/null > /dev/null; then
+		echo "$0: You need modprobe installed" >&2
+		exit $ksft_skip
+	fi
+	if ! which getconf 2> /dev/null > /dev/null; then
+		echo "$0: You need getconf installed"
+		exit $ksft_skip
+	fi
+	if ! which diff 2> /dev/null > /dev/null; then
+		echo "$0: You need diff installed"
+		exit $ksft_skip
+	fi
+	if ! which perl 2> /dev/null > /dev/null; then
+		echo "$0: You need perl installed"
+		exit $ksft_skip
+	fi
+}
+
+call_modprobe()
+{
+	modprobe $TEST_DRIVER $MODPROBE_TESTDEV_TYPE $FIRST_MODPROBE_ARGS $MODPROBE_ARGS
+	return $?
+}
+
+modprobe_reset()
+{
+	modprobe -q -r $TEST_DRIVER
+	call_modprobe
+	return $?
+}
+
+modprobe_reset_enable_debugfs()
+{
+	FIRST_MODPROBE_ARGS="enable_debugfs=1"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
+modprobe_reset_enable_lock_on_rmmod()
+{
+	FIRST_MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
+modprobe_reset_enable_rtnl_lock_on_rmmod()
+{
+	FIRST_MODPROBE_ARGS="enable_lock=1 use_rtnl_lock=1 enable_lock_on_rmmod=1"
+	FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_writes=1"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
+load_req_mod()
+{
+	modprobe_reset
+	if [ ! -d $DIR ]; then
+		if ! modprobe -q -n $TEST_DRIVER; then
+			echo "$0: module $TEST_DRIVER not found [SKIP]"
+			echo "You must set CONFIG_TEST_SYSFS=m in your kernel" >&2
+			exit $ksft_skip
+		fi
+		call_modprobe
+		if [ $? -ne 0 ]; then
+			echo "$0: modprobe $TEST_DRIVER failed."
+			exit
+		fi
+	fi
+}
+
+config_reset()
+{
+	if ! echo -n "1" >"$DIR"/reset; then
+		echo "$0: reset should have worked" >&2
+		exit 1
+	fi
+}
+
+debugfs_reset_first_test_dev_ignore_errors()
+{
+	echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
+}
+
+set_orig()
+{
+	if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
+		if [ -f ${TARGET} ]; then
+			echo "${ORIG}" > "${TARGET}"
+		fi
+	fi
+}
+
+set_test()
+{
+	echo "${TEST_STR}" > "${TARGET}"
+}
+
+set_test_ignore_errors()
+{
+	echo "${TEST_STR}" > "${TARGET}" 2> /dev/null
+}
+
+verify()
+{
+	local seen
+	seen=$(cat "$1")
+	target_short=$(basename $TARGET)
+	case $target_short in
+	test_dev_x)
+		if [ "${seen}" != "${TEST_STR}" ]; then
+			return 1
+		fi
+		;;
+	test_dev_y)
+		DIRNAME=$(dirname $1)
+		EXPECTED_RESULT=""
+		# If our target was the test file then what we write to it
+		# is the same as what that we expect when we read from it.
+		# When we write to test_dev_y directly though we expect
+		# a computed value which is driver specific.
+		if [[ "$DIRNAME" == "/tmp" ]]; then
+			let EXPECTED_RESULT="${TEST_STR}"
+		else
+			x=$(cat ${DIR}/test_dev_x)
+			let EXPECTED_RESULT="$x+${TEST_STR}+7"
+		fi
+
+		if [[ "${seen}" != "${EXPECTED_RESULT}" ]]; then
+			return 1
+		fi
+		;;
+	*)
+		echo "Unsupported target type update test script: $target_short"
+		exit 1
+	esac
+	return 0
+}
+
+verify_diff_w()
+{
+	echo "$TEST_STR" | diff -q -w -u - $1 > /dev/null
+	return $?
+}
+
+test_rc()
+{
+	if [[ $rc != 0 ]]; then
+		echo "Failed test, return value: $rc" >&2
+		exit $rc
+	fi
+}
+
+test_finish()
+{
+	set_orig
+	rm -f "${TEST_FILE}"
+
+	if [ ! -z ${old_strict} ]; then
+		echo ${old_strict} > ${WRITES_STRICT}
+	fi
+	exit $rc
+}
+
+# kernfs requires us to write everything we want in one shot because
+# There is no easy way for us to know if userspace is only doing a partial
+# write, so we don't support them. We expect the entire buffer to come on
+# the first write.  If you're writing a value, first read the file,
+# modify only the value you're changing, then write entire buffer back.
+# Since we are only testing digits we just full single writes and old stuff.
+# For more details, refer to kernfs_fop_write_iter().
+run_numerictests_single_write()
+{
+	echo "== Testing sysfs behavior against ${TARGET} =="
+
+	rc=0
+
+	echo -n "Writing test file ... "
+	echo "${TEST_STR}" > "${TEST_FILE}"
+	if ! verify "${TEST_FILE}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Checking the sysfs file is not set to test value ... "
+	if verify "${TARGET}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Writing to sysfs file from shell ... "
+	set_test
+	if ! verify "${TARGET}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Resetting sysfs file to original value ... "
+	set_orig
+	if verify "${TARGET}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	# Now that we've validated the sanity of "set_test" and "set_orig",
+	# we can use those functions to set starting states before running
+	# specific behavioral tests.
+
+	echo -n "Writing to the entire sysfs file in a single write ... "
+	set_orig
+	dd if="${TEST_FILE}" of="${TARGET}" bs=4096 2>/dev/null
+	if ! verify "${TARGET}"; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Writing to the sysfs file with multiple long writes ... "
+	set_orig
+	(perl -e 'print "A" x 50;'; echo "${TEST_STR}") | \
+		dd of="${TARGET}" bs=50 2>/dev/null
+	if verify "${TARGET}"; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+	test_rc
+}
+
+reset_vals()
+{
+	echo -n 3 > $DIR/test_dev_x
+	echo -n 4 > $DIR/test_dev_x
+}
+
+check_failure()
+{
+	echo -n "Testing that $1 fails as expected..."
+	reset_vals
+	TEST_STR="$1"
+	orig="$(cat $TARGET)"
+	echo -n "$TEST_STR" > $TARGET 2> /dev/null
+
+	# write should fail and $TARGET should retain its original value
+	if [ $? = 0 ] || [ "$(cat $TARGET)" != "$orig" ]; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+	test_rc
+}
+
+load_modreqs()
+{
+	export TEST_DEV_TYPE=$(get_test_type $1)
+	unset DIR
+	allow_user_defaults
+	load_req_mod
+}
+
+target_exists()
+{
+	TARGET="${DIR}/$1"
+	TEST_ID="$2"
+
+	if [ ! -f ${TARGET} ] ; then
+		echo "Target for test $TEST_ID: $TARGET does not exist, skipping test ..."
+		return 0
+	fi
+	return 1
+}
+
+config_enable_lock()
+{
+	if ! echo -n 1 > $DIR/config_enable_lock; then
+		echo "$0: Unable to enable locks" >&2
+		exit 1
+	fi
+}
+
+config_write_delay_msec_y()
+{
+	if ! echo -n $1 > $DIR/config_write_delay_msec_y ; then
+		echo "$0: Unable to set write_delay_msec_y to $1" >&2
+		exit 1
+	fi
+}
+
+# Default filter for dmesg scanning.
+# Ignore lockdep complaining about its own bugginess when scanning dmesg
+# output, because we shouldn't be failing filesystem tests on account of
+# lockdep.
+_check_dmesg_filter()
+{
+	egrep -v -e "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low" \
+		-e "BUG: MAX_STACK_TRACE_ENTRIES too low"
+}
+
+check_dmesg()
+{
+	# filter out intentional WARNINGs or Oopses
+	local filter=${1:-_check_dmesg_filter}
+
+	_dmesg_since_test_start | $filter >$seqres.dmesg
+	egrep -q -e "kernel BUG at" \
+	     -e "WARNING:" \
+	     -e "\bBUG:" \
+	     -e "Oops:" \
+	     -e "possible recursive locking detected" \
+	     -e "Internal error" \
+	     -e "(INFO|ERR): suspicious RCU usage" \
+	     -e "INFO: possible circular locking dependency detected" \
+	     -e "general protection fault:" \
+	     -e "BUG .* remaining" \
+	     -e "UBSAN:" \
+	     $seqres.dmesg
+	if [ $? -eq 0 ]; then
+		echo "something found in dmesg (see $seqres.dmesg)"
+		return 1
+	else
+		if [ "$KEEP_DMESG" != "yes" ]; then
+			rm -f $seqres.dmesg
+		fi
+		return 0
+	fi
+}
+
+log_kernel_fstest_dmesg()
+{
+	export FSTYP="$1"
+	export seqnum="$FSTYP/$2"
+	export date_time=$(date +"%F %T")
+	echo "run fstests $seqnum at $date_time" > /dev/kmsg
+}
+
+modprobe_loop()
+{
+	while true; do
+		call_modprobe > /dev/null 2>&1
+		modprobe -r $TEST_DRIVER > /dev/null 2>&1
+	done > /dev/null 2>&1
+}
+
+write_loop()
+{
+	while true; do
+		set_test_ignore_errors > /dev/null 2>&1
+		TEST_STR=$(( $TEST_STR + 1 ))
+	done > /dev/null 2>&1
+}
+
+write_loop_reset()
+{
+	while true; do
+		set_test_ignore_errors > /dev/null 2>&1
+		debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1
+	done > /dev/null 2>&1
+}
+
+write_loop_bg()
+{
+	BG_WRITES=1000 > /dev/null 2>&1
+	while true; do
+		for i in $(seq 1 $BG_WRITES); do
+			set_test_ignore_errors > /dev/null 2>&1 &
+			TEST_STR=$(( $TEST_STR + 1 ))
+		done > /dev/null 2>&1
+		wait
+	done > /dev/null 2>&1
+	wait
+}
+
+reset_loop()
+{
+	while true; do
+		debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1
+	done > /dev/null 2>&1
+}
+
+kill_trigger_loop()
+{
+
+	local my_first_loop_pid=$1
+	local my_second_loop_pid=$2
+	local my_sleep_max=$3
+	local my_loop=0
+
+	while true; do
+		sleep 1
+		if [[ $my_loop -ge $my_sleep_max ]]; then
+			break
+		fi
+		let my_loop=$my_loop+1
+	done
+
+	kill -s TERM $my_first_loop_pid 2>&1 > /dev/null
+	kill -s TERM $my_second_loop_pid 2>&1 > /dev/null
+}
+
+_dmesg_since_test_start()
+{
+	# search the dmesg log of last run of $seqnum for possible failures
+	# use sed \cregexpc address type, since $seqnum contains "/"
+	dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac
+}
+
+sysfs_test_0001()
+{
+	TARGET="${DIR}/$(get_test_target 0001)"
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0002()
+{
+	TARGET="${DIR}/$(get_test_target 0002)"
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0003()
+{
+	TARGET="${DIR}/$(get_test_target 0003)"
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	config_enable_lock
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0004()
+{
+	TARGET="${DIR}/$(get_test_target 0004)"
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	config_enable_lock
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0005()
+{
+	TARGET="${DIR}/$(get_test_target 0005)"
+	modprobe_reset
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing x while loading/unloading the module... "
+
+	modprobe_loop &
+	modprobe_pid=$!
+
+	write_loop &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0006()
+{
+	TARGET="${DIR}/$(get_test_target 0006)"
+	modprobe_reset
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y while loading/unloading the module... "
+	modprobe_loop &
+	modprobe_pid=$!
+
+	write_loop &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0007()
+{
+	TARGET="${DIR}/$(get_test_target 0007)"
+	modprobe_reset
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y with a larger delay while loading/unloading the module... "
+
+	MODPROBE_ARGS="write_delay_msec_y=1500"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	write_loop &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0008()
+{
+	TARGET="${DIR}/$(get_test_target 0008)"
+	modprobe_reset
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop busy writing x while loading/unloading the module... "
+
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+
+	write_loop_bg > /dev/null 2>&1 &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0009()
+{
+	TARGET="${DIR}/$(get_test_target 0009)"
+	modprobe_reset
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop busy writing y while loading/unloading the module... "
+
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+
+	write_loop_bg > /dev/null 2>&1 &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0010()
+{
+	TARGET="${DIR}/$(get_test_target 0010)"
+	modprobe_reset
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop busy writing y with a larger delay while loading/unloading the module... "
+	modprobe -q -r $TEST_DRIVER > /dev/null 2>&1
+
+	MODPROBE_ARGS="write_delay_msec_y=1500"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	write_loop_bg > /dev/null 2>&1 &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0011()
+{
+	TARGET="${DIR}/$(get_test_target 0011)"
+	modprobe_reset_enable_debugfs
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing x and resetting ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	reset_loop > /dev/null 2>&1 &
+	reset_pid=$!
+
+	kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0012()
+{
+	TARGET="${DIR}/$(get_test_target 0012)"
+	modprobe_reset_enable_debugfs
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y and resetting ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	reset_loop > /dev/null 2>&1 &
+	reset_pid=$!
+
+	kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0013()
+{
+	TARGET="${DIR}/$(get_test_target 0013)"
+	modprobe_reset_enable_debugfs
+	config_reset
+	reset_vals
+	config_write_delay_msec_y 1500
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y with a larger delay and resetting ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	reset_loop > /dev/null 2>&1 &
+	reset_pid=$!
+
+	kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0014()
+{
+	sysfs_test_0001
+}
+
+sysfs_test_0015()
+{
+	sysfs_test_0002
+}
+
+sysfs_test_0016()
+{
+	sysfs_test_0003
+}
+
+sysfs_test_0017()
+{
+	sysfs_test_0004
+}
+
+sysfs_test_0018()
+{
+	sysfs_test_0005
+}
+
+sysfs_test_0019()
+{
+	sysfs_test_0006
+}
+
+sysfs_test_0020()
+{
+	sysfs_test_0007
+}
+
+sysfs_test_0021()
+{
+	sysfs_test_0008
+}
+
+sysfs_test_0022()
+{
+	sysfs_test_0009
+}
+
+sysfs_test_0023()
+{
+	sysfs_test_0010
+}
+
+sysfs_test_0024()
+{
+	sysfs_test_0011
+}
+
+sysfs_test_0025()
+{
+	sysfs_test_0012
+}
+
+sysfs_test_0026()
+{
+	sysfs_test_0013
+}
+
+sysfs_test_0027()
+{
+	TARGET="${DIR}/$(get_test_target 0027)"
+	modprobe_reset_enable_lock_on_rmmod
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Test for possible rmmod deadlock while writing x ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0028()
+{
+	TARGET="${DIR}/$(get_test_target 0028)"
+	modprobe_reset_enable_lock_on_rmmod
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Test for possible rmmod deadlock using rtnl_lock while writing x ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 use_rtnl_lock=1 enable_verbose_writes=1"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+test_gen_desc()
+{
+	echo -n "$1 x $(get_test_count $1)"
+}
+
+list_tests()
+{
+	echo "Test ID list:"
+	echo
+	echo "TEST_ID x NUM_TEST"
+	echo "TEST_ID:   Test ID"
+	echo "NUM_TESTS: Number of recommended times to run the test"
+	echo
+	echo "$(test_gen_desc 0001) - misc test writing x in different ways"
+	echo "$(test_gen_desc 0002) - misc test writing y in different ways"
+	echo "$(test_gen_desc 0003) - misc test writing x in different ways using a mutex lock"
+	echo "$(test_gen_desc 0004) - misc test writing y in different ways using a mutex lock"
+	echo "$(test_gen_desc 0005) - misc test writing x load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0006) - misc writing y load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0007) - misc test writing y larger delay, load, remove test_sysfs"
+	echo "$(test_gen_desc 0008) - misc test busy writing x remove test_sysfs module"
+	echo "$(test_gen_desc 0009) - misc test busy writing y remove the test_sysfs module"
+	echo "$(test_gen_desc 0010) - misc test busy writing y larger delay, remove test_sysfs"
+	echo "$(test_gen_desc 0011) - misc test writing x and resetting device"
+	echo "$(test_gen_desc 0012) - misc test writing y and resetting device"
+	echo "$(test_gen_desc 0013) - misc test writing y with a larger delay and resetting device"
+	echo "$(test_gen_desc 0014) - block test writing x in different ways"
+	echo "$(test_gen_desc 0015) - block test writing y in different ways"
+	echo "$(test_gen_desc 0016) - block test writing x in different ways using a mutex lock"
+	echo "$(test_gen_desc 0017) - block test writing y in different ways using a mutex lock"
+	echo "$(test_gen_desc 0018) - block test writing x load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0019) - block test writing y load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0020) - block test writing y larger delay, load, remove test_sysfs"
+	echo "$(test_gen_desc 0021) - block test busy writing x remove the test_sysfs module"
+	echo "$(test_gen_desc 0022) - block test busy writing y remove the test_sysfs module"
+	echo "$(test_gen_desc 0023) - block test busy writing y larger delay, remove test_sysfs"
+	echo "$(test_gen_desc 0024) - block test writing x and resetting device"
+	echo "$(test_gen_desc 0025) - block test writing y and resetting device"
+	echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
+	echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
+	echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
+}
+
+usage()
+{
+	NUM_TESTS=$(grep -o ' ' <<<"$ALL_TESTS" | grep -c .)
+	let NUM_TESTS=$NUM_TESTS+1
+	MAX_TEST=$(printf "%04d\n" $NUM_TESTS)
+	echo "Usage: $0 [ -t <4-number-digit> ] | [ -w <4-number-digit> ] |"
+	echo "		 [ -s <4-number-digit> ] | [ -c <4-number-digit> <test- count>"
+	echo "           [ all ] [ -h | --help ] [ -l ]"
+	echo ""
+	echo "Valid tests: 0001-$MAX_TEST"
+	echo ""
+	echo "    all     Runs all tests (default)"
+	echo "    -t      Run test ID the number amount of times is recommended"
+	echo "    -w      Watch test ID run until it runs into an error"
+	echo "    -c      Run test ID once"
+	echo "    -s      Run test ID x test-count number of times"
+	echo "    -l      List all test ID list"
+	echo " -h|--help  Help"
+	echo
+	echo "If an error every occurs execution will immediately terminate."
+	echo "If you are adding a new test try using -w <test-ID> first to"
+	echo "make sure the test passes a series of tests."
+	echo
+	echo Example uses:
+	echo
+	echo "$TEST_NAME.sh            -- executes all tests"
+	echo "$TEST_NAME.sh -t 0002    -- Executes test ID 0002 number of times is recomended"
+	echo "$TEST_NAME.sh -w 0002    -- Watch test ID 0002 run until an error occurs"
+	echo "$TEST_NAME.sh -s 0002    -- Run test ID 0002 once"
+	echo "$TEST_NAME.sh -c 0002 3  -- Run test ID 0002 three times"
+	echo
+	list_tests
+	exit 1
+}
+
+test_num()
+{
+	re='^[0-9]+$'
+	if ! [[ $1 =~ $re ]]; then
+		usage
+	fi
+}
+
+get_test_count()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $2}'
+}
+
+get_test_enabled()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $3}'
+}
+
+get_test_target()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $4}'
+}
+
+get_test_type()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $5}'
+}
+
+run_all_tests()
+{
+	for i in $ALL_TESTS ; do
+		TEST_ID=$(echo $i | awk -F":" '{print $1}')
+		ENABLED=$(get_test_enabled $TEST_ID)
+		TEST_COUNT=$(get_test_count $TEST_ID)
+		TEST_TARGET=$(get_test_target $TEST_ID)
+		if [[ $ENABLED -eq "1" ]]; then
+			test_case $TEST_ID $TEST_COUNT $TEST_TARGET
+		else
+			echo -n "Skipping test $TEST_ID as its disabled, likely "
+			echo "could crash your system ..."
+		fi
+	done
+}
+
+watch_log()
+{
+	if [ $# -ne 3 ]; then
+		clear
+	fi
+	echo "Running test: $2 - run #$1"
+}
+
+watch_case()
+{
+	i=0
+	while [ 1 ]; do
+		if [ $# -eq 1 ]; then
+			test_num $1
+			watch_log $i ${TEST_NAME}_test_$1
+			log_kernel_fstest_dmesg sysfs $1
+			RUN_TEST=${TEST_NAME}_test_$1
+			$RUN_TEST
+			check_dmesg
+			if [[ $? -ne 0 ]]; then
+				exit 1
+			fi
+		else
+			watch_log $i all
+			run_all_tests
+		fi
+		let i=$i+1
+	done
+}
+
+test_case()
+{
+	NUM_TESTS=$2
+
+	i=0
+
+	load_modreqs $1
+	if target_exists $3 $1; then
+		return
+	fi
+
+	while [[ $i -lt $NUM_TESTS ]]; do
+		test_num $1
+		watch_log $i ${TEST_NAME}_test_$1 noclear
+		log_kernel_fstest_dmesg sysfs $1
+		RUN_TEST=${TEST_NAME}_test_$1
+		$RUN_TEST
+		let i=$i+1
+	done
+	check_dmesg
+	if [[ $? -ne 0 ]]; then
+		exit 1
+	fi
+}
+
+parse_args()
+{
+	if [ $# -eq 0 ]; then
+		run_all_tests
+	else
+		if [[ "$1" = "all" ]]; then
+			run_all_tests
+		elif [[ "$1" = "-w" ]]; then
+			shift
+			watch_case $@
+		elif [[ "$1" = "-t" ]]; then
+			shift
+			test_num $1
+			test_case $1 $(get_test_count $1) $(get_test_target $1)
+			shift
+		elif [[ "$1" = "-c" ]]; then
+			shift
+			test_num $1
+			test_num $2
+			test_case $1 $2 $(get_test_target $1)
+			shift
+			shift
+		elif [[ "$1" = "-s" ]]; then
+			shift
+			test_case $1 1 $(get_test_target $1)
+			shift
+		elif [[ "$1" = "-l" ]]; then
+			list_tests
+			shift
+		elif [[ "$1" = "-h" || "$1" = "--help" ]]; then
+			usage
+		else
+			usage
+		fi
+	fi
+}
+
+test_reqs
+allow_user_defaults
+
+trap "test_finish" EXIT
+
+parse_args $@
+
+exit 0
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 03/12] selftests: add tests_sysfs module
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams

This adds a new selftest module which can be used to test sysfs, which
would otherwise require using an existing driver. This lets us muck
with a template driver to test breaking things without affecting
system behaviour or requiring the dependencies of a real device
driver.

A series of 28 tests are added. Support for using two device types are
supported:

  * misc
  * block

Contrary to sysctls, sysfs requires a full write to happen at once, and
so we reduce the digit tests to single writes. Two main sysfs knobs are
provided for testing reading/storing, one which doesn't inclur any
delays and another which can incur programmed delays. What locks are
held, if any, are configurable, at module load time, or through dynamic
configuration at run time.

Since sysfs is a technically filesystem, but a pseudo one, which
requires a kernel user, our test_sysfs module and respective test script
embraces fstests format for tests in the kernel ring bufffer. Likewise,
a scraper for kernel crashes is provided which matches what fstests does
as well.

Two tests are kept disabled as they currently cause a deadlock, and so
this provides a mechanism to easily show proof and demo how the deadlock
can happen:

Demos the deadlock with a device specific lock
./tools/testing/selftests/sysfs/sysfs.sh -t 0027

Demos the deadlock with rtnl_lock()
./tools/testing/selftests/sysfs/sysfs.sh -t 0028

Two separate solutions to the deadlock issue have been proposed,
and so now its a matter of either documenting this limitation or
eventually adopting a generic fix.

This selftests will shortly be expanded upon with more tests which
require further kernel changes in order to provide better test
coverage.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 MAINTAINERS                            |    7 +
 lib/Kconfig.debug                      |   12 +
 lib/Makefile                           |    1 +
 lib/test_sysfs.c                       |  943 ++++++++++++++++++
 tools/testing/selftests/sysfs/Makefile |   12 +
 tools/testing/selftests/sysfs/config   |    2 +
 tools/testing/selftests/sysfs/sysfs.sh | 1208 ++++++++++++++++++++++++
 7 files changed, 2185 insertions(+)
 create mode 100644 lib/test_sysfs.c
 create mode 100644 tools/testing/selftests/sysfs/Makefile
 create mode 100644 tools/testing/selftests/sysfs/config
 create mode 100755 tools/testing/selftests/sysfs/sysfs.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 15a5fd8323f7..28a34384f541 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18166,6 +18166,13 @@ L:	linux-mmc@vger.kernel.org
 S:	Maintained
 F:	drivers/mmc/host/sdhci-pci-dwc-mshc.c
 
+SYSFS TEST DRIVER
+M:	Luis Chamberlain <mcgrof@kernel.org>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+F:	lib/test_sysfs.c
+F:	tools/testing/selftests/sysfs/
+
 SYSTEM CONFIGURATION (SYSCON)
 M:	Lee Jones <lee.jones@linaro.org>
 M:	Arnd Bergmann <arnd@arndb.de>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index dbff322207ad..ae19bf1a21b8 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2343,6 +2343,18 @@ config TEST_SYSCTL
 
 	  If unsure, say N.
 
+config TEST_SYSFS
+	tristate "sysfs test driver"
+	depends on SYSFS
+	depends on NET
+	depends on BLOCK
+	help
+	  This builds the "test_sysfs" module. This driver enables to test the
+	  sysfs file system safely without affecting production knobs which
+	  might alter system functionality.
+
+	  If unsure, say N.
+
 config BITFIELD_KUNIT
 	tristate "KUnit test bitfield functions at runtime"
 	depends on KUNIT
diff --git a/lib/Makefile b/lib/Makefile
index 2cfd33917ad5..5143d65f90d6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_TEST_FIRMWARE) += test_firmware.o
 obj-$(CONFIG_TEST_BITOPS) += test_bitops.o
 CFLAGS_test_bitops.o += -Werror
 obj-$(CONFIG_TEST_SYSCTL) += test_sysctl.o
+obj-$(CONFIG_TEST_SYSFS) += test_sysfs.o
 obj-$(CONFIG_TEST_HASH) += test_hash.o test_siphash.o
 obj-$(CONFIG_TEST_IDA) += test_ida.o
 obj-$(CONFIG_KASAN_KUNIT_TEST) += test_kasan.o
diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
new file mode 100644
index 000000000000..273fc3f39740
--- /dev/null
+++ b/lib/test_sysfs.c
@@ -0,0 +1,943 @@
+// SPDX-License-Identifier: GPL-2.0-or-later OR copyleft-next-0.3.1
+/*
+ * sysfs test driver
+ *
+ * Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or at your option any
+ * later version; or, when distributed separately from the Linux kernel or
+ * when incorporated into other software packages, subject to the following
+ * license:
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of copyleft-next (version 0.3.1 or later) as published
+ * at http://copyleft-next.org/.
+ */
+
+/*
+ * This module allows us to add race conditions which we can test for
+ * against the sysfs filesystem.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/async.h>
+#include <linux/delay.h>
+#include <linux/vmalloc.h>
+#include <linux/debugfs.h>
+#include <linux/rtnetlink.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+
+static bool enable_lock;
+module_param(enable_lock, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_lock,
+		 "enable locking on reads / stores from the start");
+
+static bool enable_lock_on_rmmod;
+module_param(enable_lock_on_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_lock_on_rmmod,
+		 "enable locking on rmmod");
+
+static bool use_rtnl_lock;
+module_param(use_rtnl_lock, bool_enable_only, 0644);
+MODULE_PARM_DESC(use_rtnl_lock,
+		 "use an rtnl_lock instead of the device mutex_lock");
+
+static unsigned int write_delay_msec_y = 500;
+module_param_named(write_delay_msec_y, write_delay_msec_y, uint, 0644);
+MODULE_PARM_DESC(write_delay_msec_y, "msec write delay for writes to y");
+
+static unsigned int test_devtype;
+module_param_named(devtype, test_devtype, uint, 0644);
+MODULE_PARM_DESC(devtype, "device type to register");
+
+static bool enable_busy_alloc;
+module_param(enable_busy_alloc, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_busy_alloc, "do a fake allocation during writes");
+
+static bool enable_debugfs;
+module_param(enable_debugfs, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_debugfs, "enable a few debugfs files");
+
+static bool enable_verbose_writes;
+module_param(enable_verbose_writes, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_debugfs, "enable stores to print verbose information");
+
+static unsigned int delay_rmmod_ms;
+module_param_named(delay_rmmod_ms, delay_rmmod_ms, uint, 0644);
+MODULE_PARM_DESC(delay_rmmod_ms, "if set how many ms to delay rmmod before device deletion");
+
+static bool enable_verbose_rmmod;
+module_param(enable_verbose_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
+
+static int sysfs_test_major;
+
+/**
+ * test_config - used for configuring how the sysfs test device will behave
+ *
+ * @enable_lock: if enabled a lock will be used when reading/storing variables
+ * @enable_lock_on_rmmod: if enabled a lock will be used when reading/storing
+ *	sysfs attributes, but it will also be used to lock on rmmod. This is
+ *	useful to test for a deadlock.
+ * @use_rtnl_lock: if enabled instead of configuration specific mutex, we'll
+ *	use the rtnl_lock. If your test case is modifying this on the fly
+ *	while doing other stores / reads, things will break as a lock can be
+ *	left contending. Best is that tests use this knob serially, without
+ *	allowing userspace to modify other knobs while this one changes.
+ * @write_delay_msec_y: the amount of delay to use when writing to y
+ * @enable_busy_alloc: if enabled we'll do a large allocation between
+ *	writes. We immediately free right away. We also schedule to give the
+ *	kernel some time to re-use any memory we don't need. This is intened
+ *	to mimic typical driver behaviour.
+ */
+struct test_config {
+	bool enable_lock;
+	bool enable_lock_on_rmmod;
+	bool use_rtnl_lock;
+	unsigned int write_delay_msec_y;
+	bool enable_busy_alloc;
+};
+
+/**
+ * enum sysfs_test_devtype - sysfs device type
+ * @TESTDEV_TYPE_MISC: misc device type
+ * @TESTDEV_TYPE_BLOCK: use a block device for the sysfs test device.
+ */
+enum sysfs_test_devtype {
+	TESTDEV_TYPE_MISC = 0,
+	TESTDEV_TYPE_BLOCK,
+};
+
+/**
+ * sysfs_test_device - test device to help test sysfs
+ *
+ * @devtype: the type of device to use
+ * @config: configuration for the test
+ * @config_mutex: protects configuration of test
+ * @misc_dev: we use a misc device under the hood
+ * @disk: represents a disk when used as a block device
+ * @dev: pointer to misc_dev's own struct device
+ * @dev_idx: unique ID for test device
+ * @x: variable we can use to test read / store
+ * @y: slow variable we can use to test read / store
+ */
+struct sysfs_test_device {
+	enum sysfs_test_devtype devtype;
+	struct test_config config;
+	struct mutex config_mutex;
+	struct miscdevice misc_dev;
+	struct gendisk *disk;
+	struct device *dev;
+	int dev_idx;
+	int x;
+	int y;
+};
+
+static struct sysfs_test_device *first_test_dev;
+
+static struct miscdevice *dev_to_misc_dev(struct device *dev)
+{
+	return dev_get_drvdata(dev);
+}
+
+static struct sysfs_test_device *misc_dev_to_test_dev(struct miscdevice *misc_dev)
+{
+	return container_of(misc_dev, struct sysfs_test_device, misc_dev);
+}
+
+static struct sysfs_test_device *devblock_to_test_dev(struct device *dev)
+{
+	return (struct sysfs_test_device *)dev_to_disk(dev)->private_data;
+}
+
+static struct sysfs_test_device *devmisc_to_testdev(struct device *dev)
+{
+	struct miscdevice *misc_dev;
+
+	misc_dev = dev_to_misc_dev(dev);
+	return misc_dev_to_test_dev(misc_dev);
+}
+
+static struct sysfs_test_device *dev_to_test_dev(struct device *dev)
+{
+	if (test_devtype == TESTDEV_TYPE_MISC)
+		return devmisc_to_testdev(dev);
+	else if (test_devtype == TESTDEV_TYPE_BLOCK)
+		return devblock_to_test_dev(dev);
+	return NULL;
+}
+
+static void test_dev_config_lock(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+
+	config = &test_dev->config;
+	if (config->enable_lock) {
+		if (config->use_rtnl_lock)
+			rtnl_lock();
+		else
+			mutex_lock(&test_dev->config_mutex);
+	}
+}
+
+static void test_dev_config_unlock(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+
+	config = &test_dev->config;
+	if (config->enable_lock) {
+		if (config->use_rtnl_lock)
+			rtnl_unlock();
+		else
+			mutex_unlock(&test_dev->config_mutex);
+	}
+}
+
+static void test_dev_config_lock_rmmod(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+
+	config = &test_dev->config;
+	if (config->enable_lock_on_rmmod)
+		test_dev_config_lock(test_dev);
+}
+
+static void test_dev_config_unlock_rmmod(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+
+	config = &test_dev->config;
+	if (config->enable_lock_on_rmmod)
+		test_dev_config_unlock(test_dev);
+}
+
+static void free_test_dev_sysfs(struct sysfs_test_device *test_dev)
+{
+	if (test_dev) {
+		kfree_const(test_dev->misc_dev.name);
+		test_dev->misc_dev.name = NULL;
+		kfree(test_dev);
+		test_dev = NULL;
+	}
+}
+
+static void test_sysfs_reset_vals(struct sysfs_test_device *test_dev)
+{
+	test_dev->x = 3;
+	test_dev->y = 4;
+}
+
+static ssize_t config_show(struct device *dev,
+			   struct device_attribute *attr,
+			   char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config = &test_dev->config;
+	int len = 0;
+
+	test_dev_config_lock(test_dev);
+
+	len += snprintf(buf, PAGE_SIZE,
+			"Configuration for: %s\n",
+			dev_name(dev));
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"x:\t%d\n",
+			test_dev->x);
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"y:\t%d\n",
+			test_dev->y);
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_lock:\t%s\n",
+			config->enable_lock ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_lock_on_rmmmod:\t%s\n",
+			config->enable_lock_on_rmmod ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"use_rtnl_lock:\t%s\n",
+			config->use_rtnl_lock ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"write_delay_msec_y:\t%d\n",
+			config->write_delay_msec_y);
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_busy_alloc:\t%s\n",
+			config->enable_busy_alloc ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_debugfs:\t%s\n",
+			enable_debugfs ? "true" : "false");
+
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_verbose_writes:\t%s\n",
+			enable_verbose_writes ? "true" : "false");
+
+	test_dev_config_unlock(test_dev);
+
+	return len;
+}
+static DEVICE_ATTR_RO(config);
+
+static ssize_t reset_store(struct device *dev,
+			   struct device_attribute *attr,
+			   const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+
+	config = &test_dev->config;
+
+	/*
+	 * We compromise and simplify this condition and do not use a lock
+	 * here as the lock type can change.
+	 */
+	config->enable_lock = false;
+	config->enable_lock_on_rmmod = false;
+	config->use_rtnl_lock = false;
+	config->enable_busy_alloc = false;
+	test_sysfs_reset_vals(test_dev);
+
+	dev_info(dev, "reset\n");
+
+	return count;
+}
+static DEVICE_ATTR_WO(reset);
+
+static void test_dev_busy_alloc(struct sysfs_test_device *test_dev)
+{
+	struct test_config *config;
+	char *ignore;
+
+	config = &test_dev->config;
+	if (!config->enable_busy_alloc)
+		return;
+
+	ignore = kzalloc(sizeof(struct sysfs_test_device) * 10, GFP_KERNEL);
+	kfree(ignore);
+
+	schedule();
+}
+
+static ssize_t test_dev_x_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	int ret;
+
+	test_dev_busy_alloc(test_dev);
+	test_dev_config_lock(test_dev);
+
+	ret = kstrtoint(buf, 10, &test_dev->x);
+	if (ret)
+		count = ret;
+
+	if (enable_verbose_writes)
+		dev_info(test_dev->dev, "wrote x = %d\n", test_dev->x);
+
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t test_dev_x_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	int ret;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->x);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(test_dev_x);
+
+static ssize_t test_dev_y_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int y;
+	int ret;
+
+	test_dev_busy_alloc(test_dev);
+	test_dev_config_lock(test_dev);
+
+	config = &test_dev->config;
+
+	ret = kstrtoint(buf, 10, &y);
+	if (ret)
+		count = ret;
+
+	msleep(config->write_delay_msec_y);
+	test_dev->y = test_dev->x + y + 7;
+
+	if (enable_verbose_writes)
+		dev_info(test_dev->dev, "wrote y = %d\n", test_dev->y);
+
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t test_dev_y_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	int ret;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", test_dev->y);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(test_dev_y);
+
+static ssize_t config_enable_lock_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	/*
+	 * We compromise for simplicty and do not lock when changing
+	 * locking configuration, with the assumption userspace tests
+	 * will know this.
+	 */
+	if (val)
+		config->enable_lock = true;
+	else
+		config->enable_lock = false;
+
+	return count;
+}
+
+static ssize_t config_enable_lock_show(struct device *dev,
+				       struct device_attribute *attr,
+				       char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	ssize_t ret;
+
+	config = &test_dev->config;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(config_enable_lock);
+
+static ssize_t config_enable_lock_on_rmmod_store(struct device *dev,
+						 struct device_attribute *attr,
+						 const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	test_dev_config_lock(test_dev);
+	if (val)
+		config->enable_lock_on_rmmod = true;
+	else
+		config->enable_lock_on_rmmod = false;
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t config_enable_lock_on_rmmod_show(struct device *dev,
+						struct device_attribute *attr,
+						char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	ssize_t ret;
+
+	config = &test_dev->config;
+
+	test_dev_config_lock(test_dev);
+	ret = snprintf(buf, PAGE_SIZE, "%d\n", config->enable_lock_on_rmmod);
+	test_dev_config_unlock(test_dev);
+
+	return ret;
+}
+static DEVICE_ATTR_RW(config_enable_lock_on_rmmod);
+
+static ssize_t config_use_rtnl_lock_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	/*
+	 * We compromise and simplify this condition and do not use a lock
+	 * here as the lock type can change.
+	 */
+	if (val)
+		config->use_rtnl_lock = true;
+	else
+		config->use_rtnl_lock = false;
+
+	return count;
+}
+
+static ssize_t config_use_rtnl_lock_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+
+	config = &test_dev->config;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", config->use_rtnl_lock);
+}
+static DEVICE_ATTR_RW(config_use_rtnl_lock);
+
+static ssize_t config_write_delay_msec_y_store(struct device *dev,
+					       struct device_attribute *attr,
+					       const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	test_dev_config_lock(test_dev);
+	config->write_delay_msec_y = val;
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t config_write_delay_msec_y_show(struct device *dev,
+					      struct device_attribute *attr,
+					      char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+
+	config = &test_dev->config;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", config->write_delay_msec_y);
+}
+static DEVICE_ATTR_RW(config_write_delay_msec_y);
+
+static ssize_t config_enable_busy_alloc_store(struct device *dev,
+					      struct device_attribute *attr,
+					      const char *buf, size_t count)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+	int ret;
+	int val;
+
+	config = &test_dev->config;
+	ret = kstrtoint(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	test_dev_config_lock(test_dev);
+	config->enable_busy_alloc = val;
+	test_dev_config_unlock(test_dev);
+
+	return count;
+}
+
+static ssize_t config_enable_busy_alloc_show(struct device *dev,
+					     struct device_attribute *attr,
+					     char *buf)
+{
+	struct sysfs_test_device *test_dev = dev_to_test_dev(dev);
+	struct test_config *config;
+
+	config = &test_dev->config;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", config->enable_busy_alloc);
+}
+static DEVICE_ATTR_RW(config_enable_busy_alloc);
+
+#define TEST_SYSFS_DEV_ATTR(name)		(&dev_attr_##name.attr)
+
+static struct attribute *test_dev_attrs[] = {
+	/* Generic driver knobs go here */
+	TEST_SYSFS_DEV_ATTR(config),
+	TEST_SYSFS_DEV_ATTR(reset),
+
+	/* These are used to test sysfs */
+	TEST_SYSFS_DEV_ATTR(test_dev_x),
+	TEST_SYSFS_DEV_ATTR(test_dev_y),
+
+	/*
+	 * These are configuration knobs to modify how we test sysfs when
+	 * doing reads / stores.
+	 */
+	TEST_SYSFS_DEV_ATTR(config_enable_lock),
+	TEST_SYSFS_DEV_ATTR(config_enable_lock_on_rmmod),
+	TEST_SYSFS_DEV_ATTR(config_use_rtnl_lock),
+	TEST_SYSFS_DEV_ATTR(config_write_delay_msec_y),
+	TEST_SYSFS_DEV_ATTR(config_enable_busy_alloc),
+
+	NULL,
+};
+
+ATTRIBUTE_GROUPS(test_dev);
+
+static int sysfs_test_dev_alloc_miscdev(struct sysfs_test_device *test_dev)
+{
+	struct miscdevice *misc_dev;
+
+	misc_dev = &test_dev->misc_dev;
+	misc_dev->minor = MISC_DYNAMIC_MINOR;
+	misc_dev->name = kasprintf(GFP_KERNEL, "test_sysfs%d", test_dev->dev_idx);
+	if (!misc_dev->name) {
+		pr_err("Cannot alloc misc_dev->name\n");
+		return -ENOMEM;
+	}
+	misc_dev->groups = test_dev_groups;
+
+	return 0;
+}
+
+static int testdev_open(struct block_device *bdev, fmode_t mode)
+{
+	return -EINVAL;
+}
+
+static blk_qc_t testdev_submit_bio(struct bio *bio)
+{
+	return BLK_QC_T_NONE;
+}
+
+static void testdev_slot_free_notify(struct block_device *bdev,
+				     unsigned long index)
+{
+}
+
+static int testdev_rw_page(struct block_device *bdev, sector_t sector,
+			   struct page *page, unsigned int op)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct block_device_operations sysfs_testdev_ops = {
+	.open = testdev_open,
+	.submit_bio = testdev_submit_bio,
+	.swap_slot_free_notify = testdev_slot_free_notify,
+	.rw_page = testdev_rw_page,
+	.owner = THIS_MODULE
+};
+
+static int sysfs_test_dev_alloc_blockdev(struct sysfs_test_device *test_dev)
+{
+	int ret = -ENOMEM;
+
+	test_dev->disk = blk_alloc_disk(NUMA_NO_NODE);
+	if (!test_dev->disk) {
+		pr_err("Error allocating disk structure for device %d\n",
+		       test_dev->dev_idx);
+		goto out;
+	}
+
+	test_dev->disk->major = sysfs_test_major;
+	test_dev->disk->first_minor = test_dev->dev_idx + 1;
+	test_dev->disk->fops = &sysfs_testdev_ops;
+	test_dev->disk->private_data = test_dev;
+	snprintf(test_dev->disk->disk_name, 16, "test_sysfs%d",
+		 test_dev->dev_idx);
+	set_capacity(test_dev->disk, 0);
+	blk_queue_flag_set(QUEUE_FLAG_NONROT, test_dev->disk->queue);
+	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, test_dev->disk->queue);
+	blk_queue_physical_block_size(test_dev->disk->queue, PAGE_SIZE);
+	blk_queue_max_discard_sectors(test_dev->disk->queue, UINT_MAX);
+	blk_queue_flag_set(QUEUE_FLAG_DISCARD, test_dev->disk->queue);
+
+	return 0;
+out:
+	return ret;
+}
+
+static struct sysfs_test_device *alloc_test_dev_sysfs(int idx)
+{
+	struct sysfs_test_device *test_dev;
+	int ret;
+
+	switch (test_devtype) {
+	case TESTDEV_TYPE_MISC:
+	       fallthrough;
+	case TESTDEV_TYPE_BLOCK:
+		break;
+	default:
+		return NULL;
+	}
+
+	test_dev = kzalloc(sizeof(struct sysfs_test_device), GFP_KERNEL);
+	if (!test_dev)
+		goto err_out;
+
+	mutex_init(&test_dev->config_mutex);
+	test_dev->dev_idx = idx;
+	test_dev->devtype = test_devtype;
+
+	if (test_dev->devtype == TESTDEV_TYPE_MISC) {
+		ret = sysfs_test_dev_alloc_miscdev(test_dev);
+		if (ret)
+			goto err_out_free;
+	} else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) {
+		ret = sysfs_test_dev_alloc_blockdev(test_dev);
+		if (ret)
+			goto err_out_free;
+	}
+	return test_dev;
+
+err_out_free:
+	kfree(test_dev);
+	test_dev = NULL;
+err_out:
+	return NULL;
+}
+
+static int register_test_dev_sysfs_misc(struct sysfs_test_device *test_dev)
+{
+	int ret;
+
+	ret = misc_register(&test_dev->misc_dev);
+	if (ret)
+		return ret;
+
+	test_dev->dev = test_dev->misc_dev.this_device;
+
+	return 0;
+}
+
+static int register_test_dev_sysfs_block(struct sysfs_test_device *test_dev)
+{
+	device_add_disk(NULL, test_dev->disk, test_dev_groups);
+	test_dev->dev = disk_to_dev(test_dev->disk);
+
+	return 0;
+}
+
+static struct sysfs_test_device *register_test_dev_sysfs(void)
+{
+	struct sysfs_test_device *test_dev = NULL;
+	int ret;
+
+	test_dev = alloc_test_dev_sysfs(0);
+	if (!test_dev)
+		goto out;
+
+	if (test_dev->devtype == TESTDEV_TYPE_MISC) {
+		ret = register_test_dev_sysfs_misc(test_dev);
+		if (ret) {
+			pr_err("could not register misc device: %d\n", ret);
+			goto out_free_dev;
+		}
+	} else if (test_dev->devtype == TESTDEV_TYPE_BLOCK) {
+		ret = register_test_dev_sysfs_block(test_dev);
+		if (ret) {
+			pr_err("could not register block device: %d\n", ret);
+			goto out_free_dev;
+		}
+	}
+
+	dev_info(test_dev->dev, "interface ready\n");
+
+out:
+	return test_dev;
+out_free_dev:
+	free_test_dev_sysfs(test_dev);
+	return NULL;
+}
+
+static struct sysfs_test_device *register_test_dev_set_config(void)
+{
+	struct sysfs_test_device *test_dev;
+	struct test_config *config;
+
+	test_dev = register_test_dev_sysfs();
+	if (!test_dev)
+		return NULL;
+
+	config = &test_dev->config;
+
+	if (enable_lock)
+		config->enable_lock = true;
+	if (enable_lock_on_rmmod)
+		config->enable_lock_on_rmmod = true;
+	if (use_rtnl_lock)
+		config->use_rtnl_lock = true;
+	if (enable_busy_alloc)
+		config->enable_busy_alloc = true;
+
+	config->write_delay_msec_y = write_delay_msec_y;
+	test_sysfs_reset_vals(test_dev);
+
+	return test_dev;
+}
+
+static void unregister_test_dev_sysfs_misc(struct sysfs_test_device *test_dev)
+{
+	misc_deregister(&test_dev->misc_dev);
+}
+
+static void unregister_test_dev_sysfs_block(struct sysfs_test_device *test_dev)
+{
+	del_gendisk(test_dev->disk);
+	blk_cleanup_disk(test_dev->disk);
+}
+
+static void unregister_test_dev_sysfs(struct sysfs_test_device *test_dev)
+{
+	test_dev_config_lock_rmmod(test_dev);
+
+	dev_info(test_dev->dev, "removing interface\n");
+
+	if (test_dev->devtype == TESTDEV_TYPE_MISC)
+		unregister_test_dev_sysfs_misc(test_dev);
+	else if (test_dev->devtype == TESTDEV_TYPE_BLOCK)
+		unregister_test_dev_sysfs_block(test_dev);
+
+	test_dev_config_unlock_rmmod(test_dev);
+
+	free_test_dev_sysfs(test_dev);
+}
+
+static struct dentry *debugfs_dir;
+
+/* When read represents how many times we have reset the first_test_dev */
+static u8 reset_first_test_dev;
+
+static ssize_t read_reset_first_test_dev(struct file *file,
+					 char __user *user_buf,
+					 size_t count, loff_t *ppos)
+{
+	ssize_t len;
+	char buf[32];
+
+	reset_first_test_dev++;
+	len = sprintf(buf, "%d\n", reset_first_test_dev);
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t write_reset_first_test_dev(struct file *file,
+					  const char __user *user_buf,
+					  size_t count, loff_t *ppos)
+{
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	if (!first_test_dev) {
+		module_put(THIS_MODULE);
+		return -ENODEV;
+	}
+
+	dev_info(first_test_dev->dev, "going to reset first interface ...\n");
+
+	unregister_test_dev_sysfs(first_test_dev);
+	first_test_dev = register_test_dev_set_config();
+
+	dev_info(first_test_dev->dev, "first interface reset complete\n");
+
+	module_put(THIS_MODULE);
+
+	return count;
+}
+
+static const struct file_operations fops_reset_first_test_dev = {
+	.read = read_reset_first_test_dev,
+	.write = write_reset_first_test_dev,
+	.open = simple_open,
+	.owner = THIS_MODULE,
+	.llseek = default_llseek,
+};
+
+static int __init test_sysfs_init(void)
+{
+	first_test_dev = register_test_dev_set_config();
+	if (!first_test_dev)
+		return -ENOMEM;
+
+	if (!enable_debugfs)
+		return 0;
+
+	debugfs_dir = debugfs_create_dir("test_sysfs", NULL);
+	if (!debugfs_dir) {
+		unregister_test_dev_sysfs(first_test_dev);
+		return -ENOMEM;
+	}
+
+	debugfs_create_file("reset_first_test_dev", 0600, debugfs_dir,
+			    NULL, &fops_reset_first_test_dev);
+	return 0;
+}
+module_init(test_sysfs_init);
+
+static void __exit test_sysfs_exit(void)
+{
+	if (enable_debugfs)
+		debugfs_remove(debugfs_dir);
+	if (delay_rmmod_ms)
+		msleep(delay_rmmod_ms);
+	unregister_test_dev_sysfs(first_test_dev);
+	if (enable_verbose_rmmod)
+		pr_info("unregister_test_dev_sysfs() completed\n");
+	first_test_dev = NULL;
+}
+module_exit(test_sysfs_exit);
+
+MODULE_AUTHOR("Luis Chamberlain <mcgrof@kernel.org>");
+MODULE_LICENSE("GPL");
diff --git a/tools/testing/selftests/sysfs/Makefile b/tools/testing/selftests/sysfs/Makefile
new file mode 100644
index 000000000000..fde99caa2338
--- /dev/null
+++ b/tools/testing/selftests/sysfs/Makefile
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+# Makefile for sysfs selftests.
+
+# No binaries, but make sure arg-less "make" doesn't trigger "run_tests".
+all:
+
+TEST_PROGS := sysfs.sh
+
+include ../lib.mk
+
+# Nothing to clean up.
+clean:
diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
new file mode 100644
index 000000000000..9196f452ecd5
--- /dev/null
+++ b/tools/testing/selftests/sysfs/config
@@ -0,0 +1,2 @@
+CONFIG_SYSFS=m
+CONFIG_TEST_SYSFS=m
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
new file mode 100755
index 000000000000..b3f4c2236c7f
--- /dev/null
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -0,0 +1,1208 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (C) 2021 Luis Chamberlain <mcgrof@kernel.org>
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms of the GNU General Public License as published by the Free
+# Software Foundation; either version 2 of the License, or at your option any
+# later version; or, when distributed separately from the Linux kernel or
+# when incorporated into other software packages, subject to the following
+# license:
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms of copyleft-next (version 0.3.1 or later) as published
+# at http://copyleft-next.org/.
+
+# This performs a series tests against the sysfs filesystem.
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+TEST_NAME="sysfs"
+TEST_DRIVER="test_${TEST_NAME}"
+TEST_DIR=$(dirname $0)
+TEST_FILE=$(mktemp)
+
+# This represents
+#
+# TEST_ID:TEST_COUNT:ENABLED:TARGET
+#
+# TEST_ID: is the test id number
+# TEST_COUNT: number of times we should run the test
+# ENABLED: 1 if enabled, 0 otherwise
+# TARGET: test target file required on the test_sysfs module
+#
+# Once these are enabled please leave them as-is. Write your own test,
+# we have tons of space.
+ALL_TESTS="0001:3:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0002:3:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0003:3:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0004:3:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0005:1:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0006:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0007:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0008:1:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0009:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0010:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0011:1:1:test_dev_x:misc"
+ALL_TESTS="$ALL_TESTS 0012:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0013:1:1:test_dev_y:misc"
+ALL_TESTS="$ALL_TESTS 0014:3:1:test_dev_x:block" # block equivalent set
+ALL_TESTS="$ALL_TESTS 0015:3:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0016:3:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0017:3:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0018:1:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0019:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0020:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0021:1:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0022:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block"
+ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
+ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
+ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+
+allow_user_defaults()
+{
+	if [ -z $DIR ]; then
+		case $TEST_DEV_TYPE in
+		misc)
+			DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0"
+			;;
+		block)
+			DIR="/sys/devices/virtual/block/${TEST_DRIVER}0"
+			;;
+		*)
+			DIR="/sys/devices/virtual/misc/${TEST_DRIVER}0"
+			;;
+		esac
+	fi
+	case $TEST_DEV_TYPE in
+		misc)
+			MODPROBE_TESTDEV_TYPE=""
+			;;
+		block)
+			MODPROBE_TESTDEV_TYPE="devtype=1"
+			;;
+		*)
+			MODPROBE_TESTDEV_TYPE=""
+			;;
+	esac
+	if [ -z $SYSFS_DEBUGFS_DIR ]; then
+		SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
+	fi
+	if [ -z $PAGE_SIZE ]; then
+		PAGE_SIZE=$(getconf PAGESIZE)
+	fi
+	if [ -z $MAX_DIGITS ]; then
+		MAX_DIGITS=$(($PAGE_SIZE/8))
+	fi
+	if [ -z $INT_MAX ]; then
+		INT_MAX=$(getconf INT_MAX)
+	fi
+	if [ -z $UINT_MAX ]; then
+		UINT_MAX=$(getconf UINT_MAX)
+	fi
+}
+
+test_reqs()
+{
+	uid=$(id -u)
+	if [ $uid -ne 0 ]; then
+		echo $msg must be run as root >&2
+		exit $ksft_skip
+	fi
+
+	if ! which modprobe 2> /dev/null > /dev/null; then
+		echo "$0: You need modprobe installed" >&2
+		exit $ksft_skip
+	fi
+	if ! which getconf 2> /dev/null > /dev/null; then
+		echo "$0: You need getconf installed"
+		exit $ksft_skip
+	fi
+	if ! which diff 2> /dev/null > /dev/null; then
+		echo "$0: You need diff installed"
+		exit $ksft_skip
+	fi
+	if ! which perl 2> /dev/null > /dev/null; then
+		echo "$0: You need perl installed"
+		exit $ksft_skip
+	fi
+}
+
+call_modprobe()
+{
+	modprobe $TEST_DRIVER $MODPROBE_TESTDEV_TYPE $FIRST_MODPROBE_ARGS $MODPROBE_ARGS
+	return $?
+}
+
+modprobe_reset()
+{
+	modprobe -q -r $TEST_DRIVER
+	call_modprobe
+	return $?
+}
+
+modprobe_reset_enable_debugfs()
+{
+	FIRST_MODPROBE_ARGS="enable_debugfs=1"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
+modprobe_reset_enable_lock_on_rmmod()
+{
+	FIRST_MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
+modprobe_reset_enable_rtnl_lock_on_rmmod()
+{
+	FIRST_MODPROBE_ARGS="enable_lock=1 use_rtnl_lock=1 enable_lock_on_rmmod=1"
+	FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_writes=1"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
+load_req_mod()
+{
+	modprobe_reset
+	if [ ! -d $DIR ]; then
+		if ! modprobe -q -n $TEST_DRIVER; then
+			echo "$0: module $TEST_DRIVER not found [SKIP]"
+			echo "You must set CONFIG_TEST_SYSFS=m in your kernel" >&2
+			exit $ksft_skip
+		fi
+		call_modprobe
+		if [ $? -ne 0 ]; then
+			echo "$0: modprobe $TEST_DRIVER failed."
+			exit
+		fi
+	fi
+}
+
+config_reset()
+{
+	if ! echo -n "1" >"$DIR"/reset; then
+		echo "$0: reset should have worked" >&2
+		exit 1
+	fi
+}
+
+debugfs_reset_first_test_dev_ignore_errors()
+{
+	echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
+}
+
+set_orig()
+{
+	if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
+		if [ -f ${TARGET} ]; then
+			echo "${ORIG}" > "${TARGET}"
+		fi
+	fi
+}
+
+set_test()
+{
+	echo "${TEST_STR}" > "${TARGET}"
+}
+
+set_test_ignore_errors()
+{
+	echo "${TEST_STR}" > "${TARGET}" 2> /dev/null
+}
+
+verify()
+{
+	local seen
+	seen=$(cat "$1")
+	target_short=$(basename $TARGET)
+	case $target_short in
+	test_dev_x)
+		if [ "${seen}" != "${TEST_STR}" ]; then
+			return 1
+		fi
+		;;
+	test_dev_y)
+		DIRNAME=$(dirname $1)
+		EXPECTED_RESULT=""
+		# If our target was the test file then what we write to it
+		# is the same as what that we expect when we read from it.
+		# When we write to test_dev_y directly though we expect
+		# a computed value which is driver specific.
+		if [[ "$DIRNAME" == "/tmp" ]]; then
+			let EXPECTED_RESULT="${TEST_STR}"
+		else
+			x=$(cat ${DIR}/test_dev_x)
+			let EXPECTED_RESULT="$x+${TEST_STR}+7"
+		fi
+
+		if [[ "${seen}" != "${EXPECTED_RESULT}" ]]; then
+			return 1
+		fi
+		;;
+	*)
+		echo "Unsupported target type update test script: $target_short"
+		exit 1
+	esac
+	return 0
+}
+
+verify_diff_w()
+{
+	echo "$TEST_STR" | diff -q -w -u - $1 > /dev/null
+	return $?
+}
+
+test_rc()
+{
+	if [[ $rc != 0 ]]; then
+		echo "Failed test, return value: $rc" >&2
+		exit $rc
+	fi
+}
+
+test_finish()
+{
+	set_orig
+	rm -f "${TEST_FILE}"
+
+	if [ ! -z ${old_strict} ]; then
+		echo ${old_strict} > ${WRITES_STRICT}
+	fi
+	exit $rc
+}
+
+# kernfs requires us to write everything we want in one shot because
+# There is no easy way for us to know if userspace is only doing a partial
+# write, so we don't support them. We expect the entire buffer to come on
+# the first write.  If you're writing a value, first read the file,
+# modify only the value you're changing, then write entire buffer back.
+# Since we are only testing digits we just full single writes and old stuff.
+# For more details, refer to kernfs_fop_write_iter().
+run_numerictests_single_write()
+{
+	echo "== Testing sysfs behavior against ${TARGET} =="
+
+	rc=0
+
+	echo -n "Writing test file ... "
+	echo "${TEST_STR}" > "${TEST_FILE}"
+	if ! verify "${TEST_FILE}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Checking the sysfs file is not set to test value ... "
+	if verify "${TARGET}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Writing to sysfs file from shell ... "
+	set_test
+	if ! verify "${TARGET}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Resetting sysfs file to original value ... "
+	set_orig
+	if verify "${TARGET}"; then
+		echo "FAIL" >&2
+		exit 1
+	else
+		echo "ok"
+	fi
+
+	# Now that we've validated the sanity of "set_test" and "set_orig",
+	# we can use those functions to set starting states before running
+	# specific behavioral tests.
+
+	echo -n "Writing to the entire sysfs file in a single write ... "
+	set_orig
+	dd if="${TEST_FILE}" of="${TARGET}" bs=4096 2>/dev/null
+	if ! verify "${TARGET}"; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Writing to the sysfs file with multiple long writes ... "
+	set_orig
+	(perl -e 'print "A" x 50;'; echo "${TEST_STR}") | \
+		dd of="${TARGET}" bs=50 2>/dev/null
+	if verify "${TARGET}"; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+	test_rc
+}
+
+reset_vals()
+{
+	echo -n 3 > $DIR/test_dev_x
+	echo -n 4 > $DIR/test_dev_x
+}
+
+check_failure()
+{
+	echo -n "Testing that $1 fails as expected..."
+	reset_vals
+	TEST_STR="$1"
+	orig="$(cat $TARGET)"
+	echo -n "$TEST_STR" > $TARGET 2> /dev/null
+
+	# write should fail and $TARGET should retain its original value
+	if [ $? = 0 ] || [ "$(cat $TARGET)" != "$orig" ]; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+	test_rc
+}
+
+load_modreqs()
+{
+	export TEST_DEV_TYPE=$(get_test_type $1)
+	unset DIR
+	allow_user_defaults
+	load_req_mod
+}
+
+target_exists()
+{
+	TARGET="${DIR}/$1"
+	TEST_ID="$2"
+
+	if [ ! -f ${TARGET} ] ; then
+		echo "Target for test $TEST_ID: $TARGET does not exist, skipping test ..."
+		return 0
+	fi
+	return 1
+}
+
+config_enable_lock()
+{
+	if ! echo -n 1 > $DIR/config_enable_lock; then
+		echo "$0: Unable to enable locks" >&2
+		exit 1
+	fi
+}
+
+config_write_delay_msec_y()
+{
+	if ! echo -n $1 > $DIR/config_write_delay_msec_y ; then
+		echo "$0: Unable to set write_delay_msec_y to $1" >&2
+		exit 1
+	fi
+}
+
+# Default filter for dmesg scanning.
+# Ignore lockdep complaining about its own bugginess when scanning dmesg
+# output, because we shouldn't be failing filesystem tests on account of
+# lockdep.
+_check_dmesg_filter()
+{
+	egrep -v -e "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low" \
+		-e "BUG: MAX_STACK_TRACE_ENTRIES too low"
+}
+
+check_dmesg()
+{
+	# filter out intentional WARNINGs or Oopses
+	local filter=${1:-_check_dmesg_filter}
+
+	_dmesg_since_test_start | $filter >$seqres.dmesg
+	egrep -q -e "kernel BUG at" \
+	     -e "WARNING:" \
+	     -e "\bBUG:" \
+	     -e "Oops:" \
+	     -e "possible recursive locking detected" \
+	     -e "Internal error" \
+	     -e "(INFO|ERR): suspicious RCU usage" \
+	     -e "INFO: possible circular locking dependency detected" \
+	     -e "general protection fault:" \
+	     -e "BUG .* remaining" \
+	     -e "UBSAN:" \
+	     $seqres.dmesg
+	if [ $? -eq 0 ]; then
+		echo "something found in dmesg (see $seqres.dmesg)"
+		return 1
+	else
+		if [ "$KEEP_DMESG" != "yes" ]; then
+			rm -f $seqres.dmesg
+		fi
+		return 0
+	fi
+}
+
+log_kernel_fstest_dmesg()
+{
+	export FSTYP="$1"
+	export seqnum="$FSTYP/$2"
+	export date_time=$(date +"%F %T")
+	echo "run fstests $seqnum at $date_time" > /dev/kmsg
+}
+
+modprobe_loop()
+{
+	while true; do
+		call_modprobe > /dev/null 2>&1
+		modprobe -r $TEST_DRIVER > /dev/null 2>&1
+	done > /dev/null 2>&1
+}
+
+write_loop()
+{
+	while true; do
+		set_test_ignore_errors > /dev/null 2>&1
+		TEST_STR=$(( $TEST_STR + 1 ))
+	done > /dev/null 2>&1
+}
+
+write_loop_reset()
+{
+	while true; do
+		set_test_ignore_errors > /dev/null 2>&1
+		debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1
+	done > /dev/null 2>&1
+}
+
+write_loop_bg()
+{
+	BG_WRITES=1000 > /dev/null 2>&1
+	while true; do
+		for i in $(seq 1 $BG_WRITES); do
+			set_test_ignore_errors > /dev/null 2>&1 &
+			TEST_STR=$(( $TEST_STR + 1 ))
+		done > /dev/null 2>&1
+		wait
+	done > /dev/null 2>&1
+	wait
+}
+
+reset_loop()
+{
+	while true; do
+		debugfs_reset_first_test_dev_ignore_errors > /dev/null 2>&1
+	done > /dev/null 2>&1
+}
+
+kill_trigger_loop()
+{
+
+	local my_first_loop_pid=$1
+	local my_second_loop_pid=$2
+	local my_sleep_max=$3
+	local my_loop=0
+
+	while true; do
+		sleep 1
+		if [[ $my_loop -ge $my_sleep_max ]]; then
+			break
+		fi
+		let my_loop=$my_loop+1
+	done
+
+	kill -s TERM $my_first_loop_pid 2>&1 > /dev/null
+	kill -s TERM $my_second_loop_pid 2>&1 > /dev/null
+}
+
+_dmesg_since_test_start()
+{
+	# search the dmesg log of last run of $seqnum for possible failures
+	# use sed \cregexpc address type, since $seqnum contains "/"
+	dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac
+}
+
+sysfs_test_0001()
+{
+	TARGET="${DIR}/$(get_test_target 0001)"
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0002()
+{
+	TARGET="${DIR}/$(get_test_target 0002)"
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0003()
+{
+	TARGET="${DIR}/$(get_test_target 0003)"
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	config_enable_lock
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0004()
+{
+	TARGET="${DIR}/$(get_test_target 0004)"
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	config_enable_lock
+
+	run_numerictests_single_write
+}
+
+sysfs_test_0005()
+{
+	TARGET="${DIR}/$(get_test_target 0005)"
+	modprobe_reset
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing x while loading/unloading the module... "
+
+	modprobe_loop &
+	modprobe_pid=$!
+
+	write_loop &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0006()
+{
+	TARGET="${DIR}/$(get_test_target 0006)"
+	modprobe_reset
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y while loading/unloading the module... "
+	modprobe_loop &
+	modprobe_pid=$!
+
+	write_loop &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0007()
+{
+	TARGET="${DIR}/$(get_test_target 0007)"
+	modprobe_reset
+	config_reset
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y with a larger delay while loading/unloading the module... "
+
+	MODPROBE_ARGS="write_delay_msec_y=1500"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	write_loop &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0008()
+{
+	TARGET="${DIR}/$(get_test_target 0008)"
+	modprobe_reset
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop busy writing x while loading/unloading the module... "
+
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+
+	write_loop_bg > /dev/null 2>&1 &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0009()
+{
+	TARGET="${DIR}/$(get_test_target 0009)"
+	modprobe_reset
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop busy writing y while loading/unloading the module... "
+
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+
+	write_loop_bg > /dev/null 2>&1 &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0010()
+{
+	TARGET="${DIR}/$(get_test_target 0010)"
+	modprobe_reset
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop busy writing y with a larger delay while loading/unloading the module... "
+	modprobe -q -r $TEST_DRIVER > /dev/null 2>&1
+
+	MODPROBE_ARGS="write_delay_msec_y=1500"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	write_loop_bg > /dev/null 2>&1 &
+	write_pid=$!
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0011()
+{
+	TARGET="${DIR}/$(get_test_target 0011)"
+	modprobe_reset_enable_debugfs
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing x and resetting ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	reset_loop > /dev/null 2>&1 &
+	reset_pid=$!
+
+	kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0012()
+{
+	TARGET="${DIR}/$(get_test_target 0012)"
+	modprobe_reset_enable_debugfs
+	config_reset
+	reset_vals
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y and resetting ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	reset_loop > /dev/null 2>&1 &
+	reset_pid=$!
+
+	kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0013()
+{
+	TARGET="${DIR}/$(get_test_target 0013)"
+	modprobe_reset_enable_debugfs
+	config_reset
+	reset_vals
+	config_write_delay_msec_y 1500
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Loop writing y with a larger delay and resetting ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	reset_loop > /dev/null 2>&1 &
+	reset_pid=$!
+
+	kill_trigger_loop $write_pid $reset_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0014()
+{
+	sysfs_test_0001
+}
+
+sysfs_test_0015()
+{
+	sysfs_test_0002
+}
+
+sysfs_test_0016()
+{
+	sysfs_test_0003
+}
+
+sysfs_test_0017()
+{
+	sysfs_test_0004
+}
+
+sysfs_test_0018()
+{
+	sysfs_test_0005
+}
+
+sysfs_test_0019()
+{
+	sysfs_test_0006
+}
+
+sysfs_test_0020()
+{
+	sysfs_test_0007
+}
+
+sysfs_test_0021()
+{
+	sysfs_test_0008
+}
+
+sysfs_test_0022()
+{
+	sysfs_test_0009
+}
+
+sysfs_test_0023()
+{
+	sysfs_test_0010
+}
+
+sysfs_test_0024()
+{
+	sysfs_test_0011
+}
+
+sysfs_test_0025()
+{
+	sysfs_test_0012
+}
+
+sysfs_test_0026()
+{
+	sysfs_test_0013
+}
+
+sysfs_test_0027()
+{
+	TARGET="${DIR}/$(get_test_target 0027)"
+	modprobe_reset_enable_lock_on_rmmod
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Test for possible rmmod deadlock while writing x ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 enable_verbose_writes=1"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0028()
+{
+	TARGET="${DIR}/$(get_test_target 0028)"
+	modprobe_reset_enable_lock_on_rmmod
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+	WAIT_TIME=2
+
+	echo -n "Test for possible rmmod deadlock using rtnl_lock while writing x ... "
+
+	write_loop > /dev/null 2>&1 &
+	write_pid=$!
+
+	MODPROBE_ARGS="enable_lock=1 enable_lock_on_rmmod=1 use_rtnl_lock=1 enable_verbose_writes=1"
+	modprobe_loop > /dev/null 2>&1 &
+	modprobe_pid=$!
+	unset MODPROBE_ARGS
+
+	kill_trigger_loop $modprobe_pid $write_pid $WAIT_TIME > /dev/null 2>&1 &
+	kill_pid=$!
+
+	wait $kill_pid > /dev/null 2>&1
+
+	if [[ $? -eq 0 ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+test_gen_desc()
+{
+	echo -n "$1 x $(get_test_count $1)"
+}
+
+list_tests()
+{
+	echo "Test ID list:"
+	echo
+	echo "TEST_ID x NUM_TEST"
+	echo "TEST_ID:   Test ID"
+	echo "NUM_TESTS: Number of recommended times to run the test"
+	echo
+	echo "$(test_gen_desc 0001) - misc test writing x in different ways"
+	echo "$(test_gen_desc 0002) - misc test writing y in different ways"
+	echo "$(test_gen_desc 0003) - misc test writing x in different ways using a mutex lock"
+	echo "$(test_gen_desc 0004) - misc test writing y in different ways using a mutex lock"
+	echo "$(test_gen_desc 0005) - misc test writing x load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0006) - misc writing y load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0007) - misc test writing y larger delay, load, remove test_sysfs"
+	echo "$(test_gen_desc 0008) - misc test busy writing x remove test_sysfs module"
+	echo "$(test_gen_desc 0009) - misc test busy writing y remove the test_sysfs module"
+	echo "$(test_gen_desc 0010) - misc test busy writing y larger delay, remove test_sysfs"
+	echo "$(test_gen_desc 0011) - misc test writing x and resetting device"
+	echo "$(test_gen_desc 0012) - misc test writing y and resetting device"
+	echo "$(test_gen_desc 0013) - misc test writing y with a larger delay and resetting device"
+	echo "$(test_gen_desc 0014) - block test writing x in different ways"
+	echo "$(test_gen_desc 0015) - block test writing y in different ways"
+	echo "$(test_gen_desc 0016) - block test writing x in different ways using a mutex lock"
+	echo "$(test_gen_desc 0017) - block test writing y in different ways using a mutex lock"
+	echo "$(test_gen_desc 0018) - block test writing x load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0019) - block test writing y load and remove the test_sysfs module"
+	echo "$(test_gen_desc 0020) - block test writing y larger delay, load, remove test_sysfs"
+	echo "$(test_gen_desc 0021) - block test busy writing x remove the test_sysfs module"
+	echo "$(test_gen_desc 0022) - block test busy writing y remove the test_sysfs module"
+	echo "$(test_gen_desc 0023) - block test busy writing y larger delay, remove test_sysfs"
+	echo "$(test_gen_desc 0024) - block test writing x and resetting device"
+	echo "$(test_gen_desc 0025) - block test writing y and resetting device"
+	echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
+	echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
+	echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
+}
+
+usage()
+{
+	NUM_TESTS=$(grep -o ' ' <<<"$ALL_TESTS" | grep -c .)
+	let NUM_TESTS=$NUM_TESTS+1
+	MAX_TEST=$(printf "%04d\n" $NUM_TESTS)
+	echo "Usage: $0 [ -t <4-number-digit> ] | [ -w <4-number-digit> ] |"
+	echo "		 [ -s <4-number-digit> ] | [ -c <4-number-digit> <test- count>"
+	echo "           [ all ] [ -h | --help ] [ -l ]"
+	echo ""
+	echo "Valid tests: 0001-$MAX_TEST"
+	echo ""
+	echo "    all     Runs all tests (default)"
+	echo "    -t      Run test ID the number amount of times is recommended"
+	echo "    -w      Watch test ID run until it runs into an error"
+	echo "    -c      Run test ID once"
+	echo "    -s      Run test ID x test-count number of times"
+	echo "    -l      List all test ID list"
+	echo " -h|--help  Help"
+	echo
+	echo "If an error every occurs execution will immediately terminate."
+	echo "If you are adding a new test try using -w <test-ID> first to"
+	echo "make sure the test passes a series of tests."
+	echo
+	echo Example uses:
+	echo
+	echo "$TEST_NAME.sh            -- executes all tests"
+	echo "$TEST_NAME.sh -t 0002    -- Executes test ID 0002 number of times is recomended"
+	echo "$TEST_NAME.sh -w 0002    -- Watch test ID 0002 run until an error occurs"
+	echo "$TEST_NAME.sh -s 0002    -- Run test ID 0002 once"
+	echo "$TEST_NAME.sh -c 0002 3  -- Run test ID 0002 three times"
+	echo
+	list_tests
+	exit 1
+}
+
+test_num()
+{
+	re='^[0-9]+$'
+	if ! [[ $1 =~ $re ]]; then
+		usage
+	fi
+}
+
+get_test_count()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $2}'
+}
+
+get_test_enabled()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $3}'
+}
+
+get_test_target()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $4}'
+}
+
+get_test_type()
+{
+	test_num $1
+	TEST_NUM=$(echo $1 | sed 's/^0*//')
+	TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$TEST_NUM'}')
+	echo ${TEST_DATA} | awk -F":" '{print $5}'
+}
+
+run_all_tests()
+{
+	for i in $ALL_TESTS ; do
+		TEST_ID=$(echo $i | awk -F":" '{print $1}')
+		ENABLED=$(get_test_enabled $TEST_ID)
+		TEST_COUNT=$(get_test_count $TEST_ID)
+		TEST_TARGET=$(get_test_target $TEST_ID)
+		if [[ $ENABLED -eq "1" ]]; then
+			test_case $TEST_ID $TEST_COUNT $TEST_TARGET
+		else
+			echo -n "Skipping test $TEST_ID as its disabled, likely "
+			echo "could crash your system ..."
+		fi
+	done
+}
+
+watch_log()
+{
+	if [ $# -ne 3 ]; then
+		clear
+	fi
+	echo "Running test: $2 - run #$1"
+}
+
+watch_case()
+{
+	i=0
+	while [ 1 ]; do
+		if [ $# -eq 1 ]; then
+			test_num $1
+			watch_log $i ${TEST_NAME}_test_$1
+			log_kernel_fstest_dmesg sysfs $1
+			RUN_TEST=${TEST_NAME}_test_$1
+			$RUN_TEST
+			check_dmesg
+			if [[ $? -ne 0 ]]; then
+				exit 1
+			fi
+		else
+			watch_log $i all
+			run_all_tests
+		fi
+		let i=$i+1
+	done
+}
+
+test_case()
+{
+	NUM_TESTS=$2
+
+	i=0
+
+	load_modreqs $1
+	if target_exists $3 $1; then
+		return
+	fi
+
+	while [[ $i -lt $NUM_TESTS ]]; do
+		test_num $1
+		watch_log $i ${TEST_NAME}_test_$1 noclear
+		log_kernel_fstest_dmesg sysfs $1
+		RUN_TEST=${TEST_NAME}_test_$1
+		$RUN_TEST
+		let i=$i+1
+	done
+	check_dmesg
+	if [[ $? -ne 0 ]]; then
+		exit 1
+	fi
+}
+
+parse_args()
+{
+	if [ $# -eq 0 ]; then
+		run_all_tests
+	else
+		if [[ "$1" = "all" ]]; then
+			run_all_tests
+		elif [[ "$1" = "-w" ]]; then
+			shift
+			watch_case $@
+		elif [[ "$1" = "-t" ]]; then
+			shift
+			test_num $1
+			test_case $1 $(get_test_count $1) $(get_test_target $1)
+			shift
+		elif [[ "$1" = "-c" ]]; then
+			shift
+			test_num $1
+			test_num $2
+			test_case $1 $2 $(get_test_target $1)
+			shift
+			shift
+		elif [[ "$1" = "-s" ]]; then
+			shift
+			test_case $1 1 $(get_test_target $1)
+			shift
+		elif [[ "$1" = "-l" ]]; then
+			list_tests
+			shift
+		elif [[ "$1" = "-h" || "$1" = "--help" ]]; then
+			usage
+		else
+			usage
+		fi
+	fi
+}
+
+test_reqs
+allow_user_defaults
+
+trap "test_finish" EXIT
+
+parse_args $@
+
+exit 0
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 04/12] kernfs: add initial failure injection support
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-18  5:04   ` Luis Chamberlain
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

This adds initial failure injection support to kernfs. We start
off with debug knobs which when enabled allow test drivers, such as
test_sysfs, to then make use of these to try to force certain
difficult races to take place with a high degree of certainty.

This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is
enabled in your kernel. If you don't have this enabled this provides
no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new
routine kernfs_debug_should_wait() ends up being transformed to if
(false), and so the compiler should optimize these out as dead code
producing no new effective binary changes.

We start off with enabling failure injections in kernfs by allowing us to
alter the way kernfs_fop_write_iter() behaves. We allow for the routine
kernfs_fop_write_iter() to wait for a certain condition in the kernel to
occur, after which it will sleep a predefined amount of time. This lets
kernfs users to time exactly when it want kernfs_fop_write_iter() to
complete, allowing for developing race conditions and test for correctness
in kernfs.

You'd boot with this enabled on your kernel command line:

fail_kernfs_fop_write_iter=1,100,0,1

The values are <interval,probability,size,times>, we don't care for
size, so for now we ignore it. The above ensures a failure will trigger
only once.

*How* we allow for this routine to change behaviour is left to knobs we
expose under debugfs:

 # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
wait_after_active
wait_after_mutex
wait_at_start
wait_before_mutex

A debugfs entry also exists to allow us to sleep a configurabler amount
of time after the completion:

/sys/kernel/debug/kernfs/sleep_after_wait_ms

These two sets of knobs allow us to construct races and demonstrate
how the kernfs active reference should suffice to project against
races.

Enabling CONFIG_FAULT_INJECTION_DEBUG_FS enables us to configure the
differnt fault injection parametres for the new fail_kernfs_fop_write_iter
fault injection at run time:

ls -1 /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
interval
probability
space
task-filter
times
verbose
verbose_ratelimit_burst
verbose_ratelimit_interval_ms

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 .../fault-injection/fault-injection.rst       | 22 +++++
 MAINTAINERS                                   |  2 +-
 fs/kernfs/Makefile                            |  1 +
 fs/kernfs/failure-injection.c                 | 91 +++++++++++++++++++
 fs/kernfs/file.c                              | 13 +++
 fs/kernfs/kernfs-internal.h                   | 72 +++++++++++++++
 include/linux/kernfs.h                        |  5 +
 lib/Kconfig.debug                             | 10 ++
 8 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 fs/kernfs/failure-injection.c

diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst
index 4a25c5eb6f07..d4d34b082f47 100644
--- a/Documentation/fault-injection/fault-injection.rst
+++ b/Documentation/fault-injection/fault-injection.rst
@@ -28,6 +28,28 @@ Available fault injection capabilities
 
   injects kernel RPC client and server failures.
 
+- fail_kernfs_fop_write_iter
+
+  Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
+  this does not immediately enable any errors to occur. You must configure
+  how you want this routine to fail or change behaviour by using the debugfs
+  knobs for it:
+
+  # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
+  wait_after_active
+  wait_after_mutex
+  wait_at_start
+  wait_before_mutex
+
+  You can also configure how long to sleep after a wait under
+
+  /sys/kernel/debug/kernfs/sleep_after_wait_ms
+
+  If you enable CONFIG_FAULT_INJECTION_DEBUG_FS the fail_add_disk failure
+  injection parameters are placed under:
+
+  /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
+
 - fail_make_request
 
   injects disk IO errors on devices permitted by setting
diff --git a/MAINTAINERS b/MAINTAINERS
index 28a34384f541..acdbf91058d5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10341,7 +10341,7 @@ M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 M:	Tejun Heo <tj@kernel.org>
 S:	Supported
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git
-F:	fs/kernfs/
+F:	fs/kernfs/*
 F:	include/linux/kernfs.h
 
 KEXEC
diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
index 4ca54ff54c98..bc5b32ca39f9 100644
--- a/fs/kernfs/Makefile
+++ b/fs/kernfs/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-y		:= mount.o inode.o dir.o file.o symlink.o
+obj-$(CONFIG_FAIL_KERNFS_KNOBS)    += failure-injection.o
diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c
new file mode 100644
index 000000000000..4130d202c13b
--- /dev/null
+++ b/fs/kernfs/failure-injection.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/fault-inject.h>
+#include <linux/delay.h>
+
+#include "kernfs-internal.h"
+
+static DECLARE_FAULT_ATTR(fail_kernfs_fop_write_iter);
+struct kernfs_config_fail kernfs_config_fail;
+
+#define kernfs_config_fail(when) \
+	kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+#define kernfs_config_fail(when) \
+	kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+static int __init setup_fail_kernfs_fop_write_iter(char *str)
+{
+	return setup_fault_attr(&fail_kernfs_fop_write_iter, str);
+}
+
+__setup("fail_kernfs_fop_write_iter=", setup_fail_kernfs_fop_write_iter);
+
+struct dentry *kernfs_debugfs_root;
+struct dentry *config_fail_kernfs_fop_write_iter;
+
+static int __init kernfs_init_failure_injection(void)
+{
+	kernfs_config_fail.sleep_after_wait_ms = 100;
+	kernfs_debugfs_root = debugfs_create_dir("kernfs", NULL);
+
+	fault_create_debugfs_attr("fail_kernfs_fop_write_iter",
+				  kernfs_debugfs_root, &fail_kernfs_fop_write_iter);
+
+	config_fail_kernfs_fop_write_iter =
+		debugfs_create_dir("config_fail_kernfs_fop_write_iter",
+				   kernfs_debugfs_root);
+
+	debugfs_create_u32("sleep_after_wait_ms", 0600,
+			   kernfs_debugfs_root,
+			   &kernfs_config_fail.sleep_after_wait_ms);
+
+	debugfs_create_bool("wait_at_start", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(at_start));
+	debugfs_create_bool("wait_before_mutex", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(before_mutex));
+	debugfs_create_bool("wait_after_mutex", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(after_mutex));
+	debugfs_create_bool("wait_after_active", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(after_active));
+	return 0;
+}
+late_initcall(kernfs_init_failure_injection);
+
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate)
+{
+	if (!evaluate)
+		return 0;
+
+	return should_fail(&fail_kernfs_fop_write_iter, 0);
+}
+
+DECLARE_COMPLETION(kernfs_debug_wait_completion);
+EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
+
+void kernfs_debug_wait(void)
+{
+	unsigned long timeout;
+
+	timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
+					      msecs_to_jiffies(3000));
+	if (!timeout)
+		pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
+			__func__);
+	else
+		pr_info("%s received completion with time left on timeout %u ms\n",
+			__func__, jiffies_to_msecs(timeout));
+
+	/**
+	 * The goal is wait for an event, and *then* once we have
+	 * reached it, the other side will try to do something which
+	 * it thinks will break. So we must give it some time to do
+	 * that. The amount of time is configurable.
+	 */
+	msleep(kernfs_config_fail.sleep_after_wait_ms);
+	pr_info("%s ended\n", __func__);
+}
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 60e2a86c535e..4479c6580333 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	const struct kernfs_ops *ops;
 	char *buf;
 
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
+		kernfs_debug_wait();
+
 	if (of->atomic_write_len) {
 		if (len > of->atomic_write_len)
 			return -E2BIG;
@@ -280,17 +283,27 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	}
 	buf[len] = '\0';	/* guarantee string termination */
 
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, before_mutex))
+		kernfs_debug_wait();
+
 	/*
 	 * @of->mutex nests outside active ref and is used both to ensure that
 	 * the ops aren't called concurrently for the same open file.
 	 */
 	mutex_lock(&of->mutex);
+
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_mutex))
+		kernfs_debug_wait();
+
 	if (!kernfs_get_active(of->kn)) {
 		mutex_unlock(&of->mutex);
 		len = -ENODEV;
 		goto out_free;
 	}
 
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_active))
+		kernfs_debug_wait();
+
 	ops = kernfs_ops(of->kn);
 	if (ops->write)
 		len = ops->write(of, buf, len, iocb->ki_pos);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index f9cc912c31e1..9e3abf597e2d 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -18,6 +18,7 @@
 
 #include <linux/kernfs.h>
 #include <linux/fs_context.h>
+#include <linux/stringify.h>
 
 struct kernfs_iattrs {
 	kuid_t			ia_uid;
@@ -147,4 +148,75 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
  */
 extern const struct inode_operations kernfs_symlink_iops;
 
+/*
+ * failure-injection.c
+ */
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+
+/**
+ * struct kernfs_fop_write_iter_fail - how kernfs_fop_write_iter_fail fails
+ *
+ * This lets you configure what part of kernfs_fop_write_iter() should behave
+ * in a specific way to allow userspace to capture possible failures in
+ * kernfs. The wait knobs are allowed to let you design capture possible
+ * race conditions which would otherwise be difficult to reproduce. A
+ * secondary driver would tell kernfs's wait completion when it is done.
+ *
+ * The point to the wait completion failure injection tests are to confirm
+ * that the kernfs active refcount suffice to ensure other objects in other
+ * layers are also gauranteed to exist, even they are opaque to kernfs. This
+ * includes kobjects, devices, and other objects built on top of this, like
+ * the block layer when using sysfs block device attributes.
+ *
+ * @wait_at_start: waits for completion from a third party at the start of
+ *	the routine.
+ * @wait_before_mutex: waits for completion from a third party before we
+ *	are allowed to continue before the of->mutex is held.
+ * @wait_after_mutex: waits for completion from a third party after we
+ *	have held the of->mutex.
+ * @wait_after_active: waits for completion from a thid party after we
+ *	have refcounted the struct kernfs_node.
+ */
+struct kernfs_fop_write_iter_fail {
+	bool wait_at_start;
+	bool wait_before_mutex;
+	bool wait_after_mutex;
+	bool wait_after_active;
+};
+
+/**
+ * struct kernfs_config_fail - kernfs configuration for failure injection
+ *
+ * You can kernfs failure injection on boot, and in particular we currently
+ * only support failures for kernfs_fop_write_iter(). However, we don't
+ * want to always enable errors on this call when failure injection is enabled
+ * as this routine is used by many parts of the kernel for proper functionality.
+ * The compromise we make is we let userspace start enabling which parts it
+ * wants to fail after boot, if and only if failure injection has been enabled.
+ *
+ * @kernfs_fop_write_iter_fail: configuration for how we want to allow
+ *	for failure injection on kernfs_fop_write_iter()
+ * @sleep_after_wait_ms: how many ms to wait after completion is received.
+ */
+struct kernfs_config_fail {
+	struct kernfs_fop_write_iter_fail kernfs_fop_write_iter_fail;
+	u32 sleep_after_wait_ms;
+};
+
+extern struct kernfs_config_fail kernfs_config_fail;
+
+#define __kernfs_config_wait_var(func, when) \
+	(kernfs_config_fail.  func  ## _fail.wait_  ## when)
+#define __kernfs_debug_should_wait_func_name(func) __kernfs_debug_should_wait_## func
+
+#define kernfs_debug_should_wait(func, when) \
+	__kernfs_debug_should_wait_func_name(func)(__kernfs_config_wait_var(func, when))
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate);
+void kernfs_debug_wait(void);
+#else
+static inline void kernfs_init_failure_injection(void) {}
+#define kernfs_debug_should_wait(func, when) (false)
+static inline void kernfs_debug_wait(void) {}
+#endif
+
 #endif	/* __KERNFS_INTERNAL_H */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3ccce6f24548..cd968ee2b503 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -411,6 +411,11 @@ void kernfs_init(void);
 
 struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 						   u64 id);
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+extern struct completion kernfs_debug_wait_completion;
+#endif
+
 #else	/* CONFIG_KERNFS */
 
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ae19bf1a21b8..a29b7d398c4e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1902,6 +1902,16 @@ config FAULT_INJECTION_USERCOPY
 	  Provides fault-injection capability to inject failures
 	  in usercopy functions (copy_from_user(), get_user(), ...).
 
+config FAIL_KERNFS_KNOBS
+	bool "Fault-injection support in kernfs"
+	depends on FAULT_INJECTION
+	help
+	  Provide fault-injection capability for kernfs. This only enables
+	  the error injection functionality. To use it you must configure which
+	  which path you want to trigger on error on using debugfs under
+	  /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/. By
+	  default all of these are disabled.
+
 config FAIL_MAKE_REQUEST
 	bool "Fault-injection capability for disk IO"
 	depends on FAULT_INJECTION && BLOCK
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 04/12] kernfs: add initial failure injection support
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams

This adds initial failure injection support to kernfs. We start
off with debug knobs which when enabled allow test drivers, such as
test_sysfs, to then make use of these to try to force certain
difficult races to take place with a high degree of certainty.

This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is
enabled in your kernel. If you don't have this enabled this provides
no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new
routine kernfs_debug_should_wait() ends up being transformed to if
(false), and so the compiler should optimize these out as dead code
producing no new effective binary changes.

We start off with enabling failure injections in kernfs by allowing us to
alter the way kernfs_fop_write_iter() behaves. We allow for the routine
kernfs_fop_write_iter() to wait for a certain condition in the kernel to
occur, after which it will sleep a predefined amount of time. This lets
kernfs users to time exactly when it want kernfs_fop_write_iter() to
complete, allowing for developing race conditions and test for correctness
in kernfs.

You'd boot with this enabled on your kernel command line:

fail_kernfs_fop_write_iter=1,100,0,1

The values are <interval,probability,size,times>, we don't care for
size, so for now we ignore it. The above ensures a failure will trigger
only once.

*How* we allow for this routine to change behaviour is left to knobs we
expose under debugfs:

 # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
wait_after_active
wait_after_mutex
wait_at_start
wait_before_mutex

A debugfs entry also exists to allow us to sleep a configurabler amount
of time after the completion:

/sys/kernel/debug/kernfs/sleep_after_wait_ms

These two sets of knobs allow us to construct races and demonstrate
how the kernfs active reference should suffice to project against
races.

Enabling CONFIG_FAULT_INJECTION_DEBUG_FS enables us to configure the
differnt fault injection parametres for the new fail_kernfs_fop_write_iter
fault injection at run time:

ls -1 /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
interval
probability
space
task-filter
times
verbose
verbose_ratelimit_burst
verbose_ratelimit_interval_ms

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 .../fault-injection/fault-injection.rst       | 22 +++++
 MAINTAINERS                                   |  2 +-
 fs/kernfs/Makefile                            |  1 +
 fs/kernfs/failure-injection.c                 | 91 +++++++++++++++++++
 fs/kernfs/file.c                              | 13 +++
 fs/kernfs/kernfs-internal.h                   | 72 +++++++++++++++
 include/linux/kernfs.h                        |  5 +
 lib/Kconfig.debug                             | 10 ++
 8 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 fs/kernfs/failure-injection.c

diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst
index 4a25c5eb6f07..d4d34b082f47 100644
--- a/Documentation/fault-injection/fault-injection.rst
+++ b/Documentation/fault-injection/fault-injection.rst
@@ -28,6 +28,28 @@ Available fault injection capabilities
 
   injects kernel RPC client and server failures.
 
+- fail_kernfs_fop_write_iter
+
+  Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
+  this does not immediately enable any errors to occur. You must configure
+  how you want this routine to fail or change behaviour by using the debugfs
+  knobs for it:
+
+  # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
+  wait_after_active
+  wait_after_mutex
+  wait_at_start
+  wait_before_mutex
+
+  You can also configure how long to sleep after a wait under
+
+  /sys/kernel/debug/kernfs/sleep_after_wait_ms
+
+  If you enable CONFIG_FAULT_INJECTION_DEBUG_FS the fail_add_disk failure
+  injection parameters are placed under:
+
+  /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
+
 - fail_make_request
 
   injects disk IO errors on devices permitted by setting
diff --git a/MAINTAINERS b/MAINTAINERS
index 28a34384f541..acdbf91058d5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10341,7 +10341,7 @@ M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 M:	Tejun Heo <tj@kernel.org>
 S:	Supported
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git
-F:	fs/kernfs/
+F:	fs/kernfs/*
 F:	include/linux/kernfs.h
 
 KEXEC
diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
index 4ca54ff54c98..bc5b32ca39f9 100644
--- a/fs/kernfs/Makefile
+++ b/fs/kernfs/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-y		:= mount.o inode.o dir.o file.o symlink.o
+obj-$(CONFIG_FAIL_KERNFS_KNOBS)    += failure-injection.o
diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c
new file mode 100644
index 000000000000..4130d202c13b
--- /dev/null
+++ b/fs/kernfs/failure-injection.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/fault-inject.h>
+#include <linux/delay.h>
+
+#include "kernfs-internal.h"
+
+static DECLARE_FAULT_ATTR(fail_kernfs_fop_write_iter);
+struct kernfs_config_fail kernfs_config_fail;
+
+#define kernfs_config_fail(when) \
+	kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+#define kernfs_config_fail(when) \
+	kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+static int __init setup_fail_kernfs_fop_write_iter(char *str)
+{
+	return setup_fault_attr(&fail_kernfs_fop_write_iter, str);
+}
+
+__setup("fail_kernfs_fop_write_iter=", setup_fail_kernfs_fop_write_iter);
+
+struct dentry *kernfs_debugfs_root;
+struct dentry *config_fail_kernfs_fop_write_iter;
+
+static int __init kernfs_init_failure_injection(void)
+{
+	kernfs_config_fail.sleep_after_wait_ms = 100;
+	kernfs_debugfs_root = debugfs_create_dir("kernfs", NULL);
+
+	fault_create_debugfs_attr("fail_kernfs_fop_write_iter",
+				  kernfs_debugfs_root, &fail_kernfs_fop_write_iter);
+
+	config_fail_kernfs_fop_write_iter =
+		debugfs_create_dir("config_fail_kernfs_fop_write_iter",
+				   kernfs_debugfs_root);
+
+	debugfs_create_u32("sleep_after_wait_ms", 0600,
+			   kernfs_debugfs_root,
+			   &kernfs_config_fail.sleep_after_wait_ms);
+
+	debugfs_create_bool("wait_at_start", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(at_start));
+	debugfs_create_bool("wait_before_mutex", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(before_mutex));
+	debugfs_create_bool("wait_after_mutex", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(after_mutex));
+	debugfs_create_bool("wait_after_active", 0600,
+			    config_fail_kernfs_fop_write_iter,
+			    &kernfs_config_fail(after_active));
+	return 0;
+}
+late_initcall(kernfs_init_failure_injection);
+
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate)
+{
+	if (!evaluate)
+		return 0;
+
+	return should_fail(&fail_kernfs_fop_write_iter, 0);
+}
+
+DECLARE_COMPLETION(kernfs_debug_wait_completion);
+EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
+
+void kernfs_debug_wait(void)
+{
+	unsigned long timeout;
+
+	timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
+					      msecs_to_jiffies(3000));
+	if (!timeout)
+		pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
+			__func__);
+	else
+		pr_info("%s received completion with time left on timeout %u ms\n",
+			__func__, jiffies_to_msecs(timeout));
+
+	/**
+	 * The goal is wait for an event, and *then* once we have
+	 * reached it, the other side will try to do something which
+	 * it thinks will break. So we must give it some time to do
+	 * that. The amount of time is configurable.
+	 */
+	msleep(kernfs_config_fail.sleep_after_wait_ms);
+	pr_info("%s ended\n", __func__);
+}
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 60e2a86c535e..4479c6580333 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	const struct kernfs_ops *ops;
 	char *buf;
 
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
+		kernfs_debug_wait();
+
 	if (of->atomic_write_len) {
 		if (len > of->atomic_write_len)
 			return -E2BIG;
@@ -280,17 +283,27 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	}
 	buf[len] = '\0';	/* guarantee string termination */
 
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, before_mutex))
+		kernfs_debug_wait();
+
 	/*
 	 * @of->mutex nests outside active ref and is used both to ensure that
 	 * the ops aren't called concurrently for the same open file.
 	 */
 	mutex_lock(&of->mutex);
+
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_mutex))
+		kernfs_debug_wait();
+
 	if (!kernfs_get_active(of->kn)) {
 		mutex_unlock(&of->mutex);
 		len = -ENODEV;
 		goto out_free;
 	}
 
+	if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_active))
+		kernfs_debug_wait();
+
 	ops = kernfs_ops(of->kn);
 	if (ops->write)
 		len = ops->write(of, buf, len, iocb->ki_pos);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index f9cc912c31e1..9e3abf597e2d 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -18,6 +18,7 @@
 
 #include <linux/kernfs.h>
 #include <linux/fs_context.h>
+#include <linux/stringify.h>
 
 struct kernfs_iattrs {
 	kuid_t			ia_uid;
@@ -147,4 +148,75 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
  */
 extern const struct inode_operations kernfs_symlink_iops;
 
+/*
+ * failure-injection.c
+ */
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+
+/**
+ * struct kernfs_fop_write_iter_fail - how kernfs_fop_write_iter_fail fails
+ *
+ * This lets you configure what part of kernfs_fop_write_iter() should behave
+ * in a specific way to allow userspace to capture possible failures in
+ * kernfs. The wait knobs are allowed to let you design capture possible
+ * race conditions which would otherwise be difficult to reproduce. A
+ * secondary driver would tell kernfs's wait completion when it is done.
+ *
+ * The point to the wait completion failure injection tests are to confirm
+ * that the kernfs active refcount suffice to ensure other objects in other
+ * layers are also gauranteed to exist, even they are opaque to kernfs. This
+ * includes kobjects, devices, and other objects built on top of this, like
+ * the block layer when using sysfs block device attributes.
+ *
+ * @wait_at_start: waits for completion from a third party at the start of
+ *	the routine.
+ * @wait_before_mutex: waits for completion from a third party before we
+ *	are allowed to continue before the of->mutex is held.
+ * @wait_after_mutex: waits for completion from a third party after we
+ *	have held the of->mutex.
+ * @wait_after_active: waits for completion from a thid party after we
+ *	have refcounted the struct kernfs_node.
+ */
+struct kernfs_fop_write_iter_fail {
+	bool wait_at_start;
+	bool wait_before_mutex;
+	bool wait_after_mutex;
+	bool wait_after_active;
+};
+
+/**
+ * struct kernfs_config_fail - kernfs configuration for failure injection
+ *
+ * You can kernfs failure injection on boot, and in particular we currently
+ * only support failures for kernfs_fop_write_iter(). However, we don't
+ * want to always enable errors on this call when failure injection is enabled
+ * as this routine is used by many parts of the kernel for proper functionality.
+ * The compromise we make is we let userspace start enabling which parts it
+ * wants to fail after boot, if and only if failure injection has been enabled.
+ *
+ * @kernfs_fop_write_iter_fail: configuration for how we want to allow
+ *	for failure injection on kernfs_fop_write_iter()
+ * @sleep_after_wait_ms: how many ms to wait after completion is received.
+ */
+struct kernfs_config_fail {
+	struct kernfs_fop_write_iter_fail kernfs_fop_write_iter_fail;
+	u32 sleep_after_wait_ms;
+};
+
+extern struct kernfs_config_fail kernfs_config_fail;
+
+#define __kernfs_config_wait_var(func, when) \
+	(kernfs_config_fail.  func  ## _fail.wait_  ## when)
+#define __kernfs_debug_should_wait_func_name(func) __kernfs_debug_should_wait_## func
+
+#define kernfs_debug_should_wait(func, when) \
+	__kernfs_debug_should_wait_func_name(func)(__kernfs_config_wait_var(func, when))
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate);
+void kernfs_debug_wait(void);
+#else
+static inline void kernfs_init_failure_injection(void) {}
+#define kernfs_debug_should_wait(func, when) (false)
+static inline void kernfs_debug_wait(void) {}
+#endif
+
 #endif	/* __KERNFS_INTERNAL_H */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3ccce6f24548..cd968ee2b503 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -411,6 +411,11 @@ void kernfs_init(void);
 
 struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 						   u64 id);
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+extern struct completion kernfs_debug_wait_completion;
+#endif
+
 #else	/* CONFIG_KERNFS */
 
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ae19bf1a21b8..a29b7d398c4e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1902,6 +1902,16 @@ config FAULT_INJECTION_USERCOPY
 	  Provides fault-injection capability to inject failures
 	  in usercopy functions (copy_from_user(), get_user(), ...).
 
+config FAIL_KERNFS_KNOBS
+	bool "Fault-injection support in kernfs"
+	depends on FAULT_INJECTION
+	help
+	  Provide fault-injection capability for kernfs. This only enables
+	  the error injection functionality. To use it you must configure which
+	  which path you want to trigger on error on using debugfs under
+	  /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/. By
+	  default all of these are disabled.
+
 config FAIL_MAKE_REQUEST
 	bool "Fault-injection capability for disk IO"
 	depends on FAULT_INJECTION && BLOCK
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 05/12] test_sysfs: add support to use kernfs failure injection
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

This extends test_sysfs with support for using the failure injection
wait completion and knobs to force a few race conditions which
demonstrates that kernfs active reference protection is sufficient
for kobject / device protection at higher layers.

This adds 4 new tests which tries to remove the device attribute
store operation in 4 different situations:

  1) at the start of kernfs_kernfs_fop_write_iter()
  2) before the of->mutex is held in kernfs_kernfs_fop_write_iter()
  3) after the of->mutex is held in kernfs_kernfs_fop_write_iter()
  4) after the kernfs node active reference is taken

A write fails in call cases except the last one, test number #32. There
is a good explanation for this: *once* kernfs_get_active() gets called
we have a guarantee that the kernfs entry cannot be removed. If
kernfs_get_active() succeeds that entry cannot be removed and so
anything trying to remove that entry will have to wait. It is perhaps
not obvious but since a sysfs write will trigger eventually a
kernfs_get_active() call, and *only* if this succeeds will the sysfs
op be called, this and the fact that you cannot remove the kernfs
entry while the kenfs entry is active implies that a module that
created the respective sysfs / kernfs entry *cannot* possibly be
removed during a sysfs operation. And test number 32 provides us with
proof of this. If it were not true test #32 should crash.

No null dereferences are reproduced, even though this has been observed
in some complex testing cases [0]. If this issue really exists we should
have enough tools on the sysfs_test toolbox now to try to reproduce
this easily without having to poke around other drivers. It very likley
was the case that the issue reported [0] was possibly a side issue after
the first bug which was zram specific. This is why it is important to
isolate the issue and try to reproduce it in a generic form using the
test_sysfs driver.

[0] https://lkml.kernel.org/r/20210623215007.862787-1-mcgrof@kernel.org

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 lib/Kconfig.debug                      |   3 +
 lib/test_sysfs.c                       |  31 +++++
 tools/testing/selftests/sysfs/config   |   3 +
 tools/testing/selftests/sysfs/sysfs.sh | 175 +++++++++++++++++++++++++
 4 files changed, 212 insertions(+)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a29b7d398c4e..176b822654e5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2358,6 +2358,9 @@ config TEST_SYSFS
 	depends on SYSFS
 	depends on NET
 	depends on BLOCK
+	select FAULT_INJECTION
+	select FAULT_INJECTION_DEBUG_FS
+	select FAIL_KERNFS_KNOBS
 	help
 	  This builds the "test_sysfs" module. This driver enables to test the
 	  sysfs file system safely without affecting production knobs which
diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
index 273fc3f39740..391e0af2864a 100644
--- a/lib/test_sysfs.c
+++ b/lib/test_sysfs.c
@@ -38,6 +38,11 @@
 #include <linux/rtnetlink.h>
 #include <linux/genhd.h>
 #include <linux/blkdev.h>
+#include <linux/kernfs.h>
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+MODULE_IMPORT_NS(KERNFS_DEBUG_PRIVATE);
+#endif
 
 static bool enable_lock;
 module_param(enable_lock, bool_enable_only, 0644);
@@ -82,6 +87,13 @@ static bool enable_verbose_rmmod;
 module_param(enable_verbose_rmmod, bool_enable_only, 0644);
 MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
 
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+static bool enable_completion_on_rmmod;
+module_param(enable_completion_on_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_completion_on_rmmod,
+		 "enable sending a kernfs completion on rmmod");
+#endif
+
 static int sysfs_test_major;
 
 /**
@@ -289,6 +301,12 @@ static ssize_t config_show(struct device *dev,
 			"enable_verbose_writes:\t%s\n",
 			enable_verbose_writes ? "true" : "false");
 
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_completion_on_rmmod:\t%s\n",
+			enable_completion_on_rmmod ? "true" : "false");
+#endif
+
 	test_dev_config_unlock(test_dev);
 
 	return len;
@@ -926,10 +944,23 @@ static int __init test_sysfs_init(void)
 }
 module_init(test_sysfs_init);
 
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+/* The goal is to race our device removal with a pending kernfs -> store call */
+static void test_sysfs_kernfs_send_completion_rmmod(void)
+{
+	if (!enable_completion_on_rmmod)
+		return;
+	complete(&kernfs_debug_wait_completion);
+}
+#else
+static inline void test_sysfs_kernfs_send_completion_rmmod(void) {}
+#endif
+
 static void __exit test_sysfs_exit(void)
 {
 	if (enable_debugfs)
 		debugfs_remove(debugfs_dir);
+	test_sysfs_kernfs_send_completion_rmmod();
 	if (delay_rmmod_ms)
 		msleep(delay_rmmod_ms);
 	unregister_test_dev_sysfs(first_test_dev);
diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
index 9196f452ecd5..2876a229f95b 100644
--- a/tools/testing/selftests/sysfs/config
+++ b/tools/testing/selftests/sysfs/config
@@ -1,2 +1,5 @@
 CONFIG_SYSFS=m
 CONFIG_TEST_SYSFS=m
+CONFIG_FAULT_INJECTION=y
+CONFIG_FAULT_INJECTION_DEBUG_FS=y
+CONFIG_FAIL_KERNFS_KNOBS=y
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index b3f4c2236c7f..f928635d0e35 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -62,6 +62,10 @@ ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
 ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
+ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
+ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
+ALL_TESTS="$ALL_TESTS 0032:1:1:test_dev_x:block" # kernfs race removal after active
 
 allow_user_defaults()
 {
@@ -92,6 +96,9 @@ allow_user_defaults()
 	if [ -z $SYSFS_DEBUGFS_DIR ]; then
 		SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
 	fi
+	if [ -z $KERNFS_DEBUGFS_DIR ]; then
+		KERNFS_DEBUGFS_DIR="/sys/kernel/debug/kernfs"
+	fi
 	if [ -z $PAGE_SIZE ]; then
 		PAGE_SIZE=$(getconf PAGESIZE)
 	fi
@@ -167,6 +174,14 @@ modprobe_reset_enable_rtnl_lock_on_rmmod()
 	unset FIRST_MODPROBE_ARGS
 }
 
+modprobe_reset_enable_completion()
+{
+	FIRST_MODPROBE_ARGS="enable_completion_on_rmmod=1 enable_verbose_writes=1"
+	FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_rmmod=1 delay_rmmod_ms=0"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
 load_req_mod()
 {
 	modprobe_reset
@@ -197,6 +212,63 @@ debugfs_reset_first_test_dev_ignore_errors()
 	echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
 }
 
+debugfs_kernfs_kernfs_fop_write_iter_exists()
+{
+	KNOB_DIR="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter"
+	if [[ ! -d $KNOB_DIR ]]; then
+		echo "kernfs debugfs does not exist $KNOB_DIR"
+		return 0;
+	fi
+	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+	if [[ ! -d $KNOB_DEBUGFS ]]; then
+		echo -n "kernfs debugfs for coniguring fail_kernfs_fop_write_iter "
+		echo "does not exist $KNOB_DIR"
+		return 0;
+	fi
+	return 1
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_once()
+{
+	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+	echo 1 > $KNOB_DEBUGFS/interval
+	echo 100 > $KNOB_DEBUGFS/probability
+	echo 0 > $KNOB_DEBUGFS/space
+	# Disable verbose messages on the kernel ring buffer which may
+	# confuse developers with a kernel panic.
+	echo 0 > $KNOB_DEBUGFS/verbose
+
+	# Fail only once
+	echo 1 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_never()
+{
+	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+	echo 0 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_set_wait_ms()
+{
+	SLEEP_AFTER_WAIT_MS="${KERNFS_DEBUGFS_DIR}/sleep_after_wait_ms"
+	echo $1 > $SLEEP_AFTER_WAIT_MS
+}
+
+debugfs_kernfs_disable_wait_kernfs_fop_write_iter()
+{
+	ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_"
+	for KNOB in ${ENABLE_WAIT_KNOB}*; do
+		echo 0 > $KNOB
+	done
+}
+
+debugfs_kernfs_enable_wait_kernfs_fop_write_iter()
+{
+	ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_$1"
+	echo -n "1" > $ENABLE_WAIT_KNOB
+	return $?
+}
+
 set_orig()
 {
 	if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
@@ -972,6 +1044,105 @@ sysfs_test_0028()
 	fi
 }
 
+sysfs_race_kernfs_kernfs_fop_write_iter()
+{
+	TARGET="${DIR}/$(get_test_target $1)"
+	WAIT_AT=$2
+	EXPECT_WRITE_RETURNS=$3
+	MSDELAY=$4
+
+	modprobe_reset_enable_completion
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	echo -n "Test racing removal of sysfs store op with kernfs $WAIT_AT ... "
+
+	if debugfs_kernfs_kernfs_fop_write_iter_exists; then
+		echo -n "skipping test as CONFIG_FAIL_KERNFS_KNOBS "
+		echo " or CONFIG_FAULT_INJECTION_DEBUG_FS is disabled"
+		return $ksft_skip
+	fi
+
+	# Allow for failing the kernfs_kernfs_fop_write_iter call once,
+	# we'll provide exact context shortly afterwards.
+	debugfs_kernfs_kernfs_fop_write_iter_set_fail_once
+
+	# First disable all waits
+	debugfs_kernfs_disable_wait_kernfs_fop_write_iter
+
+	# Enable a wait_for_completion(&kernfs_debug_wait_completion) at the
+	# specified location inside the kernfs_fop_write_iter() routine
+	debugfs_kernfs_enable_wait_kernfs_fop_write_iter $WAIT_AT
+
+	# Configure kernfs so that after its wait_for_completion() it
+	# will msleep() this amount of time and schedule(). We figure this
+	# will be sufficient time to allow for our module removal to complete.
+	debugfs_kernfs_set_wait_ms $MSDELAY
+
+	# Now we trigger a kernfs write op, which will run kernfs_fop_write_iter,
+	# but will wait until our driver sends a respective completion
+	set_test_ignore_errors &
+	write_pid=$!
+
+	# At this point kernfs_fop_write_iter() hasn't run our op, its
+	# waiting for our completion at the specified time $WAIT_AT.
+	# We now remove our module which will send a
+	# complete(&kernfs_debug_wait_completion) right before we deregister
+	# our device and the sysfs device attributes are removed.
+	#
+	# After the completion is sent, the test_sysfs driver races with
+	# kernfs to do the device deregistration with the kernfs msleep
+	# and schedule(). This should mean we've forced trying to remove the
+	# module prior to allowing kernfs to run our store operation. If the
+	# race did happen we'll panic with a null dereference on the store op.
+	#
+	# If no race happens we should see no write operation triggered.
+	modprobe -r $TEST_DRIVER > /dev/null 2>&1
+
+	debugfs_kernfs_kernfs_fop_write_iter_set_fail_never
+
+	wait $write_pid
+	if [[ $? -eq $EXPECT_WRITE_RETURNS ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0029()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0029 at_start 1 $delay
+	done
+}
+
+sysfs_test_0030()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0030 before_mutex 1 $delay
+	done
+}
+
+sysfs_test_0031()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0031 after_mutex 1 $delay
+	done
+}
+
+# A write only succeeds *iff* a module removal happens *after* the
+# kernfs active reference is obtained with kernfs_get_active().
+sysfs_test_0032()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0032 after_active 0 $delay
+	done
+}
+
 test_gen_desc()
 {
 	echo -n "$1 x $(get_test_count $1)"
@@ -1013,6 +1184,10 @@ list_tests()
 	echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
 	echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
 	echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
+	echo "$(test_gen_desc 0029) - racing removal of store op with kernfs at start"
+	echo "$(test_gen_desc 0030) - racing removal of store op with kernfs before mutex"
+	echo "$(test_gen_desc 0031) - racing removal of store op with kernfs after mutex"
+	echo "$(test_gen_desc 0032) - racing removal of store op with kernfs after active"
 }
 
 usage()
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 05/12] test_sysfs: add support to use kernfs failure injection
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	minchan-DgEjT+Ai2ygdnm+yROfE0A, jeyu-DgEjT+Ai2ygdnm+yROfE0A,
	shuah-DgEjT+Ai2ygdnm+yROfE0A
  Cc: rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	masahiroy-DgEjT+Ai2ygdnm+yROfE0A,
	ndesaulniers-hpIqsD4AKlfQT0dZR+AlfA,
	yzaikin-hpIqsD4AKlfQT0dZR+AlfA, nathan-DgEjT+Ai2ygdnm+yROfE0A,
	ojeda-DgEjT+Ai2ygdnm+yROfE0A,
	penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp,
	vitor-HdruG0TBZYlAfugRpC6u6w, elver-hpIqsD4AKlfQT0dZR+AlfA,
	jarkko-DgEjT+Ai2ygdnm+yROfE0A, glider-hpIqsD4AKlfQT0dZR+AlfA,
	rf-yzvPICuk2AA4QjBA90+/kJqQE7yCjDx5,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	David.Laight-ZS65k/vG3HxXrIkS9f7CXA, bvanassche-HInyCGIudOg,
	jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	andriy.shevchenko-VuQAYsv1563Yd54FQh9/CA,
	trishalfonso-hpIqsD4AKlfQT0dZR+AlfA,
	andreyknvl-Re5JQEeQqe8AvxtiuMwx3w, jikos-DgEjT+Ai2ygdnm+yROfE0A,
	mbenes-IBi9RG/b67k, ngupta-KNmc09w0p+Ednm+yROfE0A,
	sergey.senozhatsky.work-Re5JQEeQqe8AvxtiuMwx3w,
	mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	reinette.chatre-ral2JQCrhuEAvxtiuMwx3w,
	fenghua.yu-ral2JQCrhuEAvxtiuMwx3w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	x86-DgEjT+Ai2ygdnm+yROfE0A, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w,
	daniel.vetter-/w4YWyX8dFk, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	kw-vYTEC60ixJUAvxtiuMwx3w, dan.j.williams

This extends test_sysfs with support for using the failure injection
wait completion and knobs to force a few race conditions which
demonstrates that kernfs active reference protection is sufficient
for kobject / device protection at higher layers.

This adds 4 new tests which tries to remove the device attribute
store operation in 4 different situations:

  1) at the start of kernfs_kernfs_fop_write_iter()
  2) before the of->mutex is held in kernfs_kernfs_fop_write_iter()
  3) after the of->mutex is held in kernfs_kernfs_fop_write_iter()
  4) after the kernfs node active reference is taken

A write fails in call cases except the last one, test number #32. There
is a good explanation for this: *once* kernfs_get_active() gets called
we have a guarantee that the kernfs entry cannot be removed. If
kernfs_get_active() succeeds that entry cannot be removed and so
anything trying to remove that entry will have to wait. It is perhaps
not obvious but since a sysfs write will trigger eventually a
kernfs_get_active() call, and *only* if this succeeds will the sysfs
op be called, this and the fact that you cannot remove the kernfs
entry while the kenfs entry is active implies that a module that
created the respective sysfs / kernfs entry *cannot* possibly be
removed during a sysfs operation. And test number 32 provides us with
proof of this. If it were not true test #32 should crash.

No null dereferences are reproduced, even though this has been observed
in some complex testing cases [0]. If this issue really exists we should
have enough tools on the sysfs_test toolbox now to try to reproduce
this easily without having to poke around other drivers. It very likley
was the case that the issue reported [0] was possibly a side issue after
the first bug which was zram specific. This is why it is important to
isolate the issue and try to reproduce it in a generic form using the
test_sysfs driver.

[0] https://lkml.kernel.org/r/20210623215007.862787-1-mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org

Signed-off-by: Luis Chamberlain <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 lib/Kconfig.debug                      |   3 +
 lib/test_sysfs.c                       |  31 +++++
 tools/testing/selftests/sysfs/config   |   3 +
 tools/testing/selftests/sysfs/sysfs.sh | 175 +++++++++++++++++++++++++
 4 files changed, 212 insertions(+)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a29b7d398c4e..176b822654e5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2358,6 +2358,9 @@ config TEST_SYSFS
 	depends on SYSFS
 	depends on NET
 	depends on BLOCK
+	select FAULT_INJECTION
+	select FAULT_INJECTION_DEBUG_FS
+	select FAIL_KERNFS_KNOBS
 	help
 	  This builds the "test_sysfs" module. This driver enables to test the
 	  sysfs file system safely without affecting production knobs which
diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
index 273fc3f39740..391e0af2864a 100644
--- a/lib/test_sysfs.c
+++ b/lib/test_sysfs.c
@@ -38,6 +38,11 @@
 #include <linux/rtnetlink.h>
 #include <linux/genhd.h>
 #include <linux/blkdev.h>
+#include <linux/kernfs.h>
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+MODULE_IMPORT_NS(KERNFS_DEBUG_PRIVATE);
+#endif
 
 static bool enable_lock;
 module_param(enable_lock, bool_enable_only, 0644);
@@ -82,6 +87,13 @@ static bool enable_verbose_rmmod;
 module_param(enable_verbose_rmmod, bool_enable_only, 0644);
 MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
 
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+static bool enable_completion_on_rmmod;
+module_param(enable_completion_on_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_completion_on_rmmod,
+		 "enable sending a kernfs completion on rmmod");
+#endif
+
 static int sysfs_test_major;
 
 /**
@@ -289,6 +301,12 @@ static ssize_t config_show(struct device *dev,
 			"enable_verbose_writes:\t%s\n",
 			enable_verbose_writes ? "true" : "false");
 
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+	len += snprintf(buf+len, PAGE_SIZE - len,
+			"enable_completion_on_rmmod:\t%s\n",
+			enable_completion_on_rmmod ? "true" : "false");
+#endif
+
 	test_dev_config_unlock(test_dev);
 
 	return len;
@@ -926,10 +944,23 @@ static int __init test_sysfs_init(void)
 }
 module_init(test_sysfs_init);
 
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+/* The goal is to race our device removal with a pending kernfs -> store call */
+static void test_sysfs_kernfs_send_completion_rmmod(void)
+{
+	if (!enable_completion_on_rmmod)
+		return;
+	complete(&kernfs_debug_wait_completion);
+}
+#else
+static inline void test_sysfs_kernfs_send_completion_rmmod(void) {}
+#endif
+
 static void __exit test_sysfs_exit(void)
 {
 	if (enable_debugfs)
 		debugfs_remove(debugfs_dir);
+	test_sysfs_kernfs_send_completion_rmmod();
 	if (delay_rmmod_ms)
 		msleep(delay_rmmod_ms);
 	unregister_test_dev_sysfs(first_test_dev);
diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
index 9196f452ecd5..2876a229f95b 100644
--- a/tools/testing/selftests/sysfs/config
+++ b/tools/testing/selftests/sysfs/config
@@ -1,2 +1,5 @@
 CONFIG_SYSFS=m
 CONFIG_TEST_SYSFS=m
+CONFIG_FAULT_INJECTION=y
+CONFIG_FAULT_INJECTION_DEBUG_FS=y
+CONFIG_FAIL_KERNFS_KNOBS=y
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index b3f4c2236c7f..f928635d0e35 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -62,6 +62,10 @@ ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
 ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
+ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
+ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
+ALL_TESTS="$ALL_TESTS 0032:1:1:test_dev_x:block" # kernfs race removal after active
 
 allow_user_defaults()
 {
@@ -92,6 +96,9 @@ allow_user_defaults()
 	if [ -z $SYSFS_DEBUGFS_DIR ]; then
 		SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
 	fi
+	if [ -z $KERNFS_DEBUGFS_DIR ]; then
+		KERNFS_DEBUGFS_DIR="/sys/kernel/debug/kernfs"
+	fi
 	if [ -z $PAGE_SIZE ]; then
 		PAGE_SIZE=$(getconf PAGESIZE)
 	fi
@@ -167,6 +174,14 @@ modprobe_reset_enable_rtnl_lock_on_rmmod()
 	unset FIRST_MODPROBE_ARGS
 }
 
+modprobe_reset_enable_completion()
+{
+	FIRST_MODPROBE_ARGS="enable_completion_on_rmmod=1 enable_verbose_writes=1"
+	FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_rmmod=1 delay_rmmod_ms=0"
+	modprobe_reset
+	unset FIRST_MODPROBE_ARGS
+}
+
 load_req_mod()
 {
 	modprobe_reset
@@ -197,6 +212,63 @@ debugfs_reset_first_test_dev_ignore_errors()
 	echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
 }
 
+debugfs_kernfs_kernfs_fop_write_iter_exists()
+{
+	KNOB_DIR="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter"
+	if [[ ! -d $KNOB_DIR ]]; then
+		echo "kernfs debugfs does not exist $KNOB_DIR"
+		return 0;
+	fi
+	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+	if [[ ! -d $KNOB_DEBUGFS ]]; then
+		echo -n "kernfs debugfs for coniguring fail_kernfs_fop_write_iter "
+		echo "does not exist $KNOB_DIR"
+		return 0;
+	fi
+	return 1
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_once()
+{
+	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+	echo 1 > $KNOB_DEBUGFS/interval
+	echo 100 > $KNOB_DEBUGFS/probability
+	echo 0 > $KNOB_DEBUGFS/space
+	# Disable verbose messages on the kernel ring buffer which may
+	# confuse developers with a kernel panic.
+	echo 0 > $KNOB_DEBUGFS/verbose
+
+	# Fail only once
+	echo 1 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_never()
+{
+	KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+	echo 0 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_set_wait_ms()
+{
+	SLEEP_AFTER_WAIT_MS="${KERNFS_DEBUGFS_DIR}/sleep_after_wait_ms"
+	echo $1 > $SLEEP_AFTER_WAIT_MS
+}
+
+debugfs_kernfs_disable_wait_kernfs_fop_write_iter()
+{
+	ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_"
+	for KNOB in ${ENABLE_WAIT_KNOB}*; do
+		echo 0 > $KNOB
+	done
+}
+
+debugfs_kernfs_enable_wait_kernfs_fop_write_iter()
+{
+	ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_$1"
+	echo -n "1" > $ENABLE_WAIT_KNOB
+	return $?
+}
+
 set_orig()
 {
 	if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
@@ -972,6 +1044,105 @@ sysfs_test_0028()
 	fi
 }
 
+sysfs_race_kernfs_kernfs_fop_write_iter()
+{
+	TARGET="${DIR}/$(get_test_target $1)"
+	WAIT_AT=$2
+	EXPECT_WRITE_RETURNS=$3
+	MSDELAY=$4
+
+	modprobe_reset_enable_completion
+	ORIG=$(cat "${TARGET}")
+	TEST_STR=$(( $ORIG + 1 ))
+
+	echo -n "Test racing removal of sysfs store op with kernfs $WAIT_AT ... "
+
+	if debugfs_kernfs_kernfs_fop_write_iter_exists; then
+		echo -n "skipping test as CONFIG_FAIL_KERNFS_KNOBS "
+		echo " or CONFIG_FAULT_INJECTION_DEBUG_FS is disabled"
+		return $ksft_skip
+	fi
+
+	# Allow for failing the kernfs_kernfs_fop_write_iter call once,
+	# we'll provide exact context shortly afterwards.
+	debugfs_kernfs_kernfs_fop_write_iter_set_fail_once
+
+	# First disable all waits
+	debugfs_kernfs_disable_wait_kernfs_fop_write_iter
+
+	# Enable a wait_for_completion(&kernfs_debug_wait_completion) at the
+	# specified location inside the kernfs_fop_write_iter() routine
+	debugfs_kernfs_enable_wait_kernfs_fop_write_iter $WAIT_AT
+
+	# Configure kernfs so that after its wait_for_completion() it
+	# will msleep() this amount of time and schedule(). We figure this
+	# will be sufficient time to allow for our module removal to complete.
+	debugfs_kernfs_set_wait_ms $MSDELAY
+
+	# Now we trigger a kernfs write op, which will run kernfs_fop_write_iter,
+	# but will wait until our driver sends a respective completion
+	set_test_ignore_errors &
+	write_pid=$!
+
+	# At this point kernfs_fop_write_iter() hasn't run our op, its
+	# waiting for our completion at the specified time $WAIT_AT.
+	# We now remove our module which will send a
+	# complete(&kernfs_debug_wait_completion) right before we deregister
+	# our device and the sysfs device attributes are removed.
+	#
+	# After the completion is sent, the test_sysfs driver races with
+	# kernfs to do the device deregistration with the kernfs msleep
+	# and schedule(). This should mean we've forced trying to remove the
+	# module prior to allowing kernfs to run our store operation. If the
+	# race did happen we'll panic with a null dereference on the store op.
+	#
+	# If no race happens we should see no write operation triggered.
+	modprobe -r $TEST_DRIVER > /dev/null 2>&1
+
+	debugfs_kernfs_kernfs_fop_write_iter_set_fail_never
+
+	wait $write_pid
+	if [[ $? -eq $EXPECT_WRITE_RETURNS ]]; then
+		echo "ok"
+	else
+		echo "FAIL" >&2
+	fi
+}
+
+sysfs_test_0029()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0029 at_start 1 $delay
+	done
+}
+
+sysfs_test_0030()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0030 before_mutex 1 $delay
+	done
+}
+
+sysfs_test_0031()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0031 after_mutex 1 $delay
+	done
+}
+
+# A write only succeeds *iff* a module removal happens *after* the
+# kernfs active reference is obtained with kernfs_get_active().
+sysfs_test_0032()
+{
+	for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+		echo "Using delay-after-completion: $delay"
+		sysfs_race_kernfs_kernfs_fop_write_iter 0032 after_active 0 $delay
+	done
+}
+
 test_gen_desc()
 {
 	echo -n "$1 x $(get_test_count $1)"
@@ -1013,6 +1184,10 @@ list_tests()
 	echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
 	echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
 	echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
+	echo "$(test_gen_desc 0029) - racing removal of store op with kernfs at start"
+	echo "$(test_gen_desc 0030) - racing removal of store op with kernfs before mutex"
+	echo "$(test_gen_desc 0031) - racing removal of store op with kernfs after mutex"
+	echo "$(test_gen_desc 0032) - racing removal of store op with kernfs after active"
 }
 
 usage()
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 06/12] kernel/module: add documentation for try_module_get()
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-18  5:04   ` Luis Chamberlain
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

There is quite a bit of tribal knowledge around proper use of
try_module_get() and that it must be used only in a context which
can ensure the module won't be gone during the operation. Document
this little bit of tribal knowledge.

I'm extending this tribal knowledge with new developments which it
seems some folks do not yet believe to be true: we can be sure a
module will exist during the lifetime of a sysfs file operation.
For proof, refer to test_sysfs test #32:

./tools/testing/selftests/sysfs/sysfs.sh -t 0032

Without this being true, the write would fail or worse,
a crash would happen, in this test. It does not.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/module.h | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index c9f1200b2312..22eacd5e1e85 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -609,10 +609,40 @@ void symbol_put_addr(void *addr);
    to handle the error case (which only happens with rmmod --wait). */
 extern void __module_get(struct module *module);
 
-/* This is the Right Way to get a module: if it fails, it's being removed,
- * so pretend it's not there. */
+/**
+ * try_module_get() - yields to module removal and bumps refcnt otherwise
+ * @module: the module we should check for
+ *
+ * This can be used to try to bump the reference count of a module, so to
+ * prevent module removal. The reference count of a module is not allowed
+ * to be incremented if the module is already being removed.
+ *
+ * Care must be taken to ensure the module cannot be removed during the call to
+ * try_module_get(). This can be done by having another entity other than the
+ * module itself increment the module reference count, or through some other
+ * means which guarantees the module could not be removed during an operation.
+ * An example of this later case is using try_module_get() in a sysfs file
+ * which the module created. The sysfs store / read file operations are
+ * gauranteed to exist through the use of kernfs's active reference (see
+ * kernfs_active()). If a sysfs file operation is being run, the module which
+ * created it must still exist as the module is in charge of removing the same
+ * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
+ * unless the same file is not active.
+ *
+ * One of the real values to try_module_get() is the module_is_live() check
+ * which ensures this the caller of try_module_get() can yield to userspace
+ * module removal requests and fail whatever it was about to process.
+ */
 extern bool try_module_get(struct module *module);
 
+/**
+ * module_put() - release a reference count to a module
+ * @module: the module we should release a reference count for
+ *
+ * If you successfully bump a reference count to a module with try_module_get(),
+ * when you are finished you must call module_put() to release that reference
+ * count.
+ */
 extern void module_put(struct module *module);
 
 #else /*!CONFIG_MODULE_UNLOAD*/
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 06/12] kernel/module: add documentation for try_module_get()
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams

There is quite a bit of tribal knowledge around proper use of
try_module_get() and that it must be used only in a context which
can ensure the module won't be gone during the operation. Document
this little bit of tribal knowledge.

I'm extending this tribal knowledge with new developments which it
seems some folks do not yet believe to be true: we can be sure a
module will exist during the lifetime of a sysfs file operation.
For proof, refer to test_sysfs test #32:

./tools/testing/selftests/sysfs/sysfs.sh -t 0032

Without this being true, the write would fail or worse,
a crash would happen, in this test. It does not.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/module.h | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index c9f1200b2312..22eacd5e1e85 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -609,10 +609,40 @@ void symbol_put_addr(void *addr);
    to handle the error case (which only happens with rmmod --wait). */
 extern void __module_get(struct module *module);
 
-/* This is the Right Way to get a module: if it fails, it's being removed,
- * so pretend it's not there. */
+/**
+ * try_module_get() - yields to module removal and bumps refcnt otherwise
+ * @module: the module we should check for
+ *
+ * This can be used to try to bump the reference count of a module, so to
+ * prevent module removal. The reference count of a module is not allowed
+ * to be incremented if the module is already being removed.
+ *
+ * Care must be taken to ensure the module cannot be removed during the call to
+ * try_module_get(). This can be done by having another entity other than the
+ * module itself increment the module reference count, or through some other
+ * means which guarantees the module could not be removed during an operation.
+ * An example of this later case is using try_module_get() in a sysfs file
+ * which the module created. The sysfs store / read file operations are
+ * gauranteed to exist through the use of kernfs's active reference (see
+ * kernfs_active()). If a sysfs file operation is being run, the module which
+ * created it must still exist as the module is in charge of removing the same
+ * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
+ * unless the same file is not active.
+ *
+ * One of the real values to try_module_get() is the module_is_live() check
+ * which ensures this the caller of try_module_get() can yield to userspace
+ * module removal requests and fail whatever it was about to process.
+ */
 extern bool try_module_get(struct module *module);
 
+/**
+ * module_put() - release a reference count to a module
+ * @module: the module we should release a reference count for
+ *
+ * If you successfully bump a reference count to a module with try_module_get(),
+ * when you are finished you must call module_put() to release that reference
+ * count.
+ */
 extern void module_put(struct module *module);
 
 #else /*!CONFIG_MODULE_UNLOAD*/
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 07/12] fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link()
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-18  5:04   ` Luis Chamberlain
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

If one ends up extending this line checkpatch will complain about the
use of S_IRWXUGO suggesting it is not preferred and that 0777
should be used instead. Take the tip from checkpatch and do that
change before we do our subsequent changes.

This makes no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/kernfs/symlink.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index c8f8e41b8411..19a6c71c6ff5 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,8 +36,7 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 		gid = target->iattr->ia_gid;
 	}
 
-	kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
-			     KERNFS_LINK);
+	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 07/12] fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link()
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams

If one ends up extending this line checkpatch will complain about the
use of S_IRWXUGO suggesting it is not preferred and that 0777
should be used instead. Take the tip from checkpatch and do that
change before we do our subsequent changes.

This makes no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/kernfs/symlink.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index c8f8e41b8411..19a6c71c6ff5 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,8 +36,7 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 		gid = target->iattr->ia_gid;
 	}
 
-	kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
-			     KERNFS_LINK);
+	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns()
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-18  5:04   ` Luis Chamberlain
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

If one ends up expanding on this line checkpatch will complain that the
combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the
octal 0755. Do that.

This makes no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/sysfs/dir.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 59dffd5ca517..b6b6796e1616 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -56,8 +56,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
 
 	kobject_get_ownership(kobj, &uid, &gid);
 
-	kn = kernfs_create_dir_ns(parent, kobject_name(kobj),
-				  S_IRWXU | S_IRUGO | S_IXUGO, uid, gid,
+	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
 				  kobj, ns);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns()
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams

If one ends up expanding on this line checkpatch will complain that the
combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the
octal 0755. Do that.

This makes no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/sysfs/dir.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 59dffd5ca517..b6b6796e1616 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -56,8 +56,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
 
 	kobject_get_ownership(kobj, &uid, &gid);
 
-	kn = kernfs_create_dir_ns(parent, kobject_name(kobj),
-				  S_IRWXU | S_IRUGO | S_IXUGO, uid, gid,
+	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
 				  kobj, ns);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 09/12] sysfs: fix deadlock race with module removal
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-18  5:04   ` Luis Chamberlain
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

When sysfs attributes use a lock also used on module removal we can
race to deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have module removal call
trigger. The module removal call code holds a lock, and then the sysfs
file entry waits for the same lock. While holding the lock the module
removal tries to remove the sysfs entries, but these cannot be removed
yet as one is waiting for a lock. This won't complete as the lock is
already held. Likewise module removal cannot complete, and so we deadlock.

This can now be easily reproducible with our sysfs selftest as follows:

./tools/testing/selftests/sysfs/sysfs.sh -t 0027

To fix this we extend the struct kernfs_node with a module reference and
use the try_module_get() after kernfs_get_active() is called which
protects integrity and the existence of the kernfs node during the
operation.

So long as the kernfs node is protected with kernfs_get_active() we know
we can rely on its contents. And, as now just documented in the previous
patch, we also now know that once kernfs_get_active() is called the module
is also guarded to exist and cannot be removed.

If try_module_get() fails we fail the operation on the kernfs node.

We use a try method as a full lock means we'd then make our sysfs
attributes busy us out from possible module removal, and so userspace
could force denying module removal, a silly form of "DOS" against module
removal. A try lock on the module removal ensures we give priority to
module removal and interacting with sysfs attributes only comes second.
Using a full lock could mean for instance that if you don't stop poking
at sysfs files you cannot remove a module.

Races between removal of sysfs files and the module are not possible
given sysfs files are created by the same module, and when a sysfs file
is being used kernfs prevents removal of the sysfs file. So if module
removal is actually happening the removal would have to wait until
the sysfs file operation is complete.

This deadlock was first reported with the zram driver, however the live
patching folks have acknowledged they have observed this as well with
live patching, when a live patch is removed. I was then able to
reproduce easily by creating a dedicated selftests.

A sketch of how this can happen follows:

CPU A                              CPU B
                                   whatever_store()
module_unload
  mutex_lock(foo)
                                   mutex_lock(foo)
   del_gendisk(zram->disk);
     device_del()
       device_remove_groups()

In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked
completes which is waiting for a lock already held.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
 fs/kernfs/dir.c                        | 41 ++++++++++++++++++--
 fs/kernfs/file.c                       |  7 +++-
 fs/kernfs/kernfs-internal.h            |  1 +
 fs/kernfs/symlink.c                    |  3 +-
 fs/sysfs/dir.c                         |  4 +-
 fs/sysfs/file.c                        | 10 +++--
 fs/sysfs/group.c                       |  1 +
 include/linux/kernfs.h                 | 11 ++++--
 include/linux/sysfs.h                  | 52 ++++++++++++++++++++------
 kernel/cgroup/cgroup.c                 |  2 +-
 11 files changed, 106 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..d40c486f46b9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -207,7 +207,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
 	struct kernfs_node *kn;
 	int ret;
 
-	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
+	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, NULL,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				  0, rft->kf_ops, rft, NULL, NULL);
 	if (IS_ERR(kn))
@@ -2480,7 +2480,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
 	struct kernfs_node *kn;
 	int ret = 0;
 
-	kn = __kernfs_create_file(parent_kn, name, 0444,
+	kn = __kernfs_create_file(parent_kn, name, 0444, NULL,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
 				  &kf_mondata_ops, priv, NULL, NULL);
 	if (IS_ERR(kn))
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..f463878a5bf4 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/hash.h>
+#include <linux/module.h>
 
 #include "kernfs-internal.h"
 
@@ -414,12 +415,32 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
  */
 struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
 {
+	int v;
+
 	if (unlikely(!kn))
 		return NULL;
 
 	if (!atomic_inc_unless_negative(&kn->active))
 		return NULL;
 
+	/*
+	 * If a module created the kernfs_node, the module cannot possibly be
+	 * removed if the above atomic_inc_unless_negative() succeeded. So the
+	 * try_module_get() below is not to protect the lifetime of the module
+	 * as that is already guaranteed. The try_module_get() below is used
+	 * to ensure that we don't deadlock in case a kernfs operation and
+	 * module removal used a shared lock.
+	 *
+	 * We use a try method to avoid userspace DOS'ing module removal with
+	 * access to kernfs ops.
+	 */
+	if (!try_module_get(kn->owner)) {
+		v = atomic_dec_return(&kn->active);
+		if (unlikely(v == KN_DEACTIVATED_BIAS))
+			wake_up_all(&kernfs_root(kn)->deactivate_waitq);
+		return NULL;
+	}
+
 	if (kernfs_lockdep(kn))
 		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
 	return kn;
@@ -442,6 +463,13 @@ void kernfs_put_active(struct kernfs_node *kn)
 	if (kernfs_lockdep(kn))
 		rwsem_release(&kn->dep_map, _RET_IP_);
 	v = atomic_dec_return(&kn->active);
+
+	/*
+	 * We prevent module exit *until* we know for sure all possible
+	 * kernfs ops are done.
+	 */
+	module_put(kn->owner);
+
 	if (likely(v != KN_DEACTIVATED_BIAS))
 		return;
 
@@ -571,6 +599,7 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
 static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 					     struct kernfs_node *parent,
 					     const char *name, umode_t mode,
+					     struct module *owner,
 					     kuid_t uid, kgid_t gid,
 					     unsigned flags)
 {
@@ -606,6 +635,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 
 	kn->name = name;
 	kn->mode = mode;
+	kn->owner = owner;
 	kn->flags = flags;
 
 	if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
@@ -639,13 +669,14 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
+				    struct module *owner,
 				    kuid_t uid, kgid_t gid,
 				    unsigned flags)
 {
 	struct kernfs_node *kn;
 
 	kn = __kernfs_new_node(kernfs_root(parent), parent,
-			       name, mode, uid, gid, flags);
+			       name, mode, owner, uid, gid, flags);
 	if (kn) {
 		kernfs_get(parent);
 		kn->parent = parent;
@@ -926,7 +957,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 		root->id_highbits = 1;
 
 	kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
-			       GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
+			       NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 			       KERNFS_DIR);
 	if (!kn) {
 		idr_destroy(&root->ino_idr);
@@ -965,6 +996,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
  * @parent: parent in which to create a new directory
  * @name: name of the new directory
  * @mode: mode of the new directory
+ * @owner: if set, the module owner of this directory
  * @uid: uid of the new directory
  * @gid: gid of the new directory
  * @priv: opaque data associated with the new directory
@@ -974,6 +1006,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
  */
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
+					 struct module *owner,
 					 kuid_t uid, kgid_t gid,
 					 void *priv, const void *ns)
 {
@@ -982,7 +1015,7 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, mode | S_IFDIR,
-			     uid, gid, KERNFS_DIR);
+			     owner, uid, gid, KERNFS_DIR);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
@@ -1014,7 +1047,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
-			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+			     NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4479c6580333..73be39ddc856 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -974,6 +974,7 @@ const struct file_operations kernfs_file_fops = {
  * @uid: uid of the file
  * @gid: gid of the file
  * @size: size of the file
+ * @owner: if set, the module owner of the file
  * @ops: kernfs operations for the file
  * @priv: private data for the file
  * @ns: optional namespace tag of the file
@@ -983,7 +984,9 @@ const struct file_operations kernfs_file_fops = {
  */
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 const char *name,
-					 umode_t mode, kuid_t uid, kgid_t gid,
+					 umode_t mode,
+					 struct module *owner,
+					 kuid_t uid, kgid_t gid,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
@@ -996,7 +999,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 	flags = KERNFS_FILE;
 
 	kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
-			     uid, gid, flags);
+			     owner, uid, gid, flags);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 9e3abf597e2d..3551f62f1449 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -133,6 +133,7 @@ void kernfs_put_active(struct kernfs_node *kn);
 int kernfs_add_one(struct kernfs_node *kn);
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
+				    struct module *owner,
 				    kuid_t uid, kgid_t gid,
 				    unsigned flags);
 
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 19a6c71c6ff5..1d781970bec7 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 		gid = target->iattr->ia_gid;
 	}
 
-	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
+	kn = kernfs_new_node(parent, name, S_IFLNK|0777, target->owner, uid,
+			     gid, KERNFS_LINK);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index b6b6796e1616..01a0a53ee18e 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -56,8 +56,8 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
 
 	kobject_get_ownership(kobj, &uid, &gid);
 
-	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
-				  kobj, ns);
+	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, NULL, uid,
+				  gid, kobj, ns);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, kobject_name(kobj));
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96881b6..5edc6e180a97 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -291,8 +291,9 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
 		key = attr->key ?: (struct lock_class_key *)&attr->skey;
 #endif
 
-	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  PAGE_SIZE, ops, (void *)attr, ns, key);
+	kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner,
+				  uid, gid, PAGE_SIZE, ops, (void *)attr, ns,
+				  key);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
@@ -326,8 +327,9 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
 		key = attr->key ?: (struct lock_class_key *)&attr->skey;
 #endif
 
-	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  battr->size, ops, (void *)attr, ns, key);
+	kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner,
+				  uid, gid, battr->size, ops, (void *)attr, ns,
+				  key);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index eeb0e3099421..283607c22eef 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -135,6 +135,7 @@ static int internal_create_group(struct kobject *kobj, int update,
 		} else {
 			kn = kernfs_create_dir_ns(kobj->sd, grp->name,
 						  S_IRWXU | S_IRUGO | S_IXUGO,
+						  grp->owner,
 						  uid, gid, kobj, NULL);
 			if (IS_ERR(kn)) {
 				if (PTR_ERR(kn) == -EEXIST)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cd968ee2b503..8655abd53c85 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -160,6 +160,7 @@ struct kernfs_node {
 
 	unsigned short		flags;
 	umode_t			mode;
+	struct module           *owner;
 	struct kernfs_iattrs	*iattr;
 };
 
@@ -369,12 +370,14 @@ void kernfs_destroy_root(struct kernfs_root *root);
 
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
+					 struct module *owner,
 					 kuid_t uid, kgid_t gid,
 					 void *priv, const void *ns);
 struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 					    const char *name);
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
+					 struct module *owner,
 					 kuid_t uid, kgid_t gid,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
@@ -471,13 +474,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
 
 static inline struct kernfs_node *
 kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
-		     umode_t mode, kuid_t uid, kgid_t gid,
+		     umode_t mode, struct module *owner,
+		     kuid_t uid, kgid_t gid,
 		     void *priv, const void *ns)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
 __kernfs_create_file(struct kernfs_node *parent, const char *name,
-		     umode_t mode, kuid_t uid, kgid_t gid,
+		     umode_t mode, struct module *owner,
+		     kuid_t uid, kgid_t gid,
 		     loff_t size, const struct kernfs_ops *ops,
 		     void *priv, const void *ns, struct lock_class_key *key)
 { return ERR_PTR(-ENOSYS); }
@@ -564,7 +569,7 @@ static inline struct kernfs_node *
 kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
 		  void *priv)
 {
-	return kernfs_create_dir_ns(parent, name, mode,
+	return kernfs_create_dir_ns(parent, name, mode, parent->owner,
 				    GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				    priv, NULL);
 }
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index e3f1e8ac1f85..babbabb460dc 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -30,6 +30,7 @@ enum kobj_ns_type;
 struct attribute {
 	const char		*name;
 	umode_t			mode;
+	struct module           *owner;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	bool			ignore_lockdep:1;
 	struct lock_class_key	*key;
@@ -80,6 +81,7 @@ do {							\
  * @attrs:	Pointer to NULL terminated list of attributes.
  * @bin_attrs:	Pointer to NULL terminated list of binary attributes.
  *		Either attrs or bin_attrs or both must be provided.
+ * @module:	If set, module responsible for this attribute group
  */
 struct attribute_group {
 	const char		*name;
@@ -89,6 +91,7 @@ struct attribute_group {
 						  struct bin_attribute *, int);
 	struct attribute	**attrs;
 	struct bin_attribute	**bin_attrs;
+	struct module           *owner;
 };
 
 /*
@@ -100,38 +103,52 @@ struct attribute_group {
 
 #define __ATTR(_name, _mode, _show, _store) {				\
 	.attr = {.name = __stringify(_name),				\
-		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode),               \
+		 .owner  = THIS_MODULE,                                 \
+	},                                                             \
 	.show	= _show,						\
 	.store	= _store,						\
 }
 
 #define __ATTR_PREALLOC(_name, _mode, _show, _store) {			\
 	.attr = {.name = __stringify(_name),				\
-		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
+		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
+		 .owner  = THIS_MODULE,                                 \
+	},                                                              \
 	.show	= _show,						\
 	.store	= _store,						\
 }
 
 #define __ATTR_RO(_name) {						\
-	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0444,					\
+		    .owner  = THIS_MODULE,				\
+		},                                                     \
 	.show	= _name##_show,						\
 }
 
 #define __ATTR_RO_MODE(_name, _mode) {					\
 	.attr	= { .name = __stringify(_name),				\
-		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
+		    .owner  = THIS_MODULE,				\
+	},                                                              \
 	.show	= _name##_show,						\
 }
 
 #define __ATTR_RW_MODE(_name, _mode) {					\
 	.attr	= { .name = __stringify(_name),				\
-		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
+		    .owner  = THIS_MODULE,                              \
+	},								\
 	.show	= _name##_show,						\
 	.store	= _name##_store,					\
 }
 
 #define __ATTR_WO(_name) {						\
-	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0200,					\
+		    .owner  = THIS_MODULE,				\
+	},                                                              \
 	.store	= _name##_store,					\
 }
 
@@ -141,8 +158,11 @@ struct attribute_group {
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) {	\
-	.attr = {.name = __stringify(_name), .mode = _mode,	\
-			.ignore_lockdep = true },		\
+	.attr = {.name = __stringify(_name),                    \
+		 .mode = _mode,					\
+		 .ignore_lockdep = true,                        \
+		 .owner  = THIS_MODULE,                         \
+	},							\
 	.show		= _show,				\
 	.store		= _store,				\
 }
@@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = {	\
 #define ATTRIBUTE_GROUPS(_name)					\
 static const struct attribute_group _name##_group = {		\
 	.attrs = _name##_attrs,					\
+	.owner = THIS_MODULE,					\
 };								\
 __ATTRIBUTE_GROUPS(_name)
 
@@ -199,20 +220,29 @@ struct bin_attribute {
 
 /* macros to create static binary attributes easier */
 #define __BIN_ATTR(_name, _mode, _read, _write, _size) {		\
-	.attr = { .name = __stringify(_name), .mode = _mode },		\
+	.attr = { .name = __stringify(_name),                           \
+		   .mode = _mode,					\
+		   .owner = THIS_MODULE,				\
+	},								\
 	.read	= _read,						\
 	.write	= _write,						\
 	.size	= _size,						\
 }
 
 #define __BIN_ATTR_RO(_name, _size) {					\
-	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0444,					\
+		    .owner = THIS_MODULE,				\
+	},								\
 	.read	= _name##_read,						\
 	.size	= _size,						\
 }
 
 #define __BIN_ATTR_WO(_name, _size) {					\
-	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0200,					\
+		    .owner = THIS_MODULE,				\
+	},								\
 	.write	= _name##_write,					\
 	.size	= _size,						\
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8afa8690d288..b9886db9ecdd 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3949,7 +3949,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 	key = &cft->lockdep_key;
 #endif
 	kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
-				  cgroup_file_mode(cft),
+				  cgroup_file_mode(cft), NULL,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				  0, cft->kf_ops, cft,
 				  NULL, key);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams

When sysfs attributes use a lock also used on module removal we can
race to deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have module removal call
trigger. The module removal call code holds a lock, and then the sysfs
file entry waits for the same lock. While holding the lock the module
removal tries to remove the sysfs entries, but these cannot be removed
yet as one is waiting for a lock. This won't complete as the lock is
already held. Likewise module removal cannot complete, and so we deadlock.

This can now be easily reproducible with our sysfs selftest as follows:

./tools/testing/selftests/sysfs/sysfs.sh -t 0027

To fix this we extend the struct kernfs_node with a module reference and
use the try_module_get() after kernfs_get_active() is called which
protects integrity and the existence of the kernfs node during the
operation.

So long as the kernfs node is protected with kernfs_get_active() we know
we can rely on its contents. And, as now just documented in the previous
patch, we also now know that once kernfs_get_active() is called the module
is also guarded to exist and cannot be removed.

If try_module_get() fails we fail the operation on the kernfs node.

We use a try method as a full lock means we'd then make our sysfs
attributes busy us out from possible module removal, and so userspace
could force denying module removal, a silly form of "DOS" against module
removal. A try lock on the module removal ensures we give priority to
module removal and interacting with sysfs attributes only comes second.
Using a full lock could mean for instance that if you don't stop poking
at sysfs files you cannot remove a module.

Races between removal of sysfs files and the module are not possible
given sysfs files are created by the same module, and when a sysfs file
is being used kernfs prevents removal of the sysfs file. So if module
removal is actually happening the removal would have to wait until
the sysfs file operation is complete.

This deadlock was first reported with the zram driver, however the live
patching folks have acknowledged they have observed this as well with
live patching, when a live patch is removed. I was then able to
reproduce easily by creating a dedicated selftests.

A sketch of how this can happen follows:

CPU A                              CPU B
                                   whatever_store()
module_unload
  mutex_lock(foo)
                                   mutex_lock(foo)
   del_gendisk(zram->disk);
     device_del()
       device_remove_groups()

In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked
completes which is waiting for a lock already held.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
 fs/kernfs/dir.c                        | 41 ++++++++++++++++++--
 fs/kernfs/file.c                       |  7 +++-
 fs/kernfs/kernfs-internal.h            |  1 +
 fs/kernfs/symlink.c                    |  3 +-
 fs/sysfs/dir.c                         |  4 +-
 fs/sysfs/file.c                        | 10 +++--
 fs/sysfs/group.c                       |  1 +
 include/linux/kernfs.h                 | 11 ++++--
 include/linux/sysfs.h                  | 52 ++++++++++++++++++++------
 kernel/cgroup/cgroup.c                 |  2 +-
 11 files changed, 106 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..d40c486f46b9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -207,7 +207,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
 	struct kernfs_node *kn;
 	int ret;
 
-	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
+	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, NULL,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				  0, rft->kf_ops, rft, NULL, NULL);
 	if (IS_ERR(kn))
@@ -2480,7 +2480,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
 	struct kernfs_node *kn;
 	int ret = 0;
 
-	kn = __kernfs_create_file(parent_kn, name, 0444,
+	kn = __kernfs_create_file(parent_kn, name, 0444, NULL,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
 				  &kf_mondata_ops, priv, NULL, NULL);
 	if (IS_ERR(kn))
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..f463878a5bf4 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/hash.h>
+#include <linux/module.h>
 
 #include "kernfs-internal.h"
 
@@ -414,12 +415,32 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
  */
 struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
 {
+	int v;
+
 	if (unlikely(!kn))
 		return NULL;
 
 	if (!atomic_inc_unless_negative(&kn->active))
 		return NULL;
 
+	/*
+	 * If a module created the kernfs_node, the module cannot possibly be
+	 * removed if the above atomic_inc_unless_negative() succeeded. So the
+	 * try_module_get() below is not to protect the lifetime of the module
+	 * as that is already guaranteed. The try_module_get() below is used
+	 * to ensure that we don't deadlock in case a kernfs operation and
+	 * module removal used a shared lock.
+	 *
+	 * We use a try method to avoid userspace DOS'ing module removal with
+	 * access to kernfs ops.
+	 */
+	if (!try_module_get(kn->owner)) {
+		v = atomic_dec_return(&kn->active);
+		if (unlikely(v == KN_DEACTIVATED_BIAS))
+			wake_up_all(&kernfs_root(kn)->deactivate_waitq);
+		return NULL;
+	}
+
 	if (kernfs_lockdep(kn))
 		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
 	return kn;
@@ -442,6 +463,13 @@ void kernfs_put_active(struct kernfs_node *kn)
 	if (kernfs_lockdep(kn))
 		rwsem_release(&kn->dep_map, _RET_IP_);
 	v = atomic_dec_return(&kn->active);
+
+	/*
+	 * We prevent module exit *until* we know for sure all possible
+	 * kernfs ops are done.
+	 */
+	module_put(kn->owner);
+
 	if (likely(v != KN_DEACTIVATED_BIAS))
 		return;
 
@@ -571,6 +599,7 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
 static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 					     struct kernfs_node *parent,
 					     const char *name, umode_t mode,
+					     struct module *owner,
 					     kuid_t uid, kgid_t gid,
 					     unsigned flags)
 {
@@ -606,6 +635,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 
 	kn->name = name;
 	kn->mode = mode;
+	kn->owner = owner;
 	kn->flags = flags;
 
 	if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
@@ -639,13 +669,14 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
+				    struct module *owner,
 				    kuid_t uid, kgid_t gid,
 				    unsigned flags)
 {
 	struct kernfs_node *kn;
 
 	kn = __kernfs_new_node(kernfs_root(parent), parent,
-			       name, mode, uid, gid, flags);
+			       name, mode, owner, uid, gid, flags);
 	if (kn) {
 		kernfs_get(parent);
 		kn->parent = parent;
@@ -926,7 +957,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 		root->id_highbits = 1;
 
 	kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
-			       GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
+			       NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 			       KERNFS_DIR);
 	if (!kn) {
 		idr_destroy(&root->ino_idr);
@@ -965,6 +996,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
  * @parent: parent in which to create a new directory
  * @name: name of the new directory
  * @mode: mode of the new directory
+ * @owner: if set, the module owner of this directory
  * @uid: uid of the new directory
  * @gid: gid of the new directory
  * @priv: opaque data associated with the new directory
@@ -974,6 +1006,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
  */
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
+					 struct module *owner,
 					 kuid_t uid, kgid_t gid,
 					 void *priv, const void *ns)
 {
@@ -982,7 +1015,7 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, mode | S_IFDIR,
-			     uid, gid, KERNFS_DIR);
+			     owner, uid, gid, KERNFS_DIR);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
@@ -1014,7 +1047,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
-			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+			     NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4479c6580333..73be39ddc856 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -974,6 +974,7 @@ const struct file_operations kernfs_file_fops = {
  * @uid: uid of the file
  * @gid: gid of the file
  * @size: size of the file
+ * @owner: if set, the module owner of the file
  * @ops: kernfs operations for the file
  * @priv: private data for the file
  * @ns: optional namespace tag of the file
@@ -983,7 +984,9 @@ const struct file_operations kernfs_file_fops = {
  */
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 const char *name,
-					 umode_t mode, kuid_t uid, kgid_t gid,
+					 umode_t mode,
+					 struct module *owner,
+					 kuid_t uid, kgid_t gid,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
@@ -996,7 +999,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 	flags = KERNFS_FILE;
 
 	kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
-			     uid, gid, flags);
+			     owner, uid, gid, flags);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 9e3abf597e2d..3551f62f1449 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -133,6 +133,7 @@ void kernfs_put_active(struct kernfs_node *kn);
 int kernfs_add_one(struct kernfs_node *kn);
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
+				    struct module *owner,
 				    kuid_t uid, kgid_t gid,
 				    unsigned flags);
 
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 19a6c71c6ff5..1d781970bec7 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 		gid = target->iattr->ia_gid;
 	}
 
-	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
+	kn = kernfs_new_node(parent, name, S_IFLNK|0777, target->owner, uid,
+			     gid, KERNFS_LINK);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index b6b6796e1616..01a0a53ee18e 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -56,8 +56,8 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
 
 	kobject_get_ownership(kobj, &uid, &gid);
 
-	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
-				  kobj, ns);
+	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, NULL, uid,
+				  gid, kobj, ns);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, kobject_name(kobj));
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96881b6..5edc6e180a97 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -291,8 +291,9 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
 		key = attr->key ?: (struct lock_class_key *)&attr->skey;
 #endif
 
-	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  PAGE_SIZE, ops, (void *)attr, ns, key);
+	kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner,
+				  uid, gid, PAGE_SIZE, ops, (void *)attr, ns,
+				  key);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
@@ -326,8 +327,9 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
 		key = attr->key ?: (struct lock_class_key *)&attr->skey;
 #endif
 
-	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  battr->size, ops, (void *)attr, ns, key);
+	kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner,
+				  uid, gid, battr->size, ops, (void *)attr, ns,
+				  key);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index eeb0e3099421..283607c22eef 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -135,6 +135,7 @@ static int internal_create_group(struct kobject *kobj, int update,
 		} else {
 			kn = kernfs_create_dir_ns(kobj->sd, grp->name,
 						  S_IRWXU | S_IRUGO | S_IXUGO,
+						  grp->owner,
 						  uid, gid, kobj, NULL);
 			if (IS_ERR(kn)) {
 				if (PTR_ERR(kn) == -EEXIST)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cd968ee2b503..8655abd53c85 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -160,6 +160,7 @@ struct kernfs_node {
 
 	unsigned short		flags;
 	umode_t			mode;
+	struct module           *owner;
 	struct kernfs_iattrs	*iattr;
 };
 
@@ -369,12 +370,14 @@ void kernfs_destroy_root(struct kernfs_root *root);
 
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
+					 struct module *owner,
 					 kuid_t uid, kgid_t gid,
 					 void *priv, const void *ns);
 struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 					    const char *name);
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
+					 struct module *owner,
 					 kuid_t uid, kgid_t gid,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
@@ -471,13 +474,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
 
 static inline struct kernfs_node *
 kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
-		     umode_t mode, kuid_t uid, kgid_t gid,
+		     umode_t mode, struct module *owner,
+		     kuid_t uid, kgid_t gid,
 		     void *priv, const void *ns)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
 __kernfs_create_file(struct kernfs_node *parent, const char *name,
-		     umode_t mode, kuid_t uid, kgid_t gid,
+		     umode_t mode, struct module *owner,
+		     kuid_t uid, kgid_t gid,
 		     loff_t size, const struct kernfs_ops *ops,
 		     void *priv, const void *ns, struct lock_class_key *key)
 { return ERR_PTR(-ENOSYS); }
@@ -564,7 +569,7 @@ static inline struct kernfs_node *
 kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
 		  void *priv)
 {
-	return kernfs_create_dir_ns(parent, name, mode,
+	return kernfs_create_dir_ns(parent, name, mode, parent->owner,
 				    GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				    priv, NULL);
 }
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index e3f1e8ac1f85..babbabb460dc 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -30,6 +30,7 @@ enum kobj_ns_type;
 struct attribute {
 	const char		*name;
 	umode_t			mode;
+	struct module           *owner;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	bool			ignore_lockdep:1;
 	struct lock_class_key	*key;
@@ -80,6 +81,7 @@ do {							\
  * @attrs:	Pointer to NULL terminated list of attributes.
  * @bin_attrs:	Pointer to NULL terminated list of binary attributes.
  *		Either attrs or bin_attrs or both must be provided.
+ * @module:	If set, module responsible for this attribute group
  */
 struct attribute_group {
 	const char		*name;
@@ -89,6 +91,7 @@ struct attribute_group {
 						  struct bin_attribute *, int);
 	struct attribute	**attrs;
 	struct bin_attribute	**bin_attrs;
+	struct module           *owner;
 };
 
 /*
@@ -100,38 +103,52 @@ struct attribute_group {
 
 #define __ATTR(_name, _mode, _show, _store) {				\
 	.attr = {.name = __stringify(_name),				\
-		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode),               \
+		 .owner  = THIS_MODULE,                                 \
+	},                                                             \
 	.show	= _show,						\
 	.store	= _store,						\
 }
 
 #define __ATTR_PREALLOC(_name, _mode, _show, _store) {			\
 	.attr = {.name = __stringify(_name),				\
-		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
+		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
+		 .owner  = THIS_MODULE,                                 \
+	},                                                              \
 	.show	= _show,						\
 	.store	= _store,						\
 }
 
 #define __ATTR_RO(_name) {						\
-	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0444,					\
+		    .owner  = THIS_MODULE,				\
+		},                                                     \
 	.show	= _name##_show,						\
 }
 
 #define __ATTR_RO_MODE(_name, _mode) {					\
 	.attr	= { .name = __stringify(_name),				\
-		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
+		    .owner  = THIS_MODULE,				\
+	},                                                              \
 	.show	= _name##_show,						\
 }
 
 #define __ATTR_RW_MODE(_name, _mode) {					\
 	.attr	= { .name = __stringify(_name),				\
-		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
+		    .owner  = THIS_MODULE,                              \
+	},								\
 	.show	= _name##_show,						\
 	.store	= _name##_store,					\
 }
 
 #define __ATTR_WO(_name) {						\
-	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0200,					\
+		    .owner  = THIS_MODULE,				\
+	},                                                              \
 	.store	= _name##_store,					\
 }
 
@@ -141,8 +158,11 @@ struct attribute_group {
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) {	\
-	.attr = {.name = __stringify(_name), .mode = _mode,	\
-			.ignore_lockdep = true },		\
+	.attr = {.name = __stringify(_name),                    \
+		 .mode = _mode,					\
+		 .ignore_lockdep = true,                        \
+		 .owner  = THIS_MODULE,                         \
+	},							\
 	.show		= _show,				\
 	.store		= _store,				\
 }
@@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = {	\
 #define ATTRIBUTE_GROUPS(_name)					\
 static const struct attribute_group _name##_group = {		\
 	.attrs = _name##_attrs,					\
+	.owner = THIS_MODULE,					\
 };								\
 __ATTRIBUTE_GROUPS(_name)
 
@@ -199,20 +220,29 @@ struct bin_attribute {
 
 /* macros to create static binary attributes easier */
 #define __BIN_ATTR(_name, _mode, _read, _write, _size) {		\
-	.attr = { .name = __stringify(_name), .mode = _mode },		\
+	.attr = { .name = __stringify(_name),                           \
+		   .mode = _mode,					\
+		   .owner = THIS_MODULE,				\
+	},								\
 	.read	= _read,						\
 	.write	= _write,						\
 	.size	= _size,						\
 }
 
 #define __BIN_ATTR_RO(_name, _size) {					\
-	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0444,					\
+		    .owner = THIS_MODULE,				\
+	},								\
 	.read	= _name##_read,						\
 	.size	= _size,						\
 }
 
 #define __BIN_ATTR_WO(_name, _size) {					\
-	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0200,					\
+		    .owner = THIS_MODULE,				\
+	},								\
 	.write	= _name##_write,					\
 	.size	= _size,						\
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8afa8690d288..b9886db9ecdd 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3949,7 +3949,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 	key = &cft->lockdep_key;
 #endif
 	kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
-				  cgroup_file_mode(cft),
+				  cgroup_file_mode(cft), NULL,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				  0, cft->kf_ops, cft,
 				  NULL, key);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 10/12] test_sysfs: enable deadlock tests by default
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

Now that sysfs has the deadlock race fixed with module removal,
enable the deadlock tests module removal tests. They were left
disabled by default as otherwise you would deadlock your system

./tools/testing/selftests/sysfs/sysfs.sh -t 0027
Running test: sysfs_test_0027 - run #0
Test for possible rmmod deadlock while writing x ... ok

./tools/testing/selftests/sysfs/sysfs.sh -t 0028
Running test: sysfs_test_0028 - run #0
Test for possible rmmod deadlock using rtnl_lock while writing x ... ok

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 tools/testing/selftests/sysfs/sysfs.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index f928635d0e35..4047ac48e764 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -60,8 +60,8 @@ ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block"
 ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
-ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
-ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0027:1:1:test_dev_x:block" # deadlock test
+ALL_TESTS="$ALL_TESTS 0028:1:1:test_dev_x:block" # deadlock test with rntl_lock
 ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
 ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
 ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 10/12] test_sysfs: enable deadlock tests by default
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	minchan-DgEjT+Ai2ygdnm+yROfE0A, jeyu-DgEjT+Ai2ygdnm+yROfE0A,
	shuah-DgEjT+Ai2ygdnm+yROfE0A
  Cc: rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	masahiroy-DgEjT+Ai2ygdnm+yROfE0A,
	ndesaulniers-hpIqsD4AKlfQT0dZR+AlfA,
	yzaikin-hpIqsD4AKlfQT0dZR+AlfA, nathan-DgEjT+Ai2ygdnm+yROfE0A,
	ojeda-DgEjT+Ai2ygdnm+yROfE0A,
	penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp,
	vitor-HdruG0TBZYlAfugRpC6u6w, elver-hpIqsD4AKlfQT0dZR+AlfA,
	jarkko-DgEjT+Ai2ygdnm+yROfE0A, glider-hpIqsD4AKlfQT0dZR+AlfA,
	rf-yzvPICuk2AA4QjBA90+/kJqQE7yCjDx5,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	David.Laight-ZS65k/vG3HxXrIkS9f7CXA, bvanassche-HInyCGIudOg,
	jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	andriy.shevchenko-VuQAYsv1563Yd54FQh9/CA,
	trishalfonso-hpIqsD4AKlfQT0dZR+AlfA,
	andreyknvl-Re5JQEeQqe8AvxtiuMwx3w, jikos-DgEjT+Ai2ygdnm+yROfE0A,
	mbenes-IBi9RG/b67k, ngupta-KNmc09w0p+Ednm+yROfE0A,
	sergey.senozhatsky.work-Re5JQEeQqe8AvxtiuMwx3w,
	mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	reinette.chatre-ral2JQCrhuEAvxtiuMwx3w,
	fenghua.yu-ral2JQCrhuEAvxtiuMwx3w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	x86-DgEjT+Ai2ygdnm+yROfE0A, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w,
	daniel.vetter-/w4YWyX8dFk, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	kw-vYTEC60ixJUAvxtiuMwx3w, dan.j.williams

Now that sysfs has the deadlock race fixed with module removal,
enable the deadlock tests module removal tests. They were left
disabled by default as otherwise you would deadlock your system

./tools/testing/selftests/sysfs/sysfs.sh -t 0027
Running test: sysfs_test_0027 - run #0
Test for possible rmmod deadlock while writing x ... ok

./tools/testing/selftests/sysfs/sysfs.sh -t 0028
Running test: sysfs_test_0028 - run #0
Test for possible rmmod deadlock using rtnl_lock while writing x ... ok

Signed-off-by: Luis Chamberlain <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 tools/testing/selftests/sysfs/sysfs.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index f928635d0e35..4047ac48e764 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -60,8 +60,8 @@ ALL_TESTS="$ALL_TESTS 0023:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0024:1:1:test_dev_x:block"
 ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
 ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
-ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
-ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0027:1:1:test_dev_x:block" # deadlock test
+ALL_TESTS="$ALL_TESTS 0028:1:1:test_dev_x:block" # deadlock test with rntl_lock
 ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
 ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
 ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 11/12] zram: fix crashes with cpu hotplug multistate
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

Provide a simple state machine to fix races with driver exit where we
remove the CPU multistate callbacks and re-initialization / creation of
new per CPU instances which should be managed by these callbacks.

The zram driver makes use of cpu hotplug multistate support, whereby it
associates a struct zcomp per CPU. Each struct zcomp represents a
compression algorithm in charge of managing compression streams per
CPU. Although a compiled zram driver only supports a fixed set of
compression algorithms, each zram device gets a struct zcomp allocated
per CPU. The "multi" in CPU hotplug multstate refers to these per
cpu struct zcomp instances. Each of these will have the CPU hotplug
callback called for it on CPU plug / unplug. The kernel's CPU hotplug
multistate keeps a linked list of these different structures so that
it will iterate over them on CPU transitions.

By default at driver initialization we will create just one zram device
(num_devices=1) and a zcomp structure then set for the now default
lzo-rle comrpession algorithm. At driver removal we first remove each
zram device, and so we destroy the associated struct zcomp per CPU. But
since we expose sysfs attributes to create new devices or reset /
initialize existing zram devices, we can easily end up re-initializing
a struct zcomp for a zram device before the exit routine of the module
removes the cpu hotplug callback. When this happens the kernel's CPU
hotplug will detect that at least one instance (struct zcomp for us)
exists. This can happen in the following situation:

CPU 1                            CPU 2

                                disksize_store(...);
class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();

idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);

The warning comes up on cpuhp_remove_multi_state() when it sees that the
state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
In this case, that a struct zcom still exists, the driver allowed its
creation per CPU even though we could have just freed them per CPU
though a call on another CPU, and we are then later trying to remove the
hotplug callback.

Fix all this by providing a zram initialization boolean
protected the shared in the driver zram_index_mutex, which we
can use to annotate when sysfs attributes are safe to use or
not -- once the driver is properly initialized. When the driver
is going down we also are sure to not let userspace muck with
attributes which may affect each per cpu struct zcomp.

This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:

  cd testcases/kernel/device-drivers/zram
  while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done

You end up with a splat as follows:

kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0.  <etc>
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at \
	kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            \
	E     5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
	BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel:  do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae

The "Error: Removing state 63 which has instances left" refers
to the zram per CPU struct zcomp instances left.

[0] https://github.com/linux-test-project/ltp.git

Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
 1 file changed, 55 insertions(+), 8 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f61910c65f0f..b26abcb955cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
 static int zram_major;
 static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
 
+static bool zram_up;
+
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 /*
@@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
 	comp = zram->comp;
 	disksize = zram->disksize;
 	zram->disksize = 0;
+	zram->comp = NULL;
 
 	set_capacity_and_notify(zram->disk, 0);
 	part_stat_set_all(zram->disk->part0, 0);
@@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
 	struct zram *zram = dev_to_zram(dev);
 	int err;
 
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		err = -ENODEV;
+		goto out;
+	}
+
 	disksize = memparse(buf, NULL);
-	if (!disksize)
-		return -EINVAL;
+	if (!disksize) {
+		err = -EINVAL;
+		goto out;
+	}
 
 	down_write(&zram->init_lock);
 	if (init_done(zram)) {
@@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
 	set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
 	up_write(&zram->init_lock);
 
+	mutex_unlock(&zram_index_mutex);
+
 	return len;
 
 out_free_meta:
 	zram_meta_free(zram, disksize);
 out_unlock:
 	up_write(&zram->init_lock);
+out:
+	mutex_unlock(&zram_index_mutex);
 	return err;
 }
 
@@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	if (!do_reset)
-		return -EINVAL;
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		len = -ENODEV;
+		goto out;
+	}
+
+	if (!do_reset) {
+		len = -EINVAL;
+		goto out;
+	}
 
 	zram = dev_to_zram(dev);
 	bdev = zram->disk->part0;
@@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
 	/* Do not reset an active device or claimed device */
 	if (bdev->bd_openers || zram->claim) {
 		mutex_unlock(&bdev->bd_disk->open_mutex);
-		return -EBUSY;
+		len = -EBUSY;
+		goto out;
 	}
 
 	/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
 	zram->claim = false;
 	mutex_unlock(&bdev->bd_disk->open_mutex);
 
+out:
+	mutex_unlock(&zram_index_mutex);
 	return len;
 }
 
@@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class,
 	int ret;
 
 	mutex_lock(&zram_index_mutex);
+	if (!zram_up) {
+		mutex_unlock(&zram_index_mutex);
+		return -ENODEV;
+	}
 	ret = zram_add();
 	mutex_unlock(&zram_index_mutex);
 
@@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
 
 	mutex_lock(&zram_index_mutex);
 
+	if (!zram_up) {
+		ret = -ENODEV;
+		goto out;
+	}
+
 	zram = idr_find(&zram_index_idr, dev_id);
 	if (zram) {
 		ret = zram_remove(zram);
@@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class,
 		ret = -ENODEV;
 	}
 
+out:
 	mutex_unlock(&zram_index_mutex);
 	return ret ? ret : count;
 }
@@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
 
 static void destroy_devices(void)
 {
+	mutex_lock(&zram_index_mutex);
+	zram_up = false;
 	class_unregister(&zram_control_class);
 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
 	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");
 	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+	mutex_unlock(&zram_index_mutex);
 }
 
 static int __init zram_init(void)
@@ -2105,15 +2146,21 @@ static int __init zram_init(void)
 		return -EBUSY;
 	}
 
+	mutex_lock(&zram_index_mutex);
+
 	while (num_devices != 0) {
-		mutex_lock(&zram_index_mutex);
 		ret = zram_add();
-		mutex_unlock(&zram_index_mutex);
-		if (ret < 0)
+		if (ret < 0) {
+			mutex_unlock(&zram_index_mutex);
 			goto out_error;
+		}
 		num_devices--;
 	}
 
+	zram_up = true;
+
+	mutex_unlock(&zram_index_mutex);
+
 	return 0;
 
 out_error:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 11/12] zram: fix crashes with cpu hotplug multistate
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	minchan-DgEjT+Ai2ygdnm+yROfE0A, jeyu-DgEjT+Ai2ygdnm+yROfE0A,
	shuah-DgEjT+Ai2ygdnm+yROfE0A
  Cc: rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	masahiroy-DgEjT+Ai2ygdnm+yROfE0A,
	ndesaulniers-hpIqsD4AKlfQT0dZR+AlfA,
	yzaikin-hpIqsD4AKlfQT0dZR+AlfA, nathan-DgEjT+Ai2ygdnm+yROfE0A,
	ojeda-DgEjT+Ai2ygdnm+yROfE0A,
	penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp,
	vitor-HdruG0TBZYlAfugRpC6u6w, elver-hpIqsD4AKlfQT0dZR+AlfA,
	jarkko-DgEjT+Ai2ygdnm+yROfE0A, glider-hpIqsD4AKlfQT0dZR+AlfA,
	rf-yzvPICuk2AA4QjBA90+/kJqQE7yCjDx5,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	David.Laight-ZS65k/vG3HxXrIkS9f7CXA, bvanassche-HInyCGIudOg,
	jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	andriy.shevchenko-VuQAYsv1563Yd54FQh9/CA,
	trishalfonso-hpIqsD4AKlfQT0dZR+AlfA,
	andreyknvl-Re5JQEeQqe8AvxtiuMwx3w, jikos-DgEjT+Ai2ygdnm+yROfE0A,
	mbenes-IBi9RG/b67k, ngupta-KNmc09w0p+Ednm+yROfE0A,
	sergey.senozhatsky.work-Re5JQEeQqe8AvxtiuMwx3w,
	mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	reinette.chatre-ral2JQCrhuEAvxtiuMwx3w,
	fenghua.yu-ral2JQCrhuEAvxtiuMwx3w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	x86-DgEjT+Ai2ygdnm+yROfE0A, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w,
	daniel.vetter-/w4YWyX8dFk, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	kw-vYTEC60ixJUAvxtiuMwx3w, dan.j.williams

Provide a simple state machine to fix races with driver exit where we
remove the CPU multistate callbacks and re-initialization / creation of
new per CPU instances which should be managed by these callbacks.

The zram driver makes use of cpu hotplug multistate support, whereby it
associates a struct zcomp per CPU. Each struct zcomp represents a
compression algorithm in charge of managing compression streams per
CPU. Although a compiled zram driver only supports a fixed set of
compression algorithms, each zram device gets a struct zcomp allocated
per CPU. The "multi" in CPU hotplug multstate refers to these per
cpu struct zcomp instances. Each of these will have the CPU hotplug
callback called for it on CPU plug / unplug. The kernel's CPU hotplug
multistate keeps a linked list of these different structures so that
it will iterate over them on CPU transitions.

By default at driver initialization we will create just one zram device
(num_devices=1) and a zcomp structure then set for the now default
lzo-rle comrpession algorithm. At driver removal we first remove each
zram device, and so we destroy the associated struct zcomp per CPU. But
since we expose sysfs attributes to create new devices or reset /
initialize existing zram devices, we can easily end up re-initializing
a struct zcomp for a zram device before the exit routine of the module
removes the cpu hotplug callback. When this happens the kernel's CPU
hotplug will detect that at least one instance (struct zcomp for us)
exists. This can happen in the following situation:

CPU 1                            CPU 2

                                disksize_store(...);
class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();

idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);

The warning comes up on cpuhp_remove_multi_state() when it sees that the
state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
In this case, that a struct zcom still exists, the driver allowed its
creation per CPU even though we could have just freed them per CPU
though a call on another CPU, and we are then later trying to remove the
hotplug callback.

Fix all this by providing a zram initialization boolean
protected the shared in the driver zram_index_mutex, which we
can use to annotate when sysfs attributes are safe to use or
not -- once the driver is properly initialized. When the driver
is going down we also are sure to not let userspace muck with
attributes which may affect each per cpu struct zcomp.

This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:

  cd testcases/kernel/device-drivers/zram
  while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done

You end up with a splat as follows:

kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0.  <etc>
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at \
	kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            \
	E     5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
	BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel:  do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae

The "Error: Removing state 63 which has instances left" refers
to the zram per CPU struct zcomp instances left.

[0] https://github.com/linux-test-project/ltp.git

Acked-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Signed-off-by: Luis Chamberlain <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
 1 file changed, 55 insertions(+), 8 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f61910c65f0f..b26abcb955cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
 static int zram_major;
 static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
 
+static bool zram_up;
+
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 /*
@@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
 	comp = zram->comp;
 	disksize = zram->disksize;
 	zram->disksize = 0;
+	zram->comp = NULL;
 
 	set_capacity_and_notify(zram->disk, 0);
 	part_stat_set_all(zram->disk->part0, 0);
@@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
 	struct zram *zram = dev_to_zram(dev);
 	int err;
 
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		err = -ENODEV;
+		goto out;
+	}
+
 	disksize = memparse(buf, NULL);
-	if (!disksize)
-		return -EINVAL;
+	if (!disksize) {
+		err = -EINVAL;
+		goto out;
+	}
 
 	down_write(&zram->init_lock);
 	if (init_done(zram)) {
@@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
 	set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
 	up_write(&zram->init_lock);
 
+	mutex_unlock(&zram_index_mutex);
+
 	return len;
 
 out_free_meta:
 	zram_meta_free(zram, disksize);
 out_unlock:
 	up_write(&zram->init_lock);
+out:
+	mutex_unlock(&zram_index_mutex);
 	return err;
 }
 
@@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	if (!do_reset)
-		return -EINVAL;
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		len = -ENODEV;
+		goto out;
+	}
+
+	if (!do_reset) {
+		len = -EINVAL;
+		goto out;
+	}
 
 	zram = dev_to_zram(dev);
 	bdev = zram->disk->part0;
@@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
 	/* Do not reset an active device or claimed device */
 	if (bdev->bd_openers || zram->claim) {
 		mutex_unlock(&bdev->bd_disk->open_mutex);
-		return -EBUSY;
+		len = -EBUSY;
+		goto out;
 	}
 
 	/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
 	zram->claim = false;
 	mutex_unlock(&bdev->bd_disk->open_mutex);
 
+out:
+	mutex_unlock(&zram_index_mutex);
 	return len;
 }
 
@@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class,
 	int ret;
 
 	mutex_lock(&zram_index_mutex);
+	if (!zram_up) {
+		mutex_unlock(&zram_index_mutex);
+		return -ENODEV;
+	}
 	ret = zram_add();
 	mutex_unlock(&zram_index_mutex);
 
@@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
 
 	mutex_lock(&zram_index_mutex);
 
+	if (!zram_up) {
+		ret = -ENODEV;
+		goto out;
+	}
+
 	zram = idr_find(&zram_index_idr, dev_id);
 	if (zram) {
 		ret = zram_remove(zram);
@@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class,
 		ret = -ENODEV;
 	}
 
+out:
 	mutex_unlock(&zram_index_mutex);
 	return ret ? ret : count;
 }
@@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
 
 static void destroy_devices(void)
 {
+	mutex_lock(&zram_index_mutex);
+	zram_up = false;
 	class_unregister(&zram_control_class);
 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
 	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");
 	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+	mutex_unlock(&zram_index_mutex);
 }
 
 static int __init zram_init(void)
@@ -2105,15 +2146,21 @@ static int __init zram_init(void)
 		return -EBUSY;
 	}
 
+	mutex_lock(&zram_index_mutex);
+
 	while (num_devices != 0) {
-		mutex_lock(&zram_index_mutex);
 		ret = zram_add();
-		mutex_unlock(&zram_index_mutex);
-		if (ret < 0)
+		if (ret < 0) {
+			mutex_unlock(&zram_index_mutex);
 			goto out_error;
+		}
 		num_devices--;
 	}
 
+	zram_up = true;
+
+	mutex_unlock(&zram_index_mutex);
+
 	return 0;
 
 out_error:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-18  5:04   ` Luis Chamberlain
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams, senozhatsky,
	hch, joe, hkallweit1, axboe, jpoimboe, tglx, keescook, rostedt,
	peterz, linux-spdx, linux-doc, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, linux-kernel, copyleft-next

The ATTRIBUTE_GROUPS is typically used to avoid boiler plate
code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was
long due on the zram driver, however a recent fix for sysfs allows
users of ATTRIBUTE_GROUPS to also associate a module to the group
attribute.

In zram's case this also means it allows us to fix a race which triggers
a deadlock on the zram driver. This deadlock happens when a sysfs attribute
use a lock also used on module removal. This happens when for instance a
sysfs file on a driver is used, then at the same time we have module
removal call trigger. The module removal call code holds a lock, and then
the sysfs file entry waits for the same lock. While holding the lock the
module removal tries to remove the sysfs entries, but these cannot be
removed yet as one is waiting for a lock. This won't complete as the lock
is already held. Likewise module removal cannot complete, and so we
deadlock.

Sysfs fixes this when the group attributes have a module associated to
it, sysfs will *try* to get a refcount to the module when a shared
lock is used, prior to mucking with a sysfs attribute. If this fails we
just give up right away.

This deadlock was first reported with the zram driver, a sketch of how
this can happen follows:

CPU A                              CPU B
                                   whatever_store()
module_unload
  mutex_lock(foo)
                                   mutex_lock(foo)
   del_gendisk(zram->disk);
     device_del()
       device_remove_groups()

In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked
completes which is waiting for a lock already held.

This issue can be reproduced easily on the zram driver as follows:

Loop 1 on one terminal:

while true;
	do modprobe zram;
	modprobe -r zram;
done

Loop 2 on a second terminal:
while true; do
	echo 1024 >  /sys/block/zram0/disksize;
	echo 1 > /sys/block/zram0/reset;
done

Without this patch we end up in a deadlock, and the following
stack trace is produced which hints to us what the issue was:

INFO: task bash:888 blocked for more than 120 seconds.
      Tainted: G            E 5.12.0-rc1-next-20210304+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash            state:D stack:    0 pid:  888 ppid: 887 flags:<etc>
Call Trace:
 __schedule+0x2e4/0x900
 schedule+0x46/0xb0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.constprop.0+0x2c3/0x490
 ? _kstrtoull+0x35/0xd0
 reset_store+0x6c/0x160 [zram]
 kernfs_fop_write_iter+0x124/0x1b0
 new_sync_write+0x11c/0x1b0
 vfs_write+0x1c2/0x260
 ksys_write+0x5f/0xe0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f34f2c3df33
RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
INFO: task modprobe:1104 can't die for more than 120 seconds.
task:modprobe        state:D stack:    0 pid: 1104 ppid: 916 flags:<etc>
Call Trace:
 __schedule+0x2e4/0x900
 schedule+0x46/0xb0
 __kernfs_remove.part.0+0x228/0x2b0
 ? finish_wait+0x80/0x80
 kernfs_remove_by_name_ns+0x50/0x90
 remove_files+0x2b/0x60
 sysfs_remove_group+0x38/0x80
 sysfs_remove_groups+0x29/0x40
 device_remove_attrs+0x4a/0x80
 device_del+0x183/0x3e0
 ? mutex_lock+0xe/0x30
 del_gendisk+0x27a/0x2d0
 zram_remove+0x8a/0xb0 [zram]
 ? hot_remove_store+0xf0/0xf0 [zram]
 zram_remove_cb+0xd/0x10 [zram]
 idr_for_each+0x5e/0xd0
 destroy_devices+0x39/0x6f [zram]
 __do_sys_delete_module+0x190/0x2a0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f32adf727d7
RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20

[0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/zram/zram_drv.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b26abcb955cc..60a55ae8cd91 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1902,14 +1902,7 @@ static struct attribute *zram_disk_attrs[] = {
 	NULL,
 };
 
-static const struct attribute_group zram_disk_attr_group = {
-	.attrs = zram_disk_attrs,
-};
-
-static const struct attribute_group *zram_disk_attr_groups[] = {
-	&zram_disk_attr_group,
-	NULL,
-};
+ATTRIBUTE_GROUPS(zram_disk);
 
 /*
  * Allocate and initialize new zram device. the function returns
@@ -1981,7 +1974,7 @@ static int zram_add(void)
 		blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
 	blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
-	device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
+	device_add_disk(NULL, zram->disk, zram_disk_groups);
 
 	strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v7 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal
@ 2021-09-18  5:04   ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-18  5:04 UTC (permalink / raw)
  To: tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, bvanassche, jolsa, andriy.shevchenko, trishalfonso,
	andreyknvl, jikos, mbenes, ngupta, sergey.senozhatsky.work,
	mcgrof, reinette.chatre, fenghua.yu, bp, x86, hpa, lizefan.x,
	hannes, daniel.vetter, bhelgaas, kw, dan.j.williams

The ATTRIBUTE_GROUPS is typically used to avoid boiler plate
code which is used in many drivers. Embracing ATTRIBUTE_GROUPS was
long due on the zram driver, however a recent fix for sysfs allows
users of ATTRIBUTE_GROUPS to also associate a module to the group
attribute.

In zram's case this also means it allows us to fix a race which triggers
a deadlock on the zram driver. This deadlock happens when a sysfs attribute
use a lock also used on module removal. This happens when for instance a
sysfs file on a driver is used, then at the same time we have module
removal call trigger. The module removal call code holds a lock, and then
the sysfs file entry waits for the same lock. While holding the lock the
module removal tries to remove the sysfs entries, but these cannot be
removed yet as one is waiting for a lock. This won't complete as the lock
is already held. Likewise module removal cannot complete, and so we
deadlock.

Sysfs fixes this when the group attributes have a module associated to
it, sysfs will *try* to get a refcount to the module when a shared
lock is used, prior to mucking with a sysfs attribute. If this fails we
just give up right away.

This deadlock was first reported with the zram driver, a sketch of how
this can happen follows:

CPU A                              CPU B
                                   whatever_store()
module_unload
  mutex_lock(foo)
                                   mutex_lock(foo)
   del_gendisk(zram->disk);
     device_del()
       device_remove_groups()

In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked
completes which is waiting for a lock already held.

This issue can be reproduced easily on the zram driver as follows:

Loop 1 on one terminal:

while true;
	do modprobe zram;
	modprobe -r zram;
done

Loop 2 on a second terminal:
while true; do
	echo 1024 >  /sys/block/zram0/disksize;
	echo 1 > /sys/block/zram0/reset;
done

Without this patch we end up in a deadlock, and the following
stack trace is produced which hints to us what the issue was:

INFO: task bash:888 blocked for more than 120 seconds.
      Tainted: G            E 5.12.0-rc1-next-20210304+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash            state:D stack:    0 pid:  888 ppid: 887 flags:<etc>
Call Trace:
 __schedule+0x2e4/0x900
 schedule+0x46/0xb0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.constprop.0+0x2c3/0x490
 ? _kstrtoull+0x35/0xd0
 reset_store+0x6c/0x160 [zram]
 kernfs_fop_write_iter+0x124/0x1b0
 new_sync_write+0x11c/0x1b0
 vfs_write+0x1c2/0x260
 ksys_write+0x5f/0xe0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f34f2c3df33
RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
INFO: task modprobe:1104 can't die for more than 120 seconds.
task:modprobe        state:D stack:    0 pid: 1104 ppid: 916 flags:<etc>
Call Trace:
 __schedule+0x2e4/0x900
 schedule+0x46/0xb0
 __kernfs_remove.part.0+0x228/0x2b0
 ? finish_wait+0x80/0x80
 kernfs_remove_by_name_ns+0x50/0x90
 remove_files+0x2b/0x60
 sysfs_remove_group+0x38/0x80
 sysfs_remove_groups+0x29/0x40
 device_remove_attrs+0x4a/0x80
 device_del+0x183/0x3e0
 ? mutex_lock+0xe/0x30
 del_gendisk+0x27a/0x2d0
 zram_remove+0x8a/0xb0 [zram]
 ? hot_remove_store+0xf0/0xf0 [zram]
 zram_remove_cb+0xd/0x10 [zram]
 idr_for_each+0x5e/0xd0
 destroy_devices+0x39/0x6f [zram]
 __do_sys_delete_module+0x190/0x2a0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f32adf727d7
RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20

[0] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/zram/zram_drv.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b26abcb955cc..60a55ae8cd91 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1902,14 +1902,7 @@ static struct attribute *zram_disk_attrs[] = {
 	NULL,
 };
 
-static const struct attribute_group zram_disk_attr_group = {
-	.attrs = zram_disk_attrs,
-};
-
-static const struct attribute_group *zram_disk_attr_groups[] = {
-	&zram_disk_attr_group,
-	NULL,
-};
+ATTRIBUTE_GROUPS(zram_disk);
 
 /*
  * Allocate and initialize new zram device. the function returns
@@ -1981,7 +1974,7 @@ static int zram_add(void)
 		blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
 	blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
-	device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
+	device_add_disk(NULL, zram->disk, zram_disk_groups);
 
 	strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
  2021-09-18  5:04   ` Luis Chamberlain
@ 2021-09-20 17:53     ` Tejun Heo
  -1 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2021-09-20 17:53 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch, joe, hkallweit1,
	axboe, jpoimboe, tglx, keescook, rostedt, peterz, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest, cgroups,
	linux-kernel, copyleft-next

Hello,

On Fri, Sep 17, 2021 at 10:04:27PM -0700, Luis Chamberlain wrote:
> If try_module_get() fails we fail the operation on the kernfs node.
> 
> We use a try method as a full lock means we'd then make our sysfs
> attributes busy us out from possible module removal, and so userspace
> could force denying module removal, a silly form of "DOS" against module
> removal. A try lock on the module removal ensures we give priority to
> module removal and interacting with sysfs attributes only comes second.
> Using a full lock could mean for instance that if you don't stop poking
> at sysfs files you cannot remove a module.

I find this explanation odd because there's no real equivalent to locking
the module (as opposed to try locking) because you can't wait for the
removal to finish and then grant the lock, so any operation which increases
the reference *has* to be a try method unless the caller already holds a
reference to the same module and thus knows that the module is already
pinned. The code isn't wrong, so maybe just drop the related paragraphs in
the commit message?

>  static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
>  					     struct kernfs_node *parent,
>  					     const char *name, umode_t mode,
> +					     struct module *owner,
>  					     kuid_t uid, kgid_t gid,
>  					     unsigned flags)

Is there a particular reason why @owner is added between @mode and @uid?
Sitting between two fs attributes seems a bit awkward. Maybe it can just be
the last one?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 17:53     ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2021-09-20 17:53 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes

Hello,

On Fri, Sep 17, 2021 at 10:04:27PM -0700, Luis Chamberlain wrote:
> If try_module_get() fails we fail the operation on the kernfs node.
> 
> We use a try method as a full lock means we'd then make our sysfs
> attributes busy us out from possible module removal, and so userspace
> could force denying module removal, a silly form of "DOS" against module
> removal. A try lock on the module removal ensures we give priority to
> module removal and interacting with sysfs attributes only comes second.
> Using a full lock could mean for instance that if you don't stop poking
> at sysfs files you cannot remove a module.

I find this explanation odd because there's no real equivalent to locking
the module (as opposed to try locking) because you can't wait for the
removal to finish and then grant the lock, so any operation which increases
the reference *has* to be a try method unless the caller already holds a
reference to the same module and thus knows that the module is already
pinned. The code isn't wrong, so maybe just drop the related paragraphs in
the commit message?

>  static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
>  					     struct kernfs_node *parent,
>  					     const char *name, umode_t mode,
> +					     struct module *owner,
>  					     kuid_t uid, kgid_t gid,
>  					     unsigned flags)

Is there a particular reason why @owner is added between @mode and @uid?
Sitting between two fs attributes seems a bit awkward. Maybe it can just be
the last one?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 00/12 (RESEND)] syfs: generic deadlock fix with module removal
       [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
@ 2021-09-20 17:55   ` Tejun Heo
  2021-09-18  5:04   ` Luis Chamberlain
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2021-09-20 17:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch, joe, hkallweit1,
	axboe, jpoimboe, tglx, keescook, rostedt, peterz, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest, cgroups,
	linux-kernel, copyleft-next

On Fri, Sep 17, 2021 at 10:04:18PM -0700, Luis Chamberlain wrote:
> In this v7 I've decided it is best to merge all the effort together into
> one patch set because communication was being lost when I split the
> patches up. This was not helping in any way to either fix the zram
> issues or come to consensus on a generic solution. The patches are also
> merged now because they are all related now.

Building up all the testing framewoork is really great. I have no opinions
about the license related stuff but all other changes generally look good to
me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 00/12 (RESEND)] syfs: generic deadlock fix with module removal
@ 2021-09-20 17:55   ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2021-09-20 17:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes

On Fri, Sep 17, 2021 at 10:04:18PM -0700, Luis Chamberlain wrote:
> In this v7 I've decided it is best to merge all the effort together into
> one patch set because communication was being lost when I split the
> patches up. This was not helping in any way to either fix the zram
> issues or come to consensus on a generic solution. The patches are also
> merged now because they are all related now.

Building up all the testing framewoork is really great. I have no opinions
about the license related stuff but all other changes generally look good to
me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
  2021-09-20 17:53     ` Tejun Heo
@ 2021-09-20 19:15       ` Luis Chamberlain
  -1 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-20 19:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch, joe, hkallweit1,
	axboe, jpoimboe, tglx, keescook, rostedt, peterz, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest, cgroups,
	linux-kernel, copyleft-next

On Mon, Sep 20, 2021 at 07:53:16AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 17, 2021 at 10:04:27PM -0700, Luis Chamberlain wrote:
> > If try_module_get() fails we fail the operation on the kernfs node.
> > 
> > We use a try method as a full lock means we'd then make our sysfs
> > attributes busy us out from possible module removal, and so userspace
> > could force denying module removal, a silly form of "DOS" against module
> > removal. A try lock on the module removal ensures we give priority to
> > module removal and interacting with sysfs attributes only comes second.
> > Using a full lock could mean for instance that if you don't stop poking
> > at sysfs files you cannot remove a module.
> 
> I find this explanation odd because there's no real equivalent to locking
> the module (as opposed to try locking) 

Actually there is, __module_get() but I suspect some of these users are
probably incorrect and should be be moved to try. The documentation
about "rmmod --wait" for __module_get() is also outdated as that option
is no longer supported. I'll send an update for that later.

> because you can't wait for the
> removal to finish and then grant the lock, so any operation which increases
> the reference *has* to be a try method unless the caller already holds a
> reference to the same module and thus knows that the module is already
> pinned.

Right, the reason I mention the alternative is that we technically don't
need to use try in this case since during a kernfs op it is implied the
module will be pinned, but we have further motivations to use a try
method here: to avoid a possible DOS from module removal by userspace
mucking with ops.

> The code isn't wrong, so maybe just drop the related paragraphs in
> the commit message?

Does it make sense to clarify the above a bit more somehow? Or do think
its not needed?

> >  static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> >  					     struct kernfs_node *parent,
> >  					     const char *name, umode_t mode,
> > +					     struct module *owner,
> >  					     kuid_t uid, kgid_t gid,
> >  					     unsigned flags)
> 
> Is there a particular reason why @owner is added between @mode and @uid?
> Sitting between two fs attributes seems a bit awkward. Maybe it can just be
> the last one?

No, I just picked an arbitrary place. Sure I'll move it to the end.

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 19:15       ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-20 19:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes

On Mon, Sep 20, 2021 at 07:53:16AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 17, 2021 at 10:04:27PM -0700, Luis Chamberlain wrote:
> > If try_module_get() fails we fail the operation on the kernfs node.
> > 
> > We use a try method as a full lock means we'd then make our sysfs
> > attributes busy us out from possible module removal, and so userspace
> > could force denying module removal, a silly form of "DOS" against module
> > removal. A try lock on the module removal ensures we give priority to
> > module removal and interacting with sysfs attributes only comes second.
> > Using a full lock could mean for instance that if you don't stop poking
> > at sysfs files you cannot remove a module.
> 
> I find this explanation odd because there's no real equivalent to locking
> the module (as opposed to try locking) 

Actually there is, __module_get() but I suspect some of these users are
probably incorrect and should be be moved to try. The documentation
about "rmmod --wait" for __module_get() is also outdated as that option
is no longer supported. I'll send an update for that later.

> because you can't wait for the
> removal to finish and then grant the lock, so any operation which increases
> the reference *has* to be a try method unless the caller already holds a
> reference to the same module and thus knows that the module is already
> pinned.

Right, the reason I mention the alternative is that we technically don't
need to use try in this case since during a kernfs op it is implied the
module will be pinned, but we have further motivations to use a try
method here: to avoid a possible DOS from module removal by userspace
mucking with ops.

> The code isn't wrong, so maybe just drop the related paragraphs in
> the commit message?

Does it make sense to clarify the above a bit more somehow? Or do think
its not needed?

> >  static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> >  					     struct kernfs_node *parent,
> >  					     const char *name, umode_t mode,
> > +					     struct module *owner,
> >  					     kuid_t uid, kgid_t gid,
> >  					     unsigned flags)
> 
> Is there a particular reason why @owner is added between @mode and @uid?
> Sitting between two fs attributes seems a bit awkward. Maybe it can just be
> the last one?

No, I just picked an arbitrary place. Sure I'll move it to the end.

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
  2021-09-20 19:15       ` Luis Chamberlain
@ 2021-09-20 19:22         ` Tejun Heo
  -1 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2021-09-20 19:22 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch, joe, hkallweit1,
	axboe, jpoimboe, tglx, keescook, rostedt, peterz, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest, cgroups,
	linux-kernel, copyleft-next

Hello,

On Mon, Sep 20, 2021 at 12:15:22PM -0700, Luis Chamberlain wrote:
> > I find this explanation odd because there's no real equivalent to locking
> > the module (as opposed to try locking) 
> 
> Actually there is, __module_get() but I suspect some of these users are
> probably incorrect and should be be moved to try. The documentation

__module_get() is just getting an extra ref when the caller already
has one (or more). It can't be used to freshly acquire a new
reference. There is no equivalence between the relationship between
try_module_get() and __module_get() and the one between spin_trylock()
and spin_lock().

> Right, the reason I mention the alternative is that we technically don't
> need to use try in this case since during a kernfs op it is implied the
> module will be pinned, but we have further motivations to use a try

I'm confused. If the module is already pinned, why are we getting an
extra reference? Also, I don't understand how this has that much to do
with preventing ddoses. I mean, it does cut down the duration of one
operation but the eventual gating is through whoever acquiring the
initial reference through try_module_get(), which again is the *only*
way to acquire a fresh reference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 19:22         ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2021-09-20 19:22 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes

Hello,

On Mon, Sep 20, 2021 at 12:15:22PM -0700, Luis Chamberlain wrote:
> > I find this explanation odd because there's no real equivalent to locking
> > the module (as opposed to try locking) 
> 
> Actually there is, __module_get() but I suspect some of these users are
> probably incorrect and should be be moved to try. The documentation

__module_get() is just getting an extra ref when the caller already
has one (or more). It can't be used to freshly acquire a new
reference. There is no equivalence between the relationship between
try_module_get() and __module_get() and the one between spin_trylock()
and spin_lock().

> Right, the reason I mention the alternative is that we technically don't
> need to use try in this case since during a kernfs op it is implied the
> module will be pinned, but we have further motivations to use a try

I'm confused. If the module is already pinned, why are we getting an
extra reference? Also, I don't understand how this has that much to do
with preventing ddoses. I mean, it does cut down the duration of one
operation but the eventual gating is through whoever acquiring the
initial reference through try_module_get(), which again is the *only*
way to acquire a fresh reference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
  2021-09-20 19:22         ` Tejun Heo
@ 2021-09-20 19:38           ` Luis Chamberlain
  -1 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-20 19:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch, joe, hkallweit1,
	axboe, jpoimboe, tglx, keescook, rostedt, peterz, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest, cgroups,
	linux-kernel, copyleft-next

On Mon, Sep 20, 2021 at 09:22:32AM -1000, Tejun Heo wrote:
> On Mon, Sep 20, 2021 at 12:15:22PM -0700, Luis Chamberlain wrote:
> 
> > Right, the reason I mention the alternative is that we technically don't
> > need to use try in this case since during a kernfs op it is implied the
> > module will be pinned, but we have further motivations to use a try
> 
> I'm confused. If the module is already pinned, why are we getting an
> extra reference? 

Sorry, I meant its pinned implicitly not through an actual refcount. A module
exit can happen then in that case, it just that the sysfs removal will
have to wait until the active reference for the kernfs ops go down.

> Also, I don't understand how this has that much to do
> with preventing ddoses. I mean, it does cut down the duration of one
> operation but the eventual gating is through whoever acquiring the
> initial reference through try_module_get(), which again is the *only*
> way to acquire a fresh reference.

True. I can just leave that out from the commit log. But it is perhaps
an implicit gain here that using try alows module exit to trump use
of an kernfs op. Maybe's that's sufficient to mention.

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 19:38           ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-20 19:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes

On Mon, Sep 20, 2021 at 09:22:32AM -1000, Tejun Heo wrote:
> On Mon, Sep 20, 2021 at 12:15:22PM -0700, Luis Chamberlain wrote:
> 
> > Right, the reason I mention the alternative is that we technically don't
> > need to use try in this case since during a kernfs op it is implied the
> > module will be pinned, but we have further motivations to use a try
> 
> I'm confused. If the module is already pinned, why are we getting an
> extra reference? 

Sorry, I meant its pinned implicitly not through an actual refcount. A module
exit can happen then in that case, it just that the sysfs removal will
have to wait until the active reference for the kernfs ops go down.

> Also, I don't understand how this has that much to do
> with preventing ddoses. I mean, it does cut down the duration of one
> operation but the eventual gating is through whoever acquiring the
> initial reference through try_module_get(), which again is the *only*
> way to acquire a fresh reference.

True. I can just leave that out from the commit log. But it is perhaps
an implicit gain here that using try alows module exit to trump use
of an kernfs op. Maybe's that's sufficient to mention.

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 20:52     ` Dan Williams
  0 siblings, 0 replies; 50+ messages in thread
From: Dan Williams @ 2021-09-20 20:52 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Tejun Heo, Greg KH, Andrew Morton, Minchan Kim, jeyu, shuah,
	Randy Dunlap, Rafael J. Wysocki, Masahiro Yamada,
	Nick Desaulniers, yzaikin, Nathan Chancellor, ojeda,
	Tetsuo Handa, vitor, elver, Jarkko Sakkinen, Alexander Potapenko,
	rf, Stephen Hemminger, David Laight, bvanassche, jolsa,
	Andy Shevchenko, trishalfonso, andreyknvl, Jiri Kosina, mbenes,
	Nitin Gupta, Sergey Senozhatsky, Reinette Chatre, Fenghua Yu,
	Borislav Petkov, X86 ML, H. Peter Anvin, lizefan.x,
	Johannes Weiner, Daniel Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, senozhatsky, Christoph Hellwig,
	Joe Perches, hkallweit1, Jens Axboe, Josh Poimboeuf,
	Thomas Gleixner, Kees Cook, Steven Rostedt, Peter Zijlstra,
	linux-spdx, Linux Doc Mailing List, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, Linux Kernel Mailing List,
	copyleft-next

On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> When sysfs attributes use a lock also used on module removal we can
> race to deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have module removal call
> trigger. The module removal call code holds a lock, and then the sysfs
> file entry waits for the same lock. While holding the lock the module
> removal tries to remove the sysfs entries, but these cannot be removed
> yet as one is waiting for a lock. This won't complete as the lock is
> already held. Likewise module removal cannot complete, and so we deadlock.
>
> This can now be easily reproducible with our sysfs selftest as follows:
>
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
>
> To fix this we extend the struct kernfs_node with a module reference and
> use the try_module_get() after kernfs_get_active() is called which
> protects integrity and the existence of the kernfs node during the
> operation.
>
> So long as the kernfs node is protected with kernfs_get_active() we know
> we can rely on its contents. And, as now just documented in the previous
> patch, we also now know that once kernfs_get_active() is called the module
> is also guarded to exist and cannot be removed.
>
> If try_module_get() fails we fail the operation on the kernfs node.
>
> We use a try method as a full lock means we'd then make our sysfs
> attributes busy us out from possible module removal, and so userspace
> could force denying module removal, a silly form of "DOS" against module
> removal. A try lock on the module removal ensures we give priority to
> module removal and interacting with sysfs attributes only comes second.
> Using a full lock could mean for instance that if you don't stop poking
> at sysfs files you cannot remove a module.
>
> Races between removal of sysfs files and the module are not possible
> given sysfs files are created by the same module, and when a sysfs file
> is being used kernfs prevents removal of the sysfs file. So if module
> removal is actually happening the removal would have to wait until
> the sysfs file operation is complete.
>
> This deadlock was first reported with the zram driver, however the live
> patching folks have acknowledged they have observed this as well with
> live patching, when a live patch is removed. I was then able to
> reproduce easily by creating a dedicated selftests.
>
> A sketch of how this can happen follows:
>
> CPU A                              CPU B
>                                    whatever_store()
> module_unload
>   mutex_lock(foo)
>                                    mutex_lock(foo)
>    del_gendisk(zram->disk);
>      device_del()
>        device_remove_groups()

This flow seems possible to trigger with:

   echo $dev > /sys/bus/$bus/drivers/$driver/unbind

I am missing why module pinning is part of the solution when it's the
device_del() path that is racing? Module removal is just a more coarse
grained way to trigger unbind => device_del(). Isn't the above a bug
in the driver, not missing synchronization in kernfs? Forgive me if
the unbind question was asked and answered elsewhere, this is my first
time taking a look at this series.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 20:52     ` Dan Williams
  0 siblings, 0 replies; 50+ messages in thread
From: Dan Williams @ 2021-09-20 20:52 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Tejun Heo, Greg KH, Andrew Morton, Minchan Kim,
	jeyu-DgEjT+Ai2ygdnm+yROfE0A, shuah, Randy Dunlap,
	Rafael J. Wysocki, Masahiro Yamada, Nick Desaulniers,
	yzaikin-hpIqsD4AKlfQT0dZR+AlfA, Nathan Chancellor,
	ojeda-DgEjT+Ai2ygdnm+yROfE0A, Tetsuo Handa,
	vitor-HdruG0TBZYlAfugRpC6u6w, elver-hpIqsD4AKlfQT0dZR+AlfA,
	Jarkko Sakkinen, Alexander Potapenko,
	rf-yzvPICuk2AA4QjBA90+/kJqQE7yCjDx5, Stephen Hemminger,
	David Laight, bvanassche-HInyCGIudOg,
	jolsa-DgEjT+Ai2ygdnm+yROfE0A, Andy Shevchenko

On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> When sysfs attributes use a lock also used on module removal we can
> race to deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have module removal call
> trigger. The module removal call code holds a lock, and then the sysfs
> file entry waits for the same lock. While holding the lock the module
> removal tries to remove the sysfs entries, but these cannot be removed
> yet as one is waiting for a lock. This won't complete as the lock is
> already held. Likewise module removal cannot complete, and so we deadlock.
>
> This can now be easily reproducible with our sysfs selftest as follows:
>
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
>
> To fix this we extend the struct kernfs_node with a module reference and
> use the try_module_get() after kernfs_get_active() is called which
> protects integrity and the existence of the kernfs node during the
> operation.
>
> So long as the kernfs node is protected with kernfs_get_active() we know
> we can rely on its contents. And, as now just documented in the previous
> patch, we also now know that once kernfs_get_active() is called the module
> is also guarded to exist and cannot be removed.
>
> If try_module_get() fails we fail the operation on the kernfs node.
>
> We use a try method as a full lock means we'd then make our sysfs
> attributes busy us out from possible module removal, and so userspace
> could force denying module removal, a silly form of "DOS" against module
> removal. A try lock on the module removal ensures we give priority to
> module removal and interacting with sysfs attributes only comes second.
> Using a full lock could mean for instance that if you don't stop poking
> at sysfs files you cannot remove a module.
>
> Races between removal of sysfs files and the module are not possible
> given sysfs files are created by the same module, and when a sysfs file
> is being used kernfs prevents removal of the sysfs file. So if module
> removal is actually happening the removal would have to wait until
> the sysfs file operation is complete.
>
> This deadlock was first reported with the zram driver, however the live
> patching folks have acknowledged they have observed this as well with
> live patching, when a live patch is removed. I was then able to
> reproduce easily by creating a dedicated selftests.
>
> A sketch of how this can happen follows:
>
> CPU A                              CPU B
>                                    whatever_store()
> module_unload
>   mutex_lock(foo)
>                                    mutex_lock(foo)
>    del_gendisk(zram->disk);
>      device_del()
>        device_remove_groups()

This flow seems possible to trigger with:

   echo $dev > /sys/bus/$bus/drivers/$driver/unbind

I am missing why module pinning is part of the solution when it's the
device_del() path that is racing? Module removal is just a more coarse
grained way to trigger unbind => device_del(). Isn't the above a bug
in the driver, not missing synchronization in kernfs? Forgive me if
the unbind question was asked and answered elsewhere, this is my first
time taking a look at this series.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
  2021-09-20 20:52     ` Dan Williams
@ 2021-09-20 21:17       ` Luis Chamberlain
  -1 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-20 21:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Tejun Heo, Greg KH, Andrew Morton, Minchan Kim, jeyu, shuah,
	Randy Dunlap, Rafael J. Wysocki, Masahiro Yamada,
	Nick Desaulniers, yzaikin, Nathan Chancellor, ojeda,
	Tetsuo Handa, vitor, elver, Jarkko Sakkinen, Alexander Potapenko,
	rf, Stephen Hemminger, David Laight, bvanassche, jolsa,
	Andy Shevchenko, trishalfonso, andreyknvl, Jiri Kosina, mbenes,
	Nitin Gupta, Sergey Senozhatsky, Reinette Chatre, Fenghua Yu,
	Borislav Petkov, X86 ML, H. Peter Anvin, lizefan.x,
	Johannes Weiner, Daniel Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, senozhatsky, Christoph Hellwig,
	Joe Perches, hkallweit1, Jens Axboe, Josh Poimboeuf,
	Thomas Gleixner, Kees Cook, Steven Rostedt, Peter Zijlstra,
	linux-spdx, Linux Doc Mailing List, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, Linux Kernel Mailing List,
	copyleft-next

On Mon, Sep 20, 2021 at 01:52:21PM -0700, Dan Williams wrote:
> On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > This deadlock was first reported with the zram driver, however the live
> > patching folks have acknowledged they have observed this as well with
> > live patching, when a live patch is removed. I was then able to
> > reproduce easily by creating a dedicated selftests.
> >
> > A sketch of how this can happen follows:
> >
> > CPU A                              CPU B
> >                                    whatever_store()
> > module_unload
> >   mutex_lock(foo)
> >                                    mutex_lock(foo)
> >    del_gendisk(zram->disk);
> >      device_del()
> >        device_remove_groups()
> 
> This flow seems possible to trigger with:
> 
>    echo $dev > /sys/bus/$bus/drivers/$driver/unbind
> 
> I am missing why module pinning 

The aspect of try_module_get() which comes to value to prevent the
deadlock is it ensures kernfs ops do not run once exit is on the way.

> is part of the solution when it's the
> device_del() path that is racing?

But its not, the device_del() path will yield until the kernfs op
completes. It is fine to wait there.

The deadlock happens if a module exit routine uses a lock which is
also used on a sysfs op. If the lock was first held by module exit,
and module exit is waiting for the kernfs op to complete, and the
kernfs op is waiting to hold the same lock then the exit will wait
forever.

> Module removal is just a more coarse
> grained way to trigger unbind => device_del().

Right, but the device_del() path is not sharing a lock with the sysfs op.

> Isn't the above a bug
> in the driver, not missing synchronization in kernfs?

We can certainly take the position as an alternative:

  "thou shalt not use a lock on exit which is also used on a syfs op"

However that seems counter intuitive, specially if we can resolve the
issue easily with a try_module_get().

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 21:17       ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-20 21:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Tejun Heo, Greg KH, Andrew Morton, Minchan Kim, jeyu, shuah,
	Randy Dunlap, Rafael J. Wysocki, Masahiro Yamada,
	Nick Desaulniers, yzaikin, Nathan Chancellor, ojeda,
	Tetsuo Handa, vitor, elver, Jarkko Sakkinen, Alexander Potapenko,
	rf, Stephen Hemminger, David Laight, bvanassche, jolsa,
	Andy Shevchenko

On Mon, Sep 20, 2021 at 01:52:21PM -0700, Dan Williams wrote:
> On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > This deadlock was first reported with the zram driver, however the live
> > patching folks have acknowledged they have observed this as well with
> > live patching, when a live patch is removed. I was then able to
> > reproduce easily by creating a dedicated selftests.
> >
> > A sketch of how this can happen follows:
> >
> > CPU A                              CPU B
> >                                    whatever_store()
> > module_unload
> >   mutex_lock(foo)
> >                                    mutex_lock(foo)
> >    del_gendisk(zram->disk);
> >      device_del()
> >        device_remove_groups()
> 
> This flow seems possible to trigger with:
> 
>    echo $dev > /sys/bus/$bus/drivers/$driver/unbind
> 
> I am missing why module pinning 

The aspect of try_module_get() which comes to value to prevent the
deadlock is it ensures kernfs ops do not run once exit is on the way.

> is part of the solution when it's the
> device_del() path that is racing?

But its not, the device_del() path will yield until the kernfs op
completes. It is fine to wait there.

The deadlock happens if a module exit routine uses a lock which is
also used on a sysfs op. If the lock was first held by module exit,
and module exit is waiting for the kernfs op to complete, and the
kernfs op is waiting to hold the same lock then the exit will wait
forever.

> Module removal is just a more coarse
> grained way to trigger unbind => device_del().

Right, but the device_del() path is not sharing a lock with the sysfs op.

> Isn't the above a bug
> in the driver, not missing synchronization in kernfs?

We can certainly take the position as an alternative:

  "thou shalt not use a lock on exit which is also used on a syfs op"

However that seems counter intuitive, specially if we can resolve the
issue easily with a try_module_get().

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
  2021-09-18  5:04   ` Luis Chamberlain
@ 2021-09-20 21:36     ` Bart Van Assche
  -1 siblings, 0 replies; 50+ messages in thread
From: Bart Van Assche @ 2021-09-20 21:36 UTC (permalink / raw)
  To: Luis Chamberlain, tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, jolsa, andriy.shevchenko, trishalfonso, andreyknvl,
	jikos, mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch, joe, hkallweit1,
	axboe, jpoimboe, tglx, keescook, rostedt, peterz, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest, cgroups,
	linux-kernel, copyleft-next

On 9/17/21 10:04 PM, Luis Chamberlain wrote:
> A sketch of how this can happen follows:
> 
> CPU A                              CPU B
>                                     whatever_store()
> module_unload
>    mutex_lock(foo)
>                                     mutex_lock(foo)
>     del_gendisk(zram->disk);
>       device_del()
>         device_remove_groups()
> 
> In this situation whatever_store() is waiting for the mutex foo to
> become unlocked, but that won't happen until module removal is complete.
> But module removal won't complete until the sysfs file being poked
> completes which is waiting for a lock already held.

If I remember correctly I encountered the deadlock scenario described
above for the first time about ten years ago while working on the SCST
project. We solved this deadlock by removing the sysfs attributes from
the module unload code before grabbing mutex_lock(foo), e.g. by calling
sysfs_remove_file(). This works because calling sysfs_remove_file()
multiple times in a row is safe. Is that solution good enough for the
zram driver?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 21:36     ` Bart Van Assche
  0 siblings, 0 replies; 50+ messages in thread
From: Bart Van Assche @ 2021-09-20 21:36 UTC (permalink / raw)
  To: Luis Chamberlain, tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, jolsa, andriy.shevchenko, trishalfonso, andreyknvl,
	jikos, mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch

On 9/17/21 10:04 PM, Luis Chamberlain wrote:
> A sketch of how this can happen follows:
> 
> CPU A                              CPU B
>                                     whatever_store()
> module_unload
>    mutex_lock(foo)
>                                     mutex_lock(foo)
>     del_gendisk(zram->disk);
>       device_del()
>         device_remove_groups()
> 
> In this situation whatever_store() is waiting for the mutex foo to
> become unlocked, but that won't happen until module removal is complete.
> But module removal won't complete until the sysfs file being poked
> completes which is waiting for a lock already held.

If I remember correctly I encountered the deadlock scenario described
above for the first time about ten years ago while working on the SCST
project. We solved this deadlock by removing the sysfs attributes from
the module unload code before grabbing mutex_lock(foo), e.g. by calling
sysfs_remove_file(). This works because calling sysfs_remove_file()
multiple times in a row is safe. Is that solution good enough for the
zram driver?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 03/12] selftests: add tests_sysfs module
  2021-09-18  5:04   ` Luis Chamberlain
@ 2021-09-20 21:38     ` Bart Van Assche
  -1 siblings, 0 replies; 50+ messages in thread
From: Bart Van Assche @ 2021-09-20 21:38 UTC (permalink / raw)
  To: Luis Chamberlain, tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, jolsa, andriy.shevchenko, trishalfonso, andreyknvl,
	jikos, mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch, joe, hkallweit1,
	axboe, jpoimboe, tglx, keescook, rostedt, peterz, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest, cgroups,
	linux-kernel, copyleft-next

On 9/17/21 10:04 PM, Luis Chamberlain wrote:
> +static void test_dev_config_lock(struct sysfs_test_device *test_dev)
> +{
> +	struct test_config *config;
> +
> +	config = &test_dev->config;

Please change the above into a single line (there are multiple
occurrences):

struct test_config *config = &test_dev->config;

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 03/12] selftests: add tests_sysfs module
@ 2021-09-20 21:38     ` Bart Van Assche
  0 siblings, 0 replies; 50+ messages in thread
From: Bart Van Assche @ 2021-09-20 21:38 UTC (permalink / raw)
  To: Luis Chamberlain, tj, gregkh, akpm, minchan, jeyu, shuah
  Cc: rdunlap, rafael, masahiroy, ndesaulniers, yzaikin, nathan, ojeda,
	penguin-kernel, vitor, elver, jarkko, glider, rf, stephen,
	David.Laight, jolsa, andriy.shevchenko, trishalfonso, andreyknvl,
	jikos, mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch

On 9/17/21 10:04 PM, Luis Chamberlain wrote:
> +static void test_dev_config_lock(struct sysfs_test_device *test_dev)
> +{
> +	struct test_config *config;
> +
> +	config = &test_dev->config;

Please change the above into a single line (there are multiple
occurrences):

struct test_config *config = &test_dev->config;

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 21:43       ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-20 21:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj, gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael,
	masahiroy, ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel,
	vitor, elver, jarkko, glider, rf, stephen, David.Laight, jolsa,
	andriy.shevchenko, trishalfonso, andreyknvl, jikos, mbenes,
	ngupta, sergey.senozhatsky.work, reinette.chatre, fenghua.yu, bp,
	x86, hpa, lizefan.x, hannes, daniel.vetter, bhelgaas, kw,
	dan.j.williams, senozhatsky, hch, joe, hkallweit1, axboe,
	jpoimboe, tglx, keescook, rostedt, peterz, linux-spdx, linux-doc,
	linux-block, linux-fsdevel, linux-kselftest, cgroups,
	linux-kernel, copyleft-next

On Mon, Sep 20, 2021 at 02:36:38PM -0700, Bart Van Assche wrote:
> On 9/17/21 10:04 PM, Luis Chamberlain wrote:
> > A sketch of how this can happen follows:
> > 
> > CPU A                              CPU B
> >                                     whatever_store()
> > module_unload
> >    mutex_lock(foo)
> >                                     mutex_lock(foo)
> >     del_gendisk(zram->disk);
> >       device_del()
> >         device_remove_groups()
> > 
> > In this situation whatever_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the sysfs file being poked
> > completes which is waiting for a lock already held.
> 
> If I remember correctly I encountered the deadlock scenario described
> above for the first time about ten years ago while working on the SCST
> project. We solved this deadlock by removing the sysfs attributes from
> the module unload code before grabbing mutex_lock(foo), e.g. by calling
> sysfs_remove_file().

Well the sysfs attributes in zram do tons of funky mucking around so
unfortunately no. It's not the only driver where this can happen. It is
why I decided to work on a generic solution instead.

> This works because calling sysfs_remove_file()
> multiple times in a row is safe. Is that solution good enough for the
> zram driver?

The sysfs attributes are group attributes part of the block, and so are
removed for the driver on a del_gendisk(). So unfortunately no, this
would not be a good solution in this case.

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 21:43       ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-20 21:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	minchan-DgEjT+Ai2ygdnm+yROfE0A, jeyu-DgEjT+Ai2ygdnm+yROfE0A,
	shuah-DgEjT+Ai2ygdnm+yROfE0A, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ,
	rafael-DgEjT+Ai2ygdnm+yROfE0A, masahiroy-DgEjT+Ai2ygdnm+yROfE0A,
	ndesaulniers-hpIqsD4AKlfQT0dZR+AlfA,
	yzaikin-hpIqsD4AKlfQT0dZR+AlfA, nathan-DgEjT+Ai2ygdnm+yROfE0A,
	ojeda-DgEjT+Ai2ygdnm+yROfE0A,
	penguin-kernel-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp,
	vitor-HdruG0TBZYlAfugRpC6u6w, elver-hpIqsD4AKlfQT0dZR+AlfA,
	jarkko-DgEjT+Ai2ygdnm+yROfE0A, glider-hpIqsD4AKlfQT0dZR+AlfA,
	rf-yzvPICuk2AA4QjBA90+/kJqQE7yCjDx5,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	David.Laight-JxhZ9S5GRejQT0dZR+AlfA,
	jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	andriy.shevchenko-VuQAYsv1563Yd54FQh9/CA,
	trishalfonso-hpIqsD4AKlfQT0dZR+AlfA,
	andreyknvl-Re5JQEeQqe8AvxtiuMwx3w, jikos-DgEjT+Ai2ygdnm+yROfE0A,
	mbenes-IBi9RG/b67k, ngupta-KNmc09w0p+Ednm+yROfE0A,
	sergey.senozhatsky.work-Re5JQEeQqe8AvxtiuMwx3w,
	reinette.chatre-ral2JQCrhuEAvxtiuMwx3w,
	fenghua.yu-ral2JQCrhuEAvxtiuMwx3w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	x86-DgEjT+Ai2ygdnm+yROfE0A, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes

On Mon, Sep 20, 2021 at 02:36:38PM -0700, Bart Van Assche wrote:
> On 9/17/21 10:04 PM, Luis Chamberlain wrote:
> > A sketch of how this can happen follows:
> > 
> > CPU A                              CPU B
> >                                     whatever_store()
> > module_unload
> >    mutex_lock(foo)
> >                                     mutex_lock(foo)
> >     del_gendisk(zram->disk);
> >       device_del()
> >         device_remove_groups()
> > 
> > In this situation whatever_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the sysfs file being poked
> > completes which is waiting for a lock already held.
> 
> If I remember correctly I encountered the deadlock scenario described
> above for the first time about ten years ago while working on the SCST
> project. We solved this deadlock by removing the sysfs attributes from
> the module unload code before grabbing mutex_lock(foo), e.g. by calling
> sysfs_remove_file().

Well the sysfs attributes in zram do tons of funky mucking around so
unfortunately no. It's not the only driver where this can happen. It is
why I decided to work on a generic solution instead.

> This works because calling sysfs_remove_file()
> multiple times in a row is safe. Is that solution good enough for the
> zram driver?

The sysfs attributes are group attributes part of the block, and so are
removed for the driver on a del_gendisk(). So unfortunately no, this
would not be a good solution in this case.

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
  2021-09-20 21:17       ` Luis Chamberlain
@ 2021-09-20 21:55         ` Dan Williams
  -1 siblings, 0 replies; 50+ messages in thread
From: Dan Williams @ 2021-09-20 21:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Tejun Heo, Greg KH, Andrew Morton, Minchan Kim, jeyu, shuah,
	Randy Dunlap, Rafael J. Wysocki, Masahiro Yamada,
	Nick Desaulniers, yzaikin, Nathan Chancellor, ojeda,
	Tetsuo Handa, vitor, elver, Jarkko Sakkinen, Alexander Potapenko,
	rf, Stephen Hemminger, David Laight, bvanassche, jolsa,
	Andy Shevchenko, trishalfonso, andreyknvl, Jiri Kosina, mbenes,
	Nitin Gupta, Sergey Senozhatsky, Reinette Chatre, Fenghua Yu,
	Borislav Petkov, X86 ML, H. Peter Anvin, lizefan.x,
	Johannes Weiner, Daniel Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, senozhatsky, Christoph Hellwig,
	Joe Perches, hkallweit1, Jens Axboe, Josh Poimboeuf,
	Thomas Gleixner, Kees Cook, Steven Rostedt, Peter Zijlstra,
	linux-spdx, Linux Doc Mailing List, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, Linux Kernel Mailing List,
	copyleft-next

On Mon, Sep 20, 2021 at 2:17 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Sep 20, 2021 at 01:52:21PM -0700, Dan Williams wrote:
> > On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > This deadlock was first reported with the zram driver, however the live
> > > patching folks have acknowledged they have observed this as well with
> > > live patching, when a live patch is removed. I was then able to
> > > reproduce easily by creating a dedicated selftests.
> > >
> > > A sketch of how this can happen follows:
> > >
> > > CPU A                              CPU B
> > >                                    whatever_store()
> > > module_unload
> > >   mutex_lock(foo)
> > >                                    mutex_lock(foo)
> > >    del_gendisk(zram->disk);
> > >      device_del()
> > >        device_remove_groups()
> >
> > This flow seems possible to trigger with:
> >
> >    echo $dev > /sys/bus/$bus/drivers/$driver/unbind
> >
> > I am missing why module pinning
>
> The aspect of try_module_get() which comes to value to prevent the
> deadlock is it ensures kernfs ops do not run once exit is on the way.
>
> > is part of the solution when it's the
> > device_del() path that is racing?
>
> But its not, the device_del() path will yield until the kernfs op
> completes. It is fine to wait there.
>
> The deadlock happens if a module exit routine uses a lock which is
> also used on a sysfs op. If the lock was first held by module exit,
> and module exit is waiting for the kernfs op to complete, and the
> kernfs op is waiting to hold the same lock then the exit will wait
> forever.
>
> > Module removal is just a more coarse
> > grained way to trigger unbind => device_del().
>
> Right, but the device_del() path is not sharing a lock with the sysfs op.

The deadlock in the example comes from holding a lock over
device_del() that is also taken in a kernfs op.  For example, the code
above looks like something that runs from driver.remove(), not
exclusively module_exit(). Yes, module_exit() may trigger
driver.remove() via driver_unregister(), but there's other ways to
trigger driver.remove() that do not involve module_exit().

> > Isn't the above a bug
> > in the driver, not missing synchronization in kernfs?
>
> We can certainly take the position as an alternative:
>
>   "thou shalt not use a lock on exit which is also used on a syfs op"
>
> However that seems counter intuitive, specially if we can resolve the
> issue easily with a try_module_get().

Again, I don't see how try_module_get() can affect the ABBA failure
case of holding a lock over device_del() that is also held inside
sysfs op. I agree that the problem is subtle. Does lockdep not
complain about this case? If it's going to be avoided in the core it
seems try_module_get() does not completely cover the hole that
unsuspecting driver writers might fall into.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-20 21:55         ` Dan Williams
  0 siblings, 0 replies; 50+ messages in thread
From: Dan Williams @ 2021-09-20 21:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Tejun Heo, Greg KH, Andrew Morton, Minchan Kim, jeyu, shuah,
	Randy Dunlap, Rafael J. Wysocki, Masahiro Yamada,
	Nick Desaulniers, yzaikin, Nathan Chancellor, ojeda,
	Tetsuo Handa, vitor, elver, Jarkko Sakkinen, Alexander Potapenko,
	rf, Stephen Hemminger, David Laight, bvanassche, jolsa,
	Andy Shevchenko

On Mon, Sep 20, 2021 at 2:17 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Sep 20, 2021 at 01:52:21PM -0700, Dan Williams wrote:
> > On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > This deadlock was first reported with the zram driver, however the live
> > > patching folks have acknowledged they have observed this as well with
> > > live patching, when a live patch is removed. I was then able to
> > > reproduce easily by creating a dedicated selftests.
> > >
> > > A sketch of how this can happen follows:
> > >
> > > CPU A                              CPU B
> > >                                    whatever_store()
> > > module_unload
> > >   mutex_lock(foo)
> > >                                    mutex_lock(foo)
> > >    del_gendisk(zram->disk);
> > >      device_del()
> > >        device_remove_groups()
> >
> > This flow seems possible to trigger with:
> >
> >    echo $dev > /sys/bus/$bus/drivers/$driver/unbind
> >
> > I am missing why module pinning
>
> The aspect of try_module_get() which comes to value to prevent the
> deadlock is it ensures kernfs ops do not run once exit is on the way.
>
> > is part of the solution when it's the
> > device_del() path that is racing?
>
> But its not, the device_del() path will yield until the kernfs op
> completes. It is fine to wait there.
>
> The deadlock happens if a module exit routine uses a lock which is
> also used on a sysfs op. If the lock was first held by module exit,
> and module exit is waiting for the kernfs op to complete, and the
> kernfs op is waiting to hold the same lock then the exit will wait
> forever.
>
> > Module removal is just a more coarse
> > grained way to trigger unbind => device_del().
>
> Right, but the device_del() path is not sharing a lock with the sysfs op.

The deadlock in the example comes from holding a lock over
device_del() that is also taken in a kernfs op.  For example, the code
above looks like something that runs from driver.remove(), not
exclusively module_exit(). Yes, module_exit() may trigger
driver.remove() via driver_unregister(), but there's other ways to
trigger driver.remove() that do not involve module_exit().

> > Isn't the above a bug
> > in the driver, not missing synchronization in kernfs?
>
> We can certainly take the position as an alternative:
>
>   "thou shalt not use a lock on exit which is also used on a syfs op"
>
> However that seems counter intuitive, specially if we can resolve the
> issue easily with a try_module_get().

Again, I don't see how try_module_get() can affect the ABBA failure
case of holding a lock over device_del() that is also held inside
sysfs op. I agree that the problem is subtle. Does lockdep not
complain about this case? If it's going to be avoided in the core it
seems try_module_get() does not completely cover the hole that
unsuspecting driver writers might fall into.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
  2021-09-20 21:55         ` Dan Williams
@ 2021-09-21  0:03           ` Luis Chamberlain
  -1 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-21  0:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Tejun Heo, Greg KH, Andrew Morton, Minchan Kim, jeyu, shuah,
	Randy Dunlap, Rafael J. Wysocki, Masahiro Yamada,
	Nick Desaulniers, yzaikin, Nathan Chancellor, ojeda,
	Tetsuo Handa, vitor, elver, Jarkko Sakkinen, Alexander Potapenko,
	rf, Stephen Hemminger, David Laight, bvanassche, jolsa,
	Andy Shevchenko, trishalfonso, andreyknvl, Jiri Kosina, mbenes,
	Nitin Gupta, Sergey Senozhatsky, Reinette Chatre, Fenghua Yu,
	Borislav Petkov, X86 ML, H. Peter Anvin, lizefan.x,
	Johannes Weiner, Daniel Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, senozhatsky, Christoph Hellwig,
	Joe Perches, hkallweit1, Jens Axboe, Josh Poimboeuf,
	Thomas Gleixner, Kees Cook, Steven Rostedt, Peter Zijlstra,
	linux-spdx, Linux Doc Mailing List, linux-block, linux-fsdevel,
	linux-kselftest, cgroups, Linux Kernel Mailing List,
	copyleft-next

On Mon, Sep 20, 2021 at 02:55:10PM -0700, Dan Williams wrote:
> On Mon, Sep 20, 2021 at 2:17 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Mon, Sep 20, 2021 at 01:52:21PM -0700, Dan Williams wrote:
> > > On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > > This deadlock was first reported with the zram driver, however the live
> > > > patching folks have acknowledged they have observed this as well with
> > > > live patching, when a live patch is removed. I was then able to
> > > > reproduce easily by creating a dedicated selftests.
> > > >
> > > > A sketch of how this can happen follows:
> > > >
> > > > CPU A                              CPU B
> > > >                                    whatever_store()
> > > > module_unload
> > > >   mutex_lock(foo)
> > > >                                    mutex_lock(foo)
> > > >    del_gendisk(zram->disk);
> > > >      device_del()
> > > >        device_remove_groups()
> > >
> > > This flow seems possible to trigger with:
> > >
> > >    echo $dev > /sys/bus/$bus/drivers/$driver/unbind
> > >
> > > I am missing why module pinning
> >
> > The aspect of try_module_get() which comes to value to prevent the
> > deadlock is it ensures kernfs ops do not run once exit is on the way.
> >
> > > is part of the solution when it's the
> > > device_del() path that is racing?
> >
> > But its not, the device_del() path will yield until the kernfs op
> > completes. It is fine to wait there.
> >
> > The deadlock happens if a module exit routine uses a lock which is
> > also used on a sysfs op. If the lock was first held by module exit,
> > and module exit is waiting for the kernfs op to complete, and the
> > kernfs op is waiting to hold the same lock then the exit will wait
> > forever.
> >
> > > Module removal is just a more coarse
> > > grained way to trigger unbind => device_del().
> >
> > Right, but the device_del() path is not sharing a lock with the sysfs op.
> 
> The deadlock in the example comes from holding a lock over
> device_del() [...]

No sorry, that is my mistake not making it clear that the mutex held
in the example is on module exit. Or any lock for that matter. That is
these locks are driver specific.

> > > Isn't the above a bug
> > > in the driver, not missing synchronization in kernfs?
> >
> > We can certainly take the position as an alternative:
> >
> >   "thou shalt not use a lock on exit which is also used on a syfs op"
> >
> > However that seems counter intuitive, specially if we can resolve the
> > issue easily with a try_module_get().
> 
> Again, I don't see how try_module_get() can affect the ABBA failure
> case of holding a lock over device_del() that is also held inside
> sysfs op.

It is not device_del(), it is on module exit. Sorry for this not being
clear before. I'll fix the commit log to make it clearer. The subject
at least was clear but I think the example could be clearer.

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 09/12] sysfs: fix deadlock race with module removal
@ 2021-09-21  0:03           ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-21  0:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Tejun Heo, Greg KH, Andrew Morton, Minchan Kim, jeyu, shuah,
	Randy Dunlap, Rafael J. Wysocki, Masahiro Yamada,
	Nick Desaulniers, yzaikin, Nathan Chancellor, ojeda,
	Tetsuo Handa, vitor, elver, Jarkko Sakkinen, Alexander Potapenko,
	rf, Stephen Hemminger, David Laight, bvanassche, jolsa,
	Andy Shevchenko

On Mon, Sep 20, 2021 at 02:55:10PM -0700, Dan Williams wrote:
> On Mon, Sep 20, 2021 at 2:17 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Mon, Sep 20, 2021 at 01:52:21PM -0700, Dan Williams wrote:
> > > On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > > This deadlock was first reported with the zram driver, however the live
> > > > patching folks have acknowledged they have observed this as well with
> > > > live patching, when a live patch is removed. I was then able to
> > > > reproduce easily by creating a dedicated selftests.
> > > >
> > > > A sketch of how this can happen follows:
> > > >
> > > > CPU A                              CPU B
> > > >                                    whatever_store()
> > > > module_unload
> > > >   mutex_lock(foo)
> > > >                                    mutex_lock(foo)
> > > >    del_gendisk(zram->disk);
> > > >      device_del()
> > > >        device_remove_groups()
> > >
> > > This flow seems possible to trigger with:
> > >
> > >    echo $dev > /sys/bus/$bus/drivers/$driver/unbind
> > >
> > > I am missing why module pinning
> >
> > The aspect of try_module_get() which comes to value to prevent the
> > deadlock is it ensures kernfs ops do not run once exit is on the way.
> >
> > > is part of the solution when it's the
> > > device_del() path that is racing?
> >
> > But its not, the device_del() path will yield until the kernfs op
> > completes. It is fine to wait there.
> >
> > The deadlock happens if a module exit routine uses a lock which is
> > also used on a sysfs op. If the lock was first held by module exit,
> > and module exit is waiting for the kernfs op to complete, and the
> > kernfs op is waiting to hold the same lock then the exit will wait
> > forever.
> >
> > > Module removal is just a more coarse
> > > grained way to trigger unbind => device_del().
> >
> > Right, but the device_del() path is not sharing a lock with the sysfs op.
> 
> The deadlock in the example comes from holding a lock over
> device_del() [...]

No sorry, that is my mistake not making it clear that the mutex held
in the example is on module exit. Or any lock for that matter. That is
these locks are driver specific.

> > > Isn't the above a bug
> > > in the driver, not missing synchronization in kernfs?
> >
> > We can certainly take the position as an alternative:
> >
> >   "thou shalt not use a lock on exit which is also used on a syfs op"
> >
> > However that seems counter intuitive, specially if we can resolve the
> > issue easily with a try_module_get().
> 
> Again, I don't see how try_module_get() can affect the ABBA failure
> case of holding a lock over device_del() that is also held inside
> sysfs op.

It is not device_del(), it is on module exit. Sorry for this not being
clear before. I'll fix the commit log to make it clearer. The subject
at least was clear but I think the example could be clearer.

  Luis

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 00/12 (RESEND)] syfs: generic deadlock fix with module removal
  2021-09-20 17:55   ` Tejun Heo
@ 2021-09-27 15:11     ` Luis Chamberlain
  -1 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-27 15:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes, daniel.vetter,
	bhelgaas, kw, dan.j.williams, senozhatsky, hch, joe, hkallweit1,
	axboe, jpoimboe, tglx, keescook, rostedt, peterz, linux-spdx,
	linux-doc, linux-block, linux-fsdevel, linux-kselftest, cgroups,
	linux-kernel, copyleft-next

On Mon, Sep 20, 2021 at 07:55:13AM -1000, Tejun Heo wrote:
> On Fri, Sep 17, 2021 at 10:04:18PM -0700, Luis Chamberlain wrote:
> > In this v7 I've decided it is best to merge all the effort together into
> > one patch set because communication was being lost when I split the
> > patches up. This was not helping in any way to either fix the zram
> > issues or come to consensus on a generic solution. The patches are also
> > merged now because they are all related now.
> 
> Building up all the testing framewoork is really great. I have no opinions
> about the license related stuff but all other changes generally look good to
> me.

OK I am going to send a v8 shortly with the changes from folks addressed.
I'm going to trim the Cc list considerbly for that v8 as the list is already
quite large and I think it may appear as spam to some lists, I'll drop the
copyleft-next folks as hopefully the license stuff is out of the way.

  Luis


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v7 00/12 (RESEND)] syfs: generic deadlock fix with module removal
@ 2021-09-27 15:11     ` Luis Chamberlain
  0 siblings, 0 replies; 50+ messages in thread
From: Luis Chamberlain @ 2021-09-27 15:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, akpm, minchan, jeyu, shuah, rdunlap, rafael, masahiroy,
	ndesaulniers, yzaikin, nathan, ojeda, penguin-kernel, vitor,
	elver, jarkko, glider, rf, stephen, David.Laight, bvanassche,
	jolsa, andriy.shevchenko, trishalfonso, andreyknvl, jikos,
	mbenes, ngupta, sergey.senozhatsky.work, reinette.chatre,
	fenghua.yu, bp, x86, hpa, lizefan.x, hannes

On Mon, Sep 20, 2021 at 07:55:13AM -1000, Tejun Heo wrote:
> On Fri, Sep 17, 2021 at 10:04:18PM -0700, Luis Chamberlain wrote:
> > In this v7 I've decided it is best to merge all the effort together into
> > one patch set because communication was being lost when I split the
> > patches up. This was not helping in any way to either fix the zram
> > issues or come to consensus on a generic solution. The patches are also
> > merged now because they are all related now.
> 
> Building up all the testing framewoork is really great. I have no opinions
> about the license related stuff but all other changes generally look good to
> me.

OK I am going to send a v8 shortly with the changes from folks addressed.
I'm going to trim the Cc list considerbly for that v8 as the list is already
quite large and I think it may appear as spam to some lists, I'll drop the
copyleft-next folks as hopefully the license stuff is out of the way.

  Luis


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2021-09-27 15:12 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210918050430.3671227-1-mcgrof@kernel.org>
2021-09-18  5:04 ` [PATCH v7 01/12] LICENSES: Add the copyleft-next-0.3.1 license Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 02/12] testing: use the copyleft-next-0.3.1 SPDX tag Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 03/12] selftests: add tests_sysfs module Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-20 21:38   ` Bart Van Assche
2021-09-20 21:38     ` Bart Van Assche
2021-09-18  5:04 ` [PATCH v7 04/12] kernfs: add initial failure injection support Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 05/12] test_sysfs: add support to use kernfs failure injection Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 06/12] kernel/module: add documentation for try_module_get() Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 07/12] fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on kernfs_create_link() Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 08/12] fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns() Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 09/12] sysfs: fix deadlock race with module removal Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-20 17:53   ` Tejun Heo
2021-09-20 17:53     ` Tejun Heo
2021-09-20 19:15     ` Luis Chamberlain
2021-09-20 19:15       ` Luis Chamberlain
2021-09-20 19:22       ` Tejun Heo
2021-09-20 19:22         ` Tejun Heo
2021-09-20 19:38         ` Luis Chamberlain
2021-09-20 19:38           ` Luis Chamberlain
2021-09-20 20:52   ` Dan Williams
2021-09-20 20:52     ` Dan Williams
2021-09-20 21:17     ` Luis Chamberlain
2021-09-20 21:17       ` Luis Chamberlain
2021-09-20 21:55       ` Dan Williams
2021-09-20 21:55         ` Dan Williams
2021-09-21  0:03         ` Luis Chamberlain
2021-09-21  0:03           ` Luis Chamberlain
2021-09-20 21:36   ` Bart Van Assche
2021-09-20 21:36     ` Bart Van Assche
2021-09-20 21:43     ` Luis Chamberlain
2021-09-20 21:43       ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 10/12] test_sysfs: enable deadlock tests by default Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 11/12] zram: fix crashes with cpu hotplug multistate Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-18  5:04 ` [PATCH v7 12/12] zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal Luis Chamberlain
2021-09-18  5:04   ` Luis Chamberlain
2021-09-20 17:55 ` [PATCH v7 00/12 (RESEND)] syfs: generic deadlock fix with " Tejun Heo
2021-09-20 17:55   ` Tejun Heo
2021-09-27 15:11   ` Luis Chamberlain
2021-09-27 15:11     ` Luis Chamberlain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.